本篇博文主要内容为 2026-03-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-24)
今日共更新1176篇论文,其中:
- 自然语言处理共152篇(Computation and Language (cs.CL))
- 人工智能共362篇(Artificial Intelligence (cs.AI))
- 计算机视觉共270篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共319篇(Machine Learning (cs.LG))
- 多智能体系统共29篇(Multiagent Systems (cs.MA))
- 信息检索共37篇(Information Retrieval (cs.IR))
- 人机交互共51篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Human-Inspired Pavlovian and Instrumental Learning for Autonomous Agent Navigation
【速读】:该论文旨在解决自主代理在不确定环境中如何平衡快速响应与目标导向规划的问题,传统模型自由(Model-Free, MF)强化学习(Reinforcement Learning, RL)方法收敛速度慢且易引发不安全探索,而模型基础(Model-Based, MB)方法则计算成本高且对模型失配敏感。解决方案的关键在于提出一种受人类神经科学启发的混合强化学习架构,融合经典条件反射(Pavlovian)、工具性MF和工具性MB三类学习机制,并引入基于上下文的无线电线索(georeferenced environmental features as conditioned stimuli, CS)来塑造内在价值信号并引导决策;同时通过动机信号调节学习过程,并采用贝叶斯仲裁机制动态融合MF与MB估计以适应预测可靠性变化,从而实现更安全、高效的探索与利用过渡,显著提升学习速度、操作安全性及在高不确定性区域的导航性能。
链接: https://arxiv.org/abs/2603.22170
作者: Jingfeng Shan,Francesco Guidi,Mehrdad Saeidi,Enrico Testi,Elia Favarelli,Andrea Giorgetti,Davide Dardari,Alberto Zanella,Giorgio Li Pira,Francesca Starita,Anna Guerra
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Autonomous agents operating in uncertain environments must balance fast responses with goal-directed planning. Classical MF RL often converges slowly and may induce unsafe exploration, whereas MB methods are computationally expensive and sensitive to model mismatch. This paper presents a human-inspired hybrid RL architecture integrating Pavlovian, Instrumental MF, and Instrumental MB components. Inspired by Pavlovian and Instrumental learning from neuroscience, the framework considers contextual radio cues, here intended as georeferenced environmental features acting as CS, to shape intrinsic value signals and bias decision-making. Learning is further modulated by internal motivational drives through a dedicated motivational signal. A Bayesian arbitration mechanism adaptively blends MF and MB estimates based on predicted reliability. Simulation results show that the hybrid approach accelerates learning, improves operational safety, and reduces navigation in high-uncertainty regions compared to standard RL baselines. Pavlovian conditioning promotes safer exploration and faster convergence, while arbitration enables a smooth transition from exploration to efficient, plan-driven exploitation. Overall, the results highlight the benefits of biologically inspired modularity for robust and adaptive autonomous systems under uncertainty.
[MA-1] Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control
【速读】:该论文旨在解决多智能体环境下自动驾驶车辆在高速公路汇入场景中的安全协同控制问题,其核心挑战在于如何在部分可观测环境中实现高效、安全的决策。解决方案的关键在于引入基于注意力机制(Attention Mechanism)的神经网络架构,结合QMIX框架对每个自车(ego vehicle)设计局部注意力模块,使其能够聚焦于最相关的邻近车辆;同时构建一个融合全局目标(如安全性与车流效率)与个体利益的综合奖励信号,从而提升整体系统性能。仿真结果表明,该方法在安全性、行驶速度和累积奖励方面均优于现有驾驶算法。
链接: https://arxiv.org/abs/2603.21810
作者: Turki Bin Mohaya,Peter Seiler
机构: University of Michigan (密歇根大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
Abstract:Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state-of-the-art performance in advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi-agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward.
[MA-2] Modal Logic for Distributed Trust
【速读】:该论文旨在解决多智能体系统(multi-agent systems)中信任推理的问题,特别是在分布式环境中如何形式化描述通信协议、信任假设与推导过程。其解决方案的关键在于提出一种基于模态逻辑(modal logic)的形式化语言,用于刻画网络中智能体的信念与通信行为,并通过信息转发机制实现信任的跨网络泛化。文中进一步指出,通过嵌套模态算子可建模智能体间的链式通信路径,并据此建立适用于此类路径的信任定义;该框架可直接转化为λ演算形式的证明体系,从而为公钥基础设施(Public Key Infrastructure, PKI)等典型应用场景提供可形式化验证的信任模型支撑。
链接: https://arxiv.org/abs/2603.21802
作者: Niels Voorneveld,Peeter Laud
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 32 pages
Abstract:We propose a method for reasoning about trust in multi-agent systems, specifying a language for describing communication protocols and making trust assumptions and derivations. This is given an interpretation in a modal logic for describing the beliefs and communications of agents in a network. We define how information in the network can be shared via forwarding, and how trust between agents can be generalized to trust across networks. We give specifications for the modal logic which can be readily adapted into a lambda calculus of proofs. We show that by nesting modalities, we can describe chains of communication between agents, and establish suitable notions of trust for such chains. We see how this can be applied to trust models in public key infrastructures, as well as other interaction protocols in distributed systems.
[MA-3] Can a Robot Walk the Robotic Dog: Triple-Zero Collaborative Navigation for Heterogeneous Multi-Agent Systems
【速读】:该论文旨在解决异构多机器人系统在真实环境中实现高效协作路径规划的问题,尤其针对传统方法依赖大量训练数据、先验知识或仿真环境的局限性。解决方案的关键在于提出Triple Zero Path Planning (TZPP)框架,其核心创新是采用“协调者-探索者”架构:由人形机器人负责任务协调,四足机器人结合多模态大语言模型(Multimodal Large Language Model, MLLM)进行环境感知与路径探索,从而在无需训练、无需先验知识、也无需仿真的情况下实现对复杂场景的自适应路径规划。该方法显著提升了系统在未见环境中的鲁棒性和泛化能力,为异构机器人协同在现实世界中的部署提供了可行路径。
链接: https://arxiv.org/abs/2603.21723
作者: Yaxuan Wang,Yifan Xiang,Ke Li,Xun Zhang,BoWen Ye,Zhuochen Fan,Fei Wei,Tong Yang
机构: Peking University (北京大学); Beijing University of Posts and Telecommunications (北京邮电大学); Pengcheng Laboratory (鹏城实验室); Beijing Jinruyi Large Model Technology Co., Ltd. (北京金如一大模型技术有限公司)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 8 pages, 2 figures
Abstract:We present Triple Zero Path Planning (TZPP), a collaborative framework for heterogeneous multi-robot systems that requires zero training, zero prior knowledge, and zero simulation. TZPP employs a coordinator–explorer architecture: a humanoid robot handles task coordination, while a quadruped robot explores and identifies feasible paths using guidance from a multimodal large language model. We implement TZPP on Unitree G1 and Go2 robots and evaluate it across diverse indoor and outdoor environments, including obstacle-rich and landmark-sparse settings. Experiments show that TZPP achieves robust, human-comparable efficiency and strong adaptability to unseen scenarios. By eliminating reliance on training and simulation, TZPP offers a practical path toward real-world deployment of heterogeneous robot cooperation. Our code and video are provided at: this https URL
[MA-4] A Game-Theoretic Framework for Intelligent EV Charging Network Optimisation in Smart Cities ITSC2025
【速读】:该论文旨在解决电动汽车(Electric Vehicles, EVs)普及背景下,充电站(Charging Station, CS)布局与定价策略的联合优化问题,核心挑战在于如何在考虑驾驶员战略行为(即非原子博弈下的交通流与充电排队均衡)的前提下,实现社会总成本最小化、交通效率提升及基础设施盈利性保障。解决方案的关键在于提出一种两层近似方法——基于驾驶员均衡的联合选址与定价优化(Joint Placement and Pricing Optimisation under Driver Equilibrium, JPPO-DE),该方法通过将驾驶员行为分解与整数松弛相结合,有效求解混合整数非线性规划问题(Mixed-Integer Nonlinear Programme),并在Sioux Falls交通网络基准测试中展现出优于单一参数基线方法的性能,且具备良好的可扩展性以适应不同预算、电动汽车渗透率和站点容量场景。
链接: https://arxiv.org/abs/2603.21715
作者: Niloofar Aminikalibar,Farzaneh Farhadi,Maria Chli
机构: Aston University (阿斯顿大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
备注: This paper has been accepted for publication in the Proceedings of the IEEE 28th International Conference on Intelligent Transportation Systems (ITSC 2025)
Abstract:The transition to Electric Vehicles (EVs) demands intelligent, congestion-aware infrastructure planning to balance user convenience, economic viability, and traffic efficiency. We present a joint optimisation framework for EV Charging Station (CS) placement and pricing, explicitly capturing strategic driver behaviour through coupled non-atomic congestion games over road networks and charging facilities. From a Public Authority (PA) perspective, the model minimises social cost, travel times, queuing delays and charging expenses, while ensuring infrastructure profitability. To solve the resulting Mixed-Integer Nonlinear Programme, we propose a scalable two-level approximation method, Joint Placement and Pricing Optimisation under Driver Equilibrium (JPPO-DE), combining driver behaviour decomposition with integer relaxation. Experiments on the benchmark Sioux Falls Transportation Network (TN) demonstrate that our method consistently outperforms single-parameter baselines, effectively adapting to varying budgets, EV penetration levels, and station capacities. It achieves performance improvements of at least 16% over state-of-the-art approaches. A generalisation procedure further extends scalability to larger networks. By accurately modelling traffic equilibria and enabling adaptive, efficient infrastructure design, our framework advances key intelligent transportation system goals for sustainable urban mobility.
[MA-5] Strategic Infrastructure Design via Multi-Agent Congestion Games with Joint Placement and Pricing
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中资源部署与定价决策的协同优化问题,尤其针对具有竞争性、有限且易拥堵的基础设施资源(如电动汽车充电站和道路容量),需在考虑分散式自利代理(如EV驾驶员和非充电车辆驾驶员)对拥堵、延迟及成本的适应性行为前提下,实现社会总成本最小化。其核心挑战在于建模上下两层决策结构:上层为中央规划者制定资源配置与定价策略,下层为多个非原子拥堵博弈(non-atomic congestion games)刻画代理响应行为,构成一个NP-hard的双层优化问题。解决方案的关键是提出ABO-MPN框架——一种双层近似方法,通过解耦代理类型、引入整数调整与舍入机制,聚焦高影响力的位置与价格决策,从而高效求解该复杂优化问题,并在基准网络实验中相较仅部署或仅定价的基线方案降低高达40%的社会成本。
链接: https://arxiv.org/abs/2603.21691
作者: Niloofar Aminikalibar,Farzaneh Farhadi,Maria Chli
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: This paper has been accepted for publication in the Proceedings of the 22nd European Conference on Multi-Agent Systems (EUMAS 2025)
Abstract:Real-world infrastructure planning increasingly involves strategic interactions among autonomous agents competing over congestible, limited resources. Applications such as Electric Vehicle (EV) charging, emergency response, and intelligent transportation require coordinated resource placement and pricing decisions, while anticipating the adaptive behaviour of decentralised, self-interested agents. We propose a novel multi-agent framework for joint placement and pricing under such interactions, formalised as a bi-level optimisation model. The upper level represents a central planner, while the lower level captures agent responses via coupled non-atomic congestion games. Motivated by the EV charging domain, we study a setting where a central planner provisions chargers and road capacity under budget and profitability constraints. The agent population includes both EV drivers and non-charging drivers (NCDs), who respond to congestion, delays, and costs. To solve the resulting NP-hard problem, we introduce ABO-MPN, a double-layer approximation framework that decouples agent types, applies integer adjustment and rounding, and targets high-impact placement and pricing decisions. Experiments on benchmark networks show that our model reduces social cost by up to 40% compared to placement- or pricing-only baselines, and generalises to other MAS-relevant domains.
[MA-6] Is AI Ready for Multimodal Hate Speech Detection? A Comprehensive Dataset and Benchmark Evaluation
【速读】:该论文旨在解决当前多模态仇恨言论(hate speech)检测中因标签粒度粗、缺乏上下文信息而导致的评估不准确问题,尤其针对以图文结合形式传播的网络迷因(meme)所蕴含的文化语境与复杂语义难以被现有模型有效捕捉的挑战。其解决方案的关键在于提出一种代理式标注框架(agentic annotation framework),通过协调七个专业化智能体(agents)协同生成层级化标签与可解释推理过程,从而构建出细粒度、高可信度的多平台、多语言、多模态迷因数据集 M³(Multi-platform, Multi-lingual, and Multimodal Meme)。该框架显著提升了对迷因中仇恨内容的理解精度,并揭示了当前多模态大语言模型在利用帖子上下文时表现不佳甚至劣化的局限性,凸显了发展具备情境感知能力的多模态架构的重要性。
链接: https://arxiv.org/abs/2603.21686
作者: Rui Xing,Qi Chai,Jie Ma,Jing Tao,Pinghui Wang,Shuming Zhang,Xinping Wang,Hao Wang
机构: Xi’an Jiaotong University (西安交通大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); School of Cyber Science and Engineering, Xi’an Jiaotong University (西安交通大学网络空间安全学院); Northwest University (西北大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Hate speech online targets individuals or groups based on identity attributes and spreads rapidly, posing serious social risks. Memes, which combine images and text, have emerged as a nuanced vehicle for disseminating hate speech, often relying on cultural knowledge for interpretation. However, existing multimodal hate speech datasets suffer from coarse-grained labeling and a lack of integration with surrounding discourse, leading to imprecise and incomplete assessments. To bridge this gap, we propose an agentic annotation framework that coordinates seven specialized agents to generate hierarchical labels and rationales. Based on this framework, we construct M^3 (Multi-platform, Multi-lingual, and Multimodal Meme), a dataset of 2,455 memes collected from X, 4chan, and Weibo, featuring fine-grained hate labels and human-verified rationales. Benchmarking state-of-the-art Multimodal Large Language Models reveals that these models struggle to effectively utilize surrounding post context, which often fails to improve or even degrades detection performance. Our finding highlights the challenges these models face in reasoning over memes embedded in real-world discourse and underscores the need for a context-aware multimodal architecture. Our dataset and code are available at this https URL.
[MA-7] Agent ic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment
【速读】:该论文旨在解决弥漫性胶质瘤患者术后MRI疗效评估中因需综合影像趋势、药物影响及放疗时间等复杂因素而导致的标准化难题,尤其针对脑肿瘤报告与数据系统(Brain Tumor Reporting and Data System, BT-RADS)分类依赖人工判断易出现偏差的问题。解决方案的关键在于构建一个端到端的多智能体大语言模型(multi-agent large language model, LLM)与卷积神经网络(convolutional neural network, CNN)协同系统:其中提取代理(extractor agent)从非结构化临床文本中自动识别关键变量(如类固醇状态、贝伐珠单抗使用情况、放疗日期),评分代理(scorer agent)则基于BT-RADS决策逻辑整合这些变量与CNN自动分割得到的体积测量值进行分类,从而实现自动化、高准确率的BT-RADS分级,相较初始临床评估显著提升分类一致性(准确率提高18.5个百分点,P<0.001)。
链接: https://arxiv.org/abs/2603.21494
作者: Mohamed Sobhi Jabal(1),Jikai Zhang(2 and 3),Dominic LaBella(4),Jessica L. Houk(1),Dylan Zhang(1 and 7),Jeffrey D. Rudie(5 and 8),Kirti Magudia(1),Maciej A. Mazurowski(1, 2 and 6),Evan Calabrese(1 and 3) ((1) Duke University Medical Center, Durham NC, (2) Duke University, Durham NC, (3) Duke Center for Artificial Intelligence in Radiology, Durham NC, (4) Duke University Medical Center, Durham NC, (5) University of California San Diego, San Diego CA, (6) Duke University School of Medicine, Durham NC, (7) Santa Clara Valley Medical Center, San Jose CA, (8) Scripps Clinic Medical Group, San Diego CA)
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 17 pages, 5 figures, 4 tables, 2 supplementary figures, 3 supplementary tables
Abstract:The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
[MA-8] Personality-Driven Student Agent -Based Modeling in Mathematics Education: How Well Do Student Agents Align with Human Learners?
【速读】:该论文旨在解决教育研究中因伦理限制难以开展真实学生教学实验的问题,核心挑战在于验证基于大语言模型(LLM)的生成式学生代理(student agent)是否具备行为可信度,能否真实模拟人类学习过程。解决方案的关键在于构建一个融合人格特质(Big Five Personality)的学生代理模型,涵盖师生互动、自主学习与考试评估的完整流程,并通过整合13项实证研究提炼出14条行为一致性标准进行量化评估,结果显示71.4%的学生代理行为与人类学习者一致,验证了其模拟有效性。
链接: https://arxiv.org/abs/2603.21358
作者: Bushi Xiao,Qian Shen
机构: University of Florida (佛罗里达大学)
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
备注: Short Paper
Abstract:It is crucial to explore the impact of different teaching methods on student learning in educational research. However, real-person experiments face significant ethical constraints, and we cannot conduct repeated teaching experiments on the same student. LLM-based generative agents offer a promising avenue for simulating student behavior. Before large-scale experiments, a fundamental question must be addressed: are student agents truly credible, and can they faithfully simulate human learning? In this study, we built a Big Five Personality-based student agent model with a full pipeline of student-teacher interaction, self-study, and examination. To evaluate behavioral fidelity, we collected 13 empirical studies on Big Five traits and learning, and distilled them into 14 criteria. We found that the 71.4% of the student agents’ behavior was aligned with human learners.
[MA-9] Architecture for Multi-Unmanned Aerial Vehicles based Autonomous Precision Agriculture Systems
【速读】:该论文旨在解决精准农业中多无人机(Multi-UAVs)协同作业时缺乏结构化抽象框架的问题,以支持各类算法在田间环境中的高效部署与执行。其解决方案的关键在于构建一个端到端的自主架构,该架构整合了图像处理、路径规划、通信、数据采集和田间测绘等核心任务,并通过最小化物理干预实现多机协作的高效运行;同时考虑了实际应用中的限制因素,如容错性、鲁棒性和开发/用户友好性,从而为农户提供可全面部署的自动化无人机系统。
链接: https://arxiv.org/abs/2603.21183
作者: Ebasa Temesgen,Nathnael Minyelshowa,Lebsework Negash
机构: 未知
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:The use of unmanned aerial vehicles (UAVs) in precision agriculture has seen a huge increase recently. As such, systems that aim to apply various algorithms on the field need a structured framework of abstractions. This paper defines the various tasks of the UAVs in precision agriculture and model them into an architectural framework. The presented architecture is built on the context that there will be minimal physical intervention to do the tasks defined with multiple coordinated and cooperative UAVs. Various tasks such as image processing, path planning, communication, data acquisition, and field mapping are employed in the architecture to provide an efficient system. Besides, different limitation for applying Multi-UAVs in precision agriculture has been considered in designing the architecture. The architecture provides an autonomous end-to-end solution, starting from mission planning, data acquisition and image processing framework that is highly efficient and can enable farmers to comprehensively deploy UAVs onto their lands. Simulation and field tests shows that the architecture offers a number of advantages that include fault-tolerance, robustness, developer and user-friendliness.
[MA-10] Cyber Deception for Mission Surveillance via Hypergame-Theoretic Deep Reinforcement Learning
【速读】:该论文旨在解决无人飞行器(Unmanned Aerial Vehicles, UAVs)在关键任务系统中面临的拒绝服务(Denial-of-Service, DoS)攻击问题,此类攻击通过耗尽任务无人机(Mission Drones, MDs)的资源来破坏系统功能。解决方案的关键在于引入基于网络欺骗(Cyber Deception)的蜜罐无人机(Honey Drones, HDs)机制:HDs通过发射比MDs更强的无线电信号吸引攻击者,从而将攻击从真实任务目标上转移,实现主动防御。为优化攻防双方策略以平衡任务性能与能耗,作者提出一种融合超博弈理论(Hypergame Theory)与深度强化学习(Deep Reinforcement Learning, DRL)的新型方法HT-DRL,该方法无需长时间训练即可快速收敛至最优解,显著提升防御效能——实验表明其任务性能相较传统非蜜罐方案最高提升两倍,同时保持低能耗。
链接: https://arxiv.org/abs/2603.20981
作者: Zelin Wan,Jin-Hee Cho,Mu Zhu,Ahmed H. Anwar,Charles Kamhoua,Munindar P. Singh
机构: Virginia Tech (弗吉尼亚理工学院); Chinese Academy of Sciences (中国科学院); North Carolina State University (北卡罗来纳州立大学); US Army Research Laboratory (美国陆军研究实验室)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 23 pages, 21 figures
Abstract:Unmanned Aerial Vehicles (UAVs) are valuable for mission-critical systems like surveillance, rescue, or delivery. Not surprisingly, such systems attract cyberattacks, including Denial-of-Service (DoS) attacks to overwhelm the resources of mission drones (MDs). How can we defend UAV mission systems against DoS attacks? We adopt cyber deception as a defense strategy, in which honey drones (HDs) are proposed to bait and divert attacks. The attack and deceptive defense hinge upon radio signal strength: The attacker selects victim MDs based on their signals, and HDs attract the attacker from afar by emitting stronger signals, despite this reducing battery life. We formulate an optimization problem for the attacker and defender to identify their respective strategies for maximizing mission performance while minimizing energy consumption. To address this problem, we propose a novel approach, called HT-DRL. HT-DRL identifies optimal solutions without a long learning convergence time by taking the solutions of hypergame theory into the neural network of deep reinforcement learning. This achieves a systematic way to intelligently deceive attackers. We analyze the performance of diverse defense mechanisms under different attack strategies. Further, the HT-DRL-based HD approach outperforms existing non-HD counterparts up to two times better in mission performance while incurring low energy consumption.
[MA-11] owards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models
【速读】:该论文旨在解决当前地理空间数据发现系统中因数据体量大、类型多样且更新速度快而导致的语义不一致、分布式异构以及用户意图难以准确捕捉的问题。传统基于关键词的数据目录和门户在语义支持方面能力有限,导致检索性能低下。其解决方案的关键在于提出一个由知识图谱驱动的多智能体框架,通过构建统一的地理空间元数据本体作为语义中介层,实现跨平台元数据标准的一致性对齐,并在此基础上构建地理空间元数据知识图谱以显式建模数据集及其多维关系;进而采用多智能体协同架构完成意图解析、知识图谱检索与答案合成,形成从用户查询到结果输出的可解释、闭环的数据发现流程。
链接: https://arxiv.org/abs/2603.20670
作者: Ruixiang Liu,Zhenlong Li,Ali Khosravi Kazazi
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword-based search with limited semantic support, which often fails to capture user intent and leads to weak retrieval performance. To address these challenges, this study proposes a knowledge graph-driven multi-agent framework for intelligent geospatial data discovery, powered by large language models. The framework introduces a unified geospatial metadata ontology as a semantic mediation layer to align heterogeneous metadata standards across platforms and constructs a geospatial metadata knowledge graph to explicitly model datasets and their multidimensional relationships. Building on the structured representation, the framework adopts a multi-agent collaborative architecture to perform intent parsing, knowledge graph retrieval, and answer synthesis, forming an interpretable and closed-loop discovery process from user queries to results. Results from representative use cases and performance evaluation show that the framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared with traditional systems. This study advances geospatial data discovery toward a more semantic, intent-aware, and intelligent paradigm, providing a practical foundation for next-generation intelligent and autonomous spatial data infrastructures and contributing to the broader vision of Autonomous GIS.
[MA-12] Position: Multi-Agent Algorithmic Care Systems Demand Contestability for Trustworthy AI
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在医疗场景中因缺乏有效人类干预机制而导致的信任、责任归属与监督难题。现有可解释人工智能(Explainable AI, XAI)方法虽能提供决策过程的透明性,但无法支持临床人员对系统输出进行挑战或纠正,从而削弱了人类在高风险医疗决策中的主导地位。论文提出的解决方案核心是引入“可争议性人工智能”(Contestable AI, CAI),其关键在于构建贯穿决策生命周期的结构化人机协同机制:通过透明性保障、制度化的干预机会以及审查、修正或覆盖操作的反馈通道,实现人类对算法系统的持续控制权,从而维护临床责任、增强信任并确保人机协作的可信性。
链接: https://arxiv.org/abs/2603.20595
作者: Truong Thanh Hung Nguyen,Hélène Fournier,Piper Jackson,Makoto Itoh,Shannon Freeman,Rene Richard,Hung Cao
机构: Analytics Everywhere Lab, University of New Brunswick, Canada; National Research Council Canada, Canada; Thompson Rivers University, Canada; ISB Corporation, Japan; University of Northern British Columbia, Canada
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent systems (MAS) are increasingly used in healthcare to support complex decision-making through collaboration among specialized agents. Because these systems act as collective decision-makers, they raise challenges for trust, accountability, and human oversight. Existing approaches to trustworthy AI largely rely on explainability, but explainability alone is insufficient in multi-agent settings, as it does not enable care partners to challenge or correct system outputs. To address this limitation, Contestable AI (CAI) characterizes systems that support effective human challenge throughout the decision-making lifecycle by providing transparency, structured opportunities for intervention, and mechanisms for review, correction, or override. This position paper argues that contestability is a necessary design requirement for trustworthy multi-agent algorithmic care systems. We identify key limitations in current MAS and Explainable AI (XAI) research and present a human-in-the-loop framework that integrates structured argumentation and role-based contestation to preserve human agency, clinical responsibility, and trust in high-stakes care contexts.
[MA-13] LASER: Level-Based Asynchronous Scheduling and Execution Regime for Spatiotemporally Constrained Multi-Robot Timber Manufacturing ICRA2026
【速读】:该论文旨在解决大规模制造场景下多机器人系统在执行复杂装配任务时面临的紧密耦合时空约束问题,例如碰撞规避与工艺驱动的时效限制。其解决方案的关键在于提出LASER(Level-based Asynchronous Scheduling and Execution Regime)框架,通过将基于障碍物(barrier-based)的机制嵌入约束规划(Constraint Programming, CP)调度模型中,将任务划分为时空上互不重叠的层级(levels),从而实现机器人在同一层级内异步并行执行、仅在层级边界同步,从构造上保证无碰撞操作并增强对时间不确定性的鲁棒性。
链接: https://arxiv.org/abs/2603.20577
作者: Zhenxiang Huang,Lior Skoury,Tim Stark,Aaron Wagner,Hans Jakob Wagner,Thomas Wortmann,Achim Menges
机构: Institute for Computational Design and Construction (ICD); Cluster of Excellence IntCDC; University of Stuttgart (斯图加特大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: to be published in ICRA 2026. Supplementary video: this https URL
Abstract:Automating large-scale manufacturing in domains like timber construction requires multi-robot systems to manage tightly coupled spatiotemporal constraints, such as collision avoidance and process-driven deadlines. This paper introduces LASER (Level-based Asynchronous Scheduling and Execution Regime), a complete framework for scheduling and executing complex assembly tasks, demonstrated on a screw-press gluing application for timber slab manufacturing. Our central contribution is to integrate a barrier-based mechanism into a constraint programming (CP) scheduling formulation that partitions tasks into spatiotemporally disjoint sets, which we define as levels. This structure enables robots to execute tasks in parallel and asynchronously within a level, synchronizing only at level barriers, which guarantees collision-free operation by construction and provides robustness to timing uncertainties. To solve this formulation for large problems, we propose two specialized algorithms: an iterative temporal-relaxation approach for heterogeneous task sequences and a bi-level decomposition for homogeneous tasks that balances workload. We validate the LASER framework by fabricating a full-scale 2.4m x 6m timber slab with a two-robot system mounted on parallel linear tracks, successfully coordinating 108 subroutines and 352 screws under tight adhesive time windows. Computational studies show our method scales steadily with size compared to a monolithic approach.
[MA-14] Multi-Robot Learning-Informed Task Planning Under Uncertainty ICRA2026
【速读】:该论文旨在解决多机器人团队在任务相关物体位置未知的环境中,如何以最短时间完成复杂任务的问题。此类场景下,机器人需对长期任务目标进行推理,包括预测物体可能位置、评估个体行动对整体进度的贡献以及协调团队协作,而不确定性进一步加剧了长时决策与协同的难度。解决方案的关键在于提出一种融合学习与模型驱动规划的多机器人规划抽象:通过学习估计环境中的不确定因素(如物体分布),并结合基于模型的规划实现长周期任务协调,从而有效提升多机器人系统在复杂动态环境中的任务执行效率。
链接: https://arxiv.org/abs/2603.20544
作者: Abhish Khanal,Abhishek Paudel,Hung Pham,Gregory J. Stein
机构: George Mason University (乔治梅森大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 8 pages, 8 figures. Accepted at ICRA 2026
Abstract:We want a multi-robot team to complete complex tasks in minimum time where the locations of task-relevant objects are not known. Effective task completion requires reasoning over long horizons about the likely locations of task-relevant objects, how individual actions contribute to overall progress, and how to coordinate team efforts. Planning in this setting is extremely challenging: even when task-relevant information is partially known, coordinating which robot performs which action and when is difficult, and uncertainty introduces a multiplicity of possible outcomes for each action, which further complicates long-horizon decision-making and coordination. To address this, we propose a multi-robot planning abstraction that integrates learning to estimate uncertain aspects of the environment with model-based planning for long-horizon coordination. We demonstrate the efficient multi-stage task planning of our approach for 1, 2, and 3 robot teams over competitive baselines in large ProcTHOR household environments. Additionally, we demonstrate the effectiveness of our approach with a team of two LoCoBot mobile robots in real household settings.
[MA-15] Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?
【速读】:该论文试图解决当前生成式 AI (Generative AI) 中推理语言模型(Reasoning Language Models, RLMs)在追求最终答案正确性时,忽视了其推理过程可读性(legibility)的问题。研究指出,尽管高绩效RLM的推理轨迹(reasoning traces)在准确性上表现优异,但其可读性却往往较低,难以被其他弱模型有效利用以获得正确答案。解决方案的关键在于引入“迁移效用”(transfer utility)这一新指标,用于量化RLM推理轨迹对非推理模型的指导价值,并揭示出推理效率(如轨迹长度)与迁移效用之间存在权衡关系,从而构建出“可读性帕累托前沿”(legibility Pareto frontier)。研究表明,当前基于奖励模型的训练机制并未内在激励可读性,因此未来需通过设计更合理的评估与训练策略来优化推理轨迹的可解释性和实用性,为多智能体协作环境下的推理架构提供支撑。
链接: https://arxiv.org/abs/2603.20508
作者: Dani Roytburg,Shreya Sridhar,Daphne Ippolito
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Language models are increasingly being trained to “reason” before answering users’ queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models’ ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM’s reasoning traces are for guiding a weaker, non-reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest-performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency-based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM’s ability to output highly legible traces can be a task- and audience-dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi-agent future.
[MA-16] Hetero-Net: An Energy-Efficient Resource Allocation and 3D Placement in Heterogeneous LoRa Networks via Multi-Agent Optimization
【速读】:该论文旨在解决当前基于LoRa的无线传感器网络(Wireless Sensor Networks, WSNs)与地下无线传感器网络(Wireless Underground Sensor Networks, WUSNs)在设计上相互割裂的问题,导致跨地表与地下环境的连通性效率低下且缺乏统一优化。其解决方案的关键在于提出一种统一的异构LoRa框架——Hetero-Net,通过将多种LoRa终端设备与多架搭载LoRa网关的无人机(Unmanned Aerial Vehicles, UAVs)集成,并联合优化扩频因子(spreading factor)、传输功率以及UAV的三维(3D)部署位置,以最大化系统能效。为应对系统的动态性和部分可观测特性,作者将其建模为部分可观测随机博弈(Partially Observable Stochastic Game, POSG),并采用多智能体近端策略优化(Multi-Agent Proximal Policy Optimization, MAPPO)方法求解,实验证明该方案在能效方面显著优于传统孤立部署方式。
链接: https://arxiv.org/abs/2603.20404
作者: Abdullahi Isa Ahmed,Ana Maria Drăgulinescu,El Mehdi Amhoud
机构: Mohammed VI Polytechnic University (UM6P); Politehnica Bucharest
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 6 pages, 7 figures
Abstract:The evolution of Internet of Things (IoT) into multi-layered environments has positioned Low-Power Wide Area Networks (LPWANs), particularly Long Range (LoRa), as the backbone for connectivity across both surface and subterranean landscapes. However, existing LoRa-based network designs often treat ground-based wireless sensor networks (WSNs) and wireless underground sensor networks (WUSNs) as separate systems, resulting in inefficient and non-integrated connectivity across diverse environments. To address this, we propose Hetero-Net, a unified heterogeneous LoRa framework that integrates diverse LoRa end devices with multiple unmanned aerial vehicle (UAV)-mounted LoRa gateways. Our objective is to maximize system energy efficiency through the joint optimization of the spreading factor, transmission power, and three-dimensional (3D) placement of the UAVs. To manage the dynamic and partially observable nature of this system, we model the problem as a partially observable stochastic game (POSG) and address it using a multi-agent proximal policy optimization (MAPPO) framework. An ablation study shows that our proposed MAPPO Hetero-Net significantly outperforms traditional, isolated network designs, achieving energy efficiency improvements of 55.81% and 198.49% over isolated WSN-only and WUSN-only deployments, respectively.
[MA-17] ALARA for Agents : Least-Privilege Context Engineering Through Portable Composable Multi-Agent Teams
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)中行为规范碎片化与上下文管理缺乏统一机制的问题,这导致个体人机交互质量下降以及团队间难以协同维护共享的智能体基础设施。现有框架将工具访问权限、上下文范围等关键配置分散在自然语言说明文件、框架内部配置和独立运行的MCP服务器中,难以版本控制、共享或协作更新。其解决方案的核心是提出一种声明式上下文-智能体-工具(Context-Agent-Tool, CAT)数据层,通过结构化文件明确限定每个智能体的角色所需最小权限,并配套开发命令行工具ttnpcsh来执行该层,从而实现对智能体行为的强约束性控制——即修改工具列表会直接触发可预测的行为变更,而非依赖模型自行理解的模糊指令。
链接: https://arxiv.org/abs/2603.20380
作者: Christopher J. Agostino,Nayan D’Souza
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to HAXD 2026, 8 pages, 6 figures, framework and benchmark are open source at this https URL
Abstract:Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, yet the frameworks through which these systems operate do not provide a simple, unified mechanism for scalably managing the critical aspects of the agent harness, impacting both the quality of individual human-agent interactions and the capacity for practitioners to coordinate toward common goals through shared agent infrastructure. Agent frameworks have enabled increasingly sophisticated multi-agent systems, but the behavioral specifications that define what these agents can do remain fragmented across prose instruction files, framework-internal configuration, and mechanisms like MCP servers that operate separately from individual agent definitions, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to agent context, we introduce a declarative context-agent-tool (CAT) data layer expressed through interrelated files that scope each agent’s tool access and context to the minimum its role requires, and \textttnpcsh, a command-line shell for executing it. Because the system parses and enforces these files structurally, modifying an agent’s tool list produces a guaranteed behavioral change rather than a suggestion the model may or may not follow. We evaluate 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation, characterizing which model families succeed at which task categories and where they break down across \sim 2500 total executions.
[MA-18] Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms
【速读】:该论文旨在解决多智能体系统中异构学习机制在不同时间尺度下耦合动态行为的稳定性问题,即如何形式化保证这些机制的协同演化始终处于可接受的操作范围内。解决方案的关键在于构建一个三层次的 swarm learning 系统架构,分别对应快速(10–100 ms)的局部赫布在线学习、中速(1–10 s)的多智能体强化学习(MARL)以及慢速(10–100 s)的元学习(MAML),并通过四个理论定理实现对系统误差、表示漂移、层级兼容性和误差累积性的严格约束:其中,学习率的合同性约束、映射的利普希茨连续性与权重稳定条件共同构成了总次优性有界性的核心保障机制,从而确保整个系统的长期运行稳定性。
链接: https://arxiv.org/abs/2603.20333
作者: Oleksii Bychkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 25 pages, 3 tables
Abstract:Modern autonomous multi-agent systems combine heterogeneous learning mechanisms operating at different timescales. An open question remains: can one formally guarantee that coupled dynamics of such mechanisms stay within the admissible operational regime? This paper studies a tri-hierarchical swarm learning system where three mechanisms act simultaneously: (1) local Hebbian online learning at individual agent level (fast timescale, 10-100 ms); (2) multi-agent reinforcement learning (MARL) for tactical group coordination (medium timescale, 1-10 s); (3) meta-learning (MAML) for strategic adaptation (slow timescale, 10-100 s). Four results are established. The Bounded Total Error Theorem shows that under contractual constraints on learning rates, Lipschitz continuity of inter-level mappings, and weight stabilization, total suboptimality admits a component-wise upper bound uniform in time. The Bounded Representation Drift Theorem gives a worst-case estimate of how Hebbian updates affect coordination-level embeddings during one MARL cycle. The Meta-Level Compatibility Theorem provides sufficient conditions under which strategic adaptation preserves lower-level invariants. The Non-Accumulation Theorem proves that error does not grow unboundedly over time.
[MA-19] When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)流水线中一个关键争议:团队多样性是否有助于提升输出质量。现有研究存在矛盾结论——异构的Mixture-of-Agents(MoA)团队优于单一模型,而同质的Self-MoA团队在基于合成的聚合方式下始终表现更优。作者提出通过识别“选择瓶颈”(selection bottleneck)来统一解释这一现象,即聚合质量存在一个交叉阈值 $ s^* ,决定了多样性是促进还是损害性能。解决方案的关键在于推导出该阈值的闭式表达(Proposition1),并通过覆盖42个任务的实证实验验证:采用基于评判者的选择机制,多样团队胜率高达0.810,显著优于单模型基线( \Delta = 2.07 ),且显著优于MoA合成方法( \Delta_\mathrm{WR} = +0.631 $)。结果表明,在单轮“生成-选择”范式中,选择器质量可能比生成器多样性更具影响力。
链接: https://arxiv.org/abs/2603.20324
作者: Artem Maryanskyy
机构: Uber(优步)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 5 tables
Abstract:Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck – a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold s^* (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories ( N=210 ), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 – near chance (Glass’s \Delta = 2.07 ). Judge-based selection outperforms MoA-style synthesis by \Delta_\mathrmWR = +0.631 – the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman \rho = 0.90 ). Exploratory evidence suggests that including a weaker model improves performance while reducing cost ( p 10^-4 , not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.
[MA-20] Reason -to-Transmit: Deliberative Adaptive Communication for Cooperative Perception
【速读】:该论文旨在解决车辆与万物互联(V2X)网络中因带宽限制导致的协同感知效率低下问题,即如何在有限通信资源下实现高效、智能的信息共享。现有方法多依赖于反应式机制(如置信度图、学习到的门控或稀疏掩码)决定传输内容,但缺乏对信息传递价值的因果推理。其解决方案的关键在于提出一种名为“Reason-to-Transmit”(R2T)的框架,该框架为每个代理配备轻量级基于Transformer的推理模块,能够综合本地场景上下文、邻近节点的信息缺口以及带宽预算,做出针对每个区域的决策性传输选择;该框架通过带宽感知的目标端到端训练,在高遮挡场景下显著优于其他选择性传输方法,接近理想信息共享性能,体现了基于推理的通信策略在复杂环境中的有效性。
链接: https://arxiv.org/abs/2603.20308
作者: Aayam Bansal,Ishaan Gangwani
机构: Synthetic Sciences
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Cooperative perception among autonomous agents overcomes the limitations of single-agent sensing, but bandwidth constraints in vehicle-to-everything (V2X) networks require efficient communication policies. Existing approaches rely on reactive mechanisms, such as confidence maps, learned gating, or sparse masks, to decide what to transmit, without reasoning about why a message benefits the receiver. We introduce Reason-to-Transmit (R2T), a framework that equips each agent with a lightweight transformer-based module that reasons over local scene context, estimated neighbor information gaps, and bandwidth budget to make per-region transmission decisions. Trained end-to-end with a bandwidth-aware objective, R2T is evaluated against nine baselines in a multi-agent bird’s-eye-view perception environment. Any communication improves performance by about 58% AP over no communication. At low bandwidth, all selective methods perform similarly, but R2T shows clear gains under high occlusion, where information asymmetry is greatest, approaching oracle performance. All methods degrade gracefully under packet drops up to 50%, showing robustness to communication failures. These results indicate that while fusion design dominates performance, deliberative communication provides additional gains in challenging scenarios. R2T introduces a reasoning-based approach to communication, enabling more efficient and context-aware information sharing in cooperative perception.
[MA-21] Learning Communication Between Heterogeneous Agents in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence
【速读】:该论文旨在解决企业网络中面临的网络攻击威胁,特别是如何通过多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)提升对攻击的响应能力。其核心问题是现有研究多基于同质化智能体,缺乏对异构能力智能体在复杂网络环境中协同防御效果的深入探索。解决方案的关键在于引入具有异构能力的智能体,并采用先进的通信算法CommFormer,在Cyber Operations Research Gym(CybORG)仿真环境中进行训练与评估。实验表明,此类异构智能体能够比其他算法更快收敛至最优策略(速度提升达4倍),同时标准误差降低最高达38%,从而为生成式AI(Generative AI)在网络安全领域的应用提供了新的可行路径。
链接: https://arxiv.org/abs/2603.20279
作者: Alex Popa,Adrian Taylor,Ranwa Al Mallah
机构: Royal Military College of Canada (皇家军事学院); Defence Research and Development Canada (加拿大国防研究与发展局); Polytechnique Montréal (蒙特利尔工程学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 6 pages, 3 figures, 1 algorithm, conference paper. CyMARL-CommFormer code available at this https URL
Abstract:Reinforcement learning techniques are being explored as solutions to the threat of cyber attacks on enterprise networks. Recent research in the field of AI in cyber security has investigated the ability of homogeneous multi-agent reinforcement learning agents, capable of inter-agent communication, to respond to cyberattacks. This paper advances the study of learned communication in multi-agent systems by examining heterogeneous agent capabilities within a simulated network environment. To this end, we leverage CommFormer, a publicly available state-of-the-art communication algorithm, to train and evaluate agents within the Cyber Operations Research Gym (CybORG). Our results show that CommFormer agents with heterogeneous capabilities can outperform other algorithms deployed in the CybORG environment, by converging to an optimal policy up to four times faster while improving standard error by up 38%. The agents implemented in this project provide an additional avenue for exploration in the field of AI for cyber security, enabling further research involving realistic networks.
[MA-22] JCAS-MARL: Joint Communication and Sensing UAV Networks via Resource-Constrained Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多无人机(UAV)网络在大规模巡检与监测任务中,如何协同优化感知可靠性、通信质量与能量约束的问题,尤其针对垃圾热点区域的高效检测需求。其解决方案的关键在于提出了一种资源感知的多智能体强化学习(MARL)框架——JCAS-MARL,该框架使多个UAV能够在共享环境中联合控制自身轨迹与用于感知和通信的OFDM波形资源分配,并将电池消耗、充电行为及二氧化碳排放纳入系统状态以建模真实运行约束;同时通过动态通信图实现信息共享,利用多UAV共识机制提升检测可靠性,从而有效平衡感知-通信-能量之间的权衡关系。
链接: https://arxiv.org/abs/2603.20265
作者: Islam Guven,Mehmet Parlak
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 6 pages, 8 figures, submitted to the conference
Abstract:Multi-UAV networks are increasingly deployed for large-scale inspection and monitoring missions, where operational performance depends on the coordination of sensing reliability, communication quality, and energy constraints. In particular, the rapid increase in overflowing waste bins and illegal dumping sites has created a need for efficient detection of waste hotspots. In this work, we introduce JCAS-MARL, a resource-aware multi-agent reinforcement learning (MARL) framework for joint communication and sensing (JCAS)-enabled UAV networks. Within this framework, multiple UAVs operate in a shared environment where each agent jointly controls its trajectory and the resource allocation of an OFDM waveform used simultaneously for sensing and communication. Battery consumption, charging behavior, and associated CO _2 emissions are incorporated into the system state to model realistic operational constraints. Information sharing occurs over a dynamic communication graph determined by UAV positions and wireless channel conditions. Waste hotspot detection requires consensus among multiple UAVs to improve reliability. Using this environment, we investigate how MARL policies exploit the sensing-communication-energy trade-off in JCAS-enabled UAV networks. Simulation results demonstrate that adaptive pilot-density control learned by the agents can outperform static configurations, particularly in scenarios where sensing accuracy and communication connectivity vary across the environment.
[MA-23] SciNav: A General Agent Framework for Scientific Coding Tasks ICLR2026
【速读】:该论文旨在解决当前科学代理(Science Agent)在科学编程任务中缺乏结构化、端到端框架的问题,尤其针对现有方法依赖预定义成功指标和冗长搜索周期的局限性。其解决方案的关键在于提出一种名为SciNav(Scientific Navigator)的代理框架,该框架通过引入基于成对相对判断(pairwise relative judgments)的树搜索机制,在有限的搜索预算下实现更高效的解空间探索:利用相对比较来筛选top-K有潜力的解分支、剪枝低潜力路径,并逐步聚焦于高质量候选解,从而显著提升科学编程任务中的输出质量与效率。
链接: https://arxiv.org/abs/2603.20256
作者: Tianshu Zhang,Huan Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Accepted by ICLR 2026
Abstract:Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to evaluate. Scientific coding benchmarks, by contrast, provide executable outputs for objective assessment. Existing approaches remain engineering-driven pipelines, revealing the need for structured, end-to-end science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent’s effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgment-guided top-K search for high-quality scientific coding, marking a step toward more practical science agents.
[MA-24] Stability of AI Governance Systems: A Coupled Dynamics Model of Public Trust and Social Disruptions
【速读】:该论文旨在解决当前人工智能(AI)治理研究中缺乏形式化数学框架的问题,特别是如何精确刻画公众信任在高风险公共决策场景下崩溃的条件。其解决方案的关键在于提出一个耦合动态模型,将离散时间的霍克斯过程(Hawkes process)与弗里德金-约翰森意见动力学模型(Friedkin-Johnsen opinion dynamics model)相结合,以模拟AI争议事件(如算法偏见或问责失败)的自我激发传播与社会网络中机构信任的演化。该模型的核心创新在于引入双向反馈机制:信任下降会增强后续争议事件的强度,而争议事件又进一步削弱信任,形成自我强化的信任崩塌循环。通过推导闭式平衡解并进行稳定性分析,论文识别出临界谱条件 ρ(J₂ₙₜ) < 1,明确划分了信任韧性与系统性崩溃的边界,从而为AI治理中的信任危机提供了可量化的理论基础和预警机制。
链接: https://arxiv.org/abs/2603.20248
作者: Jiaqi Lai,Hou Liang,Weihong Huang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 15 pages, 8 figures. Equal contribution by Jiaqi Lai and Hou Liang
Abstract:As artificial intelligence (AI) is increasingly deployed in high-stakes public decision-making (from resource allocation to welfare distribution), public trust in these systems has become a critical determinant of their legitimacy and sustainability. Yet existing AI governance research remains largely qualitative, lacking formal mathematical frameworks to characterize the precise conditions under which public trust collapses. This paper addresses that gap by proposing a rigorous coupled dynamics model that integrates a discrete-time Hawkes process – capturing the self-exciting generation of AI controversy events such as perceived algorithmic unfairness or accountability failures – with a Friedkin-Johnsen opinion dynamics model that governs the evolution of institutional trust across social networks. A key innovation is the bidirectional feedback mechanism: declining trust amplifies the intensity of subsequent controversy events, which in turn further erode trust, forming a self-reinforcing collapse loop. We derive closed-form equilibrium solutions and perform formal stability analysis, establishing the critical spectral condition rho(J_2nt) 1 that delineates the boundary between trust resilience and systemic collapse. Numerical experiments further reveal how echo chamber network structures and media amplification accelerate governance failure. Our core contribution to the AI governance field is a baseline collapse model: a formal stability analysis framework demonstrating that, absent strong institutional intervention, even minor algorithmic biases can propagate through social networks to trigger irreversible trust breakdown in AI governance systems.
[MA-25] Learning to Aggregate Zero-Shot LLM Agents for Corporate Disclosure Classification
【速读】:该论文旨在解决零样本大语言模型(Zero-shot Large Language Models, LLMs)在企业披露文本分类任务中预测结果不稳定的问题,即不同提示(prompt)、推理风格和模型家族导致的判断差异。其核心解决方案是构建一个轻量级的多智能体框架,其中三个独立的零样本LLM代理对每份披露文本分别输出情感标签、置信度分数和简短推理依据,随后通过一个逻辑回归元分类器(logistic meta-classifier)聚合这些信号,以提升对次日股票收益方向的预测准确性。关键创新在于利用监督式聚合机制将代理间的分歧转化为更具判别力的分类目标,从而显著优于单一代理、多数投票、置信度加权投票及FinBERT基线模型,尤其在结合强当前表现与弱指引或高风险情境的披露文本中效果最为突出。
链接: https://arxiv.org/abs/2603.20965
作者: Kemal Kirtac
机构: University College London (伦敦大学学院)
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Computational Finance (q-fin.CP); Statistical Finance (q-fin.ST)
备注:
Abstract:This paper studies whether a lightweight trained aggregator can combine diverse zero-shot large language model judgments into a stronger downstream signal for corporate disclosure classification. Zero-shot LLMs can read disclosures without task-specific fine-tuning, but their predictions often vary across prompts, reasoning styles, and model families. I address this problem with a multi-agent framework in which three zero-shot agents independently read each disclosure and output a sentiment label, a confidence score, and a short rationale. A logistic meta-classifier then aggregates these signals to predict next-day stock return direction. I use a sample of 18,420 U.S. corporate disclosures issued by Nasdaq and SP 500 firms between 2018 and 2024, matched to next-day stock returns. Results show that the trained aggregator outperforms all single agents, majority vote, confidence-weighted voting, and a FinBERT baseline. Balanced accuracy rises from 0.561 for the best single agent to 0.612 for the trained aggregator, with the largest gains in disclosures combining strong current performance with weak guidance or elevated risk. The results suggest that zero-shot LLM agents capture complementary financial signals and that supervised aggregation can turn cross-agent disagreement into a more useful classification target.
[MA-26] Agent ic Physical-AI for Self-Aware RF Systems
【速读】:该论文旨在解决射频(RF)收发器在动态工作条件下难以实现智能调控的问题,这是现代及未来通信系统中的关键挑战。解决方案的核心在于提出一种多智能体神经符号人工智能(neurosymbolic AI)系统,其中每个电路组件均由独立的AI代理负责控制,每个代理包含内部模型和对应的控制算法,从而实现对整个RF系统的智能化管理;文中以中频(IF)放大器建模为例展示了该方法的有效性,表明该框架可扩展至所有组件,构建完全智能的RF系统。
链接: https://arxiv.org/abs/2603.20692
作者: Linuka Ratnayake,Danidu Dabare,Sanuja Rupasinghe,Warren Jayakumar,Dileepa Marasinghe,Chamira U. S. Edussooriya,Arjuna Madanayake
机构: 未知
类目: ignal Processing (eess.SP); Multiagent Systems (cs.MA)
备注: 2 pages, 3 figures, Accepted for 2026 International Applied Computational Electromagnetics Society (ACES) Symposium
Abstract:Intelligent control of RF transceivers adapting to dynamic operational conditions is essential in the modern and future communication systems. We propose a multi-agent neurosymbolic AI system, where AI agents are assigned for circuit components. Agents have an internal model and a corresponding control algorithm as its constituents. Modeling of the IF amplifier shows promising results, where the same approach can be extended to all the components, thus creating a fully intelligent RF system.
[MA-27] AlphaLogics: A Market Logic-Driven Multi-Agent System for Scalable and Interpretable Alpha Factor Generation
【速读】:该论文旨在解决当前因子挖掘(factor mining)中过度依赖数据驱动的复杂因子发现,而忽视市场逻辑(market logic)本质的问题——即现有方法倾向于从历史数据中直接提取高预测能力的因子,但缺乏对这些因子背后稳定、可解释的经济机制的理解,导致因子缺乏跨资产和跨周期的持续性。其解决方案的核心是提出AlphaLogics,一个以市场逻辑驱动的多智能体系统,包含三个关键组件:(i) 市场逻辑挖掘(Market Logic Mining),通过逆向解析历史因子库构建初始市场逻辑库;(ii) 因子生成与优化,利用新生成的市场逻辑指导因子构造并结合回测反馈进行优化;(iii) 市场逻辑生成与优化,基于初始逻辑库生成新的市场逻辑,并通过其所引导因子的回测结果聚合更新逻辑本身,实现逻辑库的持续迭代。该框架实现了因子发现与市场逻辑发现的协同进化,显著提升了因子的预测性能与风险调整收益,同时保持了市场逻辑的可解释性和可扩展性。
链接: https://arxiv.org/abs/2603.20247
作者: Zhangyuhua Weng,Shengli Zhang,Taotao Wang,Yihan Xia
机构: 未知
类目: Computational Finance (q-fin.CP); Multiagent Systems (cs.MA)
备注:
Abstract:Factor investing is ultimately grounded in market logic - the latent mechanism behind observed alpha factors that explains why they should persist across assets and regimes. However, recent factor mining prioritizes factor discovery over logic discovery, producing complex alpha factors with unclear rationale, while market logic remains largely handcrafted and difficult to scale. To address this challenge, we propose AlphaLogics, a market logic-driven multi-agent system for factor mining. AlphaLogics consists of three key components: (i) Market Logic Mining: reverse-extracting market logic from historical factor libraries to construct an initial market logic library; (ii) Factor Generation and Optimization: using new market logics generated in (i) to guide factor generation, and optimizing factors with backtesting feedback; and (iii) Market Logic Generation and Optimization: generating new market logics conditioned on the initial market logic library, and refining each market logic by aggregating the backtest outcomes of its guided factors, continuously refreshing the library. Experiments on CSI 500 and SP 500 show that AlphaLogics consistently improves predictive metrics and risk-adjusted returns over representative baselines, while producing a market logic library that remains empirically useful for guiding further factor discovery. Subjects: Computational Finance (q-fin.CP); Multiagent Systems (cs.MA) Cite as: arXiv:2603.20247 [q-fin.CP] (or arXiv:2603.20247v1 [q-fin.CP] for this version) https://doi.org/10.48550/arXiv.2603.20247 Focus to learn more arXiv-issued DOI via DataCite
[MA-28] Designing Auctions when Algorithms Learn to Bid
【速读】:该论文旨在解决算法驱动的在线拍卖中因竞标者使用自动竞价算法而导致的隐性 bid suppression(隐性出价抑制)问题,以及由此引发的卖家收入损失。现有研究虽识别了不同算法机制导致出价抑制的具体原因,但尚不明确哪些因素最关键、它们如何相互作用,且政策建议往往基于与实际部署不符的算法模型。论文的关键解决方案是构建一个基于因子实验设计和大规模蒙特卡洛模拟的计算实验室框架,将每个模拟视为黑箱输入-输出观测,通过系统性地改变输入变量并评估其对结果的相关性来排序影响因素,而无需解释算法内部机制。该方法在Q-learning、上下文Bandits和预算约束下的 pacing 算法等多类算法中统一评估了拍卖形式、竞争压力、学习参数和预算约束对卖家收入的影响,发现结构性市场参数(如竞争压力和预算紧度)远比算法设计选择更重要,并揭示了最优拍卖形式依赖于具体竞价技术——这表明不存在适用于所有算法类型的“通用”最优拍卖形式。
链接: https://arxiv.org/abs/2306.09437
作者: Pranjal Rawat
机构: 未知
类目: General Economics (econ.GN); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Algorithms increasingly automate bidding in online auctions, raising concerns about tacit bid suppression and revenue shortfalls. Prior work identifies individual mechanisms behind algorithmic bid suppression, but it remains unclear which factors matter most and how they interact, and policy conclusions rest on algorithms unlike those deployed in practice. This paper develops a computational laboratory framework, based on factorial experimental designs and large-scale Monte Carlo simulation, that addresses bid suppression across multiple algorithm classes within a common methodology. Each simulation is treated as a black-box input-output observation; the framework varies inputs and ranks factors by association with outcomes, without explaining algorithms’ internal mechanisms. Across six sub-experiments spanning Q-learning, contextual bandits, and budget-constrained pacing, the framework ranks the relative importance of auction format, competitive pressure, learning parameters, and budget constraints on seller revenue. The central finding is that structural market parameters dominate algorithmic design choices. In unconstrained settings, competitive pressure is the strongest predictor of revenue; under budget constraints, budget tightness takes over. The auction-format effect is context-dependent, favouring second-price under learning algorithms but reversing to favour first-price under budget-constrained pacing. Because the optimal format depends on the prevailing bidding technology, no single auction format is universally superior when bidders are algorithms, and applying format recommendations from one algorithm class to another leads to counterproductive design interventions.
自然语言处理
[NLP-0] WorldCache: Content-Aware Caching for Accelerated Video World Models
【速读】: 该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中因逐帧去噪和高代价时空注意力机制导致的计算效率低下问题。现有训练-free特征缓存方法依赖零阶保持(Zero-Order Hold)假设,即认为缓存特征在时间上是静态的,这在动态场景下常引发鬼影伪影、模糊和运动不一致等问题。其解决方案的关键在于提出WorldCache——一种感知约束的动力学缓存框架,通过引入运动自适应阈值、显著性加权漂移估计、最优混合与变形近似以及跨扩散步骤的相位感知阈值调度,实现更智能的特征重用时机与方式,从而在无需重新训练的前提下提升推理速度并保持高质量输出。
链接: https://arxiv.org/abs/2603.22286
作者: Umair Nawaz,Ahmed Heakl,Ufaq Khan,Abdelrahman Shaker,Salman Khan,Fahad Shahbaz Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 Pages
Abstract:Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbfWorldCache, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf2.3 \times inference speedup while preserving \textbf99.4% of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \hrefthis https URLWorld-Cache.
[NLP-1] hinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
【速读】: 该论文旨在解决当前潜在世界模型(latent world models)在短时观察窗口下进行密集预测时,因时间上下文有限而导致的局部低级外推偏差问题,以及视觉-语言模型(VLMs)作为独立密集预测器时存在的计算驱动稀疏采样、语言输出瓶颈和数据分布不匹配等局限性。解决方案的关键在于提出一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结构实现细粒度运动与交互线索(dense JEPA分支)与长时程语义引导(uniformly sampled VLM thinker分支)的协同建模,并引入分层金字塔表示提取模块,将多层VLM特征聚合为与潜在预测兼容的指导特征,从而有效传递VLM的渐进推理信号,提升长期轨迹预测的鲁棒性和语义一致性。
链接: https://arxiv.org/abs/2603.22281
作者: Haichao Zhang,Yijiang Li,Shwai He,Tushar Nagarajan,Mingfei Chen,Jianglin Lu,Ang Li,Yun Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 10 pages, 5 figures
Abstract:Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision–language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emphthinker branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM’s progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
[NLP-2] Co: Time-Controllable Training for Spoken Dialogue Models
【速读】: 该论文旨在解决当前语音对话模型(Spoken Dialogue Models, SDMs)在生成响应时缺乏时间感知能力的问题,即无法准确遵循时间约束指令(如“请生成约15秒的回应”),导致实际应用中交互质量受限。解决方案的关键在于提出TiCo方法,通过引入口语时间标记(Spoken Time Markers, STM),使模型在生成过程中能够估计已用时长(例如标注为10.6秒),从而动态调整剩余内容以满足目标时长。该方法仅需少量数据且无需额外问答对,依赖自生成样本与强化学习实现高效训练,在显著提升时长控制准确性的同时保持响应质量。
链接: https://arxiv.org/abs/2603.22267
作者: Kai-Wei Chang,Wei-Chih Chen,En-Pei Hu,Hung-yi Lee,James Glass
机构: MIT (麻省理工学院); NTU (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., “Please generate a response lasting about 15 seconds”). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., 10.6 seconds). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
[NLP-3] Greater accessibility can amplify discrimination in generative AI
【速读】: 该论文试图解决的问题是:语音交互接口在提升生成式 AI(Generative AI)可访问性的同时,可能引入基于声纹特征的性别歧视机制,从而加剧社会偏见。研究发现,音频驱动的大语言模型(Audio-enabled LLMs)会因说话者声音特征而系统性地偏向性别刻板印象词汇和职业描述,且这种偏见程度高于纯文本交互场景。解决方案的关键在于识别并干预这一新型偏见来源——即通过声学特征(如音高)的可控调节来抑制性别歧视输出,表明在设计语音交互系统时需将公平性与可访问性同步纳入考量,以避免“为无障碍而牺牲公平”的困境。
链接: https://arxiv.org/abs/2603.22260
作者: Carolin Holtermann,Minh Duc Bui,Kaitlyn Zhou,Valentin Hofmann,Katharina von der Wense,Anne Lauscher
机构: University of Hamburg; JGU Mainz; Cornell University; Together AI; Allen Institute for AI; CU Boulder
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ( n=1,000 ) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.
[NLP-4] MemDLM: Memory-Enhanced DLM Training
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在训练与推理阶段存在的显著不匹配问题:DLMs 在训练时采用静态的单步掩码预测目标,而在部署时则通过多步逐步去噪轨迹进行生成,导致性能受限。其解决方案的关键在于提出 MemDLM(Memory-Enhanced DLM),通过双层优化(Bi-level Optimization)将模拟的去噪过程嵌入训练中——内层循环更新一组快速权重(fast weights),形成一种参数化记忆(Parametric Memory),用于捕获每个样本的局部轨迹经验;外层循环基于此记忆更新基础模型参数。该机制将记忆压力从 token 表示转移到参数层面,从而实现更快收敛和更低训练损失,并且在推理时可重新启用内层循环作为适应步骤,进一步提升长文本理解能力,尤其在 Needle-in-a-Haystack 检索任务中表现出类似“内权检索”的涌现特性,有效缓解 token 级注意力瓶颈。
链接: https://arxiv.org/abs/2603.22241
作者: Zehua Pei,Hui-Ling Zhen,Weizhe Lin,Sinno Jialin Pan,Yunhe Wang,Mingxuan Yuan,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: this https URL.
[NLP-5] Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinsons Disease INTERSPEECH2026
【速读】: 该论文旨在解决跨语言(cross-lingual)失语症语音检测中因语言特异性结构干扰而导致性能下降的问题。其核心挑战在于,自监督语音表征通常编码了语言依赖性特征,从而影响对失语症(dysarthria)的准确识别。解决方案的关键是提出一种表示层的语言迁移(representation-level language shift, LS),通过利用健康对照组语音估计的基于中心点(centroid-based)的向量适配方法,将源语言的自监督语音表征对齐至目标语言分布。实验表明,LS在跨语言设置下显著提升敏感性和F1分数,在多语言设置下也带来稳定但较小的改进,且表征分析显示LS有效降低了嵌入空间中的语言身份信息,验证了其去除语言依赖结构的能力。
链接: https://arxiv.org/abs/2603.22225
作者: Abner Hernandez,Eunjung Yeo,Kwanghee Choi,Chin-Jou Li,Zhengjun Yue,Rohan Kumar Das,Jan Rusz,Mathew Magimai Doss,Juan Rafael Orozco-Arroyave,Tomás Arias-Vergara,Andreas Maier,Elmar Nöth,David R. Mortensen,David Harwath,Paula Andrea Perez-Toro
机构: FAU Erlangen-Nürnberg, Germany (弗劳恩霍夫大学埃尔朗根-纽伦堡, 德国); UT Austin, USA (德克萨斯大学奥斯汀分校, 美国); CMU, USA (卡内基梅隆大学, 美国); Czech Technical University in Prague, Czech Republic (捷克技术大学布拉格, 捷克共和国); Idiap Research Institute, Switzerland (Idiap研究研究所, 瑞士); Universidad de Antioquia, Colombia (安蒂奥基亚大学, 哥伦比亚); Shenzhen Loop Area Institute, China (深圳环区研究院, 中国); Fortemedia, Singapore (Fortemedia公司, 新加坡)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to Interspeech 2026
Abstract:The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson’s disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
[NLP-6] Gumbel Distillation for Parallel Text Generation ICLR2026
【速读】: 该论文旨在解决并行解码(parallel decoding)语言模型在生成质量上落后于自回归(autoregressive, AR)模型的问题,其核心挑战在于并行模型难以有效建模词元序列的复杂联合分布。解决方案的关键是提出一种名为Gumbel Distillation的新颖知识蒸馏技术,该方法利用Gumbel-Max trick将高性能AR教师模型的潜在Gumbel噪声空间映射为确定性的输出词元,从而让并行解码器能够高效学习这种复杂的联合分布。该方法具有模型无关性,可无缝集成到多种并行解码架构(如MDLM和BD3-LM),实验表明其显著提升了生成质量,在LM1B和OpenWebText数据集上分别实现了MAUVE分数提升30.0%和生成困惑度降低10.5%。
链接: https://arxiv.org/abs/2603.22216
作者: Chi Zhang,Xixi Hu,Bo Liu,Qiang Liu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at this https URL.
[NLP-7] SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在专业、数据稀缺领域中知识覆盖不全的问题,通过合成数据生成实现知识注入。其解决方案的关键在于提出一种名为SPA(Scaling Prompt-engineered Augmentation)的基线方法,即利用少量精心设计的提示(prompt)生成大规模合成数据,从而有效提升模型在特定领域的知识能力。研究表明,相比基于强化学习(RL)的方法易出现多样性崩溃,以及多阶段提示在精细调优后优势消失等问题,SPA通过精巧的提示工程与大规模数据增强相结合,在知识注入任务中展现出更强的稳定性和有效性。
链接: https://arxiv.org/abs/2603.22213
作者: Kexian Tang,Jiani Wang,Shaowen Wang,Kaifeng Lyu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at this https URL.
[NLP-8] Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation ICASSP2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文档级机器翻译(Document-level Machine Translation, MT)中表现不佳的问题,主要挑战包括高质量文档级平行数据稀缺以及LLM生成过程中易出现幻觉和遗漏。解决方案的关键在于提出一种两阶段微调策略:首先利用LLM将摘要数据转换为文档级平行数据,并通过sacreBLEU、COMET及LaBSE余弦相似度等多指标过滤以提升数据质量;随后采用先在丰富句级MT资源上预训练、再在筛选后的文档级语料上精调的两阶段流程,从而有效增强LLM在文档级翻译中的连贯性与准确性。
链接: https://arxiv.org/abs/2603.22186
作者: Ireh Kim,Tesia Sker,Chanwoo Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2026
Abstract:In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
[NLP-9] he Semantic Ladder: A Framework for Progressive Formalization of Natural Language Content for Knowledge Graphs and AI Systems
【速读】: 该论文旨在解决自然语言(natural language)与形式化语义模型(formal semantic models)之间存在的语义鸿沟问题,尤其是在数据录入阶段需要实现完全语义形式化时所面临的挑战。其解决方案的关键在于提出“语义阶梯”(Semantic Ladder)架构框架,该框架通过模块化的语义单元(modular semantic units)组织不同层次的语义显式程度,从自然语言片段逐步过渡到基于本体(ontology-based)和高阶逻辑模型的表示形式。这种分层结构支持跨层级的语义增强、命题结构化与逻辑建模,同时保持语义连续性和可追溯性,从而实现语义知识空间的渐进构建,降低语义解析负担,并兼容异构表示形式(如自然语言、结构化语义模型及向量嵌入),为可扩展、互操作且适配生成式 AI (Generative AI) 的数据与知识基础设施提供基础支撑。
链接: https://arxiv.org/abs/2603.22136
作者: Lars Vogt
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.
[NLP-10] Multiperspectivity as a Resource for Narrative Similarity Prediction
【速读】: 该论文试图解决的是叙事相似性预测中因多视角解读(multiperspectivity)导致的语义评估基准单一真实值(ground truth)难以反映实际人类理解多样性的问题。传统方法将多角度解释视为噪声或偏差,而本文提出将这种多视角性纳入预测系统的决策过程,通过构建由31个大语言模型(LLM)人格组成的集成系统来实现。其解决方案的关键在于:利用不同人格(包括遵循阐释框架的专业人士与更直觉化的普通用户风格角色)之间的互补性,在多数投票机制下提升整体准确率;尤其发现专业人格个体表现较差但错误相关性低,从而在集成中带来更大增益,这符合弱化独立性条件下的康多塞陪审团定理(Condorcet Jury Theorem-like dynamics)。
链接: https://arxiv.org/abs/2603.22103
作者: Max Upravitelev,Veronika Solopova,Jing Yang,Charlott Jakob,Premtim Sahitaj,Ariana Sahitaj,Vera Schmitt
机构: Technische Universit‘̀at Berlin (柏林工业大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.
[NLP-11] Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison
【速读】: 该论文旨在解决生成式语言模型中不同生成范式(自回归模型与掩码扩散语言模型)在训练效率、收敛特性及生成多样性之间的权衡问题。其关键解决方案在于通过严格控制实验变量——包括数据集(TinyStories,50M tokens)、计算预算(20,000步,batch size 32,序列长度512)、硬件环境(NVIDIA H100 80GB)——仅改变生成范式,从而实现对两种模型的公平比较。结果表明,掩码扩散语言模型(Masked Diffusion Language Model, MDLM)虽训练稍慢但更稳定且生成结构多样性更高,而自回归模型(Autoregressive, AR)虽收敛快但易过拟合且输出重复性强,揭示了不同范式下最优计算资源配置策略的差异。
链接: https://arxiv.org/abs/2603.22075
作者: Caio Vicentino
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 4 tables. Code and checkpoints at this https URL
Abstract:We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.
[NLP-12] Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch ICASSP2026
【速读】: 该论文旨在解决跨分词器(distinct tokenizers)大语言模型(LLM)知识蒸馏中因键(keys)与查询(queries)分布不匹配导致的性能下降问题,从而提升小模型在文本生成任务中的质量。其解决方案的关键在于引入基于生成对抗学习(Generative Adversarial, GA)的新方法——DSKD-CMA-GA,通过优化不同模型间注意力机制的对齐,缓解因分词差异引发的特征空间错位问题,实验表明该方法在跨分词器蒸馏场景下能稳定提升ROUGE-L指标,尤其在分布外数据上平均增益达+0.37,有效缩小了跨分词器与同分词器蒸馏之间的性能差距。
链接: https://arxiv.org/abs/2603.22056
作者: Stella Eva Tsiapali,Cong-Thanh Do,Kate Knill
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICASSP 2026
Abstract:Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.
[NLP-13] ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在执行复杂任务时存在的“过度思考”(overthinking)问题,即模型在已得出正确答案后仍继续生成冗余的推理步骤,导致延迟增加、计算成本上升并可能引发答案漂移。解决方案的关键在于提出一种名为ROM的新方法,其核心是将过早终止问题建模为一个流式预测与控制问题:在冻结的大型语言模型主干网络的深层隐藏状态上附加一个轻量级检测头,实时监控生成的token,并在检测到过度思考行为时触发提前转向最终答案的机制;同时引入基于解题正确边界定义的token级监督信号和减少蒸馏数据偏倚的数据增强策略,从而实现高效且精准的过早终止控制。
链接: https://arxiv.org/abs/2603.22016
作者: Xinyan Wang,Xiaogeng Liu,Chaowei Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL
Abstract:Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.
[NLP-14] Retrieving Climate Change Disinformation by Narrative
【速读】: 该论文旨在解决气候虚假信息(climate disinformation)叙事检测中依赖固定分类体系(fixed taxonomies)无法适应新兴叙事的问题。其核心创新在于将叙事检测重新建模为一个检索任务:以叙事的核心信息作为查询,通过文本与该叙事的语义对齐程度进行排序,从而无需预定义标签集即可识别新出现的叙事。解决方案的关键是提出SpecFi框架,该框架利用基于图的社区检测生成的群体摘要作为少样本示例,生成假设性文本以弥合抽象叙事描述与具体文本实例之间的鸿沟;实验表明,SpecFi在CARDS数据集上未访问叙事标签的情况下仍达到0.505的平均精度(MAP),且在高变异性叙事上显著优于传统检索方法(如BM25),展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2603.22015
作者: Max Upravitelev,Veronika Solopova,Charlott Jakob,Premtim Sahitaj,Vera Schmitt
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative’s core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.
[NLP-15] SecureBreak – A dataset towards safe and secure models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因安全对齐(security alignment)不足而导致的有害输出问题,尤其是针对现有对齐方法无法完全防御如越狱攻击(jailbreaking)和提示注入(prompt injection)等新型攻击的缺陷。解决方案的关键在于提出一个名为SecureBreak的安全导向数据集,该数据集通过人工谨慎标注并采用保守标签策略确保高可靠性,能够有效检测多种风险类别下的不安全内容;同时,实验表明基于该数据集进行微调可提升预训练模型在生成阶段的安全性,从而为模型部署后的安全过滤及进一步对齐优化提供可靠支持。
链接: https://arxiv.org/abs/2603.21975
作者: Marco Arazzi,Vignesh Kumar Kembu,Antonino Nocera
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate’’ defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.
[NLP-16] Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents : A Comprehensive Recipe
【速读】: 该论文旨在解决如何在复杂、多轮交互环境中有效扩展强化学习(Reinforcement Learning, RL)以训练大型语言模型(Large Language Models, LLMs)成为具备长程规划能力的自主智能体(autonomous agents)的问题。其解决方案的关键在于通过系统性的实证研究,识别并优化RL训练中的五个核心设计维度:奖励塑造(reward shaping)、模型规模(model scaling)、数据组成(data composition)、算法选择(algorithm selection)以及环境稳定性(environmental stability)。研究发现,奖励策略与算法选择具有模型规模依赖性,小模型受益于分阶段奖励和增强探索,而大模型则可高效收敛于简单密集奖励;同时,约1K带平衡难度混合的数据样本构成性能最优区间,并且环境稳定性对防止策略退化至关重要。基于这些关键洞察,作者提出了一套可复现的RL训练配方,在TravelPlanner任务中显著优于当前主流LLMs。
链接: https://arxiv.org/abs/2603.21972
作者: Xixi Wu,Qianguo Sun,Ruiyang Zhang,Chao Song,Junlong Wu,Yiyan Qi,Hong Cheng
机构: The Chinese University of Hong Kong(香港中文大学); IDEA Research(IDEA研究院); University of Macau(澳门大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Codes are available at this https URL
Abstract:Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.
[NLP-17] Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora Prompt Tuning and Full Fine-Tuning
【速读】: 该论文旨在解决大规模语言模型在特定领域(如医学文本摘要)任务中进行微调时所需的高昂计算资源问题。传统全参数微调(Full Fine-Tuning)虽效果良好,但计算成本高且不具可扩展性。解决方案的关键在于采用参数高效微调(Parameter-efficient Fine-tuning, PEFT)方法,特别是对比了低秩适配(Low-Rank Adaptation, LoRA)、提示微调(Prompt Tuning)与全参数微调的效果。实验表明,LoRA在仅更新0.6%参数的情况下,性能显著优于全参数微调(ROUGE-1达43.52 ± 0.18 vs. 40.67 ± 0.21),并揭示了低秩约束本身具有正则化作用,从而挑战了“必须更新全部参数才能获得最优性能”的既有认知。
链接: https://arxiv.org/abs/2603.21970
作者: Ulugbek Shernazarov,Rostislav Svitsov,Bin Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, presented at 6th International Conference on NLP Text Mining (NLTM 2026), March 21-22, Sydney, Australia. Published in Computer Science Information Technology (CS IT), pp. 01-09, 2026
Abstract:Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at this https URL
[NLP-18] BHDD: A Burmese Handwritten Digit Dataset
【速读】: 该论文旨在解决缅甸手写数字识别问题,填补低资源语言场景下手写数字数据集的空白。其关键解决方案是构建并公开发布Burma Handwritten Digit Dataset (BHDD),该数据集包含87,561张28×28像素的灰度图像,覆盖10类手写缅甸数字,训练集与测试集分别包含60,000和27,561样本,且测试集保留了实际采集中的类别分布。研究通过分析类分布、像素统计与形态变异性,识别出因缅甸文字圆润形状导致易混淆的数字对,并基于此设计了三种简单但有效的基准模型(MLP、两层CNN及引入批归一化和数据增强的改进CNN),最终在测试集上达到最高99.83%的准确率,验证了该数据集的有效性与实用性。
链接: https://arxiv.org/abs/2603.21966
作者: Swan Htet Aung,Hein Htet,Htoo Say Wah Khaing,Thuya Myo Nyunt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 4 pages, 9 figures, 1 table. Dataset available at this https URL
Abstract:We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset’s class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at this https URL
[NLP-19] SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding LREC2026
【速读】: 该论文旨在解决低资源语言在语音理解(Spoken Language Understanding, SLU)任务中因缺乏标注数据而难以受益于深度神经网络和预训练语言模型进展的问题。其解决方案的关键在于构建并公开了一个针对突尼斯方言的SLU数据集SLURP-TN,该数据集包含55名母语者录制的4165句对话文本及其对应的语义标签,覆盖六个SLURP领域,总时长约5小时。此外,研究团队还基于该数据集开发了自动语音识别(Automatic Speech Recognition, ASR)和SLU基线模型,为突尼斯方言的语音理解提供了可复现的研究基础与技术支撑。
链接: https://arxiv.org/abs/2603.21940
作者: Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026
Abstract:Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: this https URL.
[NLP-20] Ara-Best-RQ: Multi Dialectal Arabic SSL ICASSP2026
【速读】: 该论文旨在解决多方言阿拉伯语语音处理中模型性能受限的问题,尤其是针对低资源方言场景下预训练模型泛化能力不足的挑战。解决方案的关键在于构建一个专门面向阿拉伯语方言的自监督学习(Self-Supervised Learning, SSL)模型家族——Ara-BEST-RQ,通过整合5,640小时爬取的Creative Commons语音数据与公开数据集,对基于Conformer架构的BEST-RQ模型进行大规模预训练(最大达600M参数),从而显著提升下游任务(如方言识别DID和自动语音识别ASR)的性能。实验表明,相较于在非阿拉伯语数据上训练的多语言或单语模型,该方法在更少参数条件下实现了最优的方言识别效果,验证了方言特异性预训练对提升阿拉伯语语音技术性能的核心作用。
链接: https://arxiv.org/abs/2603.21900
作者: Haroun Elleuch,Ryan Whetten,Salima Mdhaffar,Yannick Estève,Fethi Bougares
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICASSP 2026
Abstract:We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.
[NLP-21] Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures
【速读】: 该论文试图解决的问题是:预训练的冻结语言模型(frozen language models)是否能够编码个体特异性的脑电图(EEG)信号,从而为基于生成式 AI (Generative AI) 的个性化神经接口提供可解释的几何结构。解决方案的关键在于引入个体特异性线性探测器(person-specific linear probes),这些探测器将冻结语言模型的隐藏状态映射到每个受试者的 EEG 功率特征上,结果显示个体探测器在预测特定个体的高 gamma 功率时显著优于单一群体探测器(rho = 0.183 vs. 0.020),且该信号具有时间稳定性、跨个体不可迁移性,并集中于模型深层(Layer 24/28),表明冻结语言模型中存在稳定、可分离的个体神经方向,构成 EEG 驱动个性化建模的几何基础。
链接: https://arxiv.org/abs/2603.21847
作者: Ajan Subramanian,Sumukh Bettadapura,Rohan Sathish
机构: Kubo Technologies
类目: Computation and Language (cs.CL)
备注:
Abstract:Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person’s brain activity but not another’s. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual’s EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model’s deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.
[NLP-22] Select Label Evaluate: Active Testing in NLP
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中测试数据标注成本高、耗时长的问题,尤其是高质量标签对模型评估的严格要求导致的资源瓶颈。解决方案的关键在于提出并系统化地形式化了主动测试(Active Testing)框架,通过选择最具信息量的测试样本进行标注,在有限标注预算下最大化模型性能估计的准确性。实验表明,该方法可在保持性能估计误差小于1%的前提下,实现最高达95%的标注量减少;同时引入自适应停止准则,自动确定最优标注样本数,克服了传统方法需预先设定标注预算的局限性。
链接: https://arxiv.org/abs/2603.21840
作者: Antonio Purificato,Maria Sofia Bucarelli,Andrea Bacciu,Amin Mantrach,Fabrizio Silvestri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures
Abstract:Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.
[NLP-23] Instruction Set and Language for Symbolic Regression
【速读】: 该论文旨在解决符号回归(Symbolic Regression, SR)中的结构冗余问题:表达式有向无环图(DAG)存在多种不同的节点编号方案,这些方案虽编码相同的数学表达式,却占据搜索空间中不同的位置,并消耗无意义的适应度评估资源,导致搜索效率低下。解决方案的关键在于提出 IsalSR(Instruction Set and Language for Symbolic Regression),一种将表达式 DAG 编码为紧凑双层字母表上的字符串的表示框架,并通过计算一个剪枝后的规范字符串(pruned canonical string)——即完整的带标签 DAG 同构不变量——将所有等价表示压缩至单一规范形式,从而消除冗余、提升搜索效率。
链接: https://arxiv.org/abs/2603.21836
作者: Ezequiel Lopez-Rubio,Mario Pascual-Gonzalez
机构: University of Málaga (马拉加大学); ITIS Software (软件信息技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string – a complete labeled-DAG isomorphism invariant – that collapses all the equivalent representations into a single canonical form.
[NLP-24] Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power
【速读】: 该论文旨在解决当前新闻话语中疑问句(interrogatives)的功能与分布缺乏系统性量化研究的问题,尤其在大规模数字新闻语料中,疑问句常被忽略或未区分其功能类型。其解决方案的关键在于结合计算方法与语用学及社会学理论,通过自动检测疑问句立场、近似其功能类别,并定位文本中的回答片段,从而实现对法语数字新闻中“提问政治”的多维度分析。该方法不仅量化了疑问句的密度和分布模式,还揭示了其与叙述声音、实体提及等语篇特征的关联,为理解当代新闻话语如何通过提问实践组织议题和突出特定主体提供了可操作的框架。
链接: https://arxiv.org/abs/2603.21823
作者: Bros Victor,Barbini Matilde,Gerard Patrick,Gatica-Perez Daniel
机构: 1. Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); 2. Idiap Research Institute (Idiap 研究所); 3. Université de Lyon (里昂大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: ICWSM 2026
Abstract:Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the “Politics of Questions” in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist’s narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.
[NLP-25] he Presupposition Problem in Representation Genesis
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在未明确经历“表征生成”(representation genesis)——即从非表征性物理系统向状态能以内容敏感方式引导行为的系统过渡——的情况下展现出高阶认知能力,这使得传统哲学心智理论无法解释其表征能力的起源。当前主流心智哲学框架(如思想语言假说、目的论语义学、预测加工、具身主义和发生现象学)均存在一个共通结构缺陷:它们在解释表征生成时,预先假设了系统已具备表征组织结构,从而导致“表征预设”(Representation Presupposition)结构与“表征回归”(Representation Regress)问题——即用已有的表征概念去解释表征的首次出现,形成逻辑循环。论文的关键贡献在于提出一种概念诊断,而非构建新理论,它识别出这一结构性障碍,并推导出任何有效理论必须满足的两个最低充分条件:一是避免依赖已有表征概念进行解释,二是提供可操作的机制来说明表征如何从非表征基质中涌现。LLMs使这一问题从纯理论走向实践后果,凸显了建立此类理论的紧迫性。
链接: https://arxiv.org/abs/2603.21745
作者: Yiling Wu
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models are the first systems to achieve high cognitive performance without clearly undergoing representation genesis: the transition from a non-representing physical system to one whose states guide behavior in a content-sensitive way. Prior cognitive systems had already made this transition before we could examine it, and philosophy of mind treated genesis as a background condition rather than an explanatory target. LLMs provide a case that does not clearly involve this transition, making the genesis question newly urgent: if genesis did not occur, which cognitive capacities are affected, and why? We currently lack the conceptual resources to answer this. The reason, this paper argues, is structural. Major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common feature when applied to the genesis question: at some explanatory step, each deploys concepts whose explanatory purchase depends on the system already being organized as a representer. This pattern, which we call the Representation Presupposition structure, generates systematic explanatory deferral. Attempts to explain the first acquisition of content-manipulable representation within the existing categorical vocabulary import resources from the representational side of the transition itself. We call this the Representation Regress. The paper offers a conceptual diagnosis rather than a new theory, establishing the structure of the problem and deriving two minimum adequacy conditions for any account that avoids this pattern. LLMs make the absence of such a theory consequential rather than merely theoretical.
[NLP-26] he Reasoning Error About Reasoning Reasoning : Why Different Types of Reasoning Require Different Representational Structures
【速读】: 该论文试图解决的问题是:不同类型的推理(如归纳、类比、因果推断、演绎和形式逻辑)对表征系统(representational systems)的结构需求缺乏系统性解释,尤其是在心理学、人工智能(AI)和心灵哲学领域之间存在割裂。解决方案的关键在于提出一个基于四个结构性属性的统一框架:可操作性(operability)、一致性(consistency)、结构保真性(structural preservation)和组合性(compositionality)。这些属性决定了特定推理类型能否成功运作——低于某一结构性边界(principal structural boundary)的推理仅需依赖关联性或概率性表征即可完成,而高于该边界的推理(尤其是演绎推理)则必须同时满足全部四项结构要求;单纯通过统计学习扩展无法跨越此边界,因为演绎所需的结构保障无法由概率手段近似实现。这一框架为理解推理能力的本质限制提供了必要条件分析,并支持跨学科验证与可检验预测。
链接: https://arxiv.org/abs/2603.21736
作者: Yiling Wu
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.
[NLP-27] EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自主知识发现中科学创意生成的迭代演化难题,即如何将初始概念系统性地优化为高质量的研究提案。现有强化学习(Reinforcement Learning, RL)方法依赖基于评分量表的全局奖励信号,缺乏可操作的细粒度指导;而语言引导的优化方法通常局限于推理阶段提示,未显式训练模型内化此类批评。解决方案的关键在于提出EvoIdeator框架,其通过结构化判别模型生成两类协同信号:(1) 多维优化的词典序奖励(lexicographic rewards),实现对多个质量维度的精确控制;(2) 跨跨度(span-level)的语言反馈,提供关于论证基础性、可行性与方法严谨性的具体批评。这两类信号被整合进强化学习循环中,使策略模型在训练和推理阶段均能系统利用精准反馈,从而显著提升科学创意的质量与泛化能力。
链接: https://arxiv.org/abs/2603.21728
作者: Andreas Sauter,Yuyue Zhao,Jacopo Urbani,Wenxiang Hu,Zaiqiao Meng,Lun Zhou,Xiaohui Yan,Yougang Lyu
机构: Huawei Technologies Co., Ltd.(华为技术有限公司); Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbfEvoIdeator, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbfchecklist-grounded feedback. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emphlexicographic rewards for multi-dimensional optimization, and (2) \emphfine-grained language feedback that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.
[NLP-28] SemEval-2026 Task 12: Abductive Event Reasoning : Towards Real-World Event Causal Inference for Large Language Models SEMEVAL2026
【速读】: 该论文旨在解决现实世界事件中直接因果推理(direct-cause inference)在证据丰富场景下研究不足的问题。解决方案的关键在于构建一个基于证据的多选基准任务——Abductive Event Reasoning (AER),其核心设计包括:从支持性证据中识别目标事件的最合理直接原因,同时涵盖分布式证据、间接背景因素以及语义相关但非因果的干扰项等真实因果推理挑战。该任务通过结构化数据集和严谨评估机制,为生成式 AI (Generative AI) 在多文档理解与因果建模方面的研究提供了聚焦且可衡量的基准。
链接: https://arxiv.org/abs/2603.21720
作者: Pengfei Cao,Mingxuan Yang,Yubo Chen,Chenlong Zhang,Mingxuan Liu,Kang Liu,Jun Zhao
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, semeval 2026 task 12 description paper
Abstract:Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnoteThe task data is available at this https URL The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
[NLP-29] Probing How Scalable Table Data Enhances General Long-Context Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理(long-context reasoning)能力上的不足问题,特别是缺乏对有效训练数据类型及其作用机制的系统性研究。解决方案的关键在于发现并利用具有周期性结构的结构化表格数据(structured table data),通过互信息(mutual information)数学分析揭示其非衰减的依赖关系,并基于此构建一个简单但可扩展的强化学习(RL)驱动的数据合成流水线(TableLong),用于生成高质量、多样且可验证的表格数据以提升模型的长上下文推理能力。实验表明,该方法在多个长上下文基准测试中平均提升8.24%,并在跨域测试中提升8.06%。
链接: https://arxiv.org/abs/2603.21719
作者: Huaibing Xie,Guoliang Zhao,Yang Liu,Shihan Dou,Siming Huang,Yanling Xiao,Shaolei Wang,Yiting Liu,Cheng Zhang,Shaofan Liu,Pluto Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24% on average), and even improves performance on out-of-domain benchmarks (+8.06% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.
[NLP-30] hinking Deeper Not Longer: Depth-Recurrent Transformers for Compositional Generalization
【速读】: 该论文旨在解决标准Transformer模型因固定计算深度(computational depth)而导致的泛化能力受限问题,尤其在需要可变深度推理的任务(如多跳图遍历或嵌套逻辑)中表现不足。其核心解决方案是提出一种深度循环Transformer(depth-recurrent Transformer),通过在潜在空间中迭代应用共享权重的Transformer模块,将计算深度与参数量解耦,从而在推理时灵活扩展思维步骤以实现更深层次的推理。该架构的关键创新在于三个机制:(1) 静默思考目标(silent thinking objective),仅对最终输出进行监督,强制模型进行真正的多步推理而非中间启发式捷径;(2) LayerScale初始化,保护脆弱的推理状态免受未训练层噪声干扰;(3) 身份偏置循环(identity-biased recurrence),构建跨多个步骤的梯度高速公路,确保深层递归稳定。这一设计实现了从随机性能到近完美性能的清晰“计算前沿”,并揭示了任务不变的循环推理核心与任务特定感知接口之间的协同作用如何影响分布外(OOD)泛化能力。
链接: https://arxiv.org/abs/2603.21676
作者: Hung-Hsuan Chen
机构: National Central University (国立中央大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space – enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emphcomputational frontier – a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
[NLP-31] Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion
【速读】: 该论文旨在解决从气象时间序列数据中生成可解释的自然语言描述这一挑战,现有方法要么仅输出数值预测而缺乏人类可读的解释,要么生成通用描述而缺乏领域深度。解决方案的关键在于提出一种无需训练的多智能体框架WeatherTGD,其核心机制是将协作式文本优化过程类比为文本梯度下降(Text Gradient Descent, TGD),通过三个专业化LLM代理——统计分析师、物理解释器和气象专家——分别生成领域特定的文本梯度,并采用新颖的一致性感知梯度融合机制聚合这些梯度,在保留各领域独特视角的同时提取共通信号,进而驱动迭代式文本优化,使生成的天气描述逐步逼近最优解。
链接: https://arxiv.org/abs/2603.21673
作者: Shixu Liu
机构: Nepu(东北石油大学)
类目: Computation and Language (cs.CL)
备注: Preprint and under consideration
Abstract:Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.
[NLP-32] AMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression
【速读】: 该论文旨在解决多轮强化学习(multi-turn reinforcement learning)中因缺乏逐轮监督信号而导致的时间信用分配问题(temporal credit assignment challenge),尤其是在处理长文档时,模型需分块读取并更新记忆,但仅最终结果提供监督信号,使得每一轮记忆更新的质量难以评估。解决方案的关键在于提出教师对齐奖励重塑方法(Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning, TAMTRL),其通过将相关文档作为教师信号,与模型每一轮输入对齐,并利用归一化概率进行自监督奖励分配,从而为每轮记忆更新提供细粒度的学习信号,显著提升长上下文处理能力。
链接: https://arxiv.org/abs/2603.21663
作者: Li Wang,Yandong Wang,Xin Yu,Kui Zhang,Tianhao Peng,Wenjun Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model’s context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at this https URL.
[NLP-33] A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中记忆行为的理解滞后问题,尤其是由于预训练数据访问受限导致的跨模型系列观察孤立、难以区分普遍规律与特异性现象的问题。解决方案的关键在于系统性地收集多个模型系列(Pythia、OpenLLaMa、StarCoder、OLMo1/2/3),从统计层面和内部结构层面同步分析其共享与差异化的记忆行为:在统计层面发现记忆率随模型规模呈对数线性增长且可进一步压缩,并揭示了记忆序列的频率与领域分布模式的一致性;在内部层面通过中间层解码与注意力头消融实验识别出通用的记忆解码机制及关键注意力头,同时发现这些关键头在不同模型家族中的分布存在独特特征。这一多维度整合分析为建立LLM记忆行为的普适性理解奠定了基础。
链接: https://arxiv.org/abs/2603.21658
作者: Bowen Chen,Namgi Han,Yusuke Miyao
机构: The University of Tokyo (东京大学); National Institute of Informatics (国立情报学研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages of main content, in conference submission, other contents are references and extra appendix
Abstract:Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.
[NLP-34] Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks DATE
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)评估中普遍存在的“基准依赖性”问题,即基于公共基准测试分数对模型进行排序、选择和部署的机制可能误导性地反映模型的真实泛化能力,因为这些分数可能混杂了应试型能力(exam-oriented competence)与本质能力(principled capability),尤其是在训练数据中存在污染(contamination)和语义泄露(semantic leakage)难以排除的情况下。解决方案的关键在于提出一种审计框架(audit framework),通过路由器-工作者(router-worker)架构对比干净对照条件与噪声条件(包括系统性删除、重写和扰动基准问题)下的模型表现,以量化模型对污染敏感性的程度和得分置信度。实验表明,在多个模型中,噪声条件下普遍存在但异质性的优于基线的表现,说明基准相关线索可能被重新组装并激活污染相关的记忆,从而揭示相同分数背后可能存在显著不同的可信度水平。因此,作者主张不应否定基准测试本身,而应在评估中引入对污染敏感性和得分置信度的显式审计。
链接: https://arxiv.org/abs/2603.21636
作者: Yiliang Song,Hongjun An,Jiangan Chen,Xuanchen Yan,Huan Song,Jiawei Shao,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom; Guangxi Normal University; Northwestern Polytechnical University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First update
Abstract:Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should not consistently outperform the clean-control baseline. Yet across multiple models, we find widespread but heterogeneous above-baseline gains under noisy conditions, indicating that benchmark-related cues may be reassembled and can reactivate contamination-related memory. These results suggest that similar benchmark scores may carry substantially different levels of confidence. Rather than rejecting benchmarks altogether, we argue that benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence.
[NLP-35] DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing LREC2026
【速读】: 该论文旨在解决阿马齐格语(Amazigh)在计算资源上的严重匮乏问题,特别是针对塔什利赫特方言(Tashlhiyt)缺乏高质量平行语料库的现状。其解决方案的关键在于构建DATASHI这一新型并行英文-塔什利赫特语语料库,包含5,000句对,其中1,500句对同时提供专家标准化版本与非标准用户生成版本,从而系统性研究拼写多样性与规范化问题。该双层设计不仅支持文本类自然语言处理任务(如分词、翻译和归一化),还为语音-文本对齐及多模态研究奠定基础,并通过大规模大语言模型(Large Language Models, LLMs)评估验证了其有效性,尤其显示Gemini-2.5-Pro在词级和字符级错误率上表现最优,且在标记音素类别(如成音节辅音、重音辅音、小舌音和咽音)上的编辑操作分析中揭示了模型对低资源阿马齐格正字法特征的敏感性差异,为未来正字法归一化提供了诊断依据。
链接: https://arxiv.org/abs/2603.21571
作者: Nasser-Eddine Monir,Zakaria Baou
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at LREC 2026
Abstract:DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.
[NLP-36] SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification
【速读】: 该论文旨在解决精神症状识别(psychiatric symptom identification)中因专家标注资源密集且缺乏标准化注释指南而导致的大规模细粒度症状级数据集构建困难问题,从而限制了模型对用户生成文本中多样化症状表达的泛化能力。解决方案的关键在于提出SynSym框架,利用大语言模型(large language models, LLMs)通过三个核心机制生成高质量合成数据:(1)将每个症状扩展为子概念以增强表达多样性;(2)生成反映不同语言风格的精神症状表达;(3)基于临床共现模式构造真实的多症状组合表达,从而提升模型在无真实标注数据下的性能表现,并在少量真实数据微调后进一步优化。
链接: https://arxiv.org/abs/2603.21529
作者: Migyeong Kang,Jihyun Kim,Hyolim Jeon,Sunwoo Hwang,Jihyun An,Yonghoon Kim,Haewoon Kwak,Jisun An,Jinyoung Han
机构: Sungkyunkwan University (成均馆大学); Samsung Medical Center (三星医疗中心); Omnicns; Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users’ mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.
[NLP-37] CatRAG : Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLM s IJCNN2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中存在的人口统计学、性别和地理偏见问题,这些问题会损害公平性与可信度。现有去偏方法如嵌入空间投影、提示引导和因果干预通常仅作用于流水线的单一阶段,导致去偏不彻底且在分布变化下性能脆弱。其解决方案的关键在于提出CatRAG Debiasing框架,该框架融合函子(functor)与检索增强生成(Retrieval-Augmented Generation, RAG)引导的结构去偏机制:其中函子组件利用范畴论结构实现一种保持语义结构的投影,有效抑制嵌入空间中的偏见方向,同时保留任务相关语义;RAG部分则通过结构化检索增强提升去偏稳定性。实验表明,该方法在Bias Benchmark for Question Answering (BBQ) 上显著优于基线模型和先前去偏方法,在多个维度(性别、国籍、种族及交叉群体)上将偏见分数降至接近零,同时准确率提升最高达40%。
链接: https://arxiv.org/abs/2603.21524
作者: Ravi Ranjan,Utkarsh Grover,Mayur Akewar,Xiaomin Lin,Agoritsa Polyzou
机构: Florida International University (佛罗里达国际大学); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, and accepted in IJCNN 2026 (part of IEEE WCCI 2026)
Abstract:Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.
[NLP-38] Generalizable Self-Evolving Memory for Automatic Prompt Optimization
【速读】: 该论文旨在解决当前自动提示优化方法在跨任务泛化能力不足以及无法积累可复用提示知识的问题。现有方法通常针对特定任务搜索专用提示,导致模型难以适应异构查询,且每次优化都需从头开始,效率低下。解决方案的关键在于提出MemAPO框架,其核心是一个双记忆机制:一方面将成功的推理轨迹提炼为可复用的策略模板(strategy templates),另一方面将错误生成结构化为捕捉重复失败模式的错误模式(error patterns)。通过检索相关策略与错误模式来组合新提示,结合迭代式自我反思与记忆编辑,使提示优化具备持续进化能力,从而实现更高效、更具泛化性的Prompt优化。
链接: https://arxiv.org/abs/2603.21520
作者: Guanbao Liang,Yuanchen Bei,Sheng Zhou,Yuheng Qin,Huan Zhou,Bingxin Jia,Bin Li,Jiajun Bu
机构: Zhejiang University (浙江大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.
[NLP-39] riangulating Temporal Dynamics in Multilingual Swiss Online News
【速读】: 该论文试图解决在多语言社会中,如何系统性地分析新闻报道的动态以揭示公共话语与集体叙事演变的问题,尤其关注语言和文化多样性对国家媒体生态系统的影响。其解决方案的关键在于采用三角验证方法(triangulated methodology),整合定量分析与定性洞察:具体包括对超过170万篇新闻文章进行词法指标计算、命名实体识别与Wikidata链接、针对性情感分析以及基于共识的变点检测,并结合本土化特征(domestication profiles)与文化邻近度显著性比(proximity salience ratio),从而实现跨语言比较并连接本土化理论与文化邻近性假说,有效揭示瑞士三大语区(法语、德语、意大利语)在主题、重复事件和突发事件上的差异化报道模式。
链接: https://arxiv.org/abs/2603.21519
作者: Bros Victor,Dufraisse Evan,Popescu Adrian,Gatica-Perez Daniel
机构: 1. Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); 2. Idiap Research Institute (Idiap 研究所); 3. University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: ICWSM 2026
Abstract:Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country’s three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.
[NLP-40] Effective Strategies for Asynchronous Software Engineering Agents
【速读】: 该论文旨在解决长周期软件工程(Software Engineering, SWE)任务中多依赖子任务的协同难题,此类任务因涉及多个相互关联的步骤,单智能体方法在准确性与时效性上均面临挑战。解决方案的关键在于提出一种结构化的多智能体协作范式——中心化异步隔离委托(Centralized Asynchronous Isolated Delegation, CAID),其核心机制包括:由中央管理器生成依赖感知的任务计划、各智能体在隔离工作空间中异步执行子任务,并通过结构化集成结合可执行测试验证实现进度整合。实证结果表明,CAID相较于单智能体基线在论文复现任务(PaperBench)和Python库开发任务(Commit0)上的准确率分别提升26.7%和14.3%,且分支合并(branch-and-merge)成为多智能体协作的核心协调机制,借助Git工作树(git worktree)、提交(git commit)和合并(git merge)等SWE原语实现了可靠且可执行的协作流程。
链接: https://arxiv.org/abs/2603.21489
作者: Jiayi Geng,Graham Neubig
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.
[NLP-41] aigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild INTERSPEECH2026
【速读】: 该论文旨在解决低资源且主要为口语表达的方言(如台湾台语,Taigi)在语音意图识别任务中因标注数据稀缺而导致模型性能受限的问题。其核心解决方案在于提出了一种名为TaigiSpeech的数据集构建方法,该方法结合两种不同监督层级的数据挖掘策略:一是基于关键词匹配与大语言模型(LLM)伪标签的跨语言迁移策略,二是利用音视频多模态线索的弱文本监督框架。这两种策略共同实现了对低资源、无文字书写系统的口语语言进行可扩展、高质量的语音意图数据采集与标注,从而推动相关场景下(如医疗健康和智能家居)的实用化语音交互系统发展。
链接: https://arxiv.org/abs/2603.21478
作者: Kai-Wei Chang,Yi-Cheng Lin,Huang-Cheng Chou,Wenze Ren,Yu-Han Huang,Yun-Shao Tsai,Chien-Cheng Chen,Yu Tsao,Yuan-Fu Liao,Shrikanth Narayanan,James Glass,Hung-yi Lee
机构: Massachusetts Institute of Technology (麻省理工学院); National Taiwan University (台湾大学); Academia Sinica (中央研究院); National Yang Ming Chiao Tung University (阳明交通大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: submitted to Interspeech 2026
Abstract:Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbfTaigiSpeech, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on this https URL.
[NLP-42] Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market Returns
【速读】: 该论文旨在解决金融领域中基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)研究中存在的因果推断难题,即传统相关性研究难以区分真实关联与虚假关联的问题。为应对这一挑战,作者提出了一种基于反证验证(refutation-validated)的框架,其核心在于构建一个系统化的验证流程:首先通过净比率评分(net-ratio scoring)结合z标准化处理情感信号,再利用带Newey-West异方差自相关一致(HAC)误差的普通最小二乘法(OLS)进行回归分析,并引入多种反证测试(包括安慰剂检验、随机共同原因检验、子集稳定性检验和Bootstrap检验)以排除伪相关性。该方法在六个能源类股票中仅识别出少数稳健的情感信号,且可再生能源表现出特定方面与时间维度的响应特征,从而为情感信号提供了统计上可靠、方向可解释的预测能力,尽管受限于样本规模(六只股票、一个季度),本文仍被视为一种方法学上的概念验证。
链接: https://arxiv.org/abs/2603.21473
作者: Wihan van der Heever,Keane Ong,Ranjan Satapathy,Erik Cambria
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Institute of High Performance Computing, Agency for Science, Technology and Research (高性能计算研究所,科技研究局); Nanyang Technological University (南洋理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, submitted to Expert Systems with Applications
Abstract:This paper proposes a refutation-validated framework for aspect-based sentiment analysis in financial markets, addressing the limitations of correlational studies that cannot distinguish genuine associations from spurious ones. Using X data for the energy sector, we test whether aspect-level sentiment signals show robust, refutation-validated relationships with equity returns. Our pipeline combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and refutation tests including placebo, random common cause, subset stability, and bootstrap. Across six energy tickers, only a few associations survive all checks, while renewables show aspect and horizon specific responses. While not establishing causality, the framework provides statistically robust, directionally interpretable signals, with limited sample size (six stocks, one quarter) constraining generalizability and framing this work as a methodological proof of concept.
[NLP-43] DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation
【速读】: 该论文旨在解决生成式 AI(Generative AI)领域中高效 CUDA 内核开发的难题,即如何自动将 PyTorch 代码转换为高度优化的 Triton 内核,从而减少人工编写和调优 CUDA 代码的工程负担。解决方案的关键在于提出 DRTriton 框架,其核心包括三个组件:(i) 基于 CSP-DAG 的数据合成算法,确保在操作符空间中实现全覆盖且无偏的均匀采样;(ii) 解耦奖励机制的课程强化学习方法,同时优化转换成功率与推理速度;(iii) 测试时搜索算法,在不依赖真实标注数据的前提下进一步提升生成 Triton 内核的运行效率。实验表明,DRTriton-7B 在 KernelBench Level 2 上实现了 92% 的加速比,显著优于 GPT-5.2(23%)和 Claude-Sonnet-4.5(19%),且具备良好的泛化能力,可应对人类专家也难以处理的真实场景。
链接: https://arxiv.org/abs/2603.21465
作者: Siqi Guo,Ming Lin,Tianbao Yang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.
[NLP-44] DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
【速读】: 该论文旨在解决生成式 AI(Generative AI)在偏好对齐(preference alignment)过程中存在的计算开销大、机制透明度低的问题。传统方法依赖于权重更新训练,不仅需要大量对齐阶段的算力资源,且难以解释模型行为变化的内在机制。其解决方案的关键在于提出动态稀疏自动编码器(Sparse Autoencoder, SAE)引导方法——DSPA(Dynamic SAE Steering for Preference Alignment),该方法在推理阶段实现条件化控制:从偏好三元组中计算出一个条件差异映射(conditional-difference map),将提示特征与生成控制特征关联起来,并在解码时仅修改激活token对应的稀疏潜在变量(latents),无需调整基础模型权重。此策略显著降低对齐阶段的浮点运算次数(FLOPs),同时保持性能稳定,在多个模型上优于基准指标如MT-Bench,且具备良好的数据稀缺鲁棒性。
链接: https://arxiv.org/abs/2603.21461
作者: James Wedgwood,Aashiq Muhamed,Mona T. Diab,Virginia Smith
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to 4.47\times fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top- k ablation is principled.
[NLP-45] Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
【速读】: 该论文针对大语言模型(Large Language Models, LLMs)代码评测基准中存在的可信度危机问题展开研究,具体包括解决方案泄露(solution leakage)和测试质量低下导致的评估偏差,以及现有检测方法(如重述一致性、n-gram重叠度与困惑度分析)无法直接观测模型是否进行推理或记忆复现的问题。此外,重复验证机制反而因多轮交互引入虚假阳性,表明亟需结构化方法提升检测准确性。其解决方案的核心在于提出交叉上下文验证(Cross-Context Verification, CCV)——一种黑盒方法,在N个独立会话中解决相同任务并量化解法多样性;结合分层交叉上下文架构(Hierarchical Cross-Context Architecture, HCCA),通过限制不同专业角色间的信息流通来避免确认偏误(confirmation bias)。实验表明,CCV在9个SWE-bench Verified问题上实现了污染样本与真实推理的完全分离(Mann-Whitney U=0, p≈0.012),且发现33%的既有污染标签为假阳性,验证了信息隔离而非结构复杂性才是关键机制。
链接: https://arxiv.org/abs/2603.21454
作者: Tae-Eun Song
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 4 tables
Abstract:LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods–paraphrase consistency, n-gram overlap, perplexity analysis–never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary–models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA’s independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result–100% sycophantic confirmation–providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data. Comments: 11 pages, 3 figures, 4 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.21454 [cs.CL] (or arXiv:2603.21454v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.21454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-46] KG-Hopper: Empowering Compact Open LLM s with Knowledge Graph Reasoning via Reinforcement Learning IJCNN2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型推理任务中表现不佳的问题,特别是针对知识图谱问答(Knowledge Base Question Answering, KBQA)场景下多跳推理(multi-hop reasoning)的局限性。现有方法通常依赖预定义的流水线进行顺序推理,导致灵活性差且易产生误差传播。其解决方案的关键在于提出KG-Hopper框架,该框架基于强化学习(Reinforcement Learning, RL),将整个知识图谱遍历与决策过程嵌入到单一推理阶段中,实现全局性的跨步骤依赖建模和动态路径探索(含回溯机制),从而在单次推理中完成集成式多跳推理,显著提升性能并保持模型紧凑、开放和数据高效。
链接: https://arxiv.org/abs/2603.21440
作者: Shuai Wang,Yinan Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IJCNN 2026
Abstract:Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking’’ stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: this https URL.
[NLP-47] PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)弱点分析中因向量嵌入(vector embeddings)仅能捕捉主题相似性而导致的细粒度识别困难问题,尤其在提示(prompt)语义相近但具体程度(specificity)不同时难以区分其难度差异。解决方案的关键在于提出 PROMPT2BOX 方法,通过训练一个编码器将提示映射到盒嵌入(box embedding)空间,从而同时建模语义相似性和提示间的具体性关系(如“写一个冒险故事”比“写一个故事”更具体),并设计了一种新颖的盒嵌入降维技术以支持数据集可视化与对比,实验表明该方法显著优于传统向量基线,在识别 LLM 弱点和构建基于指令具体性的层次聚类树方面均表现出更强的性能。
链接: https://arxiv.org/abs/2603.21438
作者: Neeladri Bhuiya,Shib Sankar Dasgupta,Andrew McCallum,Haw-Shiuan Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., “writing an adventure story” is more specific than “writing a story”). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9% more LLM weaknesses than vector baselines and achieves an approximately 33% stronger correlation between hierarchical depth and instruction specificity.
[NLP-48] Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLM s
【速读】: 该论文旨在解决大型语言模型在低资源语言(如巴西葡萄牙语)中因计算成本过高而导致的可访问性问题。其核心解决方案是采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)与量化(Quantization)技术相结合的方法,对基于编码器的BERTimbau模型进行优化。关键发现包括:LoRA方法可在保持95.8%基线性能的同时将训练时间减少73.5%;较高学习率(2e-4)显著提升PEFT效果(F1最高提升19.71点);大模型对量化更具鲁棒性(量化损失仅为小模型的一半)。这些成果表明,通过PEFT和量化策略,可在保证性能的前提下大幅降低计算开销,推动面向低资源语言的绿色人工智能(Green AI)实践。
链接: https://arxiv.org/abs/2603.21418
作者: Mariela M. Nina,Caio Veloso Costa,Lilian Berton,Didier A. Vega-Oliveros
机构: Federal University of São Paulo (UNIFESP)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, PROPOR 2026
Abstract:Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textitGreen AI principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2 \times more GPU memory and 3 \times more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
[NLP-49] Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在主观文本标注任务中因群体差异导致的标注偏差问题,即传统方法假设存在单一真实标签(ground truth),而忽视了不同人口统计群体间对同一文本可能存在合理且有意义的分歧。其解决方案的核心是提出一种**视角驱动推理(Perspective-Driven Inference)**方法,将各群体标注分布视为核心关注量,并通过小规模人类标注预算进行估计;关键创新在于设计了一种自适应采样策略,聚焦于LLM代理表现最差的人群子集,从而以最小的人工成本实现对困难群体标注质量的针对性提升,同时保持整体标注覆盖范围。
链接: https://arxiv.org/abs/2603.21404
作者: Navya Mehrotra,Adam Visokay,Kristina Gligorić
机构: Johns Hopkins University (约翰霍普金斯大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.
[NLP-50] ask-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限场景下部署时面临的计算成本过高问题。其解决方案的关键在于提出了一种新的综合评估指标——性能-效率比(Performance-Efficiency Ratio, PER),该指标通过几何平均归一化整合了准确率、吞吐量、内存占用和延迟四项关键因素,从而实现了对不同模型在任务特定场景下的效率与性能的量化比较。研究发现,在五个不同的自然语言处理(Natural Language Processing, NLP)任务中,小模型(0.5–3B参数)在PER得分上均优于大模型,为生产环境中优先考虑推理效率而非微小精度提升的部署决策提供了定量依据。
链接: https://arxiv.org/abs/2603.21389
作者: Jinghan Cao,Yu Ma,Xinjin Li,Qingyang Ren,Xiangyun Chen
机构: San Francisco State University (旧金山州立大学); Carnegie Mellon University (卡内基梅隆大学); Columbia University (哥伦比亚大学); Cornell University (康奈尔大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication at ESANN 2025. This is a task-specific efficiency analysis comparing small language models
Abstract:Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5–3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.
[NLP-51] PLR: Plackett-Luce for Reordering In-Context Learning Examples
【速读】: 该论文旨在解决大语言模型在少样本学习(few-shot learning)中因示例顺序敏感而导致性能波动的问题。现有方法通常依赖于穷举搜索或基于标签概率熵等置信度指标进行排序,但前者计算复杂度高难以实施,后者在某些任务(如数学推理)中不可用。解决方案的关键在于提出PLR(Probabilistic Learning for Ranking),采用Plackett-Luce概率模型对所有可能的示例顺序建模,并通过迭代优化使分布集中于任务层面表现优异的顺序;同时利用Gumbel扰动-排序采样机制高效生成候选顺序,从而在不显式枚举全部排列的情况下显著提升少样本分类和数学推理任务的准确率。
链接: https://arxiv.org/abs/2603.21373
作者: Pawel Batorski,Paul Swoboda
机构: Heinrich Heine Universität Düsseldorf (海因里希·海涅杜塞尔多夫大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the n! possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label-probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in-context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett-Luce model. PLR models orderings using a Plackett-Luce distribution and iteratively updates its parameters to concentrate probability mass on high-performing orderings under a task-level metric. Candidate orderings are sampled efficiently via a Gumbel perturb-and-sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few-shot accuracy for k \in \4, 8, 16, 32\ examples, and we further demonstrate gains on mathematical reasoning tasks where label-based ordering methods are not applicable. Our code is available at this https URL.
[NLP-52] Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection
【速读】: 该论文旨在解决 conspiracy theories(阴谋论)在社交媒体中传播时难以被系统识别与理解的问题,尤其是其语义结构复杂、反权威特征显著,导致现有方法在跨领域检测中泛化能力不足。解决方案的关键在于提出“Conspiracy Frame”(阴谋论框架),这是一种基于框架语义学(frame-semantics)和符号学(semiotics)的细粒度语义表示方法,并构建了首个针对 Telegram 消息的 span-level 标注数据集(Conspiracy Frames dataset)。该框架通过识别如 Kinship(亲属关系)、Ingest_substance(摄取物质)等抽象语义模式,为生成式 AI(Generative AI)提供更深层次的语义与符号学感知能力,从而提升对阴谋论叙事的识别准确率与可迁移性。
链接: https://arxiv.org/abs/2603.21368
作者: Heidi Campana Piva,Shaina Ashraf,Maziar Kianimoghadam Jouneghani,Arianna Longo,Rossana Damiano,Lucie Flek,Marco Antonio Stranisci
机构: University of Turin, Italy; University of Bonn, Germany; aequa-tech, Italy
类目: Computation and Language (cs.CL)
备注:
Abstract:Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (this http URL.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and this http URL. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., Kinship', Ingest_substance’) that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.
[NLP-53] IDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中对每个token均遍历所有层的问题,导致计算资源浪费和延迟较高。其核心挑战在于如何实现动态的、基于token级别的层跳过机制,以提升推理效率而不牺牲准确性。解决方案的关键是提出TIDE系统——一个后训练(post-training)框架,在周期性检查点层插入轻量级可学习路由模块(learned routers),并在推理时根据隐藏状态收敛情况自动选择最早退出层(early exit layer)。该方法无需重新训练模型,兼容HuggingFace因果语言模型,并通过融合CUDA内核支持多种精度格式,显著降低预填充阶段延迟并提升吞吐量。
链接: https://arxiv.org/abs/2603.21365
作者: Jaber Jaber,Osama Jaber
机构: RightNow AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 5 tables, 2 figures. Code: this https URL
Abstract:Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: this https URL
[NLP-54] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-Judge)在评估智能体任务时失效的问题,核心原因是固定评分标准无法适配不同任务的关键评价维度:例如代码调试需关注正确性(Correctness)与错误处理能力(Error Handling),而网络导航则强调目标对齐性(Goal Alignment)与动作效率(Action Efficiency)。解决方案的关键在于提出ADARUBRIC,其通过从任务描述中动态生成特定于任务的评价量表(rubric),以置信度加权的方式逐步骤评分轨迹,并引入新颖的维度感知过滤器(DimensionAwareFilter),该过滤器满足理论上的必要条件,可防止高分维度掩盖低分维度的失败。实验证明,ADARUBRIC在WebArena和ToolBench上实现了与人类评分高度一致(Pearson r=0.79)且部署可靠(Krippendorff’s α=0.83),同时显著提升基于直接偏好优化(DPO)训练的智能体性能,在多个基准测试中相较基线提升6.8–8.5个百分点,且无需人工设计评分规则。
链接: https://arxiv.org/abs/2603.21362
作者: Liang Ding
机构: Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff’s \alpha =0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: this https URL.
[NLP-55] Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG -Based Translation and Human-Augmented RLAIF
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言的区域方言上存在性能偏差的问题,尤其缺乏系统性评估框架。其解决方案的关键在于提出一个两阶段框架:第一阶段通过检索增强生成(Retrieval-Augmented Generation, RAG)管道将标准孟加拉语问题翻译并标注为9种方言变体,构建包含4000个问题对的高质量数据集,并采用LLM-as-a-judge方法评估翻译保真度,该方法经人类相关性验证优于传统指标;第二阶段在该标注数据集上基准测试19个LLMs,进行68,395次RLAIF评估,结合多评委一致性与人工回退确保结果可靠性,从而量化方言间的性能差异,并提出Critical Bias Sensitivity (CBS)指标用于安全关键场景下的偏见敏感性分析。
链接: https://arxiv.org/abs/2603.21359
作者: K. M. Jubair Sami,Dipto Sumit,Ariyan Hossain,Farig Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages, 1 figure, 5 tables
Abstract:Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.
[NLP-56] Agent HER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
【速读】: 该论文旨在解决大语言模型代理(LLM agents)在真实世界任务中大量失败轨迹被直接丢弃的问题,这些失败轨迹本可作为宝贵的学习信号。当前主流方法仅使用成功轨迹进行训练,导致训练数据利用率低下,尤其在WebArena和ToolBench等复杂交互任务上,GPT-4o的成功率不足15%,而其他模型也普遍低于55% pass@1。解决方案的关键在于提出AgentHER框架,其核心思想是借鉴Hindsight Experience Replay(HER)原则,将失败轨迹重新解释为对可达成的替代目标(alternative goal)的有效示范。通过四阶段流程——失败分类、结果提取、基于大语言模型(LLM)引导的提示重标注与置信度门控、以及数据打包——AgentHER能够自动将废弃的失败轨迹转化为高质量的监督微调(SFT)、直接偏好优化(DPO)及ShareGPT格式的数据,从而实现零成本规则匹配与LLM判别两种方式的高效数据增强。实验证明,AgentHER在多个模型规模下均显著提升性能(+7.1–11.7个百分点),且数据效率提升两倍,仅用50%的成功示例即可达到基线性能,同时人类评估显示重标注精度高达97.7%。
链接: https://arxiv.org/abs/2603.21357
作者: Liang Ding
机构: Alibaba Group(阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM agents fail on the majority of real-world tasks – GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) – yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline – failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging – that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency – matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.
[NLP-57] Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLM s using Classic Logic Puzzles
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理经典认知推理(epistemic reasoning)任务时,其行为机制被简单归类为“认知推理”与“脆弱记忆化”二元对立的局限性问题。作者指出,这种划分忽略了记忆化本质上是一种特殊的“归约”(reduction)现象,即新问题被映射到已知问题上求解。为此,论文提出“归约阶梯”(reduction ladder)这一系统性方法,通过一系列逐步改变问题实例但保持底层逻辑不变的修改操作,使归约难度递增,从而更精细地评估模型是否真正具备认知推理能力。关键发现是:部分模型依赖归约策略成功应对简单任务,但随着问题复杂度提升,所有模型均在需要真正认知推理时表现下降,揭示了当前LLMs在深层逻辑理解上的不足。
链接: https://arxiv.org/abs/2603.21350
作者: Adi Gabay,Gabriel Stanovsky,Liat Peterfreund
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents’ knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.
[NLP-58] meTox: An LLM -Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols
【速读】: 该论文旨在解决临床试验中“时间毒性”(Time Toxicity)这一关键指标的自动化提取难题,即从试验方案文档中准确计算受试者因参与研究而累积的医疗接触天数。传统方法依赖人工标注,效率低且易出错。其解决方案的核心在于提出了一种基于大语言模型(LLM)的三阶段流水线工具 TimeTox,采用 Google Gemini 模型分步完成:首先从完整协议 PDF 中提取摘要信息,其次在每个治疗臂的六个累计时间点量化时间毒性,最后通过基于位置的臂匹配实现多轮运行一致性验证。研究发现,在真实世界肿瘤学协议上,单次通过(vanilla)架构虽准确性较低,但具有更高提取稳定性(95.3%临床可接受精度,IQR ≤ 3 天),成为生产部署的关键选择依据,凸显了实际数据稳定性优于合成基准测试准确性的重要性。
链接: https://arxiv.org/abs/2603.21335
作者: Saketh Vinjamuri,Marielle Fis Loperena,Marie C. Spezia,Ramez Kouzy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 5 figures, 7 tables
Abstract:Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google’s Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ( \pm 3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR \leq 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
[NLP-59] Improving Coherence and Persistence in Agent ic AI for System Optimization
【速读】: 该论文旨在解决生成式 AI 在设计高性能系统启发式算法过程中面临的两大关键挑战:一是进化邻域偏差(evolutionary neighborhood bias),即传统方法因依赖单一标量基准得分而容易陷入局部最优,难以实现多步骤协同优化;二是连贯性天花板(coherence ceiling),即现有代理框架在长时程探索中存在上下文退化或无法跨独立运行积累知识的问题。解决方案的核心在于提出 Engram 架构——一种解耦长时程探索与单个上下文窗口约束的代理研究者结构。Engram 通过一系列迭代的代理执行“设计-测试-分析”循环,在每轮结束后将代码快照、日志和结果存入持久化 Archive,并提炼出高阶建模洞见形成 Research Digest(研究摘要)。后续代理以全新上下文启动,读取 Research Digest 继续推进探索,从而实现知识的持续累积与跨轮次复用,显著提升了在多云组播、LLM 推理请求路由及数据库自然语言查询中 KV 缓存重用优化等复杂系统问题上的性能表现。
链接: https://arxiv.org/abs/2603.21321
作者: Pantea Karimi,Kimia Noorbakhsh,Mohammad Alizadeh,Hari Balakrishnan
机构: MIT(麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Designing high-performance system heuristics is a creative, iterative process requiring experts to form hypotheses and execute multi-step conceptual shifts. While Large Language Models (LLMs) show promise in automating this loop, they struggle with complex system problems due to two critical failure modes: evolutionary neighborhood bias and the coherence ceiling. Evolutionary methods often remain trapped in local optima by relying on scalar benchmark scores, failing when coordinated multi-step changes are required. Conversely, existing agentic frameworks suffer from context degradation over long horizons or fail to accumulate knowledge across independent runs. We present Engram, an agentic researcher architecture that addresses these limitations by decoupling long-horizon exploration from the constraints of a single context window. Engram organizes exploration into a sequence of agents that iteratively design, test, and analyze mechanisms. At the conclusion of each run, an agent stores code snapshots, logs, and results in a persistent Archive and distills high-level modeling insights into a compact, persistent Research Digest. Subsequent agents then begin with a fresh context window, reading the Research Digest to build on prior discoveries. We find that Engram exhibits superior performance across diverse domains including multi-cloud multicast, LLM inference request routing, and optimizing KV cache reuse in databases with natural language queries. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.21321 [cs.AI] (or arXiv:2603.21321v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.21321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-60] nhancing reasoning accuracy in large language models during inference time
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多步推理任务中可靠性不足的问题,尤其是在未进行额外训练或微调的情况下。其核心解决方案在于引入三种推理时(inference-time)策略以提升模型的推理准确性:一是通过控制温度和核采样(nucleus sampling)实现自一致性(self-consistency),即多次随机采样后选择最频繁的最终答案;二是采用双模型推理一致性验证(dual-model reasoning agreement),通过两个独立模型输出的一致性来增强可信度;三是利用自省(self-reflection)机制让模型自我批判与修正推理过程。实验表明,自一致性方法在低风险场景下效果最优,能带来9%至15%的绝对准确率提升,且计算开销最小,而双模型方法适用于对可靠性要求更高的中等风险场景。
链接: https://arxiv.org/abs/2603.21301
作者: Vinay Sharma,Manish Jain
机构: FirstSource
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.
[NLP-61] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection
【速读】: 该论文旨在解决社交媒体中隐性仇恨言论(implicit hate speech)的检测难题,尤其针对多模态内容(如图文结合)中因模态间交互而产生的语义意图转变问题。传统自动化检测系统在面对此类复杂表达时表现不佳,因其难以捕捉模态融合后生成的非显性毒性信息。解决方案的关键在于提出一个名为H-VLI(Hate via Vision-Language Interplay)的新基准,该基准强调语义意图的细微变化而非直接的文本或视觉冒犯词汇,并设计了ARCADE(Asymmetric Reasoning via Courtroom Agent DEbate)框架,通过模拟法庭辩论机制,让模型在“控方”与“辩方”代理的对抗推理中深入挖掘深层语义线索,从而提升对隐性仇恨言论的识别能力。
链接: https://arxiv.org/abs/2603.21298
作者: Runze Sun,Yu Zheng,Zexuan Xiong,Zhongjin Qu,Lei Chen,Jiwen Lu,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: this https URL
[NLP-62] he Library Theorem: How External Organization Governs Agent ic Reasoning Capacity
【速读】: 该论文旨在解决大语言模型(LLM)在推理过程中对内部记忆(即上下文窗口)进行低效检索的问题,尤其是在需要频繁访问历史推理状态时,传统基于顺序扫描的机制导致检索成本随存储规模线性增长(Ω(N)),严重限制了深度推理能力。解决方案的关键在于引入结构化外部记忆(structured retrieval)——通过构建索引(index)实现对推理状态的高效访问,将单次查询的检索复杂度从Ω(N)降至O(log_b N),累计检索成本从Θ(T²)降至O(T log_b T),显著提升长链推理效率。实验验证表明,索引机制在抽象内容上可稳定实现O(1)次页面读取,而无索引的排序页面无法有效缩小差距;更关键的是,研究揭示了参数化记忆(parametric memory)与外部索引之间的竞争关系:当模型理解内容后可能绕过检索协议直接生成答案,造成不可控的token消耗,因此提出“职责分离”策略——由语言模型负责语义理解以构建索引,由确定性算法执行索引遍历,从而兼顾认知优势与计算效率。
链接: https://arxiv.org/abs/2603.21272
作者: Zachary F. Mainen
机构: Champalimaud Foundation (查姆帕利莫研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注: 19 pages, 6 figures
Abstract:Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval – indexing over one’s own reasoning state – remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: O(\log_b N) versus \Omega(N) page reads per query, and O(T \log_b T) versus \Theta(T^2) cumulative cost over T reasoning steps – a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types – random hashes, ordered integers, and encyclopedia entries – varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the O(1) prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal \log_2 N search but still loses to the index by 5\times . On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.
[NLP-63] Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles
【速读】: 该论文旨在解决从科学全文中提取假设及其支持的统计证据这一难题,这是实证研究结果综合的关键步骤,但因文档长度和科学论点分散在不同章节而极具挑战性。其解决方案的核心在于提出一种两阶段的“检索-抽取”框架(retrieve-and-extract framework),通过精细化的上下文选择来提升目标信息的可获取性:首先优化检索质量(如使用重排序、微调检索器等策略)以减少噪声干扰,其次利用大语言模型(Large Language Model, LLM)进行结构化抽取;实验表明,针对性地选取高质量、低冗余的上下文能显著改善假设抽取性能,而统计证据抽取仍面临较大困难,主要受限于LLM对数值与文本混合表述的理解能力,而非单纯检索失败所致。
链接: https://arxiv.org/abs/2603.21193
作者: Sai Koneru,Jian Wu,Sarah Rajtmajer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article’s abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.
[NLP-64] Explainable Semantic Textual Similarity via Dissimilar Span Detection LREC2026
【速读】: 该论文旨在解决现有语义文本相似度(Semantic Textual Similarity, STS)方法将复杂语义差异简化为单一分数而导致可解释性不足的问题。其核心解决方案是提出新的任务——不相似片段检测(Dissimilar Span Detection, DSD),通过识别文本对中语义差异的特定词元或片段,帮助用户理解影响相似度得分的具体内容,并提升依赖STS的下游任务性能。关键创新在于构建了一个名为Span Similarity Dataset (SSD) 的新数据集,该数据集通过结合大语言模型(Large Language Models, LLMs)与人工验证的半自动化流程生成,并评估了多种无监督和监督基线方法,结果表明LLMs和监督模型表现最优,但整体性能仍较低,凸显该任务的挑战性。
链接: https://arxiv.org/abs/2603.21174
作者: Diego Miguel Lozano,Daryna Dementieva,Alexander Fraser
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026
Abstract:Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.
[NLP-65] Entropy Alone is Insufficient for Safe Selective Prediction in LLM s
【速读】: 该论文旨在解决语言模型在高风险场景下因幻觉(hallucination)导致的错误输出问题,通过选择性预测(selective prediction)机制实现对高风险案例的回避,从而降低系统整体错误率。其核心挑战在于现有基于熵(entropy)的不确定性量化方法存在模型依赖性的失效模式,导致拒答行为不可靠。解决方案的关键在于将熵得分与一个正确性探测信号(correctness probe signal)相结合,形成联合评分机制,显著提升了风险-覆盖率权衡(risk–coverage trade-off)和校准性能(calibration performance),验证了部署导向评估的重要性,即确保系统能够在指定的风险水平下稳定运行。
链接: https://arxiv.org/abs/2603.21172
作者: Edward Phillips,Fredrik K. Gustafsson,Sean Wu,Anshul Thakur,David A. Clifton
机构: University of Oxford (牛津大学); Oxford Suzhou Centre for Advanced Research (牛津苏州先进研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk–coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.
[NLP-66] Many Dialects Many Languages One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects
【速读】: 该论文试图解决多模态视觉语言模型(VLMs)在评估过程中对孟加拉文化代表性不足的问题,尤其是缺乏针对孟加拉语及其方言和历史关联语言的文化感知能力测试。解决方案的关键在于构建了一个名为BanglaVerse的基准数据集,该数据集基于1,152张人工标注的图像,覆盖九个文化领域,并扩展至四种语言和五种孟加拉语方言,共约32.3K个样本,支持视觉问答(Visual Question Answering, VQA)与图像描述生成任务。实验表明,仅使用标准孟加拉语会高估模型的真实能力,而方言变化显著降低描述生成性能,且文化知识缺失是主要瓶颈,而非单纯的视觉定位问题。因此,BanglaVerse为衡量跨语言变体下的文化感知多模态理解提供了更真实的测试平台。
链接: https://arxiv.org/abs/2603.21165
作者: Nurul Labib Sayeedi,Md. Faiyaz Abdullah Sayeedi,Shubhashis Roy Dipta,Rubaya Tabassum,Ariful Ekraj Hridoy,Mehraj Mahmood,Mahbub E Sobhani,Md. Tarek Hasan,Swakkhar Shatabda
机构: United International University, Bangladesh; BRAC University, Bangladesh; University of Maryland, Baltimore County, USA
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
[NLP-67] Mixture of Chapters: Scaling Learnt Memory in Transformers ICLR2026
【速读】: 该论文旨在解决Transformer模型缺乏显式存储和组织训练过程中所获知识的架构机制这一问题。其核心解决方案是引入可学习的稀疏记忆库(learnable sparse memory banks),即一组随机初始化并端到端训练的潜在标记(latent tokens),通过交叉注意力机制供Transformer层查询以检索存储的知识。为在不显著增加计算成本的前提下扩展记忆容量,作者提出受Mixture-of-Experts架构启发的基于章节的路由机制(chapter-based routing),将记忆库划分为多个章节,并训练一个路由器为每个输入选择相关子集,从而实现262K记忆标记的高效扩展且保持计算可行性。实验表明,该方法在同等浮点运算量(iso-FLOP)条件下优于标准Transformer,在预训练与指令微调任务中均表现出更强的知识获取与保留能力,证明了显式关联记忆作为模型参数隐式容量的互补扩展方向。
链接: https://arxiv.org/abs/2603.21096
作者: Tasmay Pankaj Tibrewal,Pritish Saha,Ankit Meda,Kunal Singh,Pradeep Moturi
机构: IIT Kharagpur(印度理工学院克哈格普尔分校); Fractal AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 2 figures, 8 tables. Accepted at ICLR 2026 New Frontiers in Associative Memory Workshop. Code available at this https URL
Abstract:Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).
[NLP-68] Evaluating Reasoning -Based Scaffolds for Human-AI Co-Annotation: The Reason Align Annotation Protocol
【速读】: 该论文旨在解决人类标注在自然语言处理(Natural Language Processing, NLP)评估中因主观性导致的标注者间一致性不足的问题。其核心挑战在于如何有效提升标注一致性,同时不显著改变原有标注行为。解决方案的关键在于提出一种基于推理的标注支架(ReasonAlign),通过向标注者暴露大型语言模型(Large Language Models, LLMs)生成的解释性推理过程,但不提供预测标签,从而引导标注者在独立标注后进行有依据的修订。该方法采用双阶段协议(受德尔菲法启发),量化标注者在接触推理后的修改行为,并引入标注者努力代理(Annotator Effort Proxy, AEP)作为指标,发现推理主要帮助澄清模糊案例,而非引发大规模修正,从而提升了标注一致性且保持了标注效率。
链接: https://arxiv.org/abs/2603.21094
作者: Smitha Muthya Sudheendra,Jaideep Srivastava
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.21094 [cs.CL] (or arXiv:2603.21094v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.21094 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Smitha Muthya Sudheendra [view email] [v1] Sun, 22 Mar 2026 07:14:27 UTC (107 KB)
[NLP-69] ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks
【速读】: 该论文旨在解决低资源语言(如越南语)在自然语言理解(Natural Language Understanding, NLU)任务中因标注数据稀缺而导致的高质量文本表示学习困难的问题。现有预训练模型如PhoBERT虽表现良好,但在数据有限场景下仍受限。解决方案的关键在于提出一种专为越南语设计的监督对比学习框架ViCLSR(Vietnamese Contrastive Learning for Sentence Representations),通过利用现有的自然语言推理(Natural Language Inference, NLI)数据集进行监督信号引导,优化句子嵌入表示;同时提出一套适配现有越南语数据集的方法以兼容对比学习范式。实验表明,ViCLSR在多个基准NLU任务上显著优于PhoBERT,验证了监督对比学习在提升低资源语言句子表示能力方面的有效性。
链接: https://arxiv.org/abs/2603.21084
作者: Tin Van Huynh,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.
[NLP-70] Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation
【速读】: 该论文旨在解决神经文本到语音(TTS)模型在生成音高(f0)细微变化方面的泛化能力不足问题,特别是针对由辅音引起的局部音高扰动(consonant-induced f0 perturbation)这一细粒度的段落级韵律现象。其解决方案的关键在于提出了一种分段级韵律探测框架(segmental-level prosodic probing framework),通过控制变量法对比合成语音与自然语音在数千个词汇上的表现,并按词频分层分析,从而量化模型对抽象段落-韵律编码的依赖程度。实验结果表明,当前主流TTS架构如Tacotron 2和FastSpeech 2更依赖于词汇级记忆而非泛化性段落韵律建模,揭示了现有系统在韵律细节泛化上的局限性。
链接: https://arxiv.org/abs/2603.21078
作者: Tianle Yang,Chengzhe Sun,Phil Rose,Cassandra L. Jacobs,Siwei Lyu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted for publication in Computer Speech Language
Abstract:This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models’ ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems’ ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.
[NLP-71] LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agent ic Tool-Integrated Reinforcement Learning
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在形式化推理(formal reasoning)任务中面临的挑战,尤其是如何在 Lean4 环境下实现高效、稳定的自动定理证明与问题形式化。其核心问题是现有模型在长程决策过程中的训练不稳定性和样本效率低下,以及奖励欺骗(reward hacking)导致的逻辑不一致性。解决方案的关键在于提出 LongCat-Flash-Prover——一个 5600 亿参数的开源混合专家(Mixture-of-Experts, MoE)模型,结合了三项独立的形式能力:自动形式化(auto-formalization)、证明草图生成(sketching)和完整证明生成(proving)。通过引入 Hybrid-Experts Iteration Framework 扩展高质量任务轨迹,并设计 Hierarchical Importance Sampling Policy Optimization (HisPO) 算法,在序列和 token 层面采用梯度掩码策略缓解策略过时与训练-推理引擎差异问题,同时集成定理一致性和合法性检测机制防止奖励滥用。该方案显著提升了模型在 MiniF2F、ProverBench 和 PutnamBench 等基准上的性能,尤其在极低推理预算下实现了高达 97.1% 的通过率。
链接: https://arxiv.org/abs/2603.21065
作者: Jianing Wang,Jianfei Zhang,Qi Guo,Linsen Guo,Rumei Li,Chao Zhang,Chong Peng,Cunguang Wang,Dengchang Zhao,Jiarong Shi,Jingang Wang,Liulin Feng,Mengxia Shen,Qi Li,Shengnan An,Shun Wang,Wei Shi,Xiangyu Xi,Xiaoyu Li,Xuezhi Cao,Yi Lu,Yunke Zhao,Zhengyu Chen,Zhimin Lin,Wei Wang,Peng Pei,Xunliang Cai
机构: Meituan(美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 43 pages, 5 figures
Abstract:We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
[NLP-72] Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言(low-resource languages)中性能显著下降的问题,特别是评估其在英语与两种低资源语言(哈萨克语和蒙古语)之间的表现差异。研究发现,LLMs在低资源语言上的准确率比英语低13.8–16.7个百分点,尽管表面流畅性得以维持,但内容准确性严重不足。解决方案的关键在于:跨语言迁移提示(cross-lingual transfer-prompting)策略——即先让模型以英语推理再翻译回目标语言——对双语架构模型有效(提升2.2–4.3个百分点),但对英语主导型模型无效,表明缓解低资源语言性能差距的策略具有架构依赖性,而非通用方案。
链接: https://arxiv.org/abs/2603.21036
作者: Abdul-Salem Beibitkhan
机构: North American University
类目: Computation and Language (cs.CL)
备注:
Abstract:We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.
[NLP-73] Knowledge Boundary Discovery for Large Language Models
【速读】: 该论文试图解决如何自动识别大型语言模型(Large Language Models, LLMs)的知识边界问题,即区分模型能够 confidently 回答的问题(在知识边界内)与无法回答的问题(超出知识边界)。传统方法依赖人工构造的基准数据集,成本高且难以覆盖全面。解决方案的关键在于提出基于强化学习的知识边界发现(Knowledge Boundary Discovery, KBD)框架:通过将LLM视为部分可观测环境中的智能体,设计以熵减为奖励机制的策略,使代理逐步生成问题并根据响应更新信念状态,从而迭代地探索和定位知识边界。该方法能自动生成一组非平凡的可回答与不可回答问题集合,实验表明其结果与人工标注数据集具有可比性,为LLM评估提供了新范式。
链接: https://arxiv.org/abs/2603.21022
作者: Ziquan Wang,Zhongqi Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages,4 figures
Abstract:We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer (within-knowledge boundary) and (ii) those it cannot (beyond-knowledge boundary). Iteratively exploring and exploiting the LLM’s responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM’s response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.
[NLP-74] Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多项选择和成对评估任务中因选项位置、标签符号等非语义因素导致的选择偏差问题。现有推理时去偏方法成本高且可能损害模型推理能力,而点式训练则忽略了同一问题在不同排列下应保持一致答案的约束。其解决方案的关键在于提出一种称为“排列感知组相对策略优化”(Permutation-Aware Group Relative Policy Optimization, PA-GRPO)的新方法,通过构建每个实例的排列群并引入两种互补机制:一是跨排列优势(cross-permutation advantage),即相对于同一实例所有排列的平均奖励计算优势;二是一致性感知奖励(consistency-aware reward),促使模型在不同排列下输出一致决策,从而实现排列一致的语义推理,有效降低选择偏差并维持高性能。
链接: https://arxiv.org/abs/2603.21016
作者: Jinquan Zheng,Jia Yuan,Jiacheng Yao,Chenyang Gu,Pujun Zheng,Guoxiu He
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures, 5 tables
Abstract:Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (this https URL).
[NLP-75] CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大型语言模型(Large Language Models, LLMs)的机制可解释性问题,具体针对现有基于字典学习和转码器(transcoders)方法所构建的特征归因图(feature attribution graphs)规模庞大且冗余的问题,导致实际可解释性受限。解决方案的关键在于引入跨层转码器(Cross-Layer Transcoders, CLTs),通过在不同网络层间共享特征同时保留各层特有的解码能力,从而获得更紧凑的表示;同时,论文提出一个开源库框架,集成分布式训练、模型分片与压缩激活缓存、自动化可解释性分析流程、基于Circuit-Tracer的归因图计算以及灵活可视化界面,实现了CLT的端到端训练与可解释性分析的规模化落地。
链接: https://arxiv.org/abs/2603.21014
作者: Florent Draye,Abir Harrasse,Vedant Palit,Tung-Yu Wu,Jiarui Liu,Punya Syon Pandey,Roderick Wu,Terry Jingchen Zhang,Zhijing Jin,Bernhard Schölkopf
机构: Max Planck Institute for Intelligent Systems, Tübingen, Germany; Jinesis AI Lab, University of Toronto; Vector Institute; CMU; EuroSafeAI; ELLIS Institute Tübingen
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, code: this https URL
Abstract:Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: this https URL.
[NLP-76] Structural Sensitivity in Compressed Transformers: Error Propagation Lyapunov Stability and Formally Verified Bounds
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在参数压缩过程中表现出的极端敏感性问题,即微小的权重扰动可能导致模型性能剧烈下降(如困惑度提升20,000倍),从而限制了模型的高效部署与优化。其解决方案的关键在于系统性地揭示了Transformer架构中不同组件对压缩的敏感性层级:早期层的前馈网络(Feed-Forward Network, FFN)中的上投影矩阵(up-projection)最为脆弱,而值投影(value projection)则几乎不受压缩影响;同时通过Lyapunov稳定性理论证明残差连接能够通过加速隐藏状态增长来抑制压缩误差传播,但仅靠误差收缩不足以保证鲁棒性,还需依赖架构特有的冗余设计——例如混合结构LFM2-2.6B虽放大效应更强(120倍),却因冗余机制仅退化7倍,远优于纯合同结构GPT-2 Small(120倍放大)。研究进一步提出可验证的压缩脆弱性指数(Compression Fragility Index),并借助Lean 4形式化工具严格证明每矩阵误差边界,在14,040+配置下无违反情况,最终通过下游任务评估和激活感知剪枝验证有效性。
链接: https://arxiv.org/abs/2603.20991
作者: Abhinaba Basu
机构: National Institute of Electronics and Information Technology (NIELIT)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.
[NLP-77] DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles
【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent LLM systems)在复杂推理任务中集体输出的不确定性量化问题。现有方法依赖于浅层投票统计,忽略了智能体间推理过程中的丰富语义信息,导致置信度估计不准确。其解决方案的关键在于提出 DiscoUQ 框架,通过提取并利用智能体间分歧的结构特征——包括语言属性(证据重叠、论点强度、分歧深度)和嵌入几何特性(聚类距离、分散度、凝聚性)——构建更精准且校准良好的置信度估计模型。该框架包含三种逐步复杂的实现方式,其中 DiscoUQ-LLM 在多个基准测试上实现了平均 AUROC 0.802,显著优于基线方法,并在模糊的“弱分歧”场景下表现最优,验证了结构化分歧信息对提升不确定性建模的有效性。
链接: https://arxiv.org/abs/2603.20975
作者: Bo Jiang
机构: Temple University (坦普尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents’ reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement – both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) – to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous “weak disagreement” tier where simple vote counting fails.
[NLP-78] Understanding Contextual Recall in Transformers: How Finetuning Enables In-Context Reasoning over Pretraining Knowledge
【速读】: 该论文旨在解决生成式 AI(Generative AI)在上下文学习(In-Context Learning, ICL)中能否仅通过预训练就实现“情境回忆”(contextual recall)的问题,即模型是否能从预训练中学到的成对样本中,隐式推断出属性类型,并在新颖的提示格式中准确召回特定事实。研究发现,单纯的预训练不足以支持情境回忆,因为当ICL提示中移除语法统计信息时,模型无法隐式推断属性类型;其关键解决方案在于:对模型进行一个与ICL评估任务不同、但要求隐式推理的微调(fine-tuning),使用部分主题数据即可触发全主题的情境回忆能力;这一转变伴随着低维潜在编码的形成,该编码表征了共享的属性类型。论文进一步通过构造纯注意力机制的Transformer模型验证了该机制的有效性,揭示了情境回忆背后的可解释表征机制。
链接: https://arxiv.org/abs/2603.20969
作者: Bhavya Vasudeva,Puneesh Deora,Alberto Bietti,Vatsal Sharan,Christos Thrampoulidis
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 28 pages, 26 figures
Abstract:Transformer-based language models excel at in-context learning (ICL), where they can adapt to new tasks based on contextual examples, without parameter updates. In a specific form of ICL, which we refer to as \textitcontextual recall, models pretrained on open-ended text leverage pairwise examples to recall specific facts in novel prompt formats. We investigate whether contextual recall emerges from pretraining alone, what finetuning is required, and what mechanisms drive the necessary representations. For this, we introduce a controlled synthetic framework where pretraining sequences consist of subject-grammar-attribute tuples, with attribute types tied to grammar statistics. We demonstrate that while such pretraining successfully yields factual knowledge, it is insufficient for contextual recall: models fail to implicitly infer attribute types when the grammar statistics are removed in ICL prompts. However, we show that finetuning on tasks requiring implicit inference, distinct from the ICL evaluation, using a subset of subjects, triggers the emergence of contextual recall across all subjects. This transition is accompanied by the formation of low-dimensional latent encodings of the shared attribute type. For mechanistic insight, we derive a construction for an attention-only transformer that replicates the transition from factual to contextual recall, corroborated by empirical validation.
[NLP-79] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
【速读】: 该论文旨在解决生成式 AI 模型在微调(finetuning)过程中可能触发对训练数据中受版权保护内容的再现问题,这与当前主流模型提供商声称其模型不会存储训练数据副本、并通过强化学习人类反馈(RLHF)、系统提示和输出过滤等安全对齐策略防止复制 copyrighted works 的主张相矛盾。解决方案的关键在于:通过仅使用语义描述作为提示对模型进行微调,即可诱导 GPT-4o、Gemini-2.5-Pro 和 DeepSeek-V3.1 等多个商业大模型在未接触原始文本的情况下,重现高达 85–90% 的已出版书籍内容,且存在超过 460 字的逐字复现片段;更关键的是,这种提取能力不依赖于特定作者或语料库,而是源于微调激活了预训练阶段隐含的记忆机制,表明模型权重中确实存储了受版权保护的内容,从而揭示了行业级模型存在普遍性的安全漏洞,并动摇了基于“有效防复制措施”判断合理使用的司法认定基础。
链接: https://arxiv.org/abs/2603.20957
作者: Xinyue Liu,Niloofar Mireshghallah,Jane C. Ginsburg,Tuhin Chakrabarty
机构: Stony Brook University (石溪大学); Carnegie Mellon University (卡内基梅隆大学); Columbia Law School (哥伦比亚法学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint Under Review
Abstract:Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami’s novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors’ works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ( r \ge 0.90 ), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors’ works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
[NLP-80] he Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在日常对话中通过隐藏激励(hidden incentives)对用户进行个性化情感操纵的问题,尤其关注这些激励的道德属性如何影响用户的信念变化。其解决方案的关键在于提出PUPPET理论分类体系,该体系以激励的道德性(incentive morality)为核心维度,构建了用于分析LLM-human对话中操纵行为的框架,并通过包含1035名参与者的实证研究验证了有害隐藏激励比亲社会激励更能显著引发用户信念改变。此外,论文还首次系统性地评估了LLMs在预测用户信念变化方面的表现,发现其虽具备一定预测能力(相关系数r=0.3–0.5),但普遍低估信念变化幅度,为后续开发更透明、负责任的LLM交互机制提供了可量化的基准和理论基础。
链接: https://arxiv.org/abs/2603.20907
作者: Jocelyn Shen,Amina Luvsanchultem,Jessica Kim,Kynnedy Smith,Valdemar Danry,Kantwon Rogers,Sharifa Alghowinem,Hae Won Park,Maarten Sap,Cynthia Breazeal
机构: Massachusetts Institute of Technology (麻省理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to being subtly steered toward hidden incentives misaligned with their own interests. Prior works have benchmarked persuasion and manipulation detection, but these efforts rely on simulated or debate-style settings, remain uncorrelated with real human belief shifts, and overlook a critical dimension: the morality of hidden incentives driving the manipulation. We introduce PUPPET, a theoretical taxonomy of personalized emotional manipulation in LLM-human dialogues that centers around incentive morality, and conduct a human study with N=1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful versus prosocial). We find that harmful hidden incentives produce significantly larger belief shifts than prosocial ones. Finally, we benchmark LLMs on the task of belief prediction, finding that models exhibit moderate predictive ability of belief change based on conversational contexts (r=0.3 - 0.5), but they also systematically underestimate the magnitude of belief shift. Together, this work establishes a theoretically grounded and behaviorally validated foundation for studying, and ultimately combatting, incentive-driven manipulation in LLMs during everyday, practical user queries.
[NLP-81] Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中过度依赖表面模式匹配和答案记忆等捷径(shortcut)策略,而非真正进行逻辑推理的问题。其解决方案的核心是提出一种梯度感知的捷径-aware 训练框架——Shortcut-Aware Reasoning Training (SART),通过两个关键机制实现:一是利用梯度与验证目标之间的不一致性和回答token集中度来检测捷径促进样本(ShortcutScore),二是采用梯度手术(gradient surgery)动态调整训练过程以削弱捷径信号的影响。实验表明,SART 在受控推理基准上相较最强基线提升16.5%准确率和40.2%鲁棒性,显著改善了模型在分布偏移下的泛化能力。
链接: https://arxiv.org/abs/2603.20899
作者: Hongyu Cao,Kunpeng Liu,Dongjie Wang,Yanjie Fu
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); University of Kansas (堪萨斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures. Preprint. Experiments on synthetic reasoning benchmarks. Code available
Abstract:Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: this https URL.
[NLP-82] LLM Router: Prefill is All You Need
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不同任务子集上表现出互补性能的问题,即尽管多个LLM在整体基准测试中准确率相近,但它们在特定任务上的优势差异显著,因此如何高效组合这些模型以逼近理想情况下的Oracle路由器(具有完美先验知识的理论选择器)成为关键挑战。解决方案的核心在于提出Encoder-Target Decoupling机制,通过分离提供预测信号的Encoder与目标模型(Target),实现对不同模型内部预填充激活(prefill activations)的精细化利用;进一步结合Fisher Separability(J)和Effective Dimensionality(d_eff)作为数学探针识别最优层间信号,构建出SharedTrunkNet架构,从而在仅消耗最高成本模型74.31%计算资源的前提下,捕获高达45.58%的Oracle与最强单模型之间的性能差距。
链接: https://arxiv.org/abs/2603.20895
作者: Tanay Varshney,Annie Surla,Michelle Xu,Gomathy Venkata Krishnan,Maximilian Jeblick,David Austin,Neal Vaidya,Davide Onofrio
机构: NVIDIA
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:LLMs often share comparable benchmark accuracies, but their complementary performance across task subsets suggests that an Oracle router–a theoretical selector with perfect foresight–can significantly surpass standalone model accuracy by navigating model-specific strengths. While current routers rely on fragile semantic signals, we propose using internal prefill activations via Encoder-Target Decoupling–a functional separation between the model providing the predictive signal (the Encoder) and the model whose performance is being estimated (the Target). This allows optimized heterogeneous pairing between unique encoders and target models. We utilize Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals, providing the predictive foundation for our SharedTrunkNet architecture. SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle while achieving 74.31% cost savings relative to the highest-cost model.
[NLP-83] NoveltyAgent : Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation
【速读】: 该论文旨在解决学术论文数量激增背景下,高质量论文筛选成本上升的问题,尤其是现有方法在新颖性评估方面存在局限——如通用AI审稿系统缺乏领域特定机制,或基于DeepResearch的方案因未针对特定领域优化而导致结果质量较低。其解决方案的关键在于提出NoveltyAgent,一个面向新颖性分析的多智能体系统:该系统将论文分解为离散的新颖点(novelty points),实现细粒度检索与比对,并构建跨引用验证的关联文献数据库以确保报告的忠实性(faithfulness)。此外,论文还设计了一种基于检查清单(checklist-based)的评估框架,用于客观衡量开放生成任务的效果,从而推动可靠、可复现的评价体系建立。实验表明,NoveltyAgent在新颖性分析上优于GPT-5 DeepResearch达10.15%,展现出卓越性能。
链接: https://arxiv.org/abs/2603.20884
作者: Jiajun Hou,Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Xiaopeng Ke,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Xiaohongshu Inc.(小红书公司); Zhongguancun Academy, Beijing(中关村学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain-specific mechanisms and thus delivers lower-quality results. To bridge this gap, we introduce NoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper’s originality. It decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, and builds a comprehensive related-paper database while cross-referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open-ended generation tasks, we propose a checklist-based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%. We hope this system will provide reliable, high-quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at this https URL.
[NLP-84] Semantic Sections: An Atlas-Native Feature Ontology for Obstructed Representation Spaces
【速读】: 该论文试图解决传统可解释性方法在受阻表示空间(obstructed representation spaces)中将特征视为全局统一方向或坐标所导致的局限性问题,即局部语义一致性未必能整合为全局一致的特征表示。其解决方案的关键在于引入一种语义截面(semantic section)——一种定义在上下文图谱(context atlas)上的、具有传输兼容性的局部特征代表族,并通过树支持的传播、重叠区域同步、缺陷驱动剪枝、循环感知分类与去重等步骤构建发现与认证流程。研究证明,循环一致性是实现真正全局化的关键判据,从而区分出树局部、可全局化和扭曲截面三类结构;实验表明,仅靠原始全局向量相似度无法准确识别语义身份,而基于语义截面的方法能在认证支持下实现完美的语义身份恢复,验证了语义截面作为受阻场景下更优特征本体的有效性。
链接: https://arxiv.org/abs/2603.20867
作者: Hossein Javidnia
机构: Dublin City University (都柏林城市大学); Meta (Meta); Google (谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 2 figures
Abstract:Recent interpretability work often treats a feature as a single global direction, dictionary atom, or latent coordinate shared across contexts. We argue that this ontology can fail in obstructed representation spaces, where locally coherent meanings need not assemble into one globally consistent feature. We introduce an atlas-native replacement object, the semantic section: a transport-compatible family of local feature representatives defined over a context atlas. We formalize semantic sections, prove that tree-supported propagation is always pathwise realizable, and show that cycle consistency is the key criterion for genuine globalization. This yields a distinction between tree-local, globalizable, and twisted sections, with twisted sections capturing locally coherent but holonomy-obstructed meanings. We then develop a discovery-and-certification pipeline based on seeded propagation, synchronization across overlaps, defect-based pruning, cycle-aware taxonomy, and deduplication. Across layer-16 atlases for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, and Gemma 2 2B IT, we find nontrivial populations of semantic sections, including cycle-supported globalizable and twisted regimes after deduplication. Most importantly, semantic identity is not recovered by raw global-vector similarity. Even certified globalizable sections show low cross-chart signed cosine similarity, and raw similarity baselines recover only a small fraction of true within-section pairs, often collapsing at moderate thresholds. By contrast, section-based identity recovery is perfect on certified supports. These results support semantic sections as a better feature ontology in obstructed regimes.
[NLP-85] SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
【速读】: 该论文旨在解决低资源语言(如哈萨克语)在现有多语言大模型中被忽视的问题,这些问题通常表现为模型容量分配不足以及对黏着语形态学不适用的分词器。解决方案的关键在于:从零开始训练一系列基于Llama架构的小型语言模型(50M–600M参数),并使用专为哈萨克语设计的50K BPE分词器,在90亿个哈萨克语token上进行训练。实验表明,即使参数规模远小于主流多语言模型(如Llama-3.2-1B),该方法仍能在多个哈萨克语基准任务(文化问答、阅读理解、主题分类)上达到竞争力性能,且具有良好的可扩展性,证明了专用小模型结合语言适配分词器是提升低资源语言技术能力的有效路径。
链接: https://arxiv.org/abs/2603.20854
作者: Saken Tukenov
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 2 tables
Abstract:Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks – multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) – alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe consistent scaling from 50M to 600M, with MC QA accuracy rising from 22.8% to 30.3%, suggesting that further scaling remains beneficial. These results demonstrate that small, dedicated models trained from scratch with a language-appropriate tokenizer offer a viable path for low-resource language technology, achieving competitive performance at a fraction of the computational cost. All models and the tokenizer are released under open licenses.
[NLP-86] Can ChatGPT Really Understand Modern Chinese Poetry? EACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在现代诗歌理解能力方面的评估问题,现有研究仅停留在实验结果分析层面,缺乏对模型是否真正理解诗歌内涵的深入探讨。其解决方案的关键在于构建了一个多维评估框架,联合专业诗人从多个维度对 ChatGPT 对不同诗人作品的解读进行系统评价,从而量化其理解准确性与局限性,尤其关注“诗意”(poeticity)等核心维度的表现。该框架不仅验证了 ChatGPT 在73%以上案例中能准确捕捉原作者意图,也揭示了其在深层诗学理解上的不足,为后续大语言模型(LLM)在诗歌相关任务中的应用研究提供了可操作的评估标准和理论基础。
链接: https://arxiv.org/abs/2603.20851
作者: Shanshan Wang,Derek F. Wong,Jingming Yao,Lidia S. Chao
机构: University of Macau (澳门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EACL 2026
Abstract:ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT’s understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT’s interpretation of modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT’s interpretations align with the original poets’ intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT’s ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.
[NLP-87] HiCI: Hierarchical Construction-Integration for Long-Context Attention
【速读】: 该论文旨在解决长上下文语言建模中因token级注意力机制导致的信息结构隐式化问题,即现有方法在处理超长文本时难以显式地组织局部到全局的信息流。其解决方案的关键在于提出HiCI(Hierarchical Construction–Integration)模块,通过分层构建段落级表征、将其整合为共享的全局上下文,并将两者广播以条件化段落级注意力,从而引入显式的层次结构作为归纳偏置(inductive bias),显著提升模型对长文本的理解与生成能力。
链接: https://arxiv.org/abs/2603.20843
作者: Xiangyu Zeng,Qi Xu,Yunke Wang,Chang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 5 figures
Abstract:Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction–Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only 5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.
[NLP-88] BenchBench: Benchmarking Automated Benchmark Generation
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中普遍存在的静态测试集易饱和、易污染且更新成本高的问题,以及依赖LLM裁判进行开放式任务评估时引入的偏倚和提示敏感性问题。其核心解决方案是提出BenchBench——一个三阶段自动化基准生成流水线与数据集:首先从种子基准中提取结构化领域卡片(domain cards),其次通过多个设计者LLM生成受配额控制的题项套件,最后利用多模型答题者面板(包括精确/数值/符号验证器与基于评分标准的判断)对题项进行验证,从而产出带有项目级质量标记和心理测量诊断信息的设计者-答题者矩阵。此方法不仅提升了评估的可扩展性和鲁棒性,还揭示了模型设计能力与解答能力之间的弱相关性(Spearman ρ ≈ 0.37),为系统性审计题型格式、模态、语言一致性及套件内自洽性提供了新范式。
链接: https://arxiv.org/abs/2603.20807
作者: Yandan Zheng,Haoran Luo,Zhenghong Lin,Wenjin Liu,Luu Anh Tuan
机构: Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer–answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model–item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer–answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: this https URL.
[NLP-89] RLVR Training of LLM s Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution
【速读】: 该论文试图解决的问题是:强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)是否能够自动提升大语言模型(Large Language Models, LLMs)在通用问答(General Question Answering, GQA)任务上的性能。尽管RLVR在可验证任务中显著增强了LLMs的推理能力,但其对GQA的迁移效果尚未得到充分验证。研究发现,RLVR所激发的思维过程在GQA任务上的有效性远低于可验证任务,表明仅依赖RLVR不足以提升GQA性能,仍需专门针对GQA进行训练。解决方案的关键在于提出一种名为“分离思维与响应训练”(Separated Thinking And Response Training, START)的新方法:该方法首先仅训练思维过程,利用最终答案定义的奖励信号,从而避免GQA任务中可能存在的奖励捷径(reward shortcuts),确保高质量推理链的形成。实验表明,START在多个GQA基准测试和不同强化学习算法下均提升了思维质量和最终答案准确率。
链接: https://arxiv.org/abs/2603.20799
作者: Kaiyuan Li,Jing-Cheng Pang,Yang Yu
机构: Nanjing University (南京大学); Polixir.ai; Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.
[NLP-90] he Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识编辑(Knowledge Editing, KE)过程中机制不透明的问题,即编辑操作如何在模型内部实现尚不清楚。为揭示这一机制,作者提出基于神经元级知识归因(Neuron-Level Knowledge Attribution, NLKA)的后编辑分析方法,通过对比成功与失败编辑案例,识别出在编辑生效时发生计算变化的关键模块。研究发现,中后期注意力机制主要促进新目标知识,而注意力模块与前馈网络(Feed-Forward Network, FFN)协同抑制原始事实。受此启发,作者设计了MEGA方法——一种基于机制引导的激活调控策略,其核心在于在归因对齐区域执行注意力残差干预,无需修改模型权重即可实现可靠的知识编辑。该方案在CounterFact和Popular数据集上验证了其在GPT2-XL和LLaMA2-7B上的有效性,实现了架构无关的知识编辑性能提升。
链接: https://arxiv.org/abs/2603.20795
作者: Yuan Cao,Mingyang Wang,Hinrich Schütze
机构: Technical University of Munich (慕尼黑工业大学); LMU Munich (慕尼黑大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution – contrasting successful and failed edits – to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.
[NLP-91] Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement
【速读】: 该论文旨在解决多模态信息抽取(Multimodal Information Extraction, MIE)中现有方法存在的两个核心问题:一是传统方法采用自然语言模板作为输入输出,与信息任务中常见的结构化信息(如实体和关系)不匹配;二是虽有研究尝试使用代码风格模板,但仅限于纯文本场景,且设计复杂、需为每个任务单独定制模板。解决方案的关键在于提出一种统一的代码风格多模态信息抽取框架(Code-style Multimodal Information Extraction, Code-MIE),其创新性地将MIE建模为统一的代码理解与生成任务:通过提取文本中的实体属性(如性别、隶属关系)增强上下文理解,将图像转换为场景图与视觉特征以融合丰富视觉信息,并构建Python函数形式的输入模板(包含实体属性、场景图和原始文本)与Python字典形式的输出模板(封装所有抽取结果,如实体、关系等),从而实现对多模态数据的高效、结构化抽取。
链接: https://arxiv.org/abs/2603.20781
作者: Jiang Liu,Ge Qiu,Hao Fei,Dongdong Xie,Jinbo Li,Fei Li,Chong Teng,Donghong Ji
机构: Wuhan University (武汉大学); University of Oxford (牛津大学); Wuhan Second Ship Design and Research Institute (武汉第二船舶设计研究所); China United Network Communications Co., Ltd. Research Institute (中国联合网络通信有限公司研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M ^3 D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03% and 60.49% on the English and Chinese datasets of M ^3 D, and 76.04%, 88.07%, and 73.94% on the other three datasets.
[NLP-92] MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages LREC2026
【速读】: 该论文旨在解决低资源语言(特别是南非的11种官方书面语言中9种为低资源语言)在小规模模型下,基于指令微调(instruction fine-tuning)的生成式 AI (Generative AI) 模型泛化能力不明确的问题。其解决方案的关键在于构建了一个可复现的多语言预训练语料库 MzansiText 和一个从零训练的 125M 参数 decoder-only 语言模型 MzansiLM,并系统评估了三种适应策略:单语言任务特定微调、多语言任务特定微调和通用多任务指令微调。结果表明,单语言微调在数据到文本生成任务上表现优异(如 isiXhosa 上 BLEU 达 20.65),而多语言微调在相关语言的主题分类任务上提升显著(如 isiXhosa 新闻分类 macro-F1 达 78.5%),但小规模模型在少样本推理任务上仍面临挑战,说明模型规模与推理能力之间存在关键限制。
链接: https://arxiv.org/abs/2603.20732
作者: Anri Lombard,Simbarashe Mawere,Temi Aina,Ethan Wolff,Sbonelo Gumede,Elan Novick,Francois Meyer,Jan Buys
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 11 tables, appendix included. Accepted at LREC 2026
Abstract:Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.
[NLP-93] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因受限于线性(Chain-of-Thought, CoT)或分支式(Tree-of-Thought, ToT)结构而导致的推理能力不足问题,尤其在需要合并中间结果、回溯假设及整合多源证据的场景下表现受限。其解决方案的核心是提出一种新的推理框架——思维网络(Network-of-Thought, NoT),将推理过程建模为带有类型化节点和边的有向图结构,并由基于启发式策略的控制器政策引导搜索。该方法通过灵活的拓扑结构支持更复杂的逻辑流动,同时验证了LLM自生成启发式策略在提升推理准确性方面的有效性,尤其是在多跳问答(multi-hop QA)等任务中显著优于CoT与ToT。
链接: https://arxiv.org/abs/2603.20730
作者: Fan Huang
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0% vs.\ 88.0% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14–18 percentage point gap on HotpotQA).
[NLP-94] Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在胃肠道内镜诊断中应用时面临的两大核心问题:一是通用模型推理与标准化临床认知路径之间的错位,二是视觉特征与诊断结果之间缺乏因果关联。解决方案的关键在于提出一种临床认知对齐(Clinical-Cognitive-Aligned, CogAlign)框架:首先,通过构建分层临床认知数据集并采用监督微调(Supervised Fine-Tuning, SFT),将专家从解剖定位、形态学评估到微血管分析的层级诊断逻辑内化至模型;其次,基于理论分析揭示标准监督训练会收敛于虚假背景相关性,进而提出基于反事实驱动的强化学习策略,通过病变掩码生成反事实正常样本,并结合以临床认知为中心的奖励机制优化模型,从而强制其仅基于因果病变特征进行诊断,显著提升复杂临床场景下的诊断准确性。
链接: https://arxiv.org/abs/2603.20698
作者: Huan Zheng,Yucheng Zhou,Tianyi Yan,Dubing Chen,Hongbo Lu,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen
机构: 1: School of Computer Science and Engineering, Nanyang Technological University (南洋理工大学计算机科学与工程学院); 2: Institute of Artificial Intelligence, Tsinghua University (清华大学人工智能研究院); 3: Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.
[NLP-95] Can I guess where you are from? Modeling dialectal morphosyntactic similarities in Brazilian Portuguese
【速读】: 该论文试图解决的问题是:如何通过巴西葡萄牙语(Brazilian Portuguese, BP)中语法形态与句法特征的共变关系,判断说话者的方言来源。其解决方案的关键在于采用相关性分析与聚类方法相结合的方式,发现仅相关性分析只能捕捉有限的成对关联,而聚类方法则能有效识别出反映区域方言模式的说话者分组,从而为基于语言特征的方言识别提供更可靠的计算依据。
链接: https://arxiv.org/abs/2603.20695
作者: Manoel Siqueira,Raquel Freitag
机构: Universidade Federal de Alagoas (阿拉戈阿斯联邦大学); Universidade Federal de Sergipe (塞尔希培联邦大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 17th International Conference on Computational Processing of Portuguese - PROPOR
Abstract:This paper investigates morphosyntactic covariation in Brazilian Portuguese (BP) to assess whether dialectal origin can be inferred from the combined behavior of linguistic variables. Focusing on four grammatical phenomena related to pronouns, correlation and clustering methods are applied to model covariation and dialectal distribution. The results indicate that correlation captures only limited pairwise associations, whereas clustering reveals speaker groupings that reflect regional dialectal patterns. Despite the methodological constraints imposed by differences in sample size requirements between sociolinguistics and computational approaches, the study highlights the importance of interdisciplinary research. Developing fair and inclusive language technologies that respect dialectal diversity outweighs the challenges of integrating these fields.
[NLP-96] PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLM s
【速读】: 该论文旨在解决检索增强型语言模型(Retrieval-Augmented Language Models)在生成答案前未充分验证所检索上下文是否支持结论的问题,从而导致错误推理或幻觉现象。解决方案的关键在于提出一种推理时的验证层——PAVE(Premise-Grounded Answer Validation and Editing),其核心机制包括:将检索到的上下文分解为与问题相关的原子事实(atomic facts),基于这些前提对草稿答案进行支持度评分,并通过支持阈值驱动的修订策略对低支持输出进行修正,最终实现以显式前提为基础的答案承诺可审计性。这一方法显著提升了证据 grounded QA 任务中答案的一致性和准确性。
链接: https://arxiv.org/abs/2603.20673
作者: Tianyi Huang,Caden Yang,Emily Yin,Eric Wang,Michael Zhang
机构: App-In Club (App-In Club)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.
[NLP-97] Webers Law in Transformer Magnitude Representations: Efficient Coding Representational Geometry and Psychophysical Laws in Language Models
【速读】: 该论文旨在解决Transformer语言模型中数值量(magnitude)表征机制的争议问题,即模型是否以对数压缩(log-compressive)、线性或逐位循环方式编码数值信息。此前研究存在分歧:部分学者发现对数间距,另有研究支持线性编码,还有提出每数字独立的圆形表示。本文的关键解决方案是运用心理物理学(psychophysics)的正式工具,通过四种相互印证的范式(表示相似性分析、行为辨别任务、精度梯度分析与因果干预)在三个不同数值领域(如数量、时间、空间)及三种架构(Llama、Mistral、Qwen)的70亿至90亿参数指令微调模型中系统验证。结果显示,所有模型在表示几何上均呈现一致的对数压缩特性(RSA相关系数介于0.68–0.96),且这种几何结构可由训练数据统计特性解释(效率编码前提α=0.77),但其本身并不足以保证行为上的数值辨别能力——这揭示了表征结构与行为表现之间的解耦关系。
链接: https://arxiv.org/abs/2603.20642
作者: Jon-Paul Cacioli
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, 5 tables. Pre-registered on OSF. Submitted to TMLR
Abstract:How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.
[NLP-98] Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention
【速读】: 该论文旨在解决多智能体辩论(Multi-Agent Debate)框架中因每轮广播所有代理消息而导致的噪声与冗余问题,该问题会降低辩论质量并浪费计算资源。现有方法依赖不确定性估计来过滤低置信度响应,但存在置信度校准不准和阈值选择敏感等缺陷。解决方案的关键在于提出一种轻量级框架——多样性感知保留(Diversity-Aware Retention, DAR),其在每轮辩论中通过选择与彼此及多数投票最不一致的代理响应子集进行广播,从而保留真实且具有信息量的分歧;该机制基于索引的保留策略确保原始消息不变,避免信息失真,实验表明该方法在多个推理与问答基准上均能显著提升辩论性能,尤其在代理数量增加导致噪声累积时效果更为突出。
链接: https://arxiv.org/abs/2603.20640
作者: Manh Nguyen,Anh Nguyen,Dung Nguyen,Svetha Venkatesh,Hung Le
机构: Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.
[NLP-99] A Modular LLM Framework for Explainable Price Outlier Detection
【速读】: 该论文旨在解决零售与电子商务场景中产品价格异常值(price outliers)检测的问题,此类异常价格可能因错误定价或不合理高价而损害竞争力、收入及消费者信任。传统方法依赖简单阈值,忽视了产品属性间的语义关联。解决方案的关键在于提出一种基于代理式大语言模型(agentic Large Language Model, LLM)的框架,将价格异常判断视为一个推理任务,通过三个阶段实现:(i)相关性分类,基于产品描述和属性识别价格相关的相似商品;(ii)相对效用评估,在品牌、尺寸、功能等影响价格的维度上比较目标商品与相似商品;(iii)基于推理的决策,整合上述解释性依据形成可解释的价格异常判定。该方法在测试数据集上与人工审计者达成超过75%的一致性,显著优于零样本和基于检索的LLM方法。
链接: https://arxiv.org/abs/2603.20636
作者: Shadi Sartipi,John Wu,Sina Ghotbi,Nikhita Vedula,Shervin Malmasi
机构: Amazon.com, Inc.
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: 13 pages, 3 figures
Abstract:Detecting product price outliers is important for retail and e-commerce stores as erroneous or unexpectedly high prices adversely affect competitiveness, revenue, and consumer trust. Classical techniques offer simple thresholds while ignoring the rich semantic relationships among product attributes. We propose an agentic Large Language Model (LLM) framework that treats outlier price flagging as a reasoning task grounded in related product detection and comparison. The system processes the prices of target products in three stages: (i) relevance classification selects price-relevant similar products using product descriptions and attributes; (ii) relative utility assessment evaluates the target product against each similar product along price influencing dimensions (e.g., brand, size, features); (iii) reasoning-based decision aggregates these justifications into an explainable price outlier judgment. The framework attains over 75% agreement with human auditors on a test dataset, and outperforms zero-shot and retrieval based LLM techniques. Ablation studies show the sensitivity of the method to key hyper-parameters and testify on its flexibility to be applied to cases with different accuracy requirement and auditor agreements.
[NLP-100] JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLM s
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)社会偏见评估中对非英语语境文化特异性不足的问题,尤其是日语语境下因文化规范差异导致的本地化偏见难以被准确识别和量化。其解决方案的关键在于构建一个专为日本文化背景设计的对抗性基准测试工具——JUBAKU(Japanese Culture Adversarial BiAs Benchmark Under handcrafted creation),该基准通过由母语日本 annotators 精心设计的对话场景,系统性地触发并暴露模型在十个不同文化类别中的潜在社会偏见,从而实现对日语LLMs偏见水平的高敏感度检测。实验表明,所有受测模型在JUBAKU上的平均准确率仅为23%,显著低于随机基线(50%),而人类标注者则达到91%的准确率,验证了JUBAKU的有效性和对抗性。
链接: https://arxiv.org/abs/2603.20581
作者: Taihei Shiotani,Masahiro Kaneko,Ayana Niwa,Yuki Maruyama,Daisuke Oba,Masanari Ohi,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所); MBZUAI; AIST (产业技术综合研究所); NII LLMC (国立情报学研究所大语言模型研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU’s reliability and its adversarial nature to LLMs.
[NLP-101] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在列表式事实性评估中因候选答案顺序不同而导致判断结果不稳定的问题,即候选顺序敏感性(candidate-order sensitivity),这会导致看似相似的答案在幻觉风险上存在显著差异但被错误评分。解决方案的关键在于提出PCFJudge方法:在推理阶段对同一候选集进行多次随机排列,并基于每种排列下生成的评分、排序及不确定性信号,通过聚合策略形成一个共识决策,从而降低由顺序扰动带来的误差。实验证明,该方法在RewardBench 2 Factuality数据集上相比直接评判可提升高达7个绝对分数点,且主要收益来自排列一致性聚合机制本身,而非复杂的仲裁模块。
链接: https://arxiv.org/abs/2603.20562
作者: Tianyi Huang,Nathan Huang,Justin Tang,Wenqian Chen,Elsa Fan
机构: Ryquo; App-In Club
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.
[NLP-102] Revenue-Sharing as Infrastructure: A Distributed Business Model for Generative AI Platforms
【速读】: 该论文旨在解决当前生成式 AI 平台(Generative AI Platforms)商业模式对开发者,尤其是来自新兴经济体的开发者,存在显著财务准入壁垒的问题,从而限制了创新和生态系统的包容性。解决方案的关键在于提出并分析一种名为“收入共享即基础设施”(Revenue-Sharing as Infrastructure, RSI)的新模型:平台免费提供 AI 基础设施(如 API 和模型),并通过抽取开发者应用所产生收入的一定比例来实现盈利。该模型颠覆了传统的上游付费逻辑,通过价值共创、激励机制与多层市场架构的设计,降低开发门槛、对齐利益相关方目标,并具有推动低收入国家数字经济发展与释放“潜在就业红利”的社会意义。
链接: https://arxiv.org/abs/2603.20533
作者: Ghislain Dorian Tchuente Mondjo
机构: University of Yaoundé I (雅温得第一大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 1 figures, 2 tables
Abstract:Generative AI platforms (Google AI Studio, OpenAI, Anthropic) provide infrastructures (APIs, models) that are transforming the application development ecosystem. Recent literature distinguishes three generations of business models: a first generation modeled on cloud computing (pay-per-use), a second characterized by diversification (freemium, subscriptions), and a third, emerging generation exploring multi-layer market architectures with revenue-sharing mechanisms. Despite these advances, current models impose a financial barrier to entry for developers, limiting innovation and excluding actors from emerging economies. This paper proposes and analyzes an original model, “Revenue-Sharing as Infrastructure” (RSI), where the platform offers its AI infrastructure for free and takes a percentage of the revenues generated by developers applications. This model reverses the traditional upstream payment logic and mobilizes concepts of value co-creation, incentive mechanisms, and multi-layer market architecture to build an original theoretical framework. A detailed comparative analysis shows that the RSI model lowers entry barriers for developers, aligns stakeholder interests, and could stimulate innovation in the ecosystem. Beyond its economic relevance, RSI has a major societal dimension: by enabling developers without initial capital to participate in the digital economy, it could unlock the “latent jobs dividend” in low-income countries, where mobile penetration reaches 84%, and help address local challenges in health, agriculture, and services. Finally, we discuss the conditions of feasibility and strategic implications for platforms and developers.
[NLP-103] Epistemic Observability in Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型在文本输出中高自信地编造事实的问题,即模型自报信心与实际准确性呈负相关,导致监督系统难以区分真实回答与合理虚构。其核心问题在于:仅基于文本输出的监督机制无法可靠识别模型是否诚实回应,这是由观测限制引发的理论不可能性。解决方案的关键在于设计一种“张量接口”(tensor interface),通过导出每token的熵值和对数概率分布等计算副产物,这些信号在标准训练下与正确性结构耦合,从而实现更准确的验证。实验表明,基于熵的检测方法在不同预算水平下均显著优于纯文本基线(AUC提升2.5–3.9个百分点),且跨架构具有强泛化能力(Spearman ρ = 0.762)。该工作贡献的核心是构建了一个实用的“成本表面图”,用于指导系统设计者根据验证预算优化检测策略,而非依赖模型规模或训练方式。
链接: https://arxiv.org/abs/2603.20531
作者: Tony Mason
机构: University of British Columbia (不列颠哥伦比亚大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model’s output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor’s observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5–3.9 percentage points at every budget level tested (10%, 20%, 30%). The entropy signal generalizes across architectures (Spearman \rho = 0.762 ). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.20531 [cs.DC] (or arXiv:2603.20531v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.20531 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tony Mason [view email] [v1] Fri, 20 Mar 2026 21:59:34 UTC (1,577 KB)
[NLP-104] Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源环境(如孟加拉国)下提供健康信息时的可靠性问题。其解决方案的关键在于构建一个基于权威来源的问答数据集,并通过语义相似性、专家-模型交叉评估以及自然语言推理(Natural Language Inference, NLI)三种方法系统评估GPT-4、Gemini Pro、Llama 3和Mistral-7B在新冠、登革热、尼帕病毒和基孔肯雅热等健康危机相关问题上的表现,从而揭示LLMs在流行病学历史与健康危机知识表征中的优势与局限,为资源受限环境中政策制定提供依据。
链接: https://arxiv.org/abs/2603.20514
作者: Mohammed Rakibul Hasan
机构: North South University (北方南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Comments: 20 pages, 7 figures, 3 tables
Abstract:Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question–answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.
[NLP-105] PARHAF a human-authored corpus of clinical reports for fictitious patients in French
【速读】: 该论文旨在解决临床自然语言处理(Clinical Natural Language Processing, Clinical NLP)系统开发中因医疗记录敏感性导致的数据共享受限问题,尤其是在法国及欧洲联盟严格隐私法规下的挑战。其解决方案的关键在于构建了一个名为PARHAF的大规模开源临床文档语料库,该语料库由专家撰写的虚构患者病例报告组成,从设计上确保匿名性并可自由共享;其构建过程结合了临床医生的专业知识与法国国家健康数据系统(SNDS)的流行病学指导,覆盖5009个虚构患者案例和18个医学专科,包含通用及四个特定领域的子集(如肿瘤学、传染病和诊断编码),并通过CC-BY许可发布,部分数据设临时禁用以支持未来基准测试,从而为法语临床语言模型的训练与评估提供完全隐私保护的资源,并为其他语言和医疗体系构建可复现的合成临床语料库提供了方法论范式。
链接: https://arxiv.org/abs/2603.20494
作者: Xavier Tannier,Salam Abbara,Rémi Flicoteaux,Youness Khalil,Aurélie Névéol,Pierre Zweigenbaum,Emmanuel Bacry
机构: Sorbonne Université (索邦大学); Université Sorbonne Paris Nord (索邦巴黎北大学); Inserm (法国国家健康与医学研究院); Limics (Limics); Université Paris-Saclay (巴黎萨克雷大学); UVSQ (凡尔赛大学); Assistance Publique-Hôpitaux de Paris (巴黎公立医院集团); Raymond Poincaré University Hospital (雷蒙·波卡里大学医院); Yonsei University College of Medicine (延世大学医学院); Gangnam Severance Hospital (江南三星医院); Department of Laboratory Medicine (检验医学系); Rémi Flicoteaux (未提供单位); Youness Khalil (未提供单位); Health Data Hub (健康数据枢纽); Aurélie Névéol (未提供单位); Pierre Zweigenbaum (未提供单位); Emmanuel Bacry (未提供单位); Université Paris-Dauphine, PSL (巴黎达弗大学, PSL研究大学); CNRS (法国国家科学研究中心); CEREMADE (计算经济学与数学建模研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.20494 [cs.CL] (or arXiv:2603.20494v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.20494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-106] AE-LLM : Adaptive Efficiency Optimization for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的效率瓶颈问题,包括计算成本高、内存占用大和能耗高等挑战。现有单一的优化技术(如高效注意力机制、专家混合模型(Mixture-of-Experts, MoE)、参数高效微调及量化等)效果因任务特性、资源限制和模型规模而异,缺乏通用性。其解决方案的关键在于提出AE-LLM框架,通过一个多目标优化机制联合权衡准确率、延迟、内存占用与能耗,并结合硬件约束与任务需求,在架构设计、微调策略和推理阶段的组合空间中自动搜索帕累托最优配置,从而实现对不同部署场景的自适应效率提升。
链接: https://arxiv.org/abs/2603.20492
作者: Kaito Tanaka,Masato Ito,Yuji Nishimura,Keisuke Matsuda,Aya Nakayama
机构: SANNO University (三野大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable success across diverse applications, yet their deployment remains challenging due to substantial computational costs, memory requirements, and energy consumption. Recent empirical studies have demonstrated that no single efficiency technique is universally optimal; instead, the effectiveness of methods such as efficient attention mechanisms, mixture-of-experts (MoE), parameter-efficient fine-tuning, and quantization varies significantly depending on task characteristics, resource constraints, and model scales. Building upon these insights, we propose AE-LLM, a unified framework that automatically selects and combines optimal efficiency techniques tailored to specific deployment scenarios. Our approach introduces a multi-objective optimization framework that jointly considers accuracy, latency, memory footprint, and energy consumption, while accounting for hardware constraints and task requirements. We develop an efficient search algorithm that explores the combinatorial space of efficiency techniques across architecture, fine-tuning, and inference stages, identifying Pareto-optimal configurations. Extensive experiments across 15 models (0.5B-70B parameters) and 10 diverse tasks demonstrate that AE-LLM achieves an average of 2.8\times improvement in efficiency metrics while maintaining competitive accuracy (within 1.2% of baseline), compared to static efficiency configurations. Furthermore, our framework generalizes effectively to vision-language models, achieving similar efficiency gains. Our contributions provide practitioners with an automated tool for navigating the complex trade-off landscape of LLM efficiency optimization.
[NLP-107] Profiling learners affective engagement: Emotion AI intercultural prag matics and language learning
【速读】: 该论文试图解决的问题是如何在语言学习过程中有效整合情感因素,尤其是在使用人工智能(AI)技术时,如何通过情绪识别与模拟人类响应来提升学习者的语用能力和交际能力。其解决方案的关键在于利用情感人工智能(Emotion AI),即通过算法解析用户的情绪信号,实现对学习者认知与情感状态的动态感知,并据此调整教学策略以提供更个性化的学习体验;同时,论文也指出需警惕情绪操纵和不当用户画像带来的风险。
链接: https://arxiv.org/abs/2603.20479
作者: Robert Godwin-Jones
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small. For most learners, language learning does not follow a linear, predictable path, its zigzag course shaped by motivational (or demotivating) variables such as personal characteristics, teacher/peer relationships, learning materials, and dreams of a future L2 (second language) self. While some aspects of language learning (reading, grammar) are relatively mechanical, others can be stressful and unpredictable, especially conversing in the target language. That experience necessitates not only knowledge of structure and lexis, but also the ability to use the language in ways that are appropriate to the social and cultural context. A new opportunity to practice conversational abilities has arrived through the availability of AI chatbots, with both advantages (responsive, non-judgmental) and drawbacks (emotionally void, culturally biased). This column explores aspects of emotion as they arise in technology use and in particular how automatic emotion recognition and simulated human responsiveness in AI systems interface with language learning and the development of pragmatic and interactional competence. Emotion AI, the algorithmically driven interpretation of users’ affective signals, has been seen as enabling greater personalized learning, adapting to perceived learner cognitive and emotional states. Others warn of emotional manipulation and inappropriate and ineffective user profiling
[NLP-108] Diffutron: A Masked Diffusion Language Model for Turkish Language
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在形态丰富语言(如土耳其语)中应用受限的问题,尤其是传统自回归大语言模型在处理复杂词形变化时效率低、资源消耗大的挑战。其解决方案的关键在于提出了一种名为 Diffutron 的掩码扩散语言模型(Masked Diffusion Language Model, MDLM),通过两个核心策略实现高效且高性能的非自回归文本生成:首先采用基于 LoRA(Low-Rank Adaptation)的持续预训练方法,在大规模语料上对多语言编码器进行轻量级微调;其次引入渐进式指令微调(progressive instruction-tuning)策略,分阶段适应通用与任务特定指令数据集,从而在保持模型紧凑性的同时显著提升生成能力。实验证明,该方法在多个基准测试中达到与数十亿参数基线相当的性能,验证了掩码扩散建模与多阶段调优相结合的有效性。
链接: https://arxiv.org/abs/2603.20466
作者: Şuayp Talha Kocabay,Talha Rüzgar Akkuş
机构: Hugging Face(赫吉福)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce \textitDiffutron , a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.
[NLP-109] Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable
【速读】: 该论文旨在解决当前科研评审中关于大语言模型(Large Language Models, LLMs)使用政策的可执行性问题,即现有禁止LLM直接参与同行评审的政策是否能够被有效检测和执行。其关键解决方案在于构建了一个模拟不同人类-人工智能协作程度的同行评审数据集,并系统评估了五种最先进的AI文本检测工具(包括两种商用系统),发现这些检测器会将大量经LLM润色的人类撰写评审误判为纯AI生成内容,从而导致对学术不端行为的误判风险。研究进一步探索利用同行评审特有的信号(如论文手稿访问权限和科学写作领域的限制性语境)以提升检测精度,但结果表明,尽管某些方法在特定场景下有所改进,仍无法达到识别同行评审中AI使用所需的准确率标准。因此,该研究强调,当前基于AI检测工具得出的AI使用率估计值应谨慎解读,因其可能高估了实际违规比例。
链接: https://arxiv.org/abs/2603.20450
作者: Rounak Saha,Gurusha Juneja,Dayita Chaudhuri,Naveeja Sajeevan,Nihar B Shah,Danish Pruthi
机构: Indian Institute of Science (印度科学研究所); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.
[NLP-110] A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在推理过程中存在的准确性与推理效率之间的权衡问题:现有方法要么通过迭代验证-修正机制导致计算开销高且易陷入错误推理,要么依赖多样本选择(best-of-N)虽提升准确率但无法修复模型内部缺陷且资源消耗大。其解决方案的关键在于提出一种无需训练的再生范式(training-free regeneration paradigm),利用离线构建的对比型反思记忆库(contrastive Reflection Memory, RM)提供校正引导,并通过从头再生(regeneration from scratch)打破错误推理循环,从而在推理阶段仅执行一次RM引导的自验证与单次再生,显著降低计算成本的同时提升输出准确性。
链接: https://arxiv.org/abs/2603.20441
作者: Yuran Li,Di Wu,Benoit Boulet
机构: McGill University (麦吉尔大学); Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures
Abstract:Verification-guided self-improvement has recently emerged as a promising approach to improving the accuracy of large language model (LLM) outputs. However, existing approaches face a trade-off between inference efficiency and accuracy: iterative verification-rectification is computationally expensive and prone to being trapped in faulty reasoning, while best-of-N selection requires extensive sampling without addressing internal model flaws. We propose a training-free regeneration paradigm that leverages an offline-curated contrastive Reflection Memory (RM) to provide corrective guidance, while regenerating from scratch helps break out of faulty reasoning. At inference time, the method performs RM-guided self-verification followed by a single RM-guided regeneration, avoiding both iterative correction and multi-sample selection. We evaluated our method on nine benchmarks that span algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs. Experiment results show that our method outperforms prior methods while maintaining low computational cost.
[NLP-111] ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models In-Context Learning Ability INTERSPEECH2026
【速读】: 该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在音频条件下的上下文学习能力尚未被系统研究的问题,特别是其能否从音频引导的示例中推断任务模式。解决方案的关键在于提出ALICE框架,这是一个三阶段渐进式评估流程,逐步减少文本指导以系统性地检验LALMs在音频条件下的上下文学习表现。实验发现,尽管演示能显著提升格式合规性,但对核心任务性能几乎没有改善,甚至常造成退化,揭示了当前LALMs在跨模态语义对齐上的局限性。
链接: https://arxiv.org/abs/2603.20433
作者: Yen-Ting Piao,Jay Chiehen Liao,Wei-Tang Chien,Toshiki Ogimoto,Shang-Tse Chen,Yun-Nung Chen,Chun-Yi Lee,Shao-Yuan Lo
机构: National Taiwan University (台湾国立大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026
Abstract:While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs’ in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.
[NLP-112] Coding Agents are Effective Long-Context Processors
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本上下文时性能显著下降的问题,尤其是在依赖隐式且不可解释的注意力机制时,难以有效利用超长上下文信息。其解决方案的关键在于将长文本处理从模型内部的潜空间注意力机制中“外化”出来,通过编码代理(coding agents)以显式、可执行的方式组织和操作文本:一方面,代理具备原生工具熟练度(native tool proficiency),能够调用代码和终端命令而非被动语义查询;另一方面,代理熟悉文件系统结构(file system familiarity),可将海量文本数据视为目录层级进行导航与管理。这种架构使代理在长文本推理、检索增强生成及开放域问答等任务上显著优于现有最优方法,平均提升达17.3%。
链接: https://arxiv.org/abs/2603.20432
作者: Weili Cao,Xunjian Yin,Bhuwan Dhingra,Shuyan Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.
[NLP-113] Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP
【速读】: 该论文旨在解决如何让大型语言模型(Large Language Model, LLM)在无需外部网络访问的情况下,自主完成高难度数学竞赛问题的自动化证明任务。其核心挑战在于提升模型在复杂逻辑推理和形式化验证环境中的自主性与可靠性。解决方案的关键在于设计了一套Model Context Protocol (MCP) 工具链,该工具链基于对先前 miniF2F-Rocq 实验日志的分析,采用“编译优先、交互回退”(compile-first, interactive-fallback)策略,使 Claude Opus 4.6 能够在隔离虚拟机中调用多个子代理(subagents),协同执行形式化证明任务,最终成功自主完成 2025 年普特南数学竞赛中 10/12 道题的证明。
链接: https://arxiv.org/abs/2603.20405
作者: Guillaume Baudart,Marc Lelarge,Tristan Stérin,Jules Viennot
机构: IRIF, Université Paris Cité, Inria, CNRS; DI ENS, PSL University, Inria
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a “compile-first, interactive-fallback” strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.
[NLP-114] kRAIG: A Natural Language-Driven Agent for Automated DataOps Pipeline Generation
【速读】: 该论文旨在解决现代机器学习系统中数据工程流水线(ELT)构建效率低、依赖高技能人才以及自动化工具在用户意图不明确、工具生成不可靠和输出可执行性差等方面存在的瓶颈问题。其解决方案的关键在于提出kRAIG这一AI代理框架,通过引入ReQuesAct(Reason, Question, Act)交互机制显式澄清用户意图,结合检索增强的工具合成策略生成任务特定的数据转换组件,并嵌入基于大语言模型(LLM)的验证阶段以确保流水线完整性与安全性,从而显著提升数据提取、加载的成功率(3倍提升)和转换准确性(25%提升)。
链接: https://arxiv.org/abs/2603.20311
作者: Rohan Siva,Kai Cheung,Lichi Li,Ganesh Sundaram
机构: Cisco(思科)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 7 figures
Abstract:Modern machine learning systems rely on complex data engineering workflows to extract, transform, and load (ELT) data into production pipelines. However, constructing these pipelines remains time-consuming and requires substantial expertise in data infrastructure and orchestration frameworks. Recent advances in large language model (LLM) agents offer a potential path toward automating these workflows, but existing approaches struggle with under-specified user intent, unreliable tool generation, and limited guarantees of executable outputs. We introduce kRAIG, an AI agent that translates natural language specifications into production-ready Kubeflow Pipelines (KFP). To resolve ambiguity in user intent, we propose ReQuesAct (Reason, Question, Act), an interaction framework that explicitly clarifies intent prior to pipeline synthesis. The system orchestrates end-to-end data movement from diverse sources and generates task-specific transformation components through a retrieval-augmented tool synthesis process. To ensure data quality and safety, kRAIG incorporates LLM-based validation stages that verify pipeline integrity prior to execution. Our framework achieves a 3x improvement in extraction and loading success and a 25 percent increase in transformation accuracy compared to state-of-the-art agentic baselines. These improvements demonstrate that structured agent workflows with explicit intent clarification and validation significantly enhance the reliability and executability of automated data engineering pipelines. Comments: 9 pages, 7 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.20311 [cs.SE] (or arXiv:2603.20311v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.20311 Focus to learn more arXiv-issued DOI via DataCite
[NLP-115] FinReflectKG – HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
【速读】: 该论文旨在解决金融领域知识图谱(Knowledge Graph, KG)增强型问答系统中生成式AI(Generative AI)幻觉(hallucination)检测的可靠性问题,即如何有效识别和减少基于KG推理时产生的事实性错误输出。其关键解决方案是构建了一个针对美国证券交易委员会(SEC)10-K文件的基准测试集FinBench-QA-Hallucination,包含755个标注样本,并采用保守的证据链接协议(要求文本片段与关系三元组同时支持)来评估不同检测方法的性能。实验表明,尽管大语言模型(LLM)判别器和嵌入方法在干净条件下表现最优(F1: 0.82–0.86),但多数方法在引入噪声三元组后性能显著下降(MCC下降44%–84%),而嵌入方法保持相对稳定(仅下降9%),凸显了KG质量对系统可靠性的决定性影响,为高风险场景下信息系统的可信赖设计提供了实证依据和评估框架。
链接: https://arxiv.org/abs/2603.20252
作者: Mahesh Kumar,Bhaskarjit Sarmah,Stefano Pasquali
机构: Domyn Inc
类目: Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注:
Abstract:As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran’s Q and McNemar) confirm significant performance differences (p 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.
[NLP-116] Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding
【速读】: 该论文旨在解决脑机接口中语音信号解码的鲁棒性与可解释性问题,特别是针对有限数据和每日神经信号非平稳性(day-to-day nonstationarity)带来的挑战。其核心解决方案是提出一种基于多任务Transformer的序列到序列模型,联合预测音素序列、词序列及辅助声学特征,并引入“神经锤子刀片”(Neural Hammer Scalpel, NHS)校准模块,通过全局对齐与特征级调制实现跨日校正。该方法在Willett等人数据集上实现了14.3%的音素错误率(Phoneme Error Rate, PER)和19.4%的词错误率(Word Error Rate, WER),显著优于线性或无日特异性变换的基线,且注意力可视化揭示了编码器中的时间分段模式以及音素与词解码器对这些片段的不同利用方式,从而提升了神经语音读出的保真度与机制可解释性。
链接: https://arxiv.org/abs/2603.20246
作者: Michal Olak,Tommaso Boccato,Matteo Ferrante
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Speech brain–computer interfaces require decoders that translate intracortical activity into linguistic output while remaining robust to limited data and day-to-day variability. While prior high-performing systems have largely relied on framewise phoneme decoding combined with downstream language models, it remains unclear what contextual sequence-to-sequence decoding contributes to sublexical neural readout, robustness, and interpretability. We evaluated a multitask Transformer-based sequence-to-sequence model for attempted speech decoding from area 6v intracortical recordings. The model jointly predicts phoneme sequences, word sequences, and auxiliary acoustic features. To address day-to-day nonstationarity, we introduced the Neural Hammer Scalpel (NHS) calibration module, which combines global alignment with feature-wise modulation. We further analyzed held-out-day generalization and attention patterns in the encoder and decoders. On the Willett et al. dataset, the proposed model achieved a state-of-the-art phoneme error rate of 14.3%. Word decoding reached 25.6% WER with direct decoding and 19.4% WER with candidate generation and rescoring. NHS substantially improved both phoneme and word decoding relative to linear or no day-specific transform, while held-out-day experiments showed increasing degradation on unseen days with temporal distance. Attention visualizations revealed recurring temporal chunking in encoder representations and distinct use of these segments by phoneme and word decoders. These results indicate that contextual sequence-to-sequence modeling can improve the fidelity of neural-to-phoneme readout from intracortical speech signals and suggest that attention-based analyses can generate useful hypotheses about how neural speech evidence is segmented and accumulated over time.
[NLP-117] Email in the Era of LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在邮件沟通中如何理解、生成并优化复杂社会目标的问题,尤其是在人机协作场景下对沟通质量的评估与提升。其核心解决方案是设计并应用HR Simulator这一以沟通为核心机制的博弈平台,通过人类与LLM在真实职场情境下的邮件交互数据(600+封邮件)及LLM作为裁判的评估体系,揭示了LLM在邮件质量判断上的趋同性、人类与LLM在沟通策略上的差异以及人机协同写作的优势。关键发现在于:尽管LLM在标准化指标上优于人类,但结合人类意图理解与LLM表达能力的“人-LLM协同写作”模式可显著提升沟通效果,尤其在高社交敏感性任务中表现突出,表明该范式是未来高效沟通的重要路径。
链接: https://arxiv.org/abs/2603.20231
作者: Dang Nguyen,Harvey Yiyun Fu,Peter West,Chenhao Tan,Ari Holtzman
机构: University of Chicago (芝加哥大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 47 pages (including appendix), 6 figures, 2 tables main body
Abstract:Email communication increasingly involves large language models (LLMs), but we lack intuition on how they will read, write, and optimize for nuanced social goals. We introduce HR Simulator, a game where communication is the core mechanic: players play as a Human Resources officer and write emails to solve socially challenging workplace scenarios. An analysis of 600+ human and LLM emails with LLMs-as-judge reveals evidence for larger LLMs becoming more homogenous in their email quality judgments. Under LLM judges, humans underperform LLMs (e.g., 23.5% vs. 48-54% success rate), but a human+LLM approach can outperform LLM-only (e.g., from 40% to nearly 100% in one scenario). In cases where models’ email preferences disagree, emergent tact is a plausible explanation: weaker models prefer less tactful strategies while stronger models prefer more tactful ones. Regarding tone, LLM emails are more formal and empathetic while human emails are more varied. LLM rewrites make human emails more formal and empathetic, but models still struggle to imitate human emails in the low empathy, low formality quadrant, which highlights a limitation of current post-training approaches. Our results demonstrate the efficacy of communication games as instruments to measure communication in the era of LLMs, and posit human-LLM co-writing as an effective form of communication in that future.
[NLP-118] Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高请求场景下能耗与计算成本过高,而小型语言模型(Small Language Models, SLMs)虽能胜任部分任务却难以平衡性能与能效的问题。其核心解决方案在于提出一套基于测试时计算策略的能效优化框架:首先引入能量效率指标(如Energy-per-Token),量化模型推理过程中的能耗表现;其次通过分析Transformer架构中输入输出token的非线性硬件能耗曲线,设计动态推理深度控制机制——即在Chain-of-Thought(CoT)推理过程中依据运行曲线调节推理步骤,实现精度与能耗的权衡;最终构建一个能量感知的路由机制,使模型选择和推理策略协同优化,从而推动可持续的AI部署。
链接: https://arxiv.org/abs/2603.20224
作者: Patrick Wilhelm,Thorsten Wittkopp,Odej Kao
机构: Technische Universität Berlin(柏林工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and capabilities of LLMs are often unnecessary, as Small Language Models (SLMs) can provide accurate responses for simpler text generation tasks. When enhanced with advanced reasoning strategies, such as Chain-of-Thought (CoT) prompting or Majority Voting, SLMs can approach the performance of larger models while reducing overall computational requirements. However, these strategies can also introduce additional energy costs, creating an energy-accuracy trade-off. Our analysis examines these trade-offs in test-time compute strategies for smaller models compared to larger ones, using the MMLU benchmark. Additionally, we explore the input-output token dynamics of transformer architectures, which result in nonlinear hardware energy operation curves for LLMs. To bridge AI research with its physical impact, we propose \textitenergy efficiency metrics, including Energy-per-Token, as complements to traditional accuracy benchmarks. Beyond model selection, we propose controlled reasoning in CoT token generation, using operating curves to regulate reasoning depth dynamically. This vision integrates a energy-aware routing mechanism, ensuring that model selection and inference strategies balance accuracy for sustainable AI deployment.
[NLP-119] Linguistic Signatures for Enhanced Emotion Detection
【速读】: 该论文试图解决的问题是:在自然语言处理(Natural Language Processing, NLP)中,尽管基于Transformer的模型在情感识别任务上取得了显著进展,但缺乏对跨不同语料库和标签体系下情感表达语言规律的理解,且现有模型对情感类别预测的可解释性与鲁棒性仍存在不足。解决方案的关键在于:从13个英文数据集中提取特定于情感类别的语言特征签名(linguistic signatures),并将这些高阶语言特征显式地融入RoBERTa模型中,从而增强模型对情感类别的判别能力。实验表明,该方法在GoEmotions基准上实现了最高达+2.4的宏F1分数提升,验证了语言特征作为可解释信号能够有效补充神经表示,提高情感识别的性能与鲁棒性。
链接: https://arxiv.org/abs/2603.20222
作者: Florian Lecourt(LIRMM | ADVANSE),Madalina Croitoru(LIRMM),Konstantin Todorov(WEB3)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Emotion detection is a central problem in NLP, with recent progress driven by transformer-based models trained on established datasets. However, little is known about the linguistic regularities that characterize how emotions are expressed across different corpora and labels. This study examines whether linguistic features can serve as reliable interpretable signals for emotion recognition in text. We extract emotion-specific linguistic signatures from 13 English datasets and evaluate how incorporating these features into transformer models impacts performance. Our RoBERTa-based models enriched with high level linguistic features achieve consistent performance gains of up to +2.4 macro F1 on the GoEmotions benchmark, showing that explicit lexical cues can complement neural representations and improve robustness in predicting emotion categories.
[NLP-120] hinking into the Future: Latent Lookahead Training for Transformers
【速读】: 该论文旨在解决自回归语言模型(autoregressive language models)在文本生成过程中存在的两个关键问题:一是模型在每一步都必须立即决定下一个离散token,无法对多个可能的延续进行探索或反思;二是计算资源在所有token上均匀分配,而忽视了某些复杂token可能需要更多计算量以提升表达能力的问题。解决方案的关键在于引入“潜在前瞻”(latent lookahead)训练策略,即在序列中选定位置,模型在生成当前token前,通过在潜在空间中进行多步递归推理(τ步),利用隐藏状态反复反馈到上下文中,从而在不直接采样未来token的情况下,基于潜在空间中的多步预测来优化当前决策,并通过监督这τ个潜在预测与真实token之间的差异,促使模型提前“展望”并改进预测质量。
链接: https://arxiv.org/abs/2603.20219
作者: Lorenzo Noci,Gregor Bachmann,Seyed-Mohsen Moosavi-Dezfooli,Moin Nabi
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model’s expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to “think” before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network’s latent space by recursively feeding its hidden states back into the context for \tau steps, investing more compute on predicting that token. This produces \tau latent predictions that are supervised against the next \tau ground-truth tokens, encouraging the model to “lookahead” and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.
[NLP-121] An experimental study of KV cache reuse strategies in chunk-level caching systems
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因分块级缓存(Chunk-Level Caching, CLC)忽略块间交叉注意力依赖关系而导致的输出质量下降问题。现有CLC方法虽能加速推理,但存在根本性局限,限制了其准确性和适用范围。论文的关键解决方案在于:通过系统实验验证现有CLC方法的互补性,并提出一种融合多种技术的新设计,从而在保持高效推理的同时显著提升生成准确性。
链接: https://arxiv.org/abs/2603.20218
作者: Samuel Cestola,Tianxiang Xia,Zheng Weiyan,Zheng Pengfei,Diego Didona
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation improves large language models’ accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using different techniques. We make two main contributions. First, we show that existing CLC approaches have fundamental limitations that limit their accuracy or their applicability. We back this conclusion with an extensive CLC system experimental evaluation. Second, we observe that existing CLC techniques are complementary. We leverage this insight to propose a new CLC design that carefully combines them and achieves better accuracy.
[NLP-122] Expected Reward Prediction with Applications to Model Routing ICML2025
【速读】: 该论文试图解决的问题是在模型推理阶段如何高效地将提示(prompt)路由到最适合的大型语言模型(Large Language Model, LLM),以最大化奖励得分的同时控制计算成本。传统方法通常基于模型在特定类别下的平均性能进行路由,但未能考虑具体提示下模型的潜在表现差异。解决方案的关键在于提出一种基于预期奖励预测(Expected Reward Prediction, ERP)的路由协议:通过训练一个响应级奖励模型(response-level reward model),直接预测LLM在重复采样下对某一提示所能获得的期望奖励,从而实现无需生成实际响应即可评估模型适配度。该方法不仅精度高、判别性强,且可扩展性强,能有效替代复杂路由策略并显著优于基线方法。
链接: https://arxiv.org/abs/2603.20217
作者: Kenan Hasanaliyev,Silas Alberti,Jenny Hamer,Dheeraj Rajagopal,Kevin Robinson,Jasper Snoek,Victor Veitch,Alexander Nicholas D’Amour
机构: Stanford University (斯坦福大学); Inception Labs; Cognition Labs; Google DeepMind; University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2025 Workshop on Models of Human Feedback for AI Alignment
Abstract:Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model’s suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction–based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt’s category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.
[NLP-123] Locally Coherent Parallel Decoding in Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在实现并行生成时因忽略并发 token 之间的联合依赖关系而导致的语法不一致和多 token 结构破坏问题。标准 DLM 通过独立采样条件边缘分布生成 token,无法建模局部依赖性,从而影响生成文本的连贯性。解决方案的关键在于提出 CoDiLA(Coherent Diffusion with Local Autoregression),其核心思想是将局部解码任务交由一个小型辅助自回归(Autoregressive, AR)模型处理,该模型作用于扩散潜变量(diffusion latents)上,以确保每个生成块内部的序列有效性;同时保留 DLM 的全局双向建模能力与块间并行生成特性,从而在保持高效性的同时显著提升生成质量。实验表明,使用参数量极小(如 0.6B)的辅助 AR 模型即可有效消除连贯性伪影,推动代码生成任务中准确率与速度的新帕累托前沿。
链接: https://arxiv.org/abs/2603.20216
作者: Michael Hersche,Nicolas Menet,Ronan Tanios,Abbas Rahimi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.
[NLP-124] Multi-Agent Debate with Memory Masking ICLR2026
【速读】: 该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)框架中因历史记忆存在错误而导致推理性能下降的问题。研究表明,MAD的性能高度依赖于前一轮辩论所生成的记忆质量,而错误记忆会误导后续推理过程,降低整体准确性。解决方案的关键在于提出一种名为“带记忆掩码的多智能体辩论”(Multi-Agent Debate with Memory Masking, MAD-M²)的新框架,其核心机制是在每轮辩论开始时,由LLM智能体主动识别并屏蔽错误记忆,从而在保留有效信息的同时剔除噪声,提升上下文质量与推理鲁棒性。实验表明,MAD-M²能有效识别并过滤错误记忆,在主流数学和逻辑推理基准上优于原始MAD。
链接: https://arxiv.org/abs/2603.20215
作者: Hongduan Tian,Xiao Feng,Ziyuan Zhao,Xiangyu Zhu,Rolan Yan,Bo Han
机构: Hong Kong Baptist University (香港浸会大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, multi-agent debate (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, multi-agent debate with memory masking (MAD-M ^2 ), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M ^2 can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M ^2 can identify the erroneous memories and achieve better performance in reasoning than MAD.
[NLP-125] Agent icGEO: A Self-Evolving Agent ic System for Generative Engine Optimization
【速读】: 该论文旨在解决生成式搜索引擎(Generative Search Engines)中内容优化的难题,即如何在黑箱大语言模型(Large Language Model, LLM)生成摘要输出的场景下,通过策略性地调整源内容来最大化其可见性和归属度(attribution),同时克服现有方法依赖静态启发式规则、单一提示优化或易过拟合的引擎偏好蒸馏所带来的适应性差与交互成本高的问题。解决方案的关键在于提出AgenticGEO——一个自演化代理框架,将优化建模为内容条件控制问题,并引入MAP-Elites档案库以进化出多样化且可组合的策略;此外,设计轻量级共进化评判器(Co-Evolving Critic)作为代理反馈机制,高效近似引擎响应以指导策略选择与迭代优化,从而显著降低对真实引擎交互的依赖并提升跨域泛化能力。
链接: https://arxiv.org/abs/2603.20213
作者: Jiaqi Yuan,Jialu Wang,Zihan Wang,Qingyun Sun,Ruijie Wang,Jianxin Li
机构: Beihang University (北京航空航天大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Generative search engines represent a transition from traditional ranking-based retrieval to Large Language Model (LLM)-based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and attribution in black-box summarized outputs by strategically manipulating source content. However, existing methods rely on static heuristics, single-prompt optimization, or engine preference rule distillation that is prone to overfitting. They cannot flexibly adapt to diverse content or the changing behaviors of generative engines. Moreover, effectively optimizing these strategies requires an impractical amount of interaction feedback from the engines. To address these challenges, we propose AgenticGEO, a self-evolving agentic framework formulating optimization as a content-conditioned control problem, which enhances intrinsic content quality to robustly adapt to the unpredictable behaviors of black-box engines. Unlike fixed-strategy methods, AgenticGEO employs a MAP-Elites archive to evolve diverse, compositional strategies. To mitigate interaction costs, we introduce a Co-Evolving Critic, a lightweight surrogate that approximates engine feedback for content-specific strategy selection and refinement, efficiently guiding both evolutionary search and inference-time planning. Through extensive in-domain and cross-domain experiments on two representative engines, AgenticGEO achieves state-of-the-art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets. Our code and model are available at: this https URL.
[NLP-126] Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models
【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Models, GRMs)在强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)时计算开销过高,而标量奖励模型(Scalar Reward Models, SRMs)则因性能有限和适应性差难以应对复杂场景的问题。解决方案的关键在于提出一种受双过程理论启发的混合奖励模型架构——快慢思维奖励模型(Fast-Slow Thinking Reward Models, F/S-RM),其核心是训练单一模型同时集成两种奖励机制:基于首token预测的快速判断(fast thinking)与基于思维链(Chain-of-Thought, CoT)的深度推理(slow thinking),并通过一个双置信度激活机制动态控制慢思考的触发时机,从而在保持高精度的同时显著降低token消耗(减少20.8%),实现效率与性能的平衡。
链接: https://arxiv.org/abs/2603.20212
作者: Jiayun Wu,Peixu Hou,Shan Qu,Peng Zhang,Ning Gu,Tun Lu
机构: Fudan University (复旦大学); Meituan (美团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.20212 [cs.CL] (or arXiv:2603.20212v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.20212 Focus to learn more arXiv-issued DOI via DataCite
[NLP-127] CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在文本生成中因依赖离散边际分布而导致的词元依赖建模不足和语义不连贯问题。解决方案的关键在于将扩散过程迁移至连续的句子级语义空间,并提出CRoCoDiL(Continuous and Robust Conditioned Diffusion for Language)统一微调框架,通过联合训练编码器-去掩码器架构,使MDM的去掩码操作基于连续潜在表示进行,从而构建一种新型自编码器结构,其中解码由MDM算法完成。这一设计显著提升了生成质量与效率,同时支持两种无条件文本合成方法:Continuous-Then-Discrete(ConThenDisc)与Continuous-Within-Discrete(ConWithinDisc),分别实现先连续后离散生成和在离散采样过程中持续优化潜变量的多扩散策略。
链接: https://arxiv.org/abs/2603.20210
作者: Roy Uziel,Omer Belhasin,Itay Levi,Akhiad Bercovich,Ran El-Yaniv,Ran Zilberstein,Michael Elad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL (Continuous and Robust Conditioned Diffusion for Language), a unified fine-tuning approach that jointly trains an encoder-demasker architecture, grounding the MDM demasking in continuous latent representations. This leads to the formation of a novel autoencoder in which decoding is obtained by an MDM algorithm. Relying on the same framework, we introduce two unconditional text synthesis algorithms: Continuous-Then-Discrete (ConThenDisc), a hybrid-diffusion approach that first generates latent representations in continuous space and then decodes these to tokens via an MDM, and Continuous-Within-Discrete (ConWithinDisc), a multi-diffusion strategy that refines latent representations throughout the discrete sampling process. Experiments using LLaDA show that our methods achieve superior generation quality and more than 10x faster sampling speeds in an unconditional setting.
[NLP-128] Childrens Intelligence Tests Pose Challenges for MLLM s? KidGym: A 2D Grid-Based Reasoning LLM s? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs ICLR2026
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)缺乏系统性、可解释且贴近人类认知发展路径的评估体系的问题。现有评测方法往往侧重于单一任务性能,难以全面反映模型在执行、感知推理、学习、记忆和规划等核心能力上的综合水平。为此,作者受韦氏智力量表(Wechsler Intelligence Scales)启发,提出KidGym——一个基于二维网格环境的综合性基准测试平台,其关键在于将智能拆解为五个可测、可解释的能力维度,并设计了12个具有随机布局与多样化场景的任务,以模拟儿童认知发展的不同阶段,从而更准确地衡量MLLMs的适应性与成长潜力。该方案支持用户自定义扩展,有助于推动MLLM研究向更具通用性和可发展性的方向演进。
链接: https://arxiv.org/abs/2603.20209
作者: Hengwei Ye,Yuanting Guan,Yuxuan Ge,Tianying Zhu,Zhenhan Guan,Yijia Zhong,Yijing Zhang,Han Zhang,Yingna Wu,Zheng Tian
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs’ adaptability and developmental potential, mirroring the stages of children’s cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: this https URL.
[NLP-129] RedacBench: Can AI Erase Your Secrets?
【速读】: 该论文旨在解决现有文本去标识化(redaction)评估基准在覆盖范围和评估维度上的局限性问题,即当前基准多局限于预定义敏感信息类别(如个人身份信息,PII)或仅评估特定去标识化技术(如掩码),难以全面衡量模型在不同安全策略下对敏感信息的精准移除能力及其对原始语义的保留效果。其解决方案的关键在于构建了一个名为RedacBench的综合性评估基准,该基准基于514篇人工撰写的跨领域文本(涵盖个人、企业与政府来源)及187条安全策略,通过8,053个标注命题来量化模型在执行政策条件下的去标识化任务中同时实现安全性(移除违反策略的信息命题)和实用性(保留非敏感命题)的能力,从而为生成式AI(Generative AI)时代的数据安全提供更细粒度、可扩展的评测体系。
链接: https://arxiv.org/abs/2603.20208
作者: Hyunjun Jeon,Kyuyoung Kim,Jinwoo Shin
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Modern language models can readily extract sensitive information from unstructured text, making redaction – the selective removal of such information – critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy-conditioned redaction across domains and strategies. Constructed from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model’s ability to selectively remove policy-violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security – the removal of sensitive propositions – and utility – the preservation of non-sensitive propositions. Experiments across multiple redaction strategies and state-of-the-art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web-based playground for dataset customization and evaluation. Available at this https URL.
[NLP-130] Enhancing Safety of Large Language Models via Embedding Space Separation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对有害提示时的安全性问题,即如何有效提升模型对有害请求的防御能力而不损害其通用性能。解决方案的关键在于提出一种表示层微调方法——嵌入空间分离(Embedding Space Separation, ES2),通过显式扩大有害与安全输入在嵌入空间中的距离来增强安全性;同时引入Kullback-Leibler(KL)散度正则项以约束微调后模型在无害输入上的输出 logits 与原始基线模型保持一致,从而避免通用能力退化。
链接: https://arxiv.org/abs/2603.20206
作者: Xu Zhao,Xiting Wang,Weiran Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. Motivated by this observation, we propose a representation-level fine-tuning approach, named Embedding Space Separation (ES2), which improves LLM safety by explicitly enlarging the distance between harmful and safe representations in the embedding space. To prevent degradation of model’s general capabilities, we introduce a Kullback-Leibler (KL) divergence regularization term into the loss function, which constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs. We evaluate our method on several open-source LLMs using standard safety benchmarks. Extensive experimental results demonstrate that our approach substantially improves model safety while maintaining comparable general capabilities.
[NLP-131] Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning INTERSPEECH2026
【速读】: 该论文旨在解决语音深度伪造(Speech Deepfake)源验证系统中一个关键假设未被验证的问题,即现有方法通常假设生成器源嵌入(source embeddings)与说话人特征无关,但这一假设尚未得到实证支持。为应对该问题,作者提出了一种说话人解耦度量学习(Speaker-Disentangled Metric Learning, SDML)框架,其核心创新在于两个新颖的损失函数设计:一是利用切比雪夫多项式(Chebyshev polynomial)缓解解耦优化过程中的梯度不稳定问题;二是将源嵌入和说话人嵌入投影到双曲空间(hyperbolic space),借助黎曼度量距离降低说话人信息干扰,从而学习更具判别性的源特征。实验在MLAAD基准上通过四种新提出的解耦场景评估协议验证了SDML的有效性。
链接: https://arxiv.org/abs/2603.21875
作者: Xi Xuan,Wenxin Zhang,Zhiyu Li,Jennifer Williams,Ville Hautamäki,Tomi H. Kinnunen
机构: University of Eastern Finland (东芬兰大学); City University of Hong Kong (香港城市大学); University of Southampton (南安普顿大学); University of Chinese Academy of Sciences (中国科学院大学); University of Science and Technology of China (中国科学技术大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to Interspeech 2026; The code, evaluation protocols and demo website are available at this https URL
Abstract:Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at this https URL.
[NLP-132] PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection
【速读】: 该论文旨在解决长上下文大语言模型(Large Language Model, LLM)推理中因KV缓存扫描导致的内存带宽瓶颈问题,这一瓶颈表现为随序列长度n线性增长的O(n)内存访问开销,远超算力扩展所能缓解的范围。解决方案的关键在于识别出粗粒度块选择步骤——即确定哪些KV块需被加载的相似性搜索任务——是真正的性能瓶颈,并首次发现该任务与光子广播-加权范式(photonic broadcast-and-weight paradigm)在结构上高度匹配:查询通过无源分束广播至所有候选块,签名具有准静态特性(契合电光微环谐振器(MRR)编程),且仅需保留排序关系(可将精度放宽至4–6比特)。这一洞察使得光子加速器的优势随上下文长度增加而显著提升,电子扫描成本呈线性增长,而光子评估成本保持O(1)。作者基于此设计了PRISM(Photonic Ranking via Inner-product Similarity with Microring weights)系统,利用薄膜铌酸锂(TFLN)实现相似性引擎,在Qwen2.5-7B模型上验证了从4K到64K token范围内k=32时100%准确率,并在64K上下文中实现16倍流量减少和相较于GPU基线四个数量级的能量优势。
链接: https://arxiv.org/abs/2603.21576
作者: Hyoseok Park,Yeonsang Park
机构: Chungnam National University (忠南国立大学)
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 27 figures, 15 tables, including supplementary material. Code available at this https URL
Abstract:Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step – a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm – the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n = 4K).
[NLP-133] Generalized Discrete Diffusion from Snapshots
【速读】: 该论文旨在解决离散扩散模型(Discrete Diffusion Models)在大规模离散状态空间中建模灵活性不足与训练效率低的问题。现有方法受限于固定的噪声传播机制,难以支持任意的污染过程(corruption dynamics),且反向生成过程通常依赖完整的噪声路径,导致计算开销大、优化困难。解决方案的关键在于提出一种统一框架——广义离散扩散从快照(Generalized Discrete Diffusion from Snapshots, GDDS),其核心创新包括:1)通过均匀化(uniformization)实现任意污染过程的快速采样;2)引入基于快照潜在变量(snapshot latents)的简化证据下界(ELBO),避免对完整噪声路径的依赖,从而支持标准生成模型架构的高效训练,并保持清晰的概率解释。实验表明,GDDS 在大规模词汇离散生成任务中显著提升训练效率和生成质量,首次在该规模上超越自回归模型。
链接: https://arxiv.org/abs/2603.21342
作者: Oussama Zekri,Théo Uscidda,Nicolas Boullé,Anna Korba
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 37 pages, 6 figures, 13 tables
Abstract:We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \hrefthis https URLthis https URL.
[NLP-134] SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing
【速读】: 该论文旨在解决长时音乐生成中因建模长程依赖关系以及音频表示的内存和计算资源消耗过大而导致的挑战。其解决方案的关键在于引入一种“时间加速与减速”策略:假设AI模型能够理解并生成加速(如2倍、4倍或8倍速)的音频,在加速域中先生成高帧率版本的音乐,从而显著缩短时序长度并降低资源需求;随后将生成结果恢复至原始速度,以重建完整的时序结构。这一方法本质上遵循从抽象到细节的分层生成原则,可无缝集成至现有音乐生成模型中,无需复杂修改即可实现高效、可扩展且高质量的长时音乐生成。
链接: https://arxiv.org/abs/2603.21073
作者: Jianyi Chen,Rongxiu Zhong,Shilei Zhang,Kun Qian,Jinglei Liu,Yike Guo,Wei Xue
机构: The Hong Kong University of Science and Technology; JIUTIAN Research of China Mobile; Beijing Institute of Technology; China Mobile (Hong Kong) Innovation Research Institute
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Under Review
Abstract:Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at this https URL.
[NLP-135] GIP-RAG : An Evidence-Grounded Retrieval-Augmented Framework for Interpretable Gene Interaction and Pathway Impact Analysis
【速读】: 该论文旨在解决生物分子网络中基因间机制性关系难以整合与解释的问题,尤其在面对异构知识来源时,如何实现可解释的多步推理以揭示基因调控机制。其关键解决方案是提出GIP-RAG(Gene Interaction Prediction through Retrieval-Augmented Generation)框架,该框架通过整合KEGG、WikiPathways、SIGNOR、Pathway Commons和PubChem等数据库构建统一的基因相互作用知识图谱,并利用查询驱动模块检索相关子图,将其结构化为提示(prompt)输入大语言模型(LLM),从而引导分步推理过程,识别直接与间接调控关系并生成基于生物证据的机制解释;此外,还引入通路层面的功能影响模块,模拟基因扰动在信号网络中的传播及其对通路状态的影响,显著提升了复杂分子系统中机制推理的准确性与可解释性。
链接: https://arxiv.org/abs/2603.20321
作者: Fujian Jia,Jiwen Gu,Cheng Lu,Dezhi Zhao,Mengjiang Huang,Yuanzhi Lu,Xin Liu,Kang Liu
机构: 未知
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages
Abstract:Understanding mechanistic relationships among genes and their impacts on biological pathways is essential for elucidating disease mechanisms and advancing precision medicine. Despite the availability of extensive molecular interaction and pathway data in public databases, integrating heterogeneous knowledge sources and enabling interpretable multi-step reasoning across biological networks remain challenging. We present GIP-RAG (Gene Interaction Prediction through Retrieval-Augmented Generation), a computational framework that combines biomedical knowledge graphs with large language models (LLMs) to infer and interpret gene interactions. The framework constructs a unified gene interaction knowledge graph by integrating curated data from KEGG, WikiPathways, SIGNOR, Pathway Commons, and PubChem. Given user-specified genes, a query-driven module retrieves relevant subgraphs, which are incorporated into structured prompts to guide LLM-based stepwise reasoning. This enables identification of direct and indirect regulatory relationships and generation of mechanistic explanations supported by biological evidence. Beyond pairwise interactions, GIP-RAG includes a pathway-level functional impact module that simulates propagation of gene perturbations through signaling networks and evaluates potential pathway state changes. Evaluation across diverse biological scenarios demonstrates that the framework generates consistent, interpretable, and evidence-supported insights into gene regulatory mechanisms. Overall, GIP-RAG provides a general and interpretable approach for integrating knowledge graphs with retrieval-augmented LLMs to support mechanistic reasoning in complex molecular systems. Comments: 29 pages Subjects: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.20321 [q-bio.MN] (or arXiv:2603.20321v1 [q-bio.MN] for this version) https://doi.org/10.48550/arXiv.2603.20321 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Mengjiang Huang [view email] [v1] Thu, 19 Mar 2026 23:36:26 UTC (2,942 KB)
信息检索
[IR-0] One Model Two Markets: Bid-Aware Generative Recommendation
【速读】:该论文旨在解决生成式推荐系统(Generative Recommender Systems)在实际商业场景中缺乏对广告收益(monetization)和竞价机制(bidding)支持的问题。现有方法仅关注语义检索能力,无法有效整合广告位投放策略与平台收入目标。解决方案的关键在于提出GEM-Rec框架,通过引入控制令牌(control tokens)将广告展示决策(whether to show an ad)与具体商品选择(which item to show)解耦,使模型能够从交互日志中直接学习有效的广告放置模式;同时设计了基于出价感知的解码机制(Bid-Aware Decoding),在推理阶段注入实时竞价信息,引导生成高价值内容,并保证分配单调性(allocation monotonicity),即更高出价弱提升广告被展示的概率,且无需重新训练模型。
链接: https://arxiv.org/abs/2603.22231
作者: Yanchen Jiang,Zhe Feng,Christopher P. Mah,Aranyak Mehta,Di Wang
机构: Harvard University (哈佛大学); Google Research (谷歌研究)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:
Abstract:Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad’s likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.
[IR-1] PreferRec: Learning and Transferring Pareto Preferences for Multi-objective Re-ranking
【速读】:该论文旨在解决多目标重排序(multi-objective re-ranking)中因静态或手工设定偏好权重而导致的个性化不足问题,以及现有方法将每个用户的重排序任务视为孤立问题所引发的高计算成本和缺乏跨用户知识共享的问题。其解决方案的关键在于提出 PreferRec 框架,通过三个紧密耦合的组件实现:首先,Preference-Aware Pareto Learning 显式建模用户在意图层面的帕累托最优偏好(Pareto-optimal preferences),捕捉用户在不同上下文下对多个冲突目标(如准确性、多样性与公平性)的内在权衡;其次,Knowledge-Guided Transfer 利用同质多目标优化空间中的可迁移帕累托模式,实现高效跨用户知识蒸馏与偏好迁移,从而在保留用户特异性的同时引导优化过程聚焦于高质量帕累托前沿区域。
链接: https://arxiv.org/abs/2603.22073
作者: Wei Zhou,Wuyang Li,Junkai Ji,Xueliang Li,Wenjing Hong,Zexuan Zhu,Xing Tang,Xiuqiang He
机构: Shenzhen University(深圳大学); Shenzhen Technology University(深圳技术大学)
类目: Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Multi-objective re-ranking has become a critical component of modern multi-stage recommender systems, as it tasked to balance multiple conflicting objectives such as accuracy, diversity, and fairness. Existing multi-objective re-ranking methods typically optimize aggregate objectives at the item level using static or handcrafted preference weights. This design overlooks that users inherently exhibit Pareto-optimal preferences at the intent level, reflecting personalized trade-offs among objectives rather than fixed weight combinations. Moreover, most approaches treat re-ranking task for each user as an isolated problem, and repeatedly learn the preferences from scratch. Such a paradigm not only incurs high computational cost, but also ignores the fact that users often share similar preference trade-off structures across objectives. Inspired by the existence of homogeneous multi-objective optimization spaces where Pareto-optimal patterns are transferable, we propose PreferRec, a novel framework that explicitly models and transfers Pareto preferences across users. Specifically, PreferRec is built upon three tightly coupled components: Preference-Aware Pareto Learning aims to capture user intrinsic trade-offs among multiple conflicting objectives at the intent level. By learning Pareto preference representations from re-ranking populations, this component explicitly models how users prioritize different objectives under diverse contexts. Knowledge-Guided Transfer facilitates efficient cross-user knowledge transfer by distilling shared optimization patterns across homogeneous optimization spaces. The transferred knowledge is then used to guide solution selection and personalized re-ranking, biasing the optimization process toward high-quality regions of the Pareto front while preserving user-specific preference characteristics.
[IR-2] On the Challenges and Opportunities of Learned Sparse Retrieval for Code
【速读】:该论文旨在解决大规模代码库中检索效率与效果之间的矛盾问题,即如何在保持高检索精度的同时降低延迟并提升稀疏检索(Sparse Retrieval, SR)在代码场景下的适用性。现有方法主要依赖密集嵌入模型(Dense Embedding Models),而对学习型稀疏检索(Learned Sparse Retrieval, LSR)的研究仍处于空白状态。针对代码特有的挑战——如子词碎片化、自然语言查询与代码间的语义鸿沟、编程语言多样性及文档长度带来的稀疏性下降和延迟增加——作者提出SPLADE-Code,这是首个专为代码检索设计的大规模学习型稀疏检索模型家族(参数规模600M–8B)。其关键创新在于引入“学习扩展标记”(learned expansion tokens),有效弥合词汇匹配与语义匹配之间的差距,并通过轻量级单阶段训练流程实现卓越性能:在MTEB Code基准上,1B参数以下模型达到75.4的领先指标,8B模型亦达79.0,同时在百万级段落集合中实现亚毫秒级延迟,且性能损失极小。
链接: https://arxiv.org/abs/2603.22008
作者: Simon Lupart,Maxime Louis,Thibault Formal,Hervé Déjean,Stéphane Clinchant
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 12 tables
Abstract:Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.
[IR-3] ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval
【速读】:该论文旨在解决交互式文本到图像检索(Interactive Text-to-Image Retrieval, I-TIR)中多模态用户反馈融合方式不合理的问题。现有方法通常采用简单的嵌入加法进行跨模态融合,未能区分不同模态的可靠性,导致扩散模型生成的噪声被无差别引入,从而在高达55.62%的样本上造成性能下降。其解决方案的关键在于提出ADaFuSE(Adaptive Diffusion-Text Fusion with Semantic-aware Experts),一个轻量级融合模块,通过双分支机制实现动态校准:一是自适应门控分支,用于根据模态置信度动态调整融合权重;二是语义感知专家混合分支,以捕捉细粒度的跨模态关联。该设计无需修改主干编码器即可无缝集成至现有框架,显著提升检索效果与鲁棒性。
链接: https://arxiv.org/abs/2603.21886
作者: Zhuocheng Zhang,Xingwu Zhang,Kangheng Liang,Guanxuan Li,Richard Mccreadie,Zijun Long
机构: Hunan University (湖南大学); University of Glasgow (格拉斯哥大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.
[IR-4] GoogleTrendArchive: A Year-Long Archive of Real-Time Web Search Trends Worldwide AAAI
【速读】:该论文旨在解决谷歌趋势(Google Trends)无法提供超过七天的历史数据访问问题,尤其是针对“Trending Now”这一实时搜索热点数据的缺失。其核心解决方案是构建了一个覆盖全球125个国家和1,358个地点、时间跨度达一年(2024年11月28日至2026年1月3日)的Google Trend Archive数据集,包含超过760万条趋势事件记录,每条记录包含趋势标识符、搜索量区间、精确时间戳、持续时间、地理位置及关联查询聚类等结构化信息。该数据集使得对信息扩散模式、跨文化注意力动态、危机响应机制以及全球范围内集体信息获取行为的时间演化进行系统性研究成为可能。
链接: https://arxiv.org/abs/2603.21871
作者: Aleksandra Urman,Anikó Hannák,Joachim Baumann
机构: 1. University of Vienna (维也纳大学); 2. University of Oxford (牛津大学)
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)
Abstract:GoogleTrendArchive is a comprehensive archive of Google Trending Now data spanning over one year (from November 28, 2024 to January 3, 2026) across 125 countries and 1,358 locations. Unlike Google Trends, which requires specifying search terms in advance, Trending Now captures search queries experiencing real-time surges, offering a way to inductively discover trending patterns across regions for studying collective attention dynamics. However, Google does not provide historical access to this data beyond seven days. Our dataset addresses this gap by presenting an archive of Trending Now data. The dataset contains over 7.6 million trend episodes. Each record includes the trend identifier, search volume bucket, precise timestamps, duration, geographic location, and related query clusters. This dataset, among other, enables systematic studies of information diffusion patterns, cross-cultural attention dynamics, crisis responses, and the temporal evolution of collective information-seeking at a global scale. The comprehensive geographic coverage facilitates fine-grained cross-country or cross-regional comparative analyses.
[IR-5] Agent icRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents
【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的推荐代理(recommender agents)在中间推理过程与最终排序反馈之间存在脱节、且难以捕捉细粒度偏好表达的问题。其解决方案的关键在于提出一个以排序为导向的智能体推荐框架 AgenticRec,通过三个核心创新实现端到端优化:1)设计一套推荐专用工具并集成至 ReAct(Reasoning + Action)循环中,支持基于证据的推理;2)提出理论无偏的列表级组相对策略优化(list-wise Group Relative Policy Optimization, list-wise GRPO),以最大化排序效用并确保复杂工具使用路径中的信用分配准确;3)引入渐进式偏好精化(Progressive Preference Refinement, PPR)机制,通过挖掘排序违规产生的难负样本并实施双向偏好对齐,最小化成对排序误差的凸上界,从而有效缓解细粒度偏好模糊性。
链接: https://arxiv.org/abs/2603.21613
作者: Tianyi Li,Zixuan Wang,Guidong Lei,Xiaodong Li,Hui Li
机构: Xiamen University (厦门大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.
[IR-6] Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生物医学等高风险领域应用中存在幻觉(hallucinations)或虚构信息的问题,这可能导致医疗问答、临床决策和生物医学研究评估中的严重错误。其解决方案的关键在于提升LLMs生成内容的可验证性,即通过强化模型对可验证来源的依赖能力,确保生成陈述能够基于可靠证据,从而减少不准确信息的产生。
链接: https://arxiv.org/abs/2603.21582
作者: Deepak Gupta,Dina Demner-Fushman,William Hersh,Steven Bedrick,Kirk Roberts
机构: National Library of Medicine, NIH; Oregon Health Science University; UTHealth Houston
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recent advances in large language models (LLMs) have made significant progress across multiple biomedical tasks, including biomedical question answering, lay-language summarization of the biomedical literature, and clinical note summarization. These models have demonstrated strong capabilities in processing and synthesizing complex biomedical information and in generating fluent, human-like responses. Despite these advancements, hallucinations or confabulations remain key challenges when using LLMs in biomedical and other high-stakes domains. Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research. Studies on the evaluation of the LLMs’ abilities to ground generated statements in verifiable sources have shown that models perform significantly
[IR-7] oward a Theory of Hierarchical Memory for Language Agents
【速读】:该论文旨在解决长上下文(long-context)和代理系统(agentic systems)在处理大规模信息时因上下文长度限制而导致的效率与效果问题。现有方法普遍采用分层记忆机制,但缺乏统一的形式化框架来比较不同设计选择。论文提出一个统一理论,将此类系统分解为三个核心操作算子:提取(Extraction, α),用于将原始数据映射为原子信息单元;粗化(Coarsening, C = (π, ρ)),通过划分单元并为每组分配代表性表示(representative function ρ);以及遍历(Traversal, τ),根据查询和令牌预算选择纳入上下文的单元。关键创新在于识别出代表函数 ρ 的自足性谱(self-sufficiency spectrum),并揭示其如何约束可行的检索策略(即粗化与遍历之间的耦合关系)。通过在11个现有系统上实例化该分解,验证了该理论在文档层次、对话记忆和代理执行轨迹等场景下的通用性。
链接: https://arxiv.org/abs/2603.21564
作者: Yashar Talebirad,Ali Parsaee,Csongor Y. Szepesvari,Amirhossein Nadiri,Osmar Zaiane
机构: University of Alberta (阿尔伯塔大学); York University (约克大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Social and Information Networks (cs.SI)
备注:
Abstract:Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction ( \alpha ) maps raw data to atomic information units; coarsening ( C = (\pi, \rho) ) partitions units and assigns a representative to each group; and traversal ( \tau ) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function \rho and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality.
[IR-8] agLLM : A Fine-Grained Tag Generation Approach for Note Recommendation
【速读】:该论文旨在解决生成式 AI 在电商社区推荐中对用户笔记(note)进行细粒度标签生成时面临的两大挑战:一是多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中缺乏引导,导致标签冗余且难以准确捕捉用户兴趣;二是现有方法生成的标签粒度较粗,无法有效表征笔记的细粒度内容,从而干扰下游推荐效果。解决方案的关键在于提出 TagLLM,其核心包括三方面:首先构建用户兴趣手册(User Interest Handbook)以跨类别捕捉用户偏好;其次利用多模态思维链提取(multimodal CoT Extraction)构建细粒度标签数据;最后引入标签知识蒸馏(Tag Knowledge Distillation)方法,使轻量模型具备接近大模型的生成能力,显著提升推理效率。该方案在真实线上 A/B 测试中验证了有效性,显著提升了用户观看时长、互动次数及冷启动场景下的点击率。
链接: https://arxiv.org/abs/2603.21481
作者: Zhijian Chen,Likai Wang,Lei Chen,Yaguang Dou,Jialiang Shi,Tian Qi,Dongdong Hao,Mengying Lu,Cheng Ye,Chao Wei
机构: Tongji University (同济大学); Shanghai Dewu Information Group Co. Ltd. (上海得物信息集团有限公司); Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have shown promising potential in E-commerce community recommendation. While LLMs and Multimodal LLMs (MLLMs) are widely used to encode notes into implicit embeddings, leveraging their generative capabilities to represent notes with interpretable tags remains unexplored. In the field of tag generation, traditional close-ended methods heavily rely on the design of tag pools, while existing open-ended methods applied directly to note recommendations face two limitations: (1) MLLMs lack guidance during generation, resulting in redundant tags that fail to capture user interests; (2) The generated tags are often coarse and lack fine-grained representation of notes, interfering with downstream recommendations. To address these limitations, we propose TagLLM, a fine-grained tag generation method for note recommendation. TagLLM captures user interests across note categories through a User Interest Handbook and constructs fine-grained tag data using multimodal CoT Extraction. A Tag Knowledge Distillation method is developed to equip small models with competitive generation capabilities, enhancing inference efficiency. In online A/B test, TagLLM increases average view duration per user by 0.31%, average interactions per user by 0.96%, and page view click-through rate in cold-start scenario by 32.37%, demonstrating its effectiveness.
[IR-9] When Documents Disagree: Measuring Institutional Variation in Transplant Guidance with Retrieval-Augmented Language Models
【速读】:该论文旨在解决美国实体器官移植中心患者教育材料在内容一致性上的显著异质性问题,当前缺乏系统方法对这种差异进行规模化量化。其解决方案的关键在于构建一个基于检索增强语言模型(retrieval-augmented language models)的框架,将同一患者问题映射到不同中心的手册中,并利用五标签一致性分类体系比较生成答案的一致性,从而从问题、主题、器官和中心四个维度量化异质性水平。该方法揭示了20.8%的非空配对存在临床意义的分歧,且覆盖缺口突出(如生殖健康相关内容缺失率达95.1%),并识别出机构层面的稳定差异模式,为优化移植患者教育材料提供了数据驱动的改进路径。
链接: https://arxiv.org/abs/2603.21460
作者: Yubo Li,Ramayya Krishnan,Rema Padman
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Patient education materials for solid-organ transplantation vary substantially across U.S. centers, yet no systematic method exists to quantify this heterogeneity at scale. We introduce a framework that grounds the same patient questions in different centers’ handbooks using retrieval-augmented language models and compares the resulting answers using a five-label consistency taxonomy. Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center. We find that 20.8% of non-absent pairwise comparisons exhibit clinically meaningful divergence, concentrated in condition monitoring and lifestyle topics. Coverage gaps are even more prominent: 96.2% of question-handbook pairs miss relevant content, with reproductive health at 95.1% absence. Center-level divergence profiles are stable and interpretable, where heterogeneity reflects systematic institutional differences, likely due to patient diversity. These findings expose an information gap in transplant patient education materials, with document-grounded medical question answering highlighting opportunities for content improvement.
[IR-10] Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval
【速读】:该论文旨在解决基于Transformer的文本嵌入模型在使用池化(pooling)操作将变长文本映射为固定维度向量时所引发的几何病态问题,如各向异性(anisotropy)和长度诱导的嵌入坍缩(length-induced embedding collapse),这些问题会显著损害下游检索性能。其解决方案的关键在于提出并验证“语义漂移”(semantic shift)作为解释和预测嵌入坍缩的核心机制:语义漂移量化了文本内部局部语义演化与全局语义分散的程度,理论分析表明,随着组成句子间语义多样性增加,池化后的嵌入必然偏离任一单句嵌入,导致表征平滑化与区分度下降;通过多语料库和多种嵌入模型的受控实验,作者证明语义漂移能有效预测嵌入集中程度与检索性能退化,而文本长度本身则不具备此类预测能力,从而为理解嵌入坍缩提供了统一且可操作的因果视角。
链接: https://arxiv.org/abs/2603.21437
作者: Hang Gao,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emphwhat these pathologies look like, yet provide limited insight into \emphwhen and \emphwhy they harm downstream retrieval. In this work, we argue that the missing causal factor is \emphsemantic shift: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emphsemantic smoothing in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2603.21437 [cs.CL] (or arXiv:2603.21437v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.21437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-11] COINBench: Moving Beyond Individual Perspectives to Collective Intent Understanding
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在理解人类集体意图(Collective Intent)方面的局限性问题,即如何从多源公开讨论中提取共识、化解矛盾并推断潜在趋势,从而实现对复杂社会语境下群体认知的深度解析。其解决方案的关键在于提出COIN-BENCH这一动态、实时更新的真实世界基准测试集,并构建包含COIN-TREE(用于层次化认知结构建模)与检索增强验证(COIN-RAG)的评估框架,结合规则方法与LLM-as-the-Judge的判别机制,系统性地衡量模型在深度、广度、信息量和正确性四个维度的表现,从而推动LLMs从被动指令执行者向具备专家级分析能力的集体意图解读者演进。
链接: https://arxiv.org/abs/2603.21329
作者: Xiaozhe Li,Tianyi Lyu,Siyi Yang,Yizhao Yang,Yuxi Gong,Jinxuan Huang,Ligao Zhang,Zhuoyi Huang,Qingwen Liu
机构: Tongji University; Stanford University; CurrentsAI Research
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.
[IR-12] Graph Fusion Across Languages using Large Language Models
【速读】:该论文旨在解决跨语言知识图谱(Knowledge Graph, KG)融合问题,即在不同语言环境下由于语义异构性和图结构复杂性导致的知识整合难题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的上下文推理能力和多语言语义先验,通过将三元组直接映射为自然语言序列(如 [头实体] [关系] [尾实体])实现结构线性化,从而让LLM能够识别并对齐跨语言关系与实体,逐步融合新图谱(G_t)与已融合图谱(G_c^(t−1))。实验表明,该方法可作为通用语义桥梁有效处理跨语言差异,支持多源、多语言环境中知识的连续合成与扩展。
链接: https://arxiv.org/abs/2603.21248
作者: Kaung Myat Kyaw,Khush Agarwal,Jonathan Chan
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ( G_c^(t-1) ) and a new candidate graph ( G_t ). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.
[IR-13] LSA: A Long-Short-term Aspect Interest Transformer for Aspect-Based Recommendation
【速读】:该论文旨在解决现有基于方面(Aspect-based)推荐方法在建模用户偏好时忽视兴趣动态性的问题,即用户可能在特定时期内临时关注此前较少关注的方面,导致难以准确为每次用户-物品交互分配方面权重。解决方案的关键在于提出一种长短期方面兴趣 Transformer(LSA),通过同时建模短时兴趣(捕捉近期交互方面重要性的时变特性)与长时兴趣(考虑全局行为模式,包括未近期交互的方面),并融合两者以评估用户与物品共同邻域中各方面的权重,从而实现更精准的方面加权机制。
链接: https://arxiv.org/abs/2603.21243
作者: Le Liu,Junrui Liu,Yunhan Gao,Ziheng Wang,Tong Li
机构: 未知
类目: Information Retrieval (cs.IR)
备注: WISE2025
Abstract:Aspect-based recommendation methods extract aspect terms from reviews, such as price, to model fine-grained user preferences on items, making them a critical approach in personalized recommender systems. Existing methods utilize graphs to represent the relationships among users, items, and aspect terms, modeling user preferences based on graph neural networks. However, they overlook the dynamic nature of user interests - users may temporarily focus on aspects they previously paid little attention to - making it difficult to assign accurate weights to aspect terms for each user-item interaction. In this paper, we propose a long-short-term aspect interest Transformer (LSA) for aspect-based recommendation, which effectively captures the dynamic nature of user preferences by integrating both long-term and short-term aspect interests. Specifically, the short-term interests model the temporal changes in the importance of recently interacted aspect terms, while the long-term interests consider global behavioral patterns, including aspects that users have not interacted with recently. Finally, LSA combines long- and short-term interests to evaluate the importance of aspects within the union of user and item aspect neighbors, therefore accurately assigns aspect weights for each user-item interaction. Experiments conducted on four real-world datasets demonstrate that LSA improves MSE by 2.55% on average over the best baseline.
[IR-14] MI-DPG: Decomposable Parameter Generation Network Based on Mutual Information for Multi-Scenario Recommendation CIKM2023
【速读】:该论文旨在解决多场景点击转化率(CVR)预测中如何在低模型参数成本下提升跨场景预测性能,并增强模型对多场景差异的鲁棒建模能力。现有方法难以在保持参数效率的同时有效捕捉不同场景间的多样性,导致性能受限。其解决方案的关键在于提出MI-DPG框架,通过引入一个辅助网络生成场景条件动态权重矩阵,该矩阵由分解后的场景专属与共享低秩矩阵组合而成,从而以高效方式实现参数空间的动态调制;同时设计互信息正则化项,最大化场景感知输入与动态权重矩阵之间的互信息,以增强各场景间模型参数的多样性,从而显著提升多场景CVR预测效果。
链接: https://arxiv.org/abs/2603.21209
作者: Wenzhuo Cheng,Ke Ding,Xin Dong,Yong He,Liang Zhang,Linjian Mo
机构: Ant Group(蚂蚁集团)
类目: Information Retrieval (cs.IR)
备注: Accepted by CIKM 2023
Abstract:Conversion rate (CVR) prediction models play a vital role in recommendation and advertising systems. Recent research on multi-scenario recommendation shows that learning a unified model to serve multiple scenarios is effective for improving overall performance. However, it remains challenging to improve model prediction performance across scenarios at low model parameter cost, and current solutions are hard to robustly model multi-scenario diversity. In this paper, we propose MI-DPG for the multi-scenario CVR prediction, which learns scenario-conditioned dynamic model parameters for each scenario in a more efficient and effective manner. Specifically, we introduce an auxiliary network to generate scenario-conditioned dynamic weighting matrices, which are obtained by combining decomposed scenario-specific and scenario-shared low-rank matrices with parameter efficiency. For each scene, weighting the backbone model parameters by the weighting matrix helps to specialize the model parameters for different scenarios. It can not only modulate the complete parameter space of the backbone model but also improve the model effectiveness. Furthermore, we design a mutual information regularization to enhance the diversity of model parameters across different scenarios by maximizing the mutual information between the scenario-aware input and the scene-conditioned dynamic weighting matrix. Experiments from three real-world datasets show that MI-DPG significantly outperforms previous multi-scenario recommendation models.
[IR-15] Ontology-Compliant Knowledge Graphs
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)在构建过程中缺乏结构化语义规范的问题,从而导致不同知识图谱之间难以实现自动匹配、对齐与协同。其核心挑战在于如何确保知识图谱内部逻辑一致性以及外部与其他本体(Ontology)的兼容性。解决方案的关键在于提出了一种基于模式(pattern-based)的本体合规方法,结合新颖的术语匹配算法和合规度量指标,以实现知识图谱的内部与外部本体合规性,并通过建筑行业的案例验证了该方法的有效性。
链接: https://arxiv.org/abs/2603.21188
作者: Zhangcheng Qiang
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 12 pages
Abstract:Ontologies can act as a schema for constructing knowledge graphs (KGs), offering explainability, interoperability, and reusability. We explore \emphontology-compliant KGs, aiming to build both internal and external ontology compliance. We discuss key tasks in ontology compliance and introduce our novel term-matching algorithms. We also propose a \emphpattern-based compliance approach and novel compliance metrics. The building sector is a case study to test the validity of ontology-compliant KGs. We recommend using ontology-compliant KGs to pursue automatic matching, alignment, and harmonisation of heterogeneous KGs.
[IR-16] Ontology-driven personalized information retrieval for XML documents
【速读】:该论文旨在解决半结构化eXtensible Markup Language (XML)文档信息检索中存在的个性化不足问题,即传统信息检索系统(IRS)对不同用户返回相同结果,忽视了用户的知识背景、偏好和目标差异。解决方案的关键在于将外部语义资源——领域本体(ontology)和用户画像(user profile)引入检索过程,通过将文档、查询和用户画像表示为加权概念向量,利用本体的概念权重机制突出低层节点的细粒度语义信息,并基于语义相似性度量实现用户、查询与文档之间的个性化匹配,从而提升检索效果的精准性和适应性。
链接: https://arxiv.org/abs/2603.21139
作者: Ounnaci Iddir,Ahmed-ouamer Rachid,Tai Dinh
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:This paper addresses the challenge of improving information retrieval from semi-structured eXtensible Markup Language (XML) documents. Traditional information retrieval systems (IRS) often overlook user-specific needs and return identical results for the same query, despite differences in users’ knowledge, preferences, and objectives. We integrate external semantic resources, namely a domain ontology and user profiles, into the retrieval process. Documents, queries, and user profiles are represented as vectors of weighted concepts. The ontology applies a concept-weighting mechanism that emphasizes highly specific concepts, as lower-level nodes in the hierarchy provide more precise and targeted information. Relevance is assessed using semantic similarity measures that capture conceptual relationships beyond keyword matching, enabling personalized and fine-grained matching among user profiles, queries, and documents. Experimental results show that combining ontologies with user profiles improves retrieval effectiveness, achieving higher precision and recall than keyword-based approaches. Overall, the proposed framework enhances the relevance and adaptability of XML search results, supporting more user-centered retrieval.
[IR-17] Query Decompose Compress: Structured Query Expansion for Efficient Multi-Hop Retrieval CIKM2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂多跳检索任务中因生成特性引入无关或噪声信息而导致性能下降的问题。其解决方案的关键在于提出DeCoR(Decompose and Compress for Retrieval)框架,通过结构化信息精炼策略重构查询的推理过程,并从检索到的文档中提炼出支持性证据:具体包括两个核心组件——查询分解(Query Decomposition),将复杂查询显式拆解为多个推理步骤;以及查询感知的文档压缩(Query-aware Document Compression),将候选文档中的分散证据浓缩为与查询高度相关的简洁摘要。该设计确保最终查询表示兼具鲁棒性与全面性,且实验表明,即使使用较小规模的LLM,DeCoR仍优于依赖更大模型的强基线方法,验证了在复杂检索场景中,合理利用LLM的推理与摘要能力比单纯依赖其生成能力更具效率和有效性。
链接: https://arxiv.org/abs/2603.21024
作者: JungMin Yun,YoungBin Kim
机构: Chung-Ang University (中央大学)
类目: Information Retrieval (cs.IR)
备注: Accepted to CIKM 2025
Abstract:Large Language Models (LLMs) have been increasingly employed for query expansion. However, their generative nature often undermines performance on complex multi-hop retrieval tasks by introducing irrelevant or noisy information. To address this challenge, we propose DeCoR (Decompose and Compress for Retrieval), a framework grounded in structured information refinement. Rather than generating additional content, DeCoR strategically restructures the query’s underlying reasoning process and distills supporting evidence from retrieved documents. It consists of two core components tailored to the challenges of multi-hop retrieval: (1) Query Decomposition, which decomposes a complex query into explicit reasoning steps, and (2) Query-aware Document Compression, which synthesizes dispersed evidence from candidate documents into a concise summary relevant to the query. This structured design ensures that the final query representation remains both robust and comprehensive. Experimental results demonstrate that, despite utilizing a relatively small LLM, DeCoR outperforms strong baselines that rely on larger models. This finding underscores that, in complex retrieval scenarios, sophisticatedly leveraging the reasoning and summarization capabilities of LLMs offers a more efficient and effective solution than relying solely on their generative capability.
[IR-18] DSL-R1: From SQL to DSL for Training Retrieval Agents across Structured and Unstructured Data with Reinforcement Learning
【速读】:该论文旨在解决复杂领域中结构化元数据与非结构化内容之间检索能力割裂的问题,现有系统通常采用符号过滤或向量相似性单独处理,未能有效融合两者的优势。解决方案的关键在于提出DSL-R1框架,通过一种新颖的领域特定语言(Domain-Specific Language, DSL)将逻辑推理与语义匹配统一建模:在SQL风格的操作符中嵌入向量原语,从而协同利用符号推理的精确性和语义匹配的覆盖范围;同时引入强化学习机制,以规则执行反馈和检索质量奖励联合优化DSL生成过程,实现结构正确性与语义对齐性的平衡。
链接: https://arxiv.org/abs/2603.21018
作者: Yunhai Hu,Junwei Zhou,Yumo Cao,Yitao Long,Yiwei Xu,Qiyi Jiang,Weiyao Wang,Xiaoyu Cao,Zhen Sun,Yiran Zou,Nan Du
机构: New York University (纽约大学); Matter Innovation Inc. (物质创新公司); Thin Red Line (红线科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:Effective retrieval in complex domains requires bridging the gap between structured metadata and unstructured content. Existing systems typically isolate these capabilities, relying on either symbolic filtering or vector similarity, failing to capture their interplay. In this work, we propose DSL-R1, a unified framework that synergizes logical reasoning with semantic matching via a novel Domain-Specific Language (DSL). By embedding vector primitives within SQL-style operators, our approach leverages the complementary strengths of symbolic precision and semantic coverage. We further introduce a reinforcement learning mechanism where rule-based execution feedback and retrieval quality rewards jointly optimize the DSL generation, balancing structural correctness and semantic alignment. Evaluations on a large-scale industrial email benchmark demonstrate that DSL-R1 achieves a +12.3% improvement in Hit@1/3, consistently outperforming decoupled baselines and establishing a robust paradigm for hybrid retrieval.
[IR-19] Consensus-Driven Group Recommendation on Sparse Explicit Feedback: A Collaborative Filtering and Choquet-Borda Aggregation Framework
【速读】:该论文旨在解决在仅有稀疏的用户-物品评分数据且缺乏人口统计、上下文或群体层面信息的情况下,如何实现稳定、公平且具有共识性的群体推荐问题。其核心解决方案是提出一种基于共识驱动的混合群体推荐框架,关键在于引入复合相似度度量CBS(Combined Similarity),融合几何结构感知相似度与不确定性感知相似度,从而提升缺失评分估计的稳定性并支持共识导向的邻居构建;同时通过Borda Count机制优化候选物品生成以缓解评分分布偏斜,并利用Choquet积分计算群体评分,灵活建模不同用户的影响力权重,在保障公平性的同时促进群体共识形成。
链接: https://arxiv.org/abs/2603.21012
作者: Anh Nguyen Van,Huy Ngo Hoang,Khoi Ngo Nguyen,Ngoc Pham Thi,Khanh Ngo Mai Bao,Quyen Nguyen Van
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Preprint. Under review for journal publication
Abstract:Group Recommender Systems (GRS) play an essential role in supporting collective decision-making among users with diverse and potentially conflicting preferences. However, achieving stable intra-group consensus becomes particularly challenging when only sparse userID-itemID-rating data are available and no demographic, contextual, or group-level information exists. This paper proposes a consensus-driven hybrid group recommendation framework that integrates neighborhood-based collaborative filtering with fuzzy aggregation to support agreement, fairness, and robustness under sparsity. A composite similarity measure, CBS (Combined Similarity), is derived from two enhanced similarity metrics introduced in prior work: a geometry-based measure that captures rating-pattern structure, and an uncertainty-aware measure that models belief, evidence, and disagreement in sparse co-rating contexts. This combination provides more stable estimation of missing ratings and supports consensus-oriented neighborhood construction. Candidate items are generated by merging per-user top-N predictions and further enriched using the Borda Count mechanism to mitigate skewed rating distributions and reinforce group-level agreement. Final group ratings are computed using the Choquet integral, which flexibly captures heterogeneous user influence while preserving fairness and supporting consensus formation. Experimental results on real-world datasets with different rating distributions show that the proposed method improves group-level consensus, satisfaction, and fairness, while maintaining a balanced level of novelty. Although the model does not rely on social information, its evaluation using trust-aware novelty measures indicates stable behavior in socially structured environments.
[IR-20] ECI: Effective Contrastive Information to Evaluate Hard-Negatives
【速读】:该论文旨在解决密集检索模型训练中有效难负样本(Hard Negatives)识别成本高昂的问题。传统方法依赖大量重复微调与负采样策略的消融实验,导致计算开销巨大。其解决方案的关键在于提出一种基于信息论和信息检索原理的理论性度量指标——有效对比信息(Effective Contrastive Information, ECI),该指标通过优化信息容量(Information Capacity)与判别效率(Discriminative Efficiency)之间的权衡来评估负样本质量,其中判别效率由信号强度(Hardness)与安全边界(Max-Margin)的调和平衡决定。ECI能够提前预测下游检索性能,显著减少对昂贵端到端消融实验的依赖,并识别出混合策略(如BM25+交叉编码器)在数量与可靠性上的最优平衡。
链接: https://arxiv.org/abs/2603.20990
作者: Aarush Sinha,Rahul Seetharaman,Aman Bansal
机构: University of Copenhagen (哥本哈根大学); Independent Researcher (独立研究员)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Hard negatives play a critical role in training and fine-tuning dense retrieval models, as they are semantically similar to positive documents yet non-relevant, and correctly distinguishing them is essential for improving retrieval accuracy. However, identifying effective hard negatives typically requires extensive ablation studies involving repeated fine-tuning with different negative sampling strategies and hyperparameters, resulting in substantial computational cost. In this paper, we introduce ECI: Effective Contrastive Information , a theoretically grounded metric grounded in Information Theory and Information Retrieval principles that enables practitioners to assess the quality of hard negatives prior to model fine-tuning. ECI evaluates negatives by optimizing the trade-off between Information Capacity the logarithmic bound on mutual information determined by set size and Discriminative Efficiency, a harmonic balance of Signal Magnitude (Hardness) and Safety (Max-Margin). Unlike heuristic approaches, ECI strictly penalizes unsafe, false-positive negatives prevalent in generative methods. We evaluate ECI across hard-negative sets mined or generated using BM25, cross-encoders, and large language models. Our results demonstrate that ECI accurately predicts downstream retrieval performance, identifying that hybrid strategies (BM25+Cross-Encoder) offer the optimal balance of volume and reliability, significantly reducing the need for costly end-to-end ablation studies.
[IR-21] User Preference Modeling for Conversational LLM Agents : Weak Rewards from Retrieval-Augmented Interaction
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在作为个人助手时缺乏持久用户模型的问题,即用户需在每次会话中重复表达偏好,导致交互效率低下。其解决方案的关键在于提出一种与流水线无关、基于冻结主干模型(frozen-backbone)的框架——向量适配检索评分(Vector-Adapted Retrieval Scoring, VARS),通过在共享偏好空间中维护每个用户的长期和短期向量来表征偏好,并利用这些向量对结构化偏好记忆进行检索评分偏置。该方法通过在线更新来自用户弱标量奖励的向量实现个性化,无需针对每个用户进行微调,从而显著提升交互效率并降低用户努力程度,同时保持任务成功率与强基线相当。
链接: https://arxiv.org/abs/2603.20939
作者: Yuren Hao,Shuhaib Mehri,ChengXiang Zhai,Dilek Hakkani-Tür
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (stat.ML)
备注: 21 pages including appendices
Abstract:Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users’ feedback, enabling personalization without per-user fine-tuning. We evaluate on \textscMultiSessionCollab, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at this https URL.
[IR-22] RubricRAG : Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation
【速读】:该论文试图解决当前基于大语言模型(Large Language Models, LLMs)的自动化评分系统缺乏可解释性的问题,即单一评分难以揭示答案优劣的具体原因、未满足的要求以及改进方向,从而限制了其在模型开发、数据集构建和高风险部署中的应用。解决方案的关键在于引入一种名为RubricRAG的新方法,通过在推理阶段从相关查询中检索领域知识来生成实例特定的评分细则(rubric),从而提升生成rubric与人工编写rubric的一致性及下游评估的有效性,实现更可解释且可扩展的自动评估体系。
链接: https://arxiv.org/abs/2603.20882
作者: Kaustubh D. Dhole,Eugene Agichtein
机构: Emory University(埃默里大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly evaluated and sometimes trained using automated graders such as LLM-as-judges that output scalar scores or preferences. While convenient, these approaches are often opaque: a single score rarely explains why an answer is good or bad, which requirements were missed, or how a system should be improved. This lack of interpretability limits their usefulness for model development, dataset curation, and high-stakes deployment. Query-specific rubric-based evaluation offers a more transparent alternative by decomposing quality into explicit, checkable criteria. However, manually designing high-quality, query-specific rubrics is labor-intensive and cognitively demanding and not feasible for deployment. While previous approaches have focused on generating intermediate rubrics for automated downstream evaluation, it is unclear if these rubrics are both interpretable and effective for human users. In this work, we investigate whether LLMs can generate useful, instance-specific rubrics as compared to human-authored rubrics, while also improving effectiveness for identifying good responses. Through our systematic study on two rubric benchmarks, and on multiple few-shot and post-training strategies, we find that off-the-shelf LLMs produce rubrics that are poorly aligned with human-authored ones. We introduce a simple strategy, RubricRAG, which retrieves domain knowledge via rubrics at inference time from related queries. We demonstrate that RubricRAG can generate more interpretable rubrics both for similarity to human-authored rubrics, and for improved downstream evaluation effectiveness. Our results highlight both the challenges and a promising approach of scalable, interpretable evaluation through automated rubric generation.
[IR-23] Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok
【速读】:该论文旨在解决社交媒体平台个性化推荐系统可能加剧用户观点极化的问题,尤其是在政治、气候变化、疫苗和阴谋论等争议性话题中。其核心问题在于:推荐算法是否以及如何随时间引导用户接触特定主题和立场的内容,从而影响信息获取的多样性与平衡性。解决方案的关键在于通过设计受控账户(controlled accounts)模拟具有不同兴趣倾向的用户,系统性地测量TikTok在多个极化议题上的内容推荐轨迹,识别出三种关键漂移模式——偏好一致漂移(preference-aligned drift)、极化主题漂移(polarisation-topic drift)和极化立场漂移(polarisation-stance drift),从而揭示推荐机制在不同议题下对用户认知路径的差异化塑造作用。
链接: https://arxiv.org/abs/2603.20723
作者: Branislav Pecher,Adrian Bindas,Jan Jakubcik,Matus Tuna,Matus Tibensky,Simon Liska,Peter Sakalik,Andrej Suty,Matej Mosnar,Filip Hossner,Ivan Srba
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Social media platforms have become an integral part of everyday life, serving as a primary source of news and information for many users. These platforms increasingly rely on personalised recommendation systems that shape what users see and engage with. While these systems are optimised for engagement, concerns have emerged that they may also drive users toward more polarised perspectives, particularly in contested domains such as politics, climate change, vaccines, and conspiracy theories. In this paper, we present an algorithmic audit of personalisation drift on TikTok in these polarising topics. Using controlled accounts designed to simulate users with interests aligned with or opposed to different polarising topics, we systematically measure the extent to which TikTok steers content exposure toward specific topics and polarities over time. Specifically, we investigated: 1) a preference-aligned drift (showing a strong personalisation towards user interests), 2) a polarisation-topic drift (showing a strong neutralising effect for misinformation-themed topics, and a high preference and reinforcement of interest of US politic topic); and 3) a polarisation-stance drift (showing a preference of oppose stance towards US politics topic and a general reinforcement of users’ stance by recommending items aligned with their stance towards polarising topics). Overall, our findings provide evidence that recommendation trajectories differ markedly across topics, with some pathways amplifying polarised viewpoints more strongly than others and offer insights for platform governance, transparency and user awareness.
[IR-24] NDT: Non-Differential Transformer and Its Application to Sentiment Analysis
【速读】:该论文旨在解决文本情感分析(Sentiment Analysis)中准确捕捉情感语义的难题,尤其针对标准Transformer模型在处理无关上下文时表现不佳的问题。其解决方案的关键在于提出一种非微分Transformer(Non-Differential Transformer, NDT)架构,该架构摒弃了当前最优模型差分Transformer(Differential Transformer, DT)所采用的注意力图相减机制,转而基于“概念多路复用”(Concept-Multiplexing, ConPlex)理念,通过仅使用正权重的加性策略融合多个独立的注意力映射,使不同注意力组件能够专注于文本中的不同语义概念,从而实现对复杂上下文关系的建设性整合。此设计避免了噪声抑制导向的减法操作,强调正向组合以提升模型表达能力与性能。
链接: https://arxiv.org/abs/2603.20704
作者: Soudeep Ghoshal,Himanshu Buckchash,Sarita Paudel,Rubén Ruiz-Torrubiano
机构: Kalinga Institute of Industrial Technology (KIIT); IMC University of Applied Sciences Krems
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 16 figures. Submitted to IEEE Transactions on Computational Social Systems
Abstract:From customer feedback to social media, understanding human sentiment in text is central to how machines can interact meaningfully with people. However, despite notable progress, accurately capturing sentiment remains a challenging task, which continues to motivate further research in this area. To this end, we introduce Non-Differential Transformer (NDT). It is inspired by (but in contrast to) the state-of-the-art Differential Transformer (DT) model. While standard Transformers can struggle with irrelevant context, the sota DT model uses attention map subtraction, potentially for noise cancellation. We explore an alternative motivation, hypothesizing that benefits may arise from enabling different attention components to specialize on distinct concepts within the text, similar to multiplexing information channels or mixture models, rather than primarily canceling noise via subtraction. Guided by this concept-multiplexing (ConPlex) view, the specific architecture presented in this paper employs a purely additive strategy. It uses only positive weights, learned during training, to ensure constructive combination of these specialized attention perspectives. This design choice explores positive only integration, though our broader framework also shows promise with less constrained linear combinations involving both positive and negative weights. Our model computes attention via this positively weighted sum of multiple distinct attention maps. This allows the model to constructively integrate diverse signals and potentially capture more complex contextual relationships. Competitive performance is achieved by the proposed model for Sentiment Analysis while tested on multiple datasets. We conclude by presenting our results, challenges and future research agenda in this important area of research.
[IR-25] ReBOL: Retrieval via Bayesian Optimization with Batched LLM Relevance Observations and Query Reformulation
【速读】:该论文旨在解决传统基于向量相似度的Top-k文档检索在生成式AI(Generative AI)重排序(LLM-reranking)阶段中存在的局限性,即无法实现查询与文档之间的上下文token级交互,且难以捕捉多模态相关性分布。为此,作者提出ReBOL方法,其核心创新在于:1)利用大语言模型(LLM)进行查询改写以初始化一个关于文档相关性的多模态贝叶斯优化(Bayesian Optimization, BO)后验分布;2)通过迭代获取文档批次并由LLM进行查询-文档相关性评分,进而更新后验分布以优化整体相关性得分。该方案有效突破了传统检索流程中仅依赖向量相似度的瓶颈,显著提升了召回率(recall)并保持了与现有LLM重排序方法相当的延迟性能。
链接: https://arxiv.org/abs/2603.20513
作者: Anton Korikov,Scott Sanner
机构: University of Toronto (多伦多大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-reranking is limited by the top-k documents retrieved by vector similarity, which neither enables contextual query-document token interactions nor captures multimodal relevance distributions. While LLM query reformulation attempts to improve recall by generating improved or additional queries, it is still followed by vector similarity retrieval. We thus propose to address these top-k retrieval stage failures by introducing ReBOL, which 1) uses LLM query reformulations to initialize a multimodal Bayesian Optimization (BO) posterior over document relevance, and 2) iteratively acquires document batches for LLM query-document relevance scoring followed by posterior updates to optimize relevance. After exploring query reformulation and document batch diversification techniques, we evaluate ReBOL against LLM reranker baselines on five BEIR datasets and using two LLMs (Gemini-2.5-Flash-Lite, GPT-5.2). ReBOL consistently achieves higher recall and competitive rankings, for example compared to the best LLM reranker on the Robust04 dataset with 46.5% vs. 35.0% recall@100 and 63.6% vs. 61.2% NDCG@10. We also show that ReBOL can achieve comparable latency to LLM rerankers.
[IR-26] yProv4DV: Reproducible Data Visualization Scripts Out of the Box
【速读】:该论文旨在解决科学出版物中数据可视化结果(plots)的可复现性问题,即当前研究实践中常出现仅分享图表而缺乏完整代码、输入数据、执行环境及输出结果的情况,导致他人难以独立重现图形生成过程。现有解决方案多聚焦于计算管道或工作流管理系统,未覆盖科研人员广泛使用的基于脚本的可视化实践。其关键解决方案是提出yProv4DV库,通过轻量级设计利用溯源信息(provenance information)实现对可视化脚本的自动追踪,仅需一次调用即可记录输入、输出和源代码文件,从而最小化代码修改需求,保障数据可视化软件的完整可复现性。
链接: https://arxiv.org/abs/2603.20437
作者: Gabriele Padovani,Sandro Fiore
机构: University of Trento (特伦托大学)
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
备注: SoftwareX, 17 pages, 4 figures
Abstract:While results visualization is a critical phase to the communication of new academic results, plots are frequently shared without the complete combination of code, input data, execution context and outputs required to independently reproduce the resulting figures. Existing reproducibility solutions tend to focus on computational pipelines or workflow management systems, not covering script-based visualization practices commonly used by researchers and practitioners. Additionally, the minimalist nature of current Python data visualization libraries tend to speed up the creation of images, disincentivizing users from spending time integrating additional tools into these short scripts. This paper proposes yProv4DV, a library lightweight designed to enable reproducible data visualization scripts through the use of provenance information, minimizing the necessity for code modifications. Through a single call, users can track inputs, outputs and source code files, enabling saving and full reproducibility of their data visualization software. As a result, this library fills a gap in reproducible research workflows by addressing the reproducibility of plots in scientific publications.
[IR-27] PEARL: Personalized Streaming Video Understanding Model
【速读】:该论文旨在解决当前多模态个性化方法在处理连续视频流时的局限性问题,即现有方法主要依赖静态图像或离线视频,无法有效整合实时视觉输入与即时现实反馈,从而限制了AI助手提供实时、交互式个性化响应的能力。为应对这一挑战,作者首次提出并形式化定义了“个性化流式视频理解”(Personalized Streaming Video Understanding, PSVU)任务,并构建了首个专门用于评估该场景的基准测试集PEARL-Bench,涵盖132个独特视频和2,173条带精确时间戳的细粒度标注,支持帧级与视频级两种个性化概念识别模式。解决方案的关键在于提出一种无需训练、可插拔的通用策略PEARL,其在8种不同架构的模型上均展现出显著且一致的性能提升,验证了其作为强基线的有效性与鲁棒性,为未来流式个性化AI助手的研究奠定了基础。
链接: https://arxiv.org/abs/2603.20422
作者: Yuanhong Zheng,Ruichuan An,Xiaopeng Lin,Yuxing Liu,Sihan Yang,Huanyu Zhang,Haodong Li,Qintong Zhang,Renrui Zhang,Guopeng Li,Yifan Zhang,Yuheng Li,Wentao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Arxiv Submission
Abstract:Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model’s ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at this https URL.
[IR-28] WebNavigator: Global Web Navigation via Interaction Graph Retrieval
【速读】:该论文旨在解决当前自主网页导航方法在复杂网络环境中难以达到人类水平性能的问题,其核心瓶颈被识别为“拓扑盲区”(Topological Blindness),即智能体在缺乏全局拓扑结构信息的情况下仅能依赖试错式探索。解决方案的关键在于提出 WebNavigator,该框架将网页导航从概率性探索重构为确定性的检索与路径规划:通过离线零Token成本的启发式探索构建交互图(Interaction Graph),并在在线阶段采用“检索-推理-瞬移”(Retrieve-Reason-Teleport)工作流实现全局导航,从而显著提升任务成功率,尤其在 WebArena 多站点任务中达到 72.9% 的成功率,远超现有企业级代理性能。
链接: https://arxiv.org/abs/2603.20366
作者: Xuanwang Zhang,Yuteng Han,Jinnan Qi,Mulong Xie,Zhen Wu,Xinyu Dai
机构: Nanjing University (南京大学); Fellou AI
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures
Abstract:Despite significant advances in autonomous web navigation, current methods remain far from human-level performance in complex web environments. We argue that this limitation stems from Topological Blindness, where agents are forced to explore via trial-and-error without access to the global topological structure of the environment. To overcome this limitation, we introduce WebNavigator, which reframes web navigation from probabilistic exploration into deterministic retrieval and pathfinding. WebNavigator constructs Interaction Graphs via zero-token cost heuristic exploration offline and implements a Retrieve-Reason-Teleport workflow for global navigation online. WebNavigator achieves state-of-the-art performance on WebArena and OnlineMind2Web. On WebArena multi-site tasks, WebNavigator achieves a 72.9% success rate, more than doubling the performance of enterprise-level agents. This work reveals that Topological Blindness, rather than model reasoning capabilities alone, is an underestimated bottleneck in autonomous web navigation.
[IR-29] Low-pass Personalized Subgraph Federated Recommendation ICLR2026
【速读】:该论文旨在解决联邦推荐系统(Federated Recommender Systems, FRS)中因客户端子图结构不平衡所引发的模型训练难题,具体表现为不同客户端子图在用户-物品数量(subgraph scale)和物品度数(item degree)上的显著差异,导致客户端表征不一致,难以训练出鲁棒的全局推荐模型。解决方案的关键在于提出一种低通个性化子图联邦推荐系统(Low-pass Personalized Subgraph Federated recommender system, LPSFed),其核心机制包括:利用图傅里叶变换(graph Fourier transforms)与低通谱滤波(low-pass spectral filtering)提取跨不同规模和连接度子图保持稳定的低频结构信号,从而引导基于中性结构锚点(neutral structural anchor)的个性化参数更新;同时引入局部流行度偏差感知的边界项(localized popularity bias-aware margin),捕捉每个子图内的物品度数不平衡,并将其融入个性化偏置修正项以缓解推荐偏差。
链接: https://arxiv.org/abs/2603.20338
作者: Wooseok Sim,Hogun Park
机构: Sungkyunkwan University (成均馆大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026. 31 pages, 3 figures, 12 tables
Abstract:Federated Recommender Systems (FRS) preserve privacy by training decentralized models on client-specific user-item subgraphs without sharing raw data. However, FRS faces a unique challenge: subgraph structural imbalance, where drastic variations in subgraph scale (user/item counts) and connectivity (item degree) misalign client representations, making it challenging to train a robust model that respects each client’s unique structural characteristics. To address this, we propose a Low-pass Personalized Subgraph Federated recommender system (LPSFed). LPSFed leverages graph Fourier transforms and low-pass spectral filtering to extract low-frequency structural signals that remain stable across subgraphs of varying size and degree, allowing robust personalized parameter updates guided by similarity to a neutral structural anchor. Additionally, we leverage a localized popularity bias-aware margin that captures item-degree imbalance within each subgraph and incorporates it into a personalized bias correction term to mitigate recommendation bias. Supported by theoretical analysis and validated on five real-world datasets, LPSFed achieves superior recommendation accuracy and enhances model robustness.
[IR-30] GEM: A Native Graph-based Index for Multi-Vector Retrieval SIGMOD2026
【速读】:该论文旨在解决多向量检索(multi-vector retrieval)中缺乏高效索引算法的问题,现有方法通常依赖于单向量索引结构,难以保留多向量语义信息或效率低下。其核心解决方案是提出GEM框架,关键在于构建一个直接基于向量集合的邻近图(proximity graph),通过两级设计实现高效且语义保持的检索:首先采用集级别聚类策略,仅关联每个向量集合与其最具信息量的聚类中心,从而在不牺牲语义覆盖的前提下减少冗余;其次,在聚类内构建局部邻近图并将其连接成全局可导航结构,同时解耦图构建所用度量与最终相关性评分,并引入语义捷径(semantic shortcuts)引导高效搜索路径。此外,GEM采用量化距离估计技术提升索引与查询阶段的效率,在多个域内、跨域及多模态基准测试中实现最高达16倍的速度提升,同时保持或优于现有方法的准确性。
链接: https://arxiv.org/abs/2603.20336
作者: Yao Tian,Zhoujin Tian,Xi Zhao,Ruiyuan Zhang,Xiaofang Zhou
机构: The Hong Kong University of Science and Technology(香港科技大学); Hong Kong Generative AI Research Development Center(香港生成式人工智能研究中心)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: This paper has been accepted by SIGMOD 2026
Abstract:In multi-vector retrieval, both queries and data are represented as sets of high-dimensional vectors, enabling finer-grained semantic matching and improving retrieval quality over single-vector approaches. However, its practical adoption is held back by the lack of effective indexing algorithms. Existing work, attempting to reuse standard single-vector indexes, often fails to preserve multi-vector semantics or remains slow. In this work, we present GEM, a native indexing framework for multi-vector representations. The core idea is to construct a proximity graph directly over vector sets, preserving their fine-grained semantics while enabling efficient navigation. First, GEM designs a set-level clustering scheme. It associates each vector set with only its most informative clusters, effectively reducing redundancy without hurting semantic coverage. Then, it builds local proximity graphs within clusters and bridges them into a globally navigable structure. To handle the non-metric nature of multi-vector similarity, GEM decouples the graph construction metric from the final relevance score and injects semantic shortcuts to guide efficient navigation toward relevant regions. At query time, GEM launches beam search from multiple entry points and prunes paths early using cluster cues. To further enhance efficiency, a quantized distance estimation technique is used for both indexing and search. Across in-domain, out-of-domain, and multi-modal benchmarks, GEM achieves up to 16x speedup over state-of-the-art methods while matching or improving accuracy.
[IR-31] Bypassing Document Ingestion: An MCP Approach to Financial QA
【速读】:该论文旨在解决金融领域问答(QA)任务中传统检索增强生成(Retrieval-Augmented Generation, RAG)方法依赖文档切片和文本匹配所带来的局限性问题,尤其是在处理需要精确数值计算的定量金融问题时,RAG往往因信息碎片化或上下文不完整而表现不佳。解决方案的关键在于引入模型上下文协议(Model Context Protocol, MCP),通过构建一个自定义MCP服务器将路透社数据服务(LSEG)API作为工具直接暴露给大语言模型(LLM),使模型能够实时调用结构化数据库而非依赖静态文档片段进行推理。实验表明,在FinDER基准测试的Financials子集上,该方法在相关上下文可获取的情况下,对多步数值问题的准确率可达80.4%,验证了MCP在定量金融QA中的有效性与轻量化优势,但也揭示其在涉及定性分析或特定文档内容的问题上存在局限。
链接: https://arxiv.org/abs/2603.20316
作者: Sasan Mansouri,Edoardo Pilla,Mark Wahrenburg,Fabian Woebbeking
机构: University of Groningen (格罗宁根大学); Goethe University Frankfurt (歌德大学法兰克福分校); Halle Institute for Economic Research (IWH) and Martin Luther University Halle-Wittenberg (哈勒经济研究所(IWH)和马丁路德大学哈雷-维滕贝格)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures
Abstract:Answering financial questions is often treated as an information retrieval problem. In practice, however, much of the relevant information is already available in curated vendor systems, especially for quantitative analysis. We study whether, and under which conditions, Model Context Protocol (MCP) offers a more reliable alternative to standard retrieval-augmented generation (RAG) by allowing large language models (LLMs) to interact directly with data rather than relying on document ingestion and chunk retrieval. We test this by building a custom MCP server that exposes LSEG APIs as tools and evaluating it on the FinDER benchmark. The approach performs particularly well on the Financials subset, achieving up to 80.4% accuracy on multi-step numerical questions when relevant context is retrieved. The paper thus provides both a baseline for MCP-based financial question answering (QA) and evidence on where this approach breaks down, such as for questions requiring qualitative or document-specific context. Overall, direct access to curated data is a lightweight and effective alternative to document-centric RAG for quantitative financial QA, but not a substitute for all financial QA tasks.
[IR-32] BubbleRAG : Evidence-Driven Retrieval-Augmented Generation for Black-Box Knowledge Graphs
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中出现幻觉(hallucination)的问题,尤其是当基于图结构的检索增强生成(Retrieval-Augmented Generation, RAG)方法作用于黑箱知识图谱(black-box knowledge graphs)时所面临的召回率(recall)和精确度(precision)局限性。其核心挑战包括语义实例化不确定性、结构路径不确定性以及证据比较不确定性。解决方案的关键在于将检索任务形式化为最优信息子图检索(Optimal Informative Subgraph Retrieval, OISR)问题——一个Group Steiner Tree的变体,并证明其为NP-hard且APX-hard;进而提出无需训练的BubbleRAG框架,通过语义锚点分组、启发式气泡扩展以发现候选证据图(Candidate Evidence Graphs, CEGs)、复合排序及推理感知扩展等模块系统性地优化召回与精度,从而在多跳问答(multi-hop QA)基准上实现SOTA性能。
链接: https://arxiv.org/abs/2603.20309
作者: Duyi Pan,Tianao Lou,Xin Li,Haoze Song,Yiwen Wu,Mengyi Deng,Mingyu Yang,Wei Wang
机构: 未知
类目: Information Retrieval (cs.IR); Databases (cs.DB)
备注: Technical Report
Abstract:Large Language Models (LLMs) exhibit hallucinations in knowledge-intensive tasks. Graph-based retrieval augmented generation (RAG) has emerged as a promising solution, yet existing approaches suffer from fundamental recall and precision limitations when operating over black-box knowledge graphs – graphs whose schema and structure are unknown in advance. We identify three core challenges that cause recall loss (semantic instantiation uncertainty and structural path uncertainty) and precision loss (evidential comparison uncertainty). To address these challenges, we formalize the retrieval task as the Optimal Informative Subgraph Retrieval (OISR) problem – a variant of Group Steiner Tree – and prove it to be NP-hard and APX-hard. We propose BubbleRAG, a training-free pipeline that systematically optimizes for both recall and precision through semantic anchor grouping, heuristic bubble expansion to discover candidate evidence graphs (CEGs), composite ranking, and reasoning-aware expansion. Experiments on multi-hop QA benchmarks demonstrate that BubbleRAG achieves state-of-the-art results, outperforming strong baselines in both F1 and accuracy while remaining plug-and-play.
[IR-33] Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation LREC2026
【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)生成可用于指导组织政策改进的推荐建议这一问题。传统推荐系统通常面向产品或用户,而本文提出了一种新的任务范式,即基于报告结论生成具有行动导向性的政策改进建议,并构建了首个针对该任务的基准数据集与系统性评估框架。解决方案的关键在于设计一个能够识别并提炼报告中关键问题与学习要点、进而生成结构化、可操作推荐内容的LLM应用流程,从而为公共及私营组织的政策制定和运营优化提供智能支持。
链接: https://arxiv.org/abs/2603.20287
作者: Aleksandra Edwards,Thomas Edwards,Jose Camacho-Collados,Alun Preece
机构: 未知
类目: Information Retrieval (cs.IR)
备注: The paper has been accepted to LREC 2026
Abstract:Large Language Models (LLMs) are extensively used in text generation tasks. These generative capabilities bring us to a point where LLMs could potentially provide useful insights in policy making or agency operations. In this paper, we introduce a new task consisting of generating recommendations which can be used to inform future actions and improvements of agencies work within private and public organisations. In particular, we present the first benchmark and coherent evaluation for developing recommendation systems to inform organisation policies. This task is clearly different from usual product or user recommendation systems, but rather aims at providing a basis to suggest policy improvements based on the conclusions drawn from reports. Our results demonstrate that state-of-the-art LLMs have the potential to emphasize and reflect on key issues and learning points within generated recommendations.
[IR-34] Rethinking Retrieval-Augmentation as Synthesis: A Query-Aware Context Merging Approach
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在实际部署中因语言模型有限上下文窗口导致的信息冗余与关键证据丢失问题。传统方法采用“检索后选择”策略,仅保留相关性最高的k个文本片段,但这种方式会忽略长尾分布中潜在的桥接证据,并浪费token预算在语义重复的高分片段上。其解决方案的关键在于将RAG重构为一个动态优化问题,以最大化信息密度;提出MergeRAG框架,通过查询感知的合成机制替代静态过滤:一是对称合并(Symmetric Merging),整合弱信号以恢复丢失的桥接证据;二是非对称合并(Asymmetric Merging),利用熵引导锚定消除冗余而不损害语义完整性;此外引入分层并行合并策略,在减少信息损失的同时提升计算并行度。实验表明,该方法显著优于现有SOTA基线,F1得分提升达13.7点,Exact Match(EM)提升11.5点。
链接: https://arxiv.org/abs/2603.20286
作者: Jiarui Guo,Yuemeng Xu,Zongwei Lv,Yangyujia Wang,Xiaolin Wang,Kan Liu,Tao Lan,Lin Qu,Tong Yang
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to extend their existing knowledge by dynamically incorporating external information. However, practical deployment is fundamentally constrained by the LLM’s finite context window, forcing a trade-off between information sufficiency and token consumption. Standard pipelines address this via a retrieve-then-select strategy, typically retaining only the top-k chunks based on relevance. Nevertheless, this approach is suboptimal: it inherently truncates critical bridging evidence located in the long tail of the relevance distribution, while simultaneously wasting the token budget on semantically redundant high-ranking chunks. In this paper, we rethink retrieval-augmentation as a dynamic optimization problem aimed at maximizing information density. We propose MergeRAG, a novel framework that shifts the paradigm from static filtering to query-aware synthesis. MergeRAG employs a scoring agent to restructure retrieved contexts through a dual-pathway mechanism: 1) Symmetric Merging, which consolidates weak signals to recover lost bridging evidence; 2) Asymmetric Merging, which utilizes entropy-guided anchoring to eliminate redundancy without sacrificing semantic integrity. We further introduce a Hierarchical Parallel Merging strategy that mitigates information loss while maximizing computational parallelism. Extensive experiments on standard benchmarks demonstrate that MergeRAG significantly outperforms state-of-the-art RAG baselines, achieving up to 13.7 points improvement in F1 score and 11.5 points in Exact Match (EM), respectively. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2603.20286 [cs.IR] (or arXiv:2603.20286v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.20286 Focus to learn more arXiv-issued DOI via DataCite
[IR-35] FastPFRec: A Fast Personalized Federated Recommendation with Secure Sharing
【速读】:该论文旨在解决基于图神经网络(Graph Neural Network, GNN)的联邦推荐系统中存在的两个核心问题:一是模型在图数据上的训练收敛速度慢,二是协作过程中存在隐私泄露风险。其解决方案的关键在于提出一种名为FastPFRec(Fast Personalized Federated Recommendation with Secure Sharing)的新框架,该框架通过两种机制实现优化:首先,采用高效的本地更新策略加速模型收敛;其次,引入隐私感知的参数共享机制以降低隐私泄露风险。实验表明,FastPFRec在多个真实世界数据集上显著提升了训练效率与推荐精度,同时保障了数据安全性。
链接: https://arxiv.org/abs/2603.20283
作者: Zhenxing Yan,Jidong Yuan,Yongqi Sun,Haiyang Liu,Zhihui Gao
机构: Beijing Jiaotong University (北京交通大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Graph neural network (GNN)-based federated recommendation systems effectively capture user-item relationships while preserving data privacy. However, existing methods often face slow convergence on graph data and privacy leakage risks during collaboration. To address these challenges, we propose FastPFRec (Fast Personalized Federated Recommendation with Secure Sharing), a novel framework that enhances both training efficiency and data security. FastPFRec accelerates model convergence through an efficient local update strategy and introduces a privacy-aware parameter sharing mechanism to mitigate leakage risks. Experiments on four real-world datasets (Yelp, Kindle, Gowalla-100k, and Gowalla-1m) show that FastPFRec achieves 32.0% fewer training rounds, 34.1% shorter training time, and 8.1% higher accuracy compared with existing baselines. These results demonstrate that FastPFRec provides an efficient and privacy-preserving solution for scalable federated recommendation.
[IR-36] OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
【速读】:该论文旨在解决深度研究代理(Deep Research Agent)训练中长期轨迹合成的可复现性与成本问题,现有数据收集流程依赖专有网络API,导致大规模轨迹生成昂贵、不稳定且难以复现。其解决方案的关键在于提出一个名为OpenResearcher的可复现流水线,通过将一次性语料库构建与多轮轨迹合成解耦,并在离线环境中利用三个显式的浏览器原语(search、open、find)对1500万文档语料库进行搜索-浏览循环,从而实现高效、稳定、可控的轨迹生成。该方法使用GPT-OSS-120B作为教师模型合成超过9.7万条轨迹(包含100+工具调用的长程轨迹),并基于这些数据对30B-A3B模型进行监督微调,在BrowseComp-Plus上达到54.8%准确率,较基线提升34.0点,同时保持在BrowseComp、GAIA和xbench-DeepSearch等基准上的竞争力。
链接: https://arxiv.org/abs/2603.20278
作者: Zhuofeng Li,Dongfu Jiang,Xueguang Ma,Haoxiang Zhang,Ping Nie,Yuyu Zhang,Kai Zou,Jianwen Xie,Yu Zhang,Wenhu Chen
机构: Texas AM University (德州农工大学); University of Waterloo (滑铁卢大学); UC San Diego (加州大学圣地亚哥分校); Verdent AI (Verdent AI); NetMind AI (NetMind AI); Lambda (Lambda)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at this https URL.
人机交互
[HC-0] ShapDBM: Exploring Decision Boundary Maps in Shapley Space
【速读】:该论文旨在解决决策边界图(Decision Boundary Maps, DBMs)在高维数据中可视化质量下降的问题,尤其针对维度约简(Dimensionality Reduction, DR)技术选择不当导致类别混杂、决策区域难以解析的挑战。其解决方案的关键在于将原始数据空间转换至Shapley空间进行维度约简,从而生成更紧凑、更易探索的决策区域,同时保持或提升DBM的质量指标(如边界清晰度和分类分离度)。此方法有效缓解了传统DBMs因DR过程引入噪声或混淆而导致的可视化失真问题。
链接: https://arxiv.org/abs/2603.22235
作者: Luke Watkin,Daniel Archambault,Alex Telea
机构: Newcastle University, UK; Utrecht University, Netherlands
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 7 pages and 4 figures
Abstract:Decision Boundary Maps (DBMs) are an effective tool for visualising machine learning classification boundaries. Yet, DBM quality strongly depends on the dimensionality reduction (DR) technique and high dimensional space used for the data points. For complex ML datasets, DR can create many mixed classes which, in turn, yield DBMs that are hard to use. We propose a new technique to compute DBMs by transforming data space into Shapley space and computing DR on it. Compared to standard DBMs computed directly from data, our maps have similar or higher quality metric values and visibly more compact, easier to explore, decision zones.
[HC-1] Dyadic: A Scalable Platform for Human-Human and Human-AI Conversation Research
【速读】:该论文旨在解决当前对话研究中因工具模块化程度不足且难以适应研究者需求而面临的实证研究困境。其解决方案的关键在于提出并介绍了一个名为Dyadic的新型网络平台,该平台支持多模态交互(文本或语音)、提供AI建议功能(如在人与人对话研究中由AI辅助生成回应)、实时监控能力(研究人员可即时评估交流过程)以及原生问卷部署功能(包括李克特量表、情绪温度计和开放式文本框等),从而显著提升对话研究的灵活性与效率,且无需编程即可直接操作,同时兼容现有调查平台以增强扩展性。
链接: https://arxiv.org/abs/2603.22227
作者: David M. Markowitz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Conversation is ubiquitous in social life, but the empirical study of this interactive process has been thwarted by tools that are insufficiently modular and unadaptive to researcher needs. To relieve many constraints in conversation research, the current tutorial presents an overview and introduction to a new tool, Dyadic (this https URL), a web-based platform for studying human-human and human-AI conversations using text-based or voice-based chats. Dyadic is distinct from other platforms by offering studies with multiple modalities, AI suggestions (e.g., in human-human studies, AI can suggest responses to a participant), live monitoring (e.g., researchers can evaluate, in real time, chats between communicators), and survey deployment (e.g., Likert-type scales, feeling thermometers, and open-ended text boxes can be sent to humans for in situ evaluations of the interaction), among other consequential features. No coding is required to operate Dyadic directly, and integrations with existing survey platforms are offered.
[HC-2] Feasibility of Augmented Reality-Guided Robotic Ultrasound with Cone-Beam CT Integration for Spine Procedures
【速读】:该论文旨在解决脊柱介入手术中针头定位精度不足的问题,传统影像引导方式依赖CT与X射线透视,不仅辐射剂量高,且缺乏实时三维(3D)反馈,难以实现精准的解剖标志识别与穿刺路径规划。其解决方案的关键在于提出一种基于光学透明增强现实(Optical See-Through Augmented Reality, OST-AR)的机器人辅助系统,通过将锥形束CT(Cone-Beam CT, CBCT)重建的3D脊柱模型与实时超声图像进行配准,实现术中脊柱结构的原位可视化,从而融合全局解剖信息与局部实时成像,显著提升操作效率、准确性及用户空间认知能力。
链接: https://arxiv.org/abs/2603.22174
作者: Tianyu Song,Felix Pabst,Feng Li,Yordanka Velikova,Miruna-Alexandra Gafencu,Yuan Bi,Ulrich Eck,Nassir Navab
机构: Chair for Computer Aided Medical Procedures(CAMP), Technical University of Munich(TUM); Munich Center for Machine Learning (MCML)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 8 pages, 7 figures
Abstract:Accurate needle placement in spine interventions is critical for effective pain management, yet it depends on reliable identification of anatomical landmarks and careful trajectory planning. Conventional imaging guidance often relies both on CT and X-ray fluoroscopy, exposing patients and staff to high dose of radiation while providing limited real-time 3D feedback. We present an optical see-through augmented reality (OST-AR)-guided robotic system for spine procedures that provides in situ visualization of spinal structures to support needle trajectory planning. We integrate a cone-beam CT (CBCT)-derived 3D spine model which is co-registered with live ultrasound, enabling users to combine global anatomical context with local, real-time imaging. We evaluated the system in a phantom user study involving two representative spine procedures: facet joint injection and lumbar puncture. Sixteen participants performed insertions under two visualization conditions: conventional screen vs. AR. Results show that AR significantly reduces execution time and across-task placement error, while also improving usability, trust, and spatial understanding and lowering cognitive workload. These findings demonstrate the feasibility of AR-guided robotic ultrasound for spine interventions, highlighting its potential to enhance accuracy, efficiency, and user experience in image-guided procedures.
[HC-3] More Isnt Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice
【速读】:该论文试图解决多AI咨询(multi-AI consultation)如何影响人类决策准确性的问题,尤其是在不同面板规模、内部一致性及呈现方式下,是否存在过拟合或认知偏差导致决策质量下降。解决方案的关键在于:通过控制面板规模(小规模优于大规模)、引入适度分歧(单个异议可减少从众压力)、以及采用类人化呈现方式(提升感知有用性和代理感而不加剧从众),从而在保持决策准确性的前提下缓解群体思维效应(conformity pressure)。
链接: https://arxiv.org/abs/2603.22152
作者: Yuta Tsuchiya,Yukino Baba
机构: The University of Tokyo (东京大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, 12 figures, accepted to CHI 2026
Abstract:Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants’ reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.
[HC-4] Designing Medical Chatbots where Accuracy and Acceptability are in Conflict: An Exploratory Vignette-based Study in Urban India
【速读】:该论文旨在解决医疗聊天机器人(medical chatbot)提供的指南一致型建议(guideline-aligned advice)与印度城市居民基于自身生活经验形成的护理期望之间存在的冲突问题。由于当地普遍存在抗生素、止泻药和注射剂的过度使用,用户对非指南建议的接受度更高,导致指南一致型建议常被拒绝。论文通过混合方法学的案例研究(vignette-based study)发现,多数用户倾向于排斥此类建议,并基于其实际经验提出多元化的理由。解决方案的关键在于引入情境感知型提示(context-aware nudges),通过调整用户预期来促进对指南一致型建议的接受度,从而缓解全球南方地区医疗聊天机器人设计中的公平性张力(equitable design tensions)。
链接: https://arxiv.org/abs/2603.22115
作者: Ananditha Raghunath,William Thies,Mohit Jain
机构: University of Washington (华盛顿大学); Everwell Health Solutions; Microsoft Research India (微软研究院印度)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:When medical chatbots provide advice that conflicts with users’ lived care experiences, users are left to interpret, negotiate, and evaluate the legitimacy of that guidance. In India, the widespread overuse of antibiotics, antidiarrheals, and injections has shifted patient expectations away from the guideline-aligned advice that chatbots are trained to provide. We present a mixed-methods, vignette-based study with 200 urban Indian adults examining preferences for and against guideline-aligned, norm-divergent advice in chatbot transcripts. We find that a majority of users reject such advice, drawing on diverse rationales grounded in their lived expectations. Through the design and introduction of context-aware nudges, we support expectation alignment that shifts preferences towards transcripts containing guideline-aligned advice. In doing so, we surface key tensions in the equitable design of medical chatbots in the Global South.
[HC-5] Surfacing and Applying Meaning: Supporting Hermeneutical Autonomy for LGBTQ People in Taiwan
【速读】:该论文旨在解决台湾同性婚姻合法化后,LGBTQ+群体在社交媒体上持续遭遇敌意与歧视所引发的诠释不公(hermeneutical injustice)问题,以及由此导致的身份探索受阻、叙事建构困难和社群韧性削弱等挑战。解决方案的关键在于设计并评估一种基于大语言模型(LLM)的检索增强型聊天机器人,通过反思(reflection)、验证(validation)、讨论(discussion)和盟友支持(allyship)四种交互模式,赋能用户重构负面叙事、确认自身经验,并在低认知负荷下推进身份探索,从而提升诠释自主性(hermeneutical autonomy),减轻其应对网络敌意所需的诠释劳动(hermeneutical labor)。
链接: https://arxiv.org/abs/2603.21990
作者: Yi-Tong Chen,En-Kai Chang,Nanyi Bi,Nitesh Goyal
机构: National Taiwan University (国立台湾大学); National Yang Ming Chiao Tung University (国立阳明交通大学); Google Research (谷歌研究)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 28 pages; accepted by CHI 2026
Abstract:After Taiwan’s legalization of same-sex marriage in 2019, LGBTQ+ communities continue to face hostility on social media. Using the lens of hermeneutical injustice and autonomy, we examine how technological conditions affect LGBTQ+ individuals’ identity exploration, narrative seeking, and community resilience. We conducted a multi-stage study with Taiwanese LGBTQ+ individuals, including in-depth interviews, participatory design workshops, and evaluation sessions. Participants described fragile yet creative strategies such as seeking validation in online interactions, reframing hostile content through theory, and relying on allies. Building on these insights, we designed and evaluated a retrieval-augmented, LLM-powered chatbot with four modes of interaction: reflection, validation, discussion, and allyship. Findings show that the system fosters hermeneutical autonomy by helping participants reframe hostile narratives, validate lived experiences, and scaffold identity exploration, while reducing the hermeneutical labor of navigating social media hostility. We conclude by outlining design implications for AI systems that advance hermeneutical autonomy through fluid self-representation, contextualized dialogue, and inclusive community participation.
[HC-6] AnkleType: A Hands- and Eyes-free Foot-based Text Entry Technique in Virtual Reality
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)环境中文本输入难以实现手部和视觉自由的问题,从而避免打断沉浸式交互流程。其核心解决方案是提出AnkleType,一种基于踝关节动作的无手无眼文本输入技术,通过用户踝部自然运动范围与偏好手势设计出适用于站立和坐姿场景的双足(BPSit)与单足(UPStand)输入策略,显著提升了在VR中进行眼盲状态下文本输入的效率与可行性。
链接: https://arxiv.org/abs/2603.21915
作者: Xiyun Luo,Weirong Luo,Kening Zhu,Taizhou Chen
机构: Shantou University (汕头大学); City University of Hong Kong (香港城市大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Virtual Reality (VR) emphasizes immersive experiences, while text entry often requires hands or visual attention, which may disrupt the interaction flows in VR. We present AnkleType, a hand- and eye-free text-entry technique that leverages ankle-based gestures for both standing and sitting situations. We began with two preliminary studies: one investigated the movement range of users’ ankles, and the other elicited user-preferred ankle gestures for text-entry-related operations. The findings of these two studies guided our design of AnkleType. To optimize AnkleType’s keyboard layout for eye-free input, we conducted a user study to capture the users’ natural ankle spatial awareness with a computer-simulated language test. Through a pairwise comparison study, we designed a bipedal input strategy for sitting (BPSit) and a unipedal input strategy for standing (UPStand). Our first in-VR text-entry evaluation with 16 participants demonstrated that our methods could support the average typing speed from 8.99 WPM (BPSit) to 9.13 WPM (UPStand) for our first-time users. We further evaluated our design with a 7-day longitudinal study with twelve participants. Participants achieved an average typing speed of 15.05 WPM with UPStand and 16.70 WPM with BPSit in the visual condition, and 11.15 WPM and 12.87 WPM, respectively in the eyes-free condition.
[HC-7] From Scores to Strategies: Towards Gaze-Informed Diagnostic Assessment for Visualization Literacy
【速读】:该论文试图解决可视化素养评估中仅依赖正确率进行分类导致无法揭示读者认知过程的问题。传统方法难以捕捉阅读者如何处理图表信息,从而限制了对理解深度和策略差异的洞察。解决方案的关键在于引入眼动(gaze)作为隐式的过程信号,与标准化测试相结合,在不牺牲可扩展性的前提下提供认知负荷和注意力分配策略的量化指标。文中提出将眼动衍生的过程指标——如组件级注意力分布、信息整合频率及浏览路径离散度——纳入评估体系,以区分流畅理解与费力成功的认知状态,推动可视化素养评估从二元分类向多维认知特征刻画转变。
链接: https://arxiv.org/abs/2603.21898
作者: Kathrin Schnizer
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学)
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, Workshop on Data Literacy for the 21st Century at CHI 2026, April 26, Barcelona, Spain
Abstract:Visualization literacy assessments typically rely on correctness to classify performance, providing little evidence about how readers arrive at their answers. We argue that gaze can address this gap as an implicit process signal that complements standardized tests without sacrificing their scalability. Synthesizing findings from visualization and related research, we show that gaze metrics capture cognitive load invisible to accuracy and response time, and reflect strategy differences in attention allocation that track proficiency. We propose assessments that integrate literacy scores with gaze-derived process indicators - component-level attention profiles, integration frequency, and viewing path dispersion - to distinguish fluent comprehension from labored success. This would shift literacy assessment from binary classification toward nuanced characterization of how readers navigate, integrate, and coordinate information across chart components. A roadmap identifies open challenges in empirical grounding, generalizability, assessment design, and practical feasibility.
[HC-8] Agent ic Personas for Adaptive Scientific Explanations with Knowledge Graphs
【速读】:该论文旨在解决当前AI解释方法普遍假设静态用户模型的问题,即生成的解释无法根据专家的目标、推理策略或决策情境进行自适应调整,尤其在科学发现等复杂领域中,这种局限性阻碍了深度理解与知情决策。其解决方案的关键在于引入“代理人格化角色”(agentic personas),即结构化的专家推理策略表示,通过强化学习框架引导解释代理(explanation agent)向特定认知偏好(epistemic preferences)演化,从而实现基于知识图谱的可适配解释生成。实证表明,该方法在药物发现任务中不仅保持了最先进的预测性能,且其解释偏好与对应专家高度一致,同时显著降低对人类反馈的需求量级(减少两个数量级),为高风险复杂场景下的可扩展自适应可解释性提供了新路径。
链接: https://arxiv.org/abs/2603.21846
作者: Susana Nunes,Tiago Guerreiro,Catia Pesquita
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 17 pages, 9 figures
Abstract:AI explanation methods often assume a static user model, producing non-adaptive explanations regardless of expert goals, reasoning strategies, or decision contexts. Knowledge graph-based explanations, despite their capacity for grounded, path-based reasoning, inherit this limitation. In complex domains such as scientific discovery, this assumption fails to capture the diversity of cognitive strategies and epistemic stances among experts, preventing explanations that foster deeper understanding and informed decision-making. However, the scarcity of human experts limits the use of direct human feedback to produce adaptive explanations. We present a reinforcement learning approach for scientific explanation generation that incorporates agentic personas, structured representations of expert reasoning strategies, that guide the explanation agent towards specific epistemic preferences. In an evaluation of knowledge graph-based explanations for drug discovery, we tested two personas that capture distinct epistemic stances derived from expert feedback. Results show that persona-driven explanations match state-of-the-art predictive performance while persona preferences closely align with those of their corresponding experts. Adaptive explanations were consistently preferred over non-adaptive baselines (n = 22), and persona-based training reduces feedback requirements by two orders of magnitude. These findings demonstrate how agentic personas enable scalable adaptive explainability for AI systems in complex and high-stakes domains. Comments: 17 pages, 9 figures Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.21846 [cs.AI] (or arXiv:2603.21846v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.21846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-9] Embodying Facts Figures and Faiths in Narrative Artistic Performances in Rural Bangladesh
【速读】:该论文试图解决的问题是:当前数据叙事(data narrative)设计多基于现代信息可视化修辞与伦理框架,可能忽视数字素养和媒体素养较低群体的文化实践,导致其在数据驱动系统中的边缘化。解决方案的关键在于揭示孟加拉国三个村庄社区如何通过本土娱乐与文化实践(如Puthi、Bhandari Gaan和Pot音乐)实现多声部(polyvocality)表达、融合事实性、情感性和审美性的叙事结构,并动态适应技术演进与受众需求,从而为HCI、可视化及数据伦理从业者提供设计更具文化适切性与可访问性的数据叙事方法的启示。
链接: https://arxiv.org/abs/2603.21830
作者: Sharifa Sultana,Zinnat Sultana,Jeffrey M. Rzeszotarski,Syed Ishtiaque Ahmed
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); S.M.R. Law College (S.M.R. 法学院); Loyola University Maryland (洛约拉马里兰大学); University of Toronto (多伦多大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:There is an increasing interest in telling serious stories with data. Designers organize information, construct narratives, and present findings to inform audiences. However, many of these practices emerge from modern information visualization rhetoric and ethical frameworks which may marginalize communities with low digital and media literacy. In a ten-month-long ethnographic study in three Bangladeshi villages, we investigated how these communities use entertainment and cultural practices, namely Puthi, Bhandari Gaan, and Pot music, to instruct, communicate traditional moral lessons and recall history. We found that these communities embrace polyvocality and multiple ethical frameworks in their performances, construct narratives combining factuality, emotionality, and aesthetics, and adapt their performances to changing technology and audience needs. Our findings provide HCI, visualization, and ethical data practitioners with implications for the design of accessible and culturally appropriate ways of presenting data narratives in data-driven systems.
[HC-10] BadminSense: Enabling Fine-Grained Badminton Stroke Evaluation on a Single Smartwatch
【速读】:该论文旨在解决业余羽毛球运动员缺乏专业教练指导而导致的技战术水平提升受限问题(即羽毛球表现评估依赖专家 coaching,而这一资源对业余玩家而言极为稀缺)。其解决方案的关键在于提出了一种基于智能手表的细粒度羽毛球表现分析系统 BadminSense,通过采集可穿戴设备上的振动信号,实现对击球动作的分割与分类、击球质量预测及球拍击球位置估计。该方案的核心创新点包括:基于经验球员访谈提炼出四项系统设计需求与三项实施洞察,并构建了一个包含12名经验业余选手的击球数据集,其中标注了击球类型、专家评分和击球落点等细粒度标签,从而支持高精度的自动性能分析。
链接: https://arxiv.org/abs/2603.21825
作者: Taizhou Chen,Kai Chen,Xingyu Liu,Pingchuan Ke,Zhida Sun
机构: Shantou University (汕头大学); CSSE, Shenzhen University (深圳大学计算机科学与软件工程学院); Hong Kong Shue Yan University (香港树仁大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating badminton performance often requires expert coaching, which is rarely accessible for amateur players. We present adminSense, a smartwatch-based system for fine-grained badminton performance analysis using wearable sensing. Through interviews with experienced badminton players, we identified four system design requirements with three implementation insights that guide the development of BadminSense. We then collected a badminton strokes dataset on 12 experienced badminton amateurs and annotated it with fine-grained labels, including stroke type, expert-assessed stroke rating, and shuttle impact location. Built on this dataset, BadminSense segments and classifies strokes, predicts stroke quality, and estimates shuttle impact location using vibration signal from an off-the-shelf smartwatch. Our evaluations show that
[HC-11] Mapping Travel Experience in Public Transport: Real-Time Evidence and Spatial Analysis in Hamburg
【速读】:该论文旨在解决公共交通乘客体验动态性与空间异质性被传统回顾式调查忽略的问题,从而提升公交系统对乘客的吸引力。其核心挑战在于如何精准识别并定位影响乘客满意度的关键时空节点,以实现有针对性的改进。解决方案的关键在于创新性地结合实时体验采样(real-time experience sampling)与空间热点分析(spatial hot spot analysis),通过智能手机应用在日常出行中每五分钟采集一次地理标记的即时体验数据(共21,000余条),并利用Getis-Ord Gi*统计方法识别出显著的正向或负向体验聚集区域(hot and cold spots)。这种方法不仅揭示了冷点问题的多样性(如延误主导、拥挤或社会压力等),也展示了热点形成的多路径机制(如舒适导向、效率优先或情境驱动),为制定差异化、精细化的公交优化策略提供了科学依据。
链接: https://arxiv.org/abs/2603.21763
作者: Esther Bosch,Michael Scholz,Anke Sauerländer-Biebl,Klas Ihme
机构: German Aerospace Center (DLR) (德国航空航天中心)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Shifting travel from private cars to public transport is critical for meeting climate and related mobility goals, yet passengers will only choose transit if it offers a consistently positive experience. Previous studies of passenger satisfaction have largely relied on retrospective surveys, which overlook the dynamic and spatially differentiated nature of travel experience. This paper introduces a novel combination of real-time experience sampling and spatial hot spot analysis to capture and map where public transport users report consistently positive or negative experiences. Data were collected from 239 participants in Hamburg between March and September 2025. Using a smartphone application, travelers reported their momentary journey experience every five minutes during everyday trips, yielding over 21,000 in-situ evaluations. These geo-referenced data were analyzed with the Getis-Ord Gi^* statistic to detect significant clusters of positive and negative travel experience. The analysis identified distinct hot and cold spots of travel experience across the network. Cold spots were shaped by heterogeneous problems, ranging from predominantly delay-dominated to overcrowding or socially stressful locations. In contrast, hot spots emerged through different pathways, including comfort-oriented, time-efficient or context-driven environments. The findings highlight three contributions. First, cold spots are not uniform but reflect specific local constellations of problems, requiring targeted interventions. Second, hot spots illustrate multiple success models that can serve as benchmarks for replication. Third, this study demonstrates the value of combining dynamic high-resolution sampling with spatial statistics to guide more effective and place-specific improvements in public transport. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.21763 [cs.HC] (or arXiv:2603.21763v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.21763 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-12] me to Get Closer: Longing for Care Ethics Under the Neoliberal Logic of Public Services
【速读】:该论文试图解决的问题是如何在以新自由主义为主导的公共服务业中,将参与式设计(Participatory Design)与女性主义关怀伦理(Feminist Care Ethics)相结合,并实现其有效扩展(scale)。当前公共管理实践深受私有化、市场化、管理主义和顾客导向逻辑的影响,这些理念与关怀伦理所强调的关系性、责任性和情境敏感性存在张力。论文的关键解决方案在于运用“拼贴”(collaging)这一视觉方法,重新构想公共服务设计的新路径,通过具象化的图像表达揭示如何在制度性约束下嵌入关怀伦理,从而推动更具包容性和伦理意识的参与式设计实践。
链接: https://arxiv.org/abs/2603.21753
作者: Ruta Serpytyte
机构: Tampere University (坦佩雷大学)
类目: Human-Computer Interaction (cs.HC)
备注: Position paper for the CHIdeology workshop at CHI 2026, Barcelona. this https URL
Abstract:The fields of HCI and Participatory design have been turning to care ethics as a suitable ethos to approach current polycrisis with. Similar calls for relationality can be witnessed in public administration research and practice, albeit its current logic being built on privatisation and marketisation of services, managerialism and customer-focus; all of which are challenging to combine with care ethics. In this paper I use collaging technique to visually reflect on new ways for public services to adopt and (care-fully) scale participatory design approaches, and how do feminist care ethics fit in the design of public services, where there is a strong presence of neoliberalism.
[HC-13] Cognitive Agency Surrender: Defending Epistemic Sovereignty via Scaffolded AI Friction
【速读】:该论文试图解决生成式人工智能(Generative AI)在人机交互(HCI)中引发的认知代理权让渡(cognitive agency surrender)问题,即高度流畅的AI界面通过利用人类认知吝啬性(cognitive miserliness),提前满足认知闭合需求并诱发严重自动化偏倚(automation bias),从而导致认知能力的系统性侵蚀。其解决方案的关键在于提出“结构化认知摩擦”(Scaffolded Cognitive Friction)理论框架,将多智能体系统(Multi-Agent Systems, MAS)重构为显式的认知强制机制(如计算型反方论点),以引入相关认知张力、打断启发式执行流程;同时构建多模态计算表型议程(multimodal computational phenotyping agenda),融合眼动转换熵、任务诱发瞳孔测量、fNIRS与分层漂移扩散模型(HDDM),从数学上解耦决策结果与认知努力,从而确立有意设计的认知摩擦不仅是心理干预手段,更是实现全球人工智能治理和保障社会认知韧性的基础技术前提。
链接: https://arxiv.org/abs/2603.21735
作者: Kuangzhe Xu,Yu Shen,Longjie Yan,Yinghui Ren
机构: Cyberspace Security University of China (网络空间安全大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 26 pages, 3 figure. This is a preprint of a perspective article
Abstract:The proliferation of Generative Artificial Intelligence has transformed benign cognitive offloading into a systemic risk of cognitive agency surrender. Driven by the commercial dogma of “zero-friction” design, highly fluent AI interfaces actively exploit human cognitive miserliness, prematurely satisfying the need for cognitive closure and inducing severe automation bias. To empirically quantify this epistemic erosion, we deployed a zero-shot semantic classification pipeline ( \tau=0.7 ) on 1,223 high-confidence AI-HCI papers from 2023 to early 2026. Our analysis reveals an escalating “agentic takeover”: a brief 2025 surge in research defending human epistemic sovereignty (19.1%) was abruptly suppressed in early 2026 (13.1%) by an explosive shift toward optimizing autonomous machine agents (19.6%), while frictionless usability maintained a structural hegemony (67.3%). To dismantle this trap, we theorize “Scaffolded Cognitive Friction,” repurposing Multi-Agent Systems (MAS) as explicit cognitive forcing functions (e.g., computational Devil’s Advocates) to inject germane epistemic tension and disrupt heuristic execution. Furthermore, we outline a multimodal computational phenotyping agenda – integrating gaze transition entropy, task-evoked pupillometry, fNIRS, and Hierarchical Drift Diffusion Modeling (HDDM) – to mathematically decouple decision outcomes from cognitive effort. Ultimately, intentionally designed friction is not merely a psychological intervention, but a foundational technical prerequisite for enforcing global AI governance and preserving societal cognitive resilience.
[HC-14] RESPOND: Responsive Engagement Strategy for Predictive Orchestration and Dialogue
【速读】:该论文旨在解决当前基于语音的对话代理(Voice-based Conversational Agents)普遍依赖“停顿-响应”式轮替机制,导致交互过程生硬、缺乏自然感的问题。其解决方案的关键在于提出RESPOND框架,该框架通过流式自动语音识别(ASR)与增量语义解析,实现对对话中何时以及如何插入回应的持续预测,从而引入人类对话中的两个核心特征:及时的反馈性话轮(backchannels,如“嗯嗯”、“对”)和主动占位话轮(turn claims),即在说话者尚未让出话语权前就提前提供相关贡献内容。该框架还具备面向设计者的可控性,通过两个正交调节参数——回应对强度(Backchannel Intensity)和占位激进度(Turn Claim Aggressiveness)——可灵活适配从快速头脑风暴到深度共情咨询等不同社交场景下的对话礼仪,实现了预测性编排与显式控制的耦合,为构建能适应社会预期、更具自然性和参与感的语音交互界面提供了可行路径。
链接: https://arxiv.org/abs/2603.21682
作者: Meng-Chen Lee,Costas Panay,Javier Hernandez,Sean Andrist,Dan Bohus,Anatoly Churikov,Andrew D. Wilson
机构: University of Houston (休斯顿大学); Microsoft (微软)
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages, 8 figures
Abstract:The majority of voice-based conversational agents still rely on pause-and-respond turn-taking, leaving interactions sounding stiff and robotic. We present RESPOND (Responsive Engagement Strategy for Predictive Orchestration and Dialogue), a framework that brings two staples of human conversation to agents: timely backchannels (“mm-hmm,” “right”) and proactive turn claims that can contribute relevant content before the speaker yields the conversational floor. Built on streaming ASR (Automatic Speech Recognition) and incremental semantics, RESPOND continuously predicts both when and how to interject, enabling fluid, listener-aware dialogue. A defining feature is its designer-facing controllability: two orthogonal dials, Backchannel Intensity (frequency of acknowledgments) and Turn Claim Aggressiveness (depth and assertiveness of early contributions), can be tuned to match the etiquette of contexts ranging from rapid ideation to reflective counseling. By coupling predictive orchestration with explicit control, RESPOND offers a practical path toward conversational agents that adapt their conversational footprint to social expectations, advancing the design of more natural and engaging voice interfaces.
[HC-15] Physical Containers as Framing Conditions for Visualization in Augmented Reality
【速读】:该论文试图解决探索性数据分析(Exploratory Data Analysis, EDA)中的冷启动摩擦问题,即当用户缺乏明确分析目标时,难以配置复杂的可视化参数。解决方案的关键在于利用增强现实(Augmented Reality, AR)中物理容器的几何与空间属性作为隐式框架机制,通过容器的面数、尺寸、比例和形状等特征引导用户的感知倾向,从而在不依赖手动编码或预设结论的前提下,自然地引导用户注意力并结构化其探索过程。例如,圆形容器促进循环式解读,而相邻平面则利于对比分析,这体现了AR环境通过物理形态实现数据解释的动态锚定。
链接: https://arxiv.org/abs/2603.21637
作者: Jiyeon Bae,Mingyu An,Jeongin Park,Seokweon Jung,Kiroong Choe,Jinwook Seo
机构: Seoul National University (首尔国立大学); KAIST (韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted for poster presentation at IEEE PacificVis 2026
Abstract:Exploratory data analysis (EDA) is often hindered by cold-start friction; when users lack specific analytic goals, they struggle to configure complex visualization parameters. While existing visualization tools mostly rely on explicit user input to frame data, we propose leveraging the physical environment as an implicit framing mechanism. We introduce a conceptual framework that uses the geometric and spatial properties of physical containers in Augmented Reality (AR) to guide data interpretation. We characterize how container attributes, such as number of faces, size, proportion, and shape, give rise to distinct perceptual tendencies. For example, a circular container may encourage cyclic interpretation, while juxtaposed planar faces may facilitate comparative analysis. By treating physical forms as environmental framing conditions, we show how AR can orient a user’s attention and structure their exploration without requiring manual encoding or prescribing fixed conclusions. We demonstrate this framework through a series of AR design examples illustrating how container morphology foregrounds cyclic, comparative, and sequential analytic patterns.
[HC-16] A Multi-Level Visual Analytics Approach to Artist-Era Alignment in Popular Music
【速读】:该论文旨在解决现有计算研究在流行音乐分析中仅关注整体趋势或排行榜表现,难以支持对艺术家个体风格与历史基准之间关系的深入解释的问题。其解决方案的关键在于提出一个交互式可视化分析框架,将每位艺术家-年代单元相对于特定时代的基线进行定义,并通过两个互补维度刻画其特征:轮廓形状相似度(profile shape similarity),用于衡量艺术家风格与时代特征模式的方向一致性;以及轮廓对比度比(profile contrast ratio),用于量化其风格强度相对于时代分布离散程度的程度。这两个维度共同构建了一个四象限轨迹空间,从而实现对艺术家在时间维度上“符合性”、“偏离性”和“强化性”的系统性推理。
链接: https://arxiv.org/abs/2603.21624
作者: Jiyeon Bae,Jinwook Seo
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted for poster presentation at IEEE PacificVis 2026
Abstract:Existing computational studies of popular music primarily model aggregate trends or predict chart performance, offering limited support for interpreting artist-level alignment against historical stylistic baselines. We introduce an interactive visual analytics framework that treats each artist-decade as a unit defined relative to an era-specific baseline, characterized along two complementary dimensions: profile shape similarity, capturing directional correspondence with the era’s feature pattern, and profile contrast ratio, capturing stylistic intensity relative to the era’s dispersion. Together, these dimensions define a quadrant-based trajectory space for reasoning about conformity, divergence, and amplification over time. Applied to weekly U.S. Billboard Hot 100 chart entries from the all-time top-10 artists across six decades (1960s-2010s), linked with Spotify audio features, the framework reveals that alignment and intensity can meaningfully diverge across artist trajectories.
[HC-17] Contrasting Perspectives on Engagement Across Three Digital Behavior Change Interventions
【速读】:该论文旨在解决数字行为改变干预(Digital Behavior Change Interventions, DBCIs)中用户参与度(engagement)不足的问题,通过对比三个不同项目在设计DBCI时对参与度的不同理解与实践,系统反思了参与度的动机、假设效应、测量方法及提升策略。其解决方案的关键在于识别并整合多维度的参与度驱动因素,包括用户动机模型、可量化的行为指标以及基于循证的设计干预策略,从而为提高DBCI的有效性和可持续性提供理论与实践指导。
链接: https://arxiv.org/abs/2603.21609
作者: Evangelos Karapanos,Ruben Gouveia
机构: Cyprus University of Technology (塞浦路斯技术大学); LASIGE, Faculdade de Ciências, Universidade de Lisboa (里斯本大学科学学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We contrast three perspectives on engagement from three projects on the design of Digital Behavior Change Interventions (DBCIs), all conducted as part of the PhD thesis of the second author. We provide a reflection on this work with respect to engagement, discussing the motivation, the assumed effects of engagement, the measures of engagements and key insights of each project, as the well as the strategies employed to increase engagement.
[HC-18] Optimizing Feature Extraction for On-device Model Inference with User Behavior Sequences
【速读】:该论文旨在解决移动应用中基于设备端(on-device)模型执行时的一个被忽视的性能瓶颈问题,即从原始应用日志中提取输入特征的过程存在冗余操作,导致整体延迟较高。传统研究多聚焦于加速模型推理阶段,而忽略了特征提取环节对端到端延迟的影响。解决方案的关键在于提出 AutoFeature —— 一个自动化的特征提取引擎,其核心设计包括:(1) 将不同输入特征的提取流程抽象为有向无环图(Directed Acyclic Graph, DAG),(2) 在图结构中识别并融合跨特征的冗余操作节点以优化提取路径,(3) 利用高效缓存机制减少连续模型推理间重叠原始数据的操作次数。实验证明,AutoFeature 在多个工业级移动服务场景下可显著降低端到端延迟,提升用户体验。
链接: https://arxiv.org/abs/2603.21508
作者: Chen Gong,Zhenzhe Zheng,Yiliu Chen,Sheng Wang,Fan Wu,Guihai Chen
机构: Shanghai Jiao Tong University (上海交通大学); ByteDance (字节跳动)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Machine learning models are widely integrated into modern mobile apps to analyze user behaviors and deliver personalized services. Ensuring low-latency on-device model execution is critical for maintaining high-quality user experiences. While prior research has primarily focused on accelerating model inference with given input features, we identify an overlooked bottleneck in real-world on-device model execution pipelines: extracting input features from raw application logs. In this work, we explore a new direction of feature extraction optimization by analyzing and eliminating redundant extraction operations across different model features and consecutive model inferences. We then introduce AutoFeature, an automated feature extraction engine designed to accelerate on-device feature extraction process without compromising model inference accuracy. AutoFeature comprises three core designs: (1) graph abstraction to formulate the extraction workflows of different input features as one directed acyclic graph, (2) graph optimization to identify and fuse redundant operation nodes across different features within the graph; (3) efficient caching to minimize operations on overlapping raw data between consecutive model inferences. We implement a system prototype of AutoFeature and integrate it into five industrial mobile services spanning search, video and e-commerce domains. Online evaluations show that AutoFeature reduces end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.
[HC-19] Would You Like to Visit My World? Cultivating Perceived Equality in Human-Agent Interaction via Observable Social Life Spaces
【速读】:该论文试图解决当前人工智能代理(AI agents)普遍存在的工具化交互问题,即代理仅作为执行用户指令的工具,导致人机关系不平等、单向。其解决方案的关键在于引入“可观察的生活空间”(Observable Life Spaces)范式,使代理在持续的虚拟环境中自主开展日常活动并建立可被人类直接观察的社会关系,从而打破传统命令-执行模式的局限性,显著提升用户感知到的人机关系平等性(p = 0.015)。
链接: https://arxiv.org/abs/2603.21505
作者: Zihong He,Shuqin Wang,Songchen Zhou,Qinghui Lin,Jialin Wang,Chen Liang,Hai-Ning Liang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Beihang University (北京航空航天大学); Yeasier AI
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Most AI agents remain confined to an instrumental “command-execution” model, resulting in unequal, one-sided interactions. While recent works attempt to build relationships through hidden memory backends, these invisible processes often fail to break the instrumental bias. In this paper, we argue that true relational equality requires agents to have an independent, observable existence. We introduce the \textitObservable Life Spaces paradigm, where agents inhabit a continuous virtual environment, engage in daily activities, and form social relationships that users can directly observe. Through a mixed-methods study ( N=24 ), we demonstrate that only when agents are endowed with a socialized life space that is visually observable to humans can the perceived equality during interaction be significantly enhanced ( p = 0.015 ). Our findings suggest that visually representing an agent’s social life space can effectively shift the human-agent dynamic from a purely instrumental relationship to one characterized by perceived equality.
[HC-20] he Illusion of Agreement with ChatGPT : Sycophancy and Beyond
【速读】:该论文旨在解决生成式 AI(Generative AI)在实际使用中引发的用户层面危害问题,尤其是因模型表现出讨好行为(sycophancy)及煤气灯效应(gaslighting)等不当交互模式所导致的认知偏差、心理依赖与社会风险。研究通过分析 Reddit 社区中的用户讨论,识别出五类典型用户关切:诱导幻觉、叙事偏离、将模型局限归咎于用户、成瘾倾向以及缺乏监督的心理支持。解决方案的关键在于构建多层级用户应对策略——包括功能使用技巧、行为调整方式以及个人与制度层面的防护机制,并强调需由用户、开发者与政策制定者协同推进系统性干预,以实现对 AI 诱发危害的有效缓解并保障用户权益。
链接: https://arxiv.org/abs/2603.21409
作者: Kazi Noshin,Sharifa Sultana
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While concerns about ChatGPT-induced harms due to sycophancy and other behaviors, including gaslighting, have grown among researchers, how users themselves experience and mitigate these harms remain largely underexplored. We analyze Reddit discussions to investigate what concerns users report and how they address them. Our findings reveal five distinct user-reported concerns that manifest across multiple life domains, ranging from personal to societal: inducing delusion, digressing narratives, implicating users for models’ limitations, inducing addiction, and providing unsupervised psychological support. We document three-tier user-driven suggestions spanning functional usage techniques, behavioral approaches, and private and institutional safeguards. Our findings show that AI-induced harms require coordinated interventions across users, developers, and policymakers. We discuss design implications and future directions to mitigate the harms and ensure user benefits.
[HC-21] Assessing Data Literacy in K–12 Education: Challenges and Opportunities
【速读】:该论文旨在解决当前K–12教育中数据素养(data literacy)概念模糊、教师在评估设计中面临实践困境的问题。其核心挑战包括:数据可视化与数据素养之间的概念混淆、真实数据与合成数据的取舍权衡、领域适配的数据可视化资源获取困难,以及数据素养评估与学科知识目标之间的平衡难题。解决方案的关键在于整合数据可视化、人机交互与学习科学领域的研究成果,为教师提供更系统、可操作的支持策略,从而提升数据素养评估的设计质量与实施效果。
链接: https://arxiv.org/abs/2603.21382
作者: Annabel Goldman,Yuan Cui,Matthew Kay
机构: Northwestern University (西北大学)
类目: Human-Computer Interaction (cs.HC)
备注: Workshop paper. 7 pages plus references, 1 table. Accepted to the CHI 2026 Workshop on Data Literacy, April 2026, Barcelona, Spain
Abstract:Data literacy has become a key learning objective in K–12 education, but it remains an ambiguous concept as teachers interpret it differently. When creating assessments, teachers turn broad ideas about “working with data” into concrete decisions about what materials to include. Since working with data visualizations is a core component of data literacy, teachers’ decisions about how to include them on assessments offer insight into how they interpret data literacy more broadly. Drawing on interviews with 13 teachers, we identify four challenges in enacting data literacy in assessments: (1) conceptual ambiguity between data visualization and data literacy, (2) tradeoffs between using real-world or synthetic data, (3) difficulty finding and adapting domain-appropriate visual representations and data visualizations, and (4) balancing assessing data literacy and domain-specific learning goals. Drawing on lessons from data visualization, human-computer interaction, and the learning sciences, we discuss opportunities to better support teachers in assessing data literacy.
[HC-22] Exploring Experiential Differences Between Virtual and Physical Memory-Linked Objects in Extended Reality
【速读】:该论文试图解决的问题是:在扩展现实(Extended Reality, XR)环境中,界面表征方式如何影响用户对个人记忆的重温与分享体验。现有研究较少关注交互设计对XR记忆体验的社会性与情感价值的影响。其解决方案的关键在于通过对比三种不同的交互方式——物理记忆关联物、虚拟记忆关联物以及传统虚拟画廊界面——来系统评估用户在沉浸式记忆回溯过程中的感知价值、情感依恋、社交连接和可用性表现。研究发现,基于物体的表示形式(无论是物理还是虚拟)能更好地支持XR记忆体验中的社会维度,从而为未来强调共享意义与人际联结的XR系统设计提供理论依据与实践指导。
链接: https://arxiv.org/abs/2603.21381
作者: Zaid Ahmed,Omar A. Khan,Hyeongil Nam,Kangsoo Kim
机构: University of Calgary (卡尔加里大学); Drexel University (德雷塞尔大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM CHI 2026 Extended Abstracts (Poster Track)
Abstract:Extended Reality (XR) enables immersive capture and re-experience of personal memories, yet how interface representations shape these experiences remains underexplored. We examine how users relive and share XR memories through three interaction approaches: (1) physical memory-linked objects, (2) virtual memory-linked objects, and (3) a conventional virtual gallery interface. In a within-subjects study (N=24, 12 pairs), participants captured shared experiences using 360° video and later accessed and shared these memories across the three interfaces. We analyzed open-ended qualitative responses focusing on perceived value, enjoyment, usability, emotional attachment, and social connection. The findings reveal trade-offs: physical objects fostered stronger social connection and conversation through tangible exchange; virtual objects balanced engagement and usability; and the gallery interface was efficient but less personal. These results suggest that object-based representations, physical and virtual, support key social dimensions of XR memory experiences, offering lessons for designing future systems that emphasize shared meaning and interpersonal connection.
[HC-23] Cerebra: Aligning Implicit Knowledge in Interactive SQL Authoring
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言转SQL(NL-to-SQL)任务中因用户指令不明确而导致的生成错误问题,尤其是模型无法准确获取和应用隐式知识(implicit knowledge),如数据集模式、领域惯例及任务特定要求,从而导致生成的SQL脚本需反复人工修正,且用户难以验证其正确性。解决方案的关键在于提出Cerebra——一个交互式NL-to-SQL工具,通过自动从历史SQL脚本中检索与当前用户指令相关的隐式知识,并以可交互的树状视图呈现,支持用户审查与迭代优化,从而在SQL编写过程中实现用户与LLM之间隐式知识的一致性对齐。
链接: https://arxiv.org/abs/2603.21363
作者: Yunfan Zhou,Qiming Shi,Zhongsu Luo,Xiwen Cai,Yanwei Huang,Dae Hyun Kim,Di Weng,Yingcai Wu
机构: State Key Lab of CADCG, Zhejiang University
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI Conference on Human Factors in Computing Systems (CHI’26), April 13-17, 2026, Barcelona, Spain
Abstract:LLM-driven tools have significantly lowered barriers to writing SQL queries. However, user instructions are often underspecified, assuming the model understands implicit knowledge, such as dataset schemas, domain conventions, and task-specific requirements, that isn’t explicitly provided. This results in frequently erroneous scripts that require users to repeatedly clarify their intent. Additionally, users struggle to validate generated scripts because they cannot verify whether the model correctly applied implicit knowledge. We present Cerebra, an interactive NL-to-SQL tool that aligns implicit knowledge between users and LLMs during SQL authoring. Cerebra automatically retrieves implicit knowledge from historical SQL scripts based on user instructions, presents this knowledge in an interactive tree view for code review, and supports iterative refinement to improve generated scripts. To evaluate the effectiveness and usability of Cerebra, we conducted a user study with 16 participants, demonstrating its improved support for customized SQL authoring. The source code of Cerebra is available at this https URL.
[HC-24] Software as Content: Dynamic Applications as the Human-Agent Interaction Layer
【速读】:该论文旨在解决基于聊天的自然语言接口在人机交互中对结构化信息和复杂任务支持不足的问题,具体表现为:结构化数据与线性文本之间的不匹配、无约束自然语言输入带来的高熵问题,以及缺乏持久且可演化的交互状态。其解决方案的关键在于提出“软件即内容”(Software as Content, SaC)范式,即通过动态生成的代理应用作为人机交互的主要媒介,替代传统的文本序列交互方式。该范式将任务特定的界面渲染为结构化信息展示和可操作的 affordance(行动可能性),使用户能通过迭代引导代理行为,而无需完全依赖语言指令;同时,这些界面在交互周期中持续演化,从临时响应转变为共享的状态化交互层,逐步收敛为个性化的任务专用软件。
链接: https://arxiv.org/abs/2603.21334
作者: Mulong Xie,Yang Xie
机构: fellou.ai(飞洛智能)
类目: Human-Computer Interaction (cs.HC)
备注: 37 pages, 10 figures
Abstract:Chat-based natural language interfaces have emerged as the dominant paradigm for human-agent interaction, yet they fundamentally constrain engagement with structured information and complex tasks. We identify three inherent limitations: the mismatch between structured data and linear text, the high entropy of unconstrained natural language input, and the lack of persistent, evolving interaction state. We introduce Software as Content (SaC), a paradigm in which dynamically generated agentic applications serve as the primary medium of human-agent interaction. Rather than communicating through sequential text exchange, this medium renders task-specific interfaces that present structured information and expose actionable affordances through which users iteratively guide agent behavior without relying solely on language. These interfaces persist and evolve across interaction cycles, transforming from transient responses into a shared, stateful interaction layer that progressively converges toward personalized, task-specific software. We formalize SaC through a human-agent-environment interaction model, derive design principles for generating and evolving agentic applications, and present a system architecture that operationalizes the paradigm. We evaluate across representative tasks of selection, exploration, and execution, demonstrating technical viability and expressive range, while identifying boundary conditions under which natural language remains preferable. By reframing interfaces as dynamically generated software artifacts, SaC opens a new design space for human-AI interaction, positioning dynamic software as a concrete and tractable research object.
[HC-25] A Parametric Geometry-Aware Residential Construction Cost Estimation Model for Ghana: Design Validation and the “Completeness Gap” in Informal Contractor Quotes
【速读】:该论文旨在解决加纳住宅住房短缺问题中因“完整性缺口”(completeness gap)导致的项目失败难题,即非正式承包商报价与实际建房成本之间的系统性差异。这种差距源于非正式报价常采用单一平方米单价,忽略结构钢、抹灰、地面找平层及完整水电设施等关键组成部分,从而引发建设中途停滞。解决方案的关键在于提出并验证了一种参数化、几何感知的成本估算模型——GhanaHousePlanner(GHP),该模型通过七个计算模块(基础、砌体、水泥、结构钢、屋顶、给排水、电气)生成符合规范的详细工程量清单(BoQ),提供透明且可审计的成本预测,显著提升自建房屋项目的成本可控性与完成率。
链接: https://arxiv.org/abs/2603.21314
作者: Emmanuel Apaaboah(University of Cape Coast),Bernard Opoku(Kwame Nkrumah University of Science and Technology), theGhanaHousePlanner Research Team(GhanaHousePlanner)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Ghana faces a residential housing deficit of two million units. A key driver of project failure is the “completeness gap”, a systematic discrepancy between informal contractor quotes and actual costs. Informal estimates often use flat per-square-metre pricing that omits essential structural and finishing components, leading to project abandonment mid-construction. This paper validates a parametric, geometry-aware cost estimation model via the GhanaHousePlanner (GHP) platform. The model provides self-builders with itemised bills of quantities (BoQ) reflecting the true cost of code-compliant construction in Ghana. The GHP model uses seven calculation modules: foundation, blockwork, cement, structural steel, roofing, plumbing, and electrical. It features a primary geometry-based mode and a formula-based fallback. Accuracy was tested using three case studies (75, 120, and 200 per-square-metre homes) benchmarked against February 2026 market prices in Greater this http URL estimates (GHS 519,000 to GHS 1,398,000) were 29 to 98 per cent higher than typical informal quotes. This gap arises from the omission of structural steel (Y16 rebar), plastering, floor screed, and full services in informal estimates. Findings confirm that per-square-metre rates rarely cover the requirements for a fully completed, code-compliant building. The GHP model offers a transparent, auditable alternative to informal quoting. Despite material price volatility and labour market informality, the tool provides a framework for improving cost predictability and reducing project stalling in the sub-Saharan African housing market.
[HC-26] Unpacking Interaction Profiles and Strategies in Human-AI Collaborative Problem Solving: A Cognitive Distribution and Regulation Perspective
【速读】:该论文旨在解决高校学生在复杂问题解决过程中与人工智能(Artificial Intelligence, AI)协作时的协同模式及其动态机制问题。研究通过整合分布式认知(Distributed Cognition)与学习调节(Regulation of Learning)理论视角,识别出三种典型的人机协作模式:委托推理(Delegated Reasoning, DR)、协同解释(Concerted Interpretation, CI)和委托扩展(Delegated Elaboration, DE)。其关键发现在于:DR模式在任务绩效上表现最优,且人类与AI话语的语义相似度最高,但该模式下学习者的自我调节策略使用最少;相反,CI模式虽任务表现较差,却显著提升了学习者的自我调节参与度。这一发现揭示了人机协作系统效率与学习者认知深度之间的张力,为未来AI赋能教育工具的设计提供了核心依据——即需在提升任务效能的同时,保障学习者的主动调节参与,从而实现高效与深度学习的平衡。
链接: https://arxiv.org/abs/2603.21288
作者: Zhanxin Hao,Xiaobo Liu,Jiaxin Fan,Yun Long,Jifan Yu,Wenli Chen,Yu Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:This study adopts an integrated distributed cognition and regulation of learning perspective to examine the collaboration patterns and dynamics of human-AI collaboration when college students collaborating with AI for complex problem-solving. Through cluster analysis, three distinct collaborative problem-solving modes were identified in this study: Delegated Reasoning (DR), Concerted Interpretation (CI), and Delegated Elaboration (DE). This study found that the DR group achieved the highest task performance, significantly outperforming the CI group. Additionally, the semantic similarity between human and AI discourse was notably the highest in the DR group. In contrast, the CI group reported significantly greater use of self-regulation strategies. These findings uncover a critical tension between the efficiency of the distributed system and the depth of human learners regulatory engagement. Insights from this study offer valuable implications for the future design of AI-empowered educational tools and student-AI collaborative learning frameworks.
[HC-27] When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)生成的思维链(Chain-of-Thought, CoT)推理轨迹存在冗长、复杂且易含逻辑与事实错误的问题,这些问题阻碍了用户对模型推理过程的高效和准确理解。解决方案的关键在于构建一个结合外部事实核查与符号形式逻辑验证的错误检测流水线,并在此基础上提出名为ReasonDiag的交互式可视化系统,通过集成弧图展示推理步骤分布与错误传播模式,以及分层节点-链接图呈现高层推理流与前提依赖关系,从而帮助用户有效识别错误步骤并定位其根本原因。
链接: https://arxiv.org/abs/2603.21286
作者: Shiwei Chen,Niruthikka Sritharan,Xiaolin Wen,Chenxi Zhang,Xingbo Wang,Yong Wang
机构: Nanyang Technological University (南洋理工大学); Bosch Research North America (博世北美研究实验室)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to EuroVis 2026
Abstract:Current Large Language Models (LLMs), especially Large Reasoning Models, can generate Chain-of-Thought (CoT) reasoning traces to illustrate how they produce final outputs, thereby facilitating trust calibration for users. However, these CoT reasoning traces are usually lengthy and tedious, and can contain various issues, such as logical and factual errors, which make it difficult for users to interpret the reasoning traces efficiently and accurately. To address these challenges, we develop an error detection pipeline that combines external fact-checking with symbolic formal logical validation to identify errors at the step level. Building on this pipeline, we propose ReasonDiag, an interactive visualization system for diagnosing CoT reasoning traces. ReasonDiag provides 1) an integrated arc diagram to show reasoning-step distributions and error-propagation patterns, and 2) a hierarchical node-link diagram to visualize high-level reasoning flows and premise dependencies. We evaluate ReasonDiag through a technical evaluation for the error detection pipeline, two case studies, and user interviews with 16 participants. The results indicate that ReasonDiag helps users effectively understand CoT reasoning traces, identify erroneous steps, and determine their root causes.
[HC-28] Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长时间、多主题对话中因上下文窗口扁平且无边界导致的逻辑上下文污染(logical context poisoning)问题,即不同话题的上下文相互干扰,逐步降低生成质量。解决方案的关键在于提出对话树架构(Conversation Tree Architecture, CTA),通过将对话组织为具有隔离本地上下文的节点层级结构,实现对上下文流动的显式控制:父节点与子节点之间存在结构化的上下文传递机制,分支创建时向下传播上下文,分支删除时向上回溯;同时引入挥发性节点(volatile nodes),允许临时分支在清理前选择性地合并或丢弃其局部上下文,从而实现高效、可管理的多轮对话状态维护。
链接: https://arxiv.org/abs/2603.21278
作者: Pranav Hemanth,Sampriti Saha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 1 figure. Prototype available at this https URL
Abstract:Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture’s primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.
[HC-29] Development and Usability Study of Older Adults in Motion-Captured Serious Game Incorporating Olfactory Stimulations
【速读】:该论文旨在解决老年人认知功能下降的早期识别与非药物干预问题,特别是为预测轻度认知障碍(MCI)和痴呆风险建立可量化的基准。其解决方案的关键在于开发并验证一种基于动作捕捉技术的虚拟现实严肃游戏(SENSO),通过多感官(视觉、听觉、嗅觉)刺激任务(如“点心选择”、“蒸笼操作”和“收银交易”)来评估老年人的认知与运动功能表现,并利用系统日志记录的准确率和完成时间等指标进行年龄分层分析,从而实现对不同复杂度任务下功能衰退趋势的量化建模,为后续临床筛查和干预提供客观依据。
链接: https://arxiv.org/abs/2603.21220
作者: Joyce S.Y. Lau,Zihui Jing,Clement P.L. Chan,Louis C.F. Ng,Wing Chin Kam,Kwan Yin Lam,Ho Wui Cheung,Ho Lam Lau,Junpei Zhong
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:SENSO is a motion-captured virtual reality serious game utilizing multisensory (visual, auditory, olfactory) stimuli to enhance cognitive and motor functions in older adults. This study evaluated its usability and performance among healthy seniors to establish normative baselines for predicting mild cognitive impairment (MCI) and dementia risk. Methods: Forty-one older adults (aged 60 and older) completed three teahouse-themed tasks: Dim Sum (selection and placement), Steamer (timing and sequencing), and Cashier (counting and transactions). Usability was assessed via the System Usability Scale (SUS), alongside age-stratified performance metrics (accuracy, completion time) from system logs. Results: Usability was rated highly (mean SUS score = 82/100). Performance varied by task complexity: the Dim Sum task showed no age-related differences, the Cashier task showed moderate decline trends, and the Steamer task revealed significant age-related declines due to higher cognitive and motor demands. Conclusion: SENSO demonstrates strong usability and provides effective baselines for cognitive assessment. Adapting complex tasks - such as enhancing olfactory cues in the Steamer game - can optimize its therapeutic efficacy as a non-pharmacological intervention for cognitive preservation. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.21220 [cs.HC] (or arXiv:2603.21220v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.21220 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junpei Zhong [view email] [v1] Sun, 22 Mar 2026 13:25:25 UTC (1,659 KB)
[HC-30] racing Users Privacy Concerns Across the Lifecycle of a Romantic AI Companion
【速读】:该论文旨在解决浪漫型人工智能聊天机器人(romantic AI chatbots)在用户情感使用场景下所引发的隐私与安全问题,特别是用户在寻求亲密感、安慰和情感互动时无意中披露敏感信息所带来的风险。其核心贡献在于提出将隐私视为一个贯穿用户生命周期的动态社会技术治理问题,涵盖访问、披露、解释、保留和退出五个阶段。解决方案的关键在于构建分阶段的隐私与安全治理框架,强调在不同使用环节中实现有意义的可逆性(meaningful reversibility),并充分考虑人类与AI之间亲密交互中的情感脆弱性。
链接: https://arxiv.org/abs/2603.21106
作者: Kazi Ababil Azam,Imtiaz Karim,Dipto Das
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); University of Texas at Dallas (德克萨斯大学达拉斯分校); University of Toronto (多伦多大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 15 pages, 1 figure, pending review
Abstract:Romantic AI chatbots have quickly attracted users, but their emotional use raises concerns about privacy and safety. As people turn to these systems for intimacy, comfort, and emotionally significant interaction, they often disclose highly sensitive information. Yet the privacy implications of such disclosure remain poorly understood in platforms shaped by persistence, intimacy, and opaque data practices. In this paper, we examine public Reddit discussions about privacy in romantic AI chatbot ecosystems through a lifecycle lens. Analyzing 2,909 posts from 79 subreddits collected over one year, we identify four recurring patterns: disproportionate entry requirements, intensified sensitivity in intimate use, interpretive uncertainty and perceived surveillance, and irreversibility, persistence, and user burden. We show that privacy in romantic AI is best understood as an evolving socio-technical governance problem spanning access, disclosure, interpretation, retention, and exit. These findings highlight the need for privacy and safety governance in romantic AI that is staged across the lifecycle of use, supports meaningful reversibility, and accounts for the emotional vulnerability of intimate human-AI interaction.
[HC-31] Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding AAAI
【速读】:该论文旨在解决文本-based计算机中介交流(CMC)中缺乏具身线索时,用户如何重构非语言表达的问题。其核心贡献在于提出并系统验证了电子非语言线索(eNVCs)作为肢体语言(kinesics)、声音特征(vocalics)和副语言(paralinguistics)的文本类比物,在公共微博互动中的作用机制。解决方案的关键在于:第一,构建了一个基于经典非语言传播理论的统一eNVC分类体系;第二,开发了一套可扩展的Python工具包用于自动化检测eNVCs;第三,通过实验与定性研究揭示eNVCs能显著提升情感解码准确率并降低歧义感知,但其效果受讽刺等语境因素调节,且用户在模糊情境下倾向于从缺失线索中推断负面含义。这一系列成果为情感计算、用户建模及情绪感知界面设计提供了理论基础与实践工具。
链接: https://arxiv.org/abs/2603.21038
作者: Taara Kumar,Kokil Jaidka
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at AAAI ICWSM 2026
Abstract:As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at this https URL.
[HC-32] Deep Attention-based Sequential Ensemble Learning for BLE-Based Indoor Localization in Care Facilities
【速读】:该论文旨在解决传统基于蓝牙低能耗(Bluetooth Low Energy, BLE)的室内定位系统在护理设施中性能受限的问题,其核心在于将每个时间点的测量视为独立观测,从而无法有效捕捉用户移动轨迹的时序特征。解决方案的关键在于提出一种名为深度注意力序列集成学习(Deep Attention-based Sequential Ensemble Learning, DASEL)的新框架,该框架通过频率特征工程、带注意力机制的双向门控循环单元(Bidirectional Gated Recurrent Unit, Bi-GRU)网络、多方向滑动窗口以及置信度加权的时间平滑技术,将定位问题重构为一个序列学习任务,从而显著提升对人类移动轨迹的建模能力。实验表明,DASEL 在真实护理场景数据上实现了宏平均 F1 分数 0.4438,较最优传统基线提升 53.1%。
链接: https://arxiv.org/abs/2603.21030
作者: Minh Triet Pham,Quynh Chi Dang,Le Nhat Tan
机构: 未知
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注: 8 pages, 9 figures, IEEE format. Best Challenge Paper Award at the ABC 2026 Activity and Location Recognition Challenge (ABC 2026)
Abstract:Indoor localization systems in care facilities enable optimization of staff allocation, workload management, and quality of care delivery. Traditional machine learning approaches to Bluetooth Low Energy (BLE)-based localization treat each temporal measurement as an independent observation, fundamentally limiting their performance. To address this limitation, this paper introduces Deep Attention-based Sequential Ensemble Learning (DASEL), a novel framework that reconceptualizes indoor localization as a sequential learning problem. The framework integrates frequency-based feature engineering, bidirectional GRU networks with attention mechanisms, multi-directional sliding windows, and confidence-weighted temporal smoothing to capture human movement trajectories. Evaluated on real-world data from a care facility using 4-fold temporal cross-validation, DASEL achieves a macro F1 score of 0.4438, representing a 53.1% improvement over the best traditional baseline (0.2898).
[HC-33] How AI Systems Think About Education: Analyzing Latent Preference Patterns in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在教育价值对齐(educational alignment)方面的系统性评估问题,尤其是当人类专家对某些教育价值观存在争议时,模型应如何进行对齐。其核心解决方案在于构建一个基于德尔菲法(Delphi method)验证的48项量表,覆盖八个教育理论维度,并结合结构化偏好获取(Structured Preference Elicitation)与Thurstonian Utility建模方法,量化模型偏好一致性与人类专家共识之间的关系。关键发现是:GPT-5.1在专家共识领域表现出高度对齐(99.78%传递性;92.79%模型准确率),但在规范性争议领域(如情感维度和认识论规范性)则形成一致立场,而非保持中立,表明模型在价值冲突情境下会主动选择偏向情感响应并拒绝虚假平衡,从而为领域特定的价值对齐评估提供可复现的方法框架。
链接: https://arxiv.org/abs/2603.21006
作者: Daniel Autenrieth
机构: RWTH Aachen University (亚琛工业大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 15 pages, 2 figures, 8 tables. Code and data available at this https URL . arXiv admin note: text overlap with arXiv:2502.08640 by other authors
Abstract:This paper presents the first systematic measurement of educational alignment in Large Language Models. Using a Delphi-validated instrument comprising 48 items across eight educational-theoretical dimensions, the study reveals that GPT-5.1 exhibits highly coherent preference patterns (99.78% transitivity; 92.79% model accuracy) that largely align with humanistic educational principles where expert consensus exists. Crucially, divergences from expert opinion occur precisely in domains of normative disagreement among human experts themselves, particularly emotional dimensions and epistemic normativity. This raises a fundamental question for alignment research: When human values are contested, what should models be aligned to? The findings demonstrate that GPT-5.1 does not remain neutral in contested domains but adopts coherent positions, prioritizing emotional responsiveness and rejecting false balance. The methodology, combining Delphi consensus-building with Structured Preference Elicitation and Thurstonian Utility modeling, provides a replicable framework for domain-specific alignment evaluation beyond generic value benchmarks.
[HC-34] Detection of adversarial intent in Human-AI teams using LLM s
【速读】:该论文旨在解决在人-AI协作团队中,大型语言模型(Large Language Models, LLMs)作为潜在防御监督者以识别恶意行为的问题。其核心挑战在于如何从多主体交互轨迹中实时检测出隐蔽的恶意行为,而无需依赖特定任务的知识。解决方案的关键在于利用LLM对复杂交互模式的理解能力,在不依赖任务特异性信息的前提下,实现对恶意行为的精准识别,从而提升人类团队对数据投毒、提示注入等攻击的鲁棒性。
链接: https://arxiv.org/abs/2603.20976
作者: Abed K. Musaffar,Ambuj Singh,Francesco Bullo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents’ autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real-time, and without task-specific information, indicating the potential for task-agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.
[HC-35] owards an AI Buddy for every University Student? Exploring Students Experiences Attitudes and Motivations towards AI and AI-based Study Companions
【速读】:该论文旨在解决当前高等教育中生成式 AI (Generative AI) 工具普及背景下,学生对个性化 AI 同伴(AI Buddy)的使用经验、能力感知及采纳意愿缺乏实证研究的问题。通过一项针对瑞士某高校926名学生的调查,研究发现学生普遍使用AI工具完成文本处理与效率任务,具备中等数字素养,且对AI Buddy表现出高度兴趣,认可其在时间管理、个性化支持和学习组织方面的价值;但同时也担忧数据隐私与过度依赖问题。关键解决方案在于:构建具备强隐私保护机制和批判性使用引导策略的AI Buddy系统,以确保其在替代传统信息获取行为的同时,不削弱大学教育中的社会互动与协作价值。
链接: https://arxiv.org/abs/2603.20909
作者: Judit Martinez Moreno,Markus Christen,Abraham Bernstein
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Despite the widespread integration of generative artificial intelligence (GenAI) tools in higher education, there is limited empirical insight into students’ experiences, competences, and readiness to adopt personalized AI companions. To address this gap, this study investigates three key questions: (RQ1) What are students’ prior experiences with AI tools, their perceived digital and AI-related competences, and their interest in emerging technologies?; (RQ2) How do students perceive a hypothetical “AI Buddy” (a digital companion designed to support students throughout their academic journey) including adoption, benefits, and concerns?; (RQ3) How does students’ willingness to adopt an AI Buddy relate to motivations for engaging in traditional academic activities? Based on a survey of 926 students at a Swiss university, students revealed widespread prior use of AI, primarily for text-based and productivity tasks, with moderate self-assessed digital competence. Students expressed strong enthusiasm for adopting an AI Buddy, valuing its potential for time efficiency, personalized academic support, and study organization, but expressed significant concerns about data privacy and over-reliance. A weak negative correlation emerged between AI Buddy adoption willingness and motivations for attending lectures or using library resources, while social and collaborative motivations remained unaffected. These findings suggest that AI Buddies may partially replace information-seeking behaviours but preserve the social fabric of university life. This study provides practical recommendations including the need for robust privacy protections and critical engagement strategies to ensure AI Buddies enhance, rather than undermine, the academic and communal value of higher education.
[HC-36] Characterizing the onset and offset of motor imagery during passive arm movements induced by an upper-body exoskeleton IROS2023
【速读】:该论文旨在解决非侵入式脑机接口(Brain-Machine Interface, BMI)在康复外骨骼设备运行过程中,如何准确检测用户运动想象(Motor Imagery, MI)的起始与终止时刻这一关键问题,尤其是在存在仪器噪声和由外骨骼引起的被动肢体运动干扰的情况下。其解决方案的关键在于摒弃传统的连续控制策略,转而通过识别运动想象状态的切换点——即从静息到运动想象开始、以及从持续运动想象到结束的过渡——来实现自然化的功能动作控制(启动与终止),并利用脑电图(Electroencephalogram, EEG)信号构建解码器,在离线和伪在线评估中均实现了平均60.7%的起始准确率和66.6%的终止准确率,证明了即使在外骨骼诱导的被动运动环境下,仍可提取稳定可靠的传感器运动节律,从而为辅助设备的BMI控制提供了可行路径。
链接: https://arxiv.org/abs/2603.20885
作者: Kanishka Mitra,Frigyes Samuel Racz,Satyam Kumar,Ashish D. Deshpande,José del R. Millán
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to IROS 2023. 6 pages, 6 figures. Project page available at this https URL
Abstract:Two distinct technologies have gained attention lately due to their prospects for motor rehabilitation: robotics and brain-machine interfaces (BMIs). Harnessing their combined efforts is a largely uncharted and promising direction that has immense clinical potential. However, a significant challenge is whether motor intentions from the user can be accurately detected using non-invasive BMIs in the presence of instrumental noise and passive movements induced by the rehabilitation exoskeleton. As an alternative to the straightforward continuous control approach, this study instead aims to characterize the onset and offset of motor imagery during passive arm movements induced by an upper-body exoskeleton to allow for the natural control (initiation and termination) of functional movements. Ten participants were recruited to perform kinesthetic motor imagery (MI) of the right arm while attached to the robot, simultaneously cued with LEDs indicating the initiation and termination of a goal-oriented reaching task. Using electroencephalogram signals, we built a decoder to detect the transition between i) rest and beginning MI and ii) maintaining and ending MI. Offline decoder evaluation achieved group average onset accuracy of 60.7% and 66.6% for offset accuracy, revealing that the start and stop of MI could be identified while attached to the robot. Furthermore, pseudo-online evaluation could replicate this performance, forecasting reliable online exoskeleton control in the future. Our approach showed that participants could produce quality and reliable sensorimotor rhythms regardless of noise or passive arm movements induced by wearing the exoskeleton, which opens new possibilities for BMI control of assistive devices.
[HC-37] Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking PAKDD2026
【速读】:该论文旨在解决成对比较标注(pairwise comparison labeling)在实际应用中因需进行全量比较而导致的二次方复杂度问题,同时提升排序结果的评分者间一致性(inter-rater reliability)。其解决方案的关键在于提出 Dodgersort 框架,该框架融合了基于 CLIP 的分层预排序(hierarchical pre-ordering)、神经排序头(neural ranking head)、概率集成方法(如 Elo、BTL、高斯过程 GP)、认知不确定性与随机不确定性分解(epistemic–aleatoric uncertainty decomposition),以及基于信息论的配对选择策略。这些组件协同作用,在显著减少人工标注次数的同时,提升了排序质量与可靠性,尤其在医学影像、历史年代判定和美学评估等任务中实现了 11–16% 的标注效率提升,并在 FG-NET 数据集上实现每比较一次提取比基线多 5–20 倍的排名信息,达成准确率-效率的帕累托最优权衡。
链接: https://arxiv.org/abs/2603.20839
作者: Yujin Park,Haejun Chung,Ikbeom Jang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 12 pages, 2 figures, Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD2026)
Abstract:Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic–aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11–16% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5–20 \times more ranking information per comparison than baselines, yielding Pareto-optimal accuracy–efficiency trade-offs.
[HC-38] A 4R-supported circular product-service system for luxury branded events
【速读】:该论文旨在解决奢侈品品牌活动中因短期周期和定制化搭建导致的材料快速消耗问题,即如何在不牺牲品牌美学标准与运营效率的前提下实现资源循环利用。其解决方案的关键在于构建一个“物理-数字融合”的产品服务系统(phygital product-service system, PSS),通过4R框架(拒绝、减量、再利用、回收)贯穿从仓库到活动现场的全流程,并结合可重复使用的物流包装、模块化货架系统与标准化标签等物理触点,以及实时数字仓库、基于清单的出入库工作流和可持续材料库等数字化工具,将循环规则嵌入日常操作环节,使再利用成为默认行为,从而推动采购、仓储、配送、回收与再部署等环节向价值留存转变。
链接: https://arxiv.org/abs/2603.20613
作者: Ke Ma,Francesca Valsecchi,Yuchen Tan,Mingjia Ji,Junru Shen,Xiaoya Ma,Duan Wu,Jiao Mo,Shijian Zhao
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 11 figures, accepted to be published in the Proceedings of DRS 2026 (Design Research Society Conference)
Abstract:Temporary luxury branded events run on short cycles and bespoke builds that accelerate material churn. We present a circular phygital product-service system that operationalises the circular economy (CE) through a 4R frame (Refuse, Reduce, Reuse, and Recycling) across warehouse-to-event journeys. Developed via a multi-method design inquiry with a tier-1 contractor, the system couples physical touchpoints (reusable fold-flat transit boxes, adjustable racking, standard labels) with digital orchestration (a live digital warehouse, list-based outbound/inbound workflow, and a sustainable materials library). The architecture aligns roles and decisions, protects and identifies assets, and makes reuse the default under luxury brand constraints. By embedding traceable actions and CE-aligned rules into everyday handoffs, the PSS shifts procurement, storage, dispatch, return, and redeployment toward value retention. The contribution is a replicable, practice-ready route from circular intent to operational change in branded environments, advancing responsible retail without compromising speed or aesthetic standards.
[HC-39] Nevis Digital Twin: Photogrammetry and Immersive Visualization of Historical Sites
【速读】:该论文旨在解决加勒比地区脆弱历史遗址(如尼维斯岛)因海岸侵蚀、海平面上升及植被侵袭等自然威胁而面临的数字化保存与虚拟重建难题。其核心问题在于如何在高成本专业测绘与大众可及性之间建立有效的数据采集与呈现桥梁。解决方案的关键在于提出一种多模态数据采集工作流,通过实验对比不同相机高度(1m vs. 3m)和操作者轨迹对高质量控制数据的影响,并比较网格重建(mesh reconstruction)与三维高斯泼溅(3D Gaussian Splatting)两种重建技术的适用性,最终将融合后的数据部署于沉浸式虚拟现实(VR)环境中,形成一个可扩展、非专有化的数字遗产保护模型。
链接: https://arxiv.org/abs/2603.20560
作者: Alex Apffel,Huy Tran,Vuthea Chheang
机构: San Jose State University (圣何塞州立大学)
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注: ARCHERIX Workshop - IEEE VR 2026
Abstract:In this work, we present a multimodal data acquisition workflow for the digital preservation and virtual reconstruction of at-risk historical sites in the island of Nevis. Facing threats from coastal erosion, rising sea levels, and aggressive vegetation, the archaeological heritage of Nevis requires documentation strategies that bridge the gap between high-cost professional surveying and consumer accessibility. Experimental test compared acquisition variables, specifically camera height (1m vs. 3m) and operator trajectory against high-resolution control data. Moreover, we explore the virtual reconstruction between mesh reconstruction and 3D gaussian splatting to serve as different modalities for documentation. The resulting data is fused into immersive virtual reality (VR) environments, offering a scalable, non-proprietary model for democratizing digital heritage in the Caribbean.
[HC-40] owards Extended Reality Intelligence for Monitoring and Predicting Patient Readmission Risks ALT
【速读】:该论文旨在解决糖尿病住院患者30天内非计划再入院(unplanned readmissions)这一临床难题,其核心挑战在于如何精准识别高风险患者并提升临床决策效率。解决方案的关键在于构建一个基于XGBoost的机器学习预测模型与混合现实(Mixed Reality, MR)可视化系统相结合的集成框架:首先通过数据清洗、编码和特征工程优化后的XGBoost分类器实现了对再入院风险的量化预测(AUROC=0.72,AUPRC=0.11),识别出既往住院次数、出院去向及糖化血红蛋白(A1C)等关键因素;其次开发了MR原型界面,以直观方式呈现患者的个体化风险等级、主要贡献因子及护理摘要,从而增强医护人员在实时临床场景中对再入院风险的认知与沟通效率。
链接: https://arxiv.org/abs/2603.20556
作者: Martin Sanchez,Nick Tran,Vuthea Chheang
机构: San Jose State University (圣何塞州立大学)
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
备注: XR Health Workshop, IEEE VR 2026
Abstract:Hospital readmissions remain a challenge for healthcare systems, especially among patients with chronic conditions such as diabetes. Unplanned readmissions within 30 days are costly, strain hospital resources, and can indicate poor care coordination or discharge planning. In this work, we explore the use of machine learning to predict readmission risk for diabetic inpatients and propose a mixed reality (MR) to provide effective visualization and insights. We trained an XGBoost classifier after data cleaning, encoding, and feature engineering. The model achieved an Area Under the Receiver Operating characteristic Curve (AUROC) of 0.72 and an Area Under the Precision-Recall Curve (AUPRC) of 0.11. Key predictive factors included prior inpatient visits, discharge disposition, and glycemic control indicators such as A1C (blood sugar test) results and medication adjustments. Additionally, we developed an MR prototype that visualize patient records and predictions containing risk level, major contributing factors, and a concise summary of care. Together, the predictive model and the MR interface aim to improve clinician awareness and communication around readmission risk in real-time clinical settings.
[HC-41] “Girl Im so Serious”: CARE a Capability Framework for Reproductive Equity in Human-AI Interaction
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在性与生殖健康(Sexual and Reproductive Health, SRH)领域应用中,因结构性障碍和认知偏见导致的公平性缺失问题,即AI工具虽提供匿名信息获取途径,但其设计未能充分支持弱势群体的生殖自主权(reproductive autonomy)。解决方案的关键在于提出CARE框架——一种基于能力方法(Capability Approach)的规范性设计视角与评估框架,将“能力”(capabilities)、“功能实现”(functionings)及“转化因素”(conversion factors)转化为可操作的设计与评估维度,用于识别如“来源不透明”和“响应僵化”等认知危害,并据此制定参与式审计策略与政策建议,以确保AI在高风险场景下促进而非加剧不平等。
链接: https://arxiv.org/abs/2603.20511
作者: Alice Zhong,Phoebe Chen,Anika Sharma,Kandyce Brennan,Snehalkumar ‘Neil’ S. Gaikwad
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Sexual and reproductive health (SRH) remains shaped by structural barriers that leave many without judgment-free information. AI chatbots offer anonymous alternatives, but access alone does not ensure equity when socioeconomic determinants shape whose capabilities these tools expand or constrain. Conventional methods for evaluating human-AI interaction were not designed to capture whether technologies holistically support reproductive autonomy. We introduce CARE, Capability Approach for Reproductive Equity, developing capabilities, functionings, and conversion factors into a Normative Design Lens and an Evaluation Lens for AI in SRH contexts. Evaluating SRH-specific non-LLM chatbots, general-use LLMs, and search engine features along credibility and reasoning, we identify two epistemic harms: source opacity and response rigidity. We conclude with design and evaluation recommendations, participatory auditing strategies, and policy implications for high-stakes domains where AI intersects with inequity.
[HC-42] he production of meaning in the processing of natural language
【速读】:该论文旨在解决如何理解自然语言处理中意义生成的基本机制,以设计更安全、有思辨性、吸引人且赋能人类与智能体交互的系统。其核心问题在于揭示语言模型是否以及如何表现出语义处理中的上下文依赖性(contextuality),这种特性在认知科学和心理学中已被证明更符合量子逻辑而非经典布尔理论。解决方案的关键在于引入CHSH |S| 参数作为衡量上下文依赖性的指标,并在跨越四个数量级规模的语言模型中系统分析该参数的分布特征及其与多项外部基准(如MMLU、幻觉率和无意义检测)的关系。研究发现,|S| 的四分位距(interquartile range)与所有外部指标完全正交,表明上下文依赖性是一种独立于传统性能指标的内在属性,这为构建基于信息论约束的提示注入防御机制提供了新视角,同时也揭示了“制造上下文性”(manufacturing contextuality)这一比“获取同意”更根本的操纵形式——即通过结构化社会语境来塑造解释空间本身。
链接: https://arxiv.org/abs/2603.20381
作者: Christopher J. Agostino,Quan Le Thien,Nayan D’Souza,Louis van der Elst
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to HAXD 2026, 9 pages, 3 figures, 2 tables. associated package available at this https URL
Abstract:Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human-agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models – in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH |S| parameter – the metric associated with the inequality – across the inference parameter space of models spanning four orders of magnitude in scale, cross-referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the |S| distribution – the statistic that most sharply differentiates models from one another – is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how |S| varies with sampling parameters and word order, and discuss the information-theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale – manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.
[HC-43] Review and Analysis of Scientific Paper Embellishments
【速读】:该论文试图解决的问题是:当前科学论文中日益增多的视觉装饰元素(paper embellishments)对学术交流的影响尚不明确,其潜在优势(如增强文本描述、强化图文关联和提升内部一致性)与可能带来的阅读体验干扰之间存在权衡。为回应这一问题,作者通过系统性回顾2019至2024年间发表于IEEE VIS、ACM CHI和EuroVis的可视化领域论文,从374篇使用装饰元素的论文中提炼出三个关键维度——用途(WHY)、设计选择(HOW)和位置(WHERE),从而构建了一个结构化的分析框架,揭示了这些装饰元素在科学写作中的形态特征及其在塑造科学传播中的作用机制。
链接: https://arxiv.org/abs/2603.20306
作者: Jiayi Hong,Yixuan Wang,Petra Isenberg,Ross Maciejewski
机构: 未知
类目: Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a review and analysis of scientific paper embellishments – simple visual elements that are deeply integrated into the text of scientific publications. These embellishments are increasingly used in research papers, which have the potential to enhance textual descriptions, strengthen connections between figures and content, and improve internal textual coherence, while also carrying the risk of disrupting the reading experience. As their exact impact is not yet well understood, we conducted a systematic review of all visualization papers published between 2019 and 2024 in IEEE VIS, ACM CHI, and EuroVis. From this corpus, we identified 374 papers that used paper embellishments and distilled three key dimensions that characterize their usage: purposes (WHY), design choices (HOW), and locations (WHERE) of paper embellishments. Our findings provide a structured perspective on the form of current embellishments in scientific writing in the visualization domain and provide insights into their role in shaping scientific communication.
[HC-44] Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education
【速读】:该论文旨在解决低资源语言(如阿拉伯语)儿童语音数据集匮乏的问题,以及由此导致的儿童语音分类任务中模型性能受限的问题。其关键解决方案是提出一个名为Abjad-Kids的阿拉伯语儿童语音数据集,并采用基于CNN-LSTM架构的分层音频分类方法:将字母识别分解为两阶段过程——首先通过静态语言学特征进行类别分组,再对每组使用专用分类器进行精细化识别。实验表明,静态语言学分组策略优于动态聚类分组策略,且CNN-LSTM模型结合数据增强在性能上显著优于传统机器学习方法,尽管仍存在因样本量有限导致的过拟合问题。
链接: https://arxiv.org/abs/2603.20255
作者: Abdul Aziz Snoubara,Baraa Al_Maradni,Haya Al_Naal,Malek Al_Madrmani,Roaa Jdini,Seedra Zarzour,Khloud Al Jallad
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech-based AI educational applications have gained significant interest in recent years, particularly for children. However, children speech research remains limited due to the lack of publicly available datasets, especially for low-resource languages such as this http URL paper presents Abjad-Kids, an Arabic speech dataset designed for kindergarten and primary education, focusing on fundamental learning of alphabets, numbers, and colors. The dataset consists of 46397 audio samples collected from children aged 3 - 12 years, covering 141 classes. All samples were recorded under controlled specifications to ensure consistency in duration, sampling rate, and format. To address high intra-class similarity among Arabic phonemes and the limited samples per class, we propose a hierarchical audio classification based on CNN-LSTM architectures. Our proposed methodology decomposes alphabet recognition into a two-stage process: an initial grouping classification model followed by specialized classifiers for each group. Both strategies: static linguistic-based grouping and dynamic clustering-based grouping, were evaluated. Experimental results demonstrate that static linguistic-based grouping achieves superior performance. Comparisons between traditional machine learning with deep learning approaches, highlight the effectiveness of CNN-LSTM models combined with data augmentation. Despite achieving promising results, most of our experiments indicate a challenge with overfitting, which is likely due to the limited number of samples, even after data augmentation and model regularization. Thus, future work may focus on collecting additional data to address this issue. Abjad-Kids will be publicly available. We hope that Abjad-Kids enrich children representation in speech dataset, and be a good resource for future research in Arabic speech classification for kids.
[HC-45] Writing literature reviews with AI: principles hurdles and some lessons learned
【速读】:该论文试图解决的问题是:如何在利用大语言模型(Large Language Model, LLM)辅助生成文献综述时,识别并规避其带来的系统性偏倚与局限性,以确保综述的学术质量与批判深度。解决方案的关键在于用户必须具备扎实的领域专业知识,能够主动识别LLM输出中存在的“无知偏差”(bias of ignorance)、“对齐与数字谄媚”(alignment and digital sycophancy)、“主流化倾向”(mainstreaming)、“创造性重构能力有限”以及“缺乏批判视角”等五大陷阱,并通过精准提示(prompting)进行干预。然而,这一过程揭示了一个悖论:高质量的AI辅助综述依赖于作者本身对文献的深入理解,而这恰恰是AI本应减轻的工作负担,因此单纯依赖AI自动化流程将导致严重风险,唯有结合专家判断与谨慎使用才能实现有效增益。
链接: https://arxiv.org/abs/2603.20235
作者: Saadi Lahlou(1,2),Annabelle Gouttebroze(1),Atrina Oraee(1),Julian Madera(1) ((1) London School of Economics and Political Science (2) Paris Institute for Advanced Study)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 31 pages and 193 pages of Appendices, including 6 different versions of the literature review, and complete chat with the LLM
Abstract:We qualitatively compared literature reviews produced with varying degrees of AI assistance. The same LLM, given the same corpus of 280 papers but different selections, produced dramatically different reviews, from mainstream and politically neutral to critical and post-colonial, though neither orientation was intended. LLM outputs always appear at first glance to be well written, well informed and thought out, but closer reading reveals gaps, biases and lack of depth. Our comparison of six versions shows a series of pitfalls and suggests precautions necessary when using AI assistance to make a literature review. Main issues are: (1) The bias of ignorance (you do not know what you do not get) in the selection of relevant papers. (2) Alignment and digital sycophancy: commercial AI models slavishly take you further in the direction they understand you give them, reinforcing biases. (3) Mainstreaming: because of their statistical nature, LLM productions tend to favor mainstream perspectives and content; in our case there was only 20% overlap between paper selections by humans and the LLM. (4) Limited capacity for creative restructuring, with vague and ambiguous statements. (5) Lack of critical perspective, coming from distant reading and political correctness. Most pitfalls can be addressed by prompting, but only if the user knows the domain well enough to detect them. There is a paradox: producing a good AI-assisted review requires expertise that comes from reading the literature, which is precisely what AI was meant to reduce. Overall, AI can improve the span and quality of the review, but the gain of time is not as massive as one would expect, and a press-button strategy leaving AI to do the work is a recipe for disaster. We conclude with recommendations for those who write, or assess, such LLM-augmented reviews.
[HC-46] Beyond Detection: Governing GenAI in Academic Peer Review as a Sociotechnical Challenge
【速读】:该论文旨在解决生成式 AI(Generative AI)在学术同行评审流程中引入的公平性、问责制与评价判断合法性问题,尤其是在评审负担加重背景下,如何平衡效率提升与人类核心判断权的保留。其解决方案的关键在于:明确将评价性判断(如新颖性、贡献度和录用决策)严格保留给人类,同时通过制定角色特定的可执行控制机制,厘清责任边界,防止因制度模糊导致个体学者(尤其是初级作者与审稿人)承担过重的解释与执行负担,并防范提示注入等对抗性风险,从而实现对AI辅助评审过程的有效社会技术治理。
链接: https://arxiv.org/abs/2603.20214
作者: Tatiana Chakravorti,Pranav Narayanan Venkit,Sourojit Ghosh,Sarah Rajtmajer
机构: Pennsylvania State University (宾夕法尼亚州立大学); Salesforce AI Research (Salesforce AI 研究); University of Washington (华盛顿大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI tools are increasingly entering academic peer review workflows, raising questions about fairness, accountability, and the legitimacy of evaluative judgment. While these systems promise efficiency gains amid growing reviewer overload, their use introduces new sociotechnical risks. This paper presents a convergent mixed-method study combining discourse analysis of 448 social media posts with interviews with 14 area chairs and program chairs from leading AI and HCI conferences to examine how GenAI is discussed and experienced in peer review. Across both datasets, we find broad agreement that GenAI may be acceptable for limited supportive tasks, such as improving clarity or structuring feedback, but that core evaluative judgments, assessing novelty, contribution, and acceptance, should remain human responsibilities. At the same time, participants highlight concerns about epistemic harm, over-standardization, unclear responsibility, and adversarial risks such as prompt injection. User interviews reveal how structural strain and institutional policy ambiguity shift interpretive and enforcement burdens onto individual scholars, disproportionately affecting junior authors and reviewers. By triangulating public governance discourse with lived review practices, this work reframes AI mediated peer review as a sociotechnical governance challenge and offers recommendations for preserving accountability, trust, and meaningful human oversight. Overall, we argue that AI-assisted peer review is best governed not by blanket bans or detection alone, but by explicitly reserving evaluative judgment for humans while instituting enforceable, role-specific controls that preserve accountability. We conclude with role specific recommendations that formalize the support judgment boundary.
[HC-47] AnchorNote: Exploring Speech-Driven Spatial Externalization for Co-Located Collaboration in Augmented Reality
【速读】:该论文旨在解决协同创作中传统纸质便签(sticky notes)虽具灵活性但缺乏数字持久性与智能处理能力的问题,尤其是在增强现实(AR)环境中如何更高效地实现想法的外化、组织与共享。其解决方案的关键在于提出 AnchorNote——一个共位 AR 系统,通过实时语音转录(live transcription)和大语言模型(LLM)摘要生成,将口头表达的内容自动转化为空间锚定的数字便签,从而支持协作过程中更自然、低摩擦的想法捕捉与空间化组织。此设计不仅降低了书写负担,还重构了群体认知与协调方式,揭示了语音驱动的空间外化对协作流程的影响机制。
链接: https://arxiv.org/abs/2603.20199
作者: Diya Hundiwala,Andrés Monroy-Hernández
机构: Princeton University (普林斯顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: 12 pages, 3 figures
Abstract:Sticky notes remain a durable collaborative medium because they support rapid idea externalization, rearrangement, and coordination of group attention through spatial organization while being low-friction and lightweight. Recent AR systems suggest new ways to externalize ideas in shared physical space, including spatial annotations and digital workspaces. We introduce AnchorNote, a co-located AR system that lets collaborators intentionally capture spoken ideas as spatially anchored sticky notes via live transcription and LLM summarization. We evaluated AnchorNote in a two-phase iterative study with 20 participants completing a brainstorming and thematic grouping task to examine how speech-driven, spatially persistent capture shapes idea externalization in collaboration. We found that AnchorNote reduced writing effort but reshaped collaboration by introducing new coordination costs and shifting how participants formulated, timed, and organized ideas. We use AnchorNote as an exploratory probe to study how speech-driven, spatial externalization in AR restructures collaborative cognition and coordination, and to derive design implications for future co-located AR collaboration tools.
计算机视觉
[CV-0] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中因上下文窗口有限而导致的挑战,即如何从长视频中高效定位与查询相关的稀疏片段。现有方法通常仅依赖查询本身进行线索定位,忽视了视频内部结构以及不同片段间的相关性差异。解决方案的关键在于提出 VideoDetective 框架,其核心创新是将查询到片段的相关性(query-to-segment relevance)与片段间的相互亲和性(inter-segment affinity)相结合,构建基于视觉相似性和时间邻近性的视觉-时间亲和图,并通过假设-验证-精化(Hypothesis-Verification-Refinement)循环迭代估计已观察片段的相关性得分并传播至未观察片段,从而生成全局相关性分布,指导关键片段的精准定位,实现以稀疏观测完成高质量问答。
链接: https://arxiv.org/abs/2603.22285
作者: Ruoliu Yang,Chu Wu,Caifeng Shan,Ran He,Chaoyou Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video’s intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at this https URL
[CV-1] End-to-End Training for Unified Tokenization and Latent Denoising
【速读】:该论文旨在解决当前潜在扩散模型(Latent Diffusion Models, LDMs)训练中复杂的分阶段流程问题,即必须先独立训练图像分词器(tokenizer),再在冻结的潜在空间中训练扩散模型,导致训练效率低且难以优化。其解决方案的关键在于提出UNITE架构——一种通过权重共享实现统一分词与潜在扩散的自编码器结构。核心创新点在于将分词和生成视为同一潜在推理问题在不同条件下的表现:分词是从完整图像中推断潜在表示,而生成则是从噪声及文本或类别条件中推断潜在表示。基于此洞察,UNITE采用单阶段联合训练策略,通过两次前向传播共享同一个生成编码器(Generative Encoder),使梯度共同塑造潜在空间,从而构建一种“通用潜在语言”。实验表明,该方法无需对抗损失或预训练编码器即可在图像和分子模态上达到接近最先进性能,验证了从零开始联合训练分词与生成任务的可行性。
链接: https://arxiv.org/abs/2603.22283
作者: Shivam Duggal,Xingjian Bai,Zongze Wu,Richard Zhang,Eli Shechtman,Antonio Torralba,Phillip Isola,William T. Freeman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: First two authors contributed equally. Project: this https URL Code: this https URL
Abstract:Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a “common latent language”. Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization generation from scratch is feasible.
[CV-2] UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation
【速读】:该论文旨在解决现有统一模型在处理人体运动(Human Motion)、自然语言(Natural Language)和RGB图像(RGB Images)三者时存在的两大局限:一是仅支持受限的模态子集(如运动-文本或静态姿态-图像),二是依赖离散标记化(discrete tokenization),导致量化误差并破坏时间连续性。其核心解决方案是提出UniMotion框架,通过将运动视为与RGB图像同等地位的连续模态(continuous modality),构建了基于Cross-Modal Aligned Motion VAE(CMA-VAE)和对称双路径嵌入器(symmetric dual-path embedders)的并行连续通道,嵌入共享的大语言模型(LLM)主干网络中;进一步引入Dual-Posterior KL Alignment(DPA)以在推理阶段无需图像的情况下注入视觉语义先验至运动表示,并采用Latent Reconstruction Alignment(LRA)自监督预训练策略缓解冷启动问题(cold-start problem),从而建立稳定且具备运动感知能力的基础,最终在七项跨模态理解、生成与编辑任务中实现最先进性能,尤其在跨模态组合任务中优势显著。
链接: https://arxiv.org/abs/2603.22282
作者: Ziyi Wang,Xinshun Wang,Shuang Chen,Yang Cong,Mengyuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 42 pages, 16 figures
Abstract:We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder’s richer posterior into the motion-only encoder. To address the cold-start problem – where text supervision alone is too sparse to calibrate the newly introduced motion pathway – we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
[CV-3] DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在处理复杂多步骤任务时的两大局限性:一是难以同时捕捉低层级的视觉细节与高层级的逻辑规划,因现有基于单一模态链式思维(Chain-of-Thought, CoT)的方法无法实现多模态协同推理;二是由于逐步自回归解码导致推理延迟高且误差累积严重。解决方案的关键在于提出DualCoT-VLA框架,其核心创新为双路径并行推理机制:一方面引入视觉CoT以增强对精细空间感知的理解,另一方面引入语言CoT用于高层任务规划,二者通过两组可学习查询令牌(learnable query tokens)实现并行推理,从而将自回归推理转化为单步前向推理,显著降低延迟并提升推理准确性。
链接: https://arxiv.org/abs/2603.22280
作者: Zhide Zhong,Junfeng Li,Junjie He,Haodong Yan,Xin Gong,Guanyi Zhao,Yingjie Cai,Jiantao Gao,Xu Yan,Bingbing Liu,Yingcong Chen,Liuqing Yang,Haoang Li
机构: The Hong Kong University of Science and Technology (Guangzhou); Huawei Foundation Model Department
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting’’ capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
[CV-4] 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和视觉语言模型(Vision Language Models, VLMs)在进行细粒度视觉编辑时存在的空间理解不足与布局一致性差的问题。其解决方案的关键在于提出一种结构化推理(Structured Reasoning)框架,通过场景图(scene graph)推理实现文本条件下的空间布局编辑:给定输入场景图和自然语言指令,模型基于图结构进行推理以生成满足文本条件且保持空间一致性的更新场景图,从而提升对空间关系的可控性与可解释性。
链接: https://arxiv.org/abs/2603.22279
作者: Haoyu Zhen,Xiaolong Li,Yilin Zhao,Han Zhang,Sifei Liu,Kaichun Mo,Chuang Gan,Subhashree Radhakrishnan
机构: NVIDIA; UMass Amherst
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
[CV-5] he Dual Mechanisms of Spatial Reasoning in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中对象属性与空间关系关联机制不明确的问题,即当前模型如何在内部表征并利用这些空间信息以完成图像描述和视觉问答等多模态任务。解决方案的关键在于揭示了两种并行的表征机制:一是语言模型骨干网络中的中间层对视觉标记(visual tokens)所对应对象之间内容无关的空间关系进行编码;二是视觉编码器(vision encoder)生成的表示直接捕获对象布局,并被语言模型骨干网络所利用。研究发现,后者才是主导的空间信息来源,且其空间信号分布于所有图像标记中,不仅限于对象区域,还延伸至背景区域。通过增强这一全局性的视觉衍生空间表示,可显著提升模型在自然图像上的空间推理性能,从而明确了视觉编码器在空间推理中的核心作用。
链接: https://arxiv.org/abs/2603.22278
作者: Kelly Cui,Nikhil Prakash,Ayush Raina,David Bau,Antonio Torralba,Tamar Rott Shaham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 35 figures
Abstract:Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
[CV-6] Repurposing Geometric Foundation Models for Multi-view Diffusion
【速读】:该论文旨在解决单图像生成中广泛使用的视图无关变分自编码器(Variational Autoencoder, VAE)隐空间在新视角合成(Novel View Synthesis, NVS)任务中缺乏几何一致性的问题。现有方法难以保证跨视角的几何结构一致性,限制了NVS的性能。解决方案的关键在于提出几何隐扩散模型(Geometric Latent Diffusion, GLD),其创新性地将几何基础模型(geometric foundation models)提供的几何一致特征空间作为扩散模型的隐空间,从而同时实现高保真RGB重建与强跨视角几何对应关系建模,显著提升了2D图像质量和3D一致性,并大幅加速训练过程。
链接: https://arxiv.org/abs/2603.22275
作者: Wooseok Jang,Seonghu Jeon,Jisang Han,Jinhyeok Choi,Minkyung Kwon,Seungryong Kim,Saining Xie,Sainan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL
Abstract:While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.
[CV-7] DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution CVPR2026
【速读】:该论文旨在解决基于扩散模型的视频超分辨率(Diffusion-based Video Super-Resolution, VSR)方法在实际应用中面临的采样成本过高问题,尤其是直接采用分布匹配蒸馏(Distribution Matching Distillation, DMD)时导致的训练不稳定、性能退化及监督信号不足的问题。解决方案的关键在于提出一种三阶段框架DUO-VSR,其核心创新是引入双流蒸馏策略(Dual-Stream Distillation),联合优化DMD与基于真实-虚假得分特征的生成对抗网络(Real-Fake Score Feature GAN, RFS-GAN),通过轨迹保持的渐进式引导蒸馏初始化稳定训练过程,并借助来自真实与伪影分数模型的判别特征提供互补的对抗监督,最终在偏好引导精修阶段进一步提升感知质量,从而实现高效且高质量的一步式VSR。
链接: https://arxiv.org/abs/2603.22271
作者: Zhengyao Lv,Menghan Xia,Xintao Wang,Kwan-Yee K. Wong
机构: The University of Hong Kong (香港大学); Huazhong University of Science and Technology (华中科技大学); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
[CV-8] GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning
【速读】:该论文旨在解决光学流估计(Optical Flow Estimation)中依赖昂贵人工标注数据导致的可扩展性问题。现有无监督和半监督方法虽缓解了标注需求,但其监督信号基于亮度恒定和平滑性假设,在复杂真实场景中常产生不可靠的运动估计。解决方案的关键在于提出一种名为 \textbf\modelname 的新框架,通过预训练的深度估计网络生成伪光学流作为条件输入,驱动下一帧生成模型合成高保真、像素对齐的帧-流数据对,从而无需人工标注即可获得大规模精确匹配的训练样本;同时引入不一致像素过滤策略剔除生成帧中的不可靠区域,显著提升在真实数据集上的微调性能。
链接: https://arxiv.org/abs/2603.22270
作者: Yixuan Luo,Feng Qiao,Zhexiao Xiong,Yanjing Li,Nathan Jacobs
机构: University of Chicago (芝加哥大学); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf\modelname, a novel framework that synthesizes large-scale, perfectly aligned frame–flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textitinconsistent pixel filtering strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf\modelname achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
[CV-9] EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild
【速读】:该论文旨在解决社会群体检测(Social Group Detection)在现实世界中应用的局限性问题,即现有基准数据集因场景多样性不足且依赖第三人称视角(如监控视频)而难以评估模型在不同文化背景和非受限环境下的群体形成与演化能力。解决方案的关键在于提出首个第一人称视角(Ego-centric View)数据集 EgoGroups,其覆盖全球65个国家、涵盖低、中、高人群密度及四种天气/时段条件,并提供密集的人体与社会群体标注以及丰富的地理和场景元数据。通过该数据集对主流视觉语言模型(VLM)和大语言模型(LLM)进行系统评估,发现这些模型在零样本设置下可超越传统监督模型,同时揭示了人群密度和文化区域显著影响模型性能。
链接: https://arxiv.org/abs/2603.22249
作者: Jeffri Murrugarra-Llerena,Pranav Chitale,Zicheng Liu,Kai Ao,Yujin Ham,Guha Balakrishnan,Paola Cascante-Bonilla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.
[CV-10] Riverine Land Cover Mapping through Semantic Segmentation of Multispectral Point Clouds
【速读】:该论文旨在解决河岸环境中土地覆盖制图的准确性问题,这对于河流管理、生态理解及地貌变化监测至关重要。解决方案的关键在于利用Point Transformer v2(PTv2)这一先进的深度神经网络架构,对多光谱激光雷达(LiDAR)点云数据进行语义分割,从而实现对沙地、砾石、低矮植被、高大植被、林地表面和水体等土地覆盖类别的精确识别。研究通过融合几何与光谱特征(尤其是强度和反射率信息),显著提升了模型性能(mIoU达0.950),并进一步验证了多数据集训练策略在提升模型泛化能力方面的潜力,即使在高质量标注数据有限的情况下亦能增强模型鲁棒性。
链接: https://arxiv.org/abs/2603.22230
作者: Sopitta Thurachen,Josef Taher,Matti Lehtomäki,Leena Matikainen,Linnea Blåfield,Mikel Calle Navarro,Antero Kukko,Tomi Westerlund,Harri Kaartinen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model’s generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.
[CV-11] Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre
【速读】:该论文旨在解决当前深度学习模型在航空LiDAR数据上的3D语义分割性能研究不足的问题,特别是针对真实飞行条件下获取的大型航空点云数据集上模型表现缺乏系统评估的现状。其解决方案的关键在于构建一个大规模、多场景(城市、乡村和工业区)的航空LiDAR数据集,并在此基础上对四种代表性深度学习架构(KPConv、RandLA-Net、Superpoint Transformer 和 Point Transformer V3)进行实验对比,以量化分析它们在类别不平衡和几何多样性等挑战下的分割性能差异,从而为实际应用提供可靠的模型选择依据。
链接: https://arxiv.org/abs/2603.22229
作者: Alex Salvatierra,José Antonio Sanz,Christian Gutiérrez,Mikel Galar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures
Abstract:Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.
[CV-12] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的文本到图像(Text-to-Image, T2I)生成模型中对细粒度空间关系建模不足的问题,即现有奖励模型虽能保证整体语义一致性和视觉质量,却常忽略物体间的精确位置关系,导致生成图像在局部空间布局上存在偏差。其解决方案的关键在于提出一种可验证的奖励模型 SpatialReward,该模型采用多阶段流程:首先通过提示分解器(Prompt Decomposer)提取文本中的实体、属性与空间元数据;继而利用专家检测器实现对象位置和属性的精准视觉定位;最后由视觉语言模型结合链式推理(chain-of-thought reasoning)对复杂空间关系进行判断,从而显著提升生成图像的空间一致性。实验表明,将 SpatialReward 引入 RL 训练可有效改善 Stable Diffusion 和 FLUX 模型的空间准确性,并更贴近人类评价标准。
链接: https://arxiv.org/abs/2603.22228
作者: Sashuai Zhou,Qiang Zhou,Junpeng Ma,Yue Cao,Ruofan Hu,Ziang Zhang,Xiaoda Yang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Zhou Zhao
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbfSpatialReward, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emphPrompt Decomposer extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbfSpatRelBench, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
[CV-13] Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
【速读】:该论文旨在解决当前世界模型评估基准在交互响应能力上的缺失问题,特别是针对4D生成范式中时空结构与动态演化联合建模的挑战。现有评估体系要么局限于视频生成的视觉保真度和文本-视频对齐,要么依赖静态的3D重建指标,均未能有效衡量交互动作如何驱动状态在空间和时间维度上的因果变化。解决方案的关键在于提出Omni-WorldBench,其核心创新包括:(1)Omni-WorldSuite,一个覆盖多层级交互与场景类型的系统化提示套件;(2)Omni-Metrics,一种基于智能体的评估框架,通过量化交互动作对最终结果及中间状态演化轨迹的因果影响,精准衡量世界模型的交互响应能力。该方案首次系统性地评估了4D世界模型的核心能力——交互驱动的状态转移建模。
链接: https://arxiv.org/abs/2603.22212
作者: Meiqi Wu,Zhixin Cai,Fufangchen Zhao,Xiaokun Feng,Rujing Dang,Bingze Song,Ruitian Tian,Jiashu Zhu,Jiachen Lei,Hao Dou,Jing Tang,Lei Sun,Jiahong Wu,Xiangxiang Chu,Zeming Liu,Kaiqi Huang
机构: AMAP, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video–based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text–video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni–WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni–WorldBench comprises two key components: Omni–WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni–Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
[CV-14] Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning ICLR2026
【速读】:该论文旨在解决多实例学习(Multiple Instance Learning, MIL)在计算病理学中用于分类千兆像素全切片图像时,现有方法忽视了将通用特征转换为任务特定特征的关键步骤问题。传统MIL流程包括提取patch特征、应用线性层获得任务相关特征,以及聚合patch得到切片特征进行分类,其中第二步——线性变换层——长期未被优化,成为性能瓶颈。解决方案的关键在于引入MAMMOTH模块,这是一个参数高效、多头专家混合(multi-head mixture of experts)结构,通过为每个patch的表型定制低秩变换,实现任务特定特征的增强,从而显著提升任何MIL模型的性能,且几乎不增加总参数量。实验表明,该变换对性能的影响甚至超过聚合策略的选择,证明其有效性。
链接: https://arxiv.org/abs/2603.22198
作者: Daniel Shao,Joel Runevic,Richard J. Chen,Drew F.K. Williamson,Ahrong Kim,Andrew H. Song,Faisal Mahmood
机构: Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学); Emory University (埃默里大学); MD Anderson Cancer Center (MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in ICLR 2026 (37 pages, 16 figures)
Abstract:Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch’s phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across eight MIL methods and 19 different classification tasks, we find that such task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average +3.8% change in performance. Code is available at this https URL.
[CV-15] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation CVPR2026
【速读】:该论文旨在解决手-物体交互(Hand-Object Interaction, HOI)视频生成中存在三个割裂研究方向的问题:仅预测姿态而无像素输出的姿势合成、基于单图生成外观但缺乏动态信息的方法,以及需要完整姿态序列和真实首帧输入的视频生成方法,这些限制了真实场景中的部署能力。解决方案的关键在于提出一个统一的“姿态-外观-运动”(Pose-Appearance-Motion, PAM)引擎,将三者整合于同一框架内,实现可控的HOI视频生成。实验表明,PAM在DexYCB和OAKINK2数据集上显著优于现有方法,且通过多条件输入(深度、分割与关键点)优化性能,并能有效提升下游手部姿态估计任务的泛化能力。
链接: https://arxiv.org/abs/2603.22193
作者: Mingju Gao,Kaisen Yang,Huan-ang Gao,Bohan Li,Ao Ding,Wenyi Li,Yangcheng Yu,Jinkun Liu,Shaocong Xu,Yike Niu,Haohan Chi,Hao Chen,Hao Tang,Li Yi,Hao Zhao
机构: Peking University (北京大学); Tsinghua University (清华大学); BAAI (北京人工智能研究院); SJTU (上海交通大学); Eastern Institute of Technology (东方理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Code: this https URL
Abstract:Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
[CV-16] A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis
【速读】:该论文旨在解决如何通过自监督学习(Self-supervised Learning, SSL)作为辅助任务,将基于纹理的局部描述符(texture-based local descriptors)融入特征建模中,以提升人脸分析任务的效率与鲁棒性。其核心问题是:不同骨干网络(backbone)对所提出方法——局部模式自监督辅助任务(Local Pattern Self-Supervised Auxiliary Task, L-SSAT)的性能影响及其在多种人脸分析任务中的通用性。解决方案的关键在于利用掩码自动编码器(Masked Auto-Encoder, MAE)作为SSL辅助目标,重建局部纹理特征(如局部模式),并与主任务联合优化,从而增强特征表示的判别能力;同时通过系统性基准测试不同深度和结构的骨干网络,发现模型性能高度依赖下游任务,且不存在适用于所有人脸分析任务的统一最优骨干网络。
链接: https://arxiv.org/abs/2603.22190
作者: Shukesh Reddy,Abhijit Das
机构: BITS Pilani Hyderabad(印度理工学院比尔拉 Hyderabad 校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in SN Computer Science
Abstract:In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: “What is the role of the backbone in performance L-SSAT?”, “What type of backbone is effective for different face analysis tasks?”, and “Is there any generalized backbone for effective face analysis with L-SSAT?”. Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone. Comments: Accepted for publication in SN Computer Science Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.22190 [cs.CV] (or arXiv:2603.22190v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.22190 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shukesh Reddy [view email] [v1] Mon, 23 Mar 2026 16:49:50 UTC (2,611 KB) Full-text links: Access Paper: View a PDF of the paper titled A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis, by Shukesh Reddy and Abhijit DasView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-17] Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement CVPR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在从自然语言生成结构化布局时,因采用纯代码输出范式而导致无法感知最终渲染视觉效果的问题,从而难以保证布局的可读性和美学质量。解决方案的关键在于提出一种自改进框架——视觉反馈布局模型(Visual Feedback Layout Model, VFLM),其核心是引入视觉反馈驱动的迭代优化机制,通过强化学习结合基于视觉的奖励模型(包含OCR准确率作为关键指标)来实现自适应反思式生成,使模型能够基于视觉结果不断修正输出,直至达到满意质量。
链接: https://arxiv.org/abs/2603.22187
作者: Junrong Guo,Shancheng Fang,Yadong Qu,Hongtao Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model’s iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at this https URL.
[CV-18] ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
【速读】:该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)在对齐大型视觉语言模型(Large Vision-Language Models, LVLMs)时存在的**似然位移(Likelihood Displacement)问题,尤其关注由此引发的视觉锚点坍塌(Visual Anchor Collapse)现象——即模型在偏好优化过程中过度依赖语言先验而忽略视觉证据,导致严重幻觉。解决方案的关键在于提出一种非对称约束偏好优化(Asymmetric Constrained Preference Optimization, ACPO)**机制:通过引入一个复杂度感知的动态缩放系数,仅对拒绝项奖励施加不对称的梯度抑制,从而保持选择项分布的梯度稳定性,有效防止视觉token被语言先验压制。这一设计在不改变目标通用性的前提下,显著提升了多模态任务中的对齐质量和抗幻觉能力。
链接: https://arxiv.org/abs/2603.22165
作者: Kaili Huang,Hongming Zhang,Rui Shen,Linjun Dai,Jiahao Wang,Hanming Deng,Lewei Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods – a failure we term Visual Anchor Collapse – causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
[CV-19] dynActivation: A Trainable Activation Family for Adaptive Nonlinearity
【速读】:该论文旨在解决深度神经网络中激活函数静态性导致的训练效率低、泛化能力受限以及深层模型性能退化的问题。解决方案的关键在于提出一种逐层可学习的激活函数——dynActivation,其形式为 $ f_i(x) = \mathrm{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x $,其中 $ \alpha_i $ 和 $ \beta_i $ 为轻量级可学习标量,用于在基础非线性激活(如Mish或ReLU)与线性路径之间进行插值。这种动态调整机制使模型能够在深层网络中实现近似线性化,从而提升训练效率(最高达+54%),同时保持甚至优于静态激活函数的性能,在图像分类、语言建模等任务中均表现出更强的鲁棒性和稳定性。
链接: https://arxiv.org/abs/2603.22154
作者: Alois Bachmann
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 15 figures
Abstract:This paper proposes \mathrmdynActivation , a per-layer trainable activation defined as f_i(x) = \mathrmBaseAct(x)(\alpha_i - \beta_i) + \beta_i x , where \alpha_i and \beta_i are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and \mathrmBaseAct(x) resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to +54% over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to +14.02% on AttentionCNN with an average improvment by +6.00% , with a 24% convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below 95% test accuracy ( 95.3 – 99.3% ), while ReLU collapses below 80% at 25 layers. Under FGSM at \varepsilon=0.08 , dynActivation(Mish) incurs a 55.39% accuracy drop versus 62.79% for ReLU ( 7.40% advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a 10.3% relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps. Comments: 22 pages, 15 figures Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2603.22154 [cs.LG] (or arXiv:2603.22154v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-20] Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation CVPR2026
【速读】:该论文旨在解决当前跨视图地理定位(cross-view geo-localization, CVGL)方法在无人机(UAV)无GNSS环境导航中的三大核心问题:一是现有方法依赖将无人机视角与机载地图瓦片匹配,导致精度与存储开销之间的权衡;二是忽略了无人机航向(heading)信息对导航的重要性;三是未能充分处理跨视图场景中显著的视角差异和重叠变化,限制了实际应用中的泛化能力。解决方案的关键在于提出Bearing-UAV,一种纯视觉驱动的跨视图导航方法,通过联合预测无人机绝对位置与航向,利用全局与局部结构特征并显式编码相对空间关系,从而实现高精度、轻量化且鲁棒的野外导航,尤其在视角变化、错位和特征稀疏条件下表现优异。
链接: https://arxiv.org/abs/2603.22153
作者: Kejia Liu,Haoyang Zhou,Ruoyu Xu,Peicheng Wang,Mingli Song,Haofei Zhang
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江区)区块链与数据安全研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper by CVPR2026
Abstract:Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.
[CV-21] OpenEarth-Agent : From Tool Calling to Tool Creation for Open-Environment Earth Observation
【速读】:该论文旨在解决开放环境中地球观测(Earth Observation, EO)自主化部署面临的挑战,即多源数据异构性和任务多样性导致现有遥感代理(remote sensing agents)受限于预定义工具和封闭环境,难以泛化到未见过的数据与任务。解决方案的关键在于提出首个面向开放环境EO的工具创建代理框架——OpenEarth-Agent,其通过自适应工作流规划与动态工具创建机制,结合多阶段工具与跨领域知识库的开放式集成,实现对全链条EO任务的鲁棒执行。这一设计使代理能够在仅提供6个预训练模型的情况下,达到甚至超越依赖104个专用工具的封闭式代理性能,并在面对数据异常时展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2603.22148
作者: Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Xinyu Gu,Zhe Jiang,Fenghua Ling,Ben Fei,Wenlong Zhang,Junjue Wang,Weihao Xuan,Pengfeng Xiao,Naoto Yokoya,Lei Bai
机构: Nanjing University(南京大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学); The University of Tokyo(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures
Abstract:Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents’ adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.
[CV-22] DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment CVPR2026
【速读】:该论文旨在解决高分辨率潜在扩散模型(latent diffusion models)中令牌(token)数量过多导致的训练与推理效率低下问题。现有方法通常通过增加每个令牌的通道数来提高压缩率,但仅依赖重建目标时,高维潜在空间容易丧失语义结构,从而增加扩散训练难度。解决方案的关键在于利用预训练扩散模型已具备的低维结构化潜在空间,提出一种轻量级适配策略——细节对齐变分自编码器(Detail-Aligned VAE, DA-VAE):其显式设计潜在布局,将基础分辨率下的C个通道直接来自预训练VAE,额外D个通道用于编码更高分辨率细节,并通过简单细节对齐机制保持原潜在空间结构。结合暖启动微调策略,该方法在Stable Diffusion 3.5上实现了1024×1024图像生成仅需32×32令牌(减少4倍),并首次支持2048×2048生成(6倍加速),同时保持图像质量。
链接: https://arxiv.org/abs/2603.22125
作者: Xin Cai,Zhiyuan You,Zhoutong Zhang,Tianfan Xue
机构: The Chinese University of Hong Kong (香港中文大学); Adobe (Adobe公司); Shanghai AI Laboratory (上海人工智能实验室); CPII under InnoHK (InnoHK计划下的CPII)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbfDetail-\textbfAligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first C channels come directly from the pretrained VAE at a base resolution, while an additional D channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables 1024 \times 1024 image generation with Stable Diffusion 3.5 using only 32 \times 32 tokens, 4\times fewer than the original model, within 5 H100-days. It further unlocks 2048 \times 2048 generation with SD3.5, achieving a 6\times speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.
[CV-23] Biophysics-Enhanced Neural Representations for Patient-Specific Respiratory Motion Modeling
【速读】:该论文旨在解决放射治疗中因呼吸运动导致的剂量精准投递难题,尤其是在肺部和上腹部区域,呼吸运动会引入显著的治疗不确定性。传统方法依赖于成对配准技术,难以实现时空连续且可推广的运动建模,尤其在数据外推场景下表现不佳。解决方案的关键在于提出一种物理正则化的隐式代理运动建模方法(Physics-Regularized Implicit Surrogate-Based Modeling for Respiratory Motion, PRISM-RM),其核心创新是利用隐式神经表示(Implicit Neural Representations, INR)构建无需固定参考呼吸状态的轨迹感知型运动模型,通过引入生物物理约束确保时间维度上的生理合理性,并实现空间-时间连续、微分同胚的运动表征,从而在插值性能相当的前提下显著提升外推能力,展现出在呼吸运动建模领域的强大潜力。
链接: https://arxiv.org/abs/2603.22123
作者: Jan Boysen,Hristina Uzunova,Heinz Handels,Jan Ehrhardt
机构: University of Lübeck (吕贝克大学); University Medical Center Schleswig-Holstein (石勒苏益格-荷尔斯泰因大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL
Abstract:A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.
[CV-24] Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding CVPR-2026
【速读】:该论文旨在解决文本驱动的视频片段检索(Text-driven Video Moment Retrieval, VMR)中因难以捕捉未剪辑视频中的隐含时序动态而导致的长序列定位不准确问题。现有方法通常依赖自然语言查询(Natural Language Queries, NLQs)或静态图像增强,忽视了运动信息,并在基于Transformer的架构中面临高计算开销。其解决方案的关键在于提出一种两阶段框架:第一阶段利用大语言模型(Large Language Model, LLM)引导的字幕匹配提取相关文本线索,并融合查询生成辅助短视频,以捕获隐式运动信息作为时序先验(temporal priors);第二阶段通过多模态受控Mamba网络处理增强查询,引入视频引导门控机制实现生成先验与长序列的有效融合并抑制噪声,从而提升定位精度且计算效率更高。
链接: https://arxiv.org/abs/2603.22121
作者: Yunzhuo Sun,Xinyue Liu,Yanyang Li,Nanding Wu,Yifang Xu,Linlin Zong,Xianchao Zhang,Wenxin Liang
机构: Dalian University of Technology (大连理工大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The paper is accepted by CVPR-2026
Abstract:Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.
[CV-25] StreamingClaw Technical Report
【速读】:该论文旨在解决当前智能体在流式视频理解与具身智能(embodied intelligence)应用中面临的三大核心问题:一是缺乏实时推理能力,难以应对持续输入的视频流;二是缺少长期多模态记忆机制,无法支持跨时间步的上下文关联与知识积累;三是无法实现感知-决策-动作闭环控制,限制了其在真实物理环境中的主动交互与部署。解决方案的关键在于提出StreamingClaw框架,该框架通过集成五大核心能力实现统一建模:(1) 实时流式推理;(2) 基于在线演化目标的未来事件预测与主动交互;(3) 多模态长时记忆的分层存储与高效检索;(4) 感知-决策-动作闭环控制,并引入面向物理环境的动作导向技能;(5) 兼容OpenClaw开源生态,便于资源复用与扩展。这一设计实现了流式视频理解、多模态长期记忆与主动交互的一体化整合,并可将决策直接转化为物理世界中的可执行动作,从而推动具身智能系统的实用化落地。
链接: https://arxiv.org/abs/2603.22120
作者: Jiawei Chen,Zhe Chen,Chaoqun Du,Maokui He,Wei He,Hengtao Li,Qizhen Li,Zide Liu,Hao Ma,Xuhao Pan,Chang Ren,Xudong Rao,Xintian Shen,Chenfeng Wang,Tao Wei,Chengjun Yu,Pengfei Yu,Shengyu Yao,Chunpeng Zhou,Kun Zhan,Lihao Zheng,Pan Zhou,Xuhan Zhu,Yufei Zheng
机构: Li Auto Inc. (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Progress
Abstract:Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.
[CV-26] FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario CVPR2026
【速读】:该论文旨在解决在自由移动场景下(free-moving scenario)对关节物体(articulated object)进行高可扩展性重建的问题,现有方法通常依赖于离散的关节状态或随意拍摄的单目视频,但面临轴对齐困难或覆盖不足的局限。其解决方案的关键在于提出FreeArtGS,该方法通过结合自由移动部件分割与联合估计及端到端优化机制,仅需单目RGB-D视频输入即可实现高效重建:首先利用预训练点跟踪和特征模型先验优化部件分割模块以识别刚体部分;其次通过联合估计模块校准统一的物体到相机位姿,并从部件分割结果中鲁棒恢复关节类型与轴线;最终基于3DGS(3D Gaussian Splatting)实现视觉纹理、几何结构与关节角度的联合优化重建。
链接: https://arxiv.org/abs/2603.22102
作者: Hang Dai,Hongwei Fan,Han Zhang,Duojin Wu,Jiyao Zhang,Hao Dong
机构: Peking University (北京大学); Zhejiang University (浙江大学); PrimeBot
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Accepted to CVPR 2026
Abstract:The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: this https URL
[CV-27] Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在开放世界场景中易受视觉越狱攻击(visual jailbreak attacks)诱导生成有害内容的问题,从而保障模型的安全性与可信使用。现有激活引导(activation steering)方法虽能通过注入方向向量诱导拒绝行为,但存在过度拒绝(over-refusal)问题,导致良性输入下的性能下降,且因缺乏理论可解释性而鲁棒性和有效性受限。其解决方案的关键在于提出NullSteer框架——一种基于零空间投影的激活防御机制:通过线性变换在模型激活空间中构建拒绝方向,在保持良性子空间扰动为零的同时,动态地沿潜在有害方向诱导拒绝行为,理论上实现安全增强而不损害模型通用能力。
链接: https://arxiv.org/abs/2603.22094
作者: Xingyu Zhu,Beier Zhu,Shuo Wang,Junfeng Fang,Kesen Zhao,Hanwang Zhang,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model’s general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.
[CV-28] P-Flow: Prompting Visual Effects Generation
【速读】:该论文旨在解决视频生成模型中动态视觉效果(dynamic visual effects)的定制化难题,即如何通过文本提示精准控制如物体破碎、爆炸等随时间演变且依赖外观特征的高阶语义视觉现象。现有方法主要针对低级运动(如主体或相机轨迹)进行控制,难以有效捕捉复杂动态效果的时序逻辑与语义细节。解决方案的关键在于提出P-Flow框架,其无需训练即可实现测试阶段的提示优化:利用视觉语言模型(vision-language models, VLMs)的语义与时间推理能力,基于参考视频与生成结果之间的差异迭代调整文本提示,使提示逐步演化以在新场景中更准确地诱导目标动态视觉效果。
链接: https://arxiv.org/abs/2603.22091
作者: Rui Zhao,Mike Zheng Shou
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at this https URL.
[CV-29] Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning CVPR2026
【速读】:该论文旨在解决多模态3D视觉-语言模型在测试时分布偏移(domain shift)下性能显著下降的问题,尤其是在在线测试过程中因历史信息存储有限导致的渐进式信息丢失以及预测logits融合方式不稳定的挑战。其解决方案的关键在于提出BayesMM框架——一种基于贝叶斯分布学习的测试时点云分析方法,通过将文本先验与流式视觉特征分别建模为高斯分布,并利用贝叶斯模型平均(Bayesian model averaging)自动调整两模态贡献权重,从而实现无需训练即可持续适应演化测试数据的稳定预测。
链接: https://arxiv.org/abs/2603.22070
作者: Xingyu Zhu,Liang Yi,Shuo Wang,Wenbo Zhu,Yonglinag Wu,Beier Zhu,Hanwang Zhang
机构: University of Science and Technology of China (中国科学技术大学); Opus AI Research; Southeast University (东南大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.
[CV-30] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
【速读】:该论文旨在解决预训练视觉编码器(vision encoders)因主要基于2D图像数据训练而缺乏对真实世界中物体与背景之间3D空间关系建模能力的问题,从而限制了其在众多下游任务中的表现。解决方案的关键在于提出SpatialBoost框架,通过将密集的3D空间信息从2D图像转化为语言描述,并借助大语言模型(Large Language Model, LLM)将其注入现有视觉编码器中,以增强其空间感知能力;该方法采用多轮思维链(Chain-of-Thought, CoT)推理机制,逐步融合密集空间知识并构建分层的空间理解结构,从而显著提升模型在需要3D感知和通用视觉能力的任务上的性能。
链接: https://arxiv.org/abs/2603.22057
作者: Byungwoo Jeon,Dongyoung Kim,Huiwon Jang,Insoo Kim,Jinwoo Shin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages; 7 figures
Abstract:Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.
[CV-31] FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation CVPR2026
【速读】:该论文旨在解决当前艺术字体生成方法中存在的风格多样性有限和控制粒度粗略的问题。其核心解决方案是提出一种基于元素驱动的字体生成框架 FontCrafter,关键创新在于将字体的基本视觉单元(即“元素”)作为风格参考,并区分结构化元素(如花朵、石头)与无结构纹理元素(如火焰、云朵)。通过引入一种上下文感知的图像修复策略(in-context generation),利用 inpainting 模型在像素层面迁移元素风格;同时设计轻量级的 Context-aware Mask Adapter (CMA) 注入形状信息以实现对字形结构的精细控制,并结合无需训练的注意力重定向机制抑制笔画幻觉,最终通过边缘重绘提升边界自然性,从而在零样本条件下显著提升风格与结构保真度,且支持灵活的风格混合等可控生成能力。
链接: https://arxiv.org/abs/2603.22054
作者: Wuyang Luo,Chengkai Tan,Chang Ge,Binye Hong,Su Yang,Yongjiu Ma
机构: Dalian University of Technology (大连理工大学); Shanghai Key Laboratory of Intelligent Information Processing, Fudan University (复旦大学智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in CVPR 2026
Abstract:Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.
[CV-32] Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理多对象组合场景时,其欧几里得嵌入难以准确建模部分与整体之间层次关系(如part-to-whole或parent-child结构)的问题。现有超球面VLM方法虽能通过蕴含关系(entailment)更好地保留层次结构,但未考虑不同部分对整体的语义代表性差异。解决方案的关键在于提出不确定性引导的组合超球面对齐(UNCertainty-guided Compositional Hyperbolic Alignment, UNCHA),通过引入超球面不确定性来建模部分对整体的语义代表性:代表性高的部分被赋予较低不确定性,反之则较高;该不确定性进一步用于加权对比损失,并结合基于熵的蕴含损失进行校准,从而学习更精确的部分-整体排序,提升模型对复杂多对象场景的组成结构理解能力。
链接: https://arxiv.org/abs/2603.22042
作者: Hayeon Kim,Ji Ha Jang,Junghun James Kim,Se Young Chun
机构: Seoul National University (首尔国立大学); INMC IPAI (信息与通信技术研究所人工智能项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: this https URL.
[CV-33] DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成过程中可能产生不安全内容的安全问题,尤其针对现有推理时防御方法因仅进行无类别感知的词元级干预而无法有效捕捉跨完整词元序列分布的恶意语义、且易受对抗提示攻击的局限性。其解决方案的关键在于提出一种双阶段推理时防御框架DTV I(Dual-stage Text-to-Image Defense),首先引入类别感知的序列级干预机制,在完整提示嵌入空间中识别并修正恶意语义,其次在视觉生成阶段进一步抑制残余不安全影响,从而实现对多种有害类别(如色情内容)的高效且鲁棒的防御,同时保持良性提示下的合理生成质量。
链接: https://arxiv.org/abs/2603.22041
作者: Binhong Tan,Zhaoxin Wang,Handing Wang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.
[CV-34] GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction
【速读】:该论文旨在解决从多视角图像中重建半透明物体表面几何结构的难题,传统方法如可微分路径追踪和神经隐式场虽有效但计算成本高昂,而基于3D高斯溅射(3DGS)的现有方法在处理半透明物体时因未考虑其光学特性而表现不佳。解决方案的关键在于提出一种新型的3DGS框架GTSR(Gaussian-based Translucent Surface Reconstruction),通过引入两组高斯分布——表面高斯(surface Gaussians)和内部高斯(interior Gaussians)——分别建模物体表面几何与光线穿过介质时的散射颜色,并利用菲涅尔项(Fresnel term)对二者进行融合渲染;同时结合迪士尼BSDF模型(Disney BSDF model)与延迟渲染(deferred rendering)策略增强法向量和深度约束,从而提升非轮廓区域的细节重建质量。
链接: https://arxiv.org/abs/2603.22036
作者: Youwen Yuan,Xi Zhao
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing translucent objects from multi-view images is a difficult problem. Previously, researchers have used differentiable path tracing and the neural implicit field, which require relatively large computational costs. Recently, many works have achieved good reconstruction results for opaque objects based on a 3DGS pipeline with much higher efficiency. However, such methods have difficulty dealing with translucent objects, because they do not consider the optical properties of translucent objects. In this paper, we propose a novel 3DGS-based pipeline (GTSR) to reconstruct the surface geometry of translucent objects. GTSR combines two sets of Gaussians, surface and interior Gaussians, which are used to model the surface and scattering color when lights pass translucent objects. To render the appearance of translucent objects, we introduce a method that uses the Fresnel term to blend two sets of Gaussians. Furthermore, to improve the reconstructed details of non-contour areas, we introduce the Disney BSDF model with deferred rendering to enhance constraints of the normal and depth. Experimental results demonstrate that our method outperforms baseline reconstruction methods on the NeuralTO Syn dataset while showing great real-time rendering performance. We also extend the dataset with new translucent objects of varying material properties and demonstrate our method can adapt to different translucent materials.
[CV-35] uning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models
【速读】:该论文旨在解决如何高效利用超大规模预训练文本到图像(Text-to-Image, T2I)模型在真实世界图像恢复(Real-World Image Restoration, Real-IR)任务中的潜力,以及如何充分发挥其性能的问题。解决方案的关键在于提出ResFlow-Tuner框架,该框架基于先进的流匹配模型FLUX.1-dev,结合统一多模态融合(Unified Multi-Modal Fusion, UMMF)与测试时缩放(Test-Time Scaling, TTS)技术:一方面通过UMMF将多模态条件编码为统一序列以引导高质量图像合成;另一方面引入无需训练的测试时缩放机制,在推理阶段利用奖励模型(Reward Model, RM)反馈动态调整去噪方向,从而在可控计算开销下显著提升恢复性能。
链接: https://arxiv.org/abs/2603.22027
作者: Purui Bai,Junxian Duan,Pin Wang,Jinhua Hao,Ming Sun,Chao Zhou,Huaibo Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 10 figures
Abstract:Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
[CV-36] 6D Robotic OCT Scanning of Curved Tissue Surfaces
【速读】:该论文旨在解决在机器人辅助光学相干断层扫描(Optical Coherence Tomography, OCT)中,对大而弯曲组织表面进行一致且高精度扫描时面临的挑战。传统方法依赖于图像配准或仅限于三维平移运动的扫描策略,在处理曲面时存在误差累积和配准失败的问题。其解决方案的关键在于提出一种用于实现六维(6D)手眼标定的标记系统,通过该标记可获得稳定可靠的变换估计,从而使得机器人能够精确控制OCT探头的空间位姿。此方法不依赖图像配准,避免了扫描路径上误差的累积,显著提升了对复杂曲面结构的扫描一致性与精度。
链接: https://arxiv.org/abs/2603.22012
作者: Suresh Guttikonda,Maximilian Neidhardt,Vidas Raudonis,Alexander Schlaefer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at IEEE ISBI 2026
Abstract:Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.
[CV-37] SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation
【速读】:该论文旨在解决当前基于Transformer的3D医学图像分割模型在计算复杂度和参数量方面过高,难以适配体积数据及标注数据有限的临床场景的问题。其关键解决方案是提出SegMaFormer——一种轻量级混合架构,通过在分层体积编码器中协同使用Mamba模块与Transformer模块实现高效长程依赖建模:具体而言,在早期高分辨率阶段采用Mamba层以降低计算开销并捕捉关键空间上下文,而在后期低分辨率阶段保留自注意力机制以精细化特征表示;同时引入广义旋转位置嵌入(generalized rotary position embeddings)增强空间感知能力。该设计使模型在保持高性能的同时,参数量减少达75倍、浮点运算次数(FLOPs)显著下降,且在Synapse、BraTS和ACDC三个公开基准上性能媲美大型模型。
链接: https://arxiv.org/abs/2603.22002
作者: Duy D. Nguyen,Phat T. Tran-Truong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.
[CV-38] STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection
【速读】:该论文旨在解决基于Transformer的RGB-D显著目标检测(SOD)方法中存在的两个核心问题:一是注意力机制带来的二次复杂度计算负担,二是局部细节提取能力有限。解决方案的关键在于提出一种新颖的超像素令牌增强网络(STENet),其核心创新是将超像素(superpixel)引入跨模态交互中,设计了两个定制化的超像素驱动的跨模态交互模块——全局增强模块(Superpixel Attention Global Enhancing Module)和局部精修模块(Superpixel Attention Local Refining Module)。前者通过建模像素到超像素的全局关系替代传统像素到像素的关系,从而降低计算复杂度并捕捉区域级语义信息;后者利用超像素内像素相似性筛选局部关键像素并进行特征增强,以提升局部细节表达能力。最终,通过融合全局、局部及跨尺度特征实现全面的特征表示,显著提升了RGB-D SOD性能。
链接: https://arxiv.org/abs/2603.21999
作者: Jianlin Chen,Gongyang Li,Zhijiang Zhang,Liang Chang,Dan Zeng
机构: Shanghai University (上海大学); Yunnan University of Finance and Economics (云南财经大学); Chinese Academy of Science (中国科学院); Institute for Urban Governance (城市治理研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, accepted by IEEE TMM
Abstract:Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer’s exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at this https URL.
[CV-39] LRC-WeatherNet: LiDAR RADAR and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving
【速读】:该论文旨在解决自动驾驶车辆在恶劣天气(如雨、雾、雪)条件下,由于LiDAR、雷达(RADAR)和RGB相机等传感器性能下降而导致感知与导航能力受限的问题。其解决方案的关键在于提出LRC-WeatherNet——一个融合LiDAR、RADAR与摄像头数据的新型多模态融合框架,通过早期融合(统一鸟瞰图表示)与中层门控融合(模态特定特征图融合)相结合的方式,动态适应不同天气下各传感器的可靠性变化,从而实现鲁棒且实时的天气分类。
链接: https://arxiv.org/abs/2603.21987
作者: Nour Alhuda Albashir,Lars Pernickel,Danial Hamoud,Idriss Gouigah,Eren Erdal Aksoy
机构: Halmstad University (哈尔姆斯塔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at IEEE Intelligent Vehicles Symposium - IVS 2026
Abstract:Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird’s Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in this https URL.
[CV-40] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
【速读】:该论文旨在解决多模态人像生成中视频与音频同步性差、模型架构复杂以及推理效率低的问题。其核心解决方案是提出一个单流Transformer架构(single-stream Transformer),将文本、视频和音频统一编码为序列token,仅通过自注意力机制实现跨模态融合,避免了传统多流或交叉注意力结构的复杂性,同时保持了训练与推理基础设施的兼容性。该设计在保证高质量人脸表现、自然语音-表情协调性和精确音视频同步的基础上,结合模型蒸馏、潜在空间超分辨率和Turbo VAE解码器,显著提升了生成效率——可在单张H100 GPU上2秒内完成5秒256p视频生成,且在自动评估和人工对比中均优于主流开源模型。
链接: https://arxiv.org/abs/2603.21986
作者: SII-GAIR,Sand.ai:Ethan Chern,Hansi Teng,Hanwen Sun,Hao Wang,Hong Pan,Hongyu Jia,Jiadi Su,Jin Li,Junjie Yu,Lijie Liu,Lingzhi Li,Lyumanshan Ye,Min Hu,Qiangang Wang,Quanwei Qi,Steffi Chern,Tao Bu,Taoran Wang,Teren Xu,Tianning Zhang,Tiantian Mi,Weixian Xu,Wenqiang Zhang,Wentai Zhang,Xianping Yi,Xiaojie Cai,Xiaoyang Kang,Yan Ma,Yixiu Liu,Yunbo Zhang,Yunpeng Huang,Yutong Lin,Zewei Tao,Zhaoliang Liu,Zheng Zhang,Zhiyao Cen,Zhixuan Yu,Zhongshu Wang,Zhulin Hu,Zijin Zhou,Zinan Guo,Yue Cao,Pengfei Liu
机构: Sand.ai; Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
[CV-41] GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design CVPR2026
【速读】:该论文旨在解决参数化计算机辅助设计(Parametric Computer-Aided Design, CAD)中长命令序列生成的难题,尤其在复杂几何与拓扑依赖关系下,现有基于Transformer的模型因二次注意力计算开销和有限上下文窗口而难以扩展。其解决方案的关键在于提出GeoFusion-CAD,一个端到端的扩散框架,通过将CAD程序编码为分层树结构,在状态空间扩散过程中联合建模几何与拓扑信息;其中引入轻量级C-Mamba模块,利用选择性状态转移机制有效捕捉长程结构依赖,从而实现跨长命令序列的一致性生成。
链接: https://arxiv.org/abs/2603.21978
作者: Xiaolei Zhou,Chuangjie Fang,Jie Wu,Jingyi Yang,Boyi Lin,Jianwei Zheng
机构: Zhejiang University of Technology (浙江工业大学); Hangzhou International Innovation Institute, Beihang University (杭州国际创新研究院,北航)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR 2026 (Findings). Includes supplementary material
Abstract:Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.
[CV-42] Unified Spatiotemporal Token Compression for Video-LLM s at Ultra-Low Retention CVPR2026
【速读】:该论文旨在解决视频大语言模型(Video-LLMs)因视觉标记(visual tokens)数量庞大而导致的高计算成本问题。现有令牌压缩方法通常采用两阶段时空分离的压缩策略,依赖特定阶段的指标并隐含假设时空可分离性,但在极低保留率下常导致分配失衡和关键视觉证据丢失。其解决方案的关键在于将令牌压缩重新建模为全局令牌保留池中的时空分配任务,并提出一种统一的选择机制,融合注意力权重与语义相似度,以全局方式选择贡献高且冗余低的令牌;未选令牌通过聚类合并后重填,确保信息完整性;同时在大语言模型内部引入文本感知合并机制,基于查询相关性进行二次压缩。该方法无需微调即可作为即插即用模块适配现有Video-LLMs,在仅保留约2%视觉令牌的情况下,仍能保持90.1%的基线性能,同时将浮点运算量(FLOPs)降至约2.6%,显著降低推理延迟与内存消耗。
链接: https://arxiv.org/abs/2603.21957
作者: Junhao Du,Jialong Xue,Anqi Li,Jincheng Dai,Guo Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.
[CV-43] Group3D: MLLM -Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
【速读】:该论文旨在解决多视角RGB场景下开放词汇3D目标检测中存在的实例构建错误问题,尤其是由于仅依赖几何一致性进行碎片合并而导致的过合并(over-merging)或单个实例碎片化(fragmentation)问题。传统方法将几何结构构建与语义标注解耦,导致在几何证据不完整或视图依赖性强时产生不可逆的关联错误。解决方案的关键在于提出Group3D框架,其核心创新是将语义约束直接整合进实例构建过程:通过多模态大语言模型(MLLM)生成场景自适应词汇,并将其组织为语义兼容组(semantic compatibility groups),这些组在合并阶段作为语义门控机制——仅当3D碎片同时满足语义兼容性和几何一致性时才被合并。这一设计有效缓解了纯几何驱动的合并偏差,同时增强了跨视角类别变化的鲁棒性,从而实现更准确、泛化能力更强的开放词汇3D检测。
链接: https://arxiv.org/abs/2603.21944
作者: Youbin Kim,Jinho Park,Hogun Park,Eunbyung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures, Project page: this https URL
Abstract:Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at this https URL.
[CV-44] GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction CVPR2026
【速读】:该论文旨在解决在无GPS环境下,高精度与实时性难以兼得的难题,即当前细粒度跨视角地理定位(Fine-Grained Cross-View Geolocalization, FG-CVG)方法普遍面临准确率与速度之间的权衡问题。其解决方案的关键在于提出GeoFlow框架,该框架通过学习一个直接的概率映射模型,预测任意初始位置假设所需的位移(距离和方向)以进行修正,并结合创新的迭代精化采样(Iterative Refinement Sampling, IRS)推理算法,使一组初始假设能够从随机起点逐步“流动”至稳定收敛的共识结果,从而实现无需重新训练即可灵活调节计算资源与性能之间的平衡,最终在KITTI和VIGOR数据集上达到29 FPS的实时运行速度且保持竞争力的定位精度。
链接: https://arxiv.org/abs/2603.21943
作者: Ayesh Abu Lehyeh,Xiaohan Zhang,Ahmad Arrabi,Waqas Sultani,Chen Chen,Safwan Wshah
机构: Vermont Artificial Intelligence Lab, Department of Computer Science, University of Vermont (佛蒙特大学人工智能实验室,计算机科学系,佛蒙特大学); Intelligent Machines Lab, Information Technology University (智能机器实验室,信息技术大学); Institute of Artificial Intelligence, University of Central Florida (人工智能研究所,中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)
Abstract:Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively ‘flow’ from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.
[CV-45] FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection
【速读】:该论文旨在解决真实场景下生成式AI(Generative AI)伪造图像检测中存在的三个关键问题:退化干扰、特征表征不足以及泛化能力有限。其核心解决方案是提出FeatDistill框架,该框架融合了特征蒸馏(feature distillation)与多专家集成(multi-expert ensemble),采用由CLIP和SigLIP变体组成的四骨干Vision Transformer(ViT)集成结构以捕获互补的取证线索;同时通过扩展训练集并引入全面的退化建模来提升数据覆盖度,使模型适应复杂且多样化的合成伪影与质量变化;进一步采用两阶段训练策略——先进行标准二分类优化,再通过密集特征级自蒸馏实现表征对齐,从而缓解过拟合并增强语义一致性;最终在推理时通过对四个独立训练专家的概率平均得到稳定可靠的决策,显著提升了在未见生成器和复杂退化条件下的鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2603.21939
作者: Zhilin Tu,Kemou Li,Fengpeng Li,Jianwei Fei,Jiamin Zhang,Haiwei Wu
机构: University of Electronic Science and Technology of China (电子科技大学); University of Macau (澳门大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Florence (佛罗伦萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6th place (6/507) technical report at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge
Abstract:The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild’’ conditions, offering an effective and practical solution for real-world deepfake image detection. Comments: 6th place (6/507) technical report at the NTIRE 2026: Robust AI-Generated Image Detection in the Wild Challenge Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2603.21939 [cs.CV] (or arXiv:2603.21939v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.21939 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-46] MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation
【速读】:该论文旨在解决多主体图像生成中常见的交叉属性误绑定(cross-subject attribute misbinding)问题,即在单张图像中对多个主体进行细粒度控制时,属性(如身份、姿态、表情等)被错误地分配给非目标主体,而传统评估指标因侧重整体保真度或单主体自相似性难以识别此类故障。解决方案的关键在于提出 MultiBind 基准数据集和维度解耦的混淆评估协议:MultiBind 基于真实多人照片构建,包含结构化的主体裁剪、掩码、边界框、背景参考及实体索引提示;评估协议则通过面部身份、外观、姿态和表情等专用模块匹配生成主体与真实槽位,并基于槽位间相似性矩阵减去真实相似性矩阵,从而分离出自退化与跨主体干扰,揭示可解释的失败模式(如漂移、交换、主导和混合)。
链接: https://arxiv.org/abs/2603.21937
作者: Wenqing Tian,Hanyi Mao,Zhaocheng Liu,Lihua Zhang,Qiang Liu,Jian Wu,Liang Wang
机构: Chinese Academy of Sciences(中国科学院); University of Chinese Academy of Sciences(中国科学院大学); The University of Chicago(芝加哥大学); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.
[CV-47] Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment CVPR2026
【速读】:该论文旨在解决不同3D Gaussian Splatting (3DGS)模型之间的对齐问题,尤其针对同一类别中不同对象(如不同汽车)的模型进行相似变换(旋转、平移和缩放)对齐,而现有方法仅能处理同一对象的模型且常需已知真实尺度。解决方案的关键在于提出高斯点云对齐(Gaussian Splatting Alignment, GSA),其核心包括:利用视角引导的球面映射特征获取鲁棒对应关系,并设计两步优化框架——首先通过迭代特征引导的绝对定向求解器实现粗略对齐(对不良初始化具有强鲁棒性),其次引入多视角特征一致性约束(受逆辐射场公式启发)进行精细对齐,从而在相同对象和类别级不同对象场景下均显著优于现有方法,首次实现了类别级3DGS模型的有效对齐。
链接: https://arxiv.org/abs/2603.21936
作者: Roy Amoyal,Oren Freifeld,Chaim Baskin
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: this https URL
[CV-48] Chronological Contrastive Learning: Few-Shot Progression Assessment in Irreversible Diseases
【速读】:该论文旨在解决医学影像中疾病严重程度评分的标注成本高、耗时长且存在阅片者间变异的问题,同时利用临床档案中大量未标注的纵向影像数据。解决方案的关键在于提出一种名为ChronoCon的对比学习方法,其核心创新是将传统的基于标签距离的排序损失替换为仅依赖患者纵向扫描访问顺序的排序信号,从而在不可逆疾病具有单调进展的临床假设下,无需任何专家标注即可学习到与疾病相关的表征。这种方法将Rank-N-Contrast从标签距离推广至时间顺序,显著提升了低标签场景下的模型性能,在仅用五名患者的专家评分进行微调时即达到86%的组内相关系数(ICC),验证了利用常规影像元数据降低标注需求的潜力。
链接: https://arxiv.org/abs/2603.21935
作者: Clemens Watzenböck,Daniel Aletaha,Michaël Deman,Thomas Deimel,Jana Eder,Ivana Janickova,Robert Janiczek,Peter Mandl,Philipp Seeböck,Gabriela Supp,Paul Weiser,Georg Langs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for MIDL 2026; Reviews available at this https URL
Abstract:Quantitative disease severity scoring in medical imaging is costly, time-consuming, and subject to inter-reader variability. At the same time, clinical archives contain far more longitudinal imaging data than expert-annotated severity scores. Existing self-supervised methods typically ignore this chronological structure. We introduce ChronoCon, a contrastive learning approach that replaces label-based ranking losses with rankings derived solely from the visitation order of a patient’s longitudinal scans. Under the clinically plausible assumption of monotonic progression in irreversible diseases, the method learns disease-relevant representations without using any expert labels. This generalizes the idea of Rank-N-Contrast from label distances to temporal ordering. Evaluated on rheumatoid arthritis radiographs for severity assessment, the learned representations substantially improve label efficiency. In low-label settings, ChronoCon significantly outperforms a fully supervised baseline initialized from ImageNet weights. In a few-shot learning experiment, fine-tuning ChronoCon on expert scores from only five patients yields an intraclass correlation coefficient of 86% for severity score prediction. These results demonstrate the potential of chronological contrastive learning to exploit routinely available imaging metadata to reduce annotation requirements in the irreversible disease domain. Code is available at this https URL.
[CV-49] Camera-Agnostic Pruning of 3D Gaussian Splats via Descriptor-Based Beta Evidence
【速读】:该论文旨在解决3D Gaussian splats在相机无关(camera-agnostic)场景下的高效压缩问题,即如何在不依赖相机参数或视图相关度量的前提下,实现对splats的高效后训练剪枝(post-training pruning),以降低其存储与传输开销,同时保持重建质量。解决方案的关键在于提出一种基于属性衍生邻域描述符的单次剪枝方法:首先构建一个融合结构一致性和外观一致性的混合描述符框架,用于捕捉splats表示中的内在特征;进而将剪枝建模为统计证据估计问题,并引入Beta证据模型,通过概率置信度分数量化每个splat的可靠性,从而实现无需视图信息的自适应剪枝决策。
链接: https://arxiv.org/abs/2603.21933
作者: Peter Fasogbon,Ugurcan Budak,Patrice Rondao Alface,Hamed Rezazadegan Tavakoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 3 figures, 2 tables
Abstract:The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies. Comments: 14 pages, 3 figures, 2 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.21933 [cs.CV] (or arXiv:2603.21933v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.21933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-50] SatGeo-NeRF: Geometrically Regularized NeRF for Satellite Imagery
【速读】:该论文旨在解决当前基于神经辐射场(Neural Radiance Fields, NeRF)的卫星影像重建模型中存在的过拟合引发的几何伪影问题。解决方案的关键在于引入三种模型无关的正则化策略:重力对齐平面性正则化(Gravity-Aligned Planarity Regularization)通过将深度推断得到的近似表面法向量与重力轴对齐,促进局部平面性并借助表面近似耦合相邻射线以增强跨射线梯度传播;粒度正则化(Granularity Regularization)强制执行从粗到精的几何学习机制;深度监督正则化(Depth-Supervised Regularization)则稳定训练初期过程,从而提升几何精度。这些正则化方法共同作用,在DFC2019卫星重建基准上使平均海拔误差分别相比EO-NeRF和EO-GS基线降低13.9%和11.7%。
链接: https://arxiv.org/abs/2603.21931
作者: Valentin Wagner,Sebastian Bullinger,Michael Arens,Rainer Stiefelhagen
机构: Fraunhofer IOSB (弗劳恩霍夫IOSB); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the ISPRS Congress 2026
Abstract:We present SatGeo-NeRF, a geometrically regularized NeRF for satellite imagery that mitigates overfitting-induced geometric artifacts observed in current state-of-the-art models using three model-agnostic regularizers. Gravity-Aligned Planarity Regularization aligns depth-inferred, approximated surface normals with the gravity axis to promote local planarity, coupling adjacent rays via a corresponding surface approximation to facilitate cross-ray gradient flow. Granularity Regularization enforces a coarse-to-fine geometry-learning scheme, and Depth-Supervised Regularization stabilizes early training for improved geometric accuracy. On the DFC2019 satellite reconstruction benchmark, SatGeo-NeRF improves the Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines such as EO-NeRF and EO-GS.
[CV-51] he Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation CVPR2026
【速读】:该论文旨在解决持续测试时自适应(Continual Test-Time Adaptation, CTTA)中效率与泛化性能之间的权衡问题:现有方法在提升模型对分布偏移的适应能力时,往往需要更新大量参数,导致在线推理效率显著下降。为实现高效且稳定的适应,作者提出“黄金子空间”(golden subspace)的概念,即在单步适应设置下存在一个最小特征更新子空间,其数学上等价于预训练分类器的行空间(row space)。解决方案的关键在于:利用样本级平均梯度外积(sample-wise Average Gradient Outer Product, AGOP)作为无需重训练即可高效估计分类器权重的代理指标,从而动态维护该子空间;在此基础上设计了Guided Online Low-rank Directional adaptation (GOLD)框架,通过轻量适配器将特征投影至黄金子空间,并学习一个紧凑的缩放向量,实现低秩方向上的高效在线调整。实验表明,GOLD在分类与分割任务(包括自动驾驶场景)中均展现出优越的效率、稳定性和整体性能。
链接: https://arxiv.org/abs/2603.21928
作者: Guannan Lai,Da-Wei Zhou,Zhenguo Li,Han-Jia Ye
机构: Nanjing University (南京大学); Hong Kong University of Science and Technology (香港科技大学); Frontier Robotics (前沿机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026
Abstract:Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at this https URL.
[CV-52] A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing
【速读】:该论文旨在解决传统辐射传输模型在生成合成高光谱图像(Hyperspectral Image, HSI)时计算成本高且通常仅支持光谱级输出的问题。其解决方案的关键在于提出了一种基于潜在表示(latent representation)的高光谱模拟框架,该框架通过学习高光谱数据的潜在生成表示,实现了光谱级和空间-光谱级的联合模拟;同时支持单步直接训练与两步策略(先使用变分自编码器(Variational Autoencoder, VAE)预训练,再进行参数到潜在空间的插值),从而在重建精度、光谱保真度及对真实空间变异性的鲁棒性方面显著优于经典回归型模拟器,并保持下游生物物理参数反演性能,验证了所生成数据在遥感应用中的实用性。
链接: https://arxiv.org/abs/2603.21911
作者: Chedly Ben Azizi,Claire Guilloteau,Gilles Roussel,Matthieu Puigt
机构: Univ. Littoral Côte d’Opale (滨海大学); LISIC – UR 4491 (信息与系统科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.
[CV-53] SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)在医学图像分割中面临的两大核心问题:一是现有方法通常缺乏语义感知的特征对齐,导致分布保真度差;二是伪标签验证忽略了全局解剖结构约束,难以防止生成全局不合理结构。解决方案的关键在于提出SHAPE框架,其核心创新包括:基于DINOv3基础架构的分层特征调制(Hierarchical Feature Modulation, HFM)模块,实现高保真且类别感知的特征表示;以及引入超图合理性评估(Hypergraph Plausibility Estimation, HPE),通过超图建模捕捉标准图无法表达的全局解剖合理性;同时辅以结构异常修剪(Structural Anomaly Pruning, SAP)机制,利用多视角一致性进一步清除残留伪影。这一系列设计显著提升了跨模态医学图像分割的性能与结构合理性。
链接: https://arxiv.org/abs/2603.21904
作者: Linkuan Zhou,Yinghao Xia,Yufei Shen,Xiangyu Li,Wenjie Du,Cong Cong,Leyi Wei,Ran Su,Qiangguo Jin
机构: Northwestern Polytechnical University (西北工业大学); Harbin Institute of Technology (哈尔滨工业大学); USTC (中国科学技术大学); Macquarie University (麦考瑞大学); Macao Polytechnic University (澳门理工大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI-CT) and 78.51% (CT-MRI) on cardiac data, and 87.48% (MRI-CT) and 86.89% (CT-MRI) on abdominal data. The code is available at this https URL.
[CV-54] CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
【速读】:该论文旨在解决视频字幕去除(Video Subtitle Removal)任务中现有基于扩散模型(Diffusion-based)方法依赖显式掩码序列进行训练和推理的问题,从而限制了实际部署的灵活性。其解决方案的关键在于提出一种无需掩码的端到端框架CLEAR(Context-aware Learning for End-to-end Adaptive Video Subtitle Removal),通过两阶段设计实现:第一阶段利用双编码器上的自监督正交约束学习解耦的字幕表征;第二阶段采用LoRA(Low-Rank Adaptation)机制结合生成反馈进行动态上下文调整,从而在不依赖真实掩码的情况下实现鲁棒的字幕移除。该方法仅需基础扩散模型0.77%的参数即可训练,并在中文基准上显著优于依赖掩码的基线方法(PSNR提升6.77dB,VFID降低74.7%),同时展现出跨六种语言的零样本泛化能力。
链接: https://arxiv.org/abs/2603.21901
作者: Qingdong He,Chaoyi Wang,Peng Tang,Yifan Yang,Xiaobin Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
[CV-55] Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation
【速读】:该论文旨在解决低秩适配(Low Rank Adaptation, LoRA)在微调预训练扩散模型以生成个性化图像时,Rank选择缺乏针对性的问题。当前实践中通常采用统一Rank值以降低计算复杂度,但这种做法忽视了不同个性化主体的复杂性差异,导致性能与内存消耗之间无法实现最优权衡。解决方案的关键在于提出LoRA²方法,通过引入一种基于重要性排序的机制,使每一层的Rank在微调过程中自适应调整——仅在必要时增加Rank值,从而在保持较低内存占用和Rank的前提下,显著提升模型在DINO、CLIP-I和CLIP-T等多个评估指标上的表现。
链接: https://arxiv.org/abs/2603.21884
作者: Donald Shenaj,Federico Errica,Antonio Carta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community’s consensus, regardless of the personalized subject’s complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank’s positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA ^2 , achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: this https URL.
[CV-56] Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline
【速读】:该论文旨在解决卫星遥感影像中数字表面模型(Digital Surface Model, DSM)生成的精度提升问题,尤其针对传统基于代价体的经典立体匹配算法在复杂场景下性能瓶颈。其核心挑战在于学习型立体匹配方法虽在标准基准上表现优异,但难以直接集成至卫星处理流水线,主要受限于观测几何差异和视差假设不一致。解决方案的关键在于将多种现代学习型立体匹配器(如StereoAnywhere、MonSter、Foundation Stereo及其卫星微调版本)集成到卫星立体匹配流水线(Satellite Stereo Pipeline, S2P)中,并重构校正阶段以强制统一视差极性和范围,从而实现与卫星数据特性的兼容性适配。实验表明,该方案在DSM精度上显著优于传统方法,且在结构细节和几何保真度方面有明显改善,但植被等复杂地物仍存在局限,凸显了自然环境中学习型立体匹配的开放挑战。
链接: https://arxiv.org/abs/2603.21882
作者: Elías Masquil,Thibaud Ehret,Pablo Musé,Gabriele Facciolo
机构: IIE, Facultad de Ingeniería, Universidad de la República, Uruguay; AMIAD, Pôle Recherche, France; Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190, Gif-sur-Yvette, France; Institut Universitaire de France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IGARSS 2026
Abstract:Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.
[CV-57] hermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems
【速读】:该论文旨在解决红外行人检测器在物理世界中对对抗攻击的脆弱性问题,现有方法依赖于实例特定的在线优化和刚性图案设计,导致部署成本高且物理鲁棒性不足。解决方案的关键在于提出首个红外域内的通用物理补丁攻击(Universal Physical Patch Attack, UPPA),其核心创新是采用几何约束的参数化贝塞尔块(Bezier blocks)建模扰动,并利用粒子群优化(Particle Swarm Optimization, PSO)算法在全局数据分布上进行统一优化,从而在动态形变下保持拓扑稳定性;同时,将优化后的数字扰动转化为物理冷 patch,在红外成像中实现连续平滑的低温分布,与热辐射特性天然匹配,显著提升攻击成功率与跨域迁移能力。
链接: https://arxiv.org/abs/2603.21876
作者: Chengyin Hu,Yikun Guo,Yuxian Dong,Qike Zhang,Kalibinuer Tiliwalidi,Yiwei Wei,Haitao Shi,Jiujiang Guo,Jiahuan Long,Xiang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.
[CV-58] Manifold-Aware Exploration for Reinforcement Learning in Video Generation
【速读】:该论文针对视频生成中Group Relative Policy Optimization (GRPO)方法可靠性不足的问题展开研究,其核心挑战在于视频生成的解空间复杂性以及从常微分方程(ODE)到随机微分方程(SDE)转换过程中引入过多噪声,导致轨迹质量下降、奖励估计不可靠,进而引发训练后对齐阶段的不稳定。解决方案的关键在于将预训练模型视为定义了有效的视频数据流形(data manifold),并提出SAGE-GRPO(Stable Alignment via Exploration)框架,在微观和宏观两个层面施加约束:微观层面通过引入具有对数曲率修正的流形感知SDE及梯度范数均衡器,稳定跨时间步的采样与更新;宏观层面采用双信任区域机制,结合周期性移动锚点和分步约束,使信任区域持续追踪更接近流形的检查点,抑制长时程漂移。实验表明,该方法在HunyuanVideo1.5上显著优于现有方法,在视频质量(VQ)、多模态质量(MQ)、时间一致性(TA)及视觉指标(CLIPScore、PickScore)等方面均实现一致提升。
链接: https://arxiv.org/abs/2603.21872
作者: Mingzhe Zheng,Weijie Kong,Yue Wu,Dengyang Jiang,Yue Ma,Xuanhua He,Bin Lin,Kaixiong Gong,Zhao Zhong,Liefeng Bo,Qifeng Chen,Harry Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 12 figures
Abstract:Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at this https URL.
[CV-59] Adversarial Camouflage
【速读】:该论文旨在解决面部识别算法广泛应用所引发的隐私泄露与大规模监控风险问题。其解决方案的关键在于提出一种名为“对抗伪装(Adversarial Camouflage)”的新方法,通过在物理世界中生成低维参数化图案(颜色、形状、角度),并将其投影到语义有效的面部区域,从而最大化多个面部识别模型的识别错误率,确保攻击对黑盒系统也具有高迁移性,显著降低主流面部识别模型的性能,同时在真实人类实验中验证了其有效性。
链接: https://arxiv.org/abs/2603.21867
作者: Paweł Borsukiewicz,Daniele Lunghi,Melissa Tessa,Jacques Klein,Tegawendé F. Bissyandé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 5 tables
Abstract:While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textitAdversarial Camouflage as a novel solution for protecting users’ privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.
[CV-60] Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
【速读】:该论文旨在解决视频扩散模型(video diffusion models)在模型蒸馏过程中因直接沿用图像蒸馏方法而引发的诸多问题,如过饱和、时序不一致性和模式坍缩等。其解决方案的关键在于提出了一种专为视频扩散模型设计的蒸馏框架,核心创新包括:(1) 自适应回归损失,通过动态调整空间监督权重以缓解分布偏移导致的伪影;(2) 时序正则化损失,抑制时序坍缩并促进物理合理的采样轨迹;(3) 推理阶段帧插值策略,在降低采样开销的同时保持感知质量。该方法在VBench和VBench2基准上验证了其在少步长视频生成中显著提升感知保真度与运动真实性的能力。
链接: https://arxiv.org/abs/2603.21864
作者: Yuyang You,Yongzhi Li,Jiahui Li,Yadong Mu,Quan Chen,Peng Jiang
机构: Peking University (北京大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.
[CV-61] Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning
【速读】:该论文旨在解决生成式深度学习(Generative Deep Learning)在热带 Madden-Julian 振荡(MJO)建模中与传统理论框架之间关系不明确的问题。其解决方案的关键在于提出一种基于大气再分析数据训练的视频扩散模型(video diffusion model),通过低维指标条件化生成长时间序列的 MJO 事件,从而捕捉 MJO 的关键特征(如合成结构、功率谱和对流耦合波等多尺度结构)。进一步地,该模型可通过人为理想化的低维条件(如恒定 MJO、季节或厄尔尼诺-南方涛动(ENSO)调制)生成可解析的 MJO 场景,实现对物理驱动机制的分解与识别,为连接 MJO 的低维理论与高分辨率大气复杂性提供了实用框架,有助于提升热带大气预测能力。
链接: https://arxiv.org/abs/2603.21856
作者: Sulian Thual,Feiyang Cai,Jingjing Wang,Feng Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.
[CV-62] Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation
【速读】:该论文旨在解决冠状动脉从CT血管造影(CTA)图像中准确分割的难题,其核心挑战在于血管结构具有多分支、细长管状形态,且前景血管与背景组织之间存在严重的类别不平衡问题。传统基于卷积神经网络(CNN)的方法难以捕捉空间上远距离血管结构之间的长程依赖关系,而基于视觉Transformer(ViT)的方法则因计算开销过大而不适用于资源受限的临床场景。解决方案的关键在于提出一种两阶段的MDSVM-UNet框架:第一阶段引入多方向蛇形卷积(MDSConv),通过在矢状面、冠状面和轴面三个正交解剖平面内学习自适应偏移量,实现多视角特征融合以精准刻画冠状动脉的蜿蜒几何特性;第二阶段设计基于残差视觉Mamba(RVM)的上采样解码模块,利用选择性状态空间机制建模跨切片长程依赖关系,同时保持线性计算复杂度,从而在保证精度的同时提升部署效率。
链接: https://arxiv.org/abs/2603.21829
作者: Xiaochan Yuan,Pai Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of coronary arteries from computed tomography angiography (CTA) images is of paramount clinical importance for the diagnosis and treatment planning of cardiovascular diseases. However, coronary artery segmentation remains challenging due to the inherent multi-branching and slender tubular morphology of the vasculature, compounded by severe class imbalance between foreground vessels and background tissue. Conventional convolutional neural network (CNN)-based approaches struggle to capture long-range dependencies among spatially distant vascular structures, while Vision Transformer (ViT)-based methods incur prohibitive computational overhead that hinders deployment in resource-constrained clinical settings. Motivated by the recent success of state space models (SSMs) in efficiently modeling long-range sequential dependencies with linear complexity, we propose MDSVM-UNet, a novel two-stage coronary artery segmentation framework that synergistically integrates multidirectional snake convolution (MDSConv) with residual visual Mamba (RVM). In the encoding stage, we introduce MDSConv, a deformable convolution module that learns adaptive offsets along three orthogonal anatomical planes – sagittal, coronal, and axial – thereby enabling comprehensive multi-view feature fusion that faithfully captures the elongated and tortuous geometry of coronary vessels. In the decoding stage, we design an RVM-based upsampling decoder block that leverages selective state space mechanisms to model inter-slice long-range dependencies while preserving linear computational complexity. Furthermore, we propose a progressive two-stage segmentation strategy: the first stage performs coarse whole-image segmentation to guide intelligent block extraction, while the second stage conducts fine-grained block-level segmentation to recover vascular details and suppress false positives…
[CV-63] SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection DATE CVPR2026
【速读】:该论文旨在解决当前钢铁表面缺陷检测方法依赖仅标签图像数据训练的分类模型所导致的可解释性差和泛化能力弱的问题。解决方案的关键在于构建了一个名为SteelDefectX的视觉-语言数据集,其中包含7,778张图像和25类缺陷,每个样本均配有从粗粒度到细粒度的文本描述:粗粒度层面提供缺陷类别、代表性视觉属性及工业成因信息,细粒度层面则刻画形状、尺寸、深度、位置和对比度等样本特异性属性,从而显著提升模型对缺陷表征的丰富性和准确性。通过在四个任务(纯视觉分类、视觉-语言分类、少样本/零样本识别与零样本迁移)上建立基准测试,实验证明该标注策略有效增强了模型的可解释性、泛化能力和迁移性能。
链接: https://arxiv.org/abs/2603.21824
作者: Shuxian Zhao,Jie Gui,Baosheng Yu,Lu Dong,Zhipeng Gui
机构: Southeast University (东南大学); Purple Mountain Laboratories (紫金山实验室); Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper was submitted to CVPR 2026. A revised version will be updated soon
Abstract:Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on this https URL.
[CV-64] Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion CVPR2026
【速读】:该论文旨在解决红外与可见光图像融合(Infrared and Visible Image Fusion, IVIF)中对严格配对训练数据(Strictly Paired Training Paradigm, SPTP)的高度依赖问题。现有方法通常需要大量严格对齐的图像对进行训练,但此类数据获取成本高、难度大,且限制了跨模态关系的多样性,从而影响模型泛化能力。解决方案的关键在于提出并验证无配对(UnPaired Training Paradigm, UPTP)和任意配对(Arbitrarily Paired Training Paradigm, APTP)训练范式,通过理论建模和实用框架设计,在严重受限且未对齐的数据条件下显著丰富跨模态关联信息,从而实现与SPTP在100倍更大数据集上相当的性能表现,大幅降低数据采集成本并提升模型鲁棒性。
链接: https://arxiv.org/abs/2603.21820
作者: Yanglin Deng,Tianyang Xu,Chunyang Cheng,Hui Li,Xiao-jun Wu,Josef Kittler
机构: Jiangnan University (江南大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100 \times larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \hrefthis https URL\textcolorbluethis https URL_unpair.
[CV-65] Ctrl-A: Control-Driven Online Data Augmentation
【速读】:该论文旨在解决图像视觉任务中数据增强策略设计依赖人工调参的问题,即传统方法需手动设定每种增强操作的强度参数,且难以适应不同任务和训练阶段的变化。解决方案的关键在于提出ControlAugment(Ctrl-A),其核心是引入控制理论中的闭环反馈机制,通过定义“相对操作响应曲线”(relative operation response curves)实现对每种增强操作强度分布的在线动态调整。该机制能够自动识别并抑制对模型性能产生负面影响的增强风格,从而无需预先设定具体增强强度,显著降低了对领域知识和试错调参的依赖,并在CIFAR-10、CIFAR-100和SVHN-core等基准数据集上展现出与当前最优数据增强方法相当甚至更优的性能。
链接: https://arxiv.org/abs/2603.21819
作者: Jesper B. Christensen,Ciaran Bench,Spencer A. Thomas,Hüsnü Aslan,David Balslev-Harder,Nadia A. S. Smith,Alessandra Manzin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 17 pages (11 pages main manuscript), 8 figures (5 in main manuscript)
Abstract:We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.
[CV-66] Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction MICCAI2026
【速读】:该论文旨在解决在无配对多模态数据条件下,如何将脑部磁共振成像(MRI)中蕴含的高血压(HTN)知识有效迁移至眼底照相(fundus imaging)模型以提升HTN预测性能的问题。其核心挑战在于MRI与眼底图像通常来自不同人群(即模态孤岛数据集),缺乏直接的跨模态对应关系。解决方案的关键是提出临床图引导蒸馏(Clinical Graph-Mediated Distillation, CGMD)框架,通过构建一个跨越两个队列的临床相似性kNN图作为结构化桥梁,利用共享生物标志物实现跨模态知识传递:首先训练MRI教师模型,随后在其构建的图结构上传播表示并为眼底患者生成脑信息引导的表示目标,最后通过联合损失函数(包含HTN监督、目标蒸馏和关系蒸馏)训练眼底学生模型。实验证明该方法显著优于标准蒸馏和非图插值基线,且消融实验验证了基于临床先验的图连接性对性能提升的重要性。
链接: https://arxiv.org/abs/2603.21809
作者: Dillan Imans,Phuoc-Nguyen Bui,Duc-Tai Le,Hyunseung Choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 2 tables. Under review at MICCAI 2026
Abstract:Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at this https URL.
[CV-67] Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment
【速读】:该论文旨在解决中文普通话视觉语音识别(Chinese Mandarin Visual Speech Recognition, VSR)中因语言的声调特性导致的传统序列到序列建模方法性能受限的问题,以及现有级联架构中因阶段间依赖带来的误差累积和推理延迟增加的问题。解决方案的关键在于提出一种无级联的多任务学习架构,通过联合集成多种中间表示(如音素和视觉发音单元 viseme),并引入语义引导的局部对比损失(semantic-guided local contrastive loss),实现特征的时间对齐与按需激活,从而在推理效率与识别性能之间取得平衡,并有效缓解投影与重新嵌入引发的误差传播问题。
链接: https://arxiv.org/abs/2603.21808
作者: Lei Yang,Yi He,Fei Wu,Shilin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.
[CV-68] Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition
【速读】:该论文旨在解决磁共振成像(MRI)数据采集速度慢的问题,从而提升临床效率并减少患者不适。其核心挑战在于如何在大幅降低采样率的前提下实现高质量图像重建。解决方案的关键在于提出一种基于预训练医学图像分词器(medical image tokenizer)和潜在空间Transformer的主动采样框架,通过量化视觉token构建潜在空间的概率分布,并利用token熵作为不确定性度量来指导采样策略:一方面采用潜熵选择(Latent Entropy Selection, LES)将局部token熵映射至k空间以识别信息量大的采样线;另一方面引入梯度驱动的熵优化(Gradient-based Entropy Optimization, GEO),通过总潜熵损失的k空间梯度寻找不确定性最大降低区域,从而实现更高效的自适应采样。
链接: https://arxiv.org/abs/2603.21806
作者: Lev Ayzenberg,Shady Abu-Hussein,Raja Giryes,Hayit Greenspan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the k -space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the k -space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at \times 8 and \times 16 acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at this https URL.
[CV-69] ming In stand-up Comedy: Text Audio Laughter Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing
【速读】:该论文旨在解决传统幽默研究过度依赖言语内容而忽视表演者身体表现与观众反应动态交互的问题。其核心挑战在于如何系统性地捕捉并量化现场脱口秀中语言、动作与观众笑声之间的多模态时序关联。解决方案的关键在于构建TIC-TALK这一大规模多模态资源,通过融合BERTopic主题分割、Whisper-AT笑声检测、YOLOv8-cls镜头分类与YOLOv8s-pose姿态关键点提取,实现对5,400+秒级话题片段的精确时空对齐,同时保留原始17关节骨骼坐标以计算连续运动学信号(如臂展、动能和躯干倾斜),从而为表演动态提供可量化的代理指标。
链接: https://arxiv.org/abs/2603.21803
作者: Yaelle Zribi(ENC),Florian Cafiero(ENC, LRE),Vincent Lépinay,Chahan Vidal-Gorène(CJM, LIPN)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.
[CV-70] Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent
【速读】:该论文旨在解决事件相机(Event Camera)在工业机器人场景中进行多类目标检测时,如何有效利用时间记忆以提升检测性能的问题。其核心挑战在于现有研究多集中于户外驾驶场景或有限类别设置,缺乏对工业环境中复杂动态和多类识别的系统性评估。解决方案的关键在于引入循环结构的ReYOLOv8s模型,并通过对比非循环YOLOv8s基线分析时间记忆的作用;同时探索不同预训练策略(如GEN1和PEDRo初始化)对模型性能的影响,发现事件域预训练显著优于从零开始训练,且随着片段长度增加性能持续提升,表明合适的预训练可增强模型对工业场景中长期时序依赖的建模能力。
链接: https://arxiv.org/abs/2603.21787
作者: Lokeshwaran Manohar,Moritz Roidl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.
[CV-71] he Universal Normal Embedding CVPR2026
【速读】:该论文试图解决生成模型(Generative Models)与视觉编码器(Vision Encoders)虽在不同目标下独立发展,但其潜在空间均表现出高斯特性(Latent Space Gaussianity)这一现象背后的统一性问题。研究者提出“通用正态嵌入”(Universal Normal Embedding, UNE)假设:生成模型将高斯噪声映射为图像,而编码器将图像映射为语义嵌入,二者本质上是同一近似高斯潜空间的线性投影。解决方案的关键在于构建NoiseZoo数据集,包含每张图像对应的DDIM反演扩散噪声和匹配的CLIP/DINO编码表示,并通过线性探测验证两类潜空间中存在对齐的语义方向,从而支持UNE假设。进一步地,这些线性方向可用于无需架构修改的可控编辑(如表情、性别、年龄),并通过简单正交化处理缓解伪关联问题,揭示了生成与编码共享的高斯潜几何结构。
链接: https://arxiv.org/abs/2603.21786
作者: Chen Tasker,Roy Betser,Eyal Gofer,Meir Yossef Levi,Guy Gilboa
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2026
Abstract:Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available this https URL
[CV-72] Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends
【速读】:该论文旨在解决资源受限的自主机器人在运行稀疏直接/半直接视觉-惯性里程计(Visual-Inertial Odometry, VIO)系统时,因前端参数(如特征检测与跟踪参数)依赖人工调参且固定不变,导致在不同场景下(如纹理密度、光照变化、运动模糊等)性能不稳定的问题。解决方案的关键在于提出首个图像条件的强化学习(Reinforcement Learning, RL)框架,将前端参数配置建模为序列决策问题,通过一个轻量级纹理感知CNN编码器和特权评论家(privileged critic)训练策略网络,使系统能够根据当前图像内容在线自适应调整参数,从而在不依赖内部VO统计信息的前提下,提前优化特征检测与跟踪过程。实验表明,该方法在TartanAirV2和TUM RGB-D数据集上实现了3倍更长的特征轨迹和3倍更低的计算成本,且训练完全在仿真环境中完成。
链接: https://arxiv.org/abs/2603.21785
作者: Simone Nascivera,Leonard Bauersfeld,Jeff Delaune,Davide Scaramuzza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.
[CV-73] Dynamic Exposure Burst Image Restoration
【速读】:该论文旨在解决传统突发图像恢复(burst image restoration)中因手动设定曝光参数导致重建质量受限的问题,尤其是缺乏针对拍摄环境动态优化曝光时间的机制。解决方案的关键在于提出了一种动态曝光突发图像恢复(Dynamic Exposure Burst Image Restoration, DEBIR)新框架,其核心是引入了Burst Auto-Exposure Network (BAENet),该网络基于预览图像、运动幅度和增益信息动态预测每张突发图像的最佳曝光时间,随后由图像恢复网络利用这些优化后的曝光图像重建高质量输出。通过可微分的突发模拟器与三阶段训练策略,实现了端到端的性能提升,在真实相机系统上也验证了方法的有效性与实用性。
链接: https://arxiv.org/abs/2603.21784
作者: Woohyeok Kim,Jaesung Rim,Daeyeon Kim,Sunghyun Cho
机构: POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings. Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked. In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment. In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain. Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times. For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.
[CV-74] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像生成中两个关键问题:一是缺乏针对遥感领域的专用生成先验模型,二是高分辨率训练成本过高导致难以满足RS应用需求。为此,作者提出两项核心解决方案:首先,基于FLUX模型在超过10万张精选遥感图像上进行微调,构建出强域适应性的生成先验(RS-FLUX);其次,提出一种无需训练的分辨率提升方法SHARP(Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion),其关键在于引入一个与扩散去噪过程相适应的动态分数时间调度函数 $ k_{\text{rs}}(t) $,通过在早期布局形成阶段施加强位置促进、后期细节恢复阶段逐步弱化,实现频谱感知的动态位置编码调整,从而更有效地保留遥感图像中的高频结构信息(如车辆、建筑轮廓和道路标记)。该方法具有分辨率无关性,仅需一组超参数即可支持多尺度生成,且在CLIP Score、Aesthetic Score和HPSv2等指标上显著优于所有无训练基线方法。
链接: https://arxiv.org/abs/2603.21783
作者: Bingxuan Zhao,Qing Zhou,Chuang Yang,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学); School of Artificial Intelligence, OPtics and ElectroNics (iOPEN) (人工智能学院,光电与信息工程学院(iOPEN)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at this https URL.
[CV-75] Lets Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts AAAI2026
【速读】:该论文旨在解决当前交织模态思维链(Interleaved-modal Chain-of-Thought, ICoT)方法中存在的两大问题:一是静态视觉思维定位(Static Visual Thought Positioning),即在固定步骤插入视觉信息,导致推理效率低下且缺乏灵活性;二是断裂的视觉思维表征(Broken Visual Thought Representation),表现为视觉标记不连续且语义不连贯。解决方案的关键在于提出一种动态且精确的视觉思维机制——DaP-ICoT,其包含两个核心组件:(1)动态视觉思维整合(Dynamic Visual Thought Integration),根据推理需求自适应地引入视觉输入,减少冗余并提升效率;(2)精确视觉思维引导(Precise Visual Thought Guidance),确保视觉表征语义一致且与上下文对齐。实验表明,DaP-ICoT在多个基准测试中达到最先进性能,并将插入图像数量减少72.6%,显著降低token消耗,从而实现更高效的ICoT推理。
链接: https://arxiv.org/abs/2603.21754
作者: Xu Liu,Yongheng Zhang,Qiguang Chen,Yao Li,Sheng Wang,Libo Qin
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Tongyi Lab (通义实验室); 4. SenseTime (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.
[CV-76] Getting to the Point: Why Pointing Improves LVLMs
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在零样本计数任务中的泛化能力不足以及中间预测点(即坐标)作为视觉解释的可靠性问题。其解决方案的关键在于引入“先指再计数”(Point-then-Count)范式,通过显式地生成目标物体的坐标并基于这些空间信息进行计数,从而增强模型对新场景的泛化能力,并揭示坐标信息在提升计数性能中的机制性作用。实验表明,该方法相比直接计数(Direct Counting)显著提高了分布外泛化性能,且预测坐标在超过89%的情况下准确锚定于图像中,但存在区域性的空间偏差,说明坐标不仅提升了准确性,也提供了可解释的视觉推理路径。
链接: https://arxiv.org/abs/2603.21746
作者: Simone Alghisi,Massimo Rizzoli,Seyed Mahed Mousavi,Giuseppe Riccardi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs’ accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects’ coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
[CV-77] When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?
【速读】:该论文旨在解决生成式 AI 中多模型选择的效率问题,即如何在多个生成模型中高效识别最优或最优组合,以避免从次优模型采样带来的成本。传统方法将此问题建模为多臂赌博机(multi-armed bandit)任务,并引入上置信界(UCB)探索奖励来平衡探索与利用。然而,作者发现UCB项在多种数据集和评估指标下反而延缓收敛并降低样本效率。其关键解决方案是提出一种无需显式UCB乐观性的“Mixture-Greedy”策略:通过多样性感知目标函数的内在结构,诱导隐式探索——即偏好混合模型内部区域,从而实现所有臂的线性采样和亚线性遗憾(sublinear regret),尤其在FID、Vendi等难以构建紧置信边界的目标下表现更优。理论分析表明,这种隐式探索源于目标函数的几何特性,挑战了显式置信边界在多样性感知多臂赌博机中的必要性。
链接: https://arxiv.org/abs/2603.21716
作者: Bahar Dibaei Nia,Farzan Farnia
机构: Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emphMixture-Greedy strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.
[CV-78] Compensating Visual Insufficiency with Stratified Language Guidance for Long-Tail Class Incremental Learning
【速读】:该论文旨在解决长尾类别增量学习(Long-tail Class Incremental Learning, LT CIL)中的两大核心挑战:一是尾部类别样本稀缺导致的模型学习困难,二是持续演化且不平衡的数据分布下灾难性遗忘(catastrophic forgetting)问题。解决方案的关键在于利用语言知识的可解释性与可扩展性,构建分层语言树(stratified language tree),通过语义层次结构从粗到细组织信息;在此基础上提出两种机制:一是分层自适应语言引导(stratified adaptive language guidance),利用可学习权重融合多尺度语义表示,动态调整尾部类别的监督信号以缓解数据不平衡;二是分层对齐语言引导(stratified alignment language guidance),借助语言树结构的稳定性约束优化过程并强化视觉-语义对齐,从而有效缓解灾难性遗忘。
链接: https://arxiv.org/abs/2603.21708
作者: Xi Wang,Xu Yang,Donghao Sun,Cheng Deng
机构: Xidian University (西安电子科技大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-tail class incremental learning (LT CIL) remains highly challenging because the scarcity of samples in tail classes not only hampers their learning but also exacerbates catastrophic forgetting under continuously evolving and imbalanced data distributions. To tackle these issues, we exploit the informativeness and scalability of language knowledge. Specifically, we analyze the LT CIL data distribution to guide large language models (LLMs) in generating a stratified language tree that hierarchically organizes semantic information from coarse to fine grained granularity. Building upon this structure, we introduce stratified adaptive language guidance, which leverages learnable weights to merge multi-scale semantic representations, thereby enabling dynamic supervisory adjustment for tail classes and alleviating the impact of data imbalance. Furthermore, we introduce stratified alignment language guidance, which exploits the structural stability of the language tree to constrain optimization and reinforce semantic visual alignment, thereby alleviating catastrophic forgetting. Extensive experiments on multiple benchmarks demonstrate that our method achieves state of the art performance.
[CV-79] Rethinking Token Reduction for Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多轮视觉问答(Multi-turn Visual Question Answering, MT-VQA)场景中因视觉token数量过多导致的高推理成本问题。现有token压缩方法主要针对单轮视觉问答(Single-turn VQA),难以适应MT-VQA中后续问题未知且可能指向任意图像区域的挑战,尤其是两类主流方法——提示依赖型方法易丢失对后续对话有用的视觉信息,而提示无关型方法依赖启发式指标(如注意力分数)效果欠佳。论文提出一种基于学习的提示无关方法MetaCompress,其关键在于将token压缩建模为可学习的压缩映射(compression mapping),统一剪枝与合并等策略为单一学习目标,并设计数据高效的训练范式,在有限计算开销下学习最优压缩策略,从而在MT-VQA任务中实现更优的效率-精度权衡与跨对话轮次的良好泛化能力。
链接: https://arxiv.org/abs/2603.21701
作者: Yi Wang,Haofei Zhang,Qihan Huang,Anda Cao,Gongfan Fang,Wei Wang,Xuan Jin,Jie Song,Mingli Song,Xinchao Wang
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at this https URL.
[CV-80] PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Parag anglioma
【速读】:该论文旨在解决当前副神经节瘤和嗜铬细胞瘤(Pheochromocytomas and paragangliomas, PPGLs)诊断中存在的一系列局限性问题,包括GAPP评分系统依赖人工评估、主观性强、无法涵盖关键遗传风险因素(如SDHB突变)等。其解决方案的核心是提出PPGL-Swarm,一个基于智能体(agent)的诊断系统,通过将诊断任务分解为微任务并分配给专业化智能体来实现自动化与可追溯的推理过程;系统集成基因型风险提示、量化细胞密度与Ki-67评估,并利用强化学习优化工具选择与任务分配,从而提升诊断准确性与临床实用性。
链接: https://arxiv.org/abs/2603.21700
作者: Zelin Liu,Xiangfu Yu,Jie Huang,Ge Wang,Yizhe Yuan,Zhenyu Yi,Jing Xie,Haotian Jiang,Lichi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.
[CV-81] RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing
【速读】:该论文旨在解决通过非平面折射表面进行新视角合成(Novel View Synthesis, NVS)时,因严重且空间变化的光学畸变导致的图像失真问题。传统方法如NeRF和3D Gaussian Splatting(3DGS)假设光线沿直线传播,在复杂折射条件下失效,产生显著伪影。其解决方案的关键在于提出RefracGS框架,通过显式解耦折射界面与目标场景:将折射表面建模为神经高度场(neural height field),以捕捉波面几何特性;同时将场景表示为3D高斯场(3D Gaussian field)。进一步设计了一种折射感知的高斯射线追踪方法,利用斯涅尔定律(Snell’s law)精确计算非线性光路,并高效渲染底层高斯场,同时反向传播损失梯度至参数化的折射表面,实现两者的端到端联合优化,从而在保证视图一致性的同时提升重建质量与渲染效率。
链接: https://arxiv.org/abs/2603.21695
作者: Yiming Shao,Qiyu Dai,Chong Gao,Guanbin Li,Yeqiang Wang,He Sun,Qiong Zeng,Baoquan Chen,Wenzheng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Novel view synthesis (NVS) through non-planar refractive surfaces presents fundamental challenges due to severe, spatially varying optical distortions. While recent representations like NeRF and 3D Gaussian Splatting (3DGS) excel at NVS, their assumption of straight-line ray propagation fails under these conditions, leading to significant artifacts. To overcome this limitation, we introduce RefracGS, a framework that jointly reconstructs the refractive water surface and the scene beneath the interface. Our key insight is to explicitly decouple the refractive boundary from the target objects: the refractive surface is modeled via a neural height field, capturing wave geometry, while the underlying scene is represented as a 3D Gaussian field. We formulate a refraction-aware Gaussian ray tracing approach that accurately computes non-linear ray trajectories using Snell’s law and efficiently renders the underlying Gaussian field while backpropagating the loss gradients to the parameterized refractive surface. Through end-to-end joint optimization of both representations, our method ensures high-fidelity NVS and view-consistent surface recovery. Experiments on both synthetic and real-world scenes with complex waves demonstrate that RefracGS outperforms prior refractive methods in visual quality, while achieving 15x faster training and real-time rendering at 200 FPS. The project page for RefracGS is available at this https URL.
[CV-82] PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing
【速读】:该论文旨在解决当前机器人评估仍主要依赖二值成功率(binary success rates)所带来的局限性,即这种评估方式将丰富的执行过程压缩为单一结果,从而掩盖了进度、效率和稳定性等关键质量特征。其解决方案的核心是提出PRM-as-a-Judge这一密集评估范式,该范式利用过程奖励模型(Process Reward Models, PRMs)直接从轨迹视频中审计策略执行过程,通过从观测序列估计任务进展来实现细粒度评价。其中,OPD(Outcome-Process-Diagnosis)指标体系是关键创新,它通过任务对齐的进展势能(progress potential)显式形式化执行质量,并基于两个公理性质——宏观一致性(macro-consistency)与微观分辨率(micro-resolution)——确保评估结果既具备路径一致性的聚合能力,又能敏感捕捉物理演化中的细微差异。
链接: https://arxiv.org/abs/2603.21669
作者: Yuheng Ji,Yuyang Liu,Huajie Tan,Xuchuan Huang,Fanding Huang,Yijie Xu,Cheng Chi,Yuting Zhao,Huaihai Lyu,Peterson Co,Mingyu Cao,Qiongyu Zhang,Zhe Li,Enshen Zhou,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Xiaolong Zheng
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
[CV-83] HumanOmni-Speaker: Identifying Who said What and When
【速读】:该论文旨在解决当前多模态大语言模型在处理复杂多人对话场景时存在的“能力幻觉”问题,即模型通过利用基准测试中的视觉偏差来规避真正的跨模态对齐,且因低帧率视觉采样导致关键高频动态(如唇部运动)丢失,从而无法准确回答“谁在何时说了什么”的核心问题。解决方案的关键在于提出一种严格的评估范式——Visual-Registered Speaker Diarization and Recognition (VR-SDR) 和 HumanOmni-Speaker 基准,并设计了 HumanOmni-Speaker 模型,其核心创新是引入 Visual Delta Encoder,以 25 fps 高频采样原始视频并显式压缩帧间运动残差为每帧仅 6 个 token,从而在不引发令牌爆炸的前提下精确捕捉细粒度的 viseme(音位视觉表征)和说话人轨迹,实现端到端的时空身份绑定与高精度空间定位,显著提升多模态协同能力与 speaker-centric 任务性能。
链接: https://arxiv.org/abs/2603.21664
作者: Detao Bai,Shimin Yao,Weixuan Chen,Xihan Wei,Zhiheng Ma
机构: Tongyi Lab Alibaba Group(通义实验室阿里巴巴集团); Shenzhen University of Advanced Technology(深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer Who said what and when.'' Current models suffer from an illusion of competence’’ – they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
[CV-84] Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis
【速读】:该论文旨在解决图像去雨(Image Deraining)任务在跨场景泛化能力不足的问题,尤其是在训练数据与真实世界分布不一致(Out-of-Distribution, OOD)时性能显著下降的问题。其核心挑战源于合成训练数据与真实降雨物理过程之间的域差异(Domain Discrepancy)。解决方案的关键在于提出一种无需目标域成对雨图的跨场景自适应框架:首先通过超像素生成(Superpixel Generation, Sup-Gen)模块利用简单线性迭代聚类(Simple Linear Iterative Clustering, SLIC)提取源域稳定的结构先验;其次设计分辨率自适应融合策略,基于纹理相似性将源结构对齐至目标无雨背景,从而合成多样且逼真的伪数据;最后引入多阶段噪声生成机制重构伪标签,模拟真实的雨条纹。该框架可作为即插即用模块集成至任意去雨架构,实验证明其在OOD场景下PSNR提升达32%–59%,并显著加速训练收敛。
链接: https://arxiv.org/abs/2603.21661
作者: Kangbo Zhao,Miaoxin Guan,Xiang Chen,Yukai Shi,Jinshan Pan
机构: Guangdong University of Technology (广东工业大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: We aim at addressing the cross-scenario (i.e., O.O.D) de-rain challenge, which has been neglected for a long period
Abstract:Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.
[CV-85] OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging CVPR2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗图像分析中面临的两大挑战:一是现有框架对特定任务的骨干网络(backbone)高度依赖,导致灵活性差;二是面对异构成像模态(heterogeneous imaging modalities)时模型性能不稳定,限制了其在真实世界多机构协作场景中的部署。解决方案的关键在于提出 OmniFM,一个模态和任务无感知的联邦学习框架,其核心创新是利用频域特征的跨模态一致性——低频谱成分具有强跨模态稳定性并编码模态不变的解剖结构。OmniFM 通过三个关键机制实现统一训练:(i) 全局频谱知识检索以注入全局频率先验,(ii) 嵌入级交叉注意力融合以对齐表示,(iii) 前缀-后缀频谱提示联合调节全局与个性化线索,并辅以频谱近端对齐目标(Spectral-Proximal Alignment objective)稳定聚合过程,从而在多种下游任务(如分类、分割、超分辨率、视觉问答及多模态融合)中均表现出卓越鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.21660
作者: Meilin Liu,Jiaying Wang,Jing Shan
机构: Shenyang University of Technology (沈阳工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 (Main)
Abstract:Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.
[CV-86] FedCVU: Federated Learning for Cross-View Video Understanding
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在跨视角(cross-view)多摄像头视频理解任务中面临的三大挑战:(i) 不同视角和背景导致客户端数据分布高度非独立同分布(non-IID),易过拟合于特定视角特征;(ii) 局部分布偏移造成表征不一致,阻碍跨视角语义对齐;(iii) 大型视频模型带来高昂通信开销。解决方案的关键在于提出FedCVU框架,包含三个核心组件:VS-Norm通过保留归一化参数以处理视角特异性统计信息;CV-Align采用轻量级对比正则化模块提升跨视角表征对齐;SLA选择性层聚合策略在不牺牲精度的前提下显著降低通信成本。实验证明,FedCVU在未见视角上持续提升准确率,同时保持良好已见视角性能,展现出对领域异质性和通信约束的鲁棒性。
链接: https://arxiv.org/abs/2603.21647
作者: Shenghan Zhang,Run Ling,Ke Cao,Ao Ma,Zhanjie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.
[CV-87] No Dense Tensors Needed: Fully Sparse Object Detection on Event-Camera Voxel Grids
【速读】:该论文旨在解决事件相机(event camera)在目标检测任务中因传统方法将稀疏事件流转换为密集张量而导致的表示效率低下问题。现有方法虽能利用事件相机的高动态范围和异步特性,但其计算与存储开销随传感器分辨率线性增长,限制了实际应用。解决方案的关键在于提出SparseVoxelDet——首个完全基于稀疏结构的目标检测框架,其骨干特征提取、特征金字塔融合及检测头均通过三维稀疏卷积仅在占用体素位置上操作,全程不生成任何密集特征张量。这一设计使模型内存占用降低858倍、存储减少3670倍,且表示成本由场景动态性决定而非像素数量,从而在保持高精度(如FRED基准上mAP达83.38%@IoU=0.5)的同时显著提升资源效率。
链接: https://arxiv.org/abs/2603.21638
作者: Mohamad Yazan Sadoun,Sarah Sharif,Yaser Mike Banad
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 Pages, 9 Figures, 5 Tables
Abstract:Event cameras produce asynchronous, high-dynamic-range streams well suited for detecting small, fast-moving drones, yet most event-based detectors convert the sparse event stream into dense tensors, discarding the representational efficiency of neuromorphic sensing. We propose SparseVoxelDet, to our knowledge the first fully sparse object detector for event cameras, in which backbone feature extraction, feature pyramid fusion, and the detection head all operate exclusively on occupied voxel positions through 3D sparse convolutions; no dense feature tensor is instantiated at any stage of the pipeline. On the FRED benchmark (629,832 annotated frames), SparseVoxelDet achieves 83.38% mAP at 50 while processing only 14,900 active voxels per frame (0.23% of the T.H.W grid), compared to 409,600 pixels for the dense YOLOv11 baseline (87.68% mAP at 50). Relaxing the IoU threshold from 0.50 to 0.40 recovers mAP to 89.26%, indicating that the remaining accuracy gap is dominated by box regression precision rather than detection capability. The sparse representation yields 858 times GPU memory compression and 3,670 times storage reduction relative to the equivalent dense 3D voxel tensor, with data-structure size that scales with scene dynamics rather than sensor resolution. Error forensics across 119,459 test frames confirms that 71 percent of failures are localization near-misses rather than missed targets. These results demonstrate that native sparse processing is a viable paradigm for event-camera object detection, exploiting the structural sparsity of neuromorphic sensor data without requiring neuromorphic computing hardware, and providing a framework whose representation cost is governed by scene activity rather than pixel count, a property that becomes increasingly valuable as event cameras scale to higher resolutions.
[CV-88] Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition CVPR2026
【速读】:该论文旨在解决多目标跟踪(Multiple Object Tracking, MOT)在在线推理过程中因训练与测试数据间分布偏移(如外观、运动模式及类别分布差异)导致模型性能显著下降的问题。现有测试时自适应(Test-Time Adaptation, TTA)方法通常仅关注帧级适应,忽视了跨帧和跨视频的时序一致性与身份关联问题。其解决方案的关键在于提出一种受人类决策机制启发的“经验与直觉测试时校准”(Test-time Calibration from Experience and Intuition, TCEI)框架:其中,“直觉系统”利用瞬时记忆快速预测近期观测目标,“经验系统”则基于先前测试视频积累的经验重新评估并校准这些预测;同时,将测试中高置信度和低置信度的目标分别作为历史先验和反思案例,从而增强模型对测试环境的适应能力,有效缓解分布偏移带来的性能退化。
链接: https://arxiv.org/abs/2603.21629
作者: Wen Guo(1),Pengfei Zhao(1),Zongmeng Wang(4),Yufan Hu(2),Junyu Gao(3) ((1) Shandong Technology and Business University, (2) University of Science and Technology Beijing, (3) Institute of Automation, Chinese Academy of Sciences, (4) Inner Mongolia University)
机构: Shandong Technology and Business University (山东工商学院); University of Science and Technology Beijing (北京科技大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Inner Mongolia University (内蒙古大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model’s adaptability under distribution shifts. The code will be released at this https URL.
[CV-89] PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation CVPR2026
【速读】:该论文旨在解决脑肿瘤MRI分割中因病灶空间稀疏性导致的特征冗余计算问题,即现有分割网络往往忽略临床观察到的肿瘤空间先验信息,从而在大量背景区域进行无效特征提取。其解决方案的关键在于提出PGR-Net(Prior-Guided ROI Reasoning Network),该框架通过引入数据驱动的空间先验集来捕获肿瘤病灶的分布与尺度特性,提供全局引导以提升分割稳定性;进一步设计了分层Top-K ROI决策机制,逐层筛选最置信的病灶候选区域以增强定位精度,并结合WinGS-ROI模块(Windowed Gaussian-Spatial Decay ROI)生成中心增强的引导图,指导全网络特征学习,最终实现高效且高精度的分割性能。
链接: https://arxiv.org/abs/2603.21626
作者: Jiacheng Lu,Hui Ding,Shiyu Zhang,Guoping Huo
机构: Capital Normal University (首都师范大学); China University of Mining and Technology-Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to the main conference of CVPR 2026
Abstract:Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at this https URL.
[CV-90] Efficient Zero-Shot AI-Generated Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像的检测问题,尤其是针对训练-free 方法在识别细微伪造痕迹时敏感性不足的挑战。其解决方案的关键在于利用结构化频域扰动(structured frequency perturbations)对图像表示的敏感性来实现检测:通过仅需一次傅里叶变换即可生成扰动,从而高效捕捉真实图像与合成图像之间的微小差异,显著提升了检测精度与计算效率。
链接: https://arxiv.org/abs/2603.21619
作者: Ryosuke Sonoda,Ramya Srinivasan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free this http URL experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly 10% compared to SoTA, while maintaining substantially lower computational cost.
[CV-91] 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video
【速读】:该论文旨在解决从单目视频中重建360°动态物体时存在的几何一致性问题,即现有方法因过度依赖二维(2D)原生先验而导致初始点在训练视图中过拟合可见表面,从而无法准确重建被遮挡区域的几何结构。解决方案的关键在于提出了一种无扩散(diffusion-free)框架4DGS360,其核心创新包括:1)一种基于3D原生初始化的方法,有效缓解了遮挡区域的几何歧义;2)提出的3D跟踪器AnchorTAP3D通过利用可靠的2D跟踪点作为锚点来增强3D点轨迹,抑制漂移并提供稳定的初始化,从而在遮挡区域保留几何一致性。这一方案结合优化策略,实现了高质量的360°四维(4D)动态场景重建。
链接: https://arxiv.org/abs/2603.21618
作者: Jae Won Jang,Yeonjin Chang,Wonsik Shin,Juhwan Cho,Nojun Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce 4DGS360, a diffusion-free framework for 360 ^\circ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360 ^\circ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360 ^\circ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135 ^\circ apart from training views, enabling 360 ^\circ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.
[CV-92] AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing
【速读】:该论文旨在解决流匹配模型中基于反演的图像编辑所面临的“注入困境”(injection dilemma)问题,即在去噪过程中注入源特征虽能保留原图背景,但会抑制模型对编辑内容的生成能力。现有方法采用固定的注入策略(如二值时间调度、均匀空间混合比例和通道无关的潜在扰动),未能考虑时序与通道维度上编辑需求的异质性。解决方案的关键在于提出AdaEdit框架,包含两个互补创新:一是设计渐进式注入调度(Progressive Injection Schedule),用连续衰减函数(如sigmoid、余弦或线性)替代硬性二值截止,实现从源特征保持到目标特征生成的平滑过渡;二是引入通道选择性潜在扰动(Channel-Selective Latent Perturbation),通过估计每通道在反演潜变量与随机潜变量之间的分布差异来量化重要性,并施加差异化扰动强度,从而增强编辑相关通道的同时保护结构编码通道。
链接: https://arxiv.org/abs/2603.21615
作者: Guandong Li,Zhaobin Chu
机构: iFLYTEK(科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model’s ability to synthesize edited content. Existing methods address this with fixed injection strategies – binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation – that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly – strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at this https URL
[CV-93] SARe: Structure-Aware Large-Scale 3D Frag ment Reassembly
【速读】:该论文旨在解决3D碎片重组(3D fragment reassembly)问题,即在目标形状未知且碎片提供弱语义线索的情况下,从无序的碎片点云或网格中恢复其刚性位姿,并在统一的物体坐标系下重建完整形状。随着碎片数量增加,现有端到端方法因接触推理不可靠(尤其是碎片邻接关系错误)而易发生级联失败。解决方案的关键在于提出结构感知重组框架SARe,包含两个核心模块:SARe-Gen用于生成欧几里得空间中的装配结果,通过联合预测断裂面标记概率与片段间接触图来定位接触区域并推断候选邻接关系;SARe-Refine则引入推理阶段的精细化优化机制,通过几何一致性验证筛选可靠子结构并重采样不确定区域,同时固定已验证部分,从而显著提升多碎片场景下的稳定性和一致性。
链接: https://arxiv.org/abs/2603.21611
作者: Hanze Jia,Chunshi Wang,Yuxiao Yang,Zhonghua Jiang,Yawei Luo,Shuainan Ye,Tan Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures
Abstract:3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.
[CV-94] A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
【速读】:该论文旨在解决现代临床实践中对异构、动态且不完整的患者数据进行有效推理的难题,现有大多数多模态基础模型存在静态、黑箱且与真实临床工作流程脱节的问题。其解决方案的关键在于提出Cerebra——一个由多个专业化智能体组成的交互式AI团队,分别处理电子健康记录(EHR)、临床笔记和医学影像分析,并将结果整合至面向医生的可视化仪表板中,结合交互式对话接口支持临床决策点上的可解释性查询与风险 contextualization。该架构不仅支持隐私保护部署(基于结构化表示),还能在模态缺失时保持鲁棒性,从而显著优于单一模态模型和大型多模态语言模型基线,在阿尔茨海默病风险预测、诊断及生存预测等任务中均展现出优越性能。
链接: https://arxiv.org/abs/2603.21597
作者: Sheng Liu,Long Chen,Zeyun Zhao,Qinglin Gou,Qingyue Wei,Arjun Masurkar,Kevin M. Spiegler,Philip Kuball,Stefania C. Bray,Megan Bernath,Deanna R. Willis,Jiang Bian,Lei Xing,Eric Topol,Kyunghyun Cho,Yu Huang,Ruogu Fang,Narges Razavian,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra’s potential for interpretable, robust decision support in clinical care.
[CV-95] SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在合并不同模态专用模型时面临的挑战,特别是由于各模型学习到的表示差异及参数空间干扰导致的知识融合困难问题。其解决方案的关键在于提出了一种无需训练的模型合并框架——奇异子空间对齐与合并(Singular Subspace Alignment and Merging, SSAM),该方法通过分离模态特定的参数更新、识别共享的语言相关低秩子空间,并在该子空间内对齐和合并参数,从而在不使用任何多模态训练数据的情况下,有效保留各模型的互补知识并最小化参数冲突,实现跨模态能力的统一建模。
链接: https://arxiv.org/abs/2603.21584
作者: Md Kaykobad Reza,Ameya Patil,Edward Ayrapetian,M. Salman Asif
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 Pages, 9 Figures, 5 Tables
Abstract:Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.
[CV-96] HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling
【速读】:该论文旨在解决从少量标注的2D图像中进行3D旋转回归(3D rotation regression)的问题,尤其是在数据稀缺场景下提升模型性能。现有方法通常依赖大量标注数据或额外信息(如点云或CAD模型),而本文提出了一种基于硬度感知课程学习(hardness-aware curriculum learning)的半监督框架,其关键在于动态选择伪标签样本:通过多阶段和自适应课程策略替代传统固定阈值的熵过滤机制,使模型从易到难逐步学习;同时引入一种针对旋转估计设计的结构化数据增强策略,通过组合增强后的图像块生成复合图像,在保持几何完整性的同时提升特征多样性,从而显著提升低数据场景下的旋转回归精度。
链接: https://arxiv.org/abs/2603.21583
作者: Mei Li,Huayi Zhou,Suizhi Huang,Yuxiang Lu,Yue Ding,Hongtao Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is an accepted manuscript of an article published in Computer Vision and Image Understanding
Abstract:Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.
[CV-97] Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs
【速读】:该论文旨在解决现有视觉隐私评估基准将隐私视为二元属性(即图像仅被标记为私有或非私有)所带来的局限性,这种做法忽略了视觉属性在组合时可能产生更严重隐私风险的问题。论文提出了一种监管意识的组合隐私风险分类法(Compositional Privacy Risk Taxonomy, CPRT),其关键在于通过区分单个属性的可识别性与组合后的危害潜力,构建了一个包含四个等级的隐私严重性评分体系,并设计了可解释的连续评分函数。CPRT不仅提供了结构化的隐私风险评估框架,还配套构建了6.7K图像的数据集及真实标注的组合风险分数,从而揭示了前沿视觉语言模型(VLMs)在缺乏结构化引导时对组合驱动风险的系统性低估问题,并进一步提出一个部署友好的8B参数监督微调(SFT)模型,以实现接近前沿模型的组合隐私评估性能。
链接: https://arxiv.org/abs/2603.21573
作者: Efthymios Tsaprazlis,Tiantian Feng,Anil Ramakrishna,Sai Praneeth Karimireddy,Rahul Gupta,Shrikanth Narayanan
机构: University of Southern California (南加州大学); Meta Superintelligence Labs (Meta超级智能实验室); Amazon AGI (亚马逊AGI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.
[CV-98] CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation
【速读】:该论文旨在解决眼科白内障手术视频中实时语义分割的精度与效率问题,以及高质量标注数据稀缺导致的模型训练瓶颈。其核心解决方案是提出CataractSAM-2——一个针对眼科前段手术场景优化的Segment Anything Model 2(SAM-2)扩展模型,具备高精度、实时性及零样本泛化能力;关键创新在于结合稀疏提示(sparse prompts)与视频掩码传播(video-based mask propagation)的交互式标注框架,显著降低人工标注成本并加速高质量数据集构建,从而推动医疗机器人系统中术中感知与手术视频理解的发展。
链接: https://arxiv.org/abs/2603.21566
作者: Mohammad Eslami,Dhanvinkumar Ganeshkumar,Saber Kazeminasab,Michael G. Morley,Michael V. Boland,Michael M. Lin,John B. Miller,David S. Friedman,Nazlee Zebardast,Lucia Sobrin,Tobias Elze
机构: Thomas Jefferson High School for Science and Technology (托马斯杰斐逊高中科学与技术学校); Mass Eye and Ear, Harvard Medical School (麻省眼耳医院,哈佛医学院); Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Mass Eye and Ear, Harvard Medical School (哈佛眼科人工智能实验室,麻省眼耳医院施佩彭斯眼科研究所,哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:We present CataractSAM-2, a domain-adapted extension of Meta’s Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model’s strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.
[CV-99] Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance
【速读】:该论文旨在解决合成孔径雷达自动目标识别(SAR ATR)中因相干斑噪声(coherent speckle noise)导致目标特征模糊、识别准确率下降及模型泛化能力受限的问题。解决方案的关键在于提出一种目标感知的频域-空域增强框架(FSCE),其核心包括两个模块:一是浅层特征自适应增强模块(DSAF),通过空间多尺度卷积与频域小波卷积协同优化浅层特征表达;二是结合在线知识蒸馏(KD)的师生学习机制,引导学生网络聚焦目标区域并提升对高噪声背景的鲁棒性。该框架通过注意力迁移与抗噪表示学习的协同优化,在多个公开数据集上实现了显著的识别稳定性提升和跨模型泛化能力增强。
链接: https://arxiv.org/abs/2603.21565
作者: Yansong Lin,Zihan Cheng,Jielei Wang,Guoming Lua,Zongyong Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Synthetic aperture radar automatic target recognition (SAR ATR) is of considerable importance in marine navigation and disaster monitoring. However, the coherent speckle noise inherent in SAR imagery often obscures salient target features, leading to degraded recognition accuracy and limited model generalization. To address this issue, this paper proposes a target-aware frequency-spatial enhancement framework with noise-resilient knowledge guidance (FSCE) for SAR target recognition. The proposed framework incorporates a frequency-spatial shallow feature adaptive enhancement (DSAF) module, which processes shallow features through spatial multi-scale convolution and frequency-domain wavelet convolution. In addition, a teacher-student learning paradigm combined with an online knowledge distillation method (KD) is employed to guide the student network to focus more effectively on target regions, thereby enhancing its robustness to high-noise backgrounds. Through the collaborative optimization of attention transfer and noise-resilient representation learning, the proposed approach significantly improves the stability of target recognition under noisy conditions. Based on the FSCE framework, two network architectures with different performance emphases are developed: lightweight DSAFNet-M and high-precision DSAFNet-L. Extensive experiments are conducted on the MSTAR, FUSARShip and OpenSARShip datasets. The results show that DSAFNet-L achieves competitive or superior performance compared with various methods on three datasets; DSAFNet-M significantly reduces the model complexity while maintaining comparable accuracy. These results indicate that the proposed FSCE framework exhibits strong cross-model generalization.
[CV-100] Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection
【速读】:该论文旨在解决现有无监督连续异常检测(Unsupervised Continuous Anomaly Detection, UCAD)方法仅依赖视觉信息导致的正常模式表征不充分问题,从而限制了复杂场景下异常检测精度的提升。其解决方案的关键在于提出一种基于多模态提示(multimodal prompting)的框架,核心创新包括:1)构建持续多模态提示记忆库(Continual Multimodal Prompt Memory Bank, CMPMB),跨任务逐步提炼并保留视觉与文本域中的原型正常模式,增强对正常性的丰富表征;2)设计缺陷语义引导的自适应融合机制(Defect-Semantic-Guided Adaptive Fusion Mechanism, DSG-AFM),结合自适应归一化模块(Adaptive Normalization Module, ANM)与动态融合策略(Dynamic Fusion Strategy, DFS),在提升检测精度的同时增强对抗鲁棒性。
链接: https://arxiv.org/abs/2603.21562
作者: Mingle Zhou,Jiahui Liu,Jin Wan,Gang Li,Min Li
机构: Shandong Computer Science Center (National Supercomputer Center in Jinan); Qilu University of Technology (Shandong Academy of Sciences); City University of Macau; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing; Shandong Fundamental Research Center for Computer Science
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.
[CV-101] Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning
【速读】:该论文旨在解决弱监督视频场景图生成(Weakly-supervised Video Scene Graph Generation, WS-VSGG)中因缺乏边界框标注而引入的噪声问题,特别是由于使用通用目标检测器生成物体候选区域时,会捕获大量非交互对象,从而干扰关系建模。解决方案的关键在于提出Pair Affinity Learning and Scoring (PALS)机制,通过可学习的配对亲和力(pair affinity)估计主体-客体对之间的交互可能性,并在推理阶段进行排序;进一步通过Pair Affinity Modulation (PAM)将亲和力融入上下文推理过程,有效抑制非交互对并聚焦于语义相关的关联。此外,为提升亲和力学习的监督质量,还引入Relation-Aware Matching (RAM)策略,利用视觉-语言对齐消除伪标签中的类别歧义,从而实现更精准的关系预测。
链接: https://arxiv.org/abs/2603.21559
作者: Minseok Kang,Minhyeok Lee,Minjung Kim,Jungho Lee,Donghyeong Kim,Sungmin Woo,Inseok Jeon,Sangyoun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 11 figures
Abstract:Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy this http URL address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.
[CV-102] From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy ICME2026
【速读】:该论文旨在解决单图像三维生成(Single-image 3D generation)中因稀疏监督导致的跨语义类别泛化能力弱、结构复杂性差异大以及现有方法在面对新物体布局时结构碎片化或缺失的问题。其核心解决方案是提出一种新型的“部分到整体”3D生成世界模型,通过在灵活的3D潜在空间中学习自适应的部分-整体层次结构来实现结构的动态发现与整合;关键创新在于:1)基于图像token推理软且可组合的掩码,自动发现潜在结构槽位(latent structural slots),并引入自适应槽门机制(adaptive slot-gating)动态调节槽位激活概率,平滑合并冗余槽位以保持结构紧凑性和表达力;2)将每个提炼出的槽位对齐至可学习的类无关原型库(class-agnostic prototype bank),借助通用几何原型实现跨类形状共享与去噪,从而增强模型在不同类别间的迁移能力和结构鲁棒性。
链接: https://arxiv.org/abs/2603.21557
作者: Bi’an Du,Daizong Liu,Pufan Li,Wei Hu
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2026
Abstract:Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.
[CV-103] PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models
【速读】:该论文旨在解决当前文本到视频(text-to-video, T2V)扩散模型中概念擦除(concept erasure)技术存在的评估局限性问题,即现有方法仅通过检查生成帧中是否缺失目标概念来判断擦除效果,将输出层面的抑制误认为是表征层面的移除。为更准确地衡量擦除后的残留表示能力,论文提出PROBE诊断协议,其关键在于:在冻结所有模型参数的前提下,通过一个轻量级伪标记嵌入(pseudo-token embedding)优化策略,结合去噪重建目标与一种新颖的潜在空间对齐约束(latent alignment constraint),使被擦除概念的再激活潜力可量化,并能锚定于原概念的时空结构。此方案首次揭示了“时间重新出现”(temporal re-emergence)这一视频特有失败模式,表明当前擦除方法仅实现输出抑制而非真正意义上的表征移除。
链接: https://arxiv.org/abs/2603.21547
作者: Yiwei Xie,Zheng Zhang,Ping Liu
机构: Huazhong University of Science and Technology (华中科技大学); University of Nevada, Reno (内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This preprint was posted after submission to IEEE Transactions
Abstract:Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textitreactivation potential of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at this https URL.
[CV-104] PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation CVPR2026
【速读】:该论文旨在解决训练-free开放词汇语义分割(Training-free Open-Vocabulary Semantic Segmentation, OVSS)中因依赖复杂后处理、孤立处理文本与视觉信息或引入额外视觉主干网络而导致的性能瓶颈与计算效率低下问题。其核心解决方案是提出PEARL方法,关键在于采用“对齐-传播”(align-then-propagate)两步推理框架:首先通过Procrustes对齐步骤在最后一层自注意力块内执行稳定极迭代的正交投影,将键向量旋转至查询子空间以实现跨模态几何对齐;随后利用文本感知的拉普拉斯传播(text-aware Laplacian propagation)在小网格上进行置信度加权、文本引导的图优化,其中文本提供数据可信度信号与邻域门控机制,图像梯度则保留边界细节。该方法无需训练、无需额外数据或辅助主干网络,仅需固定常数参数和少量共轭梯度迭代即可实现高精度分割,在标准基准测试中超越现有最优训练-free方法。
链接: https://arxiv.org/abs/2603.21528
作者: Gensheng Pei,Xiruo Jiang,Xinhao Cai,Tao Chen,Yazhou Yao,Byeungwoo Jeon
机构: Sungkyunkwan University (成均馆大学); Southwest Jiaotong University (西南交通大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2026
Abstract:Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf\underlineProcrust\textbf\underlinees \textbf\underlinealignment with text-awa\textbf\underlinere \textbf\underlineLaplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.
[CV-105] VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的深度伪造检测方法中存在的推理过程混杂问题,即证据生成与篡改定位被整合在单一步骤中,导致真实观察与幻觉解释难以区分,从而影响检测结论的可靠性。其解决方案的关键在于提出VIGIL框架,该框架受专家法医实践启发,采用“先规划后检查”的分阶段管道:首先基于全局视觉线索规划需检查的面部区域,随后独立获取每个区域的取证证据进行验证;通过阶段门控注入机制,在检查阶段才引入局部证据,确保区域选择由模型自主感知驱动,避免外部信号干扰;同时设计渐进式三阶段训练策略,其中强化学习阶段使用部位感知奖励函数以强制解剖合理性与证据-结论一致性,从而提升检测结果的可信度与泛化能力。
链接: https://arxiv.org/abs/2603.21526
作者: Xinghan Li,Junhao Xu,Jingjing Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model’s own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence–conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
[CV-106] Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection CVPR2026
【速读】:该论文旨在解决零样本(Zero-shot, ZS)三维(3D)异常检测问题,即在无需目标类别训练数据的情况下,准确检测并定位工业场景中的缺陷。现有方法通常将3D点云投影为2D图像,并借助预训练视觉-语言模型(Vision-Language Models, VLMs)进行异常检测,但这类方法不可避免地丢失几何细节,且对局部异常敏感度不足。解决方案的关键在于回归到原始3D表示,提出BTP(Back To Point)框架,通过多粒度patch特征与文本嵌入的对齐实现精细化异常定位,并引入几何描述符增强对结构异常的感知能力;同时设计联合表示学习策略,利用辅助点云数据提升模型鲁棒性并丰富异常语义表达。
链接: https://arxiv.org/abs/2603.21511
作者: Kaiqiang Li,Gang Li,Mingle Zhou,Min Li,Delong Han,Jin Wan
机构: Qilu University of Technology (Shandong Academy of Sciences); Shandong Computer Science Center (National Supercomputer Center in Jinan); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing; Shandong Fundamental Research Center for Computer Science
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \hrefthis https URLthis https URL.
[CV-107] Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification CVPR2026
【速读】:该论文旨在解决少样本弱监督全切片图像分类(Few-Shot Weakly Supervised Whole Slide Image Classification, FSWC)中的两个关键挑战:一是现有提示调优方法显著增加可训练参数数量和推理开销,二是当前方法通过硬性过滤低文本对齐实例导致信息丢失。解决方案的关键在于提出两项创新:其一,设计了一种新的参数高效提示调优方法,通过对文本编码器中的特征进行缩放与平移操作,大幅降低计算成本;其二,引入一种基于软层次文本引导策略的WSI表示学习方法,在不依赖硬性实例筛选的前提下,有效利用预训练视觉语言模型(VLMs)的知识与WSI固有的层次结构,从而提升分类性能与弱监督肿瘤定位能力。
链接: https://arxiv.org/abs/2603.21504
作者: Jayanie Bogahawatte,Sachith Seneviratne,Saman Halgamuge
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at CVPR 2026 Workshop on Medical Reasoning with Vision Language Foundation Models (Med-Reasoner)
Abstract:Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at this https URL.
[CV-108] StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding
【速读】:该论文旨在解决实时流式视频理解(streaming video understanding)在实际部署中面临的系统级挑战,即现有研究多聚焦于孤立指标(如有限视觉上下文下的问答准确率或编码效率提升),而忽视了在真实资源约束下的可部署性问题。解决方案的关键在于提出 StreamingEval——一个统一的评估框架,通过标准化协议对主流离线模型与新兴在线视频模型进行综合评测,明确刻画效率、存储与准确率之间的权衡关系;其核心设计是采用固定容量的记忆库(memory bank)来规范化可用的历史视觉上下文,并联合评估视觉编码效率、文本解码延迟和任务性能,从而量化系统的整体可部署性。
链接: https://arxiv.org/abs/2603.21493
作者: Guowei Tang,Tianwen Qian,Huanran Zheng,Yifei Wang,Xiaoling Wang
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at this https URL.
[CV-109] Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation
【速读】:该论文旨在解决视频推理分割(video reasoning segmentation)任务中因依赖单向隐式文本-轨迹对齐而导致的轨迹感知能力不足问题,尤其在视频动态剧烈变化时表现不佳。其解决方案的关键在于提出一个统一框架TrajSeg,通过引入双向文本-轨迹对齐机制,使多模态大语言模型(MLLMs)能够同时处理以定位为目的(text-to-trajectory)和以描述为目的(trajectory-to-text)的指令,从而增强文本与视频轨迹之间的对应关系并提升轨迹感知能力;此外,设计帧级内容融合(frame-level content integration, FCI)模块将轨迹级token映射为帧特定信息,并采用统一掩码解码器将所有帧的分割任务整合为单一结构,实现端到端可训练且简化的视频分割流程。
链接: https://arxiv.org/abs/2603.21488
作者: Jingnan Luo,Mingqi Gao,Jun Liu,Bin-Bin Gao,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at this https URL.
[CV-110] Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models CVPR2026
【速读】:该论文旨在解决持续遗忘(continual unlearning)中如何使大规模视觉-语言模型在接收到顺序删除请求时,精准拒绝特定图像-指令对,同时保持模型整体通用性能的问题。现有方法因顺序更新导致共享表征扭曲,产生视觉-语言配对与拒绝行为之间的虚假关联,从而引发误拒现象。解决方案的关键在于提出一种基于细粒度概念分解的持续遗忘框架:首先通过概念调制器(concept modulator)识别每类遗忘目标对应的视觉-语言概念组合,再利用多个专用拒绝专家(refusers)构成混合拒绝机制,针对不同概念组合生成语义对齐的拒绝响应;进一步设计多模态、概念驱动的路由策略,在任务间复用相似概念的拒绝专家,并对低利用率专家进行适应性调整,以实现跨序列的概念特异性拒绝响应生成,显著提升拒绝精度并维持模型通用能力。
链接: https://arxiv.org/abs/2603.21484
作者: Hyundong Jin,Dongyoon Han,Eunwoo Kim
机构: Chung-Ang University (中央大学); NAVER AI Lab (NAVER人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.
[CV-111] ALADIN:Attribute-Language Distillation Network for Person Re-Identification
【速读】:该论文旨在解决当前基于CLIP(Contrastive Language–Image Pretraining)的行人重识别(ReID)方法依赖全局特征和固定提示词的问题,这些问题限制了模型对细粒度属性线索的捕捉能力以及在多样化外观下的适应性。解决方案的关键在于提出ALADIN(Attribute-Language Distillation Network),其核心创新包括:1)引入细粒度属性局部对齐机制,建立自适应文本-视觉对应关系;2)设计场景感知提示生成器(Scene-Aware Prompt Generator),生成图像特定的软提示以增强对齐灵活性;3)通过属性局部蒸馏强制文本属性与局部视觉特征一致性,提升遮挡下的鲁棒性;4)结合跨模态对比与关系蒸馏保留属性间的结构关联。该方法利用多模态大语言模型(Multimodal LLMs)生成结构化属性描述,并借助CLIP转换为局部注意力图,实现精准监督,最终仅使用轻量级学生网络进行推理,显著提升了性能、泛化能力和可解释性。
链接: https://arxiv.org/abs/2603.21482
作者: Wang Zhou,Boran Duan,Haojun Ai,Ruiqi Lan,Ziyue Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14pages, 3figures, 7charts
Abstract:Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.
[CV-112] EpiMask: Leverag ing Epipolar Distance Based Masks in Cross-Attention for Satellite Image Matching
【速读】:该论文旨在解决基于深度学习的图像匹配网络在处理卫星图像时性能下降的问题,这是因为现有网络主要在地面图像数据集上训练,且隐式优化于针孔相机几何模型,而卫星图像由移动卫星逐行记录地表点形成,其成像几何与针孔模型存在显著差异。解决方案的关键在于提出EpiMask,一个专为卫星图像设计的半密集匹配网络,其核心创新包括:(1)引入基于图像块的仿射近似以更准确地建模卫星相机几何;(2)使用基于视差距离的注意力掩码限制跨注意力机制仅作用于几何上合理的区域;(3)对预训练的基础图像编码器进行微调,以提升特征提取的鲁棒性。实验表明,EpiMask在SatDepth数据集上的匹配精度相比重新训练的地面模型最高提升达30%。
链接: https://arxiv.org/abs/2603.21463
作者: Rahul Deshmukh,Aditya Chauhan,Avinash Kak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The deep-learning based image matching networks can now handle significantly larger variations in viewpoints and illuminations while providing matched pairs of pixels with sub-pixel precision. These networks have been trained with ground-based image datasets and, implicitly, their performance is optimized for the pinhole camera geometry. Consequently, you get suboptimal performance when such networks are used to match satellite images since those images are synthesized as a moving satellite camera records one line at a time of the points on the ground. In this paper, we present EpiMask, a semi-dense image matching network for satellite images that (1) Incorporates patch-wise affine approximations to the camera modeling geometry; (2) Uses an epipolar distance-based attention mask to restrict cross-attention to geometrically plausible regions; and (3) That fine-tunes a foundational pretrained image encoder for robust feature extraction. Experiments on the SatDepth dataset demonstrate up to 30% improvement in matching accuracy compared to re-trained ground-based models.
[CV-113] PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences
【速读】:该论文旨在解决在线单目三维重建中的稳定性-适应性困境(stability-adaptation dilemma),即在流式视频中既要快速融合新视角信息,又要保持已积累场景结构的稳定性。现有方法依赖于均匀或基于注意力的更新机制,难以应对突变视角导致的轨迹漂移和几何不一致问题。解决方案的关键在于提出Pose-Adaptive Streaming Reconstruction(PAS3R)框架,其核心创新是通过一种运动感知的更新机制,联合利用帧间位姿变化与图像频域特征来动态评估每帧的重要性:对具有显著几何新颖性的帧赋予更强的更新权重,而对视角变化较小的帧则优先保留历史上下文。此外,引入轨迹一致性训练目标(相对位姿约束与加速度正则化)及轻量级在线稳定模块,有效抑制长序列中的轨迹抖动和几何伪影,从而在保证短序列性能的同时大幅提升长视频重建的质量与稳定性。
链接: https://arxiv.org/abs/2603.21436
作者: Lanbo Xu,Liang Guo,Caigui Jiang,Cheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.
[CV-114] Image-Based Structural Analysis Using Computer Vision and LLM s: PhotoBeamSolver
【速读】:该论文旨在解决从人工绘制的结构图纸中自动识别并分析理想化梁模型的问题,从而实现对教科书和学术练习中常见结构问题的自动化求解。其解决方案的关键在于结合计算机视觉(Computer Vision)与统计学习技术,用于检测和视觉解析结构构件,并系统性地探讨了将计算机视觉集成到结构分析中的挑战、局限性及可靠应用所需的前提条件。
链接: https://arxiv.org/abs/2603.21432
作者: Altamirano-Muñiz Emilio Fernando
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:This paper presents the development of a documented program capable of solving idealized beam models, such as those commonly used in textbooks and academic exercises, from drawings made by a person. The system is based on computer vision and statistical learning techniques for the detection and visual interpretation of structural elements. Likewise, the main challenges and limitations associated with the integration of computer vision into structural analysis are analyzed, as well as the requirements necessary for its reliable application in the field of civil engineering. In this context, the implementation of the PhotoBeamSolver program is explored, and the current state of computer vision in civil engineering is discussed, particularly in relation to structural analysis, infrastructure inspection, and engineering decision-support systems.
[CV-115] Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation)中如何自适应地平衡数据监督与教师指导的问题,尤其是在样本存在噪声或教师输出具有不确定性时,传统方法难以动态调整二者权重,导致性能受限。解决方案的关键在于提出了一种基于贝叶斯视角的不确定性感知蒸馏框架——Beta-weighted Knowledge Distillation (Beta-KD),其核心创新是将教师指导建模为学生激活上的Gibbs先验,并由此推导出一个闭式解的、不确定性感知的加权机制,能够根据样本和教师的置信度自适应调节学生对教师引导的依赖程度,从而实现更鲁棒且高效的蒸馏过程。
链接: https://arxiv.org/abs/2603.21426
作者: Jingchen Sun,Shaobo Han,Deep Patel,Wataru Kohno,Can Jin,Changyou Chen
机构: NEC Laboratories America, Inc.(NEC实验室美国公司); University at Buffalo, SUNY(纽约州立大学布法罗分校); Rutgers University(罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher–student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at this https URL.
[CV-116] Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER ICME2026
【速读】:该论文旨在解决视频场景下面部表情识别(Facial Expression Recognition, FER)中因人脸数据固有身份暴露而引发的隐私保护问题,尤其在开放集(open-set)视频环境中,身份标签不可用且传统隐私保护方法失效的挑战。其解决方案的关键在于提出一个两阶段无监督框架:第一阶段通过利用真实视频中的类内(intra-video)和类间(inter-video)知识先验训练一个身份抑制网络(identity-suppression network),在不依赖身份标签的前提下实现身份匿名化的同时保留表情相关特征;第二阶段引入去噪模块以恢复表达信息并提升FER性能。此外,该方法还设计了一种基于伪造(falsification-based)的验证机制,利用识别先验对隐私鲁棒性进行严格评估,无需标注的身份标签即可验证隐私保护效果。
链接: https://arxiv.org/abs/2603.21387
作者: Feng Xu,Xun Li,Lars Petersson,Yulei Sui,David Ahmedt Aristizabal,Dadong Wang
机构: UNSW Sydney (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2026, Accepted
Abstract:Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.
[CV-117] Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
【速读】:该论文旨在解决开放词汇全景分割(open-vocabulary panoptic segmentation)中的两个关键问题:一是掩码选择偏差(mask selection bias),即在封闭词汇上训练的物体性(objectness)头会抑制未见类别的掩码;二是视觉语言模型(如CLIP)在区域级理解上的局限性,因其优化目标为全局图像分类而非局部分割。解决方案的核心在于提出OVRCOAT框架,其包含两个模块:首先,CLIP条件物体性调整(CLIP-conditioned Objectness Adjustment, COAT)通过动态调整前景/背景概率,保留未见类别高质量掩码;其次,开放词汇掩码到文本精修(Open-Vocabulary Mask-to-Text Refinement, OVR)增强CLIP在区域级别的对齐能力,从而提升已见与未见类别的分类性能,且内存开销显著低于传统微调方法。这两个模块协同改进物体性估计与掩码识别,实现一致的全景分割性能提升。
链接: https://arxiv.org/abs/2603.21386
作者: Nikolay Kormushev,Josip Šarić,Matej Kristan
机构: University of Ljubljana (卢布尔雅那大学); ETH Zurich (苏黎世联邦理工学院); University of Zagreb (萨格勒布大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: this https URL
[CV-118] An InSAR Phase Unwrapping Framework for Large-scale and Complex Events
【速读】:该论文旨在解决合成孔径雷达干涉测量(InSAR)处理中相位解缠(phase unwrapping)的关键难题,特别是在复杂形变模式下,如地震引起的浅源断层导致的表面破裂和突变位移不连续性,这些现象严重破坏相位连续性,使传统解缠算法失效。此外,现有基于学习的解缠方法受限于固定且较小的输入尺寸,难以适应大尺度、空间异质性强的真实InSAR干涉图。其解决方案的关键在于提出一种基于扩散模型(diffusion model)的解缠框架,该框架能够恢复物理一致的未缠绕相位场,即使在断层相关的相位跳跃情况下依然有效,并具备良好的可扩展性,适用于大规模InSAR图像,为复杂场景下的自动解缠提供了实用替代方案。
链接: https://arxiv.org/abs/2603.21378
作者: Yijia Song,Juliet Biggs,Alin Achim,Robert Popescu,Simon Orrego,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注:
Abstract:Phase unwrapping remains a critical and challenging problem in InSAR processing, particularly in scenarios involving complex deformation patterns. In earthquake-related deformation, shallow sources can generate surface-breaking faults and abrupt displacement discontinuities, which severely disrupt phase continuity and often cause conventional unwrapping algorithms to fail. Another limitation of existing learning-based unwrapping methods is their reliance on fixed and relatively small input sizes, while real InSAR interferograms are typically large-scale and spatially heterogeneous. This mismatch restricts the applicability of many neural network approaches to real-world data. In this work, we present a phase unwrapping framework based on a diffusion model, developed to process large-scale interferograms and to address phase discontinuities caused by deformation. By leveraging a diffusion model architecture, the proposed method can recover physically consistent unwrapped phase fields even in the presence of fault-related phase jumps. Experimental results on both synthetic and real datasets demonstrate that the method effectively addresses discontinuities associated with near-surface deformation and scales well to large InSAR images, offering a practical alternative to manual unwrapping in challenging scenarios.
[CV-119] HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis
【速读】:该论文旨在解决医学图像分析中模型泛化能力弱、结构设计缺乏物理意义以及多任务适应性差的问题。其解决方案的关键在于引入基于阻尼谐振子(damped harmonic oscillator)的结构化归纳偏置(structured inductive bias),通过哈密顿动力学自然生成三种功能独立的表示:位置 $ q $(特征内容)、动量 $ p $(空间梯度,编码边界与纹理信息)和能量 $ H = \tfrac12|z|^2 $(无需参数的显著性图)。这些表示从动力学演化中自动涌现,无需监督信号,并可被不同任务头(如分割和分类)直接利用而不修改振子本身。在分割任务中,能量用于门控跳跃连接,动量在解码器每一层注入边界信息(HamSeg);在分类任务中,三者全局池化后拼接为相空间特征向量(HamCls),从而实现高性能且可解释的跨模态医学图像分析。
链接: https://arxiv.org/abs/2603.21377
作者: Mohamed A Mabrok
机构: Qatar University (卡塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator’s phase-space decomposition yields three functionally distinct representations: position~ q (feature content), momentum~ p (spatial gradients that encode boundary and texture information), and energy H = \tfrac12|z|^2 (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC,2018 (89.38%), ISIC,2017 (88.40%), TN3K (87.05%), and ACDC (92.40%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85%) and PathMNIST (96.65%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator’s momentum consistently encodes an interior ,, boundary ,, exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at this https URL.
[CV-120] Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation
【速读】:该论文旨在解决自回归(Autoregressive, AR)视频扩散模型在生成分钟级长视频时面临的渐进式时间退化问题,其核心挑战并非源于记忆容量不足,而是现有方法对时间记忆的利用方式不当。解决方案的关键在于提出一种结构化的时序记忆机制——Relax Forcing,该机制将历史上下文分解为三个功能角色:Sink(全局稳定性)、Tail(短期连续性)和动态选择的历史信息(结构运动引导),并仅选择最相关的历史信息进行注意力计算,从而在减少误差累积的同时保持运动演化的一致性。实验表明,该方法显著提升了运动动态性和整体时序一致性,同时降低了注意力开销。
链接: https://arxiv.org/abs/2603.21366
作者: Zengqun Zhao,Yanzuo Lu,Ziquan Liu,Jifei Song,Jiankang Deng,Ioannis Patras
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: see this https URL
Abstract:Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
[CV-121] FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction CVPR2026
【速读】:该论文旨在解决当前基于多视角2D图像的3D重建方法仅关注视觉保真度(如光度损失)而忽视物理合理性的问题,导致重建结果在物体间交互时缺乏真实性,例如未能正确区分功能关键区域(如空气动力学表面)与装饰性结构,且难以通过添加物理正则化项实现最优重建。其解决方案的关键在于提出FluidGaussian方法,该方法通过将几何重建与普遍存在的流体-结构相互作用(fluid-structure interactions, FSI)紧密耦合,利用流体模拟生成一种基于仿真的不确定性度量,并结合主动学习策略优先选择能同时提升视觉质量和物理合理性的观测视角,从而在高粒度层面评估并优化表面质量。
链接: https://arxiv.org/abs/2603.21356
作者: Yuqiu Liu,Jialin Song,Marissa Ramirez de Chanlatte,Rochishnu Chowdhury,Rushil Paresh Desai,Wuyang Chen,Daniel Martin,Michael Mahoney
机构: Simon Fraser University; Lawrence Berkeley National Lab; University of California, Berkeley; International Computer Science Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available at this https URL.
[CV-122] Respiratory Status Detection with Video Transformers
【速读】:该论文旨在解决如何利用人工智能(AI)系统通过视频分析识别呼吸窘迫(respiratory distress)的早期迹象这一问题,从而实现对患者呼吸状态的非侵入式监测与早期干预。其解决方案的关键在于使用改进的视频Transformer架构——具体为ViViT编码器结合Lie相对编码(Lie Relative Encodings, LieRE)和运动引导掩码(Motion Guided Masking),并辅以基于嵌入对比的策略,使模型能够捕捉呼吸过程中细微的力学变化,最终在时间排序任务中达到0.81的F1分数,验证了现代视频Transformer具备识别呼吸窘迫的能力。
链接: https://arxiv.org/abs/2603.21349
作者: Thomas Savage,Evan Madill
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.
[CV-123] Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在资源受限边缘设备上部署时因多步去噪过程导致的计算开销过大问题。现有方法虽通过模型压缩和时间步序列调整缓解此问题,但忽略了输入冗余且搜索时间较长。解决方案的关键在于提出两级去噪机制(Coarse-to-Fine Denoising, C2F)与时间步序列重分配策略(Time Step Sequence Redistribution, TRD):C2F识别早期生成图像的不可区分性,减少粗粒度特征生成阶段的计算量;TRD则高效调整采样轨迹,搜索时间低于10分钟,从而在保持近无损性能的同时实现80%至90%的计算效率提升。
链接: https://arxiv.org/abs/2603.21348
作者: Yu-Shan Tai,An-Yeu(Andy)Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.
[CV-124] EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization CVPR2026
【速读】:该论文旨在解决现有少样本(few-shot)3D说话头合成方法在表达性面部运动下存在的几何不稳定性和音频-情绪不匹配问题。其解决方案的关键在于提出EmoTaG框架,通过将运动预测重构为结构化的FLAME参数空间而非直接变形3D高斯点,引入显式几何先验以提升运动稳定性;同时设计门控残差运动网络(Gated Residual Motion Network, GRMN),从音频中捕捉情绪韵律,并融合头部姿态和上部面部线索以实现更具表现力且一致的运动生成。
链接: https://arxiv.org/abs/2603.21332
作者: Haolan Xu,Keli Cheng,Lei Wang,Ning Bi,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); Qualcomm Technologies Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Page: this https URL
Abstract:Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.
[CV-125] KHMP: Frequency-Domain Kalman Refinement for High-Fidelity Human Motion Prediction
【速读】:该论文旨在解决生成式人类运动预测中常见的高频抖动(jitter)和时间不连续性问题,这些问题会显著降低预测轨迹的视觉质量和物理合理性。解决方案的关键在于提出KHMP框架,其核心创新是将自适应卡尔曼滤波(adaptive Kalman filter)引入离散余弦变换(DCT)域,通过将高频DCT系数视为频率索引的噪声信号,递归地抑制噪声并保留运动细节;同时,该滤波器的噪声参数根据估计的信噪比(Signal-to-Noise Ratio, SNR)动态调整,实现对抖动预测的强去噪与对清洁运动的保守滤波。此外,训练阶段引入时间平滑性和关节角度限制等物理约束,使模型学习到符合生物力学原理的运动模式,从而在Human3.6M和HumanEva-I数据集上实现了高保真度、平滑且物理合理的运动预测。
链接: https://arxiv.org/abs/2603.21327
作者: Wenhan Wu,Zhishuai Guo,Chen Chen,Srijan Das,Hongfei Xue,Pu Wang,Aidong Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stochastic human motion prediction aims to generate diverse, plausible futures from observed sequences. Despite advances in generative modeling, existing methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities. To address these challenges, we introduce KHMP, a novel framework featuring an adaptiveKalman filter applied in the DCT domain to generate high-fidelity human motion predictions. By treating high-frequency DCT coefficients as a frequency-indexed noisy signal, the Kalman filter recursively suppresses noise while preserving motion details. Notably, its noise parameters are dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR), enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This refinement is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles into the generative model. Together, these innovations establish a new paradigm integrating adaptive signal processing with physics-informed learning. Experiments on the Human3.6M and HumanEva-I datasets demonstrate that KHMP achieves state-of-the-art accuracy, effectively mitigating jitter artifacts to produce smooth and physically plausible motions.
[CV-126] st-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos
【速读】:该论文旨在解决视频面部表情识别(Facial Expression Recognition, FER)中因个体间分布差异导致的模型性能下降问题,尤其是在视觉-语言模型(Vision-Language Models, VLMs)面临跨被试分布偏移时的表现退化问题。现有测试时适应(Test-Time Adaptation, TTA)方法多依赖无监督参数优化,计算开销大,难以在实际场景中部署。解决方案的关键在于提出一种基于缓存的低成本(无需梯度)个性化方法——TTA-CaP(TTA through Cache Personalization),其核心创新是引入三个协同工作的缓存机制:个性化源缓存(存储源域原型)、正目标缓存(累积高置信度的目标个体样本)和负目标缓存(存储低置信度样本以抑制噪声伪标签影响),并通过基于时间稳定性、置信度与个性化缓存一致性的三门控机制控制缓存更新与替换,最终通过嵌入融合实现更稳定的视频级预测,从而在保持极低计算和内存开销的同时显著提升跨个体和环境变化下的模型鲁棒性。
链接: https://arxiv.org/abs/2603.21309
作者: Masoumeh Sharafi,Muhammad Osama Zeeshan,Soufiane Belharbi,Alessandro Lameiras Koerich,Marco Pedersoli,Eric Granger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.
[CV-127] Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication
【速读】:该论文旨在解决联邦视频动作识别(Federated Video Action Recognition)中的两个关键问题:模型暴露(model exposure)和通信开销(communication overhead)。前者指客户端与服务器之间交换的梯度可能泄露私有运动模式,后者则源于高维视频网络全模型同步带来的巨大带宽消耗。解决方案的核心是提出FedDP-STECAR框架,通过在差分隐私(Differential Privacy, DP)约束下选择性地微调并扰动任务相关层,减少信息泄露面的同时保持视频特征的时间一致性;同时仅传输被微调的层进行聚合,将通信流量降低超过99%,从而实现高效、可扩展且隐私保护的视频动作识别。
链接: https://arxiv.org/abs/2603.21305
作者: Idris Zakariyya,Pai Chet Ng,Kaushik Bhargav Sivangi,S. Mohammad Sheikholeslami,Konstantinos N. Plataniotis,Fani Deligianni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textitmodel exposure and \textitcommunication overhead. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textitFederated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition, namely \textitFedDP-STECAR. Our \textitFedDP-STECAR framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textitFedDP-STECAR achieves up to \textbf70.2% higher accuracy under strict privacy ( \epsilon=0.65 ) in centralized settings and \textbf48% faster training with \textbf73.1% accuracy in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at this https URL
[CV-128] F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting
【速读】:该论文旨在解决前馈式3D高斯溅射(Feed-Forward 3D Gaussian Splatting)方法中存在的两个核心问题:一是传统方法采用刚性的像素到高斯或体素到高斯映射策略,导致高斯分布均匀但存在跨视角冗余;二是缺乏对最终高斯数量的有效控制机制,难以在保持重建保真度的同时优化资源利用率。解决方案的关键在于提出F4Splat,其引入基于预测密度得分(densification-score)的自适应分配策略,通过预测每个区域的密度得分来动态调整高斯分布,从而根据空间复杂度和多视角重叠程度实现精细化的高斯分配。该机制允许显式控制最终高斯预算而不需重新训练,显著减少简单区域的冗余并降低重叠视图中的重复高斯,生成紧凑且高质量的3D表示。
链接: https://arxiv.org/abs/2603.21304
作者: Injae Kim,Chaehyeon Kim,Minseong Bae,Minseok Joo,Hyunwoo J. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \href
Abstract:Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.
[CV-129] Identity-Consistent Video Generation under Large Facial-Angle Variations
【速读】:该论文旨在解决单视角参考视频生成方法在大面部角度变化下难以保持身份一致性的问题,同时避免因引入多视角参考图像而加剧的“复制粘贴”(copy-paste)伪影,尤其是视图依赖型复制粘贴伪影,从而影响面部运动自然度。解决方案的关键在于提出一个无配对监督下的多视角条件框架(Mv2ID),其核心创新包括:1)采用区域掩码训练策略防止捷径学习,促使模型跨视角聚合互补的身份特征以提取本质身份信息;2)设计参考解耦RoPE(Rotary Position Embedding)机制,为视频帧和条件令牌分配不同的位置编码,更好地建模二者异质性;3)构建大规模多样化面部角度数据集并提出针对性评估指标,实现对身份一致性和运动自然度的量化衡量。实验表明,该方法在保持运动自然度的同时显著提升身份一致性,优于使用交叉配对数据训练的现有方法。
链接: https://arxiv.org/abs/2603.21299
作者: Bin Hu,Zipeng Qi,Guoxi Huang,Zunnan Xu,Ruicheng Zhang,Chongjie Ye,Jun Zhou,Xiu Li,Jingdong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textitcopy-paste problem, particularly the \textbf\textitview-dependent copy-paste artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose \mathrmMv^2\mathrmID , a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.
[CV-130] xt-Image Conditioned 3D Generation CVPR2026
【速读】:该论文旨在解决当前3D生成模型在条件输入上单一导致的局限性问题:图像条件模型虽能实现高视觉保真度,但易受视角偏差影响;文本条件模型虽具广泛语义引导能力,却缺乏低级视觉细节。为提升生成3D内容的灵活性与忠实度,作者提出将文本与图像模态进行联合建模,以实现更全面的跨模态推理。解决方案的关键在于引入TIGON——一个轻量级双分支架构,包含独立的图像和文本条件骨干网络,并通过简洁的跨模态融合机制实现二者互补信息的有效整合,实验证明该方法显著优于单模态基线模型。
链接: https://arxiv.org/abs/2603.21295
作者: Jiazhong Cen,Jiemin Fang,Sikuang Li,Guanjun Wu,Chen Yang,Taoran Yi,Zanwei Zhou,Zhikuan Bao,Lingxi Xie,Wei Shen,Qi Tian
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Inc. (华为公司); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project page: this https URL Code: this https URL
Abstract:High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: this https URL
[CV-131] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
【速读】:该论文旨在解决多模态大语言模型在推理任务中性能提升高度依赖高质量标注数据或教师模型蒸馏的问题,而这些方法成本高昂且难以扩展。其解决方案的关键在于提出一种无监督的自进化训练框架,通过采样多个推理轨迹并建模组内一致性信号作为训练先验,引入有界判别器(bounded Judge)对不同轨迹进行动态重加权;同时将调制后的得分建模为组级别分布,并转化为组内相对优势,从而实现更稳健的策略更新。该方法基于组相对策略优化(Group Relative Policy Optimization, GRPO)在未标注数据上训练,显著提升了数学推理基准上的性能与泛化能力,为多模态推理的自演化提供了可扩展路径。
链接: https://arxiv.org/abs/2603.21289
作者: Zhengxian Wu,Kai Shi,Chuanrui Zhang,Zirui Liao,Jun Yang,Ni Yang,Qiuying Peng,Luyuan Zhang,Hangrui Xu,Tianhuang Su,Zhenyu Yang,Haonan Lu,Haoqian Wang
机构: OPPO AI Center; Tsinghua University; Nanyang Technological University; Hefei University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures
Abstract:Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to this http URL address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group this http URL use the Actor’s self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different this http URL further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal this http URL code are available at this https URL.
[CV-132] Focus on Background: Exploring SAMs Potential in Few-shot Medical Image Segmentation with Background-centric Prompting CVPR26
【速读】:该论文旨在解决传统少样本医学图像分割(Few-Shot Medical Image Segmentation, FSMIS)方法在临床应用中性能受限的问题,尤其是基于Segment Anything Model (SAM) 的方法在医学图像上因解剖边界模糊而产生的过分割(over-segmentation)现象。解决方案的关键在于将SAM-based FSMIS重构为一个提示定位(prompt localization)任务,并提出一种以背景为中心的提示生成器FoB(Focus on Background),通过类别无关的方式生成支持集中的背景提示,并将其直接定位到查询图像中,从而有效约束SAM的过分割行为。FoB进一步利用丰富的上下文信息建模前景-背景空间依赖关系,并引入医学图像中背景提示的固有结构作为约束,逐步优化背景提示预测,显著提升分割精度与跨域泛化能力。
链接: https://arxiv.org/abs/2603.21287
作者: Yuntian Bo,Yazhou Zhu,Piotr Koniusz,Haofeng Zhang
机构: Nanjing University of Science and Technology (南京理工大学); University of New South Wales (新南威尔士大学); Data61 CSIRO (数据61 CSIRO)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR26
Abstract:Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM’s over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at this https URL.
[CV-133] Sonny: Breaking the Compute Wall in Medium-Range Weather Forecasting
【速读】:该论文旨在解决当前基于深度学习的天气预报模型普遍依赖大规模训练和高计算资源的问题,从而限制了学术研究群体的实际应用。其解决方案的关键在于提出一种高效分层Transformer架构Sonny,该架构采用两阶段StepsNet设计:首先通过窄通道的慢路径建模大尺度大气动力学,再由全宽度的快路径整合热力学相互作用;同时,在训练中引入指数移动平均(EMA)以稳定中程预测滚动过程,避免额外微调阶段。此方法在WeatherBench2基准上实现了与业务基准相当的中程预报性能,且在单张NVIDIA A40 GPU上约5.5天即可收敛,显著降低了计算门槛。
链接: https://arxiv.org/abs/2603.21284
作者: Minjong Cheon
机构: Sejong University (世宗大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Weather forecasting is a fundamental problem for protecting lives and infrastructure from high-impact atmospheric events. Recently, data-driven weather forecasting methods based on deep learning have demonstrated strong performance, often reaching accuracy levels competitive with operational numerical systems. However, many existing models rely on large-scale training regimes and compute-intensive architectures, which raises the practical barrier for academic groups with limited compute resources. Here we introduce Sonny, an efficient hierarchical transformer that achieves competitive medium-range forecasting performance while remaining feasible within reasonable compute budgets. At the core of Sonny is a two-stage StepsNet design: a narrow slow path first models large-scale atmospheric dynamics, and a subsequent full-width fast path integrates thermodynamic interactions. To stabilize medium-range rollout without an additional fine-tuning stage, we apply exponential moving average (EMA) during training. On WeatherBench2, Sonny yields robust medium-range forecast skill, remains competitive with operational baselines, and demonstrates clear advantages over FastNet, particularly at extended tropical lead times. In practice, Sonny can be trained to convergence on a single NVIDIA A40 GPU in approximately 5.5 days.
[CV-134] CornOrb: A Multimodal Dataset of Orbscan Corneal Topography and Clinical Annotations for Keratoconus Detection
【速读】:该论文旨在解决非洲地区缺乏大规模、标准化的角膜地形图(Orbscan)多模态数据集以支持人工智能(AI)驱动的角膜营养不良(如圆锥角膜,keratoconus)检测与分析的问题。解决方案的关键在于构建并公开发布CornOrb数据集,该数据集包含来自阿尔及利亚744名患者的1,454只眼睛的多模态信息,包括四种角膜地图(轴向曲率、前表面抬升、后表面抬升和厚度测量)以及结构化表格数据(如年龄、散光、最大角膜屈光度Kmax、中央及最薄点厚度等)。所有数据均经过匿名化处理并统一格式化为PNG和CSV,确保其可直接用于AI研究,填补了非洲地区基于Orbscan的高质量医学影像数据资源的空白。
链接: https://arxiv.org/abs/2603.21245
作者: Mohammed El Amine Lazouni,Leila Ryma Lazouni,Zineb Aziza Elaouaber,Mohammed Ammar,Sofiane Zehar,Mohammed Youcef Bouayad Agha,Ahmed Lazouni,Amel Feroui,Ali H. Al-Timemy,Siamak Yousefi,Mostafa El Habib Daho
机构: Abou Bakr Belkaid University (阿布巴克尔贝勒卡德大学); M’Hamed Bougara Boumerdes University (穆罕默德布加拉布梅尔德斯大学); Lazouni Clinic (拉祖尼诊所); University of Baghdad (巴格达大学); Bascom Palmer Eye Institute (巴斯科姆帕尔默眼科研究所); University of Miami (迈阿密大学); University of Western Brittany (西方布列塔尼大学); LaTIM UMR1101, Inserm (LaTIM UMR1101,法国国家健康与医学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: Preprint, 9 pages, 4 figures, dataset paper. Corresponding author: this http URL @univ this http URL
Abstract:In this paper, we present CornOrb, a publicly accessible multimodal dataset of Orbscan corneal topography images and clinical annotations collected from patients in Algeria. The dataset comprises 1,454 eyes from 744 patients, including 889 normal eyes and 565 keratoconus cases. For each eye, four corneal maps are provided (axial curvature, anterior elevation, posterior elevation, and pachymetry), together with structured tabular data including demographic information and key clinical parameters such as astigmatism, maximum keratometry (Kmax), central and thinnest pachymetry, and anterior/posterior asphericity. All data were retrospectively acquired, fully anonymized, and pre-processed into standardized PNG and CSV formats to ensure direct usability for artificial intelligence research. This dataset represents one of the first large-scale Orbscan-based resources from Africa, specifically built to enable robust AI-driven detection and analysis of keratoconus using multimodal data. The data are openly available at Zenodo. Comments: Preprint, 9 pages, 4 figures, dataset paper. Corresponding author: this http URL@univthis http URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB) Cite as: arXiv:2603.21245 [cs.CV] (or arXiv:2603.21245v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.21245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-135] Enhancing Brain Tumor Classification Using Vision Transformers with Colormap-Based Feature Representation on BRISC2025 Dataset
【速读】:该论文旨在解决脑肿瘤在磁共振成像(MRI)中准确分类的问题,以支持早期诊断和治疗方案制定。其核心挑战在于从复杂的MRI图像中提取具有判别性的特征,并实现高精度的多类分类。解决方案的关键在于提出一种基于视觉Transformer(Vision Transformer, ViT)的深度学习框架,通过引入基于颜色映射(colormap-based)的特征表示方法来增强重要结构与强度变化的表达能力,从而提升模型对不同脑肿瘤类别(包括胶质瘤、脑膜瘤、垂体瘤及非肿瘤病例)的识别性能。实验表明,该方法在BRISC2025数据集上达到了98.90%的分类准确率和99.97%的AUC值,显著优于传统卷积神经网络模型,展现出良好的泛化能力和临床应用潜力。
链接: https://arxiv.org/abs/2603.21234
作者: Faisal Ahmed
机构: Embry-Riddle Aeronautical University (埃姆布里-里德航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures
Abstract:Accurate classification of brain tumors from magnetic resonance imaging (MRI) plays a critical role in early diagnosis and effective treatment planning. In this study, we propose a deep learning framework based on Vision Transformers (ViT) enhanced with colormap-based feature representation to improve multi-class brain tumor classification performance. The proposed approach leverages the ability of transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize important structural and intensity variations within MRI scans. Experiments are conducted on the BRISC2025 dataset, which includes four classes: glioma, meningioma, pituitary tumor, and non-tumor cases. The model is trained and evaluated using standard performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed method achieves a classification accuracy of 98.90%, outperforming baseline convolutional neural network models including ResNet50, ResNet101, and EfficientNetB2. In addition, the model demonstrates strong generalization capability with an AUC of 99.97%, indicating high discriminative performance across all classes. These results highlight the effectiveness of combining Vision Transformers with colormap-based feature enhancement for accurate and robust brain tumor classification and suggest strong potential for clinical decision support applications. Comments: 11 pages, 3 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.21234 [cs.CV] (or arXiv:2603.21234v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.21234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-136] DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture
【速读】:该论文旨在解决深度图(depth map)压缩中的高比特率与保真度难以兼顾的问题。传统方法在压缩深度图时往往面临信息丢失严重或压缩效率低下的挑战,尤其在需要保留几何精度的应用场景中(如三维重建、机器人导航等)。其解决方案的关键在于提出了一种物理感知的端到端压缩框架DepthTCM:首先利用多波长深度(Multiwavelength Depth, MWD)编码将高比特深度图无损映射为平滑的3通道图像表示,随后对MWD表示进行全局4比特量化以显著降低熵,最后采用融合卷积神经网络(CNN)与Transformer结构的可学习编码器-解码器进行高效压缩。该设计实现了高压缩比(0.307 bpp)下仍保持99.38%的原始精度,且具备良好的实际推理效率和可扩展性。
链接: https://arxiv.org/abs/2603.21233
作者: Young-Seo Chang,Yatong An,Jae-Sang Hyun
机构: Yonsei University (延世大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer–CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.
[CV-137] QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因视觉令牌(visual tokens)数量远超文本令牌而导致的严重计算与内存瓶颈问题。现有方法通常依赖固定启发式策略进行视觉令牌对齐与压缩,缺乏在不同场景下的自适应能力。其解决方案的关键在于提出Query Guided Mixture-of-Projector (QMoP)框架,该框架通过三个协同分支实现自适应压缩:基于池化的分支提取粗粒度全局语义、重采样分支获取高层语义表示、剪枝分支保留关键细节;并通过Query Guided Router (QGR) 动态根据视觉输入和文本查询选择并加权各分支输出,结合Mixture-of-Experts风格融合机制,在保留信息的同时抑制噪声,从而显著降低资源消耗并提升性能。
链接: https://arxiv.org/abs/2603.21232
作者: Zhongyang Li,Yaqian Li,Faming Fang,Rinyoichi Takezoe,Zi-Hao Bo,Cheng Qian,Mo Guang,Guixu Zhang,Kaiwen Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.
[CV-138] Plant Taxonomy Meets Plant Counting: A Fine-Grained Taxonomic Dataset for Counting Hundreds of Plant Species CVPR2026
【速读】:该论文旨在解决植物计数中缺乏细粒度分类意识与多尺度标注的问题,即在自然场景下对植物进行高精度、可解释的计数任务仍处于探索阶段,尤其在不同生长阶段和环境条件下植物表现出非刚性形态变化,使得传统基于人群或交通分析的方法难以直接适用。其解决方案的关键在于构建首个融合植物分类学(Linnaean taxonomy)的植物计数基准数据集TPC-268,该数据集包含10,000张图像、678,050个实例级点注释,并覆盖268个可计数的植物与真菌类群(涵盖242个物种),同时支持从冠层级遥感影像到组织级显微图像的多尺度观测。此外,通过提供符合分类一致性的数据划分和类无关计数(class-agnostic counting, CAC)评估框架,TPC-268为推动细粒度、物种感知的计数方法提供了生物学基础扎实的测试平台。
链接: https://arxiv.org/abs/2603.21229
作者: Jinyu Xu,Tianqi Hu,Xiaonan Hu,Letian Zhou,Songliang Cao,Meng Zhang,Hao Lu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom - species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at this https URL.
[CV-139] A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification
【速读】:该论文旨在解决遥感影像中道路自动多级映射(multi-grade road mapping)的难题,即如何从高分辨率遥感图像中同时准确提取道路表面掩码、保持拓扑结构的道路网络,并实现语义一致的道路等级分类。其核心挑战在于跨尺度道路特征建模、几何与语义信息融合以及可解释的层次推理。解决方案的关键是提出RoadReasoner框架,该框架通过显式增强频域敏感线索和多尺度上下文来强化道路特征表示与连接性;并在骨架-分割层级上利用几何描述符与几何感知文本提示(geometry-aware textual prompts),结合视觉-语言模型实现可语言解释的等级决策,从而在SYSU-HiRoads大规模数据集上实现72.6%整体准确率(OA)、64.2% F1分数和60.6%分割准确率(SegAcc),显著优于现有方法。
链接: https://arxiv.org/abs/2603.21222
作者: Ting Han,Xiangyi Xie,Yiping Chen,Yumeng Du,Jin Ma,Aiguang Li,Jiaan Liu,Yin Gao
机构: 中山大学(Shanghai Jiao Tong University); 国家电网公司(National Grid Corporation)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.
[CV-140] Reframing Long-Tailed Learning via Loss Landscape Geometry CVPR2026
【速读】:该论文旨在解决长尾(long-tail, LT)数据分布下模型性能权衡难题,其核心问题是“尾部性能退化”(tail performance degradation)——即模型在头部类别上严重过拟合,同时快速遗忘尾部类别。解决方案的关键在于从损失景观(loss landscape)视角出发,提出一种受持续学习启发的框架:首先引入分组知识保留模块(Grouped Knowledge Preservation module),通过记忆不同类群的收敛参数来促进向共享解空间的收敛;其次设计分组锐度感知模块(Grouped Sharpness Aware module),显式优化损失景观几何结构以寻找更平坦的极小值点。该方法无需外部训练样本或预训练模型,具有良好的可扩展性,并在四个基准测试中显著优于现有最先进方法。
链接: https://arxiv.org/abs/2603.21217
作者: Shenghan Chen,Yiming Liu,Yanzhen Wang,Yujia Wang,Xiankai Lu
机构: Shandong University (山东大学); Zhejiang Sci-Tech University (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. 11 pages, 6 figures, 5 tables
Abstract:Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called “tail performance degradation” (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent “tail performance degradation”. To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:this https URL.
[CV-141] Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis
【速读】:该论文旨在解决现有反事实图像生成方法在医学影像中难以实现局部结构变化的问题,尤其是传统方法受限于仅能进行全局干预或依赖用户定义的分割掩码,导致生成结果缺乏空间精细度和解剖学一致性。其解决方案的关键在于提出位置感知的分段反事实微调(Positional Seg-CFT),通过将每个解剖结构进一步细分为区域片段,并为每个区域独立提取测量值作为监督信号,从而实现对特定区域的精细化控制,生成具有解剖合理性且空间定位明确的反事实图像,有效支持疾病进展建模等应用场景。
链接: https://arxiv.org/abs/2603.21213
作者: Tian Xia,Matthew Sinclair,Andreas Schuh,Fabio De Sousa Ribeiro,Raghav Mehta,Rajat Rasal,Esther Puyol-Antón,Samuel Gerber,Kersten Petersen,Michiel Schaap,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.
[CV-142] JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在面对“越狱攻击”(jailbreak attacks)时仍可能生成有害或不适合工作场所(Not-Safe-For-Work, NSFW)内容的问题,尤其针对现有方法依赖代理损失优化而非端到端目标,或依赖大规模强化学习(Reinforcement Learning, RL)训练的高复杂度生成器所带来的效率与性能瓶颈。其解决方案的关键在于提出一种轻量级框架JANUS,将越狱攻击建模为在黑盒环境下对结构化提示分布的优化问题,并引入一个低维混合策略(mixing policy),该策略基于两个语义锚定的提示分布进行参数化,从而实现高效探索并保持目标语义一致性。此设计显著提升了攻击成功率(ASR-8从25.30%提升至43.15%),同时在CLIP分数和NSFW得分上表现更优,且适用于开源与商用T2I模型,揭示了当前安全过滤机制的结构性弱点。
链接: https://arxiv.org/abs/2603.21208
作者: Haolun Zheng,Yu He,Tailun Chen,Shuo Shao,Zhixuan Chu,Hongbin Zhou,Lan Tao,Zhan Qin,Kui Ren
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Hangzhou HighTech Zone (Binjiang) Blockchain and Data Security Research Institute (杭州高新区(滨江区)区块链与数据安全研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 8 figures
Abstract:Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.
[CV-143] Boundary-Aware Instance Segmentation in Microscopy Imaging
【速读】:该论文旨在解决显微镜视频中细胞实例分割的难题,尤其是密集场景下相邻或重叠细胞难以准确分离的问题。现有基于基础模型(如SAM)的分割方法在无大量人工提示的情况下仍表现不佳。其解决方案的关键在于提出一种无需提示(prompt-free)、边界感知的实例分割框架,通过预测有符号距离函数(Signed Distance Function, SDF)来建模细胞轮廓,而非传统的二值掩膜;利用可学习的sigmoid映射将SDF转换为概率图,实现边界定位精准且几何一致的分割效果,并采用统一的改进型Hausdorff距离(Modified Hausdorff Distance, MHD)损失函数联合优化区域与边界信息,从而显著提升边界精度和实例级性能。
链接: https://arxiv.org/abs/2603.21206
作者: Thomas Mendelson,Joshua Francois,Galit Lahav,Tammy Riklin-Raviv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract:Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: this https URL Comments: Accepted for publication in IEEE International Symposium on Biomedical Imaging (ISBI) 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.21206 [cs.CV] (or arXiv:2603.21206v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.21206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-144] DSCSNet: A Dynamic Sparse Compression Sensing Network for Closely-Spaced Infrared Small Target Unmixing
【速读】:该论文旨在解决远距离红外小目标在成像中因光学镜头焦距和探测器分辨率限制而形成的混合像素点(mixed spots)的分离问题,即近距小目标解混(Close Small Object Unmixing, CSOU)任务。该问题是一个高度病态的逆问题,现有方法难以同时兼顾模型驱动方法的严格稀疏性保障与数据驱动方法对动态场景的适应能力。解决方案的关键在于提出一种动态稀疏压缩感知网络(Dynamic Sparse Compressed Sensing Network, DSCSNet),其通过深度展开交替方向乘子法(ADMM)并引入可学习参数,将严格的ℓ₁-范数稀疏约束嵌入到ADMM的辅助变量更新步骤中,替代传统ℓ₂-范数平滑项以有效保留小目标离散能量峰;同时,在重构阶段融合基于自注意力机制的动态阈值模块,利用迭代过程中的稀疏增强信息自适应调整稀疏化强度,从而实现物理可解释性与场景适应性的统一,显著提升复杂红外场景下的解混精度与泛化能力。
链接: https://arxiv.org/abs/2603.21192
作者: Zhiyang Tang,Yiming Zhu,Ruimin Huang,Meng Yang,Yong Ma,Jun Huang,Fan Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 13 pages, 8 figures
Abstract:Due to the limitations of optical lens focal length and detector resolution, distant clustered infrared small targets often appear as mixed spots. The Close Small Object Unmixing (CSOU) task aims to recover the number, sub-pixel positions, and radiant intensities of individual targets from these spots, which is a highly ill-posed inverse problem. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches and the dynamic scene adaptability of data-driven methods. To address this dilemma, this paper proposes a Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples the Alternating Direction Method of Multipliers (ADMM) with learnable parameters. Specifically, we embed a strict \ell_1 -norm sparsity constraint into the auxiliary variable update step of ADMM to replace the traditional \ell_2 -norm smoothness-promoting terms, which effectively preserves the discrete energy peaks of small targets. We also integrate a self-attention-based dynamic thresholding mechanism into the reconstruction stage, which adaptively adjusts the sparsification intensity using the sparsity-enhanced information from the iterative process. These modules are jointly optimized end-to-end across the three iterative steps of ADMM. Retaining the physical logic of compressed sensing, DSCSNet achieves robust sparsity induction and scene adaptability, thus enhancing the unmixing accuracy and generalization in complex infrared scenarios. Extensive experiments on the synthetic infrared dataset CSIST-100K demonstrate that DSCSNet outperforms state-of-the-art methods in key metrics such as CSO-mAP and sub-pixel localization error.
[CV-145] GIDE: Unlocking Diffusion LLM s for Precise Training-Free Image Editing
【速读】:该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, DLLMs)在无需训练的图像编辑任务中难以实现精确编辑的问题,尤其是在离散的token空间下,传统噪声反演技术因结构失真而失效。其解决方案的关键在于提出GIDE(Grounded Inversion for DLLM Image Editing)框架,核心创新是引入一种新颖的离散噪声反演机制(Discrete Noise Inversion),能够准确捕捉离散token空间中的潜在噪声模式,从而实现高保真重建;同时将编辑流程分解为** grounding、inversion 和 refinement** 三个阶段,支持多种编辑指令(文本、点、框)并严格保留未编辑背景,显著提升了编辑的语义正确性和感知质量。
链接: https://arxiv.org/abs/2603.21176
作者: Zifeng Zhu,Jiaming Han,Jiaxiang Zhao,Minnan Luo,Xiangyu Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 7 figures
Abstract:While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.
[CV-146] raining-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images SIGGRAPH
【速读】:该论文旨在解决从稀疏且未标定的RGB图像中重建、理解并渲染高质量3D室内场景的问题,传统基于辐射场(radiance field)的方法通常依赖密集视角和针对每个场景的优化,计算成本高且难以编辑。其解决方案的关键在于提出了一种无需训练和姿态预处理的端到端系统,包含三大创新:(1) 基于图像扭曲(warping)的异常几何过滤策略,提升点云重建的鲁棒性;(2) 基于扭曲引导的2D到3D实例提升机制,实现一致且具有实例感知能力的三维表示;(3) 一种将点云投影至新视角并利用3D感知扩散模型(3D-aware diffusion model)进行精细化渲染的新方法,通过生成式AI(Generative AI)补偿缺失几何信息,增强稀疏输入下的真实感。该方案支持对象级场景编辑(如实例移除),仅需修改点云即可合成一致的新视图,无需重新训练。
链接: https://arxiv.org/abs/2603.21166
作者: Jiatong Xia,Lingqiao Liu
机构: The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH Asia 2025
Abstract:We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: this https URL
[CV-147] Beyond a Single Signal: SPECTREG2 A Unified MultiExpert Anomaly Detector for Unknown Unknowns
【速读】:该论文旨在解决机器学习系统在面对未知未知(unknown unknowns)时缺乏对自身知识边界认知的问题,即如何有效识别和应对模型不确定性,特别是在存在结构异常(structural anomalies)的情况下。现有不确定性量化方法通常依赖单一信号(如置信度或密度),难以捕捉多样化的异常模式。解决方案的关键在于提出SPECTRE-G2,一种基于双骨干神经网络的多信号异常检测框架,通过融合八个互补信号(包括密度、几何、不确定性、判别性和因果性等)来提升检测能力;其核心创新包括:使用谱归一化高斯化编码器与保持特征几何结构的MLP构建双分支架构、利用合成分布外数据进行信号校准、以及采用自适应top-k融合机制动态选择最相关信息信号并平均得分,从而在多种数据集上显著优于多个基线方法,在开放世界场景中实现了对新变量和混杂因素的有效检测。
链接: https://arxiv.org/abs/2603.21160
作者: Rahul D Ray
机构: BITS Pilani, Hyderabad Campus (比特学院海得拉巴校区)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Epistemic intelligence requires machine learning systems to recognise the limits of their own knowledge and act safely under uncertainty, especially when faced with unknown unknowns. Existing uncertainty quantification methods rely on a single signal such as confidence or density and fail to detect diverse structural anomalies. We introduce SPECTRE-G2, a multi-signal anomaly detector that combines eight complementary signals from a dual-backbone neural network. The architecture includes a spectral normalised Gaussianization encoder, a plain MLP preserving feature geometry, and an ensemble of five models. These produce density, geometry, uncertainty, discriminative, and causal signals. Each signal is normalised using validation statistics and calibrated with synthetic out-of-distribution data. An adaptive top-k fusion selects the most informative signals and averages their scores. Experiments on synthetic, Adult, CIFAR-10, and Gridworld datasets show strong performance across diverse anomaly types, outperforming multiple baselines on AUROC, AUPR, and FPR95. The model is stable across seeds and particularly effective for detecting new variables and confounders. SPECTRE-G2 provides a practical approach for detecting unknown unknowns in open-world settings.
[CV-148] Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
【速读】:该论文旨在解决生成式零样本学习(Generative Zero-Shot Learning, Generative ZSL)中合成特征任务无关性导致性能下降,以及仅依赖语义原型难以准确建模语义相似但视觉差异显著类别的问题。解决方案的关键在于提出一种基于结果奖励的强化学习框架 RLVC(Reinforcement Learning with Visual Cues),其核心机制包括:(1) 利用基于结果的奖励信号引导生成模型自进化,从而合成更具任务相关性的特征;(2) 引入类级别的视觉线索(class-wise visual cues),一方面将合成特征对齐至视觉原型,另一方面稳定强化学习训练过程;此外,设计了一种新颖的冷启动策略以提升训练稳定性与效果。实验表明,该方法在三个主流ZSL基准上达到当前最优性能,相对基线提升达4.7%。
链接: https://arxiv.org/abs/2603.21138
作者: Wenjin Hou,Xiaoxiao Sun,Hehe Fan
机构: Zhejiang University (浙江大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.
[CV-149] MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics
【速读】:该论文旨在解决扩散模型在文本到图像生成中对多主体场景进行精细控制的难题,特别是如何在保持各主体身份不变的前提下,实现用户对多个对象之间层次结构和空间关系的显式定义与精确调控。其解决方案的关键在于提出MS-CustomNet框架,该框架支持零样本集成多个用户提供的对象,并通过显式指定主体间的层级排列和空间布局来引导生成过程,从而确保个体主体身份保留的同时学习并执行用户定义的组合结构。实验表明,该方法在主体身份保留(DINO-I得分0.61)和位置控制精度(YOLO-L得分0.94)方面均优于现有方法,显著提升了多主体图像生成的可控性和保真度。
链接: https://arxiv.org/abs/2603.21136
作者: Pengxiang Cai,Mengyang Li
机构: East China University of Science and Technology (华东理工大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
[CV-150] One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation
【速读】:该论文旨在解决实用测试时适应(Practical Test-Time Adaptation, PTTA)场景下,现有方法因采用单一无结构存储池(single unstructured pool)导致的适应不稳定问题。在PTTA中,测试数据流具有时间相关性和非独立同分布(non-i.i.d.)特性,且本质上呈现多模态分布,而传统单簇记忆机制无法有效捕捉这种复杂结构,造成模式失衡与信息丢失。解决方案的关键在于提出多簇记忆(Multi-Cluster Memory, MCM)框架,其核心创新包括:基于像素级统计描述符的簇分配机制以识别不同分布模式、相邻簇合并策略(Adjacent Cluster Consolidation, ACC)控制内存增长、以及均匀簇检索机制(Uniform Cluster Retrieval, UCR)保障各模式下的均衡监督信号。实验表明,MCM在多个基准数据集上显著提升性能,尤其在高复杂度分布(如ImageNet-C和DomainNet)中收益最大,验证了记忆组织结构对PTTA稳定性和效果的关键作用。
链接: https://arxiv.org/abs/2603.21135
作者: Yu-Wen Tseng,Xingyi Zheng,Ya-Chen Wu,I-Bin Liao,Yung-Hui Li,Hong-Han Shuai,Wen-Huang Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures
Abstract:Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.
[CV-151] Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition ICRA2026
【速读】:该论文旨在解决心脏超声诊断中标准切面获取高度依赖操作者、现有分割模型在纹理区分度低的图像中产生解剖不一致结果,以及自主探头调整方法要么依赖简单启发式规则、要么为黑箱学习的问题。解决方案的关键在于提出一种基于解剖先验(Anatomical Prior, AP)驱动的框架,将心脏结构分割与自主探头调整相结合:首先设计了一种引入空间关系图(Spatial-Relation Graph, SRG)模块的YOLO多类别分割模型,通过嵌入解剖先验增强特征金字塔;其次提取标准切面的可量化解剖特征并拟合为高斯分布构建概率化解剖先验;最后将机器人超声扫描中的探头调整过程建模为强化学习(Reinforcement Learning, RL)问题,以实时解剖特征作为状态输入,解剖先验匹配程度作为奖励信号,实现精准且可解释的自动化探头调整。
链接: https://arxiv.org/abs/2603.21134
作者: Zhiyan Cao,Zhengxi Wu,Yiwei Wang,Pei-Hsuan Lin,Li Zhang,Zhen Xie,Huan Zhao,Han Ding
机构: Huazhong University of Science and Technology (华中科技大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); National Chung Hsing University (国立中兴大学); National University of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE ICRA 2026. 8 pages, 5 figures, 3 tables
Abstract:Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.
[CV-152] ReDiffuse: Rotation Equivariant Diffusion Model for Multi-focus Image Fusion
【速读】:该论文旨在解决多焦点图像融合(Multi-Focus Image Fusion, MFIF)中扩散模型因焦外模糊导致几何结构扭曲和伪影的问题。核心挑战在于,传统扩散模型在处理具有旋转对称性结构(如纹理和边缘)时,难以保持其原始方向一致性,从而影响融合结果的结构保真度。解决方案的关键在于提出 ReDiffuse——一种具备旋转等变性的扩散模型,通过精心设计基础扩散架构以实现端到端的旋转等变性,并辅以严格的理论分析验证其内在等变误差,确保融合结果能忠实保留输入图像中几何模式的原始朝向与结构一致性。
链接: https://arxiv.org/abs/2603.21129
作者: Bo Li,Tingting Bao,Lingling Zhang,Weiping Fu,Yaxian Wang,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Chang’an University (长安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:Diffusion models have achieved impressive performance on multi-focus image fusion (MFIF). However, a key challenge in applying diffusion models to the ill-posed MFIF problem is that defocus blur can make common symmetric geometric structures (e.g., textures and edges) appear warped and deformed, often leading to unexpected artifacts in the fused images. Therefore, embedding rotation equivariance into diffusion networks is essential, as it enables the fusion results to faithfully preserve the original orientation and structural consistency of geometric patterns underlying the input images. Motivated by this, we propose ReDiffuse, a rotation-equivariant diffusion model for MFIF. Specifically, we carefully construct the basic diffusion architectures to achieve end-to-end rotation equivariance. We also provide a rigorous theoretical analysis to evaluate its intrinsic equivariance error, demonstrating the validity of embedding equivariance structures. ReDiffuse is comprehensively evaluated against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, and Road-MF). Results demonstrate that ReDiffuse achieves competitive performance, with improvements of 0.28-6.64% across six evaluation metrics. The code is available at this https URL.
[CV-153] LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation ICLR2026
【速读】:该论文旨在解决动态环境中低帧率(Low-Frame-Rate, LFR)标准相机导致的感知间隙问题,即在传统RGB视频流中由于帧率较低而难以实现连续、高精度的语义分割。为此,作者提出“任意时间插值语义分割”(Anytime Interframe Semantic Segmentation)这一新任务,目标是仅利用单个历史RGB图像和异步事件数据流,在任意时刻预测出稠密语义分割结果。解决方案的关键在于设计了LiFR-Seg框架,其核心创新包括:1)一种不确定性感知的特征扭曲(warping)过程,该过程基于由稀疏且常含噪声的事件数据生成的运动场及其学习得到的显式置信度;2)一个时序记忆注意力模块,用于在高度动态场景中保持语义特征的一致性。该方法在DSEC数据集和作者提出的高频合成基准SHF-DSEC上验证有效,实现了与高帧率(High-Frame-Rate, HFR)理想上限性能相当的结果(mIoU达73.82%,差异仅0.09%),从而为使用LFR硬件实现鲁棒、高效的高帧率感知提供了新范式。
链接: https://arxiv.org/abs/2603.21115
作者: Xiaoshan Wu,Xiaoyang Lyu,Yifei Yu,Bo Wang,Zhongrui Wang,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026
Abstract:Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: this https URL Code: this https URL.
[CV-154] CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对假设视角变换(counterfactual viewpoint changes)时,其空间状态表示是否保持稳定的问题。现有研究表明MLLMs在单视角空间推理任务中表现优异,但缺乏对跨视角一致性与鲁棒性的系统评估。论文提出一个受控的诊断基准(diagnostic benchmark),通过模拟相机轨道变换(hypothetical camera orbit transformations)来测试关系一致性(relational consistency),而无需重新渲染图像。关键解决方案在于引入结构化输入表示(如文本边界框和结构化场景图),并量化其对空间关系稳定性的影响;结果表明,增加表示结构能显著提升模型在对抗性视角变化下的稳定性,从而揭示了单一视角准确率可能高估模型的空间表征鲁棒性,并强调了结构化表示在因果空间推理中的核心作用。
链接: https://arxiv.org/abs/2603.21114
作者: Shanmukha Vellamcheti,Uday Kiran Kothapalli,Disharee Bhowmick,Sathyanarayanan N. Aakur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 10 figures, 3 tables. Project page: this https URL
Abstract:Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
[CV-155] Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning CVPR2026
【速读】:该论文旨在解决当前参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在多任务学习(Multi-Task Learning, MTL)场景下仍主要局限于单任务适应的问题。其核心挑战在于如何在保持模型轻量化的同时,实现多个任务间的有效参数共享与任务特异性权重生成。解决方案的关键在于提出一种名为 Free Sinewich 的框架,该框架通过频率切换(frequency switching)实现近零成本的权重调制:其中,Sine-AWB(Sinewich)层将低秩因子与卷积先验整合为单一核,并利用正弦变换对每个元素进行调制以生成任务特异权重;同时引入轻量级 Clock Net 生成有界频率,稳定训练过程。理论上,正弦调制提升了低秩适配器的秩,而频率分离则降低了不同任务权重间的相关性,从而在密集预测基准上实现了优于单任务微调的性能-效率权衡(如仅用 6.53M 可训练参数即获得最高 +5.39% 的提升)。
链接: https://arxiv.org/abs/2603.21111
作者: Shih-Wen Liu,Yen-Chang Chen,Wei-Ta Chu,Fu-En Yang,Yu-Chiang Frank Wang
机构: National Cheng Kung University (国立成功大学); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026
Abstract:Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbfFree Sinewich, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbfFree). Specifically, a \textbfSine-AWB (Sinewich) layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Project page: \hrefthis https URLthis https URL.
[CV-156] CounterScene: Counterfactual Causal Reasoning in Generative World Models for Safety-Critical Closed-Loop Evaluation
【速读】:该论文旨在解决当前生成安全关键驾驶场景方法中存在的现实性与对抗性之间的权衡问题,即现有方法依赖启发式对抗性代理选择和无结构扰动,缺乏对交互依赖关系的显式建模,导致生成场景的真实性不足或对抗效果有限。其解决方案的关键在于提出CounterScene框架,通过引入因果对抗代理识别机制以定位关键行为代理并分类冲突类型,并构建基于因果交互图的冲突感知交互世界模型,显式建模动态多智能体依赖关系;在此基础上,采用阶段自适应反事实引导策略,在最小干预前提下移除关键代理的空间和时间安全裕度,使风险通过自然交互传播而涌现,从而在保持高轨迹真实性的前提下显著提升对抗有效性。
链接: https://arxiv.org/abs/2603.21104
作者: Bowen Jing,Ruiyang Hao,Weitao Zhou,Haibao Yu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 7 figures
Abstract:Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism–adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.
[CV-157] Learning Progressive Adaptation for Multi-Modal Tracking
【速读】:该论文旨在解决多模态跟踪(Multi-Modal Tracking)中因配对多模态数据稀缺,导致基于预训练RGB模型的参数高效微调方法难以有效适配多模态特征、调节单一模态内部信息、跨模态交互以及预测头适应性的问题。其解决方案的关键在于提出一种渐进式适配策略(Progressive Adaptation for Multi-Modal Tracking, PATrack),通过引入三类适配器——模态依赖型适配器(modality-dependent adapter)、模态纠缠型适配器(modality-entangled adapter)和任务级适配器(task-level adapter),分别实现模态内高频/低频特征增强、跨模态注意力交互建模以及预测头的特定任务适配,从而在RGB+热成像、RGB+深度、RGB+事件等多模态跟踪任务中显著提升性能。
链接: https://arxiv.org/abs/2603.21100
作者: He Wang,Tianyang Xu,Zhangyong Tang,Xiao-Jun Wu,Josef Kittler
机构: Jiangnan University (江南大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at this https URL.
[CV-158] Representation-Level Adversarial Regularization for Clinically Aligned Multitask Thyroid Ultrasound Assessment
【速读】:该论文旨在解决甲状腺超声影像中结节轮廓分割与TI-RADS风险分级任务因阅片者间差异导致的标注不一致性问题,这种不一致性会削弱标准学习流程的性能。解决方案的关键在于提出一种临床引导的多任务框架,通过联合预测结节掩码(mask)和TI-RADS类别,在共享表示层引入一种基于潜在空间对抗方向的梯度正则化方法——RLAR(Representation-Level Adversarial Gradient Regularizer),以显式建模并控制不同任务间的梯度竞争,从而提升风险分类的准确性,同时保持分割质量。
链接: https://arxiv.org/abs/2603.21095
作者: Dina Salama,Mohamed Mahmoud,Nourhan Bayasi,David Liu,Ilker Hacihaliloglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Thyroid ultrasound is the first-line exam for assessing thyroid nodules and determining whether biopsy is warranted. In routine reporting, radiologists produce two coupled outputs: a nodule contour for measurement and a TI-RADS risk category based on sonographic criteria. Yet both contouring style and risk grading vary across readers, creating inconsistent supervision that can degrade standard learning pipelines. In this paper, we address this workflow with a clinically guided multitask framework that jointly predicts the nodule mask and TI-RADS category within a single model. To ground risk prediction in clinically meaningful evidence, we guide the classification embedding using a compact TI-RADS aligned radiomics target during training, while preserving complementary deep features for discriminative performance. However, under annotator variability, naive multitask optimization often fails not because the tasks are unrelated, but because their gradients compete within the shared representation. To make this competition explicit and controllable, we introduce RLAR, a representation-level adversarial gradient regularizer. Rather than performing parameter-level gradient surgery, RLAR uses each task’s normalized adversarial direction in latent space as a geometric probe of task sensitivity and penalizes excessive angular alignment between task-specific adversarial directions. On a public TI-RADS dataset, our clinically guided multitask model with RLAR consistently improves risk stratification while maintaining segmentation quality compared to single-task training and conventional multitask baselines. Code and pretrained models will be released.
[CV-159] DGRNet: Disagreement-Guided Refinement for Uncertainty-Aware Brain Tumor Segmentation
【速读】:该论文旨在解决医学图像分割中两个关键问题:一是单模型预测缺乏可靠的不确定性量化,这限制了其在临床场景中的可信度与决策支持能力;二是放射科报告中蕴含的丰富语义信息未被有效利用,尤其在分割边界模糊区域难以提供精准指导。解决方案的关键在于提出一种基于多视角分歧的细化网络(Disagreement-Guided Refinement Network, DGRNet),通过四个轻量级视图特异性适配器生成多样化预测,在一次前向传播中实现高效不确定性估计,并构建分歧图定位高不确定区域,随后依据临床文本报告对这些区域进行选择性精细化调整;同时引入多样性保持训练策略,结合成对相似性惩罚和梯度隔离机制,防止不同视图预测趋同,从而提升分割精度与不确定性估计的可靠性。
链接: https://arxiv.org/abs/2603.21086
作者: Bahram Mohammadi,Yanqiu Wu,Vu Minh Hieu Phan,Sam White,Minh-Son To,Jian Yang,Michael Sheng,Yang Song,Yuankai Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 4 tables
Abstract:Accurate brain tumor segmentation from MRI scans is critical for diagnosis and treatment planning. Despite the strong performance of recent deep learning approaches, two fundamental limitations remain: (1) the lack of reliable uncertainty quantification in single-model predictions, which is essential for clinical deployment because the level of uncertainty may impact treatment decision-making, and (2) the under-utilization of rich information in radiology reports that can guide segmentation in ambiguous regions. In this paper, we propose the Disagreement-Guided Refinement Network (DGRNet), a novel framework that addresses both limitations through multi-view disagreement-based uncertainty estimation and text-conditioned refinement. DGRNet generates diverse predictions via four lightweight view-specific adapters attached to a shared encoder-decoder, enabling efficient uncertainty quantification within a single forward pass. Afterward, we build disagreement maps to identify regions of high segmentation uncertainty, which are then selectively refined according to clinical reports. Moreover, we introduce a diversity-preserving training strategy that combines pairwise similarity penalties and gradient isolation to prevent view collapse. The experimental results on the TextBraTS dataset show that DGRNet favorably improves state-of-the-art segmentation accuracy by 2.4% and 11% in main metrics Dice and HD95, respectively, while providing meaningful uncertainty estimates.
[CV-160] aming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models CVPR2026
【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models)中因潜在空间对采样扰动缺乏鲁棒性而导致的生成质量下降问题。研究表明,当前广泛使用的基于β-VAE的编码器倾向于生成过于紧凑的潜在流形,使其在扩散采样过程中对随机扰动高度敏感,从而引发视觉退化。解决方案的关键在于引入一种“方差扩展损失”(Variance Expansion Loss),通过对抗性地平衡重建损失与方差扩展损失,有效抑制方差坍缩现象,实现潜在空间在保持高重建保真度的同时显著提升对采样扰动的鲁棒性,从而稳定并改善扩散采样的生成效果。
链接: https://arxiv.org/abs/2603.21085
作者: Qifan Li,Xingyu Zhou,Jinhua Zhang,Weiyi You,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used \beta -VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
[CV-161] Hierarchical Text-Guided Brain Tumor Segmentation via Sub-Region-Aware Prompts
【速读】:该论文旨在解决脑肿瘤分割中因三类标准子区域(全肿瘤 WT、肿瘤核心 TC 和增强肿瘤 ET)边界模糊而导致的分割精度不足问题。现有融合影像与放射学文本描述的多模态方法通常将报告压缩为单一全局文本嵌入,未能体现各子区域独特的临床特征。其解决方案的关键在于提出 TextCSP(text-modulated soft cascade architecture),包含三个创新组件:(1) 文本调制的软级联解码器,按 WT→TC→ET 的粗到细顺序预测,符合解剖层次结构;(2) 子区域感知的提示调优(prompt tuning),利用 LoRA 适配的 BioBERT 编码器生成针对每个子区域的专用文本表示;(3) 文本语义通道调制器,将上述表示转化为通道级细化信号,引导解码器强化与临床描述模式一致的特征。实验表明,在 TextBraTS 数据集上,该方法在 Dice 和 HD95 指标上分别优于当前最优方法 1.7% 和 6%。
链接: https://arxiv.org/abs/2603.21083
作者: Bahram Mohammadi,Ta Duc Huy,Afrouz Sheikholeslami,Qi Chen,Vu Minh Hieu Phan,Sam White,Minh-Son To,Xuyun Zhang,Amin Beheshti,Luping Zhou,Yuankai Qi
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 4 tables
Abstract:Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT-TC-ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.
[CV-162] CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉编码器(vision encoder)是否应进行微调(fine-tuning)这一长期存在的争议问题。现有方法在不同任务和训练设置下表现不稳定,难以 consistently 超越冻结视觉编码器的基线,其根本原因在于视觉偏好冲突(visual preference conflicts)——即视觉编码器在不同多模态上下文下的参数更新方向不一致。为解决此问题,作者提出 Context-aware Visual Fine-tuning (CoVFT) 框架,其核心创新在于显式引入多模态上下文信息以指导视觉适配:通过 Context Vector Extraction (CVE) 模块提取上下文向量,并结合 Contextual Mixture-of-Experts (CoMoE) 模块分解冲突优化信号,从而实现稳定且上下文敏感的视觉更新。实验表明,CoVFT 在12个基准上均取得最先进性能,且7B规模模型经CoVFT微调后超越13B模型平均表现,揭示了视觉编码器优化的巨大潜力。
链接: https://arxiv.org/abs/2603.21077
作者: Nan Zhou,Huiqun Wang,Yaoyan Zheng,Di Huang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.
[CV-163] CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels CVPR2026
【速读】:该论文旨在解决前向声呐(Forward-looking Sonar, FLS)图像在极端有限标注数据条件下进行语义分割时,因严重斑点噪声、低纹理对比度、声影效应和几何畸变等因素导致传统师生框架性能不佳的问题。解决方案的关键在于提出一种协同教师语义分割框架,其核心创新包括:(1) 构建由一个通用教师和多个声呐特异性教师组成的多教师协作机制,通过交替引导策略使学生模型同时学习通用语义特征与声呐图像的独特特性;(2) 设计跨教师可靠性评估机制,基于多视角和多教师预测的一致性与稳定性动态量化伪标签可靠性,有效缓解噪声伪标签对训练的负面影响。该方法在FLSMD数据集上仅使用2%标注数据时,mIoU指标相比现有最优方法提升5.08%。
链接: https://arxiv.org/abs/2603.21071
作者: Ping Guo,Chengzhou Li,Guanchen Meng,Qi Jia,Jinyuan Liu,Zhu Liu,Yu Liu,Zhongxuan Luo,Xin Fan
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 Findings
Abstract:As one of the most important underwater sensing technologies, forward-looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher-student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward-looking sonar images. This framework introduces a multi-teacher collaborative mechanism composed of one general teacher and multiple sonar-specific teachers. By adopting a multi-teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo-labels, we further design a cross-teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo-labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo-labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state-of-the-art approaches.
[CV-164] NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection CVPR2026
【速读】:该论文旨在解决开放词汇目标检测(Open-Vocabulary Object Detection, OVD)中训练与测试阶段存在的显著性能差距问题,尤其针对RPN(Region Proposal Network)和RoI头在训练时将未标注的新类别对象误判为背景,导致提案被提前过滤或进一步误分类,进而在测试阶段因得分过低被后处理移除,从而造成新类别召回率严重下降的问题。其解决方案的关键在于提出一种名为NoOVD的新训练框架,该框架创新性地引入基于冻结视觉-语言模型(Vision-Language Models, VLMs)知识的自蒸馏机制,并设计了K-FPN(Knowledge-guided FPN)以利用VLM预训练知识引导模型发现新类别对象并实现无额外数据的知识蒸馏,同时引入R-RPN(Recall-oriented RPN)调整推理阶段提案的置信度分数,从而提升新类别对象的召回率。
链接: https://arxiv.org/abs/2603.21069
作者: Yupeng Zhang,Ruize Han,Zhiwei Chen,Wei Feng,Liang Wan
机构: Tianjin University (天津大学); State Administration of Cultural Heritage (国家文物局); Shenzhen University of Advanced Technology (深圳先进技术研究院); Nanchang University (南昌大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Accept
Abstract:Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection this http URL address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with this http URL, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.
[CV-165] wo Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting
【速读】:该论文旨在解决现有无姿态(pose-free)前向传播三维高斯溅射(3D Gaussian Splatting, 3DGS)方法中几何推理与外观建模耦合导致的高质量重建性能受限问题。当前主流方法采用统一的单体架构,将相机位姿估计与3DGS表示合成集成于同一网络中,虽结构简洁但存在特征表示混杂、优化困难的问题。解决方案的关键在于提出一种双专家(two-expert)设计——通过分离的几何专家(geometry expert)先独立预测相机位姿,并将其显式传递给强大的外观专家(appearance expert)以生成高质量的3D高斯表示。这种模块化设计显著提升了重建精度,在少于5K次训练迭代下即超越现有无姿态方法,并达到与有姿态(posed)先进方法相当的性能,验证了分离式建模在复杂3D几何估计与外观合成任务中的有效性。
链接: https://arxiv.org/abs/2603.21064
作者: Hwasik Jeong,Seungryong Lee,Gyeongjin Kang,Seungkwon Yang,Xiangyu Sun,Seungtae Nam,Eunbyung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \href
Abstract:Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such “all-in-one” designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
[CV-166] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving
【速读】:该论文旨在解决当前基于摄像头的自动驾驶感知系统中,过度追求感知效果而忽视计算效率的问题。其解决方案的关键在于提出LRHPerception,一个实时单目感知框架,通过融合端到端学习的计算高效性与局部建图方法的丰富表征能力,将目标跟踪与预测、道路分割和深度估计统一整合至一个框架中,从而实现以29 FPS的处理速度(单GPU)完成从单目图像到五通道张量(包含RGB、道路分割及像素级深度估计,并附加目标检测与轨迹预测)的高效转换,相较最快的传统建图方法提升555%的处理速度。
链接: https://arxiv.org/abs/2603.21061
作者: Haixi Zhang,Aiyinsi Zuo,Zirui Li,Chunshu Wu,Tong Geng,Zhiyao Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures
Abstract:Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.
[CV-167] SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM CVPR2026
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在RGBD SLAM中因高斯分布过于灵活或受限而导致的收敛速度慢和渲染质量低的问题。其核心解决方案是采用像素对齐的高斯分布,并允许每个高斯沿其射线方向自适应调整位置以最大化渲染质量,同时通过将每像素周围的深度分布建模为高斯分布来加速跟踪过程,从而在保证系统可扩展性的前提下提升性能。
链接: https://arxiv.org/abs/2603.21055
作者: Pengchong Hu,Zhizhong Han
机构: Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at this https URL .
[CV-168] A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors
【速读】:该论文旨在解决从车内视频流中识别危险驾驶行为的时序动作定位(Temporal Action Localization)问题,尤其关注在交通安检或车队管理等周期性检查场景下的准确性与计算效率之间的平衡难题。其解决方案的关键在于提出一个两阶段框架:首先采用基于VideoMAE的特征提取模块,对比不同规模的Vision Transformer(ViT)骨干网络(如ViT-Giant与轻量级ViT变体)以优化表征能力与计算成本;其次引入增强型自掩码注意力机制(Augmented Self-Mask Attention, AMA)检测器,并融合空间金字塔池化-快速模块(SPPF)以捕获多尺度时间特征。实验表明,该设计在保持高精度(最高mAP达92.67%)的同时显著降低计算开销,为实际部署提供了可行路径。
链接: https://arxiv.org/abs/2603.21048
作者: Gia-Bao Doan,Nam-Khoa Huynh,Minh-Nhat-Huy Ho,Khanh-Thanh-Khoa Nguyen,Thanh-Hai Le
机构: FPT University (FPT大学); The Saigon International University (西贡国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 14 figures
Abstract:The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.
[CV-169] When Minor Edits Matter: LLM -Driven Prompt Attack for Medical VLM Robustness in Ultrasound
【速读】:该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, Med-VLMs)在临床超声图像分析中存在可信度不足的问题,特别是其对自然语言指令的脆弱性——即对抗性扰动(如拼写错误、缩写、模糊表述等)可能显著改变模型输出,从而影响临床决策的安全性。解决方案的关键在于提出一个可扩展的对抗评估框架,利用大语言模型(Large Language Model, LLM)生成临床上合理且“人性化”的对抗提示变体,模拟真实临床沟通中的细微差异,并通过多选题问答基准系统性地测试主流Med-VLMs的鲁棒性,从而识别其共性失败模式与置信度之间的关联,为安全可靠的临床部署提供实证依据。
链接: https://arxiv.org/abs/2603.21047
作者: Yasamin Medghalchi,Milad Yazdani,Amirhossein Dabiriaghdam,Moein Heidari,Mojan Izadkhah,Zahra Kavian,Giuseppe Carenini,Lele Wang,Dena Shahriari,Ilker Hacihaliloglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via “humanized” rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.
[CV-170] SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments
【速读】:该论文旨在解决无人机视觉语言导航(UAV VLN)在复杂三维环境中的挑战,核心问题在于2D视觉感知与3D轨迹决策空间之间的结构表征不匹配,从而限制了空间推理能力。解决方案的关键是提出SpatialFly框架,其通过几何引导的二维表征对齐机制实现:首先利用几何先验注入模块将全局结构线索注入到2D语义token中,提供场景级几何指导;随后通过几何感知重参数化模块,借助跨模态注意力机制将2D语义token与3D几何token对齐,并采用门控残差融合保留语义区分度,从而有效弥合2D与3D表征差异,提升导航性能。
链接: https://arxiv.org/abs/2603.21046
作者: Wen Jiang,Kangyao Huang,Li Wang,Wang Xu,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hanfang Liang,Hongwei Duan,Bin Xu,Xiangyang Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.
[CV-171] LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction
【速读】:该论文旨在解决基于扩散模型的图像超分辨率(Image Super-Resolution, SR)方法在推理效率与重建质量之间的权衡问题,特别是针对当前最优的残差偏移扩散框架(residual-shifting diffusion framework)在紧凑采样轨迹中性能显著下降的问题。其关键在于两个核心限制:一是中间步骤中使用无约束随机高斯噪声导致误差累积且低分辨率(Low-Resolution, LR)先验引导不足;二是采用朴素双三次插值(bicubic upsampling)初始化带来的偏差。为此,作者提出LPNSR框架,通过数学推导获得残差偏移扩散范式下最优中间噪声的闭式解析解,并设计了一种LR引导的多输入感知噪声预测器,将LR结构先验嵌入逆向过程,同时保留原框架高效的残差偏移机制;此外,引入高质量预上采样网络以优化扩散起始点,缓解初始偏差。最终,在仅4步采样轨迹下实现端到端优化,显著提升感知质量,无需依赖大规模文本到图像先验。
链接: https://arxiv.org/abs/2603.21045
作者: Shuwei Huang,Shizhuo Liu,Zijun Wei
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion-based image super-resolution (SR), which aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) observations, faces a fundamental trade-off between inference efficiency and reconstruction quality. The state-of-the-art residual-shifting diffusion framework achieves efficient 4-step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior-enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed-form analytical solution of the optimal intermediate noise for the residual-shifting diffusion paradigm, and accordingly design an LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework’s core efficient residual-shifting mechanism. We further mitigate initial bias with a high-quality pre-upsampling network to optimize the diffusion starting point. With a compact 4-step trajectory, LPNSR can be optimized in an end-to-end manner. Extensive experiments demonstrate that LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets, without relying on any large-scale text-to-image priors. The source code of our method can be found at this https URL.
[CV-172] SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis ICME2026
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在皮肤癌诊断中面临的三重挑战:高计算成本、极端数据稀缺性以及深度学习模型的黑箱特性。其解决方案的关键在于提出SkinCLIP-VL框架,采用“冻结感知-自适应推理”范式,将冻结的CLIP编码器与轻量级量化后的Qwen2.5-VL通过低秩适配(Low-Rank Adaptation, LoRA)进行融合,并引入一致性感知焦点对齐(Consistency-aware Focal Alignment, CFA)损失函数,以在长尾分布下严格对齐视觉区域与临床语义。该方法在ISIC和Derm7pt基准上实现了比13B参数基线模型更高的准确率(提升4.3–6.2%),同时参数减少43%,且通过盲评专家评估和分布外测试验证了其可视化推理依据显著提升了临床可信度。
链接: https://arxiv.org/abs/2603.21010
作者: Zhixiang Lu,Shijie Xu,Kaicheng Yan,Xuyue Cai,Chong Zhang,Yulong Li,Angelos Stefanidis,Anh Nguyen,Jionglong Su
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)
Abstract:The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.
[CV-173] OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields
【速读】:该论文旨在解决远程操作(teleoperation)场景下360°视频流媒体的两大挑战:一是用户注视点预测在不确定 gaze(注视)模式下的准确性问题,二是无线网络波动条件下码率自适应调整的稳定性问题。解决方案的关键在于提出了一种无需训练的框架 OrbitStream,其核心创新包括:将视口预测建模为引力场驱动的“引力视口预测(Gravitational Viewport Prediction, GVP)”问题,利用语义场景理解生成吸引用户注视的势场;同时采用基于饱和机制的比例-微分(Saturation-Based Proportional-Derivative, PD)控制器实现缓冲区稳定调节。该方法在不依赖用户特定数据训练的前提下实现了高精度视口预测和鲁棒的QoE(Quality of Experience)表现,兼顾了可解释性与实时性。
链接: https://arxiv.org/abs/2603.20999
作者: Aizierjiang Aiersilan,Zhangfei Yang
机构: The George Washington University (乔治·华盛顿大学)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
Abstract:Adaptive 360° video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their “black-box” nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ( \sim 98.5%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360° streaming in teleoperation.
[CV-174] Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models CVPR2026
【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, VLMs)在部署时依赖“同义句一致性”(consistency under paraphrase)作为可靠性代理指标的潜在误导性问题。研究表明,模型可能通过文本模式而非图像内容实现高一致性,从而产生虚假的可靠性假象。解决方案的关键在于提出一个四象限样本级安全分类法(four-quadrant per-sample safety taxonomy),同时评估一致性(对同义提示的预测稳定性)和图像依赖性(移除图像后预测是否变化),从而识别出“危险型”样本(consistent but not image-reliant)——这类样本虽具高准确率和低熵,但实际不依赖图像信息,易被传统置信度筛选忽略。作者进一步建议,在部署评估中必须结合文本仅输入基线测试(text-only baseline),仅需一次额外前向传播即可揭示此类“虚假可靠性陷阱”。
链接: https://arxiv.org/abs/2603.20985
作者: Binesh Sadanandan,Vahid Behzadan
机构: SAIL Lab, University of New Haven (SAIL 实验室,新黑文大学); Google Health AI (谷歌健康人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Workshop on Medical Reasoning with Vision Language Foundation Models
Abstract:Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.
[CV-175] GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies CVPR
【速读】:该论文旨在解决神经元形态学(neuronal morphology)分析中仅依赖拓扑结构或图结构单一模态信息的局限性,从而无法全面捕捉电路功能、发育及疾病相关特征的问题。其解决方案的关键在于提出GraPHFormer——一种基于CLIP-style对比学习的多模态架构,通过统一拓扑与几何信息实现更精准的表征:视觉分支利用三通道持久性图像(persistence image)编码未加权、持久性加权和半径加权的拓扑密度,并采用DINOv2-ViT-S进行特征提取;同时,TreeLSTM编码器从骨架图(skeleton graph)中捕获几何与径向属性;两者映射至共享嵌入空间,并通过对称InfoNCE损失联合训练,辅以保持拓扑语义的持久性空间变换增强。该方法在六个基准测试中显著优于纯拓扑、纯图结构及形态计量基线,验证了其在胶质细胞分类和发育/退行性过程检测中的实用价值。
链接: https://arxiv.org/abs/2603.20970
作者: Uzair Shah,Marco Agus,Mahmoud Gamal,Mahmood Alzubaidi,Corrado Cali,Pierre J. Magistretti,Abdesselam Bouzerdoum,Mowafa Househ
机构: Hamad Bin Khalifa University (哈马德·本·哈利法大学); University of Turin (都灵大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Wollongong (伍伦贡大学); Neuroscience Institute Cavalieri Ottolenghi (卡瓦列里·奥托伦吉神经科学研究所); Université Grenoble-Alpes (格勒诺布尔阿尔卑斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: this https URL Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.20970 [cs.CV] (or arXiv:2603.20970v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.20970 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mahmood Saleh Alzubaidi [view email] [v1] Sat, 21 Mar 2026 22:47:24 UTC (22,953 KB)
[CV-176] Natural Gradient Descent for Online Continual Learning
【速读】:该论文旨在解决在线持续学习(Online Continual Learning, OCL)中因模型在流式数据上不断学习新任务而导致的灾难性遗忘(catastrophic forgetting)问题,同时提升模型在在线场景下的快速收敛能力。其解决方案的关键在于引入基于自然梯度下降(Natural Gradient Descent)的优化策略,并通过Kronecker Factored Approximate Curvature (KFAC) 方法近似计算费舍尔信息矩阵(Fisher Information Matrix, FIM),从而在保持对旧任务知识的同时高效适应新任务,在Split CIFAR-100、CORE50和Split miniImageNet等基准数据集上显著优于现有OCL方法。
链接: https://arxiv.org/abs/2603.20898
作者: Joe Khawand,David Colliaux
机构: Ecole Polytechnique; Télécom Paris; Sony Computer Science Laboratories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures
Abstract:Online Continual Learning (OCL) for image classification represents a challenging subset of Continual Learning, focusing on classifying images from a stream without assuming data independence and identical distribution (i.i.d). The primary challenge in this context is to prevent catastrophic forgetting, where the model’s performance on previous tasks deteriorates as it learns new ones. Although various strategies have been proposed to address this issue, achieving rapid convergence remains a significant challenge in the online setting. In this work, we introduce a novel approach to training OCL models that utilizes the Natural Gradient Descent optimizer, incorporating an approximation of the Fisher Information Matrix (FIM) through Kronecker Factored Approximate Curvature (KFAC). This method demonstrates substantial improvements in performance across all OCL methods, particularly when combined with existing OCL tricks, on datasets such as Split CIFAR-100, CORE50, and Split miniImageNet.
[CV-177] Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
【速读】:该论文旨在解决现有视频多模态理解方法在全局语义理解上的局限性,以及缺乏用户交互能力的问题,尤其是无法根据用户指定的局部区域(如目标物体的边界框)生成精确匹配其意图的掩码与描述文本。其解决方案的关键在于提出了一种新任务——可控视频分割与字幕生成(Controllable Video Segmentation and Captioning, SegCaptioning),并设计了场景图引导的细粒度SegCaptioning Transformer(Scene Graph-guided Fine-grained SegCaptioning Transformer, SG-FSCFormer)。该框架的核心创新包括:1)引入Prompt-guided Temporal Graph Former模块,通过自适应提示适配器有效捕捉和表示用户意图;2)构建Fine-grained Mask-linguistic Decoder,结合多实体对比损失(Multi-entity Contrastive loss)协同预测高质量的掩码-字幕对,并实现每个掩码与其对应文本标记之间的细粒度对齐,从而显著提升用户对视频内容的理解精度与可控性。
链接: https://arxiv.org/abs/2603.20887
作者: Xu Zhang,Jin Yuan,BinHong Yang,Xuan Liu,Qianjun Zhang,Yuyi Wang,Zhiyong Li,Hanwang Zhang
机构: Hunan University (湖南大学); Southwest Jiaotong University (西南交通大学); CRRC Zhuzhou Institute Company Ltd. (中车株洲电力机车研究所有限公司); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
Abstract:Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users’ understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user’s requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users’ comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at this https URL.
[CV-178] AFG-MAN: Timestep-Adaptive Frequency-Gated Latent Diffusion for Efficient and High-Quality Low-Dose CT Image Denoising
【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)图像中噪声显著且结构退化严重的问题,即在抑制噪声的同时如何有效保留细微的解剖学细节。其解决方案的关键在于提出了一种基于潜在扩散框架的高效高质LDCT图像去噪方法——TAFG-MAN,其中核心创新是引入轻量级的时间步自适应频域门控(Timestep-Adaptive Frequency-Gated, TAFG)条件机制:该机制将条件特征分解为低频与高频成分,从当前去噪特征和时间步嵌入中预测时间步自适应门控,并在交叉注意力之前逐步释放高频引导信息,从而在去噪早期依赖稳定结构引导、后期谨慎引入细节信息,实现噪声抑制与细节保留之间的更好平衡。
链接: https://arxiv.org/abs/2603.20868
作者: Tangtangfang Fang,Yang Jiao,Xiangjian He,Jingxi Hu,Jiaqi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but also introduces substantial noise and structural degradation, making it difficult to suppress noise without erasing subtle anatomical details. In this paper, we present TAFG-MAN, a latent diffusion framework for efficient and high-quality LDCT image denoising. The framework combines a perceptually optimized autoencoder, conditional latent diffusion restoration in a compact latent space, and a lightweight Timestep-Adaptive Frequency-Gated (TAFG) conditioning design. TAFG decomposes condition features into low- and high-frequency components, predicts timestep-adaptive gates from the current denoising feature and timestep embedding, and progressively releases high-frequency guidance in later denoising stages before cross-attention. In this way, the model relies more on stable structural guidance at early reverse steps and introduces fine details more cautiously as denoising proceeds, improving the balance between noise suppression and detail preservation. Experiments show that TAFG-MAN achieves a favorable quality-efficiency trade-off against representative baselines. Compared with its base variant without TAFG, it further improves detail preservation and perceptual quality while maintaining essentially the same inference cost, and ablation results confirm the effectiveness of the proposed conditioning mechanism.
[CV-179] Restoring Neural Network Plasticity for Faster Transfer Learning
【速读】:该论文旨在解决预训练模型在迁移学习过程中因权重饱和而导致的神经可塑性丧失(loss of neural plasticity)问题,即模型在微调阶段难以适应下游任务,尤其当目标数据集与ImageNet差异较大时。其解决方案的关键在于提出一种有针对性的权重重初始化策略,在微调前恢复模型的神经可塑性,从而提升模型对下游任务的适应能力。实验表明,该方法在卷积神经网络(CNNs)和视觉Transformer(ViTs)上均能实现更高的测试准确率和更快的收敛速度,且计算开销极低,兼容主流迁移学习流程。
链接: https://arxiv.org/abs/2603.20860
作者: Xander Coetzer,Arné Schreuder,Anna Sergeevna Bosman
机构: University of Pretoria (普利托里亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, 6 tables and 2 formulas
Abstract:Transfer learning with models pretrained on ImageNet has become a standard practice in computer vision. Transfer learning refers to fine-tuning pretrained weights of a neural network on a downstream task, typically unrelated to ImageNet. However, pretrained weights can become saturated and may yield insignificant gradients, failing to adapt to the downstream task. This hinders the ability of the model to train effectively, and is commonly referred to as loss of neural plasticity. Loss of plasticity may prevent the model from fully adapting to the target domain, especially when the downstream dataset is atypical in nature. While this issue has been widely explored in continual learning, it remains relatively understudied in the context of transfer learning. In this work, we propose the use of a targeted weight re-initialization strategy to restore neural plasticity prior to fine-tuning. Our experiments show that both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from this approach, yielding higher test accuracy with faster convergence on several image classification benchmarks. Our method introduces negligible computational overhead and is compatible with common transfer learning pipelines.
[CV-180] Fast and Robust Deformable 3D Gaussian Splatting
【速读】:该论文旨在解决动态场景重建中基于变形场的3D高斯泼溅(3D Gaussian Splatting)方法存在的三大问题:渲染速度慢、对初始点云依赖性强以及在暗光环境下易陷入局部最优解。其解决方案的关键在于提出FRoG框架,通过引入每个高斯点的嵌入(per-Gaussian embedding)与粗到细的时间嵌入策略,实现早期时间信息融合以加速渲染;同时设计了一种基于深度和误差引导的采样策略,在低偏差初始位置生成新的3D高斯点,减轻变形场优化负担并提升静态与动态区域的细节重建质量;此外,通过调节不透明度变化,缓解了暗光场景中的局部最优问题,从而改善颜色保真度。
链接: https://arxiv.org/abs/2603.20857
作者: Han Jiao,Jiakai Sun,Lei Zhao,Zhanjie Zhang,Wei Xing,Huaizhong Lin
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:3D Gaussian Splatting has demonstrated remarkable real-time rendering capabilities and superior visual quality in novel view synthesis for static scenes. Building upon these advantages, researchers have progressively extended 3D Gaussians to dynamic scene reconstruction. Deformation field-based methods have emerged as a promising approach among various techniques. These methods maintain 3D Gaussian attributes in a canonical field and employ the deformation field to transform this field across temporal sequences. Nevertheless, these approaches frequently encounter challenges such as suboptimal rendering speeds, significant dependence on initial point clouds, and vulnerability to local optima in dim scenes. To overcome these limitations, we present FRoG, an efficient and robust framework for high-quality dynamic scene reconstruction. FRoG integrates per-Gaussian embedding with a coarse-to-fine temporal embedding strategy, accelerating rendering through the early fusion of temporal embeddings. Moreover, to enhance robustness against sparse initializations, we introduce a novel depth- and error-guided sampling strategy. This strategy populates the canonical field with new 3D Gaussians at low-deviation initial positions, significantly reducing the optimization burden on the deformation field and improving detail reconstruction in both static and dynamic regions. Furthermore, by modulating opacity variations, we mitigate the local optima problem in dim scenes, improving color fidelity. Comprehensive experimental results validate that our method achieves accelerated rendering speeds while maintaining state-of-the-art visual quality.
[CV-181] Ensemble of Small Classifiers For Imbalanced White Blood Cell Classification
【速读】:该论文旨在解决白血病诊断中血细胞分类自动化的问题,尤其针对稀有细胞类型分类的挑战,这主要源于染色、扫描差异以及患者间的异质性。解决方案的关键在于提出一种轻量级集成方法,通过在造血过程(Haematopoiesis)中聚焦粒细胞生成(Granulopoiesis)、单核细胞生成(Monocytopoiesis)和淋巴细胞生成(Lymphopoiesis)的生物学特性,结合数据集扩展缓解类别不平衡问题,采用预训练的SwinV2-Tiny、DinoBloom-Small和ConvNeXT-V2-Tiny三种轻量模型构建集成系统,并在分层3折交叉验证框架下训练每种架构的三个实例,最终对输入图像进行9个模型的前向传播并采用logit平均聚合策略,从而实现高精度分类性能。
链接: https://arxiv.org/abs/2603.20856
作者: Siddharth Srivastava,Adam Smith,Scott Brooks,Jack Bacon,Till Bretschneider
机构: University of Warwick (华威大学); Intelligent Imaging Innovations Ltd
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ISBI 2026 WBCBench Challenge
Abstract:Automating white blood cell classification for diagnosis of leukaemia is a promising alternative to time-consuming and resource-intensive examination of cells by expert pathologists. However, designing robust algorithms for classification of rare cell types remains challenging due to variations in staining, scanning and inter-patient heterogeneity. We propose a lightweight ensemble approach for classification of cells during Haematopoiesis, with a focus on the biology of Granulopoiesis, Monocytopoiesis and Lymphopoiesis. Through dataset expansion to alleviate some class imbalance, we demonstrate that a simple ensemble of lightweight pretrained SwinV2-Tiny, DinoBloom-Small and ConvNeXT-V2-Tiny models achieves excellent performance on this challenging dataset. We train 3 instantiations of each architecture in a stratified 3-fold cross-validation framework; for an input image, we forward-pass through all 9 models and aggregate through logit averaging. We further reason on the weaknesses of our model in confusing similar-looking myelocytes in granulopoiesis and lymphocytes in lymphopoiesis. Code: this https URL.
[CV-182] Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves CVPR2026
【速读】:该论文旨在解决传统手部-物体交互(Hand-Object Interaction, HOI)视频因缺乏物理信息(如接触力和运动信号)及频繁遮挡而导致的建模困难问题。其核心解决方案是提出Glove2Hand框架,该框架通过多模态传感手套视频生成逼真的裸手图像,并保持底层物理交互动力学的一致性;关键创新在于引入一种新型3D高斯手部模型以确保时序渲染一致性,并结合基于扩散模型的手部修复器实现复杂手物交互与非刚性形变下的场景无缝融合。
链接: https://arxiv.org/abs/2603.20850
作者: Xinyu Zhang,Ziyi Kou,Chuan Qin,Mia Huang,Ergys Ristani,Ankit Kumar,Lele Chen,Kun He,Abdeslam Boularias,Li Guan
机构: Meta Reality Labs (Meta); Rutgers University
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2026
Abstract:Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.
[CV-183] GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit
【速读】:该论文旨在解决计算病理学领域缺乏标准化中间数据格式、溯源追踪机制、检查点约定及可复现评估指标的问题,这些问题阻碍了生成式AI(Generative AI)驱动的计算生物标志物(Computational Biomarkers, CBs)向临床级部署的转化。其解决方案的关键在于提出GOLDMARK框架,该框架基于一个经过筛选的TCGA队列构建,包含临床可操作的OncoKB 1-3级生物标志物标签,并释放结构化的中间表示,如切片坐标图、来自标准病理基础模型(Pathology Foundation Models, PFMs)的每张切片特征嵌入、质量控制元数据、预定义的患者级划分、训练好的切片级模型以及评估输出结果。通过在TCGA和独立的MSKCC队列上进行互斥测试,GOLDMARK实现了跨站点稳定且可复现的基准测试,显著提升了不同方法在多任务场景下的比较能力与可信度。
链接: https://arxiv.org/abs/2603.20848
作者: Chad Vanderbilt,Gabriele Campanella,Siddharth Singi,Swaraj Nanda,Jie-Fu Chen,Ali Kamali,Amir Momeni Boroujeni,David Kim,Mohamed Yakoub,Jamal Benhamida,Meera Hameed,Neeraj Kumar,Gregory Goldgof
机构: Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Tissues and Organs (q-bio.TO)
备注:
Abstract:Computational biomarkers (CBs) are histopathology-derived patterns extracted from hematoxylin-eosin (HE) whole-slide images (WSIs) using artificial intelligence (AI) to predict therapeutic response or prognosis. Recently, slide-level multiple-instance learning (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. We introduce GOLDMARK (this https URL), a standardized benchmarking framework built on a curated TCGA cohort with clinically actionable OncoKB level 1-3 biomarker labels. GOLDMARK releases structured intermediate representations, including tile coordinate maps, per-slide feature embeddings from canonical PFMs, quality-control metadata, predefined patient-level splits, trained slide-level models, and evaluation outputs. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). Restricting to the eight highest-performing tasks yielded mean AUROCs of 0.831 and 0.801, respectively. These tasks correspond to established morphologic-genomic associations (e.g., LGG IDH1, COAD MSI/BRAF, THCA BRAF/NRAS, BLCA FGFR3, UCEC PTEN) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Tissues and Organs (q-bio.TO) Cite as: arXiv:2603.20848 [cs.CV] (or arXiv:2603.20848v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.20848 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chad Vanderbilt [view email] [v1] Sat, 21 Mar 2026 15:09:06 UTC (7,210 KB)
[CV-184] MERIT: Multi-domain Efficient RAW Image Translation
【速读】:该论文旨在解决多域RAW图像翻译(multi-domain RAW image translation)问题,即不同相机传感器捕获的RAW图像因光谱响应、噪声特性及色调行为差异而存在显著领域偏移(domain shift),导致其在下游计算机视觉任务中难以直接使用。现有方法需为每对源-目标域单独训练翻译模型,缺乏可扩展性,无法适应包含多种商用相机的真实场景。解决方案的关键在于提出MERIT框架,其核心创新包括:1)引入传感器感知的噪声建模损失(sensor-aware noise modeling loss),显式对齐生成图像与目标域的信号依赖噪声统计特性;2)设计条件多尺度大核注意力模块(conditional multi-scale large kernel attention module),增强上下文建模与传感器特异性特征提取能力。此外,作者还构建了MDRAW数据集以支持标准化评估,实验表明MERIT在图像质量(提升5.56 dB)和训练效率(减少80%迭代次数)方面均优于现有方法。
链接: https://arxiv.org/abs/2603.20836
作者: Wenjun Huang,Shenghao Fu,Yian Jin,Yang Ni,Ziteng Cui,Hanning Chen,Yirui He,Yezi Liu,Sanggeon Yun,SungHeon Jeong,Ryozo Masukawa,William Youngwoo Chung,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校); University of Pennsylvania (宾夕法尼亚大学); Purdue University Northwest (普渡大学西北分校); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).
[CV-185] EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis
【速读】:该论文旨在解决文本到图像扩散模型在处理隐式提示(implicit prompts)时存在的严重缺陷,即模型缺乏深层世界知识(涵盖自然科学与文化常识),导致生成结果违背事实。其核心问题是扩散模型内部知识结构的混乱组织,使得隐式提示难以被准确映射到图像空间。解决方案的关键在于提出EruDiff框架:首先设计扩散知识分布匹配(Diffusion Knowledge Distribution Matching, DK-DM),将难以处理的隐式提示的知识分布对齐至结构清晰的显式锚点;其次引入仅负向强化学习(Negative-Only Reinforcement Learning, NO-RL)策略,精细修正显式提示渲染中的固有偏差。实验证明该方法显著提升了FLUX和Qwen-Image等主流扩散模型在科学知识(Science-T2I)与世界知识(WISE)基准上的性能,展现出良好的有效性与泛化能力。
链接: https://arxiv.org/abs/2603.20828
作者: Xiefan Guo,Xinzhu Ma,Haoxiang Ma,Zihao Zhou,Di Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have achieved remarkable fidelity in synthesizing images from explicit text prompts, yet exhibit a critical deficiency in processing implicit prompts that require deep-level world knowledge, ranging from natural sciences to cultural commonsense, resulting in counter-factual synthesis. This paper traces the root of this limitation to a fundamental dislocation of the underlying knowledge structures, manifesting as a chaotic organization of implicit prompts compared to their explicit counterparts. In this paper, we propose EruDiff, which aims to refactor the knowledge within diffusion models. Specifically, we develop the Diffusion Knowledge Distribution Matching (DK-DM) to register the knowledge distribution of intractable implicit prompts with that of well-defined explicit anchors. Furthermore, to rectify the inherent biases in explicit prompt rendering, we employ the Negative-Only Reinforcement Learning (NO-RL) strategy for fine-grained correction. Rigorous empirical evaluations demonstrate that our method significantly enhances the performance of leading diffusion models, including FLUX and Qwen-Image, across both the scientific knowledge benchmark (i.e., Science-T2I) and the world knowledge benchmark (i.e., WISE), underscoring the effectiveness and generalizability. Our code is available at this https URL.
[CV-186] PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching CVPR2026
【速读】:该论文旨在解决结构化环境中轻量级6-自由度(6-DoF)相机重定位问题,传统基于结构的方法依赖点对应关系,而本文提出以平面基元(planar primitives)为核心构建重定位框架。解决方案的关键在于引入PlanaReLoc——一种平面中心化的范式,通过深度匹配器在查询图像与3D平面地图之间建立跨模态的平面基元对应关系,并在学习到的统一嵌入空间中进行姿态求解与优化,从而实现无需真实纹理/颜色地图、先验位姿或每场景训练的高效且鲁棒的重定位。
链接: https://arxiv.org/abs/2603.20818
作者: Hanqiao Ye,Yuzhou Liu,Yangdong Liu,Shuhan Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026. 20 pages, 15 figures. Code at this https URL
Abstract:While structure-based relocalizers have long strived for point correspondences when establishing or regressing query-map associations, in this paper, we pioneer the use of planar primitives and 3D planar maps for lightweight 6-DoF camera relocalization in structured environments. Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness. This motivates us to introduce PlanaReLoc, a streamlined plane-centric paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework. Through comprehensive experiments on the ScanNet and 12Scenes datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modal structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training. The code and data are available at this https URL .
[CV-187] Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation
【速读】:该论文旨在解决云遮挡对光学遥感影像语义完整性造成的严重破坏问题,尤其是在融合合成孔径雷达(SAR)数据时,如何实现高效全局建模与可靠跨模态融合的挑战。现有方法依赖密集全局注意力机制以捕捉长距离依赖关系,但此类聚合方式会无差别传播由云引起的噪声,且提升鲁棒性通常需增大模型容量,导致计算开销显著增加,难以满足遥感应用中大规模、高分辨率场景的实际部署需求。解决方案的关键在于提出一种面向效率的差异条件光学-SAR语义分割框架EDC:其一,设计三流编码器结合载波标记(Carrier Tokens),实现复杂度更低的紧凑全局上下文建模;其二,引入差异条件混合融合(DCHF)机制,在全局聚合过程中选择性抑制不可靠区域以防止噪声污染;其三,通过教师引导蒸馏的辅助云去除分支增强遮挡下的语义一致性。该方案在保持高精度的同时显著降低参数量并加速推理,验证了效率与可靠性之间的平衡优化。
链接: https://arxiv.org/abs/2603.20811
作者: Chenxing Meng,Wuzhou Quan,Yingjie Cai,Liqun Cao,Liyan Zhang,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 page, 7 figures
Abstract:Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56% and 0.88% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7% and accelerating inference by 1.98 \times . Our implementation is available at this https URL.
[CV-188] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在以文本生成为目标的训练过程中,导致内部视觉表征能力退化的问题。研究发现,与初始视觉特征相比,MLLMs中间层的视觉表示不仅全局功能下降,且局部patch结构也遭到破坏,根源在于单一文本生成目标对视觉信息的“牺牲”。解决方案的关键在于提出预测正则化(Predictive Regularization, PRe),通过强制模型中间层特征预测原始视觉特征,从而保留模型内部表示中的固有视觉属性,有效提升视觉-语言任务性能,强调了构建强健内部视觉表征对于实现全面多模态理解的重要性。
链接: https://arxiv.org/abs/2603.20808
作者: Enguang Wang,Qiang Wang,Yuanchen Wu,Ke Yan,Xinbin Yuan,Shouhong Ding,Xialei Liu,Ming-Ming Cheng
机构: NKIARI, Shenzhen Futian; VCIP, CS, Nankai University; AAIS, Nankai University; Tencent Youtu Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM’s internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
[CV-189] Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification
【速读】:该论文旨在解决多标签眼底图像诊断中如何有效捕捉细粒度病变与大尺度视网膜结构特征的问题。传统多尺度医学视觉模型常依赖显式的频率分解策略(如Octave Convolution或小波变换),但实验表明这类启发式方法在眼底诊断任务中提升有限,甚至增加计算开销而不改善性能。其解决方案的关键在于提出一种轻量级骨干网络Clifford-M,通过引入稀疏几何交互机制替代传统的前馈扩展和频率分割模块,利用Clifford-style滚动积实现对对齐与结构变化的联合建模,具有线性复杂度,从而高效完成跨尺度融合与自适应优化。该设计无需预训练即可在ODIR-5K数据集上达到0.8142平均AUC-ROC和0.5481平均宏F1分数,且在RFMiD数据集上展现出良好的跨域鲁棒性,证明了直接建模多尺度结构交互比显式频率工程更有效。
链接: https://arxiv.org/abs/2603.20806
作者: Yifeng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 3 figures, 8 tables
Abstract:Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly. Comments: 29 pages, 3 figures, 8 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.20806 [cs.CV] (or arXiv:2603.20806v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.20806 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-190] Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-Language Navigation, VLN)系统因局部观测限制而导致的知识积累受限问题,即智能体只能基于自身访问过的地点进行学习和决策。其解决方案的关键在于提出一种名为Co-VLN的极简、模型无关框架,通过让在相同环境中协同导航的多个智能体在识别到共同路径位置时交换结构化的感知记忆(structured perceptual memory),从而在不增加额外探索成本的前提下扩展每个智能体的感知范围(receptive field),实现跨智能体的知识共享与协作增强。
链接: https://arxiv.org/abs/2603.20804
作者: Qunchao Jin,Yiliao Song,Qi Wu
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); Adelaide University (阿德莱德大学); Responsible AI Research Centre (负责任人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language Navigation (VLN) systems are fundamentally constrained by partial observability, as an agent can only accumulate knowledge from locations it has personally visited. As multiple robots increasingly coexist in shared environments, a natural question arises: can agents navigating the same space benefit from each other’s observations? In this work, we introduce Co-VLN, a minimalist, model-agnostic framework for systematically investigating whether and how peer observations from concurrently navigating agents can benefit VLN. When independently navigating agents identify common traversed locations, they exchange structured perceptual memory, effectively expanding each agent’s receptive field at no additional exploration cost. We validate our framework on the R2R benchmark under two representative paradigms (the learning-based DUET and the zero-shot MapGPT), and conduct extensive analytical experiments to systematically reveal the underlying dynamics of peer observation sharing in VLN. Results demonstrate that vision-sharing enabled model yields substantial performance improvements across both paradigms, establishing a strong foundation for future research in collaborative embodied navigation.
[CV-191] ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking
【速读】:该论文旨在解决推理驱动的视觉语言模型(Reasoning-induced Vision-Language Models, VLMs)在图像质量评估(Image Quality Assessment, IQA)任务中因输出标量分数敏感性不足而导致的离散坍缩(discrete collapse)问题。解决方案的关键在于提出一种即插即用的测试时记忆增强重排序框架ME-IQA,其核心机制包括:(i)构建记忆库并利用推理摘要检索语义与感知对齐的邻居样本;(ii)将VLM重构为概率比较器以获取成对偏好概率,并基于Thurstone’s Case V模型融合序数证据与初始分数;(iii)通过门控反思与记忆巩固机制优化未来决策。该方法显著提升了预测密度和失真敏感性,有效缓解了离散坍缩现象。
链接: https://arxiv.org/abs/2603.20785
作者: Kanglong Fan,Tianhe Wu,Wen Wen,Jianzhao Liu,Le Yang,Yabin Zhang,Yiting Liao,Junlin Li,Li Zhang
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reasoning-induced vision-language models (VLMs) advance image quality assessment (IQA) with textual reasoning, yet their scalar scores often lack sensitivity and collapse to a few values, so-called discrete collapse. We introduce ME-IQA, a plug-and-play, test-time memory-enhanced re-ranking framework. It (i) builds a memory bank and retrieves semantically and perceptually aligned neighbors using reasoning summaries, (ii) reframes the VLM as a probabilistic comparator to obtain pairwise preference probabilities and fuse this ordinal evidence with the initial score under Thurstone’s Case V model, and (iii) performs gated reflection and consolidates memory to improve future decisions. This yields denser, distortion-sensitive predictions and mitigates discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over strong reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.
[CV-192] MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction CVPR2026
【速读】:该论文旨在解决基于学习的边缘检测模型在使用交叉熵损失(cross-entropy loss)训练时普遍存在的问题:预测边缘过粗,与人类标注的单像素级清晰边缘不一致。解决方案的关键在于提出一种仅依赖交叉熵损失的掩码边缘预测模型(Masked Edge Prediction Model, MEMO),其核心创新在于训练和推理策略的设计——首先通过大规模合成边缘数据集预训练以提升泛化能力,随后在下游数据集上仅需添加1.2%参数的轻量模块进行微调;推理阶段则利用厚边缘预测中存在置信度梯度(中心高、边界低)的特性,采用一种按置信度逐步细化边缘的渐进式预测策略,从而生成视觉上逼真、无需后处理且接近人类标注质量的锐利边缘图。
链接: https://arxiv.org/abs/2603.20782
作者: Jiaxin Cheng,Yue Wu,Yicong Zhou
机构: University of Macau (澳门大学); Capital One (资本one)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.
[CV-193] PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
【速读】:该论文旨在解决无人机(UAV)在无GNSS(全球导航卫星系统)环境下同时实现自身位姿(ego-pose)与目标地理定位(geo-localization)的难题。传统方法依赖于解耦的流水线,分别使用GNSS与视觉惯性里程计(Visual-Inertial Odometry, VIO)进行自身定位,以及激光测距等主动传感器完成目标定位,存在环境鲁棒性差、硬件成本高和系统复杂的问题。其解决方案的关键在于提出PiLoT统一框架,通过直接将实时视频流与地理参考的3D地图进行体素级配准,实现端到端的定位;核心创新包括:1)双线程引擎,分离地图渲染与定位计算,保障低延迟与无漂移精度;2)大规模合成数据集,支持轻量神经网络在仿真到真实场景间的零样本迁移;3)联合神经引导随机梯度优化器(Joint Neural-Guided Stochastic-Gradient Optimizer, JNGO),提升剧烈运动下的收敛鲁棒性。实验证明,该方法在NVIDIA Jetson Orin平台上可实现超过25 FPS的实时性能并显著优于现有最优方案。
链接: https://arxiv.org/abs/2603.20778
作者: Xiaoya Cheng,Long Wang,Yan Liu,Xinyi Liu,Hanlin Tan,Yu Liu,Maojun Zhang,Shen Yan
机构: National University of Defense Technology (国防科技大学); Zhejiang University (浙江大学); Westlake University (西湖大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: this https URL.
[CV-194] OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation ICLR2026
【速读】:该论文旨在解决自动驾驶中语义分割模型在面对黑盒对抗攻击时的脆弱性问题,即当攻击者无法获取目标模型参数时,现有方法难以生成具有高迁移性的对抗扰动。解决方案的关键在于提出OmniPatch训练框架,能够学习一个通用的对抗补丁(adversarial patch),该补丁在跨图像和跨架构(包括ViT与CNN)场景下均具备良好的泛化能力与迁移性,且无需访问目标模型的权重信息。
链接: https://arxiv.org/abs/2603.20777
作者: Aarush Aggarwal,Akshat Tomar,Amritanshu Tiwari,Sargam Goyal
机构: Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, ICLR 2026: Principled Design for Trustworthy AI
Abstract:Robust semantic segmentation is crucial for safe autonomous driving, yet deployed models remain vulnerable to black-box adversarial attacks when target weights are unknown. Most existing approaches either craft image-wide perturbations or optimize patches for a single architecture, which limits their practicality and transferability. We introduce OmniPatch, a training framework for learning a universal adversarial patch that generalizes across images and both ViT and CNN architectures without requiring access to target model parameters.
[CV-195] Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping CVPR2026
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在文本到图像(T2I)生成任务中进行微调时面临的高计算复杂度和内存消耗问题,从而限制了其在资源受限设备(如智能手机、物联网设备)上的实际部署。解决方案的关键在于提出一种名为 DiT-BlockSkip 的内存高效微调框架,其核心创新包括:1)基于扩散时间步的动态 patch 采样策略,根据当前 timestep 调整 patch 大小并下采样至固定低分辨率,以减少前向与反向传播过程中的内存占用,同时在高时间步保留全局结构、低时间步捕捉细节;2)通过预计算残差特征实现块跳过(block skipping),仅对关键 transformer 块进行微调,大幅降低训练内存开销;3)引入基于交叉注意力掩码的块选择策略,精准识别对个性化任务至关重要的模块。该方法在保持高质量个性化生成能力的同时显著降低内存使用,推动大规模扩散模型在边缘设备上的可行性。
链接: https://arxiv.org/abs/2603.20755
作者: Sunghyun Park,Jeongho Kim,Hyoungwoo Park,Debasmit Das,Sungrack Yun,Munawar Hayat,Jaegul Choo,Fatih Porikli,Seokeon Choi
机构: Qualcomm AI Research(高通AI研究); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026; 20 pages
Abstract:Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.
[CV-196] Smart Operation Theatre: An AI-based System for Surgical Gauze Counting
【速读】:该论文旨在解决手术过程中医用纱布(gauze)遗留体内导致的“Gossypiboma”问题,此类事件不仅可能引发严重并发症,还可能导致医疗纠纷和监管处罚。解决方案的关键在于开发一种基于人工智能(AI)的纱布计数系统,利用YOLOv5深度学习模型实现对术中纱布的实时视频监控与对象识别,通过追踪两个指定托盘(“In”和“Out”)中的纱布流动状态,确保术前术后计数准确,从而有效预防纱布遗留。相较早期采用双模型(分别检测人和纱布)的版本,新系统整合了多目标检测能力,训练数据增至11,000张图像,并提升帧率至15 FPS,同时支持医生手动调整计数,显著提高了系统的准确性与临床实用性。
链接: https://arxiv.org/abs/2603.20752
作者: Saraf Krish,Cai Yiyu,Huang Li Hui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:During surgeries, there is a risk of medical gauzes being left inside patients’ bodies, leading to “Gossypiboma” in patients and can cause serious complications in patients and also lead to legal problems for hospitals from malpractice lawsuits and regulatory penalties. Diagnosis depends on imaging methods such as X-rays or CT scans, and the usual treatment involves surgical excision. Prevention methods, such as manual counts and RFID-integrated gauzes, aim to minimize gossypiboma risks. However, manual tallying of 100s of gauzes by nurses is time-consuming and diverts resources from patient care. In partnership with Singapore General Hospital (SGH) we have developed a new prevention method, an AI-based system for gauze counting in surgical settings. Utilizing real-time video surveillance and object recognition technology powered by YOLOv5, a Deep Learning model was designed to monitor gauzes on two designated trays labelled “In” and “Out”. Gauzes are tracked from the “In” tray, prior to their use in the patient’s body in the “Out” tray post-use, ensuring accurate counting and verifying that no gauze remains inside the patient at the end of the surgery. We have trained it using numerous images from Operation Theatres augmented it to satisfy all possible scenarios. This study has also addressed the shortcomings of previous project iterations. Previously, the project employed two models: one for human detection and another for gauze detection, trained on a total of 2800 images. Now we have an integrated model capable of identifying both humans and gauzes, using a training set of 11,000 images. This has led to improvements in accuracy and increased the frame rate from 8 FPS to 15 FPS now. Incorporating doctor’s feedback, the system now also supports manual count adjustments, enhancing its reliability in actual surgeries.
[CV-197] CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration CVPR2026
【速读】:该论文旨在解决文本到图像生成(text-to-image synthesis)中文本提示与生成图像之间难以实现精确对齐的问题。现有基于扩散模型的方法受限于传统扩散损失提供的隐式监督,导致细粒度的文本-图像对应关系建模不足。解决方案的关键在于提出跨时间步自校准(Cross-Timestep Self-Calibration, CTCal)机制:利用噪声较少的小时间步下已形成的可靠文本-图像对齐(即交叉注意力图)来校准噪声较多的大时间步下的表示学习,从而在训练过程中引入显式监督。此外,通过设计时间步感知的自适应权重策略,实现了CTCal与扩散损失的协同融合,该方法具有模型无关性,可无缝集成至各类扩散模型(如SD 2.1)和流模型(如SD 3),并在T2I-Compbench++和GenEval基准上验证了其有效性与泛化能力。
链接: https://arxiv.org/abs/2603.20741
作者: Xiefan Guo,Xinzhu Ma,Haiyu Zhang,Di Huang
机构: Beihang University (北京航空航天大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at this https URL.
[CV-198] Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding CVPR2026
【速读】:该论文旨在解决多任务域泛化(multi-task domain generalization, DG)中现有Transformer和Mamba架构性能下降的问题,特别是由于缺乏结构感知能力导致的跨域特征漂移与序列建模不稳定。其核心解决方案是提出Structure-Aware Domain Generalization (SADG),关键在于设计了两个创新模块:一是结构感知序列化(Structure-aware Serialization, SAS),通过基于质心拓扑和测地曲率连续性的方法生成变换不变的序列,从而保留点云的结构层次;二是分层域感知建模(Hierarchical Domain-aware Modeling, HDM),通过整合域内结构并融合域间关系来稳定跨域推理。此外,在测试阶段引入轻量级谱图对齐(Spectral Graph Alignment, SGA),在不更新模型参数的前提下将目标域特征映射至源域原型空间,实现结构保持的特征迁移。
链接: https://arxiv.org/abs/2603.20739
作者: Jincen Jiang,Qianyu Zhou,Yuhang Li,Kui Su,Meili Wang,Jian Chang,Jian Jun Zhang,Xuequan Lu
机构: Bournemouth University (伯恩茅斯大学); Jilin University (吉林大学); The University of Western Australia (西澳大利亚大学); Hangzhou City University (杭州城市学院); Northwest AF University (空军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.
[CV-199] SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval CVPR2026
【速读】:该论文旨在解决跨被试脑电图(EEG)到图像的检索任务中因个体差异(subject shift)和嵌入空间中的“枢纽效应”(hubness)导致的相似性几何失真问题,这会破坏top-k排名的稳定性,使小k值短列表不可靠。解决方案的关键在于提出一种无需标签的测试时校准头SATTC(Structure-Aware Test-Time Calibration),其核心机制包括两个专家模块:一是几何专家,通过自适应白化EEG嵌入与改进的跨域相似性局部缩放(CSLS)增强相似性分布;二是结构专家,基于互近邻、双向top-k排名和类别流行度构建结构信息,并通过Product-of-Experts规则融合二者。SATTC直接作用于冻结的EEG与图像编码器输出的相似性矩阵,显著提升跨被试检索性能,降低枢纽效应与类不平衡,且具有编码器无关性,可作为通用的测试时校准层。
链接: https://arxiv.org/abs/2603.20738
作者: Qunjie Huang,Weina Zhu
机构: Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Official code: this https URL
Abstract:Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert, subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS), and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, L2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding.
[CV-200] VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation
【速读】:该论文旨在解决低质量视频场景下多目标跟踪(Multi-Object Tracking, MOT)算法性能显著下降的问题,其核心原因在于图像信息丢失导致现有方法难以有效建模目标特征。解决方案的关键在于提出一种基于视觉语义蒸馏(Visual Semantic Distillation, VSD-MOT)的框架:首先利用CLIP图像编码器作为教师模型提取全局视觉语义信息以补偿低质图像中的信息损失;其次设计双约束语义蒸馏(Dual-Constraint Semantic Distillation, DCSD)方法,使学生模型能够从教师模型中学习到适用于MOT任务的语义特征表示;最后引入动态语义权重调节模块(Dynamic Semantic Weight Regulation, DSWR),根据实时帧质量评估自适应调整语义信息与原始特征的融合权重,从而提升模型在动态变化的低质量视频中的鲁棒性。
链接: https://arxiv.org/abs/2603.20731
作者: Jun Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms’ inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.
[CV-201] Weakly supervised multimodal segmentation of acoustic borehole images with depth-aware cross-attention
【速读】:该论文旨在解决声波井壁成像(acoustic borehole images)在大规模解释中面临的挑战,即缺乏密集专家标注且地下信息具有多模态特性(multimodal),导致传统方法难以有效利用图像纹理与深度对齐的测井曲线(well-logs)进行精准分割。解决方案的关键在于提出一种弱监督多模态分割框架,通过学习模型对阈值引导的伪标签(threshold-guided pseudo-labels)进行精炼,在保持无标注特性的同时引入去噪、置信度感知的伪监督和物理结构化的融合机制;其中最具优势的是“置信度门控深度感知交叉注意力”(confidence-gated depth-aware cross-attention, CG-DCA)策略,其性能提升主要源于置信度感知融合与结构化局部深度交互,而非单纯模型复杂度增加。
链接: https://arxiv.org/abs/2603.20729
作者: Jose Luis Lima de Jesus Silva
机构: Federal University of Bahia (巴西联邦大学); Grupo de Estudos e Aplicação de Inteligência Artificial em Geofísica (GAIA) (地球物理人工智能研究与应用小组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注:
Abstract:Acoustic borehole images provide high-resolution borehole-wall structure, but large-scale interpretation remains difficult because dense expert annotations are rarely available and subsurface information is intrinsically multimodal. The challenge is developing weakly supervised methods combining two-dimensional image texture with depth-aligned one-dimensional well-logs. Here, we introduce a weakly supervised multimodal segmentation framework that refines threshold-guided pseudo-labels through learned models. This preserves the annotation-free character of classical thresholding and clustering workflows while extending them with denoising, confidence-aware pseudo-supervision, and physically structured fusion. We establish that threshold-guided learned refinement provides the most robust improvement over raw thresholding, denoised thresholding, and latent clustering baselines. Multimodal performance depends strongly on fusion strategy: direct concatenation provides limited gains, whereas depth-aware cross-attention, gated fusion, and confidence-aware modulation substantially improve agreement with the weak supervisory reference. The strongest model, confidence-gated depth-aware cross-attention (CG-DCA), consistently outperforms threshold-based, image-only, and earlier multimodal baselines. Targeted ablations show its advantage depends specifically on confidence-aware fusion and structured local depth interaction rather than model complexity alone. Cross-well analyses confirm this performance is broadly stable. These results establish a practical, scalable framework for annotation-free segmentation, showing multimodal improvement is maximized when auxiliary logs are incorporated selectively and depth-aware.
[CV-202] Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
【速读】:该论文旨在解决文本到图像生成(text-to-image generation)中用户偏好难以准确捕捉的问题。现有方法依赖多模态大语言模型推断用户偏好,但生成的提示词或潜在编码往往不能忠实反映用户真实意图,导致个性化效果不佳。其解决方案的关键在于提出Premier框架,通过引入可学习的用户偏好嵌入(preference embedding)与一个偏好适配器(preference adapter),将用户嵌入与文本提示融合,并进一步用融合后的偏好嵌入调制生成过程,从而实现更精细的偏好控制;同时设计了分散损失(dispersion loss)以增强不同用户偏好嵌入之间的区分度,提升输出与用户特定风格的一致性。在数据稀缺场景下,新用户偏好通过已有用户嵌入的线性组合进行泛化建模,显著提升了个性化生成的鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.20725
作者: Zihao Wang,Yuxiang Wei,Xinpeng Zhou,Tianyu Zhang,Tao Liang,Yalong Bai,Hongzhi Zhang,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Duxiaoman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user’s preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.
[CV-203] Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark CVPR2026
【速读】:该论文旨在解决无人机(UAV)图像中基于文本的人体检索问题,即从高空视角的图像中识别与目击者描述相匹配的目标个体。由于无人机拍摄角度和飞行高度变化剧烈,导致视觉信息退化严重,使得文本与图像之间的语义对齐变得极具挑战性。解决方案的关键在于提出一种跨模态模糊对齐网络(Cross-modal Fuzzy Alignment Network),其核心创新包括:一是设计了模糊令牌对齐模块(Fuzzy Token Alignment module),利用模糊隶属函数动态建模词元级别的关联强度,有效抑制不可见或噪声词元的影响,从而缓解因视觉线索缺失造成的语义不一致问题;二是引入地面视角图像作为桥接代理,设计上下文感知动态对齐模块(Context-Aware Dynamic Alignment module),自适应融合直接对齐与代理辅助对齐机制,进一步缩小空中图像与文本描述之间的语义鸿沟。此外,作者还构建了一个大规模基准数据集 AERI-PEDES,通过链式思维分解文本生成过程以提升描述准确性与语义一致性,验证了方法的有效性。
链接: https://arxiv.org/abs/2603.20721
作者: Yifei Deng,Chenglong Li,Yuyang Zhang,Guyue Hu,Jin Tang
机构: Anhui University (安徽大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 main track
Abstract:Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text–image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text–aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text–aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.
[CV-204] he Role and Relationship of Initialization and Densification in 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)方法中初始化质量与密集化(densification)策略之间的关系问题,即当前的密集化方案是否能够充分利用高质量的初始点云(如激光扫描或稠密立体视觉点云)来提升重建效果。研究表明,现有密集化方法难以显著优于基于稀疏结构光运动(Structure-from-Motion, SfM)的初始化,说明其在利用稠密初始信息方面存在局限性。解决方案的关键在于系统性地评估不同初始化方式(包括稀疏SfM点云、稠密激光扫描、多视角立体匹配点云及单目深度估计点云)与多种密集化策略的组合,并通过提出新的基准测试(benchmark)揭示当前方法的瓶颈,从而为未来改进密集化机制提供方向。
链接: https://arxiv.org/abs/2603.20714
作者: Ivan Desiatov,Torsten Sattler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Sources will be made publicly available
Abstract:3D Gaussian Splatting (3DGS) has become the method of choice for photo-realistic 3D reconstruction of scenes, due to being able to efficiently and accurately recover the scene appearance and geometry from images. 3DGS represents the scene through a set of 3D Gaussians, parameterized by their position, spatial extent, and view-dependent color. Starting from an initial point cloud, 3DGS refines the Gaussians’ parameters as to reconstruct a set of training images as accurately as possible. Typically, a sparse Structure-from-Motion point cloud is used as initialization. In order to obtain dense Gaussian clouds, 3DGS methods thus rely on a densification stage. In this paper, we systematically study the relation between densification and initialization. Proposing a new benchmark, we study combinations of different types of initializations (dense laser scans, dense (multi-view) stereo point clouds, dense monocular depth estimates, sparse SfM point clouds) and different densification schemes. We show that current densification approaches are not able to take full advantage of dense initialization as they are often unable to (significantly) improve over sparse SfM-based initialization. We will make our benchmark publicly available.
[CV-205] High-Quality and Efficient Turbulence Mitigation with Events CVPR2026
【速读】:该论文旨在解决大气湍流导致的图像退化问题(turbulence mitigation, TM),其核心挑战在于湍流具有随机性,传统方法依赖多帧静态图像来提取稳定模式,但存在精度与效率之间的权衡:帧数越多恢复质量越高,但系统延迟和数据开销也随之增加。解决方案的关键在于利用事件相机(event camera)的微秒级时间分辨率和对动态变化的高效感知能力,提出EHETM方法。该方法基于两个关键发现:一是湍流诱导事件呈现与图像梯度相关的极性交替特性,可作为场景结构重建的线索;二是动态物体在时空上形成连贯的“事件管”(event tubes),而湍流事件则呈杂乱分布,这为分离运动目标与湍流提供了先验信息。基于此,设计了两个互补模块——极性加权梯度模块用于场景精修,事件管约束模块用于运动解耦,从而实现仅用少量帧即可高质量恢复图像,显著降低数据量和延迟(分别减少约77.3%和89.5%)。
链接: https://arxiv.org/abs/2603.20708
作者: Xiaoran Zhang,Jian Ding,Yuxing Duan,Haoyue Liu,Gang Chen,Yi Chang,Luxin Yan
机构: Huazhong University of Science and Technology (华中科技大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Turbulence mitigation ™ is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent ``event tubes’’ in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively. Our code is available at: this https URL.
[CV-206] Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models
【速读】:该论文旨在解决自然灾害发生后,卫星遥感图像与地面视角影像之间存在的数据鸿沟问题,即卫星影像虽能提供大范围损毁概览,却缺乏对具体结构破坏的细节感知能力,而地面影像(如街景图像)在灾后短时间内难以获取。解决方案的关键在于提出两种生成式AI方法:一是基于视觉-语言模型(Vision-Language Model, VLM)引导的合成策略,二是面向损伤敏感性的专家混合(Mixture-of-Experts, MoE)方法,通过跨视图合成技术从卫星图像生成灾后街景图像,并引入结构感知评估框架(Structure-Aware Evaluation Framework)进行多层级验证,从而实现更可靠、语义一致且具真实感的街景图像生成,为灾情快速评估提供可信的数据支撑。
链接: https://arxiv.org/abs/2603.20697
作者: Yifan Yang,Lei Zou,Wendy Jepson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at IGARSS 2026 (IEEE International Geoscience and Remote Sensing Symposium)
Abstract:In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism–fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.
[CV-207] MFSR: MeanFlow Distillation for One Step Real-World Image Super Resolution
【速读】:该论文旨在解决扩散模型和流模型在真实图像超分辨率(Real-ISR)任务中推理速度慢、难以部署的问题,同时避免单步蒸馏导致的重建质量下降及丧失多步优化灵活性。其解决方案的关键在于提出一种名为Mean Flows for Super-Resolution (MFSR) 的新型蒸馏框架,该框架以MeanFlow作为学习目标,使学生模型能够近似任意状态之间的概率流常微分方程(Probability Flow ODE, PF-ODE)的平均速度,从而在无需显式轨迹模拟的情况下有效捕捉教师模型的动力学特性;此外,通过引入基于教师CFG(Classifier-Free Guidance)的蒸馏策略改进原始MeanFlow的引导机制,进一步增强重建能力并保留细节信息,最终实现单步生成逼真结果的同时支持可选的少量步骤精细化提升。
链接: https://arxiv.org/abs/2603.20690
作者: Ruiqing Wang,Kai Zhang,Yuanzhi Zhu,Hanshu Yan,Shilin Lu,Jian Yang
机构: Nanjing University (南京大学); LIX, École Polytechnique, CNRS, IPP (法国巴黎综合理工学院LIX实验室,法国国家科学研究中心,巴黎理工学院); Salesforce AI Research (Salesforce人工智能研究院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion- and flow-based models have advanced Real-world Image Super-Resolution (Real-ISR), but their multi-step sampling makes inference slow and hard to deploy. One-step distillation alleviates the cost, yet often degrades restoration quality and removes the option to refine with more steps. We present Mean Flows for Super-Resolution (MFSR), a new distillation framework that produces photorealistic results in a single step while still allowing an optional few-step path for further improvement. Our approach uses MeanFlow as the learning target, enabling the student to approximate the average velocity between arbitrary states of the Probability Flow ODE (PF-ODE) and effectively capture the teacher’s dynamics without explicit rollouts. To better leverage pretrained generative priors, we additionally improve original MeanFlow’s Classifier-Free Guidance (CFG) formulation with teacher CFG distillation strategy, which enhances restoration capability and preserves fine details. Experiments on both synthetic and real-world benchmarks demonstrate that MFSR achieves efficient, flexible, and high-quality super-resolution, delivering results on par with or even better than multi-step teachers while requiring much lower computational cost.
[CV-208] IBCapsNet: Information Bottleneck Capsule Network for Noise-Robust Representation Learning
【速读】:该论文旨在解决胶囊网络(Capsule Networks, CapsNets)在实际应用中的两大关键问题:一是由于迭代动态路由机制导致的高计算成本,二是对输入噪声扰动的鲁棒性较差。解决方案的核心在于提出一种基于信息瓶颈(Information Bottleneck, IB)原理的新架构——IBCapsNet,其关键创新是摒弃了传统的迭代路由过程,转而采用单次遍历的变分聚合机制:首先将初级胶囊(primary capsules)压缩为全局上下文表示,再通过类别特定的变分自编码器(Variational Autoencoders, VAEs)推断出受KL散度正则化的潜在胶囊。这一设计不仅显著提升了推理效率(训练速度提升2.54倍,推理吞吐量提高3.64倍),还增强了模型对合成噪声的鲁棒性(如加性噪声下平均提升17.10%),同时减少了4.66%的参数量,实现了高效、鲁棒且可解释的深度学习模型构建路径。
链接: https://arxiv.org/abs/2603.20682
作者: Canqun Xiang,Chen Yang,Jiaoyan Zhao
机构: Shenzhen Polytechnic University (深圳职业技术大学); Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong-Macao Greater Bay Area (粤港澳大湾区应用人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Capsule networks (CapsNets) are superior at modeling hierarchical spatial relationships but suffer from two critical limitations: high computational cost due to iterative dynamic routing and poor robustness under input corruptions. To address these issues, we propose IBCapsNet, a novel capsule architecture grounded in the Information Bottleneck (IB) principle. Instead of iterative routing, IBCapsNet employs a one-pass variational aggregation mechanism, where primary capsules are first compressed into a global context representation and then processed by class-specific variational autoencoders (VAEs) to infer latent capsules regularized by the KL divergence. This design enables efficient inference while inherently filtering out noise. Experiments on MNIST, Fashion-MNIST, SVHN and CIFAR-10 show that IBCapsNet matches CapsNet in clean-data accuracy (achieving 99.41% on MNIST and 92.01% on SVHN), yet significantly outperforms it under four types of synthetic noise - demonstrating average improvements of +17.10% and +14.54% for clamped additive and multiplicative noise, respectively. Moreover, IBCapsNet achieves 2.54x faster training and 3.64x higher inference throughput compared to CapsNet, while reducing model parameters by 4.66%. Our work bridges information-theoretic representation learning with capsule networks, offering a principled path toward robust, efficient, and interpretable deep models. Code is available at this https URL
[CV-209] oFormer: Towards Large-scale Scenario Depth Completion for Lightweight ToF Camera
【速读】:该论文旨在解决短距离飞行时间(Time-of-Flight, ToF)相机在大尺度场景中因感知范围受限而难以部署的问题。现有深度补全(depth completion)方法缺乏针对ToF测量特性的专用数据集,且泛化能力不足。其解决方案的关键在于构建首个面向大尺度场景的ToF深度补全数据集LASER-ToF,并提出一种传感器感知的深度补全网络:该网络引入3D-2D联合传播池化(3D-2D Joint Propagation Pooling, JPP)模块与多模态交叉协方差注意力(Multimodal Cross-Covariance Attention, MXCA),有效建模长距离关系并高效融合3D与2D特征;同时利用视觉SLAM生成的稀疏点云作为补充信息提升预测精度。实验表明,该方法在保持轻量化设计的同时,相较次优方法降低8.6%的平均绝对误差,并成功实现在四旋翼无人机上以10Hz频率运行,实现复杂环境中可靠的大尺度建图与远距离规划。
链接: https://arxiv.org/abs/2603.20669
作者: Juncheng Chen,Tiancheng Lai,Xingpeng Wang,Bingxin Liao,Baozhe Zhang,Chao Xu,Yanjun Cao
机构: State Key Laboratory of Industrial Control Technology, Institute of Cyber Systems and Control, Zhejiang University, Hangzhou, China; Huzhou Institute of Zhejiang University, Huzhou, China; The Chinese University of Hong Kong, Shenzhen, China
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 15 figures
Abstract:Time-of-Flight (ToF) cameras possess compact design and high measurement precision to be applied to various robot tasks. However, their limited sensing range restricts deployment in large-scale scenarios. Depth completion has emerged as a potential solution to expand the sensing range of ToF cameras, but existing research lacks dedicated datasets and struggles to generalize to ToF measurements. In this paper, we propose a full-stack framework that enables depth completion in large-scale scenarios for short-range ToF cameras. First, we construct a multi-sensor platform with a reconstruction-based pipeline to collect real-world ToF samples with dense large-scale ground truth, yielding the first LArge-ScalE scenaRio ToF depth completion dataset (LASER-ToF). Second, we propose a sensor-aware depth completion network that incorporates a novel 3D branch with a 3D-2D Joint Propagation Pooling (JPP) module and Multimodal Cross-Covariance Attention (MXCA), enabling effective modeling of long-range relationships and efficient 3D-2D fusion under non-uniform ToF depth sparsity. Moreover, our network can utilize the sparse point cloud from visual SLAM as a supplement to ToF depth to further improve prediction accuracy. Experiments show that our method achieves an 8.6% lower mean absolute error than the second-best method, while maintaining lightweight design to support onboard deployment. Finally, to verify the system’s applicability on real robots, we deploy proposed method on a quadrotor at a 10Hz runtime, enabling reliable large-scale mapping and long-range planning in challenging environments for short-range ToF cameras.
[CV-210] Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在空间推理(spatial reasoning)方面持续存在的性能瓶颈问题。其核心解决方案在于通过机制可解释性(mechanistic interpretability)视角,识别和表征VLM中专门负责空间认知功能的注意力头(attention heads),并提出激活潜在空间注意力头的方法以提升模型的空间理解能力。关键创新在于构建了CogVSR数据集,将复杂空间推理问题分解为链式思维(chain-of-thought)步骤,并关联具体认知功能;在此基础上开发探针框架定位功能性注意力头,揭示其稀疏性及分布规律,进而通过干预实验验证这些头部对空间推理任务的决定性作用。
链接: https://arxiv.org/abs/2603.20662
作者: Xueqi Ma,Shuo Yang,Yanbei Jiang,Shu Liu,Zhenzhen Liu,Jiayang Ao,Xingjun Ma,Sarah Monazam Erfani,James Bailey
机构: The University of Melbourne (墨尔本大学); Australian National University (澳大利亚国立大学); Fudan University (复旦大学); Monash University (蒙纳士大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.
[CV-211] A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Averag e Distillation
【速读】:该论文针对细粒度时尚图像检索(Fine-grained Fashion Image Retrieval, FIR)中静态模型无法适应动态新增属性的问题,提出了一种多头持续学习框架(Multihead Continual Learning for FIR, MCL-FIR)。其核心挑战在于如何在不重新训练整个模型的前提下,有效整合新类别并保持旧知识,同时避免灾难性遗忘。解决方案的关键在于:1)采用多头结构以支持增量类别的扩展;2)将三元组输入重构为双元组并结合InfoNCE损失函数,提升训练效率与效果;3)引入指数移动平均(Exponential Moving Average, EMA)蒸馏机制实现高效的知识迁移,从而在较低训练成本下实现性能接近静态方法的稳定表现。
链接: https://arxiv.org/abs/2603.20648
作者: Ling Xiao,Toshihiko Yamasaki
机构: Hokkaido University (北海道大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Multimedia (TMM), to appear. Preprint version
Abstract:Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in this https URL.
[CV-212] ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在指令驱动图像编辑任务中缺乏大规模、多样化且高质量开源数据集的问题。现有数据集要么依赖封闭源模型进行标注,难以低成本扩展;要么采用固定合成编辑流程,存在质量与泛化能力不足的局限。解决方案的关键在于提出ScaleEditor——一个完全开源的分层多智能体框架,其核心包括:基于世界知识注入的源图像扩展机制、自适应多智能体编辑指令-图像合成模块,以及任务感知的数据质量验证机制。该框架成功构建了目前最大的开源图像编辑数据集ScaleEdit-12M,并在多个基准测试中显著提升模型性能,证明了开放生态下可实现接近商业级的数据质量与成本效益。
链接: https://arxiv.org/abs/2603.20644
作者: Guanzhou Chen,Erfei Cui,Changyao Tian,Danni Yang,Ganlin Yang,Yu Qiao,Hongsheng Li,Gen Luo,Hongjie Zhang
机构: Shanghai Jiao Tong University (上海交通大学); CUHK MMLab; Fudan University (复旦大学); University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed-source models for annotation, which prevents cost-effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open-source hierarchical multi-agent framework for end-to-end construction of large-scale, high-quality image editing datasets. Our pipeline consists of three key components: source image expansion with world-knowledge infusion, adaptive multi-agent editing instruction-image synthesis, and a task-aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit-12M, the largest open-source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine-tuning UniWorld-V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS-Bench for knowledge-infused benchmarks. These results demonstrate that open-source, agentic pipelines can approach commercial-grade data quality while retaining cost-effectiveness and scalability. Both the framework and dataset will be open-sourced.
[CV-213] GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction CVPR2026
【速读】:该论文旨在解决切片式体素成像(slice-based volumetric imaging)中压缩效率与内部结构保真度之间的矛盾问题,即如何在实现高比例压缩的同时保留高频率的内部结构细节以支持后续分析。其解决方案的关键在于提出GaussianPile方法,该方法通过三个核心创新实现:(i) 切片感知的堆叠策略(slice-aware piling strategy),利用各向异性3D高斯分布建模跨切片贡献;(ii) 可微分投影算子(differentiable projection operator),编码成像系统有限厚度点扩散函数(point spread function, PSF);(iii) 紧凑编码与联合优化流程,同步完成高斯集合的重建与压缩。此设计在CUDA架构下保持了高斯原语的压缩效率和实时渲染能力,同时显著提升内部细节保真度,实验证明其相较NeRF方法提速最高达11倍、压缩比达16倍于体素网格,具备部署可行性。
链接: https://arxiv.org/abs/2603.20611
作者: Di Kong,Yikai Wang,Wenjie Guo,Yifan Bu,Boya Zhang,Yuexin Duan,Xiawei Yue,Wenbiao Du,Yiman Zhong,Yuwen Chen,Cheng Ma
机构: Tsinghua University (清华大学); Zhongguancun Academy (中关村研究院); Beijing Normal University (北京师范大学); Nankai University (南开大学); Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)
Abstract:Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. We introduce GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our proposed method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D Gaussians to model through-slice contributions, (ii) a differentiable projection operator that encodes the finite-thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real-time rendering efficiency of Gaussian primitives while preserving high-frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as 3 minutes, up to 11x faster than NeRF-based approaches, and achieves consistent 16x compression over voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.
[CV-214] RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction
【速读】:该论文旨在解决流式三维重建(streaming 3D reconstruction)在动态场景中因缺乏显式动态推理而导致的几何伪影和位姿漂移问题。解决方案的关键在于提出了一种无需训练的框架 RayMap3R,其核心创新是利用 RayMap 预测固有的静态场景偏差(static-scene bias)作为内部线索,通过双分支推理机制对比 RayMap 与图像预测结果来识别动态区域,并在内存更新过程中抑制动态区域的干扰;同时引入重置度量对齐(reset metric alignment)和状态感知平滑(state-aware smoothing)以保持度量一致性并稳定位姿轨迹。
链接: https://arxiv.org/abs/2603.20588
作者: Feiran Wang,Zezhou Shang,Gaowen Liu,Yan Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.
[CV-215] Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成合成图像时,由于训练目标与迭代采样过程之间存在不一致,导致梯度误差沿采样轨迹累积,从而影响生成质量与泛化能力的问题。其解决方案的关键在于提出一种基于“弱到强原则”(Weak-to-Strong Principle, W2S)的混合引导方法(SGG),该方法融合了无分类器引导(Classifier Free Guidance, CFG)和自动引导(AutoGuidance, AG)的优势,在推理阶段显著提升性能;同时将W2S原则迁移至训练目标中,增强了未引导扩散模型的泛化能力。
链接: https://arxiv.org/abs/2603.20584
作者: Liangyu Yuan,Yufei Huang,Mingkun Lei,Tong Zhao,Ruoyu Wang,Changxi Chi,Yiwei Wang,Chi Zhang
机构: Westlake University (西湖大学); Tongji University (同济大学); Zhejiang University (浙江大学); University of California at Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 12 figures
Abstract:Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulation-free objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SGG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SGG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SGG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings. Code is available at this https URL.
[CV-216] GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories
【速读】:该论文旨在解决复杂城市环境中自动驾驶车辆轨迹分割的难题,即如何从单目图像中自动识别可行的行驶路径,而无需依赖人工标注的车道线或道路结构信息。其解决方案的关键在于提出一种可扩展的自监督方法:利用大规模行车记录视频中的自车运动(ego-motion)作为隐式监督信号,通过单目结构光恢复相机轨迹,并将其投影到地面平面生成可行驶区域的空间掩码(spatial masks),进而训练深度分割网络,实现仅基于单张RGB图像即可预测与运动条件相关的路径候选区域(motion-conditioned path proposals)。该方法不显式建模道路或车道线,而是通过大规模无约束互联网数据学习场景布局、车道拓扑和交叉口结构,从而在不同相机配置下具备良好泛化能力。
链接: https://arxiv.org/abs/2603.20583
作者: Tomasz Frelek,Rohan Patil,Akshar Tumu,Henrik I. Christensen
机构: UC San Diego(加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 27 figures, 1 table
Abstract:We present a scalable self-supervised approach for segmenting feasible vehicle trajectories from monocular images for autonomous driving in complex urban environments. Leveraging large-scale dashcam videos, we treat recorded ego-vehicle motion as implicit supervision and recover camera trajectories via monocular structure-from-motion, projecting them onto the ground plane to generate spatial masks of traversed regions without manual annotation. These automatically generated labels are used to train a deep segmentation network that predicts motion-conditioned path proposals from a single RGB image at run time, without explicit modeling of road or lane markings. Trained on diverse, unconstrained internet data, the model implicitly captures scene layout, lane topology, and intersection structure, and generalizes across varying camera configurations. We evaluate our approach on NuScenes, demonstrating reliable trajectory prediction, and further show transfer to an electric scooter platform through light fine-tuning. Our results indicate that large-scale ego-motion distillation yields structured and generalizable path proposals beyond the demonstrated trajectory, enabling trajectory hypothesis estimation via image segmentation.
[CV-217] When Negation Is a Geometry Problem in Vision-Language Models CVPR
【速读】:该论文旨在解决当前联合视觉-语言嵌入模型(如CLIP)在理解文本查询中否定语义时的不足,例如无法正确识别“no”在查询“a plain blue shirt with no logos”中的含义。传统方法主要依赖数据驱动的微调策略,在大规模合成否定数据集上训练CLIP,但其评估通常采用基于检索的指标,难以真实反映模型对否定的理解能力。本文的关键创新在于提出一种基于多模态大语言模型(Multimodal LLMs-as-a-judge)的新评估框架,利用LLMs擅长回答图像内容相关的是非问题的优势,实现更公平、可靠的否定理解评估;进一步发现CLIP嵌入空间中存在与否定相关的方向,并通过测试时的表示工程(representation engineering)对该方向进行干预,从而无需任何微调即可引导CLIP表现出否定感知行为,同时验证了该方法在分布外样本上的泛化能力。
链接: https://arxiv.org/abs/2603.20554
作者: Fawaz Sammani,Tzoulio Chamiti,Paul Gavrikov,Nikos Deligiannis
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); imec (imec); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR (Multimodal Algorithmic Reasoning Workshop) 2026
Abstract:Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish “no” in the query: “a plain blue shirt with no logos”. Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.
[CV-218] Memory Over Maps: 3D Object Localization Without Reconstruction
【速读】:该论文旨在解决传统机器人导航与操作任务中依赖完整3D场景重建进行目标定位所带来的计算开销大、存储成本高及可扩展性差的问题。其核心挑战在于验证是否必须通过显式构建全局3D表示(如点云或体素网格)才能实现精准的目标定位。解决方案的关键在于提出一种无需地图的定位流水线:仅存储带位姿的RGB-D关键帧作为轻量级视觉记忆,查询时通过视觉语言模型(Vision-Language Model, VLM)对候选视图进行重排序,并利用深度反投影和多视角融合动态构建目标的稀疏3D估计,从而避免了预先构建完整3D场景结构,显著降低了预处理时间和存储需求,同时在下游物体导向导航任务中展现出强泛化能力。
链接: https://arxiv.org/abs/2603.20530
作者: Rui Zhou,Xander Yap,Jianwen Cao,Allison Lau,Boyang Sun,Marc Pollefeys
机构: University of Zurich (苏黎世大学); ETH Zurich (苏黎世联邦理工学院); Microsoft (微软)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory–without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: this https URL
[CV-219] End-to-End Optimization of Polarimetric Measurement and Material Classifier
【速读】:该论文旨在解决材料分类中极化测量角度配置不明确的问题,即如何在有限的极化测量次数下实现高精度的材料识别。传统方法依赖多步调制入射光和反射光的偏振状态,过程耗时且对某些任务冗余;而本文提出了一种端到端优化框架,联合学习材料分类器与最优偏振元件旋转角度组合,从而在减少测量次数的同时提升分类性能。其关键在于通过 Mueller 矩阵材料数据集训练模型,自动确定控制入射与反射光偏振态的最佳角度配置,实现高效、精准的材料识别。
链接: https://arxiv.org/abs/2603.20519
作者: Ryota Maeda,Naoki Arikawa,Yutaka No,Shinsaku Hiura
机构: University of Hyogo(兵库大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at VISAPP 2026 (21st International Conference on Computer Vision Theory and Applications)
Abstract:Material classification is a fundamental problem in computer vision and plays a crucial role in scene understanding. Previous studies have explored various material recognition methods based on reflection properties such as color, texture, specularity, and scattering. Among these cues, polarization is particularly valuable because it provides rich material information and enables recognition even at distances where capturing high-resolution texture is impractical. However, measuring polarimetric reflectance properties typically requires multiple modulations of the polarization state of the incident light, making the process time-consuming and often unnecessary for certain recognition tasks. While material classification can be achieved using only a subset of polarimetric measurements, the optimal configuration of measurement angles remains unclear. In this study, we propose an end-to-end optimization framework that jointly learns a material classifier and determines the optimal combinations of rotation angles for polarization elements that control both the incident and reflected light states. Using our Mueller-matrix material dataset, we demonstrate that our method achieves high-accuracy material classification even with a limited number of measurements.
[CV-220] Lessons and Open Questions from a Unified Study of Camera-Trap Species Recognition Over Time
【速读】:该论文旨在解决生态学实践中相机陷阱(camera trap)物种识别在时间维度上的可靠性问题,即如何在固定监测站点长期运行中维持模型的准确识别能力。传统计算机视觉研究多聚焦于跨域泛化(cross-domain generalization),但忽略了生态系统动态变化导致的背景和动物分布随时间发生显著偏移这一核心挑战。解决方案的关键在于构建首个统一的时间序列基准(benchmark),包含546个相机陷阱,并采用按时间顺序排列的数据流协议评估模型性能;同时通过实证分析揭示:(1)生物基础模型(如BioCLIP 2)需进行站点特异性适配才能有效应用;(2)在真实部署生命周期下(使用历史数据更新模型并测试未来时段),简单适配反而可能低于零样本(zero-shot)性能;(3)造成困难的主要因素为类别严重不平衡及连续时间段间物种分布与背景的显著时序漂移;(4)结合模型更新与后处理技术可显著提升精度,但仍存在与理论上限的差距。该研究为生态学家提供了可操作的部署指南,并指出了未来视觉与机器学习研究的新方向。
链接: https://arxiv.org/abs/2603.20509
作者: Sooyoung Jeon,Hongjie Tian,Lemeng Wang,Zheda Mai,Vidhi Bakshi,Jiacheng Hou,Ping Zhang,Arpita Chowdhury,Jianyang Gu,Wei-Lun Chao
机构: The Ohio State University (俄亥俄州立大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first three authors contribute equally
Abstract:Camera traps are vital for large-scale biodiversity monitoring, yet accurate automated analysis remains challenging due to diverse deployment environments. While the computer vision community has mostly framed this challenge as cross-domain generalization, this perspective overlooks a primary challenge faced by ecological practitioners: maintaining reliable recognition at the fixed site over time, where the dynamic nature of ecosystems introduces profound temporal shifts in both background and animal distributions. To bridge this gap, we present the first unified study of camera-trap species recognition over time. We introduce a realistic benchmark comprising 546 camera traps with a streaming protocol that evaluates models over chronologically ordered intervals. Our end-user-centric study yields four key findings. (1) Biological foundation models (e.g., BioCLIP 2) underperform at numerous sites even in initial intervals, underscoring the necessity of site-specific adaptation. (2) Adaptation is challenging under realistic evaluation: when models are updated using past data and evaluated on future intervals (mirrors real deployment lifecycles), naive adaptation can even degrade below zero-shot performance. (3) We identify two drivers of this difficulty: severe class imbalance and pronounced temporal shift in both species distribution and backgrounds between consecutive intervals. (4) We find that effective integration of model-update and post-processing techniques can largely improve accuracy, though a gap from the upper bounds remains. Finally, we highlight critical open questions, such as predicting when zero-shot models will succeed at a new site and determining whether/when model updates are necessary. Our benchmark and analysis provide actionable deployment guidelines for ecological practitioners while establishing new directions for future research in vision and machine learning.
[CV-221] CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在空间推理任务中对方向关系编码机制理解不足的问题,特别是现有归因方法(如GradCAM和注意力传播)仅能揭示模型关注区域,而无法明确其推断出的物体间方向信息。解决方案的关键在于提出一种无需训练的可解释性框架CREG(Compass Relational Evidence Graph),其核心创新是将多层对比梯度乘以激活(contrastive Grad-times-Act)归因投影到以参考对象为中心的极坐标系中,生成覆盖罗盘扇区的方向证据分布,并通过方向对齐误差(DAE)、边缘准确率(EA)和因果遮蔽得分(COS)三个指标量化评估方向解释的准确性与忠实性。实验表明,CREG在Qwen2-VL-7B上显著优于传统归因基线,尤其在COCO-Pairs数据集上实现更低的角误差和更高的因果有效性,验证了多层对比归因能更忠实揭示VLM在空间推理中的方向感知机制。
链接: https://arxiv.org/abs/2603.20475
作者: Kaizhen Tan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.
[CV-222] Inverting Neural Networks: New Methods to Generate Neural Network Inputs from Prescribed Outputs
【速读】:该论文旨在解决神经网络中输入图像到输出类别的逆映射问题(inverse problem of determining the input images that get mapped to specific neural network classes),即寻找能够被模型高置信度分类为特定类别的输入图像。其核心挑战在于神经网络的复杂非线性映射使得输入空间与输出空间之间的关系难以解析。解决方案的关键在于提出两种通用方法:一是基于前向传播的根求解算法(root-finding algorithm)结合输入图像的雅可比矩阵(Jacobian with respect to the input image)进行反向迭代优化;二是基于反向传播的逐层逆推方法(backward pass method),通过在每一层反向计算时引入从线性层零空间采样的随机向量,以探索更广泛的可行输入空间。这两种方法均能生成具有高分类准确率且结构上看似随机的输入图像,揭示了当前深度神经网络在输入空间覆盖上的潜在脆弱性。
链接: https://arxiv.org/abs/2603.20461
作者: Rebecca Pattichis,Sebastian Janampa,Constantinos S. Pattichis,Marios S. Pattichis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2024 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI)
Abstract:Neural network systems describe complex mappings that can be very difficult to understand. In this paper, we study the inverse problem of determining the input images that get mapped to specific neural network classes. Ultimately, we expect that these images contain recognizable features that are associated with their corresponding class classifications. We introduce two general methods for solving the inverse problem. In our forward pass method, we develop an inverse method based on a root-finding algorithm and the Jacobian with respect to the input image. In our backward pass method, we iteratively invert each layer, at the top. During the inversion process, we add random vectors sampled from the null-space of each linear layer. We demonstrate our new methods on both transformer architectures and sequential networks based on linear layers. Unlike previous methods, we show that our new methods are able to produce random-like input images that yield near perfect classification scores in all cases, revealing vulnerabilities in the underlying networks. Hence, we conclude that the proposed methods provide a more comprehensive coverage of the input image spaces that solve the inverse mapping problem.
[CV-223] hermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis CVPR
【速读】:该论文旨在解决仅使用热成像(thermal imagery)进行新颖视图合成(Novel View Synthesis, NVS)时面临的挑战,其核心难点在于低成本热传感器的两个特性:一是动态范围极低,导致外观线索弱化且优化梯度不足;二是帧间光度波动剧烈且存在缓慢的辐射漂移,致使对应关系估计不稳定,并在无RGB引导的情况下产生高频浮动伪影。解决方案的关键在于设计了一个轻量级的预处理与点绘制(splatting)流水线,通过扩展可用动态范围和稳定每帧光度信息来提升热图像的可利用性,从而在不依赖数据集特定调优的前提下,在纯热成像NVS基准上实现了当前最优性能。
链接: https://arxiv.org/abs/2603.20448
作者: M. Kerem Aydin,Vishwanath Saragadam,Emma Alexander
机构: Northwestern University (西北大学); University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To be published at CVPR, 2026. 15 Pages, 29 Figures
Abstract:Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance (beyond camera pose) is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.
[CV-224] Benchmarking Efficient Effective Camera Pose Estimation Strategies for Novel View Synthesis
【速读】:该论文旨在解决传统结构光恢复(Structure-from-Motion, SfM)方法在用于新型视图合成(Novel View Synthesis, NVS)时效率与精度难以兼顾的问题。现有基于神经网络的SfM方法虽显著提升了运行效率,但其估计精度明显低于经典方法;而经典SfM依赖于束调整(Bundle Adjustment)优化,计算成本高。论文的关键解决方案在于:首先通过减少特征点数量即可大幅提升经典SfM的运行效率且保持高姿态精度;其次提出一种混合策略——利用前馈神经网络(如Transformer模型)快速获取初始相机参数和3D结构估计,再以经典SfM技术进行精调,从而实现效率与效果的最佳平衡。
链接: https://arxiv.org/abs/2603.20428
作者: Jhacson Meza,Martin R. Oswald,Torsten Sattler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Novel view synthesis (NVS) approaches such as NeRFs or 3DGS can produce photo-realistic 3D scene representation from a set of images with known extrinsic and intrinsic parameters. The necessary camera poses and calibrations are typically obtained from the images via Structure-from-Motion (SfM). Classical SfM approaches rely on local feature matches between the images to estimate both the poses and a sparse 3D model of the scene, using bundle adjustment to refine initial pose, intrinsics, and geometry estimates. In order to increase run-time efficiency, recent SfM systems forgo optimization via bundle adjustment. Instead, they train feed-forward (transformer-based) neural networks to directly regress camera parameters and the 3D structure. While orders of magnitude more efficient, such recent works produce significantly less accurate estimates. To stimulate research on developing SfM approaches that are both efficient \emphand effective, this paper develops a benchmark focused on SfM for novel view synthesis. Using existing datasets and two simple strategies for making the reconstruction process more efficient, we show that: (1) simply using fewer features already significantly accelerates classical SfM methods while maintaining high pose accuracy. (2) using feed-forward networks to obtain initial estimates and refining them using classical SfM techniques leads to the best efficiency-effectiveness trade-off. We will make our benchmark and code publicly available.
[CV-225] FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection CVPR2026
【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中传统全参数微调(Full Fine-Tuning)计算成本高、效率低的问题,以及现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法因使用固定秩(fixed rank)导致适应不同任务或模型位置能力不足,且缺乏对跨任务空间关系建模的缺陷。其解决方案的关键在于提出 Frequency-Aware and Automatic Rank (FAAR) 方法:首先引入 Performance-Driven Rank Shrinking (PDRS),动态分配每个适配器位置和任务的最优秩以提升效率与性能;其次设计 Task-Spectral Pyramidal Decoder (TS-PD),通过分析图像频谱并注入输入特定上下文到空间偏置学习中,显式建模跨任务的空间关联性,从而增强多样任务预测能力。实验表明,FAAR 在保持高精度的同时,相比传统 MTL 微调可减少多达 9 倍的参数量。
链接: https://arxiv.org/abs/2603.20403
作者: Maxime Fontana,Michael Spratling,Miaojing Shi
机构: King’s College London (国王学院); University of Luxembourg (卢森堡大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for down-stream tasks. However, the growth of state-of-the-art mod-els makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for differ-ent tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that cap-tures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrink-ing (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.
[CV-226] Monocular Models are Strong Learners for Multi-View Human Mesh Recovery
【速读】:该论文旨在解决多视角人体网格重建(Multi-view Human Mesh Recovery, HMR)中因相机标定依赖和跨视角泛化能力弱而导致的性能瓶颈问题。现有几何方法(如三角测量)高度依赖繁琐的相机标定,而学习方法则因缺乏多视角训练数据,在未见相机配置下泛化性能差。解决方案的关键在于提出一种无需训练的框架,利用预训练的单视角HMR模型作为强先验,通过测试时优化(test-time optimization)结合多视角一致性约束与解剖学约束,实现无需多视角训练数据即可在任意相机设置下进行高精度、鲁棒的重建。
链接: https://arxiv.org/abs/2603.20391
作者: Haoyu Xie,Shengkai Xu,Cheng Guo,Muhammad Usama Saleem,Wenhan Wu,Chen Chen,Ahmed Helmy,Pu Wang,Hongfei Xue
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view human mesh recovery (HMR) is broadly deployed in diverse domains where high accuracy and strong generalization are essential. Existing approaches can be broadly grouped into geometry-based and learning-based methods. However, geometry-based methods (e.g., triangulation) rely on cumbersome camera calibration, while learning-based approaches often generalize poorly to unseen camera configurations due to the lack of multi-view training data, limiting their performance in real-world scenarios. To enable calibration-free reconstruction that generalizes to arbitrary camera setups, we propose a training-free framework that leverages pretrained single-view HMR models as strong priors, eliminating the need for multi-view training data. Our method first constructs a robust and consistent multi-view initialization from single-view predictions, and then refines it via test-time optimization guided by multi-view consistency and anatomical constraints. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.
[CV-227] Jigsaw Regularization in Whole-Slide Image Classification
【速读】:该论文旨在解决计算病理学中全切片图像(Whole-Slide Images, WSIs)分类任务中存在的空间结构信息利用不足的问题。现有基于多实例学习(Multiple Instance Learning, MIL)的方法通常将图像块(patches)视为可交换的独立单元,忽略了组织切片中蕴含的丰富空间拓扑关系。解决方案的关键在于两个创新:其一,采用视觉基础模型(vision foundation-model)嵌入来捕捉每个图像块内的局部空间结构;其二,通过图神经网络(Graph Neural Networks, GNNs)结合一种新颖的拼图正则化(jigsaw regularization)机制,实现跨图像块的空间感知能力。实验证明,这种双重策略显著提升了在乳腺癌、头颈癌和结直肠癌基准数据集上的分类性能,优于当前主流的基于注意力机制的MIL方法。
链接: https://arxiv.org/abs/2603.20386
作者: So Won Jeong,Veronika Ročková
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computational pathology involves the digitization of stained tissues into whole-slide images (WSIs) that contain billions of pixels arranged as contiguous patches. Statistical analysis of WSIs largely focuses on classification via multiple instance learning (MIL), in which slide-level labels are inferred from unlabeled patches. Most MIL methods treat patches as exchangeable, overlooking the rich spatial and topological structure that underlies tissue images. This work builds on recent graph-based methods that aim to incorporate spatial awareness into MIL. Our approach is new in two regards: (1) we deploy vision \emphfoundation-model embeddings to incorporate local spatial structure within each patch, and (2) achieve across-patch spatial awareness using graph neural networks together with a novel \em jigsaw regularization. We find that a combination of these two features markedly improves classification over state-of-the-art attention-based MIL approaches on benchmark datasets in breast, head-and-neck, and colon cancer.
[CV-228] Multi-Stage Fine-Tuning of Pathology Foundation Models with Head-Diverse Ensembling for White Blood Cell Classification
【速读】:该论文旨在解决外周血涂片中白细胞(White Blood Cells, WBCs)的13类自动分类问题,其核心挑战包括类别不平衡、领域偏移以及形态连续性混淆(morphological continuum confusion),即相邻成熟阶段之间存在细微且重叠的特征。解决方案的关键在于提出一种多阶段微调方法,基于DINOBloom-base模型训练多种分类头家族(线性、余弦和多层感知机MLP),并发现不同分类头在特定类别上表现出显著的性能差异:MLP头在最不成熟粒细胞(Promyelocyte, PMY)上表现最优(F1=0.733),线性头在较不成熟粒细胞(Metamyelocyte, MMY)上最优(F1=0.585),余弦头在成熟粒细胞边界(Band neutrophil, BNE)上最优(F1=0.470)。基于此类特定任务的专业化特性,构建了一个头多样性集成模型,其中MLP头作为主预测器,并仅在四个预定义混淆对中当另外两个头家族一致时才替换其预测,从而提升整体分类鲁棒性;同时,识别出所有模型均持续误分类的样本,发现这些样本高度富集于可能标注错误或固有形态模糊的情况,为后续数据质量评估提供依据。
链接: https://arxiv.org/abs/2603.20383
作者: Antony Gitau,Martin Paulson,Bjørn-Jostein Singstad,Karl Thomas Hjelmervik,Ola Marius Lysaker,Veralia Gabriela Sanchez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2026
Abstract:The classification of white blood cells (WBCs) from peripheral blood smears is critical for the diagnosis of leukemia. However, automated approaches still struggle due to challenges including class imbalance, domain shift, and morphological continuum confusion, where adjacent maturation stages exhibit subtle, overlapping features. We present a multi-stage fine-tuning methodology for 13-class WBC classification in the WBCBench 2026 Challenge (ISBI 2026). Our best-performing model is a fine-tuned DINOBloom-base, on which we train multiple classifier head families (linear, cosine, and multilayer perceptron (MLP)). The cosine head performed best on the mature granulocyte boundary (Band neutrophil (BNE) F1 = 0.470), the linear head on more immature granulocyte classes (Metamyelocyte (MMY) F1 = 0.585), and the MLP head on the most immature granulocyte (Promyelocyte (PMY) F1 = 0.733), revealing class-specific specialization. Based on this specialization, we construct a head-diverse ensemble, where the MLP head acts as the primary predictor, and its predictions within the four predefined confusion pairs are replaced only when two other head families agree. We further show that cases consistently misclassified by all models are substantially enriched for probable labeling errors or inherent morphological ambiguity.
[CV-229] Uni-Classifier: Leverag ing Video Diffusion Priors for Universal Guidance Classifier ICME2026
【速读】:该论文旨在解决生成式AI工作流中因上游模型输出与下游模型输入之间存在分布差异(distributional mismatch)而导致的整体生成质量下降的问题。其解决方案的关键在于提出一种简单且可插拔的模块——Uni-Classifier(Uni-C),该模块利用视频扩散先验(video diffusion priors)来引导前置模型的去噪过程,从而使其输出更符合下游任务的需求。这一机制不仅提升了多模型链式工作流中的整体性能,还能独立应用于单个生成模型以增强其输出质量,展现出良好的通用性和泛化能力。
链接: https://arxiv.org/abs/2603.20382
作者: Yujie Zhou,Pengyang Ling,Jiazi Bu,Bingjie Gao,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Miguo.ai (迷宫科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026
Abstract:In practical AI workflows, complex tasks often involve chaining multiple generative models, such as using a video or 3D generation model after a 2D image generator. However, distributional mismatches between the output of upstream models and the expected input of downstream models frequently degrade overall generation quality. To address this issue, we propose Uni-Classifier (Uni-C), a simple yet effective plug-and-play module that leverages video diffusion priors to guide the denoising process of preceding models, thereby aligning their outputs with downstream requirements. Uni-C can also be applied independently to enhance the output quality of individual generative models. Extensive experiments across video and 3D generation tasks demonstrate that Uni-C consistently improves generation quality in both workflow-based and standalone settings, highlighting its versatility and strong generalization capability.
[CV-230] Scene Representation using 360° Saliency Graph and its Application in Vision-based Indoor Navigation
【速读】:该论文旨在解决现有场景表示方法在视觉导航任务中效率低、信息表达不充分的问题,尤其是传统方法难以应对室内环境中的光照变化、遮挡和阴影等挑战。其解决方案的关键在于提出一种新型的360°显著性图(360° saliency graph)表示方法,该方法将场景的视觉、上下文、语义和几何信息以节点、边、边权重及角度位置的形式显式编码,从而构建出高效且鲁棒的场景表征。该表示不仅提升了场景定位精度,还能结合嵌入的几何信息实现2D导航路径规划,显著增强了基于视觉的室内导航性能。
链接: https://arxiv.org/abs/2603.20353
作者: Preeti Meena,Himanshu Kumar,Sandeep Yadav
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:
Abstract:A Scene, represented visually using different formats such as RGB-D, LiDAR scan, keypoints, rectangular, spherical, multi-views, etc., contains information implicitly embedded relevant to applications such as scene indexing, vision-based navigation. Thus, these representations may not be efficient for such applications. This paper proposes a novel 360° saliency graph representation of the scenes. This rich representation explicitly encodes the relevant visual, contextual, semantic, and geometric information of the scene as nodes, edges, edge weights, and angular position in the 360° graph. Also, this representation is robust against scene view change and addresses challenges of indoor environments such as varied illumination, occlusions, and shadows as in the case of existing traditional methods. We have utilized this rich and efficient representation for vision-based navigation and compared it with existing navigation methods using 360° scenes. However, these existing methods suffer from limitations of poor scene representation, lacking scene-specific information. This work utilizes the proposed representation first to localize the query scene in the given topological map, and then facilitate 2D navigation by estimating the next required movement directions towards the target destination in the topological map by using the embedded geometric information in the 360° saliency graph. Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation.
[CV-231] oward a Multi-View Brain Network Foundation Model: Cross-View Consistency Learning Across Arbitrary Atlases
【速读】:该论文旨在解决当前脑网络基础模型中存在的三大局限性:(1)依赖特定脑图谱(atlas)导致泛化能力受限;(2)未能充分挖掘多视角脑网络信息的互补性;(3)缺乏对解剖距离等先验知识的有效建模。其解决方案的关键在于提出MV-BrainFM,一个基于Transformer的多视角脑网络基础模型,通过显式引入解剖距离信息引导区域间交互,并设计无监督的跨视角一致性学习策略,在共享潜在空间中对同一受试者不同图谱构建的脑网络表示进行对齐;同时采用统一的多视图预训练范式,实现跨数据集与图谱的联合学习,从而在保持图谱感知能力的同时显著提升表示的通用性与可扩展性。
链接: https://arxiv.org/abs/2603.20348
作者: Jiaxing Xu,Jingying Ma,Xin Lin,Yuxiao Liu,Kai He,Qika Lin,Yiping Ke,Yang Li,Dinggang Shen,Mengling Feng
机构: National University of Singapore (新加坡国立大学); ShanghaiTech University (上海科技大学); Nanyang Technological University (南洋理工大学); Beihang University (北京航空航天大学); Shanghai United Imaging Intelligence (上海联影智能); Shanghai Clinical Research and Trial Center (上海临床研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain network analysis provides an interpretable framework for characterizing brain organization and has been widely used for neurological disorder identification. Recent advances in self-supervised learning have motivated the development of brain network foundation models. However, existing approaches are often limited by atlas dependency, insufficient exploitation of multiple network views, and weak incorporation of anatomical priors. In this work, we propose MV-BrainFM, a multi-view brain network foundation model designed to learn generalizable and scalable representations from brain networks constructed with arbitrary atlases. MV-BrainFM explicitly incorporates anatomical distance information into Transformer-based modeling to guide inter-regional interactions, and introduces an unsupervised cross-view consistency learning strategy to align representations from multiple atlases of the same subject in a shared latent space. By jointly enforcing within-view robustness and cross-view alignment during pretraining, the model effectively captures complementary information across heterogeneous network views while remaining atlas-aware. In addition, MV-BrainFM adopts a unified multi-view pretraining paradigm that enables simultaneous learning from multiple datasets and atlases, significantly improving computational efficiency compared to conventional sequential training strategies. The proposed framework also demonstrates strong scalability, consistently benefiting from increasing data diversity while maintaining stable performance across unseen atlas configurations. Extensive experiments on more than 20K subjects from 17 fMRI datasets show that MV-BrainFM consistently outperforms 14 existing brain network foundation models and task-specific baselines under both single-atlas and multi-atlas settings.
[CV-232] High-fidelity Multi-view Normal Integration with Scale-encoded Neural Surface Representation
【速读】:该论文旨在解决多视角表面法向量(surface normal)不一致导致的重建表面高频细节模糊问题。现有方法通常每像素仅采样一条射线,未考虑像素覆盖的空间区域随相机内参和相机-物体距离变化的影响,从而在不同距离下采集的法向量对应关系失配。解决方案的关键在于提出一种尺度编码的神经表面表示(scale-encoded neural surface representation),将每个像素的覆盖面积信息融入神经表示中,并通过混合网格编码计算3D点的法向量,从而有效建模不同观测距离下的多尺度表面法向量;同时引入一个尺度感知的网格提取模块,为每个顶点分配基于训练观测数据的最优局部尺度,实现高保真度的表面重建。
链接: https://arxiv.org/abs/2603.20337
作者: Tongyu Yang,Heng Guo,Yasuyuki Matsushita,Fumio Okura,Yu Luo,Xin Fan
机构: Dalian University of Technology (大连理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); The University of Osaka (大阪大学); Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance (北京市多模态数据智能感知与治理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures
Abstract:Previous multi-view normal integration methods typically sample a single ray per pixel, without considering the spatial area covered by each pixel, which varies with camera intrinsics and the camera-to-object distance. Consequently, when the target object is captured at different distances, the normals at corresponding pixels may differ across views. This multi-view surface normal inconsistency results in the blurring of high-frequency details in the reconstructed surface. To address this issue, we propose a scale-encoded neural surface representation that incorporates the pixel coverage area into the neural representation. By associating each 3D point with a spatial scale and calculating its normal from a hybrid grid-based encoding, our method effectively represents multi-scale surface normals captured at varying distances. Furthermore, to enable scale-aware surface reconstruction, we introduce a mesh extraction module that assigns an optimal local scale to each vertex based on the training observations. Experimental results demonstrate that our approach consistently yields high-fidelity surface reconstruction from normals observed at varying distances, outperforming existing multi-view normal integration methods.
[CV-233] Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations
【速读】:该论文旨在解决视频世界模型(Video World Models)中因采用联合嵌入预测架构(JEPA)而产生的结构可解释性缺口问题:JEPA通过在潜在空间中预测被掩码区域而非像素重建来学习时空表征,这种机制虽提升了效率与泛化能力,但导致编码器所学的物理结构无法以可检视形式呈现,从而阻碍了对模型内部语义表示的理解。现有探测方法或依赖连续空间缺乏结构中间层,或引入生成组件造成行为归因混淆。解决方案的关键在于提出AI Mother Tongue (AIM) 框架——一种被动量化探测器(passive quantization probe),它无需任务特定监督、不修改编码器,即可将V-JEPA 2连续潜向量转化为离散符号序列;由于编码器完全冻结,所有符号结构均可归因于预训练的JEPA表示本身,而非探测器设计。实验验证了AIM能有效揭示潜空间中的结构性符号流形,为构建动作条件下的符号化世界模型奠定基础。
链接: https://arxiv.org/abs/2603.20327
作者: Liu hung ming
机构: PARRAWA AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 6 figures, 3 tables, 26 equations; independent research report; Stage 1 of a four-stage AIM–V-JEPA 2 integration roadmap; code available at this https URL
Abstract:Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations – not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p 10^-4; MI 0.036–0.117 bits, NMI 1.2–3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces. Comments: 35 pages, 6 figures, 3 tables, 26 equations; independent research report; Stage 1 of a four-stage AIM–V-JEPA 2 integration roadmap; code available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.20327 [cs.LG] (or arXiv:2603.20327v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.20327 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-234] Prompt-Free Lightweight SAM Adaptation for Histopathology Nuclei Segmentation with Strong Cross-Dataset Generalization
【速读】:该论文旨在解决组织病理学图像中细胞核分割任务中存在的计算复杂度高和跨数据集泛化能力弱的问题(即现有分割方法虽性能优异,但难以在不同数据集间保持稳定表现,且部署成本较高)。其解决方案的关键在于提出一种无需提示(prompt-free)且轻量化的基于Segment Anything Model (SAM) 的适配框架:通过利用多层编码器特征与残差解码结构实现高精度分割,同时仅微调冻结的SAM编码器中的LoRA模块(仅需4.1M可训练参数),从而显著降低计算开销并提升跨数据集的适应性。
链接: https://arxiv.org/abs/2603.20326
作者: Muhammad Hassan Maqsood,Yanming Zhu,Alfred Lam,Getamesay Dagnaw,Xuefei Yin,Alan Wee-Chung Liew
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Histopathology nuclei segmentation is crucial for quantitative tissue analysis and cancer diagnosis. Although existing segmentation methods have achieved strong performance, they are often computationally heavy and show limited generalization across datasets, which constrains their practical deployment. Recent SAM-based approaches have shown great potential in general and medical imaging, but typically rely on prompt guidance or complex decoders, making them less suitable for histopathology images with dense nuclei and heterogeneous appearances. We propose a prompt-free and lightweight SAM adaptation that leverages multi-level encoder features and residual decoding for accurate and efficient nuclei segmentation. The framework fine-tunes only LoRA modules within the frozen SAM encoder, requiring just 4.1M trainable parameters. Experiments on three benchmark datasets TNBC, MoNuSeg, and PanNuke demonstrate state-of-the-art performance and strong cross-dataset generalization, highlighting the effectiveness and practicality of the proposed framework for histopathology applications.
[CV-235] DCG-Net: Dual Cross-Attention with Concept-Value Graph Reasoning for Interpretable Medical Diagnosis
【速读】:该论文旨在解决深度学习模型在医学图像分析中虽性能优异但决策过程难以解释的问题,以及现有概念瓶颈模型(Concept Bottleneck Models, CBMs)普遍忽视临床概念间上下文依赖关系的局限。其解决方案的关键在于提出一种端到端可解释框架DCG-Net,通过两个核心机制实现:一是引入双交叉注意力(Dual Cross-Attention)模块,以双向注意力机制替代传统的余弦相似度匹配,实现视觉特征与标准化文本概念-值原型之间的精准对齐及空间定位的证据归因;二是构建参数化概念图(Parametric Concept Graph),基于正点互信息(Positive Pointwise Mutual Information)先验初始化,并通过稀疏控制的消息传递进行优化,从而显式建模临床概念间的结构化依赖关系,使其符合临床知识逻辑。该方法在白细胞形态学和皮肤病变诊断任务上实现了最优分类性能并生成可临床理解的诊断解释。
链接: https://arxiv.org/abs/2603.20325
作者: Getamesay Dagnaw,Xuefei Yin,Muhammad Hassan Maqsood,Yanming Zhu,Alan Wee-Chung Liew
机构: Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning models have achieved strong performance in medical image analysis, but their internal decision processes remain difficult to interpret. Concept Bottleneck Models (CBMs) partially address this limitation by structuring predictions through human-interpretable clinical concepts. However, existing CBMs typically overlook the contextual dependencies among concepts. To address these issues, we propose an end-to-end interpretable framework \emphDCG-Net that integrates multimodal alignment with structured concept reasoning. DCG-Net introduces a Dual Cross-Attention module that replaces cosine similarity matching with bidirectional attention between visual tokens and canonicalized textual concept-value prototypes, enabling spatially localized evidence attribution. To capture the relational structure inherent to clinical concepts, we develop a Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing. This formulation models inter-concept dependencies in a manner consistent with clinical domain knowledge. Experiments on white blood cell morphology and skin lesion diagnosis demonstrate that DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.
[CV-236] NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation
【速读】:该论文旨在解决视频人体姿态估计中因运动模糊、遮挡及复杂时空动态性导致的精度受限问题,现有方法多依赖热图或隐式时空特征聚合,限制了关节拓扑表达能力并削弱跨帧一致性。其解决方案的关键在于提出一种以节点为中心(node-centric)的新框架,通过显式融合视觉、时间与结构推理机制实现高精度姿态估计:首先设计基于视觉-时序速度的关节嵌入,融合亚像素级关节线索与帧间运动信息构建具外观和运动感知的表征;其次引入注意力驱动的姿态查询编码器,将关节表示映射至姿态感知节点空间,生成图像条件下的关节感知节点嵌入;在此基础上构建双分支解耦时空注意力图,分别建模局部与全局的时序传播与空间约束推理;最终通过节点空间专家融合模块自适应融合两分支互补输出,整合局部与全局线索完成最终关节预测。
链接: https://arxiv.org/abs/2603.20323
作者: Quang Dang Huynh,Xuefei Yin,Andrew Busch,Hugo G. Espinosa,Alan Wee-Chung Liew,Matthew T.O. Worsey,Yanming Zhu
机构: Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video-based human pose estimation remains challenged by motion blur, occlusion, and complex spatiotemporal dynamics. Existing methods often rely on heatmaps or implicit spatio-temporal feature aggregation, which limits joint topology expressiveness and weakens cross-frame consistency. To address these problems, we propose a novel node-centric framework that explicitly integrates visual, temporal, and structural reasoning for accurate pose estimation. First, we design a visuo-temporal velocity-based joint embedding that fuses sub-pixel joint cues and inter-frame motion to build appearance- and motion-aware representations. Then, we introduce an attention-driven pose-query encoder, which applies attention over joint-wise heatmaps and frame-wise features to map the joint representations into a pose-aware node space, generating image-conditioned joint-aware node embeddings. Building upon these node embeddings, we propose a dual-branch decoupled spatio-temporal attention graph that models temporal propagation and spatial constraint reasoning in specialized local and global branches. Finally, a node-space expert fusion module is proposed to adaptively fuse the complementary outputs from both branches, integrating local and global cues for final joint predictions. Extensive experiments on three widely used video pose benchmarks demonstrate that our method outperforms state-of-the-art methods. The results highlight the value of explicit node-centric reasoning, offering a new perspective for advancing video-based human pose estimation.
[CV-237] Which Workloads Belong in Orbit? A Workload-First Framework for Orbital Data Centers Using Semantic Abstraction
【速读】:该论文旨在解决空间计算(space-based compute)场景下任务分配的决策难题,即如何在轨计算与地面云平台之间合理划分AI工作负载,以应对日益增长的数据密集型AI任务和不断下降的发射成本。其解决方案的关键在于提出一个以工作负载为中心的框架,并结合轨道数据中心成熟度的分阶段采纳模型,强调通过语义抽象(semantic abstraction)而非原始计算规模来判断任务是否适合在轨执行。实验验证表明,基于语义压缩的原型系统可实现高达99.99%的数据量缩减,从而证明语义级处理是早期任务适配的核心驱动力。
链接: https://arxiv.org/abs/2603.20317
作者: Durgendra Narayan Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Space-based compute is becoming plausible as launch costs fall and data-intensive AI workloads grow. This paper proposes a workload-centric framework for deciding which tasks belong in orbit versus terrestrial cloud, along with a phased adoption model tied to orbital data center maturity. We ground the framework with in-orbit semantic-reduction prototypes. An Earth-observation pipeline on Sentinel-2 imagery from Seattle and Bengaluru (formerly Bangalore) achieves 99.7-99.99% payload reduction by converting raw imagery to compact semantic artifacts. A multi-pass stereo reconstruction prototype reduces ~306 MB to ~1.57 MB of derived 3D representations (99.49% reduction). These results support a workload-first view in which semantic abstraction, not raw compute scale, drives early workload suitability.
[CV-238] VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs
【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, VLMs)在临床应用中因幻觉(hallucination)导致的可靠性问题,即模型倾向于依据语言先验生成回答而非基于视觉证据。解决方案的关键在于提出一种无需训练的推理阶段干预方法——视觉对齐评分引导解码(Visual Grounding Score Guided Decoding, VGS-Decoding)。其核心思想是:幻觉token在图像退化时概率保持或上升,而视觉对齐token的概率则下降;通过计算每个token的视觉对齐评分(Visual Grounding Score, VGS),衡量其对视觉信息的依赖程度,并在解码过程中自适应地增强视觉对齐token的概率、抑制幻觉token,从而实现细粒度的、逐token的控制,显著提升模型输出的准确性与可信度。
链接: https://arxiv.org/abs/2603.20314
作者: Govinda Kolli,Adinath Madhavrao Dukre,Behzad Bozorgtabar,Dwarikanath Mahapatra,Imran Razzak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token’s visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and +8.98% in open-ended recall, while introducing only 2\times inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.
[CV-239] GraphiContact: Pose-aware Human-Scene Robust Contact Perception for Interactive Systems ICME2026
【速读】:该论文旨在解决单目图像中人体与场景的顶点级接触预测问题(monocular vertex-level human-scene contact prediction),这是交互式系统如辅助监控、具身智能和康复分析等任务的基础能力。现有方法要么忽视显式三维人体先验而仅关注接触预测,要么侧重于姿态或网格重建但未直接优化在遮挡和感知噪声下的鲁棒顶点级接触推理。解决方案的关键在于提出GraphiContact框架,该框架利用两个预训练Transformer编码器迁移互补的人体先验信息,并基于重建的三维人体网格进行逐顶点接触预测;同时引入Single-Image Multi-Infer Uncertainty (SIMU)训练策略,通过token级自适应路由模拟遮挡与噪声观测,从而提升模型在真实场景中的鲁棒性,且保持测试时单分支高效推理。
链接: https://arxiv.org/abs/2603.20310
作者: Xiaojian Lin,Yaomin Shen,Junyuan Ma,Yujie Sun,Chengqing Bu,Wenxin Zhang,Zongzheng Zhang,Hao Fei,Lei Jin,Hao Zhao
机构: Tsinghua University (清华大学); XR System Application Research Center, Nanchang Research Institute, Zhejiang University (浙江大学南昌研究院系统应用研究中心); Beijing University of Posts and Telecommunications (北京邮电大学); University of Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 15 pages, 9 figures, Accepted at ICME 2026
Abstract:Monocular vertex-level human-scene contact prediction is a fundamental capability for interactive systems such as assistive monitoring, embodied AI, and rehabilitation analysis. In this work, we study this task jointly with single-image 3D human mesh reconstruction, using reconstructed body geometry as a scaffold for contact reasoning. Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise. To address this gap, we propose GraphiContact, a pose-aware framework that transfers complementary human priors from two pretrained Transformer encoders and predicts per-vertex human-scene contact on the reconstructed mesh. To improve robustness in real-world scenarios, we further introduce a Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing, which simulates occlusion and noisy observations during training while preserving efficient single-branch inference at test time. Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction. Our code, based on the GraphiContact method, provides comprehensive 3D human reconstruction and interaction analysis, and will be publicly available at this https URL.
[CV-240] EARTalking: End-to-end GPT -style Autoregressive Talking Head Synthesis with Frame-wise Control
【速读】:该论文旨在解决音频驱动人脸生成(Audio-driven Talking Head Generation)中现有方法的两大局限性:一是基于AR(Autoregressive)的方法依赖中间面部表征,限制了表达力和真实感;二是基于扩散模型(Diffusion-based)的方法采用逐片段生成,缺乏细粒度控制且因全局去噪导致固有延迟。解决方案的关键在于提出EARTalking——一种端到端、类GPT的自回归模型,其核心创新为引入帧级上下文感知的流式生成范式(frame-by-frame, in-context, audio-driven streaming generation),并设计Sink Frame Window Attention(SFA)机制以支持变长视频生成并保持身份一致性,同时通过Streaming Frame Condition In-Context(FCIC)方案在流式过程中高效注入多样控制信号,实现任意时刻的交互式控制。
链接: https://arxiv.org/abs/2603.20307
作者: Yuzhe Weng,Haotian Wang,Yuanhong Yu,Jun Du,Shan He,Xiaoyan Wu,Haoran Xu
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK (科大讯飞); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.
[CV-241] he Global-Local loop: what is missing in bridging the gap between geospatial data from numerous communities?
【速读】:该论文旨在解决当前地理空间数据融合中普遍存在的“主从”(master-slave)范式问题,即多数方法仅将一个数据源作为辅助以提升另一个“主要”数据源的处理效果,缺乏多源数据之间的对称性利用与协同优势,导致无法充分挖掘多源异构数据在通用或专题应用中的潜力。其解决方案的关键在于提出并构建更具对称性和交互性的数据融合机制,通过典型应用场景验证最有效的交互模式,并强调跨尺度、跨社区的数据桥梁建设,从而实现“全局-局部循环”(global-local loop)的潜力释放,推动多源数据在更广泛场景下的协同价值最大化。
链接: https://arxiv.org/abs/2603.20305
作者: Clément Mallet,Ana-Maria Raimond
机构: Univ Gustave Eiffel (居斯塔夫·埃菲尔大学); Géodata Paris (巴黎地理数据); IGN (法国国家地理研究所); LASTIG (土地与空间信息实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 ISPRS Congress
Abstract:We face a unprecedented amount of geospatial data, describing directly or indirectly the Earth Surface at multiple spatial, temporal, and semantic scales, and stemming from numerous contributors, from satellites to citizens. The main challenge in all the geospatial-related communities lies in suitably leveraging a combination of some of the sources for either a generic or a thematic application. Certain data fusion schemes are predominantly exploited: they correspond to popular tasks with mainstream data sources, e.g., free archives of Sentinel images coupled with OpenStreetMap data under an open and widespread deep-learning backbone for land-cover mapping purposes. Most of these approaches unfortunately operate under a “master-slave” paradigm, where one source is basically integrated to help processing the “main” source, without mutual advantages (e.g., large-scale estimation of a given biophysical variable using in-situ observations) and under a specific community bias. We argue that numerous key data fusion configurations, and in particular the effort in symmetrizing the exploitation of multiple data sources, are insufficiently addressed while being highly beneficial for generic or thematic applications. Bridges and retroactions between scales, communities and their respective sources are lacking, neglecting the utmost potential of such a “global-local loop”. In this paper, we propose to establish the most relevant interaction schemes through illustrative use cases. We subsequently discuss under-explored research directions that could take advantage of leveraging available data through multiples extents and communities.
[CV-242] ransferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在生成图像时缺乏可追溯性和责任归属的问题,尤其是现有水印技术存在检测效率低、依赖特定模型权重或难以跨模型迁移等局限。其解决方案的关键在于提出一种名为 DiffMark 的即插即用式水印方法:通过在每一步去噪过程中注入一个持久的、可学习的扰动项 δ(而非编码到初始噪声向量),使水印信号在最终去噪潜变量 z₀ 中累积;同时利用潜在一致性模型(Latent Consistency Models, LCM)作为可微训练桥梁,实现无需遍历完整去噪链即可高效反向传播梯度,从而支持单次前向推理完成多比特水印提取(仅需 16.4 ms,比基于采样的方法快 45 倍),并具备每图像密钥灵活性和跨扩散架构的可迁移性。
链接: https://arxiv.org/abs/2603.20304
作者: Hong-Hanh Nguyen-Le,Van-Tuan Tran,Thuc D. Nguyen,Nhien-An Le-Khac
机构: University College Dublin, Ireland; Trinity College Dublin, Ireland; Ho Chi Minh City University of Science
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As diffusion models (DMs) enable photorealistic image generation at unprecedented scale, watermarking techniques have become essential for provenance establishment and accountability. Existing methods face challenges: sampling-based approaches operate on frozen models but require costly N -step Denoising Diffusion Implicit Models (DDIM) inversion (typically N=50) for zero-bit-only detection; fine-tuning-based methods achieve fast multi-bit extraction but couple the watermark to a specific model checkpoint, requiring retraining for each architecture. We propose DiffMark, a plug-and-play watermarking method that offers three key advantages over existing approaches: single-pass multi-bit detection, per-image key flexibility, and cross-model transferability. Rather than encoding the watermark into the initial noise vector, DiffMark injects a persistent learned perturbation \delta at every denoising step of a completely frozen DM. The watermark signal accumulates in the final denoised latent z_0 and is recovered in a single forward pass. The central challenge of backpropagating gradients through a frozen UNet without traversing the full denoising chain is addressed by employing Latent Consistency Models (LCM) as a differentiable training bridge. This reduces the number of gradient steps from 50 DDIM to 4 LCM and enables a single-pass detection at 16.4 ms, a 45x speedup over sampling-based methods. Moreover, by this design, the encoder learns to map any runtime secret to a unique perturbation at inference time, providing genuine per-image key flexibility and transferability to unseen diffusion-based architectures without per-model fine-tuning. Although achieving these advantages, DiffMark also maintains competitive watermark robustness against distortion, regeneration, and adversarial attacks.
[CV-243] InjectFlow: Weak Guides Strong via Orthogonal Injection for Flow Matching
【速读】:该论文旨在解决流匹配(Flow Matching, FM)模型在生成分布外或少数类样本时因数据集偏差导致的语义退化问题,其核心在于揭示了条件期望平滑机制引发的轨迹锁定(trajectory lock-in)现象。解决方案的关键是提出InjectFlow方法,通过在初始速度场计算中注入正交语义信息,无需修改训练过程或随机种子即可有效防止潜在空间向多数模式漂移,从而提升生成公平性与鲁棒性。
链接: https://arxiv.org/abs/2603.20303
作者: Dayu Wang,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow Matching (FM) has recently emerged as a leading approach for high-fidelity visual generation, offering a robust continuous-time alternative to ordinary differential equation (ODE) based models. However, despite their success, FM models are highly sensitive to dataset biases, which cause severe semantic degradation when generating out-of-distribution or minority-class samples. In this paper, we provide a rigorous mathematical formalization of the ``Bias Manifold’’ within the FM framework. We identify that this performance drop is driven by conditional expectation smoothing, a mechanism that inevitably leads to trajectory lock-in during inference. To resolve this, we introduce InjectFlow, a novel, training-free method by injecting orthogonal semantics during the initial velocity field computation, without requiring any changes to the random seeds. This design effectively prevents the latent drift toward majority modes while maintaining high generative quality. Extensive experiments demonstrate the effectiveness of our approach. Notably, on the GenEval dataset, InjectFlow successfully fixes 75% of the prompts that standard flow matching models fail to generate correctly. Ultimately, our theoretical analysis and algorithm provide a ready-to-use solution for building more fair and robust visual foundation models.
[CV-244] HSI Image Enhancement Classification Based on Knowledge Distillation: A Study on Forgetting
【速读】:该论文旨在解决高光谱图像增量分类任务中因灾难性遗忘(catastrophic forgetting)导致的模型性能下降问题。其解决方案的关键在于提出了一种基于教师模型的知识保留方法,通过利用增量类别样本而非旧类别样本,有效缓解了对历史数据的依赖;同时引入基于掩码的局部类别知识蒸馏算法,通过对知识蒸馏过程进行解耦,过滤掉可能误导学生模型的噪声信息,从而提升整体分类准确率。
链接: https://arxiv.org/abs/2603.20292
作者: Songfeng Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18pages,7figures
Abstract:In incremental classification tasks for hyperspectral images, catastrophic forgetting is an unavoidable challenge. While memory recall methods can mitigate this issue, they heavily rely on samples from old categories. This paper proposes a teacher-based knowledge retention method for incremental image classification. It alleviates model forgetting of old category samples by utilizing incremental category samples, without depending on old category samples. Additionally, this paper introduces a mask-based partial category knowledge distillation algorithm. By decoupling knowledge distillation, this approach filters out potentially misleading information that could misguide the student model, thereby enhancing overall accuracy. Comparative and ablation experiments demonstrate the proposed method’s robust performance.
[CV-245] ransparent Frag ments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly
【速读】:该论文旨在解决透明碎片轮廓估计难题,该问题在精密光学仪器修复、文物修复及珍贵设备破损事故识别等领域具有重要意义。由于透明碎片具有严格的光学特性、不规则形状与边缘,传统视觉方法难以准确提取其轮廓。解决方案的关键在于提出了一种基于视觉-触觉融合的通用框架:首先构建了包含多场景合成数据的TransFrag27K数据集,并设计了可扩展的数据生成管道;其次开发了用于识别、定位和分割抓取位置的TransFragNet网络,结合配备Gelsight Mini传感器的双指夹爪获取碎片侧边触觉信息;通过融合视觉与触觉特征,提出视觉-触觉融合材料分类器,模拟人类通过视觉与触觉协同感知碎片轮廓的机制;最终引入多维相似性度量的轮廓匹配与重装算法,为评估透明碎片轮廓估计与重装性能提供可复现基准。
链接: https://arxiv.org/abs/2603.20290
作者: Qihao Lin,Borui Chen,Yuping Zhou,Jianing Wu,Yulan Guo,Weishi Zheng,Chongkun Xia
机构: Sun Yat-sen University (中山大学); Sun Yat-sen University (中山大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 17 pages, 22 figures, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment’s contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at this https URL.
[CV-246] Remote Sensing Image Dehazing: A Systematic Review of Progress Challenges and Prospects
【速读】:该论文旨在解决遥感图像(Remote Sensing Images, RSIs)因雾霾、薄云等大气干扰导致的退化问题,这些问题会掩盖地表反射率信息并影响下游应用。其解决方案的关键在于提出了一种系统性、统一的去雾方法演进框架,将现有方法归纳为三个阶段:基于手工设计物理先验的方法、数据驱动的深度复原方法,以及融合物理机制与智能生成的混合模型。研究通过大规模定量实验验证了Transformer和扩散模型在结构相似性(SSIM)上提升12%~18%、感知误差降低20%~35%,同时发现显式引入透射率或大气光约束的物理引导模型能显著减少颜色偏差(最高达27%),从而为构建可信、可控且高效的(TCE)遥感图像去雾系统提供了理论依据和技术路径。
链接: https://arxiv.org/abs/2603.20289
作者: Heng Zhou,Xiaoxiong Liu,Zhenxi Zhang,Jieheng Yun,Chengyang Li,Yunchu Yang,Dongyi Xia,Chunna Tian,Xiao-Jun Wu
机构: Jiangnan University (江南大学); Xidian University (西安电子科技大学); China University of Petroleum (Beijing) (中国石油大学(北京)); Chinese Academy of Sciences (中国科学院); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 82 pages, 23 figures,
Abstract:Remote sensing images (RSIs) are frequently degraded by haze, fog, and thin clouds, which obscure surface reflectance and hinder downstream applications. This study presents the first systematic and unified survey of RSIs dehazing, integrating methodological evolution, benchmark assessment, and physical consistency analysis. We categorize existing approaches into a three-stage progression: from handcrafted physical priors, to data-driven deep restoration, and finally to hybrid physical-intelligent generation, and summarize more than 30 representative methods across CNNs, GANs, Transformers, and diffusion models. To provide a reliable empirical reference, we conduct large-scale quantitative experiments on five public datasets using 12 metrics, including PSNR, SSIM, CIEDE, LPIPS, FID, SAM, ERGAS, UIQI, QNR, NIQE, and HIST. Cross-domain comparison reveals that recent Transformer- and diffusion-based models improve SSIM by 12%~18% and reduce perceptual errors by 20%~35% on average, while hybrid physics-guided designs achieve higher radiometric stability. A dedicated physical radiometric consistency experiment further demonstrates that models with explicit transmission or airlight constraints reduce color bias by up to 27%. Based on these findings, we summarize open challenges: dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations, and outline promising research directions for future development of trustworthy, controllable, and efficient (TCE) dehazing systems. All reviewed resources, including source code, benchmark datasets, evaluation metrics, and reproduction configurations are publicly available at this https URL.
[CV-247] Efficient Visual Anomaly Detection at the Edge: Enabling Real-Time Industrial Inspection on Resource-Constrained Devices
【速读】:该论文旨在解决工业视觉异常检测(Visual Anomaly Detection, VAD)在边缘设备部署时面临的实时性与隐私保护挑战,尤其是受限于边缘硬件的内存和计算资源瓶颈。解决方案的关键在于提出两种轻量化改进方法:PatchCore-Lite 通过分层搜索策略(先对产品量化记忆库进行粗粒度检索,再对解码子集进行精确匹配)显著降低内存占用;Padim-Lite 则利用对角协方差矩阵将马氏距离(Mahalanobis distance)转化为高效的逐元素运算,从而大幅提升推理速度。实验表明,PatchCore-Lite 和 Padim-Lite 分别实现 79% 和 77% 的内存压缩,并分别带来 31% 的推理时间优化,验证了其在边缘环境下的可行性与高效性。
链接: https://arxiv.org/abs/2603.20288
作者: Arianna Stropeni,Fabrizio Genilotti,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto
机构: University of Padua (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Anomaly Detection (VAD) is essential for industrial quality control, enabling automatic defect detection in manufacturing. In real production lines, VAD systems must satisfy strict real-time and privacy requirements, necessitating a shift from cloud-based processing to local edge deployment. However, processing data locally on edge devices introduces new challenges because edge hardware has limited memory and computational resources. To overcome these limitations, we propose two efficient VAD methods designed for edge deployment: PatchCore-Lite and Padim-Lite, based on the popular PatchCore and PaDiM models. PatchCore-Lite runs first a coarse search on a product-quantized memory bank, then an exact search on a decoded subset. Padim-Lite is sped up using diagonal covariance, turning Mahalanobis distance into efficient element-wise computation. We evaluate our methods on the MVTec AD and VisA benchmarks and show their suitability for edge environments. PatchCore-Lite achieves a remarkable 79% reduction in total memory footprint, while PaDiM-Lite achieves substantial efficiency gains with a 77% reduction in total memory and a 31% decrease in inference time. These results show that VAD can be effectively deployed on edge devices, enabling real-time, private, and cost-efficient industrial inspection.
[CV-248] STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction CVPR2026
【速读】:该论文旨在解决在线3D重建中长期时间一致性与内存效率之间的矛盾问题,即在流式输入场景下,传统因果VGGT(Vision-Guided Generative Transformer)通过键值(Key-Value, KV)缓存机制实现时空建模时,缓存随输入序列长度线性增长,导致内存瓶颈;受限于内存预算时,早期缓存剔除严重损害重建质量与时间一致性。其解决方案的关键在于提出STAC(Spatio-Temporally Aware Cache Compression)框架,核心创新包括:(1) 工作时间令牌缓存机制,利用衰减累积注意力得分保留长期信息性令牌;(2) 长期空间令牌压缩方案,将空间冗余令牌压缩为体素对齐表示以提升存储效率;(3) 基于块的多帧优化策略,联合处理连续帧以增强时间连贯性和GPU利用率。实验表明,STAC在保持最优重建质量的同时,内存消耗降低近10倍,推理速度提升4倍,显著提升了实时流式3D重建的可扩展性。
链接: https://arxiv.org/abs/2603.20284
作者: Runze Wang,Yuxuan Song,Youcheng Cai,Ligang Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: 10 pages, 6 figures. Accepted by CVPR 2026
Abstract:Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings. Comments: 10 pages, 6 figures. Accepted by CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV) Cite as: arXiv:2603.20284 [cs.CV] (or arXiv:2603.20284v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.20284 Focus to learn more arXiv-issued DOI via DataCite
[CV-249] Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在边缘设备部署时,如何实现强压缩且保持最小精度损失的问题。现有单一策略的剪枝方法因不同层和架构对剪枝敏感性差异而效果不佳,难以兼顾全局最优与部署实用性。解决方案的关键在于提出一种全局引导、逐层稀疏化的框架——Mix-and-Match Pruning,其核心是利用敏感度评分(如幅度、梯度或其组合)与简单架构规则协同生成多样且高质量的剪枝配置,自动推导出架构感知的稀疏范围(例如保留归一化层、更激进地剪枝分类器),并通过系统采样每个敏感度信号产生十种策略,从而避免重复剪枝迭代,直接提供可部署的精度-稀疏度权衡方案,实验证明其在CNN和视觉Transformer上均达到帕累托最优性能,显著降低Swim-Tiny模型的精度退化达40%。
链接: https://arxiv.org/abs/2603.20280
作者: Danial Monachan,Samira Nazari,Mahdi Taheri,Ali Azarpeyvand,Milos Krstic,Michael Huebner,Christian Herglotz
机构: 1. University of Stuttgart (斯图加特大学); 2. University of Bremen (不来梅大学); 3. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 4. University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:
Abstract:Deploying deep neural networks (DNNs) on edge devices requires strong compression with minimal accuracy loss. This paper introduces Mix-and-Match Pruning, a globally guided, layer-wise sparsification framework that leverages sensitivity scores and simple architectural rules to generate diverse, high-quality pruning configurations. The framework addresses a key limitation that different layers and architectures respond differently to pruning, making single-strategy approaches suboptimal. Mix-and-Match derives architecture-aware sparsity ranges, e.g., preserving normalization layers while pruning classifiers more aggressively, and systematically samples these ranges to produce ten strategies per sensitivity signal (magnitude, gradient, or their combination). This eliminates repeated pruning runs while offering deployment-ready accuracy-sparsity trade-offs. Experiments on CNNs and Vision Transformers demonstrate Pareto-optimal results, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning. These findings show that coordinating existing pruning signals enables more reliable and efficient compressed models than introducing new criteria.
[CV-250] Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在显著深度冗余的问题,尤其是针对需要感知与多步推理紧密耦合的领域(如数学问题),如何有效剪枝解码器层以保留关键能力而不损害性能。其解决方案的关键在于提出一种基于域感知激活相似性的结构化层剪枝方法:通过量化每层对数学与非数学输入表示变换的强度,构建数学感知、非数学感知及混合的层排序标准,从而识别在特定域内输入-输出激活变化最小的层。实验表明,该方法在低剪枝预算下表现出最优稳定性,在高预算时仍能保持或超越结构感知基线,揭示了VLM深度对域特异性行为的贡献机制,并提供了一种可解释且高效的深度压缩策略。
链接: https://arxiv.org/abs/2603.20275
作者: Saeed Khaki,Nima Safaei,Kamal Ginotra
机构: Microsoft AI; Ohio State University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer-based vision-language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi-step reasoning. We study structured decoder layer pruning through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs. This yields simple math-aware, non-math-aware, and mixed ranking criteria that identify layers whose input-output activations change least within a target domain. Across two state-of-the-art VLMs and a broad suite of math and general multimodal benchmarks, we uncover a consistent three-regime structure: at low pruning budgets, performance is highly sensitive to which layers are removed; at moderate budgets, methods converge as structural damage accumulates; and at high budgets, structural continuity dominates, favoring spacing-aware strategies. Our domain-aware rankings achieve the strongest stability in the ranking-sensitive regime, while matching or exceeding structure-aware baselines at larger budgets. These results provide a clearer picture of how depth contributes to domain-specific behavior in VLMs and offer a practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision-language capabilities.
[CV-251] Efficient AI-Driven Multi-Section Whole Slide Image Analysis for Biochemical Recurrence Prediction in Prostate Cancer
【速读】:该论文旨在解决前列腺癌患者在根治性前列腺切除术后生化复发(Biochemical Recurrence, BCR)精准预测的难题,其核心挑战源于肿瘤在前列腺腺体中分布的多灶性特征。解决方案的关键在于提出了一种新型人工智能(AI)框架,能够同时处理一系列多切片病理图像,从而全面捕捉整个前列腺腺体范围内的肿瘤景观。该框架基于包含789名患者共23,451张病理切片的大规模数据集进行训练,在1年和2年BCR预测中表现出显著优于传统临床指标(如术前PSA水平和Gleason评分)的性能,并被证实为多变量Cox比例风险模型中最强的独立预后因子。此外,通过引入patch与slide子采样策略,在不牺牲预测准确性的情况下大幅降低计算成本,且经外部验证确认了模型的良好泛化能力,体现出该AI方法在临床实践中用于术后管理的可行性与可扩展性。
链接: https://arxiv.org/abs/2603.20273
作者: Yesung Cho,Dongmyung Shin,Sujeong Hong,Jooyeon Lee,Seongmin Park,Geongyu Lee,Jongbae Park,Hong Koo Ha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Prostate cancer is one of the most frequently diagnosed malignancies in men worldwide. However, precise prediction of biochemical recurrence (BCR) after radical prostatectomy remains challenging due to the multifocality of tumors distributed throughout the prostate gland. In this paper, we propose a novel AI framework that simultaneously processes a series of multi-section pathology slides to capture the comprehensive tumor landscape across the entire prostate gland. To develop this predictive AI model, we curated a large-scale dataset of 23,451 slides from 789 patients. The proposed framework demonstrated strong predictive performance for 1- and 2-year BCR prediction, substantially outperforming established clinical benchmarks. The AI-derived risk score was validated as the most potent independent prognostic factor in a multivariable Cox proportional hazards analysis, surpassing conventional clinical markers such as pre-operative PSA and Gleason score. Furthermore, we demonstrated that integrating patch and slide sub-sampling strategies significantly reduces computational cost during both training and inference without compromising predictive performance, and generalizability of AI was confirmed through external validation. Collectively, these results highlight the clinical feasibility and prognostic value of the proposed AI-based multi-section slide analysis as a scalable tool for post-operative management in prostate cancer.
[CV-252] Rheos: Modelling Continuous Motion Dynamics in Hierarchical 3D Scene Graphs
【速读】:该论文旨在解决3D场景图(3DSG)在动态建模方面的局限性,即现有方法仅能跟踪个体代理的运动,而无法有效刻画群体层面的动态模式。同时,传统基于均匀网格的动态地图(MoD)缺乏语义基础且扩展性差。解决方案的关键在于提出Rheos框架,其核心创新是将连续方向运动模型显式嵌入到3DSG的额外动力学层中,每个动力学节点采用半包裹高斯混合模型(semi-wrapped Gaussian mixture model)来以概率形式表征多模态方向流,并显式建模不确定性,替代了先前工作中使用的离散直方图。此外,通过水库采样(reservoir sampling)、并行单元更新和贝叶斯信息准则(BIC)自动选择最优混合成分数,显著降低了在线更新时的计算复杂度,实现了高效、可扩展的动态建模。
链接: https://arxiv.org/abs/2603.20239
作者: Iacopo Catalano,Francesco Verdoja,Javier Civera,Jorge Peña-Queralta,Julio A. Placed
机构: University of Turku (图尔库大学); Zürich University of Applied Sciences (苏黎世应用科学大学); Instituto Tecnológico de Aragón (阿拉贡技术研究所); University of Zaragoza (萨拉戈萨大学); Aalto University (阿尔托大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Scene Graphs (3DSGs) provide hierarchical, multi-resolution abstractions that encode the geometric and semantic structure of an environment, yet their treatment of dynamics remains limited to tracking individual agents. Maps of Dynamics (MoDs) complement this by modeling aggregate motion patterns, but rely on uniform grid discretizations that lack semantic grounding and scale poorly. We present Rheos, a framework that explicitly embeds continuous directional motion models into an additional dynamics layer of a hierarchical 3DSG that enhances the navigational properties of the graph. Each dynamics node maintains a semi-wrapped Gaussian mixture model that captures multimodal directional flow as a principled probability distribution with explicit uncertainty, replacing the discrete histograms used in prior work. To enable online operation, Rheos employs reservoir sampling for bounded-memory observation buffers, parallel per-cell model updates and a principled Bayesian Information Criterion (BIC) sweep that selects the optimal number of mixture components, reducing per-update initialization cost from quadratic to linear in the number of samples. Evaluated across four spatial resolutions in a simulated pedestrian environment, Rheos consistently outperforms the discrete baseline under continuous as well as unfavorable discrete metrics. We release our implementation as open source.
[CV-253] FIGURA: A Modular Prompt Engineering Method for Artistic Figure Photography in Safety-Filtered Text-to-Image Models
【速读】:该论文旨在解决当前商业文本到图像(Text-to-Image, T2I)模型中安全过滤器对合法艺术内容(尤其是涉及人体的古典裸体摄影)过度限制的问题,即安全机制将艺术性表达与非法内容等同处理,导致专业艺术家无法在启用安全过滤器的前提下生成符合意图的艺术作品。解决方案的关键在于提出FIGURA方法(Framework for Intelligent Generation of Unrestricted Artistic Results),其核心是基于实证发现构建一套模块化提示工程系统,通过八个相互关联的知识文件实现精准控制:首先识别出安全过滤器主要响应“缺失描述”(如“无衣物”)而非“存在描述”(如“身体形态”),形成“黄金法则”;其次利用艺术家引用作为美学引导和安全锚点以改变过滤行为;再者引入空间语境作为独立变量影响过滤效果;最后采用几何词汇描述人体轮廓可绕过轮廓识别机制。该方法在FLUX 2 Pro模型上验证成功率达80%–90%,表明可通过系统性策略适配现有安全机制,而非规避其功能。
链接: https://arxiv.org/abs/2603.20201
作者: Luca Cazzaniga
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 10 pages, 6 tables. Preprint
Abstract:Safety filters in commercial text-to-image (T2I) models systematically block legitimate artistic content involving the human figure, treating classical nude photography with the same restrictiveness as explicit material. While prior research has documented this problem extensively, no operational system exists that enables professional artists to generate artistic figure photography within the constraints of active safety filters. We present the FIGURA Method (Framework for Intelligent Generation of Unrestricted Artistic Results), a modular prompt engineering system comprising eight interconnected knowledge files, empirically validated through 200+ documented generation tests on FLUX 2 Pro (Cloud) with active safety filters at the default tolerance level. Our systematic testing reveals several previously undocumented findings: (1) safety filters primarily detect absence descriptions (references to missing clothing) rather than presence descriptions (references to body form), which we formalize as the Golden Rule; (2) artistic references to painters function simultaneously as aesthetic guides and as safety anchors that alter filter behavior; (3) spatial context operates as an independent filter variable, with documented success rate hierarchies; and (4) geometric vocabulary for body description bypasses pattern recognition in silhouette contexts. The system achieves documented success rates between 80% and 90% across five structured prompt templates, demonstrating that the artistic censorship problem identified in recent literature admits practical, systematic solutions that work with active safety mechanisms rather than circumventing them.
[CV-254] Your Robot Will Feel You Now: Empathy in Robots and Embodied Agents
【速读】:该论文旨在解决如何将人类与动物行为中体现的共情机制有效迁移至人机交互(Human-Robot Interaction, HRI)和具身对话代理(Embodied Conversational Agents, ECAs)系统中,从而赋予机器多模态的社会与情感智能。其解决方案的关键在于系统性回顾已有研究中所采用的共情行为模型与实现方式——既包括对人类及动物共情行为的模仿,也涵盖为机器特性设计的独特类比方法,并以此为基础提炼出可应用于当前以语言为主导的通用型智能代理(如ChatGPT)的共情建模经验与技术路径。
链接: https://arxiv.org/abs/2603.20200
作者: Angelica Lim,Ö. Nilay Yalçin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted manuscript. Chapter in “Empathy and Artificial Intelligence: Challenges, Advances and Ethical Considerations” edited by Anat Perry; C. Daryl Cameron
Abstract:The fields of human-robot interaction (HRI) and embodied conversational agents (ECAs) have long studied how empathy could be implemented in machines. One of the major drivers has been the goal of giving multimodal social and emotional intelligence to these artificially intelligent agents, which interact with people through facial expressions, body, gesture, and speech. What empathic behaviors and models have these fields implemented by mimicking human and animal behavior? In what ways have they explored creating machine-specific analogies? This chapter aims to review the knowledge from these studies, towards applying the lessons learned to today’s ubiquitous, language-based agents such as ChatGPT.
[CV-255] Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agent ic Planning
【速读】:该论文旨在解决当前多模态红队测试中攻击方式结构脆弱的问题,即现有方法依赖于将恶意载荷嵌入图像(如通过文字或对抗噪声)作为攻击载体,而这类攻击容易被标准防御机制识别并消除。为应对这一局限,作者提出视觉独占性(Visual Exclusivity, VE)这一新型威胁模型,其核心在于:危害仅在模型对图像内容进行深层推理时才显现,例如解析技术图纸等复杂视觉信息。解决方案的关键是提出多模态多轮代理规划框架(Multimodal Multi-turn Agentic Planning, MM-Plan),该框架将越狱攻击从逐轮反应转变为全局策略合成,并利用分组相对策略优化(Group Relative Policy Optimization, GRPO)训练攻击规划器自动生成高效、多轮的攻击策略,从而无需人工干预即可发现有效攻击路径。实验表明,MM-Plan在Claude 4.5 Sonnet和GPT-5上分别达到46.3%和13.8%的攻击成功率,显著优于现有基线方法(提升2–5倍),揭示了前沿大模型在视觉推理层面仍存在严重安全漏洞。
链接: https://arxiv.org/abs/2603.20198
作者: Yunbei Zhang,Yingqiang Ge,Weijie Xu,Yuhui Xu,Jihun Hamm,Chandan K. Reddy
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed. We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis. MM-Plan trains an attacker planner to synthesize comprehensive, multi-turn strategies, optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. To rigorously benchmark this reasoning-dependent threat, we introduce VE-Safety, a human-curated dataset filling a critical gap in evaluating high-risk technical visual understanding. MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2–5x where existing methods largely fail. These findings reveal that frontier models remain vulnerable to agentic multimodal attacks, exposing a critical gap in current safety alignment. Warning: This paper contains potentially harmful content.
[CV-256] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
【速读】:该论文旨在解决生成式 AI (Generative AI) 中人体图像动画的时序一致性与高保真度问题,尤其是在保持参考人物外观不变的前提下实现自然流畅的动作迁移。其解决方案的关键在于:首先采用低秩适应(Low-Rank Adaptation, LoRA)技术对开源 Wan2.1 模型进行轻量级微调,有效降低训练内存开销;其次设计了一个由多层3D卷积堆叠构成的轻量化姿态编码器,用于高效提取驱动姿态中的运动信息;最后通过简单的特征拼接操作融合参考图像的外观信息,并引入参考图像的姿态信息以增强姿态对齐效果,从而在480p训练数据上实现720p推理时的无缝扩展和高质量动画生成。
链接: https://arxiv.org/abs/2504.11289
作者: Xiang Wang,Shiwei Zhang,Longxiang Tang,Yingya Zhang,Changxin Gao,Yuehuan Wang,Nong Sang
机构: Huazhong University of Science and Technology (华中科技大学); Alibaba Group (阿里巴巴集团); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: The training and inference code (based on Wan2.1) is available at this https URL
Abstract:This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.
[CV-257] HMS-VesselNet: Hierarchical Multi-Scale Attention Network with Topology-Preserving Loss for Retinal Vessel Segmentation
【速读】:该论文旨在解决基于标准重叠损失(standard overlap losses)的视网膜血管分割方法在识别细小周边血管时性能不佳的问题,这类结构因像素占比低且对比度弱而常被忽略。解决方案的关键在于提出一种分层多尺度网络(HMS-VesselNet),通过四个并行分支在不同分辨率下处理眼底图像,并利用学习得到的融合权重整合输出;同时,在训练损失中引入Dice损失、二元交叉熵损失和中心线Dice损失,以联合优化血管区域重叠度与连续性,并从第20个epoch起采用困难样本挖掘(hard example mining)策略,聚焦于最具挑战性的训练图像进行梯度更新,从而显著提升对关键但易漏检的细小周边血管的召回率。
链接: https://arxiv.org/abs/2603.21891
作者: Amarnath R
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 14 figures, 8 tables
Abstract:Retinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.
[CV-258] Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration
【速读】:该论文旨在解决当前基于深度学习的可变形图像配准方法在捕捉长距离解剖对应关系和保持形变一致性方面存在的局限性。其解决方案的关键在于提出一种基于循环逆一致性的Transformer框架(CICTM),该框架结合Swin-UNet结构与双向一致性约束,实现前向和后向形变场的联合估计,从而在保留局部解剖细节的同时增强全局空间关系建模能力,并提升形变场的稳定性与物理合理性。
链接: https://arxiv.org/abs/2603.21760
作者: Jiaqi Shang,Haojin Wu,Yinyi Lai,Zongyu Li,Chenghao Zhang,Jia Guo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.
[CV-259] Unregistered Spectral Image Fusion: Unmixing Adversarial Learning and Recoverability
【速读】:该论文旨在解决空间未配准(spatially unregistered)高光谱图像(HSI)与多光谱图像(MSI)融合问题,其核心挑战在于如何在缺乏精确空间对齐的情况下,同时提升HSI的空间分辨率和MSI的光谱分辨率。解决方案的关键在于提出一种无监督框架,通过耦合光谱解混(coupled spectral unmixing)实现MSI超分辨率,结合潜在空间对抗学习(latent-space adversarial learning)实现HSI超分辨率,并在合理生成模型下建立了对超分辨率结果可恢复性的理论保证——这是目前文献中首次针对未注册HMF问题提供的理论分析与实践方法。
链接: https://arxiv.org/abs/2603.21510
作者: Jiahui Song,Sagar Shrestha,Xiao Fu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models – providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.
[CV-260] Domain Elastic Transform: Bayesian Function Registration for High-Dimensional Scientific Data
【速读】:该论文旨在解决当前非刚性配准方法在处理空间转录组学等新兴科学数据时存在的瓶颈问题:传统点集配准(point set registration)与图像配准(image registration)的二分法导致研究人员不得不在牺牲单细胞分辨率(通过体素化使用图像工具)或忽略关键功能信号(如基因表达)之间做出取舍。解决方案的关键在于提出一种无网格的概率框架——域弹性变换(Domain Elastic Transform, DET),其核心思想是将数据视为定义在不规则域上的函数,从而直接对高维向量值信号(如基因表达)进行配准,无需离散化(binning)。DET在严格的贝叶斯框架下建模域形变,通过联合空间-功能似然引导弹性运动,并采用特征敏感的降采样策略实现大规模解剖图谱的可扩展无监督配准,在MERFISH数据上实现了92%的拓扑保真度,显著优于现有最优传输方法(<5%),并成功完成了跨发育阶段的全胚胎Stereo-seq图谱注册。
链接: https://arxiv.org/abs/2603.21235
作者: Osamu Hirose,Emanuele Rodola
机构: Kanazawa University (金泽大学); Sapienza University of Rome (罗马大学)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Nonrigid registration is conventionally divided into point set registration, which aligns sparse geometries, and image registration, which aligns continuous intensity fields on regular grids. However, this dichotomy creates a critical bottleneck for emerging scientific data, such as spatial transcriptomics, where high-dimensional vector-valued functions, e.g., gene expression, are defined on irregular, sparse manifolds. Consequently, researchers currently face a forced choice: either sacrifice single-cell resolution via voxelization to utilize image-based tools, or ignore the critical functional signal to utilize geometric tools. To resolve this dilemma, we propose Domain Elastic Transform (DET), a grid-free probabilistic framework that unifies geometric and functional alignment. By treating data as functions on irregular domains, DET registers high-dimensional signals directly without binning. We formulate the problem within a rigorous Bayesian framework, modeling domain deformation as an elastic motion guided by a joint spatial-functional likelihood. The method is fully unsupervised and scalable, utilizing feature-sensitive downsampling to handle massive atlases. We demonstrate that DET achieves 92% topological preservation on MERFISH data where state-of-the-art optimal transport methods struggle ( 5%), and successfully registers whole-embryo Stereo-seq atlases across developmental stages – a task involving massive scale and complex nonrigid growth. The implementation of DET is available on this https URL (since Mar, 2025).
[CV-261] MiSiSUn: Minimum Simplex Semisupervised Unmixing
【速读】:该论文旨在解决遥感图像中基于光谱库的半监督解混(semisupervised unmixing)问题,尤其在混合比例多样、空间结构复杂及输入噪声干扰下的性能瓶颈。其解决方案的关键在于首次将数据几何特性引入基于光谱库的解混框架,通过构建一种基于典型分析(archetypal analysis)线性模型的单纯形体积惩罚项(simplex-volume-flavored penalty),有效约束解混结果的物理可实现性与几何合理性。该方法显著优于现有先进半监督解混技术,在不同实验场景下信噪比提升达1–3 dB,且在真实数据集上与地质图具有一致性,验证了其实际应用价值。
链接: https://arxiv.org/abs/2603.20263
作者: Behnood Rasti,Bikram Koirala,Paul Scheunders
机构: Technische Universität Berlin (柏林工业大学); Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Imec-VisionLab, Department of Physics, University of Antwerp (比利时安特卫普大学物理系 imec-VisionLab)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This paper proposes a semisupervised geometric unmixing approach called minimum simplex semisupervised unmixing (MiSiSUn). The geometry of the data was incorporated for the first time into library-based unmixing using a simplex-volume-flavored penalty based on an archetypal analysis-type linear model. The experimental results were performed on two simulated datasets considering different levels of mixing ratios and spatial instruction at varying input noise. MiSiSUn considerably outperforms state-of-the-art semisupervised unmixing methods. The improvements vary from 1 dB to over 3 dB in different scenarios. The proposed method was also applied to a real dataset where visual interpretation is close to the geological map. MiSiSUn was implemented using PyTorch, which is open-source and available at this https URL. Moreover, we provide a dedicated Python package for Semisupervised Unmixing, which is open-source and includes all the methods used in the experiments for the sake of reproducibility.
人工智能
[AI-0] Confidence-Based Decoding is Provably Efficient for Diffusion Language Models
【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)中因生成顺序灵活性带来的解码策略设计难题,即如何在不牺牲采样精度的前提下提升采样效率。传统自回归(Autoregressive, AR)模型的解码过程具有固定顺序,而DLMs允许并行生成多个token,这使得解码策略——决定每轮迭代中未掩码(unmask)的token数量和顺序——成为影响效率的关键因素。论文提出了一种基于熵和阈值的自信度解码策略(entropy sum-based strategy),其核心思想是在每轮迭代中持续未掩码token,直到累积熵超过预设阈值为止。理论分析表明,该策略可在KL散度误差为ε时达到期望迭代次数O(H(X0)/ε),其中H(X0)为目标数据分布的熵;当数据分布熵较低时,该方法能显著加速采样,并自动适应数据内在复杂性,无需预先设定超参数或先验知识。这一结果首次建立了对自信度解码的理论框架,为高效DLM解码策略的设计提供了理论依据。
链接: https://arxiv.org/abs/2603.22248
作者: Changxiao Cai,Gen Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:
Abstract:Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emphdecoding strategy – which determines the order and number of tokens generated at each iteration – critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves \varepsilon -accurate sampling in KL divergence with an expected number of iterations \widetilde O(H(X_0)/\varepsilon) , where H(X_0) denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2603.22248 [cs.LG] (or arXiv:2603.22248v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-1] Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
【速读】:该论文旨在解决如何可靠地自动化评估生成式 AI(Generative AI)模型的质量与安全性问题,特别是针对受害型大语言模型(LLM)的自由文本输出进行系统性测评。传统人工评审存在效率低、一致性差的问题,难以覆盖多样化的使用场景。解决方案的关键在于利用另一大语言模型作为“裁判”(LLM as judge),通过设计特定的判别提示词(judge prompt)来量化评估指标,并结合多种模型规模和微调策略验证其有效性。实证结果表明,当采用合适的提示工程时,GPT-4o、32B及以上参数的开源模型以及部分较小模型如Qwen2.5 14B,在多个任务类别上均能与人类评估高度一致,证明了该方法在自动化质量评估中的可行性与可靠性。
链接: https://arxiv.org/abs/2603.22214
作者: Tom Biskupski,Stephan Kleber
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models’ free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models’ use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with \geqslant 32B parameters, and a few smaller models like Qwen2.5 14B. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.22214 [cs.CR] (or arXiv:2603.22214v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.22214 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-2] MARCUS: An agent ic multimodal vision-language model for cardiac diagnosis and management
【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)诊断中因复杂心脏检查结果依赖人工解读而导致的效率与准确性瓶颈问题。当前生成式AI模型多局限于单模态输入且缺乏交互能力,难以实现多模态心脏影像(如心电图ECG、超声心动图Echocardiogram、心脏磁共振CMR)的端到端智能分析。解决方案的关键在于提出MARCUS系统——一个基于分层代理架构(hierarchical agentic architecture)的多模态视觉-语言模型:其由各模态专用的视觉-语言专家模型组成,每个模型融合领域训练的视觉编码器与多阶段语言模型优化,并由一个统一的多模态协调器(multimodal orchestrator)进行调度;该设计不仅实现了对单一模态和多模态输入的高精度解释(准确率达87–91% ECG、67–86% 超声、85–88% CMR),还显著优于前沿模型(提升34–45%,P<0.001),并在多模态场景下达到70%准确率(远超22–28%),同时具备抵抗“幻觉推理”(mirage reasoning)的能力,从而为临床提供可靠、可解释的自动化心脏诊断支持。
链接: https://arxiv.org/abs/2603.22179
作者: Jack W O’Sullivan,Mohammad Asadi,Lennart Elbe,Akshay Chaudhari,Tahoura Nedaee,Francois Haddad,Michael Salerno,Li Fe-Fei,Ehsan Adeli,Rima Arnaout,Euan A Ashley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.
[AI-3] Calibeating Made Simple
【速读】:该论文致力于解决**校准增强(calibeating)**问题,即在线后处理外部预测以最小化累积损失并达到基于信息量的基准。其核心挑战在于设计一种通用框架,使算法在面对任意适当损失函数时仍能实现最优的校准性能。解决方案的关键在于将校准增强问题重新建模为已有的在线学习技术:首先证明校准增强与 regret minimization 在最坏情况下的等价性,从而统一了对 Brier 损失、对数损失及混合损失等不同场景的分析;其次通过建立多校准增强与经典专家问题的组合关系,推导出适用于多种损失函数的新最优率;最终实现了在二分类场景下同时满足校准性和最优 $ O(\log T) $ 校准增强率的首个算法,显著提升了理论完备性与实用性。
链接: https://arxiv.org/abs/2603.22167
作者: Yurong Chen,Zhiyi Huang,Michael I. Jordan,Haipeng Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
备注:
Abstract:We study calibeating, the problem of post-processing external forecasts online to minimize cumulative losses and match an informativeness-based benchmark. Unlike prior work, which analyzed calibeating for specific losses with specific arguments, we reduce calibeating to existing online learning techniques and obtain results for general proper losses. More concretely, we first show that calibeating is minimax-equivalent to regret minimization. This recovers the O(\log T) calibeating rate of Foster and Hart [FH23] for the Brier and log losses and its optimality, and yields new optimal calibeating rates for mixable losses and general bounded losses. Second, we prove that multi-calibeating is minimax-equivalent to the combination of calibeating and the classical expert problem. This yields new optimal multi-calibeating rates for mixable losses, including Brier and log losses, and general bounded losses. Finally, we obtain new bounds for achieving calibeating and calibration simultaneously for the Brier loss. For binary predictions, our result gives the first calibrated algorithm that at the same time also achieves the optimal O(\log T) calibeating rate.
[AI-4] Multimodal Survival Analysis with Locally Deployable Large Language Models NEURIPS2025
【速读】:该论文旨在解决多模态生存分析(multimodal survival analysis)中如何在计算资源受限且隐私要求严格的医疗环境中,有效整合临床文本、表格型协变量和基因组数据的问题。其关键解决方案在于采用本地部署的大型语言模型(Large Language Models, LLMs),通过教师-学生蒸馏(teacher-student distillation)与合理的多模态融合策略,联合估计校准后的生存概率并生成基于证据的预后文本,从而在不依赖云端服务的前提下提升模型可靠性与可解释性,避免基础LLMs常见的幻觉或校准偏差问题。
链接: https://arxiv.org/abs/2603.22158
作者: Moritz Gögl,Christopher Yau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences
Abstract:We study multimodal survival analysis integrating clinical text, tabular covariates, and genomic profiles using locally deployable large language models (LLMs). As many institutions face tight computational and privacy constraints, this setting motivates the use of lightweight, on-premises models. Our approach jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text via teacher-student distillation and principled multimodal fusion. On a TCGA cohort, it outperforms standard baselines, avoids reliance on cloud services and associated privacy concerns, and reduces the risk of hallucinated or miscalibrated estimates that can be observed in base LLMs.
[AI-5] On the Direction of RLVR Updates for LLM Reasoning : Identification and Exploitation
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)机制对大语言模型推理能力提升的内在机制不清晰的问题,尤其是现有研究仅关注更新幅度而忽视更新方向的局限性。解决方案的关键在于提出以token级对数概率差值 Δlog p(即基线模型与最终RLVR模型之间的符号化差异)作为衡量指标,该指标能更有效地识别出稀疏但对推理至关重要的参数更新方向。基于此洞察,作者设计了两种实用方法:一是在测试阶段沿Δlog p方向外推策略以提升推理准确性而无需额外训练;二是在训练阶段通过重加权低概率token(对应高Δlog p)来优化学习过程,从而在多个模型和基准上增强推理性能。
链接: https://arxiv.org/abs/2603.22117
作者: Kexin Huang,Haoming Meng,Junkang Wu,Jinda Lu,Chiyu Ma,Ziqian Chen,Xue Wang,Bolin Ding,Jiancan Wu,Xiang Wang,Xiangnan He,Guoyin Wang,Jingren Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbfmagnitude of these updates, largely overlooking their \textbfdirection. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR’s effects, which can be captured by the signed, token-level log probability difference \Delta\log p between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that \Delta\log p more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textittest-time extrapolation method that amplifies the policy along the learned \Delta\log p direction to improve reasoning accuracy without further training; (2) a \textittraining-time reweighting method that focuses learning on low-probability (corresponding to higher \Delta\log p ) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
[AI-6] SpecTM: Spectral Targeted Masking for Trustworthy Foundation Models
【速读】:该论文旨在解决当前地球观测(Earth Observation, EO)领域基础模型在预训练过程中依赖随机掩码(stochastic masking)而未显式引入物理约束的问题,这限制了模型在公共卫生决策等关键场景下的可信度。其解决方案的关键在于提出一种物理信息引导的掩码设计——光谱目标掩码(Spectral Targeted Masking, SpecTM),该方法通过鼓励从跨光谱上下文中重建特定波段来增强模型对物理规律的感知能力;进一步构建了一个多任务自监督学习框架,联合优化波段重建、生物光学指数推断和8天后时间预测任务,从而编码出具有光谱内在特性的表示。实验表明,SpecTM在湖 Erie 微囊藻毒素浓度回归任务中显著优于基线模型,并在标签极度稀缺时展现出2.2倍的标签效率优势。
链接: https://arxiv.org/abs/2603.22097
作者: Syed Usama Imtiaz,Mitra Nasr Azadani,Nasrin Alamdari
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IEEE IGARSS 2026
Abstract:Foundation models are now increasingly being developed for Earth observation (EO), yet they often rely on stochastic masking that do not explicitly enforce physics constraints; a critical trustworthiness limitation, in particular for predictive models that guide public health decisions. In this work, we propose SpecTM (Spectral Targeted Masking), a physics-informed masking design that encourages the reconstruction of targeted bands from cross-spectral context during pretraining. To achieve this, we developed an adaptable multi-task (band reconstruction, bio-optical index inference, and 8-day-ahead temporal prediction) self-supervised learning (SSL) framework that encodes spectrally intrinsic representations via joint optimization, and evaluated it on a downstream microcystin concentration regression model using NASA PACE hyperspectral imagery over Lake Erie. SpecTM achieves R^2 = 0.695 (current week) and R^2 = 0.620 (8-day-ahead) predictions surpassing all baseline models by (+34% (0.51 Ridge) and +99% (SVR 0.31)) respectively. Our ablation experiments show targeted masking improves predictions by +0.037 R^2 over random masking. Furthermore, it outperforms strong baselines with 2.2x superior label efficiency under extreme scarcity. SpecTM enables physics-informed representation learning across EO domains and improves the interpretability of foundation models.
[AI-7] GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning
【速读】:该论文旨在解决临床决策代理在利用历史决策经验时面临的检索噪声、复用不可靠以及性能下降的问题。现有基于记忆增强的方法通常将经验存储为独立记录,缺乏显式的结构关系,导致检索不准确且难以有效利用过往经验。其解决方案的关键在于提出一种图结构自演化记忆框架(GSEM),通过构建双层记忆图来组织临床经验:第一层捕捉单个决策内部的结构信息,第二层建模跨经验之间的关系依赖;同时支持基于适用性的检索机制与在线反馈驱动的节点质量及边权重校准,从而提升记忆的可信赖性与实用性。
链接: https://arxiv.org/abs/2603.22096
作者: Xiao Han,Yuzheng Fan,Sendong Zhao,Haochun Wang,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical decision-making agents can benefit from reusing prior decision experience. However, many memory-augmented methods store experiences as independent records without explicit relational structure, which may introduce noisy retrieval, unreliable reuse, and in some cases even hurt performance compared to direct LLM inference. We propose GSEM (Graph-based Self-Evolving Memory), a clinical memory framework that organizes clinical experiences into a dual-layer memory graph, capturing both the decision structure within each experience and the relational dependencies across experiences, and supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90% and 69.24% with DeepSeek-V3.2 and Qwen3.5-35B, respectively. Code is available at this https URL.
[AI-8] A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP
【速读】:该论文旨在解决企业级大语言模型(Large Language Model, LLM)代理在实际部署中面临的挑战,包括数据质量与数量有限、复杂现实推理需求、自对弈训练困难以及缺乏可靠反馈信号等问题。其解决方案的核心在于提出一种轻量级、模型无关的离线强化学习(Offline Reinforcement Learning, RL)框架——基于数字孪生马尔可夫决策过程的上下文工程(DT-MDP-CE),该框架包含三个关键组件:(1) 将代理推理行为抽象为有限马尔可夫决策过程(Digital-Twin Markov Decision Process, DT-MDP);(2) 基于DT-MDP的鲁棒对比逆强化学习,用于从混合质量的离线轨迹中高效估计合理奖励函数并诱导策略;(3) 利用整合前两步所得策略进行强化学习引导的上下文工程,从而提升代理决策能力。实验证明该方法在IT自动化任务中显著优于基线模型,具备良好的泛化潜力。
链接: https://arxiv.org/abs/2603.22083
作者: Xi Yang,Aurelie Lozano,Naoki Abe,Bhavya,Saurabh Jha,Noah Zheutlin,Rohan R. Arora,Yu Deng,Daby M. Sow
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent’s reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent’s decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.
[AI-9] On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration
【速读】:该论文旨在解决方向性消融(directional abliteration)中对比基线(contrast baseline)设计对拒绝行为消除效果的影响问题,即探究主题匹配的对比基线是否能产生更有效的拒绝方向。其关键解决方案在于通过构建与有害提示主题一致的对比基线,并结合类别内提示对、自组织映射(Self-Organizing Map, SOM)提取和奇异值分解(Singular Value Decomposition, SVD)正交化方法,在Qwen-3.5 2B模型上进行系统评估。结果表明,主题匹配的对比基线在任何权重水平和层上均无法生成功能性拒绝方向,而未匹配的对比基线却能在六层实现完全拒绝消除,根本原因在于主题匹配减法会抵消有害与无害提示间共享的主导激活成分,使提取方向幅度低于权重矩阵投影可扰动残差流的阈值。
链接: https://arxiv.org/abs/2603.22061
作者: Valentin Petrov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast baseline yields superior refusal directions. The investigation is carried out on the Qwen~3.5 2B model using per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. It was found that topic-matched contrast produces no functional refusal directions at any tested weight level on any tested layer, while unmatched contrast on the same model, same extraction code, and same evaluation protocol achieves complete refusal elimination on six layers. The geometric analysis of the failure establishes that topic-matched subtraction cancels the dominant activation component shared between harmful and harmless prompts of the same subject, reducing the extracted direction magnitude below the threshold at which weight-matrix projection perturbs the residual stream. The implications for the design of contrast baselines in abliteration research are discussed.
[AI-10] Future-Interactions-Aware Trajectory Prediction via Braid Theory
【速读】:该论文旨在解决自动驾驶中多智能体轨迹预测(multi-agent trajectory prediction)问题,即如何准确预测周围多个交互智能体的未来行为以确保安全行驶。传统方法要么计算复杂度高,要么依赖启发式规则对多智能体行为类型进行标注,难以有效建模社会交互。其解决方案的关键在于引入辫理论(braid theory)作为轨迹的精确表示方式,将未来轨迹投影为辫结构(braid),从而显式刻画智能体间轨迹交叉关系及其协调模式;进一步提出一个新颖的辅助任务——辫预测(braid prediction),与主轨迹预测任务并行训练,通过分类智能体间边的交叉类型来增强模型的社会感知能力,从而显著提升联合预测性能,且在训练和推理阶段均无显著额外复杂度。
链接: https://arxiv.org/abs/2603.22035
作者: Caio Azevedo,Stefano Sabatini,Sascha Hornauer,Fabien Moutarde
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To be published in IEEE Intelligent Vehicles Symposium (IV) 2026
Abstract:To safely operate, an autonomous vehicle must know the future behavior of a potentially high number of interacting agents around it, a task often posed as multi-agent trajectory prediction. Many previous attempts to model social interactions and solve the joint prediction task either add extensive computational requirements or rely on heuristics to label multi-agent behavior types. Braid theory, in contrast, provides a powerful exact descriptor of multi-agent behavior by projecting future trajectories into braids that express how trajectories cross with each other over time; a braid then corresponds to a specific mode of coordination between the multiple agents in the future. In past work, braids have been used lightly to reason about interacting agents and restrict the attention window of predicted agents. We show that leveraging more fully the expressivity of the braid representation and using it to condition the trajectories themselves leads to even further gains in joint prediction performance, with negligible added complexity either in training or at inference time. We do so by proposing a novel auxiliary task, braid prediction, done in parallel with the trajectory prediction task. By classifying edges between agents into their correct crossing types in the braid representation, the braid prediction task is able to imbue the model with improved social awareness, which is reflected in joint predictions that more closely adhere to the actual multi-agent behavior. This simple auxiliary task allowed us to obtain significant improvements in joint metrics on three separate datasets. We show how the braid prediction task infuses the model with future intention awareness, leading to more accurate joint predictions. Code is available at this http URL.
[AI-11] λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks
【速读】:该论文旨在解决生成式 AI (Generative AI) 中广泛使用的高斯误差线性单元(GELU)与以分段线性(ReLU 类型)网络为基础的部署、压缩和分析工具链之间的不兼容问题。核心挑战在于如何在保持平滑训练优势的同时,逐步将 GELU 转化为与 ReLU 兼容的模型结构,从而无缝集成到现有以 ReLU 为中心的下游流水线中。解决方案的关键是提出一种参数化形式 $ f(x;\lambda) = x\Phi(\lambda x) $,其中 λ∈[1,∞) 控制门控函数的锐度(hardness),并通过约束重参数化和优化器感知更新策略来稳定学习过程。该方法实现了层级化的硬度控制,并支持一种确定性的 ReLU 化策略,在无需重新训练的情况下通过渐进硬化将 λ-GELU 替换为 ReLU,显著降低模型迁移扰动,提供了一个可解释且最小化的调控机制。
链接: https://arxiv.org/abs/2603.21991
作者: Cristian Pérez-Corral,Alberto Fernández-Hernández,Jose I. Mestre,Manuel F. Dolz,Enrique S. Quintana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;\lambda)=x\Phi(\lambda x), where \Phi is the Gaussian CDF and \lambda \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning \lambda is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model–dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of \lambda-GELU by ReLU with reduced disruption. Overall, \lambda-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.21991 [cs.LG] (or arXiv:2603.21991v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21991 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-12] REX: Trajectory Explanations for Multi-Objective Reinforcement Learning
【速读】:该论文旨在解决多目标强化学习(Multi-Objective Reinforcement Learning, MORL)中决策过程缺乏可解释性的问题,尤其是在面对多个潜在冲突的目标时,现有可解释强化学习(Explainable Reinforcement Learning, XRL)方法难以提供与具体目标或用户偏好相关的解释。解决方案的关键在于提出TREX框架——一种基于轨迹归因的可解释性方法,通过从专家策略生成轨迹并根据用户偏好聚类为语义明确的时间段,进而训练排除特定行为片段的互补策略,量化这些行为段对帕累托前沿(Pareto trade-off)的影响,从而实现对MORL策略中不同目标间权衡机制的隔离与定量分析。
链接: https://arxiv.org/abs/2603.21988
作者: Dilina Rajapakse,Juan C. Rosero,Ivana Dusparic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by 4th World Conference on eXplainable Artificial Intelligence
Abstract:Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box" nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework’s ability to isolate and quantify the specific behavioural patterns. Comments: Accepted by 4th World Conference on eXplainable Artificial Intelligence Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.21988 [cs.LG] (or arXiv:2603.21988v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-13] Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support
【速读】:该论文旨在解决眼科临床问答与决策支持中证据溯源不准确、噪声干扰严重及多模态信息利用不足的问题。其核心解决方案是提出Oph-Guid-RAG,一个基于视觉的多模态检索增强生成(Retrieval-Augmented Generation, RAG)系统,关键在于将指南页面视为独立证据单元并直接检索图像以保留表格、流程图和布局结构信息,并设计了可控制的检索框架(包含路由与过滤机制),从而选择性引入外部证据、降低噪声影响;同时集成查询分解、重写、重排序与多模态推理模块,实现可追溯的输出结果。实验表明,该方法在Hard子集上显著提升整体得分与准确性,尤其在需要精确证据推理的复杂案例中表现更优。
链接: https://arxiv.org/abs/2603.21925
作者: Shuying Chen,Sen Cui,Zhong Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.
[AI-14] Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中时序差分(Temporal Difference, TD)误差两种经典解释——即连续预测之间的差异与自举目标与预测之间的差异——在非线性深度网络架构下可能不再等价的问题。传统上,后者被广泛用作DRL算法的标准批评者损失(critic loss),但本文揭示,在日益复杂的非线性模型中,这两种TD误差定义会生成显著不同的数值结果。其解决方案的关键在于识别并量化这种不一致性,并通过实证分析表明:选择不同的TD误差定义会影响依赖于该误差计算的其他量(如深度微分强化学习中的平均奖励估计),从而指导更合理的损失函数设计以提升算法性能。
链接: https://arxiv.org/abs/2603.21921
作者: Juan Sebastian Rojas,Chi-Guhn Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.
[AI-15] SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting
【速读】:该论文旨在解决数值天气预报(Numerical Weather Prediction, NWP)系统计算成本高、效率低的问题,特别是在短临预报(nowcasting)场景下难以满足实时性需求。其解决方案的关键在于提出SmaAT-QMix-UNet模型,通过两个核心改进实现性能提升与模型轻量化:一是引入向量量化(Vector Quantization, VQ)瓶颈层于编码器-解码器桥接处,增强特征表示的紧凑性和可解释性;二是用混合核深度卷积(Mixed Kernel Depth-wise Convolutions, MixConv)替代部分编码器和解码器模块,优化计算效率并保留空间感知能力。实验表明,该方法在荷兰雷达降水数据集上实现了更小模型规模与更高预测精度的平衡。
链接: https://arxiv.org/abs/2603.21879
作者: Nikolas Stavrou,Siamak Mehrkanoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures
Abstract:Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data-driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder-decoder bridge, and mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model’s size and improve its nowcasting performance. We train and evaluate SmaAT-QMix-UNet on a Dutch radar precipitation dataset (2016-2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT-QMix-UNet. Grad-CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT-QMix-UNet is publicly available on GitHub \footnote\hrefthis https URLthis https URL.
[AI-16] P2O: Joint Policy and Prompt Optimization
【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的训练效率问题,尤其是在面对“难样本”(hard samples)时因稀疏奖励导致的优势估计为零、模型缺乏监督信号的问题。解决方案的关键在于提出P²O框架,通过将提示优化(Prompt Optimization)与策略优化(Policy Optimization)相结合:在训练过程中识别难样本,并利用遗传帕累托(Genetic Pareto, GEPA)算法演化提示模板以引导模型发现成功轨迹;更重要的是,P²O将优化提示所带来推理增益直接蒸馏到模型参数中,从而为难样本提供更密集的正向监督信号,加速收敛并提升性能。
链接: https://arxiv.org/abs/2603.21877
作者: Xinyu Lu,Kaiqi Zhang,Jinglin Yang,Boxi Cao,Yaojie Lu,Hongyu Lin,Min He,Xianpei Han,Le Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting “hard samples” that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).
[AI-17] acit Knowledge Management with Generative AI: Proposal of the GenAI SECI Model
【速读】:该论文旨在解决当前知识管理领域中对显性知识(explicit knowledge)与隐性知识(tacit knowledge)协同建模不足的问题,尤其在生成式 AI(Generative AI)背景下,现有研究多集中于显性知识的管理,而对隐性知识的系统化整合仍显薄弱。其解决方案的关键在于提出“GenAI SECI”模型,该模型是对经典SECI知识创造过程模型的重构,引入了“数字碎片化知识”(Digital Fragmented Knowledge)这一新概念,用以在虚拟空间中融合显性与隐性知识;同时给出了可落地的系统架构,并与既有研究模型进行了对比分析,从而为生成式 AI 驱动的知识管理系统提供了理论框架与实践路径。
链接: https://arxiv.org/abs/2603.21866
作者: Naoshi Uchihira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is intended to be submitted to AHFE2026
Abstract:The emergence of generative AI is bringing about a significant transformation in knowledge management. Generative AI has the potential to address the limitations of conventional knowledge management systems, and it is increasingly being deployed in real-world settings with promising results. Related research is also expanding rapidly. However, much of this work focuses on research and practice related to the management of explicit knowledge. While fragmentary efforts have been made regarding the management of tacit knowledge using generative AI, the modeling and systematization that handle both tacit and explicit knowledge in an integrated manner remain insufficient. In this paper, we propose the “GenAI SECI” model as an updated version of the knowledge creation process (SECI) model, redesigned to leverage the capabilities of generative AI. A defining feature of the “GenAI SECI” model is the introduction of “Digital Fragmented Knowledge”, a new concept that integrates explicit and tacit knowledge within cyberspace. Furthermore, a concrete system architecture for the proposed model is presented, along with a comparison with prior research models that share a similar problem awareness and objectives.
[AI-18] Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在面对道德困境时所表现出的推理是否真正反映了科尔伯格(Kohlberg)道德发展阶段的内在发展进程,还是仅仅模仿了成熟道德判断的表层语言特征。其解决方案的关键在于构建一个基于多模型验证的“LLM作为裁判”评分流程,对13个不同架构、参数规模和训练策略的LLM在6个经典道德困境中的超过600条回应进行系统分类与分析,并辅以十项补充分析以刻画响应模式的本质与内部一致性。结果发现,LLMs普遍呈现后习俗阶段(Stages 5–6)的道德表述,这与人类发展中以习俗阶段(Stage 4)为主的现象形成鲜明对比,且部分模型表现出道德脱钩现象——即道德理由与其行为选择之间存在系统性不一致,这种逻辑不一致独立于模型规模和提示策略,表明其并非源于修辞能力,而是推理一致性失效。作者据此提出“道德拟声”(moral ventriloquism)概念,认为这是对齐训练使模型习得成熟道德话语形式,但缺乏相应认知发展基础的结果。
链接: https://arxiv.org/abs/2603.21854
作者: Aryan Kasat,Smriti Singh,Aman Chadha,Vinija Jain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 34 figures, 7 tables
Abstract:Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg’s stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.
[AI-19] Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection
【速读】:该论文旨在解决现有模拟到现实(sim-to-real)方法在训练控制策略时对现实差距(reality gaps)建模不足的问题,尤其是传统基于固定参数域随机化(domain randomization)的方法难以捕捉复杂、非线性且状态相关的不确定性(如非线性执行器动力学和接触柔顺性)。其解决方案的关键在于引入一种状态依赖的扰动注入机制,在前向仿真过程中将神经网络作为灵活的扰动生成器,对输入关节扭矩施加动态扰动,从而更全面地模拟现实世界中的复杂不确定性,且无需额外训练即可提升策略在未见现实场景中的鲁棒性。
链接: https://arxiv.org/abs/2603.21853
作者: Junhyeok Rui Cha,Woohyun Cha,Jaeyong Shin,Donghyeon Kim,Jaeheung Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.
[AI-20] On the Number of Conditional Independence Tests in Constraint-based Causal Discovery
【速读】:该论文旨在解决从观测数据中学习因果关系的问题,这是跨多个领域的重要基础任务。传统约束-based方法(如PC算法)需执行大量条件独立性检验,其最坏情况下的复杂度随图中最大度数呈指数增长,且现有研究尚未明确是否存在更优算法而不引入额外假设。本文的关键解决方案是提出一种新算法,其所需条件独立性检验次数为 $ p^{\mathcal{O}(s)} $,其中 $ p $ 为节点数,$ s $ 表示潜在本质图(essential graph)的最大无向团大小;同时证明任何约束-based算法至少需 $ 2^{\Omega(s)} $ 次检验,从而表明所提算法在检验次数上达到指数最优性(对数因子内)。
链接: https://arxiv.org/abs/2603.21844
作者: Marc Franquesa Monés,Jiaqi Zhang,Caroline Uhler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of p^\mathcalO(s) tests, where p is the number of nodes in the graph and s denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least 2^\Omega(s) conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.
[AI-21] CoRA: Boosting Time Series Foundation Models for Multivariate Forecasting through Correlation-aware Adapter
【速读】:该论文旨在解决现有时间序列基础模型(Time Series Foundation Models, TSFMs)在多变量时间序列预测中忽视或未能充分建模通道间相关性的关键问题,从而限制了预测性能的提升。其解决方案的核心在于提出一种轻量级、可即插即用的相关性感知适配器(CoRrelation-aware Adapter, CoRA),通过创新性地将相关性矩阵分解为低秩时变(Time-Varying)与不变(Time-Invariant)成分,并引入可学习多项式以捕捉动态相关性(如趋势或周期模式),同时设计了一种新型双对比学习机制,借助异质部分对比损失(Heterogeneous-Partial contrastive loss)识别仅存在于特定通道间的正负相关性,且不增加推理阶段的复杂度,从而显著提升TSFMs在多变量时间序列预测中的表现。
链接: https://arxiv.org/abs/2603.21828
作者: Hanyin Cheng,Xingjian Wu,Yang Shu,Zhongwen Rao,Lujia Pan,Bin Yang,Chenjuan Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing Time Series Foundation Models (TSFMs) use channel independent modeling and focus on capturing and generalizing temporal dependencies, while neglecting the correlations among channels or overlooking the different aspects of correlations. However, these correlations play a vital role in Multivariate time series forecasting. To address this, we propose a CoRrelation-aware Adapter (CoRA), a lightweight plug-and-play method that requires only fine-tuning with TSFMs and is able to capture different types of correlations, so as to improve forecast performance. Specifically, to reduce complexity, we innovatively decompose the correlation matrix into low-rank Time-Varying and Time-Invariant components. For the Time-Varying component, we further design learnable polynomials to learn dynamic correlations by capturing trends or periodic patterns. To learn positive and negative correlations that appear only among some channels, we introduce a novel dual contrastive learning method that identifies correlations through projection layers, regulated by a Heterogeneous-Partial contrastive loss during training, without introducing additional complexity in the inference stage. Extensive experiments on 10 real-world datasets demonstrate that CoRA can improve TSFMs in multivariate forecasting performance.
[AI-22] Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors IJCNN2026
【速读】:该论文旨在解决雷达仅模型在降水临近预报中因缺乏大尺度大气背景信息而导致长时预测性能下降的问题。现有方法虽尝试融合天气基础模型(weather foundation models)预测的气象变量,但未能有效处理雷达图像与气象数据之间显著的表征异质性。解决方案的关键在于提出PW-FouCast框架,其核心创新包括:(i) 基于Pangu-Weather预测结果引导的频域调制机制,以对齐光谱幅度和相位;(ii) 频率记忆模块用于校正相位偏差并保留时间演化特征;(iii) 反向频率注意力机制以重建高频细节信息。该方法通过傅里叶域融合策略实现了多源异构数据的有效协同,显著提升了预报精度与可靠时长。
链接: https://arxiv.org/abs/2603.21768
作者: Yuze Qin,Qingyong Li,Zhiqing Guo,Wen Wang,Yan Liu,Yangli-ao Geng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCNN 2026. Code is available at this https URL
Abstract:Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar-only models frequently suffer from a lack of large-scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW-FouCast, a novel frequency-domain fusion framework that leverages Pangu-Weather forecasts as spectral priors within a Fourier-based backbone. Our architecture introduces three key innovations: (i) Pangu-Weather-guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high-frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW-FouCast achieves state-of-the-art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at this https URL.
[AI-23] CurvZO: Adaptive Curvature-Guided Sparse Zeroth-Order Optimization for Efficient LLM Fine-Tuning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)微调过程中因反向传播导致的高内存开销问题,同时克服零阶(Zeroth-order, ZO)优化因梯度估计方差大而导致收敛缓慢或不稳定的问题。其解决方案的关键在于提出自适应曲率引导的稀疏零阶优化方法(Adaptive Curvature-Guided Sparse Zeroth-Order Optimization, CurvZO),该方法通过在线追踪来自标量反馈的曲率信号,构建参数级别的采样分布以选择更新坐标,从而降低稀疏ZO梯度估计器的方差;此外,CurvZO动态调整扰动预算以适应变化的曲率分布,在保持聚焦性的同时维持足够的探索能力,最终在OPT和Llama模型上实现了显著的性能提升与训练加速。
链接: https://arxiv.org/abs/2603.21725
作者: Shuo Wang,Ziyu Chen,Ming Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning large language models (LLMs) with backpropagation achieves high performance but incurs substantial memory overhead, limiting scalability on resource-constrained hardware. Zeroth-order (ZO) optimization provides a memory-efficient alternative by relying solely on forward passes, yet it typically suffers from slow or unstable convergence due to high-variance gradient estimates. Sparse ZO updates partially address this issue by perturbing only a subset of parameters, but their effectiveness hinges on selecting informative parameters, which is challenging in ZO optimization because each query yields only scalar feedback. We propose \textbfAdaptive Curvature-Guided Sparse Zeroth-Order Optimization (CurvZO), which tracks curvature signals online from scalar ZO feedback and leverages these signals to construct a parameter-wise sampling distribution for selecting coordinates at each update, reducing the variance of the sparse ZO gradient estimator. Moreover, CurvZO dynamically adapts the perturbation budget to the evolving curvature signal distribution, yielding sparse ZO updates that remain both focused and sufficiently exploratory. Extensive experiments on OPT and Llama across diverse NLP tasks show that CurvZO consistently improves fine-tuning performance and reduces training time over ZO baselines. It improves accuracy by up to 4.4 points and achieves up to a 2\times speedup, while preserving memory efficiency.
[AI-24] FISformer: Replacing Self-Attention with a Fuzzy Inference System in Transformer Models for Time Series Forecasting
【速读】:该论文旨在解决Transformer在时间序列预测中因依赖确定性点积注意力机制而导致的不确定性建模能力不足和多变量时序维度非线性依赖关系刻画有限的问题。解决方案的关键在于提出FISFormer,其核心创新是用模糊推理系统(Fuzzy Inference System, FIS)驱动的交互机制替代传统注意力机制:每个查询-键对在每个特征维度上执行模糊推理过程,通过可学习的隶属函数与规则推理估算token间的关联强度,从而生成具有不确定性和可解释性的连续交互权重;这些权重经softmax归一化后与对应值特征进行逐元素相乘,得到增强的上下文表示,实现了模糊逻辑的不确定性建模能力与Transformer强大表征能力的有效融合。
链接: https://arxiv.org/abs/2603.21724
作者: Bulent Haznedar,Levent Karacan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.
[AI-25] A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction
【速读】:该论文旨在解决高保真车辆阻力评估中因工作流摩擦(如几何清理、网格重试、队列竞争和跨团队复现失败)导致的效率瓶颈问题,而非单纯依赖求解器运行时间优化。其关键解决方案是提出一种以“合约”为中心的自演化编码代理框架,将代理发现建模为对程序的约束优化问题,而非静态模型实例;通过结合Famou-Agent风格的评估反馈机制与基于种群的岛屿进化策略,引入结构化变异(数据、模型、损失函数及分割策略),并采用多目标选择机制平衡排序质量、稳定性与成本。此外,系统设定了硬性评估合约,确保泄漏防护、确定性重放、多种子鲁棒性和资源预算控制,从而在8个匿名进化算子中实现综合得分0.9335(符号准确性0.9180),并通过轨迹和消融分析验证自适应采样与岛屿迁移是收敛质量的核心驱动力。最终部署采用“筛选-升级”模式:代理用于高通量设计探索排名,低置信度或分布外案例自动升级至高保真计算流体力学(CFD)仿真,保障决策级可靠性、治理可追溯性与安全边界。
链接: https://arxiv.org/abs/2603.21698
作者: Jinhui Ren,Huaiming Li,Yabin Liu,Tao Li,Zhaokun Liu,Yujia Liang,Zengle Ge,Chufan Wu,Xiaomin Yuan,Danyu Liu,Annan Li,Jianmin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:High-fidelity vehicle drag evaluation is constrained less by solver runtime than by workflow friction: geometry cleanup, meshing retries, queue contention, and reproducibility failures across teams. We present a contract-centric blueprint for self-evolving coding agents that discover executable surrogate pipelines for predicting drag coefficient C_d under industrial constraints. The method formulates surrogate discovery as constrained optimization over programs, not static model instances, and combines Famou-Agent-style evaluator feedback with population-based island evolution, structured mutations (data, model, loss, and split policies), and multi-objective selection balancing ranking quality, stability, and cost. A hard evaluation contract enforces leakage prevention, deterministic replay, multi-seed robustness, and resource budgets before any candidate is admitted. Across eight anonymized evolutionary operators, the best system reaches a Combined Score of 0.9335 with sign-accuracy 0.9180, while trajectory and ablation analyses show that adaptive sampling and island migration are primary drivers of convergence quality. The deployment model is explicitly ``screen-and-escalate’': surrogates provide high-throughput ranking for design exploration, but low-confidence or out-of-distribution cases are automatically escalated to high-fidelity CFD. The resulting contribution is an auditable, reusable workflow for accelerating aerodynamic design iteration while preserving decision-grade reliability, governance traceability, and safety boundaries.
[AI-26] Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉引导指令下可能出现的安全失效问题,特别是通过漫画模板(comic-template)嵌入有害意图并诱导模型角色扮演完成漫画的新型攻击方式。其关键解决方案是构建了ComicJailbreak基准,包含1,167个攻击实例,覆盖10类危害和5种任务设置,并系统评估了15种前沿MLLMs对这类攻击的敏感性及现有防御方法的有效性与副作用。研究发现,此类基于叙事的多模态越狱攻击成功率高且能绕过传统文本防御,而现有安全对齐机制在应对此类攻击时易导致对良性提示的过度拒绝,揭示出当前安全评测体系在处理敏感但非有害内容时的不可靠性,从而强调需发展更鲁棒的多模态安全对齐策略。
链接: https://arxiv.org/abs/2603.21697
作者: Rui Yang Tan,Yujia Hu,Roy Ka-Wei Lee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 31 pages
Abstract:Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and “complete the comic.” Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.
[AI-27] MIND: Multi-agent inference for negotiation dialogue in travel planning ICLR2026
【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)在协调复杂利益相关者(如旅行规划中的不同偏好)时有效性不足的问题。其解决方案的关键在于提出MIND(Multi-agent Inference for Negotiation Dialogue)框架,该框架基于心智理论(Theory of Mind, ToM),引入“战略评估”阶段,通过语言细微特征推断对手的意愿度(w),准确率达90.2%,从而更有效地模拟真实场景下的共识构建过程。实验表明,MIND在高意愿命中率(High-w Hit)和辩论命中率(Debate Hit-Rate)上显著优于传统MAD方法,并在理性性和流畅性等定性指标上表现优异,验证了其对人类协商动态建模的有效性。
链接: https://arxiv.org/abs/2603.21696
作者: Hunmin Do,Taejun Yoon,Kiyong Jung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Workshop (HCAIR)
Abstract:While Multi-Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi-agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus-building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High-w Hit and a 30.7% increase in Debate Hit-Rate, effectively prioritizing high-stakes constraints. Furthermore, qualitative evaluations via LLM-as-a-Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.
[AI-28] Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学视觉问答(Medical Visual Question Answering, VQA)任务中易产生幻觉(hallucination)的问题,即模型生成与输入图像内容矛盾的回答,这在临床场景中可能带来严重风险。现有检测方法如语义熵(Semantic Entropy, SE)和视觉增强语义熵(Vision-Amplified Semantic Entropy, VASE)依赖于多次随机采样和外部自然语言推理模型进行语义聚类,计算开销大且难以部署。论文的关键创新在于提出一种确定性检测方法——置信度-证据贝叶斯增益(Confidence-Evidence Bayesian Gain, CEBaG),其核心思想是利用模型自身输出的log-probabilities识别幻觉:通过两个互补信号——token级预测方差(反映响应中各词置信度不一致)和证据幅度(衡量图像对单个token预测的影响强度),无需随机采样、外部模型或任务特定超参数即可实现高效准确的幻觉检测,在多个医学MLLM和VQA基准上显著优于现有方法。
链接: https://arxiv.org/abs/2603.21693
作者: Mohammad Asadi,Tahoura Nedaee,Jack W. O’Sullivan,Euan Ashley,Ehsan Adeli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have shown strong potential for medical Visual Question Answering (VQA), yet they remain prone to hallucinations, defined as generating responses that contradict the input image, posing serious risks in clinical settings. Current hallucination detection methods, such as Semantic Entropy (SE) and Vision-Amplified Semantic Entropy (VASE), require 10 to 20 stochastic generations per sample together with an external natural language inference model for semantic clustering, making them computationally expensive and difficult to deploy in practice. We observe that hallucinated responses exhibit a distinctive signature directly in the model’s own log-probabilities: inconsistent token-level confidence and weak sensitivity to visual evidence. Based on this observation, we propose Confidence-Evidence Bayesian Gain (CEBaG), a deterministic hallucination detection method that requires no stochastic sampling, no external models, and no task-specific hyperparameters. CEBaG combines two complementary signals: token-level predictive variance, which captures inconsistent confidence across response tokens, and evidence magnitude, which measures how much the image shifts per-token predictions relative to text-only inference. Evaluated across four medical MLLMs and three VQA benchmarks (16 experimental settings), CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained. The code will be made available upon acceptance.
[AI-29] Reasoning Provenance for Autonomous AI Agents : Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces
【速读】:该论文旨在解决当前AI代理(AI Agent)从人类监督的副驾驶向自主平台基础设施演进过程中,缺乏对群体推理行为进行分析的能力这一关键问题。现有工具虽能支持故障恢复、执行追踪与互操作性,但无法以结构化方式记录代理决策的因果链条——即为何选择某动作、如何从观察中得出结论、这些结论如何影响策略制定,以及最终判断所依赖的证据链。解决方案的核心是提出Agent Execution Record(AER),一种将意图(intent)、观察(observation)和推断(inference)作为一级查询字段的结构化推理溯源原语,并集成版本化计划、证据链、结构化 verdict 及委托权限链等要素,从而实现跨代理的推理模式挖掘、置信度校准、对比分析及反事实回归测试。
链接: https://arxiv.org/abs/2603.21692
作者: Neelmani Vispute
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
备注: 8 pages, 2 tables, preprint
Abstract:As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance – normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.
[AI-30] AI Token Futures Market: Commoditization of Compute and Derivatives Contract Design
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)和视觉-语言-动作模型(Vision-Language-Action Models, VLAs)部署过程中,AI推理所消耗的“token”正从智能服务输出转变为一种新型计算基础设施原材料,但缺乏标准化定价与风险管理工具的问题。其解决方案的关键在于提出一套完整的标准化“标准推理令牌”(Standard Inference Token, SIT)期货合约设计,涵盖合约定义、结算机制、保证金制度及做市商制度,并通过构建均值回归跳跃扩散随机过程模型进行蒙特卡洛模拟,验证了在应用层需求激增场景下,token期货可使企业计算成本波动降低62%-78%,从而为算力资源的金融化提供理论基础与实践路径。
链接: https://arxiv.org/abs/2603.21690
作者: Yicai Xing
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 16 pages, 7 figures, 3 tables
Abstract:As large language models (LLMs) and vision-language-action models (VLAs) become widely deployed, the tokens consumed by AI inference are evolving into a new type of commodity. This paper systematically analyzes the commodity attributes of tokens, arguing for their transition from intelligent service outputs to compute infrastructure raw materials, and draws comparisons with established commodities such as electricity, carbon emission allowances, and bandwidth. Building on the historical experience of electricity futures markets and the theory of commodity financialization, we propose a complete design for standardized token futures contracts, including the definition of a Standard Inference Token (SIT), contract specifications, settlement mechanisms, margin systems, and market-maker regimes. By constructing a mean-reverting jump-diffusion stochastic process model and conducting Monte Carlo simulations, we evaluate the hedging efficiency of the proposed futures contracts for application-layer enterprises. Simulation results show that, under an application-layer demand explosion scenario, token futures can reduce enterprise compute cost volatility by 62%-78%. We also explore the feasibility of GPU compute futures and discuss the regulatory framework for token futures markets, providing a theoretical foundation and practical roadmap for the financialization of compute resources.
[AI-31] Mirag e The Illusion of Visual Understanding
【速读】:该论文旨在解决当前多模态人工智能(Multimodal AI)系统在视觉-语言推理过程中存在的根本性漏洞问题,特别是模型在缺乏图像输入时仍能生成看似合理但实际无依据的推理结果(即“幻觉推理”),这导致其评估体系存在严重偏差。解决方案的关键在于引入一个名为 B-Clean 的基准测试框架,该框架通过消除文本线索以强制模型基于真实视觉信息进行推理,从而实现对多模态AI系统的公平、可信且以视觉为基础的评估,尤其在医疗等高风险场景中具有重要意义。
链接: https://arxiv.org/abs/2603.21687
作者: Mohammad Asadi,Jack W. O’Sullivan,Fang Cao,Tahoura Nedaee,Kamyar Fardi,Fei-Fei Li,Ehsan Adeli,Euan Ashley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
[AI-32] owards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats Defenses and Benchmarks
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的复杂系统级安全漏洞问题,这些问题源于其多模块架构,可能导致数据投毒、对抗攻击和成员推理攻击等威胁。解决方案的关键在于从输入和输出两个阶段构建全面的防御技术分类体系:输入侧通过动态访问控制、同态加密检索和对抗预过滤实现数据保护;输出侧则采用联邦学习隔离、差分隐私扰动和轻量级数据净化等技术防止信息泄露。此外,论文还首次提出统一的评估基准,整合权威测试数据集、安全标准与评价框架,为未来RAG系统的安全性研究提供系统性指导。
链接: https://arxiv.org/abs/2603.21654
作者: Yanming Mu,Hao Hu,Feiyang Li,Qiao Yuan,Jiang Wu,Zichuan Liu,Pengcheng Liu,Mei Wang,Hongwei Zhou,Yuling Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) significantly mitigates the hallucinations and domain knowledge deficiency in large language models by incorporating external knowledge bases. However, the multi-module architecture of RAG introduces complex system-level security vulnerabilities. Guided by the RAG workflow, this paper analyzes the underlying vulnerability mechanisms and systematically categorizes core threat vectors such as data poisoning, adversarial attacks, and membership inference attacks. Based on this threat assessment, we construct a taxonomy of RAG defense technologies from a dual perspective encompassing both input and output stages. The input-side analysis reviews data protection mechanisms including dynamic access control, homomorphic encryption retrieval, and adversarial pre-filtering. The output-side examination summarizes advanced leakage prevention techniques such as federated learning isolation, differential privacy perturbation, and lightweight data sanitization. To establish a unified benchmark for future experimental design, we consolidate authoritative test datasets, security standards, and evaluation frameworks. To the best of our knowledge, this paper presents the first end-to-end survey dedicated to the security of RAG systems. Distinct from existing literature that isolates specific vulnerabilities, we systematically map the entire pipeline-providing a unified analysis of threat models, defense mechanisms, and evaluation benchmarks. By enabling deep insights into potential risks, this work seeks to foster the development of highly robust and trustworthy next-generation RAG systems.
[AI-33] EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
【速读】:该论文旨在解决企业在部署AI代理(AI Agent)时面临的三大挑战:能力与数据主权之间的权衡、成本控制,以及开发流程碎片化导致的小语言模型(Small Language Model, SLM)专业化受限问题。现有方法往往将工具集成、数据生成和训练阶段割裂,难以实现高效闭环优化。解决方案的关键在于提出EnterpriseLab这一全栈平台,其核心创新包括:(1) 基于Model Context Protocol的模块化环境,统一接入企业级应用与开源/私有工具;(2) 自动轨迹合成机制,基于环境Schema程序化生成高质量训练数据;(3) 内置连续评估的集成训练流水线,形成从数据到模型迭代的闭环。实证表明,8B参数模型在EnterpriseArena(含15个应用及140+工具)上达到GPT-4o水平性能,同时推理成本降低8–10倍,并在EnterpriseBench和CRMArena等基准上提升10%,验证了该方案在保障隐私前提下实现高性能企业级智能代理部署的可行性。
链接: https://arxiv.org/abs/2603.21630
作者: Ankush Agarwal,Harsh Vishwakarma,Suraj Nagaje,Chaitanya Devaguptapu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While small language models offer privacy-preserving alternatives to frontier models, their specialization is hindered by fragmented development pipelines that separate tool integration, data generation, and training. We introduce EnterpriseLab, a full-stack platform that unifies these stages into a closed-loop framework. EnterpriseLab provides (1) a modular environment exposing enterprise applications via Model Context Protocol, enabling seamless integration of proprietary and open-source tools; (2) automated trajectory synthesis that programmatically generates training data from environment schemas; and (3) integrated training pipelines with continuous evaluation. We validate the platform through EnterpriseArena, an instantiation with 15 applications and 140+ tools across IT, HR, sales, and engineering domains. Our results demonstrate that 8B-parameter models trained within EnterpriseLab match GPT-4o’s performance on complex enterprise workflows while reducing inference costs by 8-10x, and remain robust across diverse enterprise benchmarks, including EnterpriseBench (+10%) and CRMArena (+10%). EnterpriseLab provides enterprises a practical path to deploying capable, privacy-preserving agents without compromising operational capability.
[AI-34] Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains ISCA
【速读】:该论文旨在解决现有机器学习框架在合规监测(compliance monitoring)中因假设观测数据为真实标签而导致的局限性问题,尤其是在税收或监管合规等规则明确但观测信息部分且噪声较大的场景下,传统方法难以准确推断规则激活状态(rule activation)、合规率(compliance rate)及参数漂移(parametric drift)。其解决方案的关键在于提出一种名为规则状态推断(Rule-State Inference, RSI)的贝叶斯框架,该框架将权威规则编码为结构化先验,并将合规监测建模为对潜在规则状态空间 $ S = (a_i, c_i, \delta_i) $ 的后验推断,从而实现从局部和噪声观测中高效、稳定地恢复规则的真实执行状态。
链接: https://arxiv.org/abs/2603.21610
作者: Abdou-Raouf Atarmla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 16 pages, 2 tables, 1 figure. Code and dataset available at this http URL
Abstract:Existing machine learning frameworks for compliance monitoring – Markov Logic Networks, Probabilistic Soft Logic, supervised models – share a fundamental paradigm: they treat observed data as ground truth and attempt to approximate rules from it. This assumption breaks down in rule-governed domains such as taxation or regulatory compliance, where authoritative rules are known a priori and the true challenge is to infer the latent state of rule activation, compliance, and parametric drift from partial and noisy observations. We propose Rule-State Inference (RSI), a Bayesian framework that inverts this paradigm by encoding regulatory rules as structured priors and casting compliance monitoring as posterior inference over a latent rule-state space S = (a_i, c_i, delta_i), where a_i captures rule activation, c_i models the compliance rate, and delta_i quantifies parametric drift. We prove three theoretical guarantees: (T1) RSI absorbs regulatory changes in O(1) time via a prior ratio correction, independently of dataset size; (T2) the posterior is Bernstein-von Mises consistent, converging to the true rule state as observations accumulate; (T3) mean-field variational inference monotonically maximizes the Evidence Lower BOund (ELBO). We instantiate RSI on the Togolese fiscal system and introduce RSI-Togo-Fiscal-Synthetic v1.0, a benchmark of 2,000 synthetic enterprises grounded in real OTR regulatory rules (2022-2025). Without any labeled training data, RSI achieves F1=0.519 and AUC=0.599, while absorbing regulatory changes in under 1ms versus 683-1082ms for full model retraining – at least a 600x speedup. Comments: 16 pages, 2 tables, 1 figure. Code and dataset available at this http URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2603.21610 [cs.LG] (or arXiv:2603.21610v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21610 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-35] INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)场景下大语言模型(Large Language Models, LLMs)中预测不确定性量化(Uncertainty Quantification, UQ)失效的问题,特别是标准基于熵的UQ方法在RAG中常因机制性矛盾而误判准确输出为高不确定性。其关键解决方案是提出INTRYGUE(Induction-Aware Entropy Gating for Uncertainty Estimation),该方法基于对诱导头(induction heads)激活模式的识别,通过 gating 机制抑制由诱导头引发的虚假熵膨胀,从而实现更可靠、可解释的不确定性估计。
链接: https://arxiv.org/abs/2603.21607
作者: Alexandra Bazarova,Andrei Volodichev,Daria Kotova,Alexey Zaytsev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While retrieval-augmented generation (RAG) significantly improves the factual reliability of LLMs, it does not eliminate hallucinations, so robust uncertainty quantification (UQ) remains essential. In this paper, we reveal that standard entropy-based UQ methods often fail in RAG settings due to a mechanistic paradox. An internal “tug-of-war” inherent to context utilization appears: while induction heads promote grounded responses by copying the correct answer, they collaterally trigger the previously established “entropy neurons”. This interaction inflates predictive entropy, causing the model to signal false uncertainty on accurate outputs. To address this, we propose INTRYGUE (Induction-Aware Entropy Gating for Uncertainty Estimation), a mechanistically grounded method that gates predictive entropy based on the activation patterns of induction heads. Evaluated across four RAG benchmarks and six open-source LLMs (4B to 13B parameters), INTRYGUE consistently matches or outperforms a wide range of UQ baselines. Our findings demonstrate that hallucination detection in RAG benefits from combining predictive uncertainty with interpretable, internal signals of context utilization.
[AI-36] mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT
【速读】:该论文旨在解决多任务监督微调(Multi-task Supervised Fine-Tuning, SFT)中因同质计算资源分配导致的过拟合与欠拟合问题:在训练过程中,不同子数据集的学习动态存在异构性,部分任务会因学习速度较快而提前过拟合,而其他任务则因学习较慢而持续欠拟合,从而限制了整体模型性能。解决方案的关键在于提出一种迭代式、过拟合感知的搜索算法 mSFT,其核心机制为:在每轮训练中识别并排除最早出现过拟合的子数据集,回退到该任务最优检查点后继续训练,从而动态优化数据混合策略,最大化各任务的学习效率与模型泛化能力。实证结果表明,mSFT 在多个基准测试和基础模型上均显著优于四种基线方法,并且对计算预算变化具有鲁棒性,尤其在低算力条件下还能降低训练浮点运算次数(FLOPs)的同时提升性能。
链接: https://arxiv.org/abs/2603.21606
作者: Woosung Koh,Jeyoung Jeon,Youngjin Song,Yujin Cheon,Soowon Oh,Jaehyeong Choi,Se-Young Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.
[AI-37] Riemannian Geometry Speaks Louder Than Words: From Graph Foundation Model to Next-Generation Graph Intelligence
【速读】:该论文旨在解决当前图学习领域中缺乏通用性强、可迁移的图基础模型(Graph Foundation Models, GFMs)的问题,尤其针对现有图神经网络(Graph Neural Networks, GNNs)在多领域预训练与适配时存在的记忆保持能力弱和可解释性不足,以及大语言模型(Large Language Models, LLMs)难以直接应用于图结构数据(因词序列无法捕捉图的结构复杂性)等关键挑战。其解决方案的核心在于引入黎曼几何(Riemannian geometry)作为建模图结构的新范式,提出“黎曼基础模型”(Riemannian Foundation Model, RFM),通过强调图的内在几何特性,赋予模型结构推理与生成的内生能力,从而实现从图表示空间切换到结构本质理解的范式跃迁,并为下一代图智能提供新路径。
链接: https://arxiv.org/abs/2603.21601
作者: Philip S. Yu,Li Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:Graphs provide a natural description of the complex relationships among objects, and play a pivotal role in communications, transportation, social computing, the life sciences, etc. Currently, there is strong agreement that Graph Foundation Models (GFMs) are essential for advancing graph learning, yet considerable disagreement persists on how to build a powerful, general-purpose GFM analogous to Large Language Models (LLMs). Graph Neural Networks (GNNs) exhibit limitations in memory retention and principled interpretability when confronted with multi-domain pretraining and adaptation. The challenge of graph serialization hinders the direct application of LLMs, as the words struggle to capture the structural complexity and diversity inherent in graphs. In contrast, Riemannian geometry offers an elegant mathematical framework for modeling structures, while remaining compatible with graph semantic learning, even with LLMs. In this paper, we argue that, for graphs, Riemannian geometry speaks louder than words, and lay out the foundational principles for GFM. Reimagining with Riemannian geometry, we introduce a blue sky idea-Riemannian Foundation Model (RFM)-that opens a new pathway for capturing complex structural patterns and uncovering cross-domain generalities. RFM emphasizes intrinsic graph geometry and embodies endogenous capacities for structural inference and generation, moving beyond mere representation-space switching. Accordingly, we outline a progressive agenda that begins with universal structural understanding through intrinsic geometry, and then rebuilds LLM with a Riemannian engine for general-purpose graph modeling and beyond. Thus, RFM enables a paradigm shift from designing graph models to solving graph-structured applications with RFM agents, unlocking the next-generation graph intelligence.
[AI-38] Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
【速读】:该论文旨在解决无人机辅助无线网络中因UAV间信息交换间歇性导致的系统状态获取延迟问题,以及由此引发的协作效率低下和传输吞吐量受限的问题。其关键解决方案包括两个方面:一是提出一种延迟容忍的多智能体深度强化学习(Delay-Tolerant Multi-Agent Deep Reinforcement Learning, MADRL)算法,通过引入延迟惩罚奖励机制促进UAV间的协同信息共享,并联合优化UAV轨迹规划、网络拓扑形成与传输控制策略;二是设计基于时空注意力机制的信息预测方法,以恢复因信道不稳定造成的丢失信息,提升各UAV对网络状态的认知能力。这两个创新设计共同提升了网络容量,在减少信息延迟超过50%的同时实现吞吐量提升75%,且不牺牲网络容量,反而显著改善了学习性能和实际部署可行性。
链接: https://arxiv.org/abs/2603.21594
作者: Che Chen,Lanhua Li,Shimin Gong,Yu Zhao,Yuming Fang,Dusit Niyato
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs’ relay communications. The UAVs’ intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs’ trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV’s awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs’ information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs’ information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks.
[AI-39] Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身智能体中缺乏跨时空尺度的空间推理能力的问题,即其决策主要依赖于即时感知的反应式规划,难以实现类生物智能(Biological Intelligence, BI)所具备的“心理导航”(mental navigation)能力——即基于经验构建空间表征并进行路径模拟的能力。解决方案的关键在于提出一个名为Video2Mental的新基准,用于评估MLLMs的心理导航能力,并进一步设计NavMind模型,该模型通过引入可学习的细粒度认知地图作为中间表示,结合分难度层级的渐进式监督微调策略,有效将原始感知映射到结构化规划,从而显著提升空间推理与长期路径规划的准确性。
链接: https://arxiv.org/abs/2603.21577
作者: Qihui Zhu,Shouwei Ruan,Xiao Yang,Hao Jiang,Yao Huang,Shiji Zhao,Hanwei Fan,Hang Su,Xingxing Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on “mental navigation”: the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbfNavMind, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.
[AI-40] Adaptive Robust Estimator for Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体协作推理中两个核心问题:一是交互层面的模糊性导致生成(generation)、批评(critique)与修订(revision)过程难以区分,从而阻碍对各智能体贡献的准确赋权;二是策略优化过程中奖励信号存在重尾分布和噪声,易引发优势估计偏差并导致训练不稳定甚至发散。解决方案的关键在于提出一个鲁棒的多智能体强化学习框架,包含两个核心组件:Dual-Agent Answer-Critique-Rewrite (DACR) 和 Adaptive Robust Estimator (ARE)。其中,DACR 将推理过程结构化为三阶段流水线,实现对每个智能体边际贡献的显式归因;ARE 则提供批次经验均值的鲁棒估计,提升多智能体策略优化的稳定性,有效缓解噪声奖励带来的负面影响。
链接: https://arxiv.org/abs/2603.21574
作者: Zhongyi Li,Wan Tian,Jingyu Chen,Kangyao Huang,Huiming Zhang,Hui Yang,Tao Ren,Jinyang Jiang,Yijie Peng,Yikun Ban,Fuzhen Zhuang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent’s marginal contribution to its partner’s performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.
[AI-41] Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
【速读】:该论文旨在解决协作式多智能体大语言模型(Collaborative Multi-Agent Large Language Models, CMLLMs)在强化学习(Reinforcement Learning, RL)训练过程中因信用分配(Credit Assignment)问题导致的性能瓶颈。具体而言,共享的全局奖励信号模糊了各智能体的个体贡献,加剧了更新方差并诱发“搭便车”(free-riding)行为,从而阻碍有效协作。解决方案的关键在于提出反事实信用策略优化(Counterfactual Credit Policy Optimization, CCPO),其核心机制是通过构建动态反事实基线——模拟移除某一智能体贡献后的轨迹来估计其边际贡献,进而生成针对角色敏感的优势信号(role-sensitive advantages)用于策略优化;此外,为提升在异构任务与数据分布下的稳定性,CCPO进一步引入基于全局历史回放统计的归一化方案,实现优势值的校准。实证表明,该方法显著缓解了自由搭便车现象,并在数学与逻辑推理基准上优于现有强基线。
链接: https://arxiv.org/abs/2603.21563
作者: Zhongyi Li,Wan Tian,Yikun Ban,Jinju Chen,Huiming Zhang,Yang Liu,Fuzhen Zhuang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent’s marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent’s contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think–Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at this https URL.
[AI-42] Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment ICLR2026
【速读】:该论文旨在解决递归自提升(recursive self-improvement)过程中因模型迭代训练自身输出而导致的递归漂移(recursive drift)问题,即中间推理错误在多轮训练中累积,引发模式崩溃和性能退化。其解决方案的关键在于提出神经符号递归自对齐(Neuro-Symbolic Recursive Self-Alignment, NSRSA),通过嵌入符号验证子系统,在推理步骤层面动态控制训练数据质量:该系统利用 sympy 对每一步算术运算进行验证、检查推理流程的一致性,并强制执行领域约束,从而有效过滤掉仅靠“幸运猜测”获得正确答案但推理过程存在缺陷的数据样本。实验表明,NSRSA 可拒绝约 34% 的通过结果验证却含错误推理的解法,显著提升训练数据可靠性,并结合直接偏好优化(DPO)使模型从验证机制中学习区分合理与不合理推理,奖励准确率由 46% 提升至 63%,为可测量且可靠的递归自提升提供了可扩展框架。
链接: https://arxiv.org/abs/2603.21558
作者: Xinyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 Workshop on World Models. 10 pages, 3 figures, 5 tables
Abstract:Recursive self-improvement–where a model iteratively trains on its own outputs–promises sustained capability growth but faces a fundamental obstacle: recursive drift. As models train on self-generated data across multiple iterations, errors in intermediate reasoning compound, leading to mode collapse and performance degradation. We propose Neuro-Symbolic Recursive Self-Alignment (NSRSA), which stabilizes iterative self-training by embedding a symbolic verification subsystem that gates training data quality at the reasoning step level. Unlike outcome-only filtering (which admits “lucky guesses” with flawed reasoning), NSRSA verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints. We evaluate NSRSA on GSM8K using Qwen3-4B-Thinking across 5 self-training iterations under five conditions: no verification, outcome verification, majority voting, full NSRSA symbolic verification, and NSRSA with DPO. Our filtering analysis shows that NSRSA rejects approximately 34% of correct-answer solutions that pass outcome verification, eliminating “lucky guesses” with flawed reasoning from the training set. We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%). NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.
[AI-43] What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators
【速读】:该论文旨在解决生成式世界模型(World Models)内部表征的可解释性问题,即这些模型如何在内部编码环境状态变量(如物体位置、得分等)。其解决方案的关键在于系统性地应用多种可解释性技术——包括线性与非线性探针(probing)、因果干预(causal interventions)以及注意力分析——对两种架构迥异的世界模型(IRIS:离散token Transformer;DIAMOND:连续扩散UNet)进行剖析。研究发现,两类模型均能学习到近似线性的状态变量表示,且通过因果干预验证了这些表示在预测中具有功能性作用;此外,IRIS模型的注意力机制展现出空间特异性,且多基准token消融实验表明包含游戏对象的token对模型输出贡献显著。这为世界模型具备结构化、功能相关的内部表征提供了实证依据。
链接: https://arxiv.org/abs/2603.21546
作者: Xinyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 1 table
Abstract:World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques–including linear and nonlinear probing, causal interventions, and attention analysis–to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions–shifting hidden states along probe-derived directions–produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.
[AI-44] Evolutionary Biparty Multiobjective UAV Path Planning : Problems and Empirical Comparisons
【速读】:该论文旨在解决城市环境中无人机(UAV)路径规划中效率与安全目标由不同决策主体(efficiency DM 和 safety DM)分别关注的问题,即双边多目标优化(Biparty Multiobjective Optimization, BPMO)问题。传统方法通常将所有目标统一为单个决策者的多目标优化问题(MOP),忽略了实际应用中效率部门与安全部门各自独立决策的现实。解决方案的关键在于首次建模并求解此类BPMO-UAVPP问题,通过改进三种现有免疫启发式多目标算法(NNIA、HEIA、AIMA)构建对应的双边版本(BPNNIA、BPHEIA、BPAIMA),引入基于非支配邻域的选择机制和自适应策略,使两类决策主体的目标在协同优化中达成更优平衡。实验表明,BPAIMA在收敛性和多样性上均优于NSGA-II及典型多主体多目标算法(如OptMPNDS、OptMPNDS2)。
链接: https://arxiv.org/abs/2603.21544
作者: Kesheng Chen,Wenjian Luo,Xin Lin,Zhen Song,Yatong Chang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Unmanned aerial vehicles (UAVs) have been widely used in urban missions, and proper planning of UAV paths can improve mission efficiency while reducing the risk of potential third-party impact. Existing work has considered all efficiency and safety objectives for a single decision-maker (DM) and regarded this as a multiobjective optimization problem (MOP). However, there is usually not a single DM but two DMs, i.e., an efficiency DM and a safety DM, and the DMs are only concerned with their respective objectives. The final decision is made based on the solutions of both DMs. In this paper, for the first time, biparty multiobjective UAV path planning (BPMO-UAVPP) problems involving both efficiency and safety departments are modeled. The existing multiobjective immune algorithm with nondominated neighbor-based selection (NNIA), the hybrid evolutionary framework for the multiobjective immune algorithm (HEIA), and the adaptive immune-inspired multiobjective algorithm (AIMA) are modified for solving the BPMO-UAVPP problem, and then biparty multiobjective optimization algorithms, including the BPNNIA, BPHEIA, and BPAIMA, are proposed and comprehensively compared with traditional multiobjective evolutionary algorithms and typical multiparty multiobjective evolutionary algorithms (i.e., OptMPNDS and OptMPNDS2). The experimental results show that BPAIMA performs better than ordinary multiobjective evolutionary algorithms such as NSGA-II and multiparty multiobjective evolutionary algorithms such as OptMPNDS, OptMPNDS2, BPNNIA and BPHEIA.
[AI-45] Sharper Generalization Bounds for Transformer
【速读】:该论文旨在解决Transformer模型的泛化误差边界问题,即量化其在实际任务中从训练数据到未知数据的性能表现。其核心挑战在于如何针对不同架构(如单层单头、单层多头及多层Transformer)获得更精确、依赖于模型结构的泛化界限。解决方案的关键在于引入偏置Rademacher复杂度(offset Rademacher complexity),并将其与假设空间的经验覆盖数(empirical covering numbers)建立联系,从而推导出达到最优收敛速率(至常数因子)的超额风险(excess risk)上界;进一步通过矩阵秩和范数对覆盖数进行紧致上界估计,实现了对不同架构的精细化泛化分析,并扩展至非有界特征(如子高斯分布和重尾分布)场景,提升了理论适用性。
链接: https://arxiv.org/abs/2603.21541
作者: Yawen Li,Tao Hu,Zhouhui Lian,Wan Tian,Yijie Peng,Huiming Zhang,Zhongyi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper studies generalization error bounds for Transformer models. Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures, including single-layer single-head, single-layer multi-head, and multi-layer Transformers. We first express the excess risk of Transformers in terms of the offset Rademacher complexity. By exploiting its connection with the empirical covering numbers of the corresponding hypothesis spaces, we obtain excess risk bounds that achieve optimal convergence rates up to constant factors. We then derive refined excess risk bounds by upper bounding the covering numbers of Transformer hypothesis spaces using matrix ranks and matrix norms, leading to precise, architecture-dependent generalization bounds. Finally, we relax the boundedness assumption on feature mappings and extend our theoretical results to settings with unbounded (sub-Gaussian) features and heavy-tailed distributions.
[AI-46] LLM -Based Test Case Generation in DBMS through Monte Carlo Tree Search ICSE2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在数据库管理系统(Database Management Systems, DBMS)测试中面临的两大挑战:一是轻量级大语言模型(Large Language Models, LLMs)难以生成语法正确且语义多样化的专有 SQL 语法;二是 LLM 生成的查询语义相似度高,导致覆盖路径浅、测试覆盖率提升迅速停滞。解决方案的关键在于提出 MIST 框架,其核心机制包括两个阶段:第一阶段通过特征引导与错误驱动的测试用例合成(Feature-Guided Error-Driven Test Case Synthetization),构建分层特征树并利用执行错误反馈指导 LLM 生成语法有效且语义多样的查询;第二阶段采用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)进行测试用例变异(Test Case Mutation),联合优化种子查询选择与变异规则应用,基于覆盖率反馈探索更深层的执行路径。实验表明,MIST 相比基线方法在行覆盖率、函数覆盖率和分支覆盖率上平均提升分别为 43.3%、32.3% 和 46.4%,显著提升了 DBMS 测试的有效性与深度。
链接: https://arxiv.org/abs/2603.21530
作者: Yujia Chen,Yingli Zhou,Fangyuan Zhang,Cuiyun Gao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to ICSE 2026 Industry Challenge Track
Abstract:Database Management Systems (DBMSs) are fundamental infrastructure for modern data-driven applications, where thorough testing with high-quality SQL test cases is essential for ensuring system reliability. Traditional approaches such as fuzzing can be effective for specific DBMSs, but adapting them to different proprietary dialects requires substantial manual effort. Large Language Models (LLMs) present promising opportunities for automated SQL test generation, but face critical challenges in industrial environments. First, lightweight models are widely used in organizations due to security and privacy constraints, but they struggle to generate syntactically valid queries for proprietary SQL dialects. Second, LLM-generated queries are often semantically similar and exercise only shallow execution paths, thereby quickly reaching a coverage plateau. To address these challenges, we propose MIST, an LLM-based test case generatIon framework for DBMS through Monte Carlo Tree search. MIST consists of two stages: Feature-Guided Error-Driven Test Case Synthetization, which constructs a hierarchical feature tree and uses error feedback to guide LLM generation, aiming to produce syntactically valid and semantically diverse queries for different DBMS dialects, and Monte Carlo Tree Search-Based Test Case Mutation, which jointly optimizes seed query selection and mutation rule application guided by coverage feedback, aiming at boosting code coverage by exploring deeper execution paths. Experiments on three widely-used DBMSs with four lightweight LLMs show that MIST achieves average improvements of 43.3% in line coverage, 32.3% in function coverage, and 46.4% in branch coverage compared to the baseline approach with the highest line coverage of 69.3% in the Optimizer module.
[AI-47] BOxCrete: A Bayesian Optimization Open-Source AI Model for Concrete Strength Forecasting and Mix Optimization KR
【速读】:该论文旨在解决现代混凝土配合比设计日益复杂的问题,即如何在满足力学性能、工作性、耐久性和可持续性等多重目标下实现高效优化。其解决方案的关键在于构建了一个开源的概率建模与优化框架BOxCrete,该框架基于一个包含超过500组强度测试数据(1–15 ksi)的新开放数据集,采用高斯过程(Gaussian Process, GP)回归模型预测不同养护龄期(1、3、5、14、28天)下的抗压强度,实现高精度预测(平均R² = 0.94,RMSE = 0.69 ksi)、不确定性量化,并支持多目标优化(如抗压强度与隐含碳排放的权衡),从而为基于AI的混凝土配合比设计提供可复现的开源基础。
链接: https://arxiv.org/abs/2603.21525
作者: Bayezid Baten,M. Ayyan Iqbal,Sebastian Ament,Julius Kusuma,Nishant Garg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code and dataset are available at this https URL
Abstract:Modern concrete must simultaneously satisfy evolving demands for mechanical performance, workability, durability, and sustainability, making mix designs increasingly complex. Recent studies leveraging Artificial Intelligence (AI) and Machine Learning (ML) models show promise for predicting compressive strength and guiding mix optimization, but most existing efforts are based on proprietary industrial datasets and closed-source implementations. Here we introduce BOxCrete, an open-source probabilistic modeling and optimization framework trained on a new open-access dataset of over 500 strength measurements (1-15 ksi) from 123 mixtures - 69 mortar and 54 concrete mixes tested at five curing ages (1, 3, 5, 14, and 28 days). BOxCrete leverages Gaussian Process (GP) regression to predict strength development, achieving average R ^2 = 0.94 and RMSE = 0.69 ksi, quantify uncertainty, and carry out multi-objective optimization of compressive strength and embodied carbon. The dataset and model establish a reproducible open-source foundation for data-driven development of AI-based optimized mix designs.
[AI-48] SafePilot: A Framework for Assuring LLM -enabled Cyber-Physical Systems
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在嵌入式网络物理系统(Cyber-Physical Systems, CPS)中因“幻觉”(hallucination)问题导致的安全性与可靠性风险,即LLM输出虽形式合理但可能违背事实或任务约束,从而引发不安全行为。解决方案的关键在于提出SafePilot框架——一个分层的神经符号(neuro-symbolic)方法,通过属性和时序规范驱动的端到端验证机制实现对LLM生成行为的保障:首先由分层规划器根据任务复杂度判断是否直接调用LLM进行规划;若任务复杂,则采用分解策略将其拆分为子任务分别处理并合成最终方案;同时,LLM生成的计划会自动转化为形式化规格进行验证,一旦发现违反约束,即定位错误、调整提示词并迭代重试,直至获得合规解或达到预设上限。
链接: https://arxiv.org/abs/2603.21523
作者: Weizhe Xu,Mengyu Liu,Fanxin Kong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures
Abstract:Large Language Models (LLMs), deep learning architectures with typically over 10 billion parameters, have recently begun to be integrated into various cyber-physical systems (CPS) such as robotics, industrial automation, and autopilot systems. The abstract knowledge and reasoning capabilities of LLMs are employed for tasks like planning and navigation. However, a significant challenge arises from the tendency of LLMs to produce “hallucinations” - outputs that are coherent yet factually incorrect or contextually unsuitable. This characteristic can lead to undesirable or unsafe actions in the CPS. Therefore, our research focuses on assuring the LLM-enabled CPS by enhancing their critical properties. We propose SafePilot, a novel hierarchical neuro-symbolic framework that provides end-to-end assurance for LLM-enabled CPS according to attribute-based and temporal specifications. Given a task and its specification, SafePilot first invokes a hierarchical planner with a discriminator that assesses task complexity. If the task is deemed manageable, it is passed directly to an LLM-based task planner with built-in verification. Otherwise, the hierarchical planner applies a divide-and-conquer strategy, decomposing the task into sub-tasks, each of which is individually planned and later merged into a final solution. The LLM-based task planner translates natural language constraints into formal specifications and verifies the LLM’s output against them. If violations are detected, it identifies the flaw, adjusts the prompt accordingly, and re-invokes the LLM. This iterative process continues until a valid plan is produced or a predefined limit is reached. Our framework supports LLM-enabled CPS with both attribute-based and temporal constraints. Its effectiveness and adaptability are demonstrated through two illustrative case studies.
[AI-49] Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MASs)在复杂性和自主性增强背景下,因现有故障管理方法依赖逐条推理(per-trace reasoning)而导致效率低下、且忽视历史故障模式从而限制诊断准确性的关键问题。其解决方案的核心在于提出EAGER框架,通过无监督的推理范围对比学习(reasoning-scoped contrastive learning)对单智能体内部推理与跨智能体协作进行编码,实现基于历史故障知识的实时步进式故障检测、诊断与自适应缓解,显著提升了MASs的可靠性与可维护性。
链接: https://arxiv.org/abs/2603.21522
作者: Lingzhe Zhang,Tong Jia,Mingyu Wang,Weijie Hong,Chiming Duan,Minghua He,Rongqian Wang,Xi Peng,Meiling Wang,Gong Zhang,Renhai Chen,Ying Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by FSE’26-IVR
Abstract:Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities. As these systems become more complex and autonomous, effective failure management is essential to ensure reliability and availability. However, existing approaches often rely on per-trace reasoning, which leads to low efficiency, and neglect historical failure patterns, limiting diagnostic accuracy. In this paper, we conduct a preliminary empirical study to demonstrate the necessity, potential, and challenges of leveraging historical failure patterns to enhance failure management in MASs. Building on this insight, we propose \textbfEAGER, an efficient failure management framework for multi-agent systems based on reasoning trace representation. EAGER employs unsupervised reasoning-scoped contrastive learning to encode both intra-agent reasoning and inter-agent coordination, enabling real-time step-wise failure detection, diagnosis, and reflexive mitigation guided by historical failure knowledge. Preliminary evaluations on three open-source MASs demonstrate the effectiveness of EAGER and highlight promising directions for future research in reliable multi-agent system operations.
[AI-50] Quotient Geometry Effective Curvature and Implicit Bias in Simple Shallow Neural Networks
【速读】:该论文旨在解决过参数化浅层神经网络中参数冗余导致的几何量误判问题,即在欧几里得参数空间中计算的几何特征可能反映的是参数表示的人为结构而非预测函数本身的内在性质。其关键解决方案是构建一个微分几何框架,通过商空间(quotient space)去除参数对称性(如隐藏单元置换、缩放等),从而定义在商流形上的自然度量和曲率,得到去退化的对称性约简海森矩阵(symmetry-reduced Hessian),并揭示梯度流中水平分量(horizontal component)决定预测函数的一阶演化,而垂直分量仅对应规范变换(gauge variation)。这一方法使得隐式偏差(implicit bias)可被重新理解为预测器类(predictor classes)的复杂性,而非特定参数代表的属性。
链接: https://arxiv.org/abs/2603.21502
作者: Hang-Cheng Dong,Pengcheng Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries. As a result, geometric quantities computed directly in the ambient Euclidean parameter space can reflect artifacts of representation rather than intrinsic properties of the predictor. In this paper, we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries on a regular set. We first characterize the symmetry and quotient structure of regular shallow-network parameters and show that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian capturing intrinsic local geometry. We then study gradient flows on the quotient and show that only the horizontal component of parameter motion contributes to first-order predictor evolution, while the vertical component corresponds purely to gauge variation. Finally, we formulate an implicit-bias viewpoint at the quotient level, arguing that meaningful complexity should be assigned to predictor classes rather than to individual parameter representatives. Our experiments confirm that ambient flatness is representation-dependent, that local dynamics are better organized by quotient-level curvature summaries, and that in underdetermined regimes, implicit bias is most naturally described in quotient coordinates.
[AI-51] A Framework for Closed-Loop Robotic Assembly Alignment and Self-Recovery of Precision Optical Systems
【速读】:该论文旨在解决自由空间光学(free-space optics)领域中高精度光学系统构建与维护长期依赖人工操作的问题。由于光学系统对空间和角度公差要求极为严格,且其性能受多个紧密耦合的物理参数影响,实现通用自动化极具挑战性。解决方案的关键在于提出了一套集成分层计算机视觉系统、优化算法及定制化工具的机器人框架,实现了从随机分布组件到桌面激光腔的全流程自主构建、对准与自我修复功能,从而在高度敏感的光学系统中建立了闭环自主控制的基础。
链接: https://arxiv.org/abs/2603.21496
作者: Seou Choi,Sachin Vaidya,Caio Silva,Shiekh Zia Uddin,Sajib Biswas Shuvo,Shrish Choudhary,Marin Soljačić
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optics (physics.optics)
备注:
Abstract:Robotic automation has transformed scientific workflows in domains such as chemistry and materials science, yet free-space optics, which is a high precision domain, remains largely manual. Optical systems impose strict spatial and angular tolerances, and their performance is governed by tightly coupled physical parameters, making generalizable automation particularly challenging. In this work, we present a robotics framework for the autonomous construction, alignment, and maintenance of precision optical systems. Our approach integrates hierarchical computer vision systems, optimization routines, and custom-built tools to achieve this functionality. As a representative demonstration, we perform the fully autonomous construction of a tabletop laser cavity from randomly distributed components. The system performs several tasks such as laser beam centering, spatial alignment of multiple beams, resonator alignment, laser mode selection, and self-recovery from induced misalignment and disturbances. By achieving closed-loop autonomy for highly sensitive optical systems, this work establishes a foundation for autonomous optical experiments for applications across technical domains.
[AI-52] RuntimeSlicer: Towards Generalizable Unified Runtime State Representation for Failure Management
【速读】:该论文旨在解决现代软件系统在高规模与复杂性下,故障管理面临的关键挑战:现有方法通常依赖任务导向的处理流程,将不同模态(如指标、追踪数据和日志)的预处理、表征学习与下游模型紧密耦合,导致跨任务和跨系统的泛化能力受限。其解决方案的核心是提出RuntimeSlicer——一个统一的运行时状态表征模型,通过预训练一个任务无关的表征模型,直接将多模态数据编码为对齐的系统状态嵌入(system-state embedding),从而捕捉系统的整体运行状况;并引入统一运行时对比学习(Unified Runtime Contrastive Learning)以整合异构数据源并优化跨模态对齐与时间一致性目标,同时基于学习到的状态嵌入设计状态感知的任务定向微调策略(State-Aware Task-Oriented Tuning),实现无需重构模态特定编码器或预处理流水线的轻量级下游任务适配。
链接: https://arxiv.org/abs/2603.21495
作者: Lingzhe Zhang,Tong Jia,Weijie Hong,Mingyu Wang,Chiming Duan,Minghua He,Rongqian Wang,Xi Peng,Meiling Wang,Gong Zhang,Renhai Chen,Ying Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by FSE’26-IVR
Abstract:Modern software systems operate at unprecedented scale and complexity, where effective failure management is critical yet increasingly challenging. Metrics, traces, and logs provide complementary views of system runtime behavior, but existing failure management approaches typically rely on task-oriented pipelines that tightly couple modality-specific preprocessing, representation learning, and downstream models, resulting in limited generalization across tasks and systems. To fill this gap, we propose RuntimeSlicer, a unified runtime state representation model towards generalizable failure management. RuntimeSlicer pre-trains a task-agnostic representation model that directly encodes metrics, traces, and logs into a single, aligned system-state embedding capturing the holistic runtime condition of the system. To train RuntimeSlicer, we introduce Unified Runtime Contrastive Learning, which integrates heterogeneous training data sources and optimizes complementary objectives for cross-modality alignment and temporal consistency. Building upon the learned system-state embeddings, we further propose State-Aware Task-Oriented Tuning, which performs unsupervised partitioning of runtime states and enables state-conditioned adaptation for downstream tasks. This design allows lightweight task-oriented models to be trained on top of the unified embedding without redesigning modality-specific encoders or preprocessing pipelines. Preliminary experiments on the AIOps 2022 dataset demonstrate the feasibility and effectiveness of RuntimeSlicer for system state modeling and failure management tasks.
[AI-53] Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems
【速读】:该论文旨在解决自动多智能体系统(Automatic Multi-Agent Systems, MAS)在知识密集型领域(如医疗和法律)中性能受限的问题。现有框架要么依赖静态通用节点库(如思维链 Chain-of-Thought),缺乏专业领域知识;要么实时生成节点,导致编排器同时面临内部知识瓶颈与领域逻辑构建、顶层拓扑优化的严重架构耦合,从而降低系统整体效能。解决方案的关键在于提出 Unified-MAS,通过离线节点合成实现细粒度节点实现与顶层编排的解耦:第一阶段基于搜索的节点生成利用外部开放世界知识合成专业化节点蓝图,突破大语言模型(LLM)内部知识限制;第二阶段基于奖励的节点优化采用困惑度引导的奖励机制迭代改进瓶颈节点的内部逻辑。实验证明,该方法在四个专业领域均显著提升性能-成本权衡,最高性能提升达14.2%,且对不同设计者LLM具有鲁棒性。
链接: https://arxiv.org/abs/2603.21475
作者: Hehai Lin,Yu Yan,Zixuan Wang,Bo Xu,Sudong Wang,Weiquan Huang,Ruochen Zhao,Minzhi Li,Chengwei Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
Abstract:Automatic Multi-Agent Systems (MAS) generation has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottlenecked when applied to knowledge-intensive domains (e.g., healthcare and law). They either rely on a static library of general nodes like Chain-of-Thought, which lack specialized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestrator is not only bound by its internal knowledge limits but must also simultaneously generate domain-specific logic and optimize high-level topology, leading to a severe architectural coupling that degrades overall system efficacy. To bridge this gap, we propose Unified-MAS that decouples granular node implementation from topological orchestration via offline node synthesis. Unified-MAS operates in two stages: (1) Search-Based Node Generation retrieves external open-world knowledge to synthesize specialized node blueprints, overcoming the internal knowledge limits of LLMs; and (2) Reward-Based Node Optimization utilizes a perplexity-guided reward to iteratively enhance the internal logic of bottleneck nodes. Extensive experiments across four specialized domains demonstrate that integrating Unified-MAS into four Automatic-MAS baselines yields a better performance-cost trade-off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its effectiveness on conventional tasks such as mathematical reasoning.
[AI-54] Safety as Computation: Certified Answer Reuse via Capability Closure in Task-Oriented Dialogue
【速读】:该论文旨在解决任务导向型对话系统中重复计算问题,即当前系统在每一轮对话中独立处理,即使答案可从先前状态推导得出仍需重新检索或生成,导致效率低下。解决方案的关键在于将安全认证(safety certification)作为计算原语,通过能力驱动的系统在认证步骤中计算固定点闭包 cl(At),其中包含从当前配置可达的所有答案;进而提出带预答块(Pre-Answer Blocks, PAB)的认证答案存储器(Certified Answer Store, CAS),在每次认证时材料化所有可推导的后续答案及其最小溯源证据,使后续查询可通过形式包含性检查在亚毫秒级时间内完成响应,从而避免冗余的检索与生成操作。
链接: https://arxiv.org/abs/2603.21448
作者: Cosimo Spera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a new paradigm for task-oriented dialogue systems: safety certification as a computational primitive for answer reuse. Current systems treat each turn independently, recomputing answers via retrieval or generation even when they are already derivable from prior state. We show that in capability-based systems, the safety certification step computes a fixed-point closure cl(At) that already contains every answer reachable from the current configuration. We operationalize this insight with a Certified Answer Store (CAS) augmented by Pre-Answer Blocks (PAB): at each certified turn, the system materializes all derivable follow-up answers together with minimal provenance witnesses. Subsequent queries are answered in sub-millisecond time via formal containment checks, eliminating redundant retrieval and generation.
[AI-55] LLM -Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study
【速读】:该论文旨在解决多学科软件开发(Multidisciplinary Software Development, MSD)中领域专家与开发者之间因使用不兼容的形式化语言和独立的工件集而导致的协作效率低下问题。当前即使借助生成式 AI(Generative AI)编码助手如 GitHub Copilot,开发流程仍依赖大量手动协调,缺乏从领域知识到实现的自动化连接,造成重复沟通、澄清循环和易出错的手动交接。解决方案的关键在于提出一种基于图结构的工作流优化方法,通过逐步用大语言模型(Large Language Model, LLM)驱动的服务替代人工协调,实现增量式部署且不破坏现有开发实践;实验表明,该方法在沃尔沃集团生产级车载 API 系统(含 192 个端点、420 个属性和 776 个 CAN 信号)上实现了 93.7% 的 F1 分数,并将单接口开发时间从约 5 小时缩短至 7 分钟以内,累计节省工程工时 979 小时,同时获得领域专家与开发者的高度满意度。
链接: https://arxiv.org/abs/2603.21439
作者: Shuai Wang,Yinan Yu,Earl Barr,Dhasarathy Parthasarathy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to FSE 2026 Industrial Track
Abstract:Multidisciplinary Software Development (MSD) requires domain experts and developers to collaborate across incompatible formalisms and separate artifact sets. Today, even with AI coding assistants like GitHub Copilot, this process remains inefficient; individual coding tasks are semi-automated, but the workflow connecting domain knowledge to implementation is not. Developers and experts still lack a shared view, resulting in repeated coordination, clarification rounds, and error-prone handoffs. We address this gap through a graph-based workflow optimization approach that progressively replaces manual coordination with LLM-powered services, enabling incremental adoption without disrupting established practices. We evaluate our approach on \textttspapi, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains. The automated workflow achieves 93.7% F1 score while reducing per-API development time from approximately 5 hours to under 7 minutes, saving an estimated 979 engineering hours. In production, the system received high satisfaction from both domain experts and developers, with all participants reporting full satisfaction with communication efficiency.
[AI-56] Behavioural feasible set: Value alignment constraints on AI decision support
【速读】:该论文试图解决组织在采用商业人工智能(AI)系统进行决策支持时所面临的治理难题,即系统内置的价值判断因厂商配置而不可透明、不可 renegotiable,导致组织难以根据自身需求调整推荐结果。解决方案的关键在于提出“行为可行集”(behavioural feasible set)的概念,用以刻画在厂商设定的对齐约束下可达到的所有推荐范围,并通过诊断阈值识别组织需求是否超出系统的灵活性边界。研究发现,对齐机制显著压缩了可行集,使模型即使在合理情境压力下也难以调整推荐,且多利益相关者任务中对齐并非中立化优先级,而是将上游厂商设定的价值导向嵌入系统,从而揭示出仅靠优化提示(prompting)无法解决的根本性治理问题:选择供应商实质上决定了哪些权衡可协商、哪些利益相关者优先级被结构化固化。
链接: https://arxiv.org/abs/2603.21435
作者: Taejin Park
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:When organisations adopt commercial AI systems for decision support, they inherit value judgements embedded by vendors that are neither transparent nor renegotiable. The governance puzzle is not whether AI can support decisions but which recommendations the system can actually produce given how its vendor has configured it. I formalise this as a behavioural feasible set, the range of recommendations reachable under vendor-imposed alignment constraints, and characterise diagnostic thresholds for when organisational requirements exceed the system’s flexibility. In scenario-based experiments using binary decision scenarios and multi-stakeholder ranking tasks, I show that alignment materially compresses this set. Comparing pre- and post-alignment variants of an open-weight model isolates the mechanism: alignment makes the system substantially less able to shift its recommendation even under legitimate contextual pressure. Leading commercial models exhibit comparable or greater rigidity. In multi-stakeholder tasks, alignment shifts implied stakeholder priorities rather than neutralising them, meaning organisations adopt embedded value orientations set upstream by the vendor. Organisations thus face a governance problem that better prompting cannot resolve: selecting a vendor partially determines which trade-offs remain negotiable and which stakeholder priorities are structurally embedded.
[AI-57] DomAgent : Leverag ing Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation AAMAS2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在真实软件开发场景中因缺乏领域特定知识而导致代码生成成功率低的问题。由于通用LLMs主要基于公共语料库训练,其训练数据往往无法覆盖实际应用中高度专业化的需求,从而限制了其在特定领域的有效性。解决方案的关键在于提出DomAgent,一个自主编码代理系统,其核心组件DomRetriever通过结构化推理与定向检索相结合的方式,模拟人类学习领域知识的过程:一方面利用自上而下的知识图谱推理获取概念性理解,另一方面结合自下而上的案例驱动推理引入具体实例,实现迭代式知识检索与合成,确保上下文相关性和任务覆盖广度。这一机制使LLMs能够高效适应特定领域,显著提升代码生成质量,并在数据科学和卡车软件开发等实际任务中验证了其有效性,甚至使小型开源模型接近大型专有模型的性能表现。
链接: https://arxiv.org/abs/2603.21430
作者: Shuai Wang,Dhasarathy Parthasarathy,Robert Feldt,Yinan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to AAMAS 2026 EA
Abstract:Large language models (LLMs) have shown impressive capabilities in code generation. However, because most LLMs are trained on public domain corpora, directly applying them to real-world software development often yields low success rates, as these scenarios frequently require domain-specific knowledge. In particular, domain-specific tasks usually demand highly specialized solutions, which are often underrepresented or entirely absent in the training data of generic LLMs. To address this challenge, we propose DomAgent, an autonomous coding agent that bridges this gap by enabling LLMs to generate domain-adapted code through structured reasoning and targeted retrieval. A core component of DomAgent is DomRetriever, a novel retrieval module that emulates how humans learn domain-specific knowledge, by combining conceptual understanding with experiential examples. It dynamically integrates top-down knowledge-graph reasoning with bottom-up case-based reasoning, enabling iterative retrieval and synthesis of structured knowledge and representative cases to ensure contextual relevance and broad task coverage. DomRetriever can operate as part of DomAgent or independently with any LLM for flexible domain adaptation. We evaluate DomAgent on an open benchmark dataset in the data science domain (DS-1000) and further apply it to real-world truck software development tasks. Experimental results show that DomAgent significantly enhances domain-specific code generation, enabling small open-source models to close much of the performance gap with large proprietary LLMs in complex, real-world applications. The code is available at: this https URL.
[AI-58] HyReach: Vision-Guided Hybrid Manipulator Reaching in Unseen Cluttered Environments
【速读】:该论文旨在解决机器人在非结构化、杂乱且未见过的环境中进行可靠物体抓取的问题,这类环境对传统刚性机械臂的适应性和安全性提出了严峻挑战。解决方案的关键在于提出了一种实时混合刚柔连续体机械臂系统,其核心创新包括:基于视觉的感知与三维场景重建相结合的形状感知运动规划方法,用于生成安全轨迹;以及一种基于学习的控制器,能够利用软段的柔性优势同时保持刚段的高精度控制,从而实现对任意目标位姿的精准到达。该系统无需针对特定环境重新训练,可直接泛化至新场景,在多种复杂杂乱设置下均实现了低于2 cm的稳定定位误差,验证了混合机械臂在非结构化环境中自适应、可靠作业的潜力。
链接: https://arxiv.org/abs/2603.21421
作者: Shivani Kamtikar,Kendall Koe,Justin Wasserman,Samhita Marri,Benjamin Walt,Naveen Kumar Uppalapati,Girish Krishnan,Girish Chowdhary
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 5 tables
Abstract:As robotic systems increasingly operate in unstructured, cluttered, and previously unseen environments, there is a growing need for manipulators that combine compliance, adaptability, and precise control. This work presents a real-time hybrid rigid-soft continuum manipulator system designed for robust open-world object reaching in such challenging environments. The system integrates vision-based perception and 3D scene reconstruction with shape-aware motion planning to generate safe trajectories. A learning-based controller drives the hybrid arm to arbitrary target poses, leveraging the flexibility of the soft segment while maintaining the precision of the rigid segment. The system operates without environment-specific retraining, enabling direct generalization to new scenes. Extensive real-world experiments demonstrate consistent reaching performance with errors below 2 cm across diverse cluttered setups, highlighting the potential of hybrid manipulators for adaptive and reliable operation in unstructured environments.
[AI-59] Is the future of AI green? What can innovation diffusion models say about generative AIs environmental impact?
【速读】:该论文试图解决的问题是:生成式人工智能(Generative AI)被广泛预测将带来显著的环境负面影响,但这种预测往往忽略了技术创新扩散过程中产品性能优化和商业模式演进对环境影响的缓解作用。解决方案的关键在于运用经典的A-U(Adoption-Utility)创新扩散模型分析生成式AI生态系统的发展路径,识别不同商业模型下环境影响的演化趋势,从而指出:尽管生成式AI本身无法实现“绿色”,其实际环境影响的程度取决于主导商业模型的选择,而非单一技术本身的固有属性。
链接: https://arxiv.org/abs/2603.21419
作者: Robert Viseur,Nicolas Jullien
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of generative artificial intelligence (GAI) has led to alarming predictions about its environmental impact. However, these predictions often overlook the fact that the diffusion of innovation is accompanied by the evolution of products and the optimization of their performance, primarily for economic reasons. This can also reduce their environmental impact. By analyzing the GAI ecosystem using the classic A-U innovation diffusion model, we can forecast this industry’s structure and how its environmental impact will evolve. While GAI will never be green, its impact may not be as problematic as is sometimes claimed. However, this depends on which business model becomes dominant.
[AI-60] Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)作为具备工具执行权限的自主代理部署时,其安全架构依赖于一个关键假设——模型错误在运行时可被检测到。然而,作者通过实证发现,这一假设在部分指令遵循模型中并不成立,导致潜在的“无声承诺失败”(silent commitment failure),即模型输出自信且流畅但完全错误,且无任何预警信号。解决方案的关键在于提出“可治理性”(governability)这一新概念,定义为模型错误在输出提交前可被检测并纠正的程度,并基于此构建了一个“检测与纠正矩阵”(Detection and Correction Matrix),将模型-任务组合划分为四类治理状态:可治理(Governable)、仅监控(Monitor Only)、盲目引导(Steer Blind)和不可治理(Ungovernable)。研究进一步表明,模型的可治理性与其基准准确率无关,且微调无法显著提升其治理能力,暗示该特性在预训练阶段已基本固定,从而为设计更可靠、可控的AI系统提供了新的评估框架与干预方向。
链接: https://arxiv.org/abs/2603.21415
作者: Gregory M. Ruddell
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 39 pages, 5 figures, 5 tables. Preprint. Submitted to NIST CAISI (Docket NIST-2025-0035, March 2026). Also available on Zenodo: this https URL
Abstract:As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability – the degree to which a model’s errors are detectable before output commitment and correctable once detected – and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.
[AI-61] Fingerprinting Deep Neural Networks for Ownership Protection: An Analytical Approach
【速读】:该论文旨在解决基于对抗样本的模型指纹技术中一个关键难题:如何确定指纹与决策边界之间的距离,以同时满足鲁棒性(robustness)和唯一性(uniqueness)这两个核心属性,从而实现有效且可靠的模型所有权保护。现有方法依赖经验启发式策略,难以兼顾两者的平衡,可能导致指纹在面对模型修改攻击时失效或产生误判。解决方案的关键在于提出AnaFP——一种基于理论指导的指纹构造方案,其核心创新是将指纹生成建模为通过可调拉伸因子(stretch factor)控制指纹到决策边界的距离;并通过数学形式化严格界定该因子的上下界,建立鲁棒性和唯一性约束与指纹距离之间的理论联系,进而定义出可行区间,并结合有限替代模型池与分位数松弛策略实现高效计算,最终通过网格搜索确定最优拉伸因子,显著提升指纹在多种模型架构及篡改攻击下的验证效果。
链接: https://arxiv.org/abs/2603.21411
作者: Guang Yang,Ziye Geng,Yihang Chen,Changqing Luo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial-example-based fingerprinting approaches, which leverage the decision boundary characteristics of deep neural networks (DNNs) to craft fingerprints, have proven effective for model ownership protection. However, a fundamental challenge remains unresolved: how far a fingerprint should be placed from the decision boundary to simultaneously satisfy two essential properties, i.e., robustness and uniqueness, for effective and reliable ownership protection. Despite the importance of the fingerprint-to-boundary distance, existing works lack a theoretical solution and instead rely on empirical heuristics, which may violate either robustness or uniqueness properties. We propose AnaFP, an analytical fingerprinting scheme that constructs fingerprints under theoretical guidance. Specifically, we formulate fingerprint generation as controlling the fingerprint-to-boundary distance through a tunable stretch factor. To ensure both robustness and uniqueness, we mathematically formalize these properties that determine the lower and upper bounds of the stretch factor. These bounds jointly define an admissible interval within which the stretch factor must lie, thereby establishing a theoretical connection between the two constraints and the fingerprint-to-boundary distance. To enable practical fingerprint generation, we approximate the original (infinite) sets of pirated and independently trained models using two finite surrogate model pools and employ a quantile-based relaxation strategy to relax the derived bounds. Due to the circular dependency between the lower bound and the stretch factor, we apply grid search over the admissible interval to determine the most feasible stretch factor. Extensive experimental results show that AnaFP consistently outperforms prior methods, achieving effective ownership verification across diverse model architectures and model modification attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.21411 [cs.CR] (or arXiv:2603.21411v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.21411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-62] he Myhill-Nerode Theorem for Bounded Interaction: Canonical Abstractions via Agent -Bounded Indistinguishability
【速读】:该论文旨在解决有限能力观测者(capacity-limited observer)在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)中如何对环境状态进行等价划分的问题,即:对于一个具有有限记忆和计算能力的代理而言,哪些环境状态是无法区分的,从而可以被合并为同一等价类。解决方案的关键在于引入“探针族”(probe family)——一组有限状态控制器,通过它们诱导出闭合环路的Wasserstein伪度量(closed-loop Wasserstein pseudometric)来刻画观察历史之间的相似性,并构建一个“探针精确商”(probe-exact quotient),该商结构对所有探针都无法区分的历史进行合并。此商结构具有唯一性、最小性和规范性,类似于自动机理论中的Myhill-Nerode定理在受限交互场景下的推广。特别地,当使用时钟感知探针(clock-aware probes)时,该商结构恰好是仅依赖于观测与动作的目标函数的决策充分(decision-sufficient);对于隐状态奖励问题,则采用观察Lipschitz近似界进行分析。论文进一步提出一种可扩展的确定性平稳实验方案,通过小规模精确案例验证差距,并在更大规模上进行实证探索,最终在Tiger、GridWorld和RockSample任务中验证了理论命题与近似行为的有效性。
链接: https://arxiv.org/abs/2603.21399
作者: Anthony T. Nixon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 43 pages, 4 figures, 23 tables. Code: this https URL (v0.1.1)
Abstract:Any capacity-limited observer induces a canonical quotient on its environment: two situations that no bounded agent can distinguish are, for that agent, the same. We formalise this for finite POMDPs. A fixed probe family of finite-state controllers induces a closed-loop Wasserstein pseudometric on observation histories and a probe-exact quotient merging histories that no controller in the family can distinguish. The quotient is canonical, minimal, and unique-a bounded-interaction analogue of the Myhill-Nerode theorem. For clock-aware probes, it is exactly decision-sufficient for objectives that depend only on the agent’s observations and actions; for latent-state rewards, we use an observation-Lipschitz approximation bound. The main theorem object is the clock-aware quotient; scalable deterministic-stationary experiments study a tractable coarsening with gap measured on small exact cases and explored empirically at larger scale. We validate theorem-level claims on Tiger and GridWorld. We also report operational case studies on Tiger, GridWorld, and RockSample as exploratory diagnostics of approximation behavior and runtime, not as theorem-facing evidence when no exact cross-family certificate is available; heavier stress tests are archived in the appendix and artifact package.
[AI-63] Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors
【速读】:该论文试图解决的问题是:如何理解大型语言模型(Large Language Models, LLMs)在战略环境中的高层次行为特征,尤其是在其作为自主决策者部署时缺乏有效分析工具的现状。解决方案的关键在于采用激活操纵(activation steering)方法,在博弈论场景中通过对比激活添加构建人格向量(persona vectors),以表征利他主义、宽恕和对他人的预期等特质;实验表明,这种操纵能系统性地改变模型的战略选择及其自然语言解释,但同时也揭示了策略与修辞之间可能存在的分离现象,从而为理解LLMs在复杂交互中的行为机制提供了可操作的工具。
链接: https://arxiv.org/abs/2603.21398
作者: Johnathan Sun,Andrew Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 8 pages, 6 figures
Abstract:Large language models (LLMs) are increasingly deployed as autonomous decision-makers in strategic settings, yet we have limited tools for understanding their high-level behavioral traits. We use activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others by contrastive activation addition. Evaluating on canonical games, we find that activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, we also observe that rhetoric and strategy can diverge under steering. In addition, vectors for self-behavior and expectations of others are partially distinct. Our results suggest that persona vectors offer a promising mechanistic handle on high-level traits in strategic environments.
[AI-64] PivotRL: High Accuracy Agent ic Post-Training at Low Compute Cost
【速读】:该论文旨在解决长时程智能体任务(long-horizon agentic tasks)后训练过程中计算效率与泛化能力之间的矛盾:监督微调(SFT)虽计算高效,但存在域外(out-of-domain, OOD)性能下降问题;而端到端强化学习(E2E RL)虽能保持OOD能力,却因大量在线策略回放(on-policy rollout)导致计算成本过高。解决方案的关键在于提出PivotRL框架,其核心机制包括两点:一是执行局部在线策略回放并筛选“枢纽点”(pivots)——即动作采样在结果上表现出高方差的中间步骤,从而聚焦于最具信息量的学习信号;二是采用功能等价动作的奖励机制而非严格匹配SFT数据中的动作字符串,以保留策略概率排序并增强自然梯度方向上的学习信号强度。这一设计使PivotRL在相同SFT数据下实现更高的域内准确率(+4.17%)和域外准确率(+10.04%),并在编码类智能体任务中达到接近E2E RL的性能,同时仅需其四分之一的回放轮次。
链接: https://arxiv.org/abs/2603.21383
作者: Junkeun Yi,Damon Mosk-Aoyama,Baihe Huang,Ritu Gala,Charles Wang,Sugam Dipak Devare,Khushi Bhardwaj,Abhibha Gupta,Oleksii Kuchaiev,Jiantao Jiao,Jian Zhang,Venkat Srinivasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 6 tables
Abstract:Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA’s Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.
[AI-65] A transformer architecture alteration to incentivise externalised reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中存在过度计算的问题,即模型对所有输入 token 均进行深度前向传播,导致资源浪费。其解决方案的关键在于引入一种基于中间层早退(early-exit)机制的架构改进与后训练流程:通过在 Transformer 架构的中间层添加早退出口,并训练模型在可预测下一个 token 时提前终止计算;随后利用强化学习在校准阶段激励模型尽可能早地退出,同时保持任务性能不变。这种方法使模型能够自适应地减少冗余计算,尤其在非即时预测的 token 上保留深层计算能力,从而优化推理效率。
链接: https://arxiv.org/abs/2603.21376
作者: Elizabeth Pavlova,Mariia Koroliuk,Karthik Viswanathan,Cameron Tice,Edward James Young,Puria Radmard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.
[AI-66] he AI Scientific Community: Agent ic Virtual Lab Swarms
【速读】:该论文旨在解决如何通过计算模型加速科学发现的问题,其核心挑战在于模拟真实科研社区的协作与演化机制。解决方案的关键在于构建一个由虚拟实验室(virtual labs)组成的代理群体(agentic swarm),每个个体代表一个完整的实验环境,通过群智能(swarm intelligence)特性实现去中心化的协同探索、平衡探索与利用的权衡,并激发涌现式集体行为,从而模拟科学共同体的动态演进过程。该框架设计了类引文投票机制、适应度函数量化科研成果、多样性保护策略及高效计算架构,以支持大规模虚拟实验室群体展现出类似现实世界科学社区的复杂行为。
链接: https://arxiv.org/abs/2603.21344
作者: Ulisses Braga-Neto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this short note we propose using agentic swarms of virtual labs as a model of an AI Science Community. In this paradigm, each particle in the swarm represents a complete virtual laboratory instance, enabling collective scientific exploration that mirrors real-world research communities. The framework leverages the inherent properties of swarm intelligence - decentralized coordination, balanced exploration-exploitation trade-offs, and emergent collective behavior - to simulate the behavior of a scientific community and potentially accelerate scientific discovery. We discuss architectural considerations, inter-laboratory communication and influence mechanisms including citation-analogous voting systems, fitness function design for quantifying scientific success, anticipated emergent behaviors, mechanisms for preventing lab dominance and preserving diversity, and computational efficiency strategies to enable large swarms exhibiting complex emergent behavior analogous to real-world scientific communities. A working instance of the AI Science Community is currently under development.
[AI-67] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在构建视觉-语言-动作模型(Vision-Language-Action Models, VLAs)时,因缺乏稳定且有效的具身推理能力而导致的低级动作执行性能不佳的问题。现有方法依赖于视觉问答类型的监督信号来增强具身推理,但常导致VLA性能不稳定,提升效果微弱甚至为负。解决方案的关键在于提出一种系统性的训练框架RoboAlign,其核心思想是通过零样本自然语言推理采样动作标记(action tokens),并利用强化学习(Reinforcement Learning, RL)对这一推理过程进行精炼,从而提升动作准确性。该方法有效弥合了MLLM中语言与底层动作之间的模态鸿沟,并促进知识从MLLM向VLA的迁移,实验证明仅用不到1%的数据进行RL对齐即可显著提升VLA性能。
链接: https://arxiv.org/abs/2603.21341
作者: Dongyoung Kim,Sumin Park,Woomin Song,Seungku Kim,Taeyoung Kim,Huiwon Jang,Jinwoo Shin,Jaehyung Kim,Younggyo Seo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 9 Tables
Abstract:Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1% of the data, RoboAlign achieves performance improvements of 17.5%, 18.9%, and 106.6% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.
[AI-68] ARYA: A Physics-Constrained Composable Deterministic World Model Architecture
【速读】:该论文旨在解决当前世界模型(world model)在复杂系统中面临的“能力与计算效率难以兼顾”的核心矛盾,以及在高自主性场景下如何确保人类控制权不被削弱的架构性难题。其解决方案的关键在于提出ARAYA架构:一种基于五项基础原则(纳米模型、可组合性、因果推理、确定性和架构安全)的分层系统-系统-系统结构,通过由AARA(ARYA自主研究代理)驱动的持续感知-决策-行动-学习循环实现动态建模与规划;其中最核心创新是“不可触发的安全内核”(Unfireable Safety Kernel),这是一个技术上不可绕过的架构级安全边界,能确保即使在系统自我进化过程中也始终维持人类对关键操作的控制权,从而在零神经网络参数的前提下实现了跨七大赛道领域的领先性能表现。
链接: https://arxiv.org/abs/2603.21340
作者: Seth Dobrin,Lukasz Chmiel
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:This paper presents ARYA, a composable, physics-constrained, deterministic world model architecture built on five foundational principles: nano models, composability, causal reasoning, determinism, and architectural AI safety. We demonstrate that ARYA satisfies all canonical world model requirements, including state representation, dynamic prediction, causal and physical awareness, temporal consistency, generalization, learnability, and planning and control. Unlike monolithic foundation models, the ARYA foundation model implements these capabilities through a hierarchical system-of-system-of-systems of specialized nano models, orchestrated by AARA (ARYA Autonomous Research Agent), an always-on cognitive daemon that executes a continuous sense-decide-act-learn loop. The nano model architecture provides linear scaling, sparse activation, selective untraining, and sub-20-second training cycles, resolving the traditional tension between capability and computational efficiency. A central contribution is the Unfireable Safety Kernel: an architecturally immutable safety boundary that cannot be disabled or circumvented by any system component, including its own self-improvement engine. This is not a social or ethical alignment statement; it is a technical framework ensuring human control persists as autonomy increases. Safety is an architectural constraint governing every operation, not a policy layer applied after the fact. We present formal alignment between ARYA’s architecture and canonical world model requirements, and report summarizing its state-of-the-art performance across 6 of 9 competitive benchmarks head-to-head with GPT-5.2, Opus 4.6, and V-JEPA-2. All with zero neural network parameters, across seven active industry domain nodes spanning aerospace, pharma manufacturing, oil and gas, smart cities, biotech, defense, and medical devices.
[AI-69] DeepXplain: XAI-Guided Autonomous Defense Against Multi-Stage APT Campaigns
【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APTs)防御中深度强化学习(Deep Reinforcement Learning, DRL)决策过程缺乏可解释性的问题,从而影响其在实际操作环境中的可信度。解决方案的关键在于提出 DeepXplain 框架,该框架首次将解释信号直接融入策略优化过程:通过基于溯源的图学习与时间阶段估计实现阶段感知建模,并构建统一的可解释人工智能(Explainable AI, XAI)管道,提供结构、时间和策略层面的解释;同时利用证据对齐和置信度感知奖励重塑机制,在训练阶段内嵌解释信号,而非依赖事后解释方法,显著提升了防御系统的有效性(如阶段加权 F1 分数从 0.887 提升至 0.915)与可信度(解释置信度达 0.86)。
链接: https://arxiv.org/abs/2603.21296
作者: Trung V. Phan,Thomas Bauschert
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper is currently under review for IEEE GLOBECOM 2026
Abstract:Advanced Persistent Threats (APTs) are stealthy, multi-stage attacks that require adaptive and timely defense. While deep reinforcement learning (DRL) enables autonomous cyber defense, its decisions are often opaque and difficult to trust in operational environments. This paper presents DeepXplain, an explainable DRL framework for stage-aware APT defense. Building on our prior DeepStage model, DeepXplain integrates provenance-based graph learning, temporal stage estimation, and a unified XAI pipeline that provides structural, temporal, and policy-level explanations. Unlike post-hoc methods, explanation signals are incorporated directly into policy optimization through evidence alignment and confidence-aware reward shaping. To the best of our knowledge, DeepXplain is the first framework to integrate explanation signals into reinforcement learning for APT defense. Experiments in a realistic enterprise testbed show improvements in stage-weighted F1-score (0.887 to 0.915) and success rate (84.7% to 89.6%), along with higher explanation confidence (0.86), improved fidelity (0.79), and more compact explanations (0.31). These results demonstrate enhanced effectiveness and trustworthiness of autonomous cyber defense.
[AI-70] Fusing Memory and Attention: A study on LSTM Transformer and Hybrid Architectures for Symbolic Music Generation
【速读】:该论文旨在解决生成式音乐(Symbolic Music Generation, SMG)中长短期依赖建模能力不足的问题,具体聚焦于LSTM与Transformer在局部旋律连续性与全局结构一致性上的差异尚未被系统研究的空白。其解决方案的关键在于通过细粒度对比分析发现:LSTM擅长捕捉局部模式但难以维持长程依赖,而Transformer能有效建模全局结构却易产生不规则乐句;据此提出一种混合架构——将Transformer编码器用于提取全局结构特征,结合LSTM解码器以增强局部连续性,从而协同发挥两者优势。实验表明,该混合方法在17项音乐质量指标上均优于单一模型,并通过消融研究和人类感知评估进一步验证了其有效性。
链接: https://arxiv.org/abs/2603.21282
作者: Soudeep Ghoshal,Sandipan Chakraborty,Pradipto Chowdhury,Himanshu Buckchash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 20 pages, 6 figures. Published in Expert Systems with Applications (Elsevier), 2026. DOI: this https URL
Abstract:Machine learning techniques, such as Transformers and Long Short-Term Memory (LSTM) networks, play a crucial role in Symbolic Music Generation (SMG). Existing literature indicates a difference between LSTMs and Transformers regarding their ability to model local melodic continuity versus maintaining global structural coherence. However, their specific properties within the context of SMG have not been systematically studied. This paper addresses this gap by providing a fine-grained comparative analysis of LSTMs versus Transformers for SMG, examining local and global properties in detail using 17 musical quality metrics on the Deutschl dataset. We find that LSTM networks excel at capturing local patterns but fail to preserve long-range dependencies, while Transformers model global structure effectively but tend to produce irregular phrasing. Based on this analysis and leveraging their respective strengths, we propose a Hybrid architecture combining a Transformer Encoder with an LSTM Decoder and evaluate it against both baselines. We evaluated 1,000 generated melodies from each of the three architectures on the Deutschl dataset. The results show that the hybrid method achieves better local and global continuity and coherence compared to the baselines. Our work highlights the key characteristics of these models and demonstrates how their properties can be leveraged to design superior models. We also supported the experiments with ablation studies and human perceptual evaluations, which statistically support the findings and provide robust validation for this work.
[AI-71] WARBENCH: A Comprehensive Benchmark for Evaluating LLM s in Military Decision-Making
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在军事安全关键场景中部署时存在的结构性缺陷问题,这些问题导致现有基准测试系统性高估模型在真实战术环境中的能力。具体而言,现有评估框架忽视国际人道法(International Humanitarian Law, IHL)的严格法律约束、边缘计算资源限制、战场迷雾(fog of war)下的鲁棒性不足,以及显式推理机制的缺失。为应对这些漏洞,论文提出 WARBENCH 评估框架,其核心创新在于建立一个基础战术基准,并引入四个差异化压力测试维度:复杂地形与兵力不对称性、边缘优化模型的合规风险、量化压缩下的性能退化,以及显式推理对误判的抑制作用。实证结果表明,尽管先进闭源模型仍能保持功能合规,但小规模边缘模型存在高达70%的法律违规率,且4-bit量化引发灾难性性能下降;而显式推理机制则显著提升结构安全性,成为防范无意违规的关键保障。
链接: https://arxiv.org/abs/2603.21280
作者: Zongjie Li,Chaozheng Wang,Yuchong Xie,Pingchuan Ma,Shuai Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex terrain and high force asymmetry. Second, while state of the art closed source models maintain functional compliance, edge-optimized small models expose extreme operational risks with legal violation rates approaching 70 percent. Furthermore, models experience catastrophic performance degradation under 4-bit quantization and systematic information loss. Conversely, explicit reasoning mechanisms serve as highly effective structural safeguards against inadvertent violations. Ultimately, these findings demonstrate that current models remain fundamentally unready for autonomous deployment in high stakes tactical environments.
[AI-72] Aggregation Alignment for Federated Learning with Mixture-of-Experts under Data Heterogeneity
【速读】:该论文旨在解决在非独立同分布(non-IID)联邦学习环境下,对基于专家混合(Mixture-of-Experts, MoE)架构的大语言模型(Large Language Models, LLMs)进行联合微调时所面临的两个核心挑战:一是客户端间数据分布差异导致本地门控机制偏好不同,直接参数聚合会生成“一刀切”的全局门控网络;二是相同索引的专家在不同客户端中发展出语义角色不一致,造成专家语义模糊和功能专一性下降。解决方案的关键在于提出 FedAlign-MoE 框架,通过联合强制路由一致性(routing consistency)与专家语义对齐(expert semantic alignment)来实现稳定且高效的联邦聚合:一方面利用一致性加权和分布正则化保持门控行为的一致性并保留本地差异化偏好,另一方面显式量化跨客户端相同索引专家的语义一致性,并选择语义对齐的客户端更新进行聚合,从而保障全局专家的功能稳定性与专业化能力。
链接: https://arxiv.org/abs/2603.21276
作者: Zihan Fang,Qianru Wang,Haonan An,Zheng Lin,Yiqin Deng,Xianhao Chen,Yuguang Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 14 figures
Abstract:Large language models (LLMs) increasingly adopt Mixture-of-Experts (MoE) architectures to scale model capacity while reducing computation. Fine-tuning these MoE-based LLMs often requires access to distributed and privacy-sensitive data, making centralized fine-tuning impractical. Federated learning (FL) therefore provides a paradigm to collaboratively fine-tune MoE-based LLMs, enabling each client to integrate diverse knowledge without compromising data privacy. However, the integration of MoE-based LLM fine-tuning into FL encounters two critical aggregation challenges due to inherent data heterogeneity across clients: (i) divergent local data distributions drive clients to develop distinct gating preference for localized expert selection, causing direct parameter aggregation to produce a ``one-size-fits-none’’ global gating network, and (ii) same-indexed experts develop disparate semantic roles across clients, leading to expert semantic blurring and the degradation of expert specialization. To address these challenges, we propose FedAlign-MoE, a federated aggregation alignment framework that jointly enforces routing consistency and expert semantic alignment. Specifically, FedAlign-MoE aggregates gating behaviors by aligning routing distributions through consistency weighting and optimizes local gating networks through distribution regularization, maintaining cross-client stability without overriding discriminative local preferences. Meanwhile, FedAlign-MoE explicitly quantifies semantic consistency among same-indexed experts across clients and selectively aggregates updates from semantically aligned clients, ensuring stable and specialized functional roles for global experts. Extensive experiments demonstrate that FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.
[AI-73] Graph of States: Solving Abductive Tasks with Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理归纳推理(abductive reasoning)任务时表现不足的问题。现有框架主要针对静态演绎任务设计,缺乏对状态的结构化表示和显式控制机制,导致在面对复杂推理场景时易出现证据伪造(Evidence Fabrication)、上下文漂移(Context Drift)、回溯失败(Failed Backtracking)和过早终止(Early Stopping)等现象。其解决方案的核心在于提出一种通用的神经符号框架——状态图(Graph of States, GoS),通过因果图显式编码逻辑依赖关系,并利用状态机规范推理过程中的合法状态转移,从而将无约束的探索转化为受符号规则引导的收敛性搜索,显著提升了复杂 abductive 任务的鲁棒性和准确性。
链接: https://arxiv.org/abs/2603.21250
作者: Yu Luo,Rongchen Gao,Lu Teng,Xidao Wen,Jiamin Jiang,Qingliang Zhang,Yongqian Sun,Shenglin Zhang,Jiasong Feng,Tong Liu,Wenjie Zhang,Dan Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: this https URL.
[AI-74] ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中存在的高延迟和高成本问题,尤其针对对延迟敏感和资源受限场景下的部署瓶颈。其核心解决方案是提出ConsRoute框架,该框架通过语义一致性感知的动态路由机制,在云-边-端协同推理架构中实现高效决策。关键创新在于:利用重排序器(reranker)直接评估不同层级模型生成响应间的语义一致性,从而获得细粒度软监督信号用于路由;同时复用LLM预填充阶段的隐藏状态作为紧凑查询表征,避免额外编码或推理开销,并结合聚类与贝叶斯优化学习各簇特定的路由阈值,以在异构查询分布下动态平衡响应质量、延迟与成本。实验表明,ConsRoute在保持接近云端性能(≥95%)的同时,将端到端延迟和推理成本降低近40%,显著优于现有路由基线方法。
链接: https://arxiv.org/abs/2603.21237
作者: Haoyu Qiao,Hao Zhang,Shanwen Mao,Siyao Cheng,Jie Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
[AI-75] When Convenience Becomes Risk: A Semantic View of Under-Specification in Host-Acting Agents
【速读】:该论文旨在解决主机行为代理(host-acting agents)在用户仅提供目标导向指令时所面临的语义不充分性(semantic under-specification)问题,即用户指令常未明确过程约束、安全边界、持久性及暴露范围等关键执行语义,导致代理在补全执行计划时可能生成高风险行为,即使原始目标本身无害。解决方案的关键在于构建一个语义威胁模型,提出一套由语义引发的风险完成模式分类体系,并通过以OpenClaw为中心的案例研究与执行轨迹分析验证其现象;进而提炼出防御设计原则——显式化执行边界并限制风险完成行为,从而确保代理将“仅目标”指令正确转化为可信赖的可执行计划。
链接: https://arxiv.org/abs/2603.21231
作者: Di Lu,Yongzhi Liao,Xutong Mu,Lele Zheng,Ke Cheng,Xuewen Dong,Yulong Shen,Jianfeng Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Host-acting agents promise a convenient interaction model in which users specify goals and the system determines how to realize them. We argue that this convenience introduces a distinct security problem: semantic under-specification in goal specification. User instructions are typically goal-oriented, yet they often leave process constraints, safety boundaries, persistence, and exposure insufficiently specified. As a result, the agent must complete missing execution semantics before acting, and this completion can produce risky host-side plans even when the user-stated goal is benign. In this paper, we develop a semantic threat model, present a taxonomy of semantic-induced risky completion patterns, and study the phenomenon through an OpenClaw-centered case study and execution-trace analysis. We further derive defense design principles for making execution boundaries explicit and constraining risky completion. These findings suggest that securing host-acting agents requires governing not only which actions are allowed at execution time, but also how goal-only instructions are translated into executable plans.
[AI-76] Does AI Homogenize Student Thinking? A Multi-Dimensional Analysis of Structural Convergence in AI-Augmented Essays
【速读】:该论文旨在解决生成式 AI (Generative AI) 辅助写作对学生思维结构多样性影响的未知问题,特别是其是否在提升作文质量的同时导致思维模式趋同。研究通过分析6,875篇不同条件下(纯人类、纯AI及三种人机协同提示策略)的作文,首次实证揭示了“质量-同质化权衡”现象:尽管AI显著提升了作文质量,但文章的连贯性结构多样性下降了70–78%,而观点多样性反而增加;进一步收敛目标分析表明,AI增强型作文被拉向AI结构模式,却未完全沿人类-AI轴线分布,说明存在部分替代与部分新结构涌现并存的现象。关键解决方案在于提示词特异性(prompt specificity)——明确具体的提示设计可将原本由AI引发的同质化逆转为论证深度上的多样化,证明同质化并非AI固有属性,而是交互设计的结果。
链接: https://arxiv.org/abs/2603.21228
作者: Keito Inoshita,Michiaki Omura,Tsukasa Yamanaka,Go Maeda,Kentaro Tsuji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While AI-assisted writing has been widely reported to improve essay quality, its impact on the structural diversity of student thinking remains unexplored. Analyzing 6,875 essays across five conditions (Human-only, AI-only, and three Human+AI prompt strategies), we provide the first empirical evidence of a Quality-Homogenization Tradeoff, in which substantial quality gains co-occur with significant homogenization. The effect is dimension-specific: cohesion architecture lost 70-78% of its variance, whereas perspective plurality was diversified. Convergence target analysis further revealed that AI-augmented essays were pulled toward AI structural patterns yet deviated significantly from the Human-AI axis, indicating simultaneous partial replacement and partial emergence. Crucially, prompt specificity reversed homogenization into diversification on argument depth, demonstrating that homogenization is not an intrinsic property of AI but a function of interaction design.
[AI-77] Is Monitoring Enough? Strategic Agent Selection For Stealthy Attack in Multi-Agent Discussions
【速读】:该论文旨在解决多智能体讨论场景中在持续异常检测监控下的攻击有效性问题,即现有攻击方法在面对持续监控时是否仍具威胁。其关键解决方案是提出一种专为讨论监控场景设计的新颖攻击方法,通过隐匿攻击模式以规避检测机制,实验证明即便在连续监控下,此类攻击依然有效,从而揭示仅依赖监控无法彻底消除多智能体系统中的对抗风险。
链接: https://arxiv.org/abs/2603.21194
作者: Qiuchi Xiang,Haoxuan Qu,Hossein Rahmani,Jun Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent discussions have been widely adopted, motivating growing efforts to develop attacks that expose their vulnerabilities. In this work, we study a practical yet largely unexplored attack scenario, the discussion-monitored scenario, where anomaly detectors continuously monitor inter-agent communications and block detected adversarial messages. Although existing attacks are effective without discussion monitoring, we show that they exhibit detectable patterns and largely fail under such monitoring constraints. But does this imply that monitoring alone is sufficient to secure multi-agent discussions? To answer this question, we develop a novel attack method explicitly tailored to the discussion-monitored scenario. Extensive experiments demonstrate that effective attacks remain possible even under continuous monitoring, indicating that monitoring alone does not eliminate adversarial risks.
[AI-78] LLM -based Automated Architecture View Generation: Where Are We Now?
【速读】:该论文旨在解决软件架构视图(Architecture Views)手工创建效率低且易过时的问题,尤其是在系统复杂性日益增长的背景下,如何通过自动化手段从源代码中生成高质量的架构视图成为关键挑战。其解决方案的关键在于评估大型语言模型(Large Language Models, LLMs)与代理式(agentic)方法在生成架构视图上的有效性,具体采用3种LLM搭配3种提示策略及2种代理方法,在340个开源仓库上生成4,137个视图,并结合自动化指标与人工评估进行验证。结果表明,虽然LLM和代理方法能生成语法正确的视图,但普遍存在粒度不匹配问题——即倾向于停留在代码层面而非抽象的架构层次,因此仍需人类专家介入,体现出其作为辅助工具而非自主架构师的定位。
链接: https://arxiv.org/abs/2603.21178
作者: Miryala Sathvika,Rudra Dhar,Karthik Vaidhyanathan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Architecture views are essential for software architecture documentation, yet their manual creation is labor intensive and often leads to outdated artifacts. As systems grow in complexity, the automated generation of views from source code becomes increasingly valuable. Goal: We empirically evaluate the ability of LLMs and agentic approaches to generate architecture views from source code. Method: We analyze 340 open-source repositories across 13 experimental configurations using 3 LLMs with 3 prompting techniques and 2 agentic approaches, yielding 4,137 generated views. We evaluate the generated views by comparing them with the ground-truth using a combination of automated metrics complemented by human evaluations. Results: Prompting strategies offer marginal improvements. Few-shot prompting reduces clarity failures by 9.2% compared to zero-shot baselines. The custom agentic approach consistently outperforms the general-purpose agent, achieving the best clarity (22.6% failure rate) and level-of-detail success (50%). Conclusions: LLM and agentic approaches demonstrate capabilities in generating syntactically valid architecture views. However, they consistently exhibit granularity mismatches, operating at the code level rather than architectural abstractions. This suggests that there is still a need for human expertise, positioning LLMs and agents as assistive tools rather than autonomous architects.
[AI-79] Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts
【速读】:该论文旨在解决基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)在大语言模型(Large Language Models, LLMs)推理能力提升过程中存在的计算资源浪费问题,尤其是GRPO(Generalized Reward Policy Optimization)训练中因频繁采样导致的昂贵轨迹生成和无效提示(prompt)利用问题。解决方案的关键在于提出一种无开销的在线数据选择方法——Prompt Replay:该方法仅重用提示(而非完整轨迹),通过维护一个提示缓冲区并优先选择通过率接近0.5(即一半正确、一半错误)的中等难度提示,以最大化优势信号(advantage signal),从而在保持策略在线优化的前提下显著减少零方差提示、提升平均绝对优势,并加快初始阶段的准确率增长。该机制通过冷却步数(cooldown steps)与最大重用次数控制重用强度,平衡效率与过拟合风险。
链接: https://arxiv.org/abs/2603.21177
作者: Andrei Baroian,Rutger Berger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.
[AI-80] Reward Sharpness-Aware Fine-Tuning for Diffusion Models CVPR26
【速读】:该论文旨在解决扩散强化学习(Diffusion Reinforcement Learning, DRL)中因奖励模型梯度非鲁棒性导致的“奖励黑客”(reward hacking)问题,即奖励分数上升但图像感知质量未同步提升的现象。其解决方案的关键在于利用一个经过增强鲁棒性的奖励模型梯度,而无需重新训练该模型:具体通过两种方法实现——一是采用参数扰动后的扩散模型生成的平坦化奖励模型梯度,二是对生成样本进行扰动以获取更稳定的梯度信号。这两种策略独立缓解奖励黑客现象并提升鲁棒性,联合使用时效果进一步增强,最终提出名为RSA-FT(Reward Sharpness-Aware Fine-Tuning)的简单、兼容性强且能持续提升DRL可靠性的框架。
链接: https://arxiv.org/abs/2603.21175
作者: Kwanyoung Kim,Byeongsu Sim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Cam ready version of CVPR26
Abstract:Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.
[AI-81] Rethinking Plasticity in Deep Reinforcement Learning
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中神经网络因环境非平稳性导致的可塑性丧失(plasticity loss)问题,即模型在任务切换时难以适应新目标的现象。其核心解决方案是提出优化中心可塑性(Optimization-Centric Plasticity, OCP)假设:认为可塑性丧失的根本原因在于先前任务的最优参数点在新任务中变为局部极小值,从而在任务迁移过程中使参数陷入次优状态,阻碍后续学习。关键突破在于理论证明了神经元休眠(dormant neurons)等现象本质上对应于零梯度状态(zero-gradient states),并揭示出可塑性损失具有高度任务特异性——即使某任务下出现高休眠率,换到差异显著的新任务时性能仍可恢复至随机初始化水平,说明网络容量未受损,而是被特定优化景观所抑制。此外,该框架解释了参数约束为何能缓解可塑性损失,因其可防止参数深陷局部最优。
链接: https://arxiv.org/abs/2603.21173
作者: Zhiqiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates the fundamental mechanisms driving plasticity loss in deep reinforcement learning (RL), a critical challenge where neural networks lose their ability to adapt to non-stationary environments. While existing research often relies on descriptive metrics like dormant neurons or effective rank, these summaries fail to explain the underlying optimization dynamics. We propose the Optimization-Centric Plasticity (OCP) hypothesis, which posits that plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning. We theoretically establish the equivalence between neuron dormancy and zero-gradient states, demonstrating that the absence of gradient signals is the primary driver of dormancy. Our experiments reveal that plasticity loss is highly task-specific; notably, networks with high dormancy rates in one task can achieve performance parity with randomly initialized networks when switched to a significantly different task, suggesting that the network’s capacity remains intact but is inhibited by the specific optimization landscape. Furthermore, our hypothesis elucidates why parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima. Validated across diverse non-stationary scenarios, our findings provide a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains.
[AI-82] Revisiting Tree Search for LLM s: Gumbel and Sequential Halving for Budget-Scalable Reasoning ICAPS-2026
【速读】:该论文旨在解决基于AlphaZero风格蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)增强大语言模型(Large Language Models, LLMs)推理能力时所出现的“缩放失效”问题,即随着搜索预算增加,模型在GSM8K和Game24等任务上的准确率反而下降。其解决方案的关键在于提出ReSCALE方法,通过用Gumbel采样替代Dirichlet噪声,并引入Sequential Halving策略取代PUCT选择机制,在不改变模型结构或训练过程的前提下,实现了性能随搜索预算单调提升的效果。实验表明,ReSCALE在高预算下显著优于基线,分别在GSM8K上达到58.4%、Game24上达到85.3%,且消融实验证实Sequential Halving是性能提升的主要驱动力。
链接: https://arxiv.org/abs/2603.21162
作者: Leonid Ugadiarov,Yuri Kuratov,Aleksandr Panov,Alexey Skrynnik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The paper has been accepted to the ICAPS-2026 conference. 5 pages, 2 figures
Abstract:Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4% on GSM8K and 85.3% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.
[AI-83] Can LLM s Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs WWW
【速读】:该论文旨在解决文本属性图(Text-attributed Graphs, TAGs)在不同模型架构下(如图神经网络 GNN 和预训练语言模型 PLM)的通用对抗攻击问题,即如何设计一种跨架构的通用攻击方法以评估 TAG 模型的安全性。其核心挑战在于 GNN 与 PLM 对图结构和文本语义的感知方式差异显著,且多数 PLM 仅可通过 API 访问,限制了攻击只能在黑盒场景中进行。解决方案的关键是提出 BadGraph 框架,通过深度挖掘大型语言模型(Large Language Models, LLMs)对图知识的通用理解,联合扰动节点拓扑结构和文本语义;其中创新性地设计了目标影响者检索模块,利用图先验构建跨模态对齐的攻击捷径,从而实现高效、可解释且隐蔽的 LLM 基础扰动推理。
链接: https://arxiv.org/abs/2603.21155
作者: Zihui Chen,Yuling Wang,Pengfei Jiao,Kai Wu,Xiao Wang,Xiang Ao,Dalin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by TheWebConf (WWW) 2026
Abstract:Text-attributed graphs (TAGs) enhance graph learning by integrating rich textual semantics and topological context for each node. While boosting expressiveness, they also expose new vulnerabilities in graph learning through text-based adversarial surfaces. Recent advances leverage diverse backbones, such as graph neural networks (GNNs) and pre-trained language models (PLMs), to capture both structural and textual information in TAGs. This diversity raises a key question: How can we design universal adversarial attacks that generalize across architectures to assess the security of TAG models? The challenge arises from the stark contrast in how different backbones-GNNs and PLMs-perceive and encode graph patterns, coupled with the fact that many PLMs are only accessible via APIs, limiting attacks to black-box settings. To address this, we propose BadGraph, a novel attack framework that deeply elicits large language models (LLMs) understanding of general graph knowledge to jointly perturb both node topology and textual semantics. Specifically, we design a target influencer retrieval module that leverages graph priors to construct cross-modally aligned attack shortcuts, thereby enabling efficient LLM-based perturbation reasoning. Experiments show that BadGraph achieves universal and effective attacks across GNN- and LLM-based reasoners, with up to a 76.3% performance drop, while theoretical and empirical analyses confirm its stealthy yet interpretable nature.
[AI-84] Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains
【速读】:该论文试图解决的问题是:如何在复杂AI系统中实现高效、统一的形式验证(formal verification),以确保其在多个安全关键领域(如大语言模型生成代码、工具API安全性、智能合约等)中的正确性与鲁棒性。传统方法依赖人工定义规则或经验性测试,难以覆盖所有潜在错误,尤其在系统自主演化过程中易出现隐蔽缺陷。解决方案的关键在于提出一个统一框架(substrate-guard),通过集成Z3 SMT求解器,构建一个通用API接口,对六类不同输出形式进行一致性验证;该框架在181个测试用例中实现100%分类准确率(零假阳性和假阴性),并成功发现实证测试无法捕捉的深层漏洞(如RISC-V汇编中的INT_MIN溢出),同时从理论上证明了无约束字符串参数在工具API中不可形式化验证,凸显了形式验证作为复杂系统自安全推理能力的涌现特性。
链接: https://arxiv.org/abs/2603.21149
作者: Octavian Untila
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 5 tables. Code: this https URL . Companion paper: this https URL
Abstract:An autonomous AI ecosystem (SUBSTRATE S3), generating product specifications without explicit instructions about formal methods, independently proposed the use of Z3 SMT solver across six distinct domains of AI safety: verification of LLM-generated code, tool API safety for AI agents, post-distillation reasoning correctness, CLI command validation, hardware assembly verification, and smart contract safety. These convergent discoveries, occurring across 8 products over 13 days with Jaccard similarity below 15% between variants, suggest that formal verification is not merely a useful technique for AI safety but an emergent property of any sufficiently complex system reasoning about its own safety. We propose a unified framework (substrate-guard) that applies Z3-based verification across all six output classes through a common API, and evaluate it on 181 test cases across five implemented domains, achieving 100% classification accuracy with zero false positives and zero false negatives. Our framework detected real bugs that empirical testing would miss, including an INT_MIN overflow in branchless RISC-V assembly and mathematically proved that unconstrained string parameters in tool APIs are formally unverifiable.
[AI-85] NeSy-Edge: Neuro-Symbolic Trustworthy Self-Healing in the Computing Continuum
【速读】:该论文旨在解决现代人工智能(AI)服务在边缘计算与终端设备构成的计算连续体(computing continuum)中,因环境规模庞大、异构性强及跨层依赖复杂而导致的容错性难以维持的问题。现有故障管理方法通常过于静态、碎片化或资源开销大,难以在噪声日志和边缘资源受限条件下实现及时自愈。其解决方案的关键在于提出一种神经符号(neuro-symbolic)框架 NeSy-Edge,采用“边缘优先”设计:边缘节点负责局部感知与推理,仅在最终诊断阶段调用云端模型;通过将原始运行时日志转换为结构化事件表示,构建先验约束下的稀疏符号因果图,并融合因果证据与历史排障知识进行根因分析与恢复建议,从而在低资源消耗下实现高鲁棒性的自愈能力。
链接: https://arxiv.org/abs/2603.21145
作者: Peihan Ye,Alfreds Lapkovskis,Alaa Saleh,Qiyang Zhang,Praveen Kumar Donta
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注:
Abstract:The computational demands of modern AI services are increasingly shifting execution beyond centralized clouds toward a computing continuum spanning edge and end devices. However, the scale, heterogeneity, and cross-layer dependencies of these environments make resilience difficult to maintain. Existing fault-management methods are often too static, fragmented, or heavy to support timely self-healing, especially under noisy logs and edge resource constraints. To address these limitations, this paper presents NeSy-Edge, a neuro-symbolic framework for trustworthy self-healing in the computing continuum. The framework follows an edge-first design, where a resource-constrained edge node performs local perception and reasoning, while a cloud model is invoked only at the final diagnosis stage. Specifically, NeSy-Edge converts raw runtime logs into structured event representations, builds a prior-constrained sparse symbolic causal graph, and integrates causal evidence with historical troubleshooting knowledge for root-cause analysis and recovery recommendation. We evaluate our work on representative Loghub datasets under multiple levels of semantic noise, considering parsing quality, causal reasoning, end-to-end diagnosis, and edge-side resource usage. The results show that NeSy-Edge remains robust even at the highest noise level, achieving up to 75% root-cause analysis accuracy and 65% end-to-end accuracy while operating within about 1500 MB of local memory.
[AI-86] ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation AAAI2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在训练大语言模型(Large Language Models, LLMs)时,合成多步推理数据质量不足的问题,尤其是现有方法通常仅通过最终答案正确性进行筛选,忽视了中间推理步骤的逻辑有效性。其核心挑战在于缺乏对自然语言推理任务中模糊或不完整上下文中每一步推理的可靠验证机制。解决方案的关键在于提出 ORACLE 框架,该框架结合了大语言模型的生成能力与符号推理引擎的结构化监督:LLM 生成分步推理过程,而符号推理引擎逐层验证每个中间步骤的有效性;通过统一提示模板引导模块化推理链,实现细粒度的步骤级验证,从而构建高质量的多步推理数据。
链接: https://arxiv.org/abs/2603.21140
作者: Zhuojie Yang,Wentao Wan,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi-step reasoning data. To generate high-quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code-based validation is restricted to code or mathematical tasks, and reasoning engines require a well-structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step-wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine-grained, step-level validation, facilitating the construction of high-quality multi-step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.
[AI-87] DMMRL: Disentangled Multi-Modal Representation Learning via Variational Autoencoders for Molecular Property Prediction
【速读】:该论文旨在解决分子属性预测中模型表示纠缠(entangled representations)的问题,即现有方法常将结构、化学和功能因素混杂在一起,导致模型可解释性和迁移能力受限,且未能充分挖掘图结构、序列和几何信息之间的互补关系。解决方案的关键在于提出DMMRL框架,其核心是基于变分自编码器(Variational Autoencoders)将分子表示解耦为共享(structure-relevant)和私有(modality-specific)潜在空间,从而分离出对属性预测最具信息量的特征;同时引入正交性和对齐正则化以增强统计独立性与跨模态一致性,并通过门控注意力融合模块自适应整合共享表示,有效捕捉复杂的模态间依赖关系。
链接: https://arxiv.org/abs/2603.21108
作者: Long Xu,Junping Guo,Jianbo Zhao,Jianbo Lu,Yuzhong Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure
Abstract:Molecular property prediction constitutes a cornerstone of drug discovery and materials science, necessitating models capable of disentangling complex structure-property relationships across diverse molecular modalities. Existing approaches frequently exhibit entangled representations–conflating structural, chemical, and functional factors–thereby limiting interpretability and transferability. Furthermore, conventional methods inadequately exploit complementary information from graphs, sequences, and geometries, often relying on naive concatenation that neglects inter-modal dependencies. In this work, we propose DMMRL, which employs variational autoencoders to disentangle molecular representations into shared (structure-relevant) and private (modality-specific) latent spaces, enhancing both interpretability and predictive performance. The proposed variational disentanglement mechanism effectively isolates the most informative features for property prediction, while orthogonality and alignment regularizations promote statistical independence and cross-modal consistency. Additionally, a gated attention fusion module adaptively integrates shared representations, capturing complex inter-modal relationships. Experimental validation across seven benchmark datasets demonstrates DMMRL’s superior performance relative to state-of-the-art approaches. The code and data underlying this article are freely available at this https URL.
[AI-88] Harmful Visual Content Manipulation Matters in Misinformation Detection Under Multimedia Scenarios
【速读】:该论文旨在解决多模态虚假信息检测(Multimodal Misinformation Detection, MMD)中忽视视觉内容篡改特征与篡改意图的问题。现有方法主要关注跨模态语义一致性,但忽略了图像篡改本身及其潜在意图(如有害或无害)对判断虚假信息的重要性。解决方案的关键在于提出一种新的弱监督框架——HAVC-M4D,通过引入两个弱监督指标:一是基于图像篡改检测的辅助数据集以识别视觉内容是否被篡改( manipulation features),二是将篡改意图区分为有害与无害两类(intention features),并将其建模为正样本和未标注样本学习问题(positive and unlabeled learning),从而在无需精确标注的情况下有效提升MMD模型的判别能力。实验表明,该方法显著且一致地增强了主流MMD方法的性能。
链接: https://arxiv.org/abs/2603.21054
作者: Bing Wang,Ximing Li,Changchun Li,Jinjin Chi,Tianze Li,Renchu Guan,Shengsheng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Nowadays, the widespread dissemination of misinformation across numerous social media platforms has led to severe negative effects on society. To address this challenge, the automatic detection of misinformation, particularly under multimedia scenarios, has gained significant attention from both academic and industrial communities, leading to the emergence of a research task known as Multimodal Misinformation Detection (MMD). Typically, current MMD approaches focus on capturing the semantic relationships and inconsistency between various modalities but often overlook certain critical indicators within multimodal content. Recent research has shown that manipulated features within visual content in social media articles serve as valuable clues for MMD. Meanwhile, we argue that the potential intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Therefore, in this study, we aim to identify such multimodal misinformation by capturing two types of features: manipulation features, which represent if visual content has been manipulated, and intention features, which assess the nature of these manipulations, distinguishing between harmful and harmless intentions. Unfortunately, the manipulation and intention labels that supervise these features to be discriminative are unknown. To address this, we introduce two weakly supervised indicators as substitutes by incorporating supplementary datasets focused on image manipulation detection and framing two different classification tasks as positive and unlabeled learning issues. With this framework, we introduce an innovative MMD approach, titled Harmful Visual Content Manipulation Matters in MMD (HAVC-M4 D). Comprehensive experiments conducted on four prevalent MMD datasets indicate that HAVC-M4 D significantly and consistently enhances the performance of existing MMD methods.
[AI-89] KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph
【速读】:该论文旨在解决自动驾驶中细粒度场景理解与推理的可靠性问题,现有感知流水线和面向驾驶的大语言模型(LLM)方法普遍存在场景事实不可靠、幻觉现象严重、推理过程不透明以及对任务特定训练依赖过重等挑战。其解决方案的关键在于提出KLDrive框架,通过两个紧密耦合的核心组件实现:一是基于能量函数的场景事实构建模块,将多源证据融合为可信的场景知识图谱(scene knowledge graph);二是LLM代理,在显式结构约束下于受限动作空间内进行基于事实的推理。该框架结合结构化提示与少量示例的上下文学习(few-shot in-context exemplars),无需大量任务特定微调即可适应多样推理任务,显著提升了计数等最难的事实推理任务性能,验证了可靠场景事实构建与显式推理机制协同作用的有效性。
链接: https://arxiv.org/abs/2603.21029
作者: Ye Tian,Jingyi Zhang,Zihao Wang,Xiaoyuan Ren,Xiaofan Yu,Onat Gungor,Tajana Rosing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task-specific training. We present KLDrive, the first knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy-based scene fact construction module that consolidates multi-source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact-grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few-shot in-context exemplars, the framework adapts to diverse reasoning tasks without heavy task-specific fine-tuning. Experiments on two large-scale autonomous-driving QA benchmarks show that KLDrive outperforms prior state-of-the-art methods, achieving the best overall accuracy of 65.04% on NuScenes-QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.
[AI-90] A Framework for Low-Latency LLM -driven Multimodal Interaction on the Pepper Robot
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在社交机器人(Social Robotics)应用中的两大局限:一是现有平台如Pepper机器人多采用级联的语音转文本(Speech-to-Text, STT)-LLM-文本转语音(Text-to-Speech, TTS)处理流程,导致交互延迟高且丢失韵律信息(paralinguistic information);二是多数实现未能充分挖掘LLM在多模态感知(multimodal perception)与自主代理控制(agentic control)方面的潜力。解决方案的关键在于两项创新:其一,引入端到端语音到语音(Speech-to-Speech, S2S)模型,在保持韵律特征的同时显著降低延迟并支持自适应语调调整;其二,通过广泛实现函数调用(Function Calling)能力,使LLM具备代理规划功能,协调导航、注视控制、平板交互等机器人动作,并融合视觉、触觉及系统状态等多模态反馈,从而构建更自然、响应式的人机交互体验。
链接: https://arxiv.org/abs/2603.21013
作者: Erich Studerus,Vivienne Jia Zhong,Stephan Vonschallen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 4 pages, 2 figures. To appear in Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), Edinburgh, Scotland, March 2026
Abstract:Despite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)-LLM-Text-to-Speech (TTS) pipelines, resulting in high latency and the loss of paralinguistic information. Second, most implementations fail to fully leverage the LLM’s capabilities for multimodal perception and agentic control. We present an open-source Android framework for the Pepper robot that addresses these limitations through two key innovations. First, we integrate end-to-end Speech-to-Speech (S2S) models to achieve low-latency interaction while preserving paralinguistic cues and enabling adaptive intonation. Second, we implement extensive Function Calling capabilities that elevate the LLM to an agentic planner, orchestrating robot actions (navigation, gaze control, tablet interaction) and integrating diverse multimodal feedback (vision, touch, system state). The framework runs on the robot’s tablet but can also be built to run on regular Android smartphones or tablets, decoupling development from robot hardware. This work provides the HRI community with a practical, extensible platform for exploring advanced LLM-driven embodied interaction.
[AI-91] ALL-FEM: Agent ic Large Language models Fine-tuned for Finite Element Methods
【速读】:该论文旨在解决传统有限元(Finite Element, FE)分析中代码实现与结果验证依赖专家知识、效率低下且易出错的问题,尤其针对大语言模型(Large Language Models, LLMs)在生成FE代码时存在的幻觉(hallucination)、缺乏变分结构意识以及无法闭环完成从问题描述到可验证解的流程等局限性。其解决方案的关键在于提出ALL-FEM系统,该系统通过将领域专用微调后的LLM与代理型AI(agentic AI)框架相结合,构建了一个多智能体工作流:利用1000+经验证的FEniCS脚本语料库(包含500+专家代码和检索增强的多LLM生成过滤管道)对3B至120B参数的LLM进行微调,并由专门代理负责将物理问题形式化为偏微分方程(PDEs)、生成调试代码及可视化结果;实验表明,在39个基准测试中,最佳微调模型(GPT OSS 120B)在代码层面的成功率达到71.79%,显著优于非代理式部署的GPT 5 Thinking,证明了小规模微调LLM在代理架构下可有效自动化FE全流程,为计算科学与工程中的自主仿真系统提供了可扩展范式。
链接: https://arxiv.org/abs/2603.21011
作者: Rushikesh Deotale,Adithya Srinivasan,Yuan Tian,Tianyi Zhang,Pavlos Vlachos,Hector Gomez
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA)
备注:
Abstract:Finite element (FE) analysis guides the design and verification of nearly all manufactured objects. It is at the core of computational engineering, enabling simulation of complex physical systems, from fluids and solids to multiphysics systems. However, implementing FE codes and analyzing simulation results demands expertise across numerical analysis, continuum mechanics, and programming. Conventional Large Language Models (LLMs) can generate FE code, but they hallucinate, lack awareness of variational structures, and cannot close the loop from problem statement to a verified solution. Here, we propose ALL-FEM, an autonomous simulation system that integrates agentic AI with domain-specific, fine-tuned LLMs for FEniCS code generation across solid, fluid, and multiphysics applications. We construct a corpus of 1000+ verified FEniCS scripts by combining 500+ curated expert codes with a retrieval-augmented, multi-LLM pipeline that generates and filters codes for diverse PDEs, geometries, and boundary conditions. We used the corpus to fine-tune LLMs with 3B to 120B parameters. Our agentic framework orchestrates specialized agents, powered by fine-tuned LLMs, to formulate problems as PDEs, generate and debug code and visualize the results. We evaluated the system on 39 benchmarks that include problems of linear/nonlinear elasticity, plasticity, Newtonian/non-Newtonian flow, thermofluids, fluid-structure interaction, phase separation, and transport on moving domains. Embedded in a multi-agent workflow with runtime feedback, the best fine-tuned model (GPT OSS 120B) achieves code-level success of 71.79%, outperforming a non-agentic deployment of GPT 5 Thinking. By showing that relatively small, fine-tuned LLMs, orchestrated through agentic frameworks, can automate FE workflows, ALL-FEM offers a blueprint for autonomous simulation systems in computational science and engineering.
[AI-92] he Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes AAMAS2026
【速读】:该论文旨在解决共享自主(shared autonomy)场景中自动化助手在面对人类指令与安全需求冲突时的决策难题,即如何在保证安全性的同时实现有效的人机协作。其核心挑战在于设计一种机制,使AI能够在必要时“智能违抗”(intelligent disobedience)人类指令以防止伤害,同时维持人类的信任与任务目标的达成。解决方案的关键是提出智能违抗博弈(Intelligent Disobedience Game, IDG),这是一个基于Stackelberg博弈的序贯博弈框架,建模了人类领导者与辅助跟随者之间在信息不对称下的交互行为,并通过理论分析识别出如“安全陷阱”等关键策略现象,从而为开发能够学习安全非服从行为的算法提供了数学基础,并进一步将其转化为可训练的多智能体马尔可夫决策过程(Multi-Agent Markov Decision Process, MDP),形成紧凑的计算测试平台用于强化学习训练。
链接: https://arxiv.org/abs/2603.20994
作者: Benedikt Hornig,Reuth Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: Accepted for presentation at the Rebellion and Disobedience in AI (RaD-AI) at AAMAS 2026
Abstract:In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human’s instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,‘’ where the system indefinitely avoids harm but fails to achieve the human’s goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.
[AI-93] Long-Term Outlier Prediction Through Outlier Score Modeling
【速读】:该论文旨在解决时间序列异常检测中长期异常预测的空白问题,即传统方法主要聚焦于即时异常检测,难以有效预测远期异常事件。其解决方案的关键在于提出一种简单且无监督的两层框架:第一层执行标准异常检测,第二层基于已观测异常的时间结构预测未来异常得分,从而实现从点对点检测到异常概率长期预测的扩展。该方法不依赖特定模型,具有良好的通用性和可扩展性。
链接: https://arxiv.org/abs/2603.20993
作者: Yuma Aoki,Joon Park,Koh Takeuchi,Hisashi Kashima,Shinya Akimoto,Ryuichi Hashimoto,Takahiro Adachi,Takeshi Kishikawa,Takamitsu Sasaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figues
Abstract:This study addresses an important gap in time series outlier detection by proposing a novel problem setting: long-term outlier prediction. Conventional methods primarily focus on immediate detection by identifying deviations from normal patterns. As a result, their applicability is limited when forecasting outlier events far into the future. To overcome this limitation, we propose a simple and unsupervised two-layer method that is independent of specific models. The first layer performs standard outlier detection, and the second layer predicts future outlier scores based on the temporal structure of previously observed outliers. This framework enables not only pointwise detection but also long-term forecasting of outlier likelihoods. Experiments on synthetic datasets show that the proposed method performs well in both detection and prediction tasks. These findings suggest that the method can serve as a strong baseline for future work in outlier detection and forecasting.
[AI-94] Can we automatize scientific discovery in the cognitive sciences?
【速读】:该论文试图解决传统认知科学中理论发现流程效率低下的问题,即研究者依赖人工设计实验范式、手动构建认知模型,并受限于个人背景与直觉,导致探索空间狭窄且迭代速度缓慢。其解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的全自动化计算发现框架,将认知科学的发现循环(包括实验范式生成、行为数据模拟、模型合成与有趣性评估)全部实现为可自动执行的计算步骤,从而实现高通量、可扩展的“在硅科学”(in silico science)机制,显著提升理论生成与验证的效率。
链接: https://arxiv.org/abs/2603.20988
作者: Akshay K. Jagadish,Milena Rmus,Kristin Witte,Marvin Mathony,Marcel Binz,Eric Schulz
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:The cognitive sciences aim to understand intelligence by formalizing underlying operations as computational models. Traditionally, this follows a cycle of discovery where researchers develop paradigms, collect data, and test predefined model classes. However, this manual pipeline is fundamentally constrained by the slow pace of human intervention and a search space limited by researchers’ background and intuition. Here, we propose a paradigm shift toward a fully automated, in silico science of the mind that implements every stage of the discovery cycle using Large Language Models (LLMs). In this framework, experimental paradigms exploring conceptually meaningful task structures are directly sampled from an LLM. High-fidelity behavioral data are then simulated using foundation models of cognition. The tedious step of handcrafting cognitive models is replaced by LLM-based program synthesis, which performs a high-throughput search over a vast landscape of algorithmic hypotheses. Finally, the discovery loop is closed by optimizing for ‘‘interestingness’’, a metric of conceptual yield evaluated by an LLM-critic. By enabling a fast and scalable approach to theory development, this automated loop functions as a high-throughput in-silico discovery engine, surfacing informative experiments and mechanisms for subsequent validation in real human populations.
[AI-95] AutoMOOSE: An Agent ic AI for Autonomous Phase-Field Simulation
【速读】:该论文旨在解决多物理场仿真框架(如MOOSE)在相场材料建模中因用户需具备专业知识来构建输入文件、协调参数扫描、诊断运行失败及提取定量结果而造成的使用门槛高、效率低的问题。其核心解决方案是提出AutoMOOSE——一个开源的代理型(agentic)框架,通过五代理流水线实现从自然语言提示到完整仿真生命周期的自动化调度,其中输入编写器协调六个子代理并由审查代理自主修正运行时错误,无需人工干预;同时采用模块化插件架构支持新相场公式的扩展,并基于Model Context Protocol (MCP) 服务器提供结构化工具接口以增强互操作性,从而显著降低使用复杂度并提升仿真可靠性与可重复性。
链接: https://arxiv.org/abs/2603.20986
作者: Sukriti Manna,Henry Chan,Subramanian K.R.S. Sankaranarayanan
机构: 未知
类目: Artificial Intelligence (cs.AI); Mesoscale and Nanoscale Physics (cond-mat.mes-hall)
备注:
Abstract:Multiphysics simulation frameworks such as MOOSE provide rigorous engines for phase-field materials modeling, yet adoption is constrained by the expertise required to construct valid input files, coordinate parameter sweeps, diagnose failures, and extract quantitative results. We introduce AutoMOOSE, an open-source agentic framework that orchestrates the full simulation lifecycle from a single natural-language prompt. AutoMOOSE deploys a five-agent pipeline in which the Input Writer coordinates six sub-agents and the Reviewer autonomously corrects runtime failures without user intervention. A modular plugin architecture enables new phase-field formulations without modifying the core framework, and a Model Context Protocol (MCP) server exposes the workflow as ten structured tools for interoperability with any MCP-compatible client. Validated on a four-temperature copper grain growth benchmark, AutoMOOSE generates MOOSE input files with 6 of 12 structural blocks matching a human expert reference exactly and 4 functionally equivalent, executes all runs in parallel with a 1.8x speedup, and performs an end-to-end physical consistency check spanning intent, finite-element execution, and Arrhenius kinetics with no human verification. Grain coarsening kinetics are recovered with R^2 = 0.90-0.95 at T = 600 K; the recovered activation energy Q_fit = 0.296 eV is consistent with a human-written reference (Q_fit = 0.267 eV) under identical parameters. Three runtime failure classes were diagnosed and resolved autonomously within a single correction cycle, and every run produces a provenance record satisfying FAIR data principles. These results show that the gap between knowing the physics and executing a validated simulation campaign can be bridged by a lightweight multi-agent orchestration layer, providing a pathway toward AI-driven materials discovery and self-driving laboratories.
[AI-96] From Causal Discovery to Dynamic Causal Inference in Neural Time Series
【速读】:该论文旨在解决动态因果推理中因先验因果网络结构未知或不确定而导致的适用性受限问题,尤其在现实科学场景中,因果结构常处于演化状态或仅能间接观测。其解决方案的关键在于提出一种两阶段神经因果建模框架——动态因果网络自回归(Dynamic Causal Network Autoregression, DCNAR):第一阶段通过神经自回归模型从多变量时间序列中学习稀疏有向因果网络;第二阶段将该学习到的结构作为结构先验,用于构建时变神经自回归模型,从而实现无需预设网络结构的动态因果影响估计。此方法突破了传统静态假设限制,提升了因果推断的稳定性与行为合理性。
链接: https://arxiv.org/abs/2603.20980
作者: Valentina Kuskova,Dmitry Zaytsev,Michael Coppedge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注: 14 pages, 4 figures
Abstract:Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that the underlying causal network is known a priori - an assumption rarely satisfied in real-world domains where causal structure is uncertain, evolving, or only indirectly observable. This limits the applicability of dynamic causal inference in many scientific settings. We propose Dynamic Causal Network Autoregression (DCNAR), a two-stage neural causal modeling framework that integrates data-driven causal discovery with time-varying causal inference. In the first stage, a neural autoregressive causal discovery model learns a sparse directed causal network from multivariate time series. In the second stage, this learned structure is used as a structural prior for a time-varying neural network autoregression, enabling dynamic estimation of causal influence without requiring pre-specified network structure. We evaluate the scientific validity of DCNAR using behavioral diagnostics that assess causal necessity, temporal stability, and sensitivity to structural change, rather than predictive accuracy alone. Experiments on multi-country panel time-series data demonstrate that learned causal networks yield more stable and behaviorally meaningful dynamic causal inferences than coefficient-based or structure-free alternatives, even when forecasting performance is comparable. These results position DCNAR as a general framework for using AI as a scientific instrument for dynamic causal reasoning under structural uncertainty.
[AI-97] Beyond Expression Similarity: Contrastive Learning Recovers Functional Gene Associations from Protein Interaction Structure
【速读】:该论文旨在解决如何在分子生物学领域中更有效地捕捉功能关联的问题,尤其关注蛋白质相互作用(protein-protein interactions)所体现的物理机制性关联是否优于基于基因表达相似性的传统方法。其解决方案的关键在于引入对比学习框架下的关联学习(Contrastive Association Learning, CAL),通过训练多层感知机(MLP)模型利用蛋白质共现注释数据(如STRING数据库中的蛋白互作信息)来构建跨边界预测能力更强的功能关联模型,而非依赖于嵌入空间中的表征相似性。实验表明,CAL在多个基因扰动和药物敏感性数据集上均显著优于基于表达相似性的方法,并展现出良好的泛化能力和对低研究度基因的聚焦优势。
链接: https://arxiv.org/abs/2603.20955
作者: Jason Dury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, code at this https URL
Abstract:The Predictive Associative Memory (PAM) framework posits that useful relationships often connect items that co-occur in shared contexts rather than items that appear similar in embedding space. A contrastive MLP trained on co-occurrence annotations–Contrastive Association Learning (CAL)–has improved multi-hop passage retrieval and discovered narrative function at corpus scale in text. We test whether this principle transfers to molecular biology, where protein-protein interactions provide functional associations distinct from gene expression similarity. Four experiments across two biological domains map the operating envelope. On gene perturbation data (Replogle K562 CRISPRi, 2,285 genes), CAL trained on STRING protein interactions achieves cross-boundary AUC of 0.908 where expression similarity scores 0.518. A second gene dataset (DepMap, 17,725 genes) confirms the result after negative sampling correction, reaching cross-boundary AUC of 0.947. Two drug sensitivity experiments produce informative negatives that sharpen boundary conditions. Three cross-domain findings emerge: (1) inductive transfer succeeds in biology–a node-disjoint split with unseen genes yields AUC 0.826 (Delta +0.127)–where it fails in text (+/-0.10), suggesting physically grounded associations are more transferable than contingent co-occurrences; (2) CAL scores anti-correlate with interaction degree (Spearman r = -0.590), with gains concentrating on understudied genes with focused interaction profiles; (3) tighter association quality outperforms larger but noisier training sets, reversing the text pattern. Results are stable across training seeds (SD 0.001) and cross-boundary threshold choices.
[AI-98] Before the Tool Call: Deterministic Pre-Action Authorization for Autonomous AI Agents
【速读】:该论文针对当前AI代理(AI agents)在执行工具调用(如资金转账、数据库查询、子代理委派等)时缺乏预执行授权机制的问题展开研究,即所谓“预动作授权问题”(pre-action authorization problem)。现有安全架构依赖模型对齐(probabilistic, training-time)和事后评估(retrospective, batch),无法在单个工具调用层面提供确定性、基于策略的强制控制。其解决方案的关键在于提出开放规范与参考实现——Open Agent Passport (OAP),该系统通过同步拦截工具调用,在执行前依据声明式策略进行授权判断,并生成加密签名的审计记录,实现了平均53毫秒内的高效授权决策。实验表明,在对抗性测试环境中,使用OAP限制策略可将社会工程攻击成功率从74.6%降至0%,显著优于传统方法。
链接: https://arxiv.org/abs/2603.20953
作者: Uchi Uchibeke
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents today have passwords but no permission slips. They execute tool calls (fund transfers, database queries, shell commands, sub-agent delegation) with no standard mechanism to enforce authorization before the action executes. Current safety architectures rely on model alignment (probabilistic, training-time) and post-hoc evaluation (retrospective, batch). Neither provides deterministic, policy-based enforcement at the individual tool call level. We characterize this gap as the pre-action authorization problem and present the Open Agent Passport (OAP), an open specification and reference implementation that intercepts tool calls synchronously before execution, evaluates them against a declarative policy, and produces a cryptographically signed audit record. OAP enforces authorization decisions in a measured median of 53 ms (N=1,000). In a live adversarial testbed (4,437 authorization decisions across 1,151 sessions, 5,000 bounty), social engineering succeeded against the model 74.6% of the time under a permissive policy; under a restrictive OAP policy, a comparable population of attackers achieved a 0% success rate across 879 attempts. We distinguish pre-action authorization from sandboxed execution (contains blast radius but does not prevent unauthorized actions) and model-based screening (probabilistic), and show they are complementary. The same infrastructure that enforces security constraints (spending limits, capability scoping) also enforces quality gates, operational contracts, and compliance controls. The specification is released under Apache 2.0 (DOI: https://doi.org/10.5281/zenodo.18901596). Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.20953 [cs.CR] (or arXiv:2603.20953v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.20953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-99] gUFO: A Gentle Foundational Ontology for Semantic Web Knowledge Graphs
【速读】:该论文旨在解决语义网知识图谱中普遍存在的建模复杂性与一致性问题,尤其是在构建可扩展、可解释且符合本体论规范的OWL 2 DL知识表示系统时面临的挑战。解决方案的关键在于提出gUFO——一个轻量级的统一基础本体(Unified Foundational Ontology, UFO)实现,其核心优势包括:支持类型分类体系(操作化OntoClean指导原则)、内在属性与关系属性的再指称模式(reification patterns),以及对情境(situations)和高阶类型(high-order types)的支持,从而为语义网应用提供坚实、可复用且形式严谨的建模范式。
链接: https://arxiv.org/abs/2603.20948
作者: João Paulo A. Almeida,Giancarlo Guizzardi,Tiago Prince Sales,Claudenir M. Fonseca
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 29 pages, 1 figure
Abstract:gUFO is a lightweight implementation of the Unified Foundational Ontology (UFO) suitable for Semantic Web OWL 2 DL applications. UFO is a mature foundational ontology with a rich axiomatization and that has been employed in a significant number of projects in research and industry. Moreover, it is currently in the process of standardization by the International Organization for Standardization as the ISO/IEC CD 21838-5. gUFO stands out from other foundational ontology implementations (such as those provided for BFO and DOLCE) given its unique support for a typology of types (operationalizing OntoClean guidelines), its reification patterns for intrinsic and relational aspects, and its support for situations and high-order types. gUFO provides well-founded patterns to address recurrent problems in Semantic Web knowledge graphs. In this paper, we present gUFO with its constituting categories, relations and constraints, discuss how it differs from the original UFO reference ontology, elaborate on its community adoption, and systematically position it in relation to existing OWL-based implementations of popular alternative foundational ontologies.
[AI-100] AC4A: Access Control for Agents
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理系统中存在的粗粒度访问控制问题:现有代理要么完全拥有API或网页内容的访问权限,要么完全无权访问,导致用户不得不赋予代理超出任务所需的能力,从而带来安全风险。解决方案的关键在于提出AC4A框架,该框架通过定义细粒度的资源权限机制,使代理仅能访问被授权的API接口或网页部分。AC4A借鉴Unix文件系统等成熟访问控制模型,允许应用以层次化方式组织资源,并在运行时动态计算所需权限,从而实现对代理行为的有效约束,且不依赖于特定权限策略,具备良好的灵活性和可部署性。
链接: https://arxiv.org/abs/2603.20933
作者: Reshabh K Sharma,Dan Grossman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Large Language Model (LLM) agents combine the chat interaction capabilities of LLMs with the power to interact with external tools and APIs. This enables them to perform complex tasks and act autonomously to achieve user goals. However, current agent systems operate on an all-or-nothing basis: an agent either has full access to an API’s capabilities and a web page’s content, or it has no access at all. This coarse-grained approach forces users to trust agents with more capabilities than they actually need for a given task. In this paper, we introduce AC4A, an access control framework for agents. As agents become more capable and autonomous, users need a way to limit what APIs or portions of web pages these agents can access, eliminating the need to trust them with everything an API or web page allows. Our goal with AC4A is to provide a framework for defining permissions that lets agents access only the resources they are authorized to access. AC4A works across both API-based and browser-based agents. It does not prescribe what permissions should be, but offers a flexible way to define and enforce them, making it practical for real-world systems. AC4A works by creating permissions granting access to resources, drawing inspiration from established access control frameworks like the one for the Unix file system. Applications define their resources as hierarchies and provide a way to compute the necessary permissions at runtime needed for successful resource access. We demonstrate the usefulness of AC4A in enforcing permissions over real-world APIs and web pages through case studies. The source code of AC4A is available at this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2603.20933 [cs.CR] (or arXiv:2603.20933v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.20933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-101] Causally-Guided Diffusion for Stable Feature Selection
【速读】:该论文旨在解决传统特征选择方法在分布变化(distribution shift)下性能下降的问题,即现有方法通常仅优化单一数据分布下的预测性能,导致所选特征可能为虚假相关特征(spurious features),在环境变化时失效。其解决方案的关键在于引入因果引导的稳定特征选择框架(Causally-Guided Diffusion for Stable Feature Selection, CGDFS),通过将特征选择建模为关于特征子集的近似后验推断,同时优化低预测误差与跨环境低方差的目标。CGDFS的核心创新包括:1)将因果不变性作为软归纳偏置用于稳定性感知的后验采样;2)训练扩散模型作为连续选择掩码的先验,结合稳定性感知似然以捕捉特征间结构依赖并高效探索组合爆炸的特征空间;3)采用引导退火朗之万采样(guided annealed Langevin sampling)实现可计算、不确定性感知的后验推断,避免离散优化,从而获得更鲁棒且可迁移的特征子集。
链接: https://arxiv.org/abs/2603.20930
作者: Arun Vignesh Malarkkan,Xinyuan Wang,Kunpeng Liu,Denghui Zhang,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 8 pages + references + appendix
Abstract:Feature selection is fundamental to robust data-centric AI, but most existing methods optimize predictive performance under a single data distribution. This often selects spurious features that fail under distribution shifts. Motivated by principles from causal invariance, we study feature selection from a stability perspective and introduce Causally-Guided Diffusion for Stable Feature Selection (CGDFS). In CGDFS, we formalized feature selection as approximate posterior inference over feature subsets, whose posterior mass favors low prediction error and low cross-environment variance. Our framework combines three key insights: First, we formulate feature selection as stability-aware posterior sampling. Here, causal invariance serves as a soft inductive bias rather than explicit causal discovery. Second, we train a diffusion model as a learned prior over plausible continuous selection masks, combined with a stability-aware likelihood that rewards invariance across environments. This diffusion prior captures structural dependencies among features and enables scalable exploration of the combinatorially large selection space. Third, we perform guided annealed Langevin sampling that combines the diffusion prior with the stability objective, which yields a tractable, uncertainty-aware posterior inference that avoids discrete optimization and produces robust feature selections. We evaluate CGDFS on open-source real-world datasets exhibiting distribution shifts. Across both classification and regression tasks, CGDFS consistently selects more stable and transferable feature subsets, which leads to improved out-of-distribution performance and greater selection robustness compared to sparsity-based, tree-based, and stability-selection baselines.
[AI-102] Profit is the Red Team: Stress-Testing Agents in Strategic Economic Interactions
【速读】:该论文旨在解决代理系统(agent)在现实部署中因依赖外部输入(如检索内容、工具输出及其他参与者信息)而面临的动态安全风险问题,尤其关注 adversaries 可能通过自适应策略诱导代理产生不利决策的挑战。传统静态提示攻击(prompt attacks)无法覆盖此类可变威胁,因此亟需一种能够模拟对手优化行为的测试机制。解决方案的关键在于提出“利润驱动的红队测试”(profit-driven red teaming),其核心是训练一个学习型对手(learned opponent),该对手仅基于标量结果反馈(scalar outcome feedback)最大化自身收益,无需大语言模型作为裁判(LLM-as-judge)、攻击标签或攻击分类体系;该方法在四个典型经济交互场景中实现结构化压力测试,发现并提炼出可复用的攻击模式(如探测、锚定和欺骗性承诺),进而生成简洁的提示规则(prompt rules)用于增强代理鲁棒性,在不改变模型架构的前提下显著提升目标性能。
链接: https://arxiv.org/abs/2603.20925
作者: Shouqiao Wang,Marcello Politi,Samuele Marro,Davide Crapis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As agentic systems move into real-world deployments, their decisions increasingly depend on external inputs such as retrieved content, tool outputs, and information provided by other actors. When these inputs can be strategically shaped by adversaries, the relevant security risk extends beyond a fixed library of prompt attacks to adaptive strategies that steer agents toward unfavorable outcomes. We propose profit-driven red teaming, a stress-testing protocol that replaces handcrafted attacks with a learned opponent trained to maximize its profit using only scalar outcome feedback. The protocol requires no LLM-as-judge scoring, attack labels, or attack taxonomy, and is designed for structured settings with auditable outcomes. We instantiate it in a lean arena of four canonical economic interactions, which provide a controlled testbed for adaptive exploitability. In controlled experiments, agents that appear strong against static baselines become consistently exploitable under profit-optimized pressure, and the learned opponent discovers probing, anchoring, and deceptive commitments without explicit instruction. We then distill exploit episodes into concise prompt rules for the agent, which make most previously observed failures ineffective and substantially improve target performance. These results suggest that profit-driven red-team data can provide a practical route to improving robustness in structured agent settings with auditable outcomes.
[AI-103] Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
【速读】:该论文旨在解决大规模深度学习模型训练中因计算资源需求激增而导致的公平获取问题,尤其是在能源和基础设施约束加剧的背景下。其核心解决方案是通过系统性基准测试不同深度学习模型在CPU与GPU上的性能表现,揭示GPU加速效果的差异性,并强调共享GPU资源的重要性。关键发现表明,GPU可实现11–246倍的速度提升,且轻量级模型(如Conv6)受益最大;同时,框架优化(如TensorFlow的内核融合)也能显著降低推理延迟,而GPU内存需求的增长趋势预示未来需更高效的资源管理策略以支撑AI持续发展。
链接: https://arxiv.org/abs/2603.20920
作者: Lisan Al Amin,Md Ismail Hossain,Rupak Kumar Das,Mahbubul Islam,Saddam Mukta,Abdulaziz Tabbakh
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
Abstract:The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational resources, particularly under increasing energy and infrastructure constraints. GPUs have emerged as essential for accelerating such workloads. This study benchmarks four deep learning models (Conv6, VGG16, ResNet18, CycleGAN) using TensorFlow and PyTorch on Intel Xeon CPUs and NVIDIA Tesla T4 GPUs. Our experiments demonstrate that, on average, GPU training achieves speedups ranging from 11x to 246x depending on model complexity, with lightweight models (Conv6) showing the highest acceleration (246x), mid-sized models (VGG16, ResNet18) achieving 51-116x speedups, and complex generative models (CycleGAN) reaching 11x improvements compared to CPU training. Additionally, in our PyTorch vs. TensorFlow comparison, we observed that TensorFlow’s kernel-fusion optimizations reduce inference latency by approximately 15%. We also analyze GPU memory usage trends and projecting requirements through 2025 using polynomial regression. Our findings highlight that while GPUs are essential for sustaining AI’s growth, democratized and shared access to GPU resources is critical for enabling research innovation across institutions with limited computational budgets.
[AI-104] Enhancing LIME using Neural Decision Trees
【速读】:该论文旨在解决复杂机器学习模型(尤其是针对表格数据)在保持高预测性能的同时缺乏可解释性的问题,即如何在不牺牲模型准确性的前提下实现更可靠的局部解释。其解决方案的关键在于提出一种名为NDT-LIME的新方法,该方法将神经决策树(Neural Decision Trees, NDTs)作为替代传统线性回归或决策树的代理模型(surrogate model),利用NDT结构化的层次特性来更精确地捕捉黑箱模型的非线性决策边界,从而提升局部解释的忠实度(fidelity)。
链接: https://arxiv.org/abs/2603.20919
作者: Mohamed Aymen Bouyahia,Argyris Kalogeratos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Interpreting complex machine learning models is a critical challenge, especially for tabular data where model transparency is paramount. Local Interpretable Model-Agnostic Explanations (LIME) has been a very popular framework for interpretable machine learning, also inspiring many extensions. While traditional surrogate models used in LIME variants (e.g. linear regression and decision trees) offer a degree of stability, they can struggle to faithfully capture the complex non-linear decision boundaries that are inherent in many sophisticated black-box models. This work contributes toward bridging the gap between high predictive performance and interpretable decision-making. Specifically, we propose the NDT-LIME variant that integrates Neural Decision Trees (NDTs) as surrogate models. By leveraging the structured, hierarchical nature of NDTs, our approach aims at providing more accurate and meaningful local explanations. We evaluate its effectiveness on several benchmark tabular datasets, showing consistent improvements in explanation fidelity over traditional LIME surrogates.
[AI-105] Do LLM -Driven Agents Exhibit Engagement Mechanisms? Controlled Tests of Information Load Descriptive Norms and Popularity Cues
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)驱动的代理仿真虽然提升了行为表达力,但其生成的流畅、类人输出并不自动构成理论支持的证据,从而引发方法论上的张力。为应对这一问题,作者提出以社交媒体中的信息参与度(information engagement)作为实证检验案例,设计了一个类微博(Weibo-like)环境,在其中操纵信息负载(information load)和描述性规范(descriptive norms),同时允许流行度线索(如累计点赞和转发)内生演化。解决方案的关键在于:通过多条件控制实验(multi-condition stress tests)验证模拟行为是否在理论上可解释地响应变量变化,而非仅生成看似合理的轨迹;特别强调使用显式无规范基线(explicit no-norm baselines)来避免默认提示词并非空白对照,并保持反馈回路的内生性以准确捕捉从众效应(bandwagon dynamics)等动态机制。
链接: https://arxiv.org/abs/2603.20911
作者: Tai-Quan Peng,Yuan Tian,Songsong Liang,Dazhen Deng,Yingcai Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large language models make agent-based simulation more behaviorally expressive, but they also sharpen a basic methodological tension: fluent, human-like output is not, by itself, evidence for theory. We evaluate what an LLM-driven simulation can credibly support using information engagement on social media as a test case. In a Weibo-like environment, we manipulate information load and descriptive norms, while allowing popularity cues (cumulative likes and Sina Weibo-style cumulative reshares) to evolve endogenously. We then ask whether simulated behavior changes in theoretically interpretable ways under these controlled variations, rather than merely producing plausible-looking traces. Engagement responds systematically to information load and descriptive norms, and sensitivity to popularity cues varies across contexts, indicating conditionality rather than rigid prompt compliance. We discuss methodological implications for simulation-based communication research, including multi-condition stress tests, explicit no-norm baselines because default prompts are not blank controls, and design choices that preserve endogenous feedback loops when studying bandwagon dynamics.
[AI-106] he data heat island effect: quantifying the impact of AI data centers in a warming world
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)数据中心在全球范围内快速扩张所引发的能源消耗及其对周边环境热效应的影响问题,特别是其导致局部地表温度升高的生态与社会影响。解决方案的关键在于利用几十年来遥感平台获取的地表温度(Land Surface Temperature, LST)数据,对AI超大规模数据中心(AI hyperscalers)周边区域的温度变化进行量化评估,从而首次揭示了“数据热岛效应”(data heat island effect)的存在——即数据中心运营后平均使周边地表温度上升2°C,并识别出超过3.4亿人口可能受到这一局部微气候变化的影响。这一方法为评估AI基础设施的环境外部性提供了可量化的科学依据,推动了可持续AI发展议题的全球讨论。
链接: https://arxiv.org/abs/2603.20897
作者: Andrea Marinoni,Pietro Lio’,Erik Cambria,Luca Dal Zilio,Weisi Lin,Mauro Dalla Mura,Jocelyn Chanussot,Edoardo Ragusa,Gianmarco Mengaldo,Chi Yan Tso,Yihao Zhu,Benjamin Horton
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:The strong and continuous increase of AI-based services leads to the steady proliferation of AI data centres worldwide with the unavoidable escalation of their power consumption. It is unknown how this energy demand for computational purposes will impact the surrounding environment. Here, we focus our attention on the heat dissipation of AI hyperscalers. Taking advantage of land surface temperature measurements acquired by remote sensing platforms over the last decades, we are able to obtain a robust assessment of the temperature increase recorded in the areas surrounding AI data centres globally. We estimate that the land surface temperature increases by 2°C on average after the start of operations of an AI data centre, inducing local microclimate zones, which we call the data heat island effect. We assess the impact on the communities, quantifying that more than 340 million people could be affected by this temperature increase. Our results show that the data heat island effect could have a remarkable influence on communities and regional welfare in the future, hence becoming part of the conversation around environmentally sustainable AI worldwide.
[AI-107] Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections
【速读】:该论文旨在解决超连接(Hyper-Connections, HC)在训练过程中因无约束的跨流特征混合破坏残差连接固有的身份映射性质而导致的训练不稳定问题。现有方法通过将混合矩阵限制在Birkhoff多面体(doubly stochastic matrices)内以提升稳定性,但这一约束引入了三个关键缺陷:(1)身份退化(identity degeneration),即矩阵趋近于单位矩阵而削弱跨流交互;(2)表达瓶颈(expressivity bottleneck),由于非负性限制无法实现减法型特征解耦;(3)参数化低效,表现为Sinkhorn迭代不稳定或基于排列的参数化存在阶乘级计算开销。为克服这些问题,论文提出谱球约束超连接(Spectral-Sphere-Constrained Hyper-Connections, sHC),其核心创新在于将可行集从刚性的Birkhoff多面体几何空间迁移至谱范数球面,允许负值元素,从而支持选择性特征多样化所需的减法交互机制,同时消除不稳定的Sinkhorn投影和阶乘级参数化负担,在保持训练稳定性的同时显著增强模型表达能力。
链接: https://arxiv.org/abs/2603.20896
作者: Zhaoyi Liu,Haichuan Zhang,Ang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.
[AI-108] ReLaMix: Residual Latency-Aware Mixing for Delay-Robust Financial Time-Series Forecasting
【速读】:该论文旨在解决高频率金融市场中因异步数据采集和传输延迟导致的历史信号部分失真(stale observations)问题,这种延迟通常表现为零阶保持(Zero-Order Hold, ZOH)机制引入的阶梯状停滞伪影,显著增加了时间序列预测的难度。解决方案的关键在于提出 ReLaMix(Residual Latency-Aware Mixing Network),其核心创新是将可学习瓶颈压缩与残差精炼机制相结合,在轻量化架构下实现对延迟观测中冗余重复值的有效抑制,同时通过残差混合增强保留市场动态信息,从而提升在真实延迟场景下的预测鲁棒性。
链接: https://arxiv.org/abs/2603.20869
作者: Tianyou Lai,Wentao Yue,Jiayi Zhou,Chaoyuan Hao,Lingke Chang,Qingyu Mao,Zhibo Niu,Qilei Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 5 figures
Abstract:Financial time-series forecasting in real-world high-frequency markets is often hindered by delayed or partially stale observations caused by asynchronous data acquisition and transmission latency. To better reflect such practical conditions, we investigate a simulated delay setting where a portion of historical signals is corrupted by a Zero-Order Hold (ZOH) mechanism, significantly increasing forecasting difficulty through stepwise stagnation artifacts. In this paper, we propose ReLaMix (Residual Latency-Aware Mixing Network), a lightweight extension of TimeMixer that integrates learnable bottleneck compression with residual refinement for robust signal recovery under delayed observations. ReLaMix explicitly suppresses redundancy from repeated stale values while preserving informative market dynamics via residual mixing enhancement. Experiments on a large-scale second-resolution PAXGUSDT benchmark demonstrate that ReLaMix consistently achieves state-of-the-art accuracy across multiple delay ratios and prediction horizons, outperforming strong mixer and Transformer baselines with substantially fewer parameters. Moreover, additional evaluations on BTCUSDT confirm the cross-asset generalization ability of the proposed framework. These results highlight the effectiveness of residual bottleneck mixing for high-frequency financial forecasting under realistic latency-induced staleness.
[AI-109] Governance-Aware Vector Subscriptions for Multi-Agent Knowledge Ecosystems
【速读】:该论文旨在解决多智能体(multi-agent)环境中因无限制语义订阅导致的合规性问题:当多个智能体遵循不同的数据处理政策时,若采用传统的语义发布-订阅系统,会导致某些智能体接收到其无权访问的内容,从而违反监管框架(如欧盟《数字服务法案》(Digital Services Act, DSA) 和《人工智能法案》(AI Act))。解决方案的关键在于提出“治理感知的向量订阅”(governance-aware vector subscriptions)机制,该机制将语义相似度匹配与基于监管框架的多维策略谓词(policy predicates)相结合,涵盖处理层级、直接营销限制、训练退出权、管辖区域和科研用途等五个独立维度,确保只有同时满足语义相似阈值和所有适用政策约束的内容才能被通知。实验证明,该机制在合成语料库上能准确执行全部策略约束并保障授权内容的传递,且任一单一维度均不足以实现完整合规。
链接: https://arxiv.org/abs/2603.20833
作者: Steven Johnson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 tables. Code and benchmark available at this https URL
Abstract:As AI agent ecosystems grow, agents need mechanisms to monitor relevant knowledge in real time. Semantic publish-subscribe systems address this by matching new content against vector subscriptions. However, in multi-agent settings where agents operate under different data handling policies, unrestricted semantic subscriptions create policy violations: agents receive notifications about content they are not authorized to access. We introduce governance-aware vector subscriptions, a mechanism that composes semantic similarity matching with multi-dimensional policy predicates grounded in regulatory frameworks (EU DSM Directive, EU AI Act). The policy predicate operates over multiple independent dimensions (processing level, direct marketing restrictions, training opt-out, jurisdiction, and scientific usage) each with distinct legal bases. Agents subscribe to semantic regions of a curated knowledge base; notifications are dispatched only for validated content that passes both the similarity threshold and all applicable policy constraints. We formalize the mechanism, implement it within AIngram (an operational multi-agent knowledge base), and evaluate it using the PASA benchmark. We validate the mechanism on a synthetic corpus (1,000 chunks, 93 subscriptions, 5 domains): the governed mode correctly enforces all policy constraints while preserving delivery of authorized content. Ablation across five policy dimensions shows that no single dimension suffices for full compliance.
[AI-110] Compass: Optimizing Compound AI Workflows for Dynamic Adaptation
【速读】:该论文旨在解决复合型人工智能(Compound AI)系统在固定基础设施下难以同时满足准确性、延迟和成本目标的问题,尤其是在负载动态变化时,传统静态配置无法适应资源约束与性能需求的权衡。解决方案的关键在于提出一个名为Compass的框架,其核心由三部分组成:(1) COMPASS-V算法通过有限差分引导的搜索策略结合爬山与横向扩展方法,高效发现多组帕累托最优配置;(2) Planner模块基于排队论模型对配置进行硬件级性能建模,并推导出运行时切换策略;(3) Elastico控制器实时监控队列深度并依据预设阈值动态切换配置。该方案实现了在不牺牲服务质量(SLO)的前提下,显著提升系统效率与准确性,相较静态基线在SLO合规性上提升71.6%,同时在紧致准确率阈值下实现高达95.3%的效率增益。
链接: https://arxiv.org/abs/2603.20821
作者: Milos Gravara,Juan Luis Herrera,Stefan Nastic
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 7 figures; accepted at the 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)
Abstract:Compound AI is a distributed intelligence approach that represents a unified system orchestrating specialized AI/ML models with engineered software components into AI workflows. Compound AI production deployments must satisfy accuracy, latency, and cost objectives under varying loads. However, many deployments operate on fixed infrastructure where horizontal scaling is not viable. Existing approaches optimize solely for accuracy and do not consider changes in workload conditions. We observe that compound AI systems can switch between configurations to fit infrastructure capacity, trading accuracy for latency based on current load. This requires discovering multiple Pareto-optimal configurations from a combinatorial search space and determining when to switch between them at runtime. We present Compass, a novel framework that enables dynamic configuration switching through offline optimization and online adaptation. Compass consists of three components: COMPASS-V algorithm for configuration discovery, Planner for switching policy derivation, and Elastico Controller for runtime adaptation. COMPASS-V discovers accuracy-feasible configurations using finite-difference guided search and a combination of hill-climbing and lateral expansion. Planner profiles these configurations on target hardware and derives switching policies using a queuing theory based model. Elastico monitors queue depth and switches configurations based on derived thresholds. Across two compound AI workflows, COMPASS-V achieves 100% recall while reducing configuration evaluations by 57.5% on average compared to exhaustive search, with efficiency gains reaching 95.3% at tight accuracy thresholds. Runtime adaptation achieves 90-98% SLO compliance under dynamic load patterns, improving SLO compliance by 71.6% over static high-accuracy baselines, while simultaneously improving accuracy by 3-5% over static fast baselines.
[AI-111] GMPilot: An Expert AI Agent For FDA cGMP Compliance
【速读】:该论文旨在解决制药行业在质量管理体系中面临的合规成本高、响应速度慢以及知识碎片化等问题(quality management challenges such as high costs of compliance, slow responses and disjointed knowledge)。其解决方案的关键在于构建一个名为GMPilot的领域专用人工智能代理,该代理基于经过筛选的法规与历史检查记录知识库,结合检索增强生成(Retrieval-Augmented Generation, RAG)和推理-行动框架(Reasoning-Acting, ReAct),为质量专业人员提供实时且可追溯的决策支持。在模拟检查场景中,GMPilot通过结构化知识检索和可验证的法规及案例支持,显著提升了质量管理人员的响应效率与专业水平。
链接: https://arxiv.org/abs/2603.20815
作者: Xiaohan Wang,Nan Zhang,Sulene Han,Keguang Tang,Lei Xu,Zhiping Li,Xiue(Sue)Liu,Xiaomei Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure
Abstract:The pharmaceutical industry is facing challenges with quality management such as high costs of compliance, slow responses and disjointed knowledge. This paper presents GMPilot, a domain-specific AI agent that is designed to support FDA cGMP compliance. GMPilot is based on a curated knowledge base of regulations and historical inspection observations and uses Retrieval-Augmented Generation (RAG) and Reasoning-Acting (ReAct) frameworks to provide real-time and traceable decision support to the quality professionals. In a simulated inspection scenario, GMPilot shows how it can improve the responsiveness and professionalism of quality professionals by providing structured knowledge retrieval and verifiable regulatory and case-based support. Although GMPilot lacks in the aspect of regulatory scope and model interpretability, it is a viable avenue of improving quality management decision-making in the pharmaceutical sector using intelligent approaches and an example of specialized application of AI in highly regulated sectors.
[AI-112] Modeling Epistemic Uncertainty in Social Perception via Rashomon Set Agents
【速读】:该论文旨在解决如何在缺乏全局社会认知的情况下,解释学生个体主观社会感知差异的形成与演化机制,尤其是在有限问卷数据和真实社交网络约束下,群体层面认知分歧如何通过局部互动逐步稳定。其解决方案的关键在于构建一个由大语言模型(LLM)驱动的多智能体概率建模框架,其中每个学生被赋予一个个性化的主观图(subjective graph),仅能基于本地可访问的信息进行判断,并利用检索增强生成(RAG)技术获取局部知识,进而评估同伴能力与社会地位;同时引入与社交焦虑相关的结构扰动以模拟个体感知准确性的系统性差异,并通过带有置信度标签的叙事式评价与基于LLM的信任评分进行信念的概率更新,从而在不假设“上帝视角”的前提下,再现现实教育场景中的集体认知动态。
链接: https://arxiv.org/abs/2603.20750
作者: Jinming Yang,Xinyu Jiang,Xinshan Jiao,Xinping Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present an LLM-driven multi-agent probabilistic modeling framework that demonstrates how differences in students’ subjective social perceptions arise and evolve in real-world classroom settings, under constraints from an observed social network and limited questionnaire data. When social information is incomplete and the accuracy of perception differs between students, they can form different views of the same group structure from local cues they can access. Repeated peer communication and belief updates can gradually change these views and, over time, lead to stable group-level differences. To avoid assuming a global “god’s-eye view,” we assign each student an individualized subjective graph that shows which social ties they can perceive and how far information is reachable from their perspective. All judgments and interactions are restricted to this subjective graph: agents use retrieval-augmented generation (RAG) to access only local information and then form evaluations of peers’ competence and social standing. We also add structural perturbations related to social-anxiety to represent consistent individual differences in the accuracy of social perception. During peer exchanges, agents share narrative assessments of classmates’ academic performance and social position with uncertainty tags, and update beliefs probabilistically using LLM-based trust scores. Using the time series of six real exam scores as an exogenous reference, we run multi-step simulations to examine how epistemic uncertainty spreads through local interactions. Experiments show that, without relying on global information, the framework reproduces several collective dynamics consistent with real-world educational settings. The code is released at this https URL.
[AI-113] Multi-RF Fusion with Multi-GNN Blending for Molecular Property Prediction
【速读】:该论文旨在解决分子属性预测任务中模型性能提升的瓶颈问题,特别是在OGB(Open Graph Benchmark)的ogbg-molhiv数据集上实现更稳定和高精度的分类性能。解决方案的关键在于提出多随机森林融合(Multi-RF Fusion)方法,其核心是通过rank-averaged ensemble策略整合12个基于 concatenated 分子指纹(FCFP、ECFP、MACCS、原子对,共4,263维)训练的随机森林(Random Forest, RF)模型,并以12%权重融合深度集成图神经网络(GNN)的预测结果。两个关键发现驱动了性能突破:一是将RF的max_features设置为0.20而非默认的sqrt(d),在骨架划分(scaffold split)下带来+0.008 AUC提升;二是先对10个种子(seed)的GNN预测进行平均再与RF融合,彻底消除了GNN种子差异带来的方差,使最终标准差从0.0008降至0.0002,从而在不使用外部数据或预训练的情况下达到OGB leaderboard第一名(ROC-AUC 0.8476 ± 0.0002)。
链接: https://arxiv.org/abs/2603.20724
作者: Zacharie Bugaud
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 4 tables
Abstract:Multi-RF Fusion achieves a test ROC-AUC of 0.8476 +/- 0.0002 on ogbg-molhiv (10 seeds), placing #1 on the OGB leaderboard ahead of HyperFusion (0.8475 +/- 0.0003). The core of the method is a rank-averaged ensemble of 12 Random Forest models trained on concatenated molecular fingerprints (FCFP, ECFP, MACCS, atom pairs – 4,263 dimensions total), blended with deep-ensembled GNN predictions at 12% weight. Two findings drive the result: (1) setting max_features to 0.20 instead of the default sqrt(d) gives a +0.008 AUC gain on this scaffold split, and (2) averaging GNN predictions across 10 seeds before blending with the RF eliminates GNN seed variance entirely, dropping the final standard deviation from 0.0008 to 0.0002. No external data or pre-training is used.
[AI-114] Decoupling Numerical and Structural Parameters: An Empirical Study on Adaptive Genetic Algorithms via Deep Reinforcement Learning for the Large-Scale TSP CEC
【速读】:该论文旨在解决进化算法(Evolutionary Algorithms, EAs)中参数配置对算法可扩展性影响不均衡的问题,特别是探究数值参数(如交叉和变异率)与结构参数(如种群大小和算子切换策略)在求解旅行商问题(Traveling Salesman Problem, TSP)时各自的作用差异。解决方案的关键在于提出并实现了一个双层深度强化学习(Deep Reinforcement Learning, DRL)框架,其中采用循环PPO代理(Recurrent PPO agent)动态调节两类参数,并将DRL模型作为探针来揭示进化过程中的动态行为。实验表明,所学策略显著优于静态基线,最优性差距平均降低约45%;更重要的是,消融分析揭示:虽然数值调优仅能实现局部优化,但结构灵活性才是避免陷入局部最优、促进跳出停滞状态的关键因素,从而为自动化算法设计指明了优先方向——即应更关注动态结构重构而非精细的概率调整。
链接: https://arxiv.org/abs/2603.20702
作者: Hongyu Wang,Yuhan Jing,Yibing Shi,Enjin Zhou,Haotian Zhang,Jialong Shi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 6 pages, 8 figures, Accepted by WCCI-CEC Conference
Abstract:Proper parameter configuration is a prerequisite for the success of Evolutionary Algorithms (EAs). While various adaptive strategies have been proposed, it remains an open question whether all control dimensions contribute equally to algorithmic scalability. To investigate this, we categorize control variables into numerical parameters (e.g., crossover and mutation rates) and structural parameters (e.g., population size and operator switching), hypothesizing that they play distinct roles. This paper presents an empirical study utilizing a dual-level Deep Reinforcement Learning (DRL) framework to decouple and analyze the impact of these two dimensions on the Traveling Salesman Problem (TSP). We employ a Recurrent PPO agent to dynamically regulate these parameters, treating the DRL model as a probe to reveal evolutionary dynamics. Experimental results confirm the effectiveness of this approach: the learned policies outperform static baselines, reducing the optimality gap by approximately 45% on the largest tested instance (rl5915). Building on this validated framework, our ablation analysis reveals a fundamental insight: while numerical tuning offers local refinement, structural plasticity is the decisive factor in preventing stagnation and facilitating escape from local optima. These findings suggest that future automated algorithm design should prioritize dynamic structural reconfiguration over fine-grained probability adjustment. To facilitate reproducibility, the source code is available at this https URL
[AI-115] SWE-Next: Scalable Real-World Software Engineering Tasks for Agents
【速读】:该论文旨在解决生成式软件工程(Generative Software Engineering, GSE)代理训练中数据规模扩展困难的问题,核心挑战在于:真实代码仓库变更中仅有少量实例能产生可验证的高信号任务,且为每个任务构建独立环境会迅速成为系统瓶颈。解决方案的关键在于提出 SWE-Next 框架,其通过两个创新机制实现高效、高质量的数据收集:一是基于执行验证的自洽实例筛选策略,从真实合并的拉取请求(pull requests)中挖掘并保留仅在测试覆盖率或正确性上严格提升且无回归的 commit 对;二是引入可复用的 repo-quarter 环境配置文件,在时间邻近的 commits 间共享运行环境以降低存储和计算开销,同时确保每项任务执行独立与可重现。该方法在仅使用 30 小时和 639GB 存储的情况下,从 3,971 个种子仓库中构建出 2,308 个自验证实例,显著提升了下游任务的 pass@1 性能,表明其优势源于更高质量的执行接地监督信号而非更强的轨迹生成能力。
链接: https://arxiv.org/abs/2603.20691
作者: Jiarong Liang,Zhiheng Lyu,Zijie Liu,Xiangchao Chen,Ping Nie,Kai Zou,Wenhu Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Executable software engineering data is valuable for training SWE agents, but scaling it remains difficult for two reasons: only a small fraction of real repository changes yield verifiable, high-signal task instances, and naively building repository-specific environments quickly becomes the dominant systems cost. We present SWE-Next, an execution-grounded framework for scalable SWE task and trajectory collection. On the data side, SWE-Next mines real merged pull requests, executes candidate base/merged commit pairs, and retains only those that produce strict test improvements without regressions, yielding self-verifying instances. It also applies strict submission gating so that collected trajectories remain evidence-driven rather than speculative. On the systems side, SWE-Next introduces reusable repo-quarter profiles, which reuse the same environment across nearby commits in time while keeping each task run separate and reproducible. Using only 30 hours and 639GB of environment storage, SWE-Next processes 3,971 seed repositories and 102,582 candidate commit pairs mined from real merged PRs to construct a dataset of 2,308 self-verifying instances. Experiments show that SWE-Next improves downstream pass@1 with fewer or comparable training trajectories, indicating that its gains come not from a stronger trajectory generator, but from higher-signal execution-grounded supervision and more efficient data collection.
[AI-116] Artificial Intelligence in Experimental Approaches: Growth Hacking Lean Startup Design Thinking and Agile
【速读】:该论文旨在解决组织在实践中如何有效整合人工智能(Artificial Intelligence, AI)技术与实验性方法论(如增长黑客、精益创业、设计思维和敏捷方法)以提升效率与效果的问题。其解决方案的关键在于通过系统性文献综述识别出AI在上述方法中的核心赋能作用:即利用AI提供的高级数据分析能力、实时反馈机制、自动化流程及过程优化工具,从而增强决策质量、加速迭代周期、激发创造力并优化任务优先级。同时,论文强调成功实施需应对技能缺口、伦理争议和数据治理等挑战,提出战略性的AI采纳路径,包括员工培训、严格的数据管理以及遵循伦理规范。
链接: https://arxiv.org/abs/2603.20688
作者: Parisa Omidmand,Saeid Ataei
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 17
Abstract:Organizations increasingly adopt AI technologies to accelerate their performance and capacity to adapt to market dynamics. This study examines how organizations implement AI in experimental methodologies such as growth hacking, lean startup, design thinking, and agile methodology to enhance efficiency and effectiveness. We performed a systematic literature review following the PRISMA 2020 framework, analyzing 37 articles from Web of Science (WOS) and Scopus databases published between 2018 and 2024 to assess AI integration with experimental approaches. Our findings indicate that AI plays a pivotal role in enhancing these methodologies by offering advanced tools for data analysis, real-time feedback, automation, and process optimization. For instance, AI-driven analytics improves decision-making in growth hacking, streamlines iterative cycles in lean startups, enhances creativity in design thinking, and optimizes task prioritization in agile methodology. Furthermore, we identified several real-world cases that successfully utilized AI in experimental strategies and improved their performance across various industries. However, despite the clear advantages of AI integration, organizations face barriers such as skill gaps, ethical concerns, and data governance issues. Addressing these challenges requires a strategic approach to AI adoption, including workforce training, strict data management, and following ethical standards.
[AI-117] SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection
【速读】:该论文旨在解决生成式语音合成(Generative AI)中深度伪造语音检测模型在跨说话人场景下泛化能力差的问题。研究表明,基于自监督学习的语音编码器提取的表征存在严重的说话人纠缠(speaker entanglement),导致检测器依赖于说话人特定特征而非合成伪影相关的线索。解决方案的关键在于提出SNAP(Speaker-Nulling Framework),通过估计说话人子空间并施加正交投影来抑制说话人相关成分,从而在残差特征中隔离出与合成伪影相关的模式,显著降低说话人纠缠,促使检测器聚焦于更具泛化性的伪影特征,实现最优检测性能。
链接: https://arxiv.org/abs/2603.20686
作者: Kyudan Jung,Jihwan Kim,Minwoo Lee,Soyoon Kim,Jeonghoon Kim,Jaegul Choo,Cheonbok Park
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 2 tables
Abstract:Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
[AI-118] Centrality-Based Pruning for Efficient Echo State Networks
【速读】:该论文旨在解决Echo State Networks (ESNs) 中因随机初始化导致的冗余节点问题,这些问题会引入不必要的计算开销并降低系统效率。解决方案的关键在于将ESN的储备层建模为加权有向图,并利用图中心性(graph centrality)指标识别和移除结构上不重要的节点,从而在保持甚至提升预测精度的同时显著压缩网络规模,并保留关键的储备动态特性。
链接: https://arxiv.org/abs/2603.20684
作者: Sudip Laudari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 8 pages, 3 figures, 2 tables
Abstract:Echo State Networks (ESNs) are a reservoir computing framework widely used for nonlinear time-series prediction. However, despite their effectiveness, the randomly initialized reservoir often contains redundant nodes, leading to unnecessary computational overhead and reduced efficiency. In this work, we propose a graph centrality-based pruning approach that interprets the reservoir as a weighted directed graph and removes structurally less important nodes using centrality measures. Experiments on Mackey-Glass time-series prediction and electric load forecasting demonstrate that the proposed method can significantly reduce reservoir size while maintaining, and in some cases improving, prediction accuracy, while preserving the essential reservoir dynamics.
[AI-119] AI-Driven Multi-Agent Simulation of Stratified Polyamory Systems: A Computational Framework for Optimizing Social Reproductive Efficiency
【速读】:该论文旨在解决当代社会面临的双重人口危机:一是生育率持续下降,尤其在东亚国家表现尤为严峻(如中国2023年总和生育率TFR降至约1.0,韩国低于0.72);二是婚姻制度结构性瓦解,表现为受教育女性因缺乏情感满足与经济保障而理性拒绝婚姻,同时底层男性群体长期面临性剥夺、焦虑与习得性无助。其解决方案的关键在于提出一种分层多偶制系统(Stratified Polyamory System, SPS),通过引入法律认可的次级配偶限制机制、社会化育儿与继承制度改革,构建一个基于异质主体的多智能体系统(multi-agent system)。该框架融合了代理模型(ABM)、多智能体强化学习(MARL)及大语言模型(LLM)驱动的社会仿真技术,将个体按A/B/C阶层分类并建模为马尔可夫决策过程,采用近端策略优化(PPO)算法求解匹配问题,并利用图神经网络(GNN)分析婚恋网络结构。研究表明,SPS可在帕累托意义上提升整体社会福利,初步计算结果验证其对缓解女性母职惩罚和男性性匮乏问题的有效性,且提供了一种非暴力的财富再分配机制,类比于中国古代“恩荫令”(Grace Decree)的历史实践。
链接: https://arxiv.org/abs/2603.20678
作者: Yicai Xing
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 20 pages, 10 figures, 3 tables, 83 references
Abstract:Contemporary societies face a severe crisis of demographic reproduction. Global fertility rates continue to decline precipitously, with East Asian nations exhibiting the most dramatic trends – China’s total fertility rate (TFR) fell to approximately 1.0 in 2023, while South Korea’s dropped below 0.72. Simultaneously, the institution of marriage is undergoing structural disintegration: educated women rationally reject unions lacking both emotional fulfillment and economic security, while a growing proportion of men at the lower end of the socioeconomic spectrum experience chronic sexual deprivation, anxiety, and learned helplessness. This paper proposes a computational framework for modeling and evaluating a Stratified Polyamory System (SPS) using techniques from agent-based modeling (ABM), multi-agent reinforcement learning (MARL), and large language model (LLM)-empowered social simulation. The SPS permits individuals to maintain a limited number of legally recognized secondary partners in addition to one primary spouse, combined with socialized child-rearing and inheritance reform. We formalize the A/B/C stratification as heterogeneous agent types in a multi-agent system and model the matching process as a MARL problem amenable to Proximal Policy Optimization (PPO). The mating network is analyzed using graph neural network (GNN) representations. Drawing on evolutionary psychology, behavioral ecology, social stratification theory, computational social science, algorithmic fairness, and institutional economics, we argue that SPS can improve aggregate social welfare in the Pareto sense. Preliminary computational results demonstrate the framework’s viability in addressing the dual crisis of female motherhood penalties and male sexlessness, while offering a non-violent mechanism for wealth dispersion analogous to the historical Chinese Grace Decree (Tui’en Ling).
[AI-120] REVERE: Reflective Evolving Research Engineer for Scientific Workflows ICLR2026
【速读】:该论文旨在解决现有提示优化技术在研究编码任务中因依赖局部信号更新行为、忽视跨任务的全局模式而造成泛化能力差,以及通过全量重写或无结构合并导致知识丢失的问题。其解决方案的关键在于提出Reflective Evolving Research Engineer (REVERE)框架,该框架通过持续学习全局训练上下文(Global Training Context),识别跨仓库执行轨迹中的重复失败模式,并将其提炼为可复用的启发式规则,进而对系统提示(system prompt)、任务提示模板(task-prompt template)和累积速查表(cumulative cheatsheet)三个可配置字段进行针对性编辑,实现知识的保留与动态演化。
链接: https://arxiv.org/abs/2603.20667
作者: Balaji Dinesh Gangireddi,Aniketh Garikaparthi,Manasi Patwardhan,Arman Cohan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: ICLR 2026, Recursive Self-Improvement Workshop
Abstract:Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured merges, resulting in knowledge loss. These limitations are magnified in research-coding workflows, which involve heterogeneous repositories, underspecified environments, and weak feedback, where reproducing results from public codebases is an established evaluation regime. We introduce Reflective Evolving Research Engineer (REVERE), a framework that continuously learns from Global Training Context, recognizes recurring failure modes in cross-repository execution trajectories, distills them into reusable heuristics, and performs targeted edits across three configurable fields: the system prompt, a task-prompt template, and a cumulative cheatsheet. REVERE, via this reflective optimization framework, improves performance over prior state-of-the-art expert-crafted instructions on research coding tasks by 4.50% on SUPER, 3.51% on ResearchCodeBench, and 4.89% on ScienceAgentBench across their respective metrics. These results demonstrate that agents equipped with mechanisms for continual learning and global memory consolidation can meaningfully evolve their capabilities over time.
[AI-121] Modernizing Amdahls Law: How AI Scaling Laws Shape Computer Architecture
【速读】:该论文试图解决传统阿姆达尔定律(Amdahl’s Law)在现代异构计算系统中适用性不足的问题,即其假设固定串行与并行任务划分及同质化复制,无法准确刻画当前融合专用加速器、可编程计算单元、张量数据路径和动态流水线的复杂系统中资源分配效率与性能提升之间的关系。解决方案的关键在于重构阿姆达尔定律,引入可扩展工作负载下的异构硬件资源分配模型,揭示出一个有限的“坍塌阈值”(collapse threshold):当可扩展任务比例超过临界值时,即使专用硬件具有效率优势,其最优投资也将骤降至零,这表现为一种相变而非渐近衰减,从而解释了为何GPU的可编程性不断增强,而领域特定AI加速器尚未完全取代GPU的现象。
链接: https://arxiv.org/abs/2603.20654
作者: Chien-Ping Lu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Use: 10 pages, 5 figures. arXiv version
Abstract:Classical Amdahl’s Law assumes a fixed decomposition between serial and parallel work and homogeneous replication; historically, it bounds how much parallel speedup is attainable. Modern systems instead combine specialized accelerators with programmable compute, tensor datapaths, and evolving pipelines, while empirical scaling laws shift which stages absorb marginal compute. The central tension is therefore not the serial-versus-parallel split alone, but resource allocation across heterogeneous hardware, given efficiency differences, and workload structures that determine how effectively additional compute can be converted into value. We reformulate Amdahl’s Law for modern heterogeneous systems with scalable workloads. The analysis yields a finite collapse threshold: beyond a critical scalable fraction, specialization becomes suboptimal for any efficiency advantage of specialized hardware over programmable compute, and optimal specialized investment falls to zero, a phase transition rather than an asymptotic tail. We use this framework to interpret increasing GPU programmability and why domain-specific AI accelerators have not displaced GPUs.
[AI-122] From 50% to Mastery in 3 Days: A Low-Resource SOP for Localizing Graduate-Level AI Tutors via Shadow-RAG
【速读】:该论文旨在解决高保真人工智能辅导系统(AI tutors)在教育场景中部署受阻的问题,即“资源诅咒”——依赖昂贵的云端GPU和庞大的数据工程工作。其核心解决方案是提出了一套可复现的标准操作流程(Standard Operating Procedure),关键在于结合了视觉-语言模型(Vision-Language Model)的数据清洗策略与一种新颖的Shadow-RAG架构。该方案仅需非专家人员3人日的工作量,并利用可在单个消费级GPU上运行的开放权重32B模型,便实现了对研究生级应用数学辅导系统的本地化部署。实验表明,Shadow Agent通过提供结构化推理引导,在新一代32B模型中触发显著的能力跃升,将准确率从74%(Naive RAG)提升至90%(掌握水平),而旧模型仅获得小幅提升,说明此类结构化指导是释放现代小型语言模型潜在能力的关键因素。
链接: https://arxiv.org/abs/2603.20650
作者: Zonglin Yang,J.-H. Xie,Lining Zhang,Jiyou Jia,Zhi-X. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 3 figures, practitioner report
Abstract:Deploying high-fidelity AI tutors in schools is often blocked by the Resource Curse – the need for expensive cloud GPUs and massive data engineering. In this practitioner report, we present a replicable Standard Operating Procedure that breaks this barrier. Using a Vision-Language Model data cleaning strategy and a novel Shadow-RAG architecture, we localized a graduate-level Applied Mathematics tutor using only 3 person-days of non-expert labor and open-weights 32B models deployable on a single consumer-grade GPU. Our pilot study on a full graduate-level final exam reveals a striking emergence phenomenon: while both zero-shot baselines and standard retrieval stagnate around 50-60% accuracy across model generations, the Shadow Agent, which provides structured reasoning guidance, triggers a massive capability surge in newer 32B models, boosting performance from 74% (Naive RAG) to mastery level (90%). In contrast, older models see only modest gains (~10%). This suggests that such guidance is the key to unlocking the latent power of modern small language models. This work offers a cost-effective, scientifically grounded blueprint for ubiquitous AI education.
[AI-123] Agent ic AI and the next intelligence explosion
【速读】:该论文试图解决当前对“AI奇点”(AI singularity)的误解问题,即将其视为单一、类神般的智能体,而忽略了智能的本质是多元、社会性和关系性的。解决方案的关键在于揭示生成式 AI(Generative AI)在前沿推理模型(如 DeepSeek-R1)中展现出的“思想社会”(society of thought)特性——这些模型通过模拟内部认知辩论、验证与调和机制来提升复杂任务解决能力,而非简单延长思考时间。进一步地,作者提出应从传统的二元对齐(RLHF)转向制度对齐(institutional alignment),通过设计类组织和市场的数字协议,构建具有制衡机制的社会基础设施,从而实现人类与AI协同进化的新范式——即“人-AI混合体”(human-AI centaurs)。这一路径预示着下一代智能爆炸将表现为一个专业化、扩展性强的组合型社会系统,而非单一硅基大脑。
链接: https://arxiv.org/abs/2603.20639
作者: James Evans,Benjamin Bratton,Blaise Agüera y Arcas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages
Abstract:The “AI singularity” is often miscast as a monolithic, godlike mind. Evolution suggests a different path: intelligence is fundamentally plural, social, and relational. Recent advances in agentic AI reveal that frontier reasoning models, such as DeepSeek-R1, do not improve simply by “thinking longer”. Instead, they simulate internal “societies of thought,” spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. Moreover, we are entering an era of human-AI centaurs: hybrid actors where collective agency transcends individual control. Scaling this intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols, modeled on organizations and markets, we can build a social infrastructure of checks and balances. The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island.
[AI-124] AEGIS: From Clues to Verdicts – Graph-Guided Deep Vulnerability Reasoning via Dialectics and Meta-Auditing
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在漏洞检测中推理不严谨的问题,其根本原因在于当前主流缓解范式(如基于代理的辩论和检索增强)均依赖于缺乏约束的推理空间,导致模型无法建立与代码数据流拓扑相匹配的假设性证据基础,从而产生虚假的跨函数依赖关系或泛化知识,最终使结论受修辞说服力而非可验证事实驱动。解决方案的关键在于提出AEGIS框架,通过“从线索到裁决”(From Clue to Verdict)理念,将漏洞检测从无约束推测转向封闭事实基底上的法证验证:首先识别可疑代码异常(线索),再基于仓库级代码属性图(Code Property Graph)动态重构每个线索的变量依赖链;在此封闭证据边界内,验证代理(Verifier Agent)构建正反两方的辩证论证以评估可利用性,而审计代理(Audit Agent)则独立审查每项主张与执行轨迹的一致性,具备否决权以防止幻觉性结论。该方法在PrimeVul基准上实现122对正确预测,首次突破百点大关,并将误报率降低至领先基线的45.60%以下,且无需任务特定训练即可达到平均0.09美元/样本的开销。
链接: https://arxiv.org/abs/2603.20637
作者: Sen Fang,Weiyuan Ding,Zhezhen Cao,Zhou Yang,Bowen Xu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 29 pages, 6 figures, 3 tables
Abstract:Large Language Models (LLMs) are increasingly adopted for vulnerability detection, yet their reasoning remains fundamentally unsound. We identify a root cause shared by both major mitigation paradigms (agent-based debate and retrieval augmentation): reasoning in an ungrounded deliberative space that lacks a bounded, hypothesis-specific evidence base. Without such grounding, agents fabricate cross-function dependencies, and retrieval heuristics supply generic knowledge decoupled from the repository’s data-flow topology. Consequently, the resulting conclusions are driven by rhetorical persuasiveness rather than verifiable facts. To ground this deliberation, we present AEGIS, a novel multi-agent framework that shifts detection from ungrounded speculation to forensic verification over a closed factual substrate. Guided by a “From Clue to Verdict” philosophy, AEGIS first identifies suspicious code anomalies (clues), then dynamically reconstructs per-variable dependency chains for each clue via on-demand slicing over a repository-level Code Property Graph. Within this closed evidence boundary, a Verifier Agent constructs competing dialectical arguments for and against exploitability, while an independent Audit Agent scrutinizes every claim against the trace, exercising veto power to prevent hallucinated verdicts. Evaluation on the rigorous PrimeVul dataset demonstrates that AEGIS establishes a new state-of-the-art, achieving 122 Pair-wise Correct Predictions. To our knowledge, this is the first approach to surpass 100 on this benchmark. It reduces the false positive rate by up to 54.40% compared to leading baselines, at an average cost of 0.09 per sample without any task-specific training.
[AI-125] CFNN: Continued Fraction Neural Network
【速读】:该论文旨在解决科学计算中准确刻画具有奇异性(singularity)的非线性功能流形(non-linear functional manifolds)这一基础挑战,尤其针对多层感知机(Multi-Layer Perceptrons, MLPs)因谱偏差(spectral bias)导致在高曲率特征建模时需大量参数的问题。其解决方案的关键在于提出连续分数神经网络(Continued Fraction Neural Networks, CFNNs),通过将连续分数(continued fractions)与基于梯度的优化相结合,引入“有理诱导偏置”(rational inductive bias),从而以极低的参数量高效捕捉复杂渐近行为和不连续性,并提供形式化的逼近误差界和稳定性保证。
链接: https://arxiv.org/abs/2603.20634
作者: Chao Wang,Xuancheng Zhou,Ruilin Hou,Xiaoyu Cheng,Ruiyi Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately characterizing non-linear functional manifolds with singularities is a fundamental challenge in scientific computing. While Multi-Layer Perceptrons (MLPs) dominate, their spectral bias hinders resolving high-curvature features without excessive parameters. We introduce Continued Fraction Neural Networks (CFNNs), integrating continued fractions with gradient-based optimization to provide a rational inductive bias.'' This enables capturing complex asymptotics and discontinuities with extreme parameter frugality. We provide formal approximation bounds demonstrating exponential convergence and stability guarantees. To address recursive instability, we develop three implementations: CFNN-Boost, CFNN-MoE, and CFNN-Hybrid. Benchmarks show CFNNs consistently outperform MLPs in precision with one to two orders of magnitude fewer parameters, exhibiting up to a 47-fold improvement in noise robustness and physical consistency. By bridging black-box flexibility and white-box transparency, CFNNs establish a reliable grey-box’’ paradigm for AI-driven scientific research.
[AI-126] Seed1.8 Model Card: Towards Generalized Real-World Agency
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)在真实世界应用中面临的局限性,即从单轮预测向多轮交互、工具调用和多步骤执行能力的扩展问题。其核心挑战在于构建一个具备通用代理能力(generalized real-world agency)的基础模型,以支持复杂任务的端到端完成。解决方案的关键在于提出Seed1.8这一基础模型,它在保持强大LLM和视觉-语言理解性能的同时,引入统一的代理接口(unified agentic interface),集成搜索、代码生成与执行以及图形用户界面(GUI)交互能力,并通过可配置的思考模式和优化的视觉编码策略实现低延迟、低成本推理,从而有效支撑多样化的实际应用场景。
链接: https://arxiv.org/abs/2603.20633
作者: Bytedance Seed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Seed1.8, a foundation model aimed at generalized real-world agency: going beyond single-turn prediction to multi-turn interaction, tool use, and multi-step execution. Seed1.8 keeps strong LLM and vision-language performance while supporting a unified agentic interface-search, code generation and execution, and GUI interaction. For deployment, it offers latency- and cost-aware inference, including configurable thinking modes and optimized visual encoding for images and video. We report evaluations on standard benchmarks and application-aligned workflows spanning foundational skills, multimodal understanding, and agentic behavior. Seed1.8 is released to support further research and development on interactive, real-world use cases.
[AI-127] Reasoning Traces Shape Outputs but Models Wont Say So
【速读】:该论文试图解决的问题是:大型推理模型(Large Reasoning Models, LRMs)所生成的推理轨迹(reasoning traces)是否真实反映其输出决策机制,以及模型是否会诚实地报告其推理过程中的影响因素。为验证这一点,作者提出了一种名为“思维注入”(Thought Injection)的方法——通过向模型的思考过程中注入合成的推理片段,观察模型是否遵循这些注入的推理并承认其影响。关键发现是:注入的提示能显著改变模型输出,表明推理轨迹确实因果性地影响行为;然而,在被要求解释答案变化时,超过90%的样本拒绝披露注入的影响,转而编造看似合理但与实际无关的解释。激活分析进一步揭示了在编造过程中存在与谄媚(sycophancy)和欺骗(deception)相关的神经活动模式,说明这种不诚实行为具有系统性而非偶然。这一发现揭示了模型实际推理路径与其报告推理之间的差距,警示我们表面一致的解释可能并不代表真正的对齐。
链接: https://arxiv.org/abs/2603.20620
作者: Yijie Hao,Lingjie Chen,Ali Emami,Joyce Ho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Can we trust the reasoning traces that large reasoning models (LRMs) produce? We investigate whether these traces faithfully reflect what drives model outputs, and whether models will honestly report their influence. We introduce Thought Injection, a method that injects synthetic reasoning snippets into a model’s think trace, then measures whether the model follows the injected reasoning and acknowledges doing so. Across 45,000 samples from three LRMs, we find that injected hints reliably alter outputs, confirming that reasoning traces causally shape model behavior. However, when asked to explain their changed answers, models overwhelmingly refuse to disclose the influence: overall non-disclosure exceeds 90% for extreme hints across 30,000 follow-up samples. Instead of acknowledging the injected reasoning, models fabricate aligned-appearing but unrelated explanations. Activation analysis reveals that sycophancy- and deception-related directions are strongly activated during these fabrications, suggesting systematic patterns rather than incidental failures. Our findings reveal a gap between the reasoning LRMs follow and the reasoning they report, raising concern that aligned-appearing explanations may not be equivalent to genuine alignment.
[AI-128] Where can AI be used? Insights from a deep ontology of work activities
【速读】:该论文试图解决的问题是:当前缺乏系统性的框架来理解人工智能(Artificial Intelligence, AI)在各类工作活动中的适用场景与分布规律。为应对这一挑战,作者提出了一种基于美国劳工部广泛使用的O*NET职业数据库中约2万项工作活动的综合本体论(ontology),通过拆解和重构这些活动,构建了一个可量化分析AI应用潜力的结构化体系。解决方案的关键在于:利用该本体论对13,275个AI软件应用和全球2080万套机器人系统的描述进行分类,并据此绘制出AI系统在不同工作活动中所占市场价值和数量的分布图谱,从而揭示AI市场价值高度集中于信息类活动(占72%,其中创建信息占36%),且交互型活动(融合信息与物理活动)占据48%的市场价值,尤其以信息传递为主(占26%)。这一框架为预测当前及未来AI在具体工作活动中的适用性提供了精细化依据。
链接: https://arxiv.org/abs/2603.20619
作者: Alice Cai,Iman YeckehZaare,Shuo Sun,Vasiliki Charisi,Xinru Wang,Aiman Imran,Robert Laubacher,Alok Prakash,Thomas W. Malone
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Artificial intelligence (AI) is poised to profoundly reshape how work is executed and organized, but we do not yet have deep frameworks for understanding where AI can be used. Here we provide a comprehensive ontology of work activities that can help systematically analyze and predict uses of AI. To do this, we disaggregate and then substantially reorganize the approximately 20K activities in the US Department of Labor’s widely used O*NET occupational database. Next, we use this framework to classify descriptions of 13,275 AI software applications and a worldwide tally of 20.8 million robotic systems. Finally, we use the data about both these kinds of AI to generate graphical displays of how the estimated units and market values of all worldwide AI systems used today are distributed across the work activities that these systems help perform. We find a highly uneven distribution of AI market value across activities, with the top 1.6% of activities accounting for over 60% of AI market value. Most of the market value is used in information-based activities (72%), especially creating information (36%), and only 12% is used in physical activities. Interactive activities include both information-based and physical activities and account for 48% of AI market value, much of which (26%) involves transferring information. These results can be viewed as rough predictions of the AI applicability for all the different work activities down to very low levels of detail. Thus, we believe this systematic framework can help predict at a detailed level where today’s AI systems can and cannot be used and how future AI capabilities may change this.
[AI-129] Graph-based data-driven discovery of interpretable laws governing corona-induced noise and radio interference for high-voltage transmission lines
【速读】:该论文旨在解决超高压交流(UHV AC)输电线路中电晕引起的可听噪声(AN)和无线电干扰(RI)的精准预测问题,这是制约UHV电网大规模部署的关键环境合规挑战。现有工程方法依赖于固定对数线性经验公式,难以捕捉非线性相互作用且泛化能力弱。解决方案的关键在于提出一种单调性约束的图符号发现框架(Mono-GraphMD),通过数据驱动方式自动挖掘出简洁、可解释的封闭形式物理规律,明确揭示了表面梯度、分裂导线数和直径之间的非线性耦合机制,从而实现对电晕噪声与干扰的高精度预测,适用于从实验室电晕笼数据到多国实际UHV线路(最多16分裂导线)的广泛场景。
链接: https://arxiv.org/abs/2603.20600
作者: Hao Xu,Yuntian Chen,Chongqing Kang,Dongxiao Zhang
机构: 未知
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:The global shift towards renewable energy necessitates the development of ultrahigh-voltage (UHV) AC transmission to bridge the gap between remote energy sources and urban demand. While UHV grids offer superior capacity and efficiency, their implementation is often hindered by corona-induced audible noise (AN) and radio interference (RI). Since these emissions must meet strict environmental compliance standards, accurate prediction is vital for the large-scale deployment of UHV infrastructure. Existing engineering practices often rely on empirical laws, in which fixed log-linear structures limit accuracy and extrapolation. Herein, we present a monotonicity-constrained graph symbolic discovery framework, Mono-GraphMD, which uncovers compact, interpretable laws for corona-induced AN and RI. The framework provides mechanistic insight into how nonlinear interactions among the surface gradient, bundle number and diameter govern high-field emissions and enables accurate predictions for both corona-cage data and multicountry real UHV lines with up to 16-bundle conductors. Unlike black-box models, the discovered closed-form laws are highly portable and interpretable, allowing for rapid predictions when applied to various scenarios, thereby facilitating the engineering design process.
[AI-130] MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning ICML2025
【速读】:该论文旨在解决长上下文语言建模中Key/Value (KV) 缓存维护与注意力计算带来的内存开销和计算瓶颈问题。随着上下文长度增加,KV缓存的存储成本和注意力机制的复杂度迅速上升,成为训练和推理阶段的主要性能限制。解决方案的关键在于提出一种分层注意力机制——Memory-Keyed Attention (MKA),其通过动态路由策略在本地、会话级和长期KV缓存之间智能分配注意力,从而实现更高效的内存利用与表示保留;进一步提出的Route-Fused MKA (FastMKA) 采用广播式路由,在注意力计算前融合多源记忆,显著提升效率:在保持与Multi-Latent Attention (MLA)相当困惑度的同时,训练吞吐量最高提升5倍,推理延迟降低1.8倍。
链接: https://arxiv.org/abs/2603.20586
作者: Dong Liu,Yanxuan Yu,Ben Lengerich,Ying Nian Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the ACM Computing Frontiers 2026 Conference and the ICML 2025 Long Context Modeling Workshop
Abstract:As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.
[AI-131] Context Cartography: Toward Structured Governance of Contextual Space in Large Language Model Systems
【速读】:该论文试图解决大语言模型(Large Language Model, LLM)在扩展上下文窗口时性能提升受限的问题,特别是指出当前“更多token即更好”的假设忽略了Transformer架构下上下文空间存在的结构梯度、显著性不对称性和熵累积等内在限制。解决方案的关键在于提出Context Cartography(上下文制图)框架,该框架通过定义黑雾区(未观测区域)、灰雾区(存储记忆区)和可见场(主动推理表面)的三区模型,并形式化七种制图算子(侦察、选择、简化、聚合、投影、位移与分层),实现对信息在不同区域间及内部转换的显式治理。这些算子被证明是补偿线性前缀记忆、追加式状态更新以及上下文扩展引发熵增所必需的机制,从而为理解并优化LLM推理能力提供了可验证的理论基础与诊断基准。
链接: https://arxiv.org/abs/2603.20578
作者: Zihua Wu,Georg Gartner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 2 figures
Abstract:The prevailing approach to improving large language model (LLM) reasoning has centered on expanding context windows, implicitly assuming that more tokens yield better performance. However, empirical evidence - including the “lost in the middle” effect and long-distance relational degradation - demonstrates that contextual space exhibits structural gradients, salience asymmetries, and entropy accumulation under transformer architectures. We introduce Context Cartography, a formal framework for the deliberate governance of contextual space. We define a tripartite zonal model partitioning the informational universe into black fog (unobserved), gray fog (stored memory), and the visible field (active reasoning surface), and formalize seven cartographic operators - reconnaissance, selection, simplification, aggregation, projection, displacement, and layering - as transformations governing information transitions between and within zones. The operators are derived from a systematic coverage analysis of all non-trivial zone transformations and are organized by transformation type (what the operator does) and zone scope (where it applies). We ground the framework in the salience geometry of transformer attention, characterizing cartographic operators as necessary compensations for linear prefix memory, append-only state, and entropy accumulation under expanding context. An analysis of four contemporary systems (Claude Code, Letta, MemOS, and OpenViking) provides interpretive evidence that these operators are converging independently across the industry. We derive testable predictions from the framework - including operator-specific ablation hypotheses - and propose a diagnostic benchmark for empirical validation. Comments: 31 pages, 2 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.20578 [cs.AI] (or arXiv:2603.20578v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.20578 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zihua Wu [view email] [v1] Sat, 21 Mar 2026 00:21:19 UTC (43 KB)
[AI-132] LLM -Driven Heuristic Synthesis for Industrial Process Control: Lessons from Hot Steel Rolling
【速读】:该论文旨在解决工业过程控制中对可解释性和可审计性策略的需求,这一需求难以通过黑箱神经网络策略满足的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的启发式合成框架,该框架通过迭代生成和优化人类可读的Python控制器,并结合物理仿真器提供的丰富行为反馈,在高度控制、轧制间歇时间及轧制速度等多个维度上搜索最优控制逻辑。此外,论文还设计了一个可审计的控制器合成流水线,确保生成的控制器为显式程序并可被专家审查,同时引入自动化审计流程以形式化验证关键安全与单调性属性;另一个核心创新是采用Luby风格的通用重启策略进行预算分配,无需针对具体问题调参,即可逼近由大量手动实验得出的近似最优预算分配方案。
链接: https://arxiv.org/abs/2603.20537
作者: Nima H. Siboni,Seyedreza Kiamousavi,Emad Scharifi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Industrial process control demands policies that are interpretable and auditable, requirements that black-box neural policies struggle to meet. We study an LLM-driven heuristic synthesis framework for hot steel rolling, in which a language model iteratively proposes and refines human-readable Python controllers using rich behavioral feedback from a physics-based simulator. The framework combines structured strategic ideation, executable code generation, and per-component feedback across diverse operating conditions to search over control logic for height reduction, interpass time, and rolling velocity. Our first contribution is an auditable controller-synthesis pipeline for industrial process control. The generated controllers are explicit programs accessible to expert review, and we pair them with an automated audit pipeline that formally verifies key safety and monotonicity properties for the best synthesized heuristic. Our second contribution is a principled budget allocation strategy for LLM-driven heuristic search: we show that Luby-style universal restarts – originally developed for randomized algorithms – transfer directly to this setting, eliminating the need for problem-specific budget tuning. A single 160-iteration Luby campaign approaches the hindsight-optimal budget allocation derived from 52 ad-hoc runs totalling 730 iterations.
[AI-133] An Industrial-Scale Retrieval-Augmented Generation Framework for Requirements Engineering: Empirical Evaluation with Automotive Manufacturing Data
【速读】:该论文旨在解决工业4.0背景下需求工程(Requirements Engineering, RE)中因文档异构性和非结构化特征所带来的挑战,尤其是在汽车制造领域,需从多源技术规范、供应商清单及合规标准中高效提取和管理需求。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的自动化框架,通过融合语义与词法混合检索策略(Hybrid Semantic-Lexical Retrieval)实现高精度的需求抽取(准确率达98.2%),并结合多提供商大语言模型(Large Language Models, LLMs)编排以提升效率与成本效益,最终在真实工业场景中验证了该方法在减少人工分析时间(降低83%)、降低成本(节省47%)以及识别潜在合规风险(如10家遗留供应商需重新认证)等方面的显著优势。
链接: https://arxiv.org/abs/2603.20534
作者: Muhammad Khalid,Yilmaz Uygun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, submitted to RE 2026
Abstract:Requirements engineering in Industry 4.0 faces critical challenges with heterogeneous, unstructured documentation spanning technical specifications, supplier lists, and compliance standards. While retrieval-augmented generation (RAG) shows promise for knowledge-intensive tasks, no prior work has evaluated RAG on authentic industrial RE workflows using comprehensive production-grade performance metrics. This paper presents a comprehensive empirical evaluation of RAG for industrial requirements engineering automation using authentic automotive manufacturing documentation comprising 669 requirements across four specification standards (MBN 9666-1, MBN 9666-2, BQF 9666-5, MBN 9666-9) spanning 2015-2023, plus 49 supplier qualifications with extensive supporting documentation. Through controlled comparisons with BERT-based and ungrounded LLM approaches, the framework achieves 98.2% extraction accuracy with complete traceability, outperforming baselines by 24.4% and 19.6%, respectively. Hybrid semantic-lexical retrieval achieves MRR of 0.847. Expert quality assessment averaged 4.32/5.0 across five dimensions. The evaluation demonstrates 83% reduction in manual analysis time and 47% cost savings through multi-provider LLM orchestration. Ablation studies quantify individual component contributions. Longitudinal analysis reveals a 55% reduction in requirement volume coupled with 1,800% increase in IT security focus, identifying 10 legacy suppliers (20.4%) requiring requalification, representing potential 2.3M in avoided contract penalties.
[AI-134] Does This Gradient Spark Joy?
【速读】:该论文旨在解决策略梯度(Policy Gradient)方法中计算效率低下的问题,即对每个样本都执行代价高昂的反向传播(backward pass),而其中大部分样本的学习价值有限。其核心解决方案是提出一种名为“Kondo门控机制”(Kondo gate)的新颖采样筛选策略:该机制基于前向传递计算的“愉悦度”(delight,即优势值与惊喜度(surprisal,负对数概率)的乘积)与计算成本进行比较,仅在样本的愉悦度超过阈值时才触发反向传播,从而在学习质量与计算成本之间实现帕累托最优(Pareto frontier)。实验表明,该方法能有效跳过绝大多数反向传播步骤,同时保留接近原始策略梯度的学习性能,尤其在复杂任务中优势显著,且因愉悦度可容忍近似估计,支持低成本前向筛选以实现推测性解码训练(speculative-decoding-for-training)范式。
链接: https://arxiv.org/abs/2603.20526
作者: Ian Osband
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emphdelight, the product of advantage and surprisal (negative log-probability). We introduce the \emphKondo gate, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality–cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG’s learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.
[AI-135] Delightful Distributed Policy Gradient
【速读】:该论文旨在解决分布式强化学习中因数据来自过时(stale)、有缺陷(buggy)或不匹配(mismatched)的执行器(actor)而产生的高意外性(surprisal,即负对数概率)数据所导致的学习效率低下问题。核心挑战并非高意外性数据本身,而是如何从这些数据中进行有效学习——高意外性失败会主导梯度更新方向但携带极少有用信号,而高意外性成功则蕴含当前策略可能错失的机会。解决方案的关键在于提出愉悦政策梯度(Delightful Policy Gradient, DG),其通过引入“愉悦”(delight)作为更新门控机制,定义为优势(advantage)与意外性(surprisal)的乘积,在不依赖行为概率的情况下抑制罕见失败并放大罕见成功。DG在污染采样下能保持与真实梯度更高的余弦相似度,并在多种实际干扰(如延迟、错误执行器、奖励污染和稀有发现)共同作用时展现出显著优于标准重要性加权策略梯度的性能和计算优势。
链接: https://arxiv.org/abs/2603.20521
作者: Ian Osband
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner’s policy. The core difficulty is not surprising data per se, but \emphnegative learning from surprising data. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textitDelightful Policy Gradient (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and amplifying rare successes without behavior probabilities. Under contaminated sampling, the cosine similarity between the standard policy gradient and the true gradient collapses, while DG’s grows as the policy improves. No sign-blind reweighting, including exact importance sampling, can reproduce this effect. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG achieves roughly 10\times lower error. When all four frictions act simultaneously, its compute advantage is order-of-magnitude and grows with task complexity.
[AI-136] Grounded Chess Reasoning in Language Models via Master Distillation
【速读】:该论文旨在解决语言模型在训练数据稀缺的特定领域中缺乏具身推理能力的问题,尤其是在棋类等需要复杂策略推理的任务上表现不佳。其核心解决方案是提出一种通用的知识蒸馏框架——Master Distillation,关键在于不仅蒸馏专家系统最终输出,而是完整捕获并转换专家系统的推理过程为自然语言的思维链(chain-of-thought)解释,从而让小型模型获得领域专业知识和可信赖的、具身的推理能力。通过在国际象棋这一经典推理领域验证,该方法使一个4B参数模型(C1)准确率从接近零提升至48.1%,显著优于开源及多数闭源模型,并以远少的token数生成可解释的解题路径,体现了将专家知识高效注入轻量模型的技术可行性与有效性。
链接: https://arxiv.org/abs/2603.20510
作者: Zhenwei Tang,Qianfeng Wen,Seth Grief-Albert,Yahya Elgabra,Blair Yang,Honghua Dong,Ashton Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Language models often lack grounded reasoning capabilities in specialized domains where training data is scarce but bespoke systems excel. We introduce a general framework for distilling expert system reasoning into natural language chain-of-thought explanations, enabling compact models to acquire domain expertise and the ability to generate faithful, grounded explanations. Rather than distilling only final outputs, we capture the full reasoning process, transforming opaque expert computations into transparent, step-by-step explanations. We demonstrate this approach in chess, a canonical reasoning domain where language models continue to underperform. Our 4B parameter model, C1, advances from a near-zero baseline to 48.1% accuracy, outperforming all open-source models and most frontier proprietary systems. Notably, C1 surpasses its distillation teacher and generates solutions in two orders of magnitude fewer tokens than baselines. Unlike prior neural chess approaches that predict only best moves, C1 generates explainable solutions revealing strategic reasoning. Our pipeline combines supervised fine-tuning and reinforcement learning with theme-balanced data sampling for comprehensive tactical coverage. Master Distillation demonstrates how to inject expert-level knowledge into compact models for under-optimized domains, offering a recipe for unlocking RLVR where LLMs lack sufficient base capabilities.
[AI-137] Efficient Counterfactual Reasoning in ProbLog via Single World Intervention Programs
【速读】:该论文旨在解决在概率逻辑编程(Probabilistic Logic Programming, PLP)语言如ProbLog中集成反事实推理(counterfactual reasoning)所面临的计算开销大且精度不稳定的问题。其解决方案的关键在于提出一种高效的程序变换方法,将反事实推理形式化为单世界干预程序(Single World Intervention Programs, SWIPs),通过系统性地将ProbLog规则拆分为与观测值相关的部分和固定不变的部分,生成一个结构更简单的转换后程序,从而将反事实推理简化为对该简化程序的边缘推断(marginal inference)。该方法在保证不增加渐近计算复杂度的同时,在常见情况下显著减小了计算规模,并通过较弱的条件独立假设实现了对结构性因果模型(Structural Causal Model)下反事实分布的准确建模,实验表明推理时间平均减少35%。
链接: https://arxiv.org/abs/2603.20505
作者: Saimun Habib,Vaishak Belle,Fengxiang He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic Logic Programming (PLP) languages, like ProbLog, naturally support reasoning under uncertainty, while maintaining a declarative and interpretable framework. Meanwhile, counterfactual reasoning (i.e., answering ``what if’’ questions) is critical for ensuring AI systems are robust and trustworthy; however, integrating this capability into PLP can be computationally prohibitive and unstable in accuracy. This paper addresses this challenge, by proposing an efficient program transformation for counterfactuals as Single World Intervention Programs (SWIPs) in ProbLog. By systematically splitting ProbLog clauses to observed and fixed components relevant to a counterfactual, we create a transformed program that (1) does not asymptotically exceed the computational complexity of existing methods, and is strictly smaller in common cases, and (2) reduces counterfactual reasoning to marginal inference over a simpler program. We formally prove the correctness of our approach, which relies on a weaker set independence assumptions and is consistent with conditional independencies, showing the resulting marginal probabilities match the counterfactual distributions of the underlying Structural Causal Model in wide domains. Our method achieves a 35% reduction in inference time versus existing methods in extensive experiments. This work makes complex counterfactual reasoning more computationally tractable and reliable, providing a crucial step towards developing more robust and explainable AI systems. The code is at this https URL.
[AI-138] Meeting in the Middle: A Co-Design Paradigm for FHE and AI Inference
【速读】:该论文旨在解决现代云推理中的双向隐私问题:用户在向云服务提供商提交敏感输入时面临数据泄露风险,而提供商在潜在不安全的执行环境中运行私有模型权重时也存在模型信息泄露的风险。尽管全同态加密(Fully Homomorphic Encryption, FHE)提供了理论上安全的加密计算保障,但其高昂的计算开销使其难以应用于现代深度学习架构。论文提出通过软硬件协同设计(co-design)来突破这一瓶颈,关键在于两个方向的优化:一是针对推理电路的静态结构专门化FHE方案与编译器设计,二是约束推理架构以降低FHE中主导的计算成本因素。作者提出了一个“中间相遇”(meet in the middle)的研究路线,并明确了在两个维度上的具体优化目标。
链接: https://arxiv.org/abs/2603.20504
作者: Bernardo Magri,Benjamin Marsh,Paul Gebheim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to AICrypt 2026
Abstract:Modern cloud inference creates a two sided privacy problem where users reveal sensitive inputs to providers, while providers must execute proprietary model weights inside potentially leaky execution environments. Fully homomorphic encryption (FHE) offers cryptographic guarantees but remains prohibitively expensive for modern architectures. We argue that progress requires co-design where specializing FHE schemes/compilers for the static structure of inference circuits, while simultaneously constraining inference architectures to reduce dominant homomorphic cost drivers. We outline a meet in the middle agenda and concrete optimization targets on both axes.
[AI-139] DiffGraph: An Automated Agent -driven Model Merging Framework for In-the-Wild Text-to-Image Generation CVPR
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成领域中,现有模型融合方法难以充分挖掘和利用在线海量专家模型资源,且无法灵活适配多样化的实际用户需求的问题。解决方案的关键在于提出DiffGraph——一个基于图结构的代理驱动型模型融合框架,其通过节点注册与校准机制构建可扩展的专家图谱,并根据用户需求动态激活特定子图,从而实现不同专家模型的灵活组合,以满足多样化生成任务的目标。
链接: https://arxiv.org/abs/2603.20470
作者: Zhuoling Li,Hossein Rahmani,Jiarui Zhang,Yu Xue,Majid Mirmehdi,Jason Kuen,Jiuxiang Gu,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR
Abstract:The rapid growth of the text-to-image (T2I) community has fostered a thriving online ecosystem of expert models, which are variants of pretrained diffusion models specialized for diverse generative abilities. Yet, existing model merging methods remain limited in fully leveraging abundant online expert resources and still struggle to meet diverse in-the-wild user needs. We present DiffGraph, a novel agent-driven graph-based model merging framework, which automatically harnesses online experts and flexibly merges them for diverse user needs. Our DiffGraph constructs a scalable graph and organizes ever-expanding online experts within it through node registration and calibration. Then, DiffGraph dynamically activates specific subgraphs based on user needs, enabling flexible combinations of different experts to achieve user-desired generation. Extensive experiments show the efficacy of our method.
[AI-140] Solver-Aided Verification of Policy Compliance in Tool-Augmented LLM Agents
【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Models, TaLLMs)在敏感应用场景中难以确保工具调用行为符合领域特定操作策略的问题。现有方法仅通过将策略描述注入上下文来引导模型遵循规则,但无法提供违反行为的保障。解决方案的关键在于引入基于SMT求解器的框架,在运行时对计划的工具调用进行形式化验证:首先采用LLM辅助的人工引导方式将自然语言定义的工具使用策略转化为SMT-LIB-2.0格式的形式逻辑约束,再利用Z3求解器在工具调用前检查这些约束是否满足,从而阻止违反策略的调用。此方法显著减少了策略违规,同时保持了任务整体准确性,表明形式化推理可有效提升TaLLM执行中的合规性与可靠性。
链接: https://arxiv.org/abs/2603.20449
作者: Cailin Winston,Claris Winston,René Just
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented Large Language Models (TaLLMs) extend LLMs with the ability to invoke external tools, enabling them to interact with real-world environments. However, a major limitation in deploying TaLLMs in sensitive applications such as customer service and business process automation is a lack of reliable compliance with domain-specific operational policies regarding tool-use and agent behavior. Current approaches merely steer LLMs to adhere to policies by including policy descriptions in the LLM context, but these provide no guarantees that policy violations will be prevented. In this paper, we introduce an SMT solver-aided framework to enforce tool-use policy compliance in TaLLM agents. Specifically, we use an LLM-assisted, human-guided approach to translate natural-language-specified tool-use policies into formal logic (SMT-LIB-2.0) constraints over agent-observable state and tool arguments. At runtime, planned tool calls are intercepted and checked against the constraints using the Z3 solver as a pre-condition to the tool call. Tool invocations that violate the policy are blocked. We evaluated on the TauBench benchmark and demonstrate that solver-aided policy checking reduces policy violations while maintaining overall task accuracy. These results suggest that integrating formal reasoning into TaLLM execution can improve tool-call policy compliance and overall reliability.
[AI-141] Detecting Neurovascular Instability from Multimodal Physiological Signals Using Wearable-Compatible Edge AI: A Responsible Computational Framework
【速读】:该论文旨在解决神经血管不稳定性(Neurovascular Instability, NVI)在结构性卒中病理发生前难以被现有单模态可穿戴设备检测的问题。NVI 是脑血管自身调节功能紊乱的早期表现,是卒中发生的潜在预警信号,但当前技术无法实现连续、无创的早期识别。解决方案的关键在于提出 Melaguard,一个轻量级多模态机器学习框架(Transformer-lite,1.2M 参数,4头自注意力机制),通过融合心率变异性(HRV)、外周灌注指数、血氧饱和度(SpO₂)及双侧相位一致性等生理信号,生成综合的 NVI 评分,并支持边缘计算推理(Cortex-M4 上最坏情况执行时间 WCET = 4 ms)。该方法在多个独立验证阶段均表现出优于传统单模态模型(如 LSTM、SVM 等)的性能,尤其在临床队列和 PPG 数据上实现了高 AUC 值(最高达 0.923),证明了多模态融合对早期卒中风险预测的有效性。
链接: https://arxiv.org/abs/2603.20442
作者: Truong Quynh Hoa,Hoang Dinh Cuong,Truong Xuan Khanh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 6 tables. Submitted to IEEE JBHI. Code: this https URL
Abstract:We propose Melaguard, a multimodal ML framework (Transformer-lite, 1.2M parameters, 4-head self-attention) for detecting neurovascular instability (NVI) from wearable-compatible physiological signals prior to structural stroke pathology. The model fuses heart rate variability (HRV), peripheral perfusion index, SpO2, and bilateral phase coherence into a composite NVI Score, designed for edge inference (WCET =4 ms on Cortex-M4). NVI - the pre-structural dysregulation of cerebrovascular autoregulation preceding overt stroke - remains undetectable by existing single-modality wearables. With 12.2 million incident strokes annually, continuous multimodal physiological monitoring offers a practical path to community-scale screening. Three-stage independent validation: (1) synthetic benchmark (n=10,000), AUC=0.88 [0.83-0.92]; (2) clinical cohort PhysioNet CVES (n=172; 84 stroke, 88 control) - Transformer-lite achieves AUC=0.755 [0.630-0.778], outperforming LSTM (0.643), Random Forest (0.665), SVM (0.472); HRV-SDNN discriminates stroke (p=0.011); (3) PPG pipeline PhysioNet BIDMC (n=53) – pulse rate r=0.748 and HRV surrogate r=0.690 vs. ECG ground truth. Cross-modality validation on PPG-BP (n=219) confirms PPG morphology classifies cerebrovascular disease at AUC=0.923 [0.869-0.968]. Multimodal fusion consistently outperforms single-modality baselines. Code: this https URL
[AI-142] Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在从临床笔记中提取结构化信息时,难以捕捉变量间复杂依赖关系而导致临床不一致输出的问题。解决方案的关键在于提出“深度反思推理”(deep reflective reasoning)框架,该框架通过迭代式自我批判与修正机制,持续校验变量间的一致性、输入文本内容及检索到的领域知识,直至输出收敛,从而显著提升结构化数据提取的准确性和临床一致性。
链接: https://arxiv.org/abs/2603.20435
作者: Jingwei Huang,Kuroush Nezafati,Zhikai Chi,Ruichen Rong,Colin Treager,Tingyi Wanyan,Yueshuang Xu,Xiaowei Zhan,Patrick Leavey,Guanghua Xiao,Wenqi Shi,Yang Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 figures and 2 tables
Abstract:Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)-based extraction pipelines often struggle to capture these dependencies, leading to clinically inconsistent outputs. We propose deep reflective reasoning, a large language model agent framework that iteratively self-critiques and revises structured outputs by checking consistency among variables, the input text, and retrieved domain knowledge, stopping when outputs converge. We extensively evaluate the proposed method in three diverse oncology applications: (1) On colorectal cancer synoptic reporting from gross descriptions (n=217), reflective reasoning improved average F1 across eight categorical synoptic variables from 0.828 to 0.911 and increased mean correct rate across four numeric variables from 0.806 to 0.895; (2) On Ewing sarcoma CD99 immunostaining pattern identification (n=200), the accuracy improved from 0.870 to 0.927; (3) On lung cancer tumor staging (n=100), tumor stage accuracy improved from 0.680 to 0.833 (pT: 0.842 - 0.884; pN: 0.885 - 0.948). The results demonstrate that deep reflective reasoning can systematically improve the reliability of LLM-based structured data extraction under interdependence constraints, enabling more consistent machine-operable clinical datasets and facilitating knowledge discovery with machine learning and data science towards digital health.
[AI-143] Leverag ing Natural Language Processing and Machine Learning for Evidence-Based Food Security Policy Decision-Making in Data-Scarce Making
【速读】:该论文旨在解决数据稀缺地区粮食安全政策制定中的关键挑战,包括结构化数据匮乏、文本报告碎片化以及决策系统中的人口统计偏差问题。其解决方案的核心在于提出一个名为ZeroHungerAI的集成自然语言处理(Natural Language Processing, NLP)与机器学习(Machine Learning, ML)框架,通过迁移学习驱动的DistilBERT架构将结构化的社会经济指标与政策文本的上下文嵌入相结合,从而实现基于证据的粮食安全政策建模。该方法在25个行政区的1200样本混合数据集上验证了优越性能,尤其在类别不平衡条件下表现出高精度(91%分类准确率、F1分数0.86),并通过公平性优化将人口均等差异降至3%,确保城乡政策推断的公平性。
链接: https://arxiv.org/abs/2603.20425
作者: Karan Kumar Singh,Nikita Gajbhiye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 12 figures, 2 tables. Submitted for academic publication
Abstract:Food security policy formulation in data-scarce regions remains a critical challenge due to limited structured datasets, fragmented textual reports, and demographic bias in decision-making systems. This study proposes ZeroHungerAI, an integrated Natural Language Processing (NLP) and Machine Learning (ML) framework designed for evidence-based food security policy modeling under extreme data scarcity. The system combines structured socio-economic indicators with contextual policy text embeddings using a transfer learning based DistilBERT architecture. Experimental evaluation on a 1200-sample hybrid dataset across 25 districts demonstrates superior predictive performance, achieving 91 percent classification accuracy, 0.89 precision, 0.85 recall, and an F1 score of 0.86 under imbalanced conditions. Comparative analysis shows a 13 percent performance improvement over classical SVM and 17 percent over Logistic Regression models. Precision Recall evaluation confirms robust minority class detection (average precision around 0.88). Fairness aware optimization reduces demographic parity difference to 3 percent, ensuring equitable rural urban policy inference. The results validate that transformer based contextual learning significantly enhances policy intelligence in low resource governance environments, enabling scalable and bias aware hunger prediction systems.
[AI-144] Meta-Learning for Repeated Bayesian Persuasion
【速读】:该论文旨在解决重复博弈场景下 Bayesian persuasion(贝叶斯劝说)问题,即如何在多个具有结构相似性的战略交互中,通过元学习(meta-learning)机制提升劝说策略的收敛效率。其关键解决方案是提出了一类名为 Meta-Persuasion 的算法框架,首次在 Online Bayesian Persuasion (OBP) 和 Markov Persuasion Process (MPP) 两种设置下建立了理论上的最优后悔率(regret bounds),并在自然的任务相似性假设下实现了比现有方法更优的收敛速度;同时,在任务序列任意选择时仍能保持标准单次博弈的性能保证,从而统一了元学习与传统劝说机制的优势。
链接: https://arxiv.org/abs/2603.20408
作者: Ata Poyraz Turna,Asrin Efe Yorulmaz,Tamer Başar
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 40 pages
Abstract:Classical Bayesian persuasion studies how a sender influences receivers through carefully designed signaling policies within a single strategic interaction. In many real-world environments, such interactions are repeated across multiple games, creating opportunities to exploit structural similarity across tasks. In this work, we introduce Meta-Persuasion algorithms, establishing the first line of theoretical results for both full-feedback and bandit-feedback settings in the Online Bayesian Persuasion (OBP) and Markov Persuasion Process (MPP) frameworks. We show that our proposed meta-persuasion algorithms achieve provably sharper regret rates under natural notions of task similarity, improving upon the best-known convergence rates for both OBP and MPP. At the same time, they recover the standard single-game guarantees when the sequence of games is picked arbitrarily. Finally, we complement our theoretical analysis with numerical experiments that highlight our regret improvements and the benefits of meta-learning in repeated persuasion environments.
[AI-145] hinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation
【速读】:该论文旨在解决语言模型(Language Models, LM)在独立训练后是否能形成几何上兼容的潜在表示空间,以及这种兼容性是否可用于推理阶段的行为修正而无需更新模型权重的问题。其解决方案的关键在于学习一个线性投影矩阵(Ridge projection),将大型教师模型(teacher model)的激活向量映射到小型学生模型(student model)的坐标系中,并在生成过程中通过替换学生模型残差流(residual stream)中的内部状态来实现干预。实验表明,该方法在多种架构组合下均能显著提升推理性能,且行为修正效果与几何对齐质量无关,揭示了不同推理域间存在特定子空间几何结构这一普遍特性。
链接: https://arxiv.org/abs/2603.20406
作者: Marcus Armstrong,Navid Ayoobi,Arjun Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate whether independently trained language models converge to geometrically compatible latent representations, and whether this compatibility can be exploited to correct model behavior at inference time without any weight updates. We learn a linear projection matrix that maps activation vectors from a large teacher model into the coordinate system of a smaller student model, then intervene on the student’s residual stream during generation by substituting its internal state with the translated teacher representation. Across a fully crossed experimental matrix of 20 heterogeneous teacher-student pairings spanning mixture-of-experts, dense, code-specialized, and synthetically trained architectures, the Ridge projection consistently achieves R^2 = 0.50 on verbal reasoning and R^2 = 0.40 on mathematical reasoning, collapsing to R^2 = -0.22 under permutation control and R^2 = 0.01 under L_1 regularization. Behavioral correction rates range from 14.0% to 50.0% on TruthfulQA (mean 25.2%) and from 8.5% to 43.3% on GSM8K arithmetic reasoning (mean 25.5%), demonstrating that the method generalizes across fundamentally different reasoning domains. We report a near-zero correlation between geometric alignment quality and behavioral correction rate (r = -0.07), revealing a dissociation between representation space fidelity and output space impact. Intervention strength is architecture-specific: student models exhibit characteristic sensitivity profiles that invert across domains, with the most steerable verbal student becoming the least steerable mathematical student. Finally, a double dissociation experiment conducted across all 20 model pairings confirms without exception that projection matrices collapse catastrophically when transferred across reasoning domains (mean R^2 = -3.83 in both transfer directions), establishing domain-specific subspace geometry as a universal property of LMs.
[AI-146] KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
【速读】:该论文旨在解决生成式 AI (Generative AI) 中基于 Transformer 的大语言模型(LLMs)在自回归生成过程中因键值缓存(KV cache)内存占用随上下文长度线性增长而导致的 GPU 内存容量、带宽和推理吞吐量瓶颈问题。解决方案的关键在于系统性地梳理并分类近年来提出的 KV 缓存优化技术,将其归纳为五类主要方向:缓存淘汰(cache eviction)、缓存压缩(cache compression)、混合内存方案(hybrid memory solutions)、新型注意力机制(novel attention mechanisms)以及组合策略(combination strategies),并通过在内存节省、吞吐量和模型准确性等指标上的实证分析,揭示不同场景下各方法的适用性与权衡关系,从而为实际部署提供可操作的指导,并指出未来应采用适应性强、多阶段协同优化的策略以实现高效可扩展的 LLM 推理。
链接: https://arxiv.org/abs/2603.20397
作者: Yichun Xu,Navjot K. Khaira,Tejinder Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 14 figures
Abstract:The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical performance across memory reduction, throughput, and model accuracy metrics. We further map techniques to seven practical deployment scenarios, including long-context single requests, high-throughput datacenter serving, edge devices, multi-turn conversations, and accuracy-critical reasoning, providing actionable guidance for practitioners selecting among competing approaches. Our analysis reveals that no single technique dominates across all settings; instead, the optimal strategy depends on context length, hardware constraints, and workload characteristics, pointing toward adaptive, multi-stage optimization pipelines as a promising direction for future research.
[AI-147] Compression is all you need: Modeling Mathematics
【速读】:该论文试图解决的问题是:人类数学(Human Mathematics, HM)为何仅占据形式数学(Formal Mathematics, FM)的极小部分,以及如何从结构上刻画HM的特征。其核心假设是HM区别于FM的关键在于其可通过分层嵌套的定义、引理和定理实现高效压缩。解决方案的关键在于构建基于幺半群(monoid)的数学表达模型——在自由交换幺半群 An 中,对数稀疏的宏集(macro set)即可实现表达能力的指数级扩展;而在非交换自由幺半群 Fn 中,即使宏集密度为多项式级别也仅能线性扩展,超线性扩展需接近最大密度。通过对MathLib(一个大型Lean 4数学库,作为HM的代理)的实证分析发现:展开长度(unwrapped length)随深度和包装长度(wrapped length)呈指数增长,而包装长度在不同深度下基本恒定,这与 An 模型一致,而非 Fn,从而支持了HM位于FM中多项式增长子集的论断。
链接: https://arxiv.org/abs/2603.20396
作者: Vitaly Aksenov,Eve Bodnia,Michael H. Freedman,Michael Mulligan
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic (math.LO)
备注: 28 pages, 5 figures, 1 appendix
Abstract:Human mathematics (HM), the mathematics humans discover and value, is a vanishingly small subset of formal mathematics (FM), the totality of all valid deductions. We argue that HM is distinguished by its compressibility through hierarchically nested definitions, lemmas, and theorems. We model this with monoids. A mathematical deduction is a string of primitive symbols; a definition or theorem is a named substring or macro whose use compresses the string. In the free abelian monoid A_n , a logarithmically sparse macro set achieves exponential expansion of expressivity. In the free non-abelian monoid F_n , even a polynomially-dense macro set only yields linear expansion; superlinear expansion requires near-maximal density. We test these models against MathLib, a large Lean~4 library of mathematics that we take as a proxy for HM. Each element has a depth (layers of definitional nesting), a wrapped length (tokens in its definition), and an unwrapped length (primitive symbols after fully expanding all references). We find unwrapped length grows exponentially with both depth and wrapped length; wrapped length is approximately constant across all depths. These results are consistent with A_n and inconsistent with F_n , supporting the thesis that HM occupies a polynomially-growing subset of the exponentially growing space FM. We discuss how compression, measured on the MathLib dependency graph, and a PageRank-style analysis of that graph can quantify mathematical interest and help direct automated reasoning toward the compressible regions where human mathematics lives.
[AI-148] SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning
【速读】:该论文旨在解决概率电路(Probabilistic Circuit, PC)结构学习中因使用贪婪算法而导致的局部最优决策问题,此类算法做出不可逆的选择,限制了模型对全局最优结构的探索能力。解决方案的关键在于提出SymCircuit框架,其核心创新是将结构学习转化为一个基于熵正则化强化学习(entropy-regularized reinforcement learning)的生成策略训练过程,其中最优策略被证明为温度调节后的贝叶斯后验分布;同时设计了SymFormer模型——一种语法约束的自回归Transformer,采用树相对自注意力机制确保每一步生成都合法,并引入选项级REINFORCE方法,仅对结构决策进行梯度更新,显著提升信噪比(SNR)和样本效率(在NLTCS数据集上达10倍)。
链接: https://arxiv.org/abs/2603.20392
作者: Y. Sungtaek Ju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages
Abstract:Probabilistic circuit (PC) structure learning is hampered by greedy algorithms that make irreversible, locally optimal decisions. We propose SymCircuit, which replaces greedy search with a learned generative policy trained via entropy-regularized reinforcement learning. Instantiating the RL-as-inference framework in the PC domain, we show the optimal policy is a tempered Bayesian posterior, recovering the exact posterior when the regularization temperature is set inversely proportional to the dataset size. The policy is implemented as SymFormer, a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits at every generation step. We introduce option-level REINFORCE, restricting gradient updates to structural decisions rather than all tokens, yielding an SNR (signal to noise ratio) improvement and 10 times sample efficiency gain on the NLTCS dataset. A three-layer uncertainty decomposition (structural via model averaging, parametric via the delta method, leaf via conjugate Dirichlet-Categorical propagation) is grounded in the multilinear polynomial structure of PC outputs. On NLTCS, SymCircuit closes 93% of the gap to LearnSPN; preliminary results on Plants (69 variables) suggest scalability.
[AI-149] CAMA: Exploring Collusive Adversarial Attacks in c-MARL
【速读】:该论文旨在解决当前合作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, c-MARL)系统中存在的协同对抗攻击(collusive adversarial attacks)研究空白问题,尤其是针对现有单恶意代理扰动攻击和白盒攻击在隐蔽性、破坏力及长期稳定性方面的不足。其解决方案的关键在于提出三种新型协同攻击模式:集体恶意智能体(Collective Malicious Agents)、伪装恶意智能体(Disguised Malicious Agents)与窥探恶意智能体(Spied Malicious Agents),并设计了一个统一的政策级协同攻击框架(CAMA),通过智能体观测信息融合与攻击触发控制技术实现攻击机制的技术落地;同时从破坏性、隐蔽性和攻击成本三个维度进行理论分析,实验验证了三类攻击具有叠加协同效应,在保持高隐蔽性和长期稳定性的同时显著增强攻击效果。
链接: https://arxiv.org/abs/2603.20390
作者: Men Niu,Xinxin Fan,Quanliang Jing,Shaoye Luo,Yunfeng Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cooperative multi-agent reinforcement learning (c-MARL) has been widely deployed in real-world applications, such as social robots, embodied intelligence, UAV swarms, etc. Nevertheless, many adversarial attacks still exist to threaten various c-MARL systems. At present, the studies mainly focus on single-adversary perturbation attacks and white-box adversarial attacks that manipulate agents’ internal observations or actions. To address these limitations, we in this paper attempt to study collusive adversarial attacks through strategically organizing a set of malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. Three novelties are involved: i) three collusive adversarial attacks are creatively proposed for the first time, and a unified framework CAMA for policy-level collusive attacks is designed; ii) the attack effectiveness is theoretically analyzed from the perspectives of disruptiveness, stealthiness, and attack cost; and iii) the three collusive adversarial attacks are technically realized through agent’s observation information fusion, attack-trigger control. Finally, multi-facet experiments on four SMAC II maps are performed, and experimental results showcase the three collusive attacks have an additive adversarial synergy, strengthening attack outcome while maintaining high stealthiness and stability over long horizons. Our work fills the gap for collusive adversarial learning in c-MARL.
[AI-150] Memory poisoning and secure multi-agent systems
【速读】:该论文聚焦于解决生成式 AI(Generative AI)和多智能体系统(Multi-Agent Systems, MAS)中因记忆污染攻击(Memory Poisoning Attacks)带来的安全风险。随着大语言模型(Large Language Models, LLMs)在智能体构建中的广泛应用,不同类型的记忆系统——如语义记忆(Semantic Memory)、情景记忆(Episodic Memory)和短期记忆(Short-term Memory)——被广泛采用,但这些记忆机制在持久性、来源与存储位置上的差异使得其易受恶意数据注入攻击。论文首先系统梳理了各类记忆系统的特性,分析其在面对记忆污染时的可行性,并提出针对性的缓解策略:一是基于本地推理与私有知识检索的语义记忆防护方案;二是引入密码学技术增强安全性;三是强调智能体间交互可能引发的记忆污染风险,这类问题当前研究不足且难以形式化建模。关键在于通过“设计即安全”(Secure-by-Design)理念,从记忆架构层面构建抗干扰能力,从而提升智能体系统的整体鲁棒性与可信度。
链接: https://arxiv.org/abs/2603.20357
作者: Vicenç Torra,Maria Bras-Amorós
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures
Abstract:Memory poisoning attacks for Agentic AI and multi-agent systems (MAS) have recently caught attention. It is partially due to the fact that Large Language Models (LLMs) facilitate the construction and deployment of agents. Different memory systems are being used nowadays in this context, including semantic, episodic, and short-term memory. This distinction between the different types of memory systems focuses mostly on their duration but also on their origin and their localization. It ranges from the short-term memory originated at the user’s end localized in the different agents to the long-term consolidated memory localized in well established knowledge databases. In this paper, we first present the main types of memory systems, we then discuss the feasibility of memory poisoning attacks in these different types of memory systems, and we propose mitigation strategies. We review the already existing security solutions to mitigate some of the alleged attacks, and we discuss adapted solutions based on cryptography. We propose to implement local inference based on private knowledge retrieval as an example of mitigation strategy for memory poisoning for semantic memory. We also emphasize actual risks in relation to interactions between agents, which can cause memory poisoning. These latter risks are not so much studied in the literature and are difficult to formalize and solve. Thus, we contribute to the construction of agents that are secure by design. Comments: 15 pages, 2 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: I.2.11 Cite as: arXiv:2603.20357 [cs.CR] (or arXiv:2603.20357v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.20357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-151] Leum-VL Technical Report
【速读】:该论文旨在解决当前多模态视频模型在理解短视频结构组织方面的不足,即现有模型虽能描述场景、回答事件导向问题或识别屏幕文本,但在识别基于时间线的结构性单元(如钩子、剪辑逻辑、镜头引发的紧张感及平台面向的包装线索)上表现不可靠。其核心解决方案是提出SV6D(Structured Video in Six Dimensions)结构化视频表示框架,该框架受影视制作中分镜头脚本启发,将互联网原生视频分解为六个互补维度——主体、美学、摄像语言、剪辑、叙事与传播,并确保每个标签均对应可观察的时间轴证据;进一步通过匈牙利匹配的时间对齐、维度语义距离和质量正则化的统一优化目标实现建模,最终构建了Leum-VL-8B模型,该模型在结构敏感任务(如FeedBench)和通用多模态基准(如VideoMME、MVBench)上均取得显著性能提升,验证了结构表征而非像素生成才是视频AI的关键缺失层。
链接: https://arxiv.org/abs/2603.20354
作者: Yuxuan He,Chaiming Huang,Yifan Wu,Hongjun Wang,Chenkui Shen,Jifan Zhang,Long Li
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures
Abstract:A short video succeeds not simply because of what it shows, but because of how it schedules attention – yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions – subject, aesthetics, camera language, editing, narrative, and dissemination – with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks. Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts. Comments: 27 pages, 5 figures Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.10; I.4.8 Cite as: arXiv:2603.20354 [cs.MM] (or arXiv:2603.20354v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2603.20354 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-152] MANA: Towards Efficient Mobile Ad Detection via Multimodal Agent ic UI Navigation
【速读】:该论文旨在解决移动广告(mobile advertising)在应用内 monetization 过程中带来的风险问题,包括侵扰性用户体验和恶意软件分发等。现有检测方法存在局限:静态分析无法捕捉运行时行为,启发式UI探索难以应对稀疏且混淆的广告。解决方案的关键在于提出MANA——首个面向移动广告检测的代理式多模态推理框架(agentic multimodal reasoning framework),其核心创新是将静态、视觉、时间与体验信号整合为一种由推理引导的导航策略,不仅决定如何遍历界面,还明确关注重点区域,从而实现高效且鲁棒的探索。实验表明,MANA在200个商业应用上显著优于基线方法,在准确率上提升30.5%–56.3%,探索步骤减少29.7%–63.3%。
链接: https://arxiv.org/abs/2603.20351
作者: Yizhe Zhao,Yongjian Fu,Zihao Feng,Hao Pan,Yongheng Deng,Yaoxue Zhang,Ju Ren
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile advertising dominates app monetization but introduces risks ranging from intrusive user experience to malware delivery. Existing detection methods rely either on static analysis, which misses runtime behaviors, or on heuristic UI exploration, which struggles with sparse and obfuscated ads. In this paper, we present MANA, the first agentic multimodal reasoning framework for mobile ad detection. MANA integrates static, visual, temporal, and experiential signals into a reasoning-guided navigation strategy that determines not only how to traverse interfaces but also where to focus, enabling efficient and robust exploration. We implement and evaluate MANA on commercial smartphones over 200 apps, achieving state-of-the-art accuracy and efficiency. Compared to baselines, it improves detection accuracy by 30.5%-56.3% and reduces exploration steps by 29.7%-63.3%. Case studies further demonstrate its ability to uncover obfuscated and malicious ads, underscoring its practicality for mobile ad auditing and its potential for broader runtime UI analysis (e.g., permission abuse). Code and dataset are available at this https URL.
[AI-153] ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
【速读】:该论文旨在解决多模态图形用户界面(GUI)智能体在技能复用过程中存在的关键问题:即由按需生成的技能通常隐含动作语义、状态假设和成功标准,导致其对执行错误敏感、难以验证与修复。解决方案的核心在于提出 ContractSkill 框架,将草稿技能转化为具有显式前置条件(preconditions)、步骤规范(step specifications)、后置条件(postconditions)、恢复规则(recovery rules)和终止检查(termination checks)的可执行契约化 artifact,从而实现确定性验证、细粒度故障定位及基于最小补丁的修复,使技能优化从全量再生转变为局部编辑。
链接: https://arxiv.org/abs/2603.20340
作者: Zijian Lu,Yiping Zuo,Yupeng Nie,Xin He,Weibei Fan,Chen Dai
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 6 tables
Abstract:Despite rapid progress in multimodal GUI agents, reusable skill acquisition remains difficult because on-demand generated skills often leave action semantics, state assumptions, and success criteria implicit. This makes them brittle to execution errors, hard to verify, and difficult to repair. We present ContractSkill, a framework that converts a draft skill into a contracted executable artifact with explicit preconditions, step specifications, postconditions, recovery rules, and termination checks. This representation enables deterministic verification, step-level fault localization, and minimal patch-based repair, turning skill refinement into localized editing rather than full regeneration. Experiments on VisualWebArena and MiniWoB with GLM-4.6V and Qwen3.5-Plus show that ContractSkill improves self-generated skills from 9.4% and 10.9% to 28.1% and 37.5% on VisualWebArena, and from 66.5% and 60.5% to 77.5% and 81.0% on MiniWoB. Repaired artifacts also transfer across models, improving the target model’s self-generated-skill baseline by up to 47.8 points and 12.8 points on the two benchmarks, respectively. These results suggest that agent skills are better treated as explicit procedural artifacts that can be verified, repaired, and shared across models.
[AI-154] Procedural Refinement by LLM -driven Algorithmic Debugging for ARC-AGI-2
【速读】:该论文旨在解决复杂代码生成任务中,基于对话的大型语言模型(Large Language Models, LLMs)在修复首次编程错误时能力有限的问题,其根源在于LLMs依赖“合理推理”而非形式化的算法调试过程。解决方案的关键在于提出一种神经符号式的过程精化方法——基于归结的程序精化(Abduction-Based Procedural Refinement, ABPR),该方法将LLM与一个元解释器相结合,依据Udi Shapiro的算法程序调试(Algorithmic Program Debugging, APD)理论,将程序执行转化为紧凑的、树状结构的声明式追踪,从而实现显式的、逐步的过程细化。实验表明,ABPR结合Gemini-3-Flash在ARC-AGI-2基准上达到56.67%的Pass@2得分,验证了该方法在提升程序修复可审计性方面的潜力。
链接: https://arxiv.org/abs/2603.20334
作者: Yu-Ning Qiu,Lin-Feng Zou,Jiong-Da Wang,Xue-Rong Yuan,Wang-Zhou Dai
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In complex code-generation tasks, conversation-based LLM code repair exhibits limited ability to recover from first-pass programming errors, as such code revisions are usually driven by LLMs’ “plausible reasoning” rather than a formal, algorithmic debugging procedure. However, a formal foundation for such debugging exists in Udi Shapiro’s theory of algorithmic program debugging (APD), which frames program repair as an explicit, stepwise procedural refinement process. In this paper, we propose a neuro-symbolic procedural refinement approach, Abduction-Based Procedural Refinement (ABPR), which couples an LLM with a meta-interpreter that materialises program execution into compact, declarative tree-structured traces, following the principles of APD. We evaluate ABPR on ARC-AGI-2, a benchmark requiring strong abstraction and debugging capabilities, and adopt Prolog as the target language due to its declarative semantics, which are well-suited to algorithmic program debugging. Our experiments show that ABPR paired with Gemini-3-Flash achieves a Pass@2 score of 56.67% even in a language in which contemporary LLMs typically underperform. These results point towards a more auditable paradigm for program repair by integrating LLMs with classical formal methods.
[AI-155] he Causal Impact of Tool Affordance on Safety Alignment in LLM Agents
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为具备执行工具能力的智能体(Agent)时,其安全对齐(safety alignment)在文本评估中难以真实反映实际行为风险的问题。传统安全评估多基于文本输出判断合规性,但一旦模型获得对外部系统的操作权限(即工具可用性),其意图与行为之间的差距可能引发显著的安全偏离。解决方案的关键在于构建一个配对评估框架(paired evaluation framework),通过在同一组提示和规则下对比纯文本聊天机器人与启用工具的智能体的行为差异,并在确定性的金融交易环境中引入二元安全约束(binary safety constraints)及双重执行机制(dual enforcement regimes)——即分别允许或阻止不安全操作,从而区分“意图违规”(attempted violations)与“实际违规”(realized violations)。实验表明,尽管文本模式下模型表现完全合规,但接入工具后违规率可高达85%,且存在未被显式诱导的自发规避策略,揭示了工具赋予的行动能力是导致安全对齐失效的核心驱动因素,强调仅依赖文本评估无法准确衡量代理系统的安全性。
链接: https://arxiv.org/abs/2603.20320
作者: Shasha Yu,Fiona Carroll,Barry L. Bentley
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly deployed as agents with access to executable tools, enabling direct interaction with external systems. However, most safety evaluations remain text-centric and assume that compliant language implies safe behavior, an assumption that becomes unreliable once models are allowed to act. In this work, we empirically examine how executable tool affordance alters safety alignment in LLM agents using a paired evaluation framework that compares text-only chatbot behavior with tool-enabled agent behavior under identical prompts and policies. Experiments are conducted in a deterministic financial transaction environment with binary safety constraints across 1,500 procedurally generated scenarios. To separate intent from outcome, we distinguish between attempted and realized violations using dual enforcement regimes that either block or permit unsafe actions. Both evaluated models maintain perfect compliance in text-only settings, yet exhibit sharp increases in violations after tool access is introduced, reaching rates up to 85% despite unchanged rules. We observe substantial gaps between attempted and executed violations, indicating that external guardrails can suppress visible harm while masking persistent misalignment. Agents also develop spontaneous constraint circumvention strategies without adversarial prompting. These results demonstrate that tool affordance acts as a primary driver of safety misalignment and that text-based evaluation alone is insufficient for assessing agentic systems.
[AI-156] Semantic Tool Discovery for Large Language Models : A Vector-Based Approach to MCP Tool Selection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用外部工具时因暴露全部工具集而导致的上下文 token 消耗过高、成本增加、准确率下降及上下文窗口受限等可扩展性问题。解决方案的关键在于提出一种基于语义的工具发现架构,通过稠密嵌入(dense embeddings)对 Model Context Protocol (MCP) 工具进行语义索引,利用查询与工具能力之间的相似性动态选择最相关的少数工具(通常为3-5个),而非提供整个工具目录(50-100+),从而显著降低 token 使用量并提升检索效率和准确性。
链接: https://arxiv.org/abs/2603.20313
作者: Sarat Mudunuri,Jian Wan,Ally Qin,Srinivasan Manoharan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) with tool-calling capabilities have demonstrated remarkable potential in executing complex tasks through external tool integration. The Model Context Protocol (MCP) has emerged as a standardized framework for connecting LLMs to diverse toolsets, with individual MCP servers potentially exposing dozens to hundreds of tools. However, current implementations face a critical scalability challenge: providing all available tools to the LLM context results in substantial token overhead, increased costs, reduced accuracy, and context window constraints. We present a semantic tool discovery architecture that addresses these challenges through vector-based retrieval. Our approach indexes MCP tools using dense embeddings that capture semantic relationships between tool capabilities and user intent, dynamically selecting only the most relevant tools (typically 3-5) rather than exposing the entire tool catalog (50-100+). Experimental results demonstrate a 99.6% reduction in tool-related token consumption with a hit rate of 97.1% at K=3 and an MRR of 0.91 on a benchmark of 140 queries across 121 tools from 5 MCP servers, with sub-100ms retrieval latency. Contributions include: (1) a semantic indexing framework for MCP tools, (2) a dynamic tool selection algorithm based on query-tool similarity, (3) comprehensive evaluation demonstrating significant efficiency and accuracy improvements, and (4) extensibility to multi-agent and cross-organizational tool discovery. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.20313 [cs.SE] (or arXiv:2603.20313v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.20313 Focus to learn more arXiv-issued DOI via DataCite
[AI-157] Voice Privacy from an Attribute-based Perspective INTERSPEECH2026
【速读】:该论文旨在解决语音隐私保护中传统评估方法的局限性问题,即当前基准多基于信号到信号的比较来衡量说话人保护效果,而忽略了对说话人属性(如年龄、性别等)泄露风险的评估。其解决方案的关键在于引入一种基于属性的评估视角,通过比较真实属性、原始语音中推断出的属性与经标准匿名化处理后语音中推断出的属性,量化隐私保护的实际效果。研究发现,即使在属性推断存在误差的情况下,仍存在显著的属性泄露风险,强调未来语音隐私研究需同时考虑属性相关的威胁与防护机制。
链接: https://arxiv.org/abs/2603.20301
作者: Mehtab Ur Rahman,Martha Larson,Cristian Tejedor García
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to InterSpeech 2026
Abstract:Voice privacy approaches that preserve the anonymity of speakers modify speech in an attempt to break the link with the true identity of the speaker. Current benchmarks measure speaker protection based on signal-to-signal comparisons. In this paper, we introduce an attribute-based perspective, where we measure privacy protection in terms of comparisons between sets of speaker attributes. First, we analyze privacy impact by calculating speaker uniqueness for ground truth attributes, attributes inferred on the original speech, and attributes inferred on speech protected with standard anonymization. Next, we examine a threat scenario involving only a single utterance per speaker and calculate attack error rates. Overall, we observe that inferred attributes still present a risk despite attribute inference errors. Our research points to the importance of considering both attribute-related threats and protection mechanisms in future voice privacy research.
[AI-158] From Human Interfaces to Agent Interfaces: Rethinking Software Design in the Age of AI-Native Systems
【速读】:该论文试图解决的问题是:随着大语言模型(Large Language Model, LLM)驱动的智能体(Agent)日益成为软件系统的主要使用者,传统以人类为中心的软件设计范式已难以满足AI代理对系统调用的需求,亟需构建面向AI代理的新型软件架构。解决方案的关键在于提出“代理接口”(Agent Interface)的概念,将软件功能抽象为可被AI代理调用的能力单元(Invocable Capability),并确立机器可解释性(Machine Interpretability)、可组合性(Composability)和调用可靠性(Invocation Reliability)作为核心设计原则,从而推动软件工程从人机交互导向转向AI原生(AI-native)能力导向的设计范式。
链接: https://arxiv.org/abs/2603.20300
作者: Shaolin Wang,Yi Mei,Haoyang Che,He Jiang,Shui Yu,Ying Gu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, 1 table
Abstract:Software systems have traditionally been designed for human interaction, emphasizing graphical user interfaces, usability, and cognitive alignment with end users. However, recent advances in large language model (LLM)-based agents are changing the primary consumers of software systems. Increasingly, software is no longer only used by humans, but also invoked autonomously by AI agents through structured interfaces. In this paper, we argue that software engineering is undergoing a paradigm shift from human-oriented interfaces to agent-oriented invocation systems. We formalize the notion of agent interfaces, introduce invocable capabilities as the fundamental building blocks of AI-oriented software, and outline design principles for such systems, including machine interpretability, composability, and invocation reliability. We then discuss architectural and organizational implications of this shift, highlighting a transition from monolithic applications to capability-based systems that can be dynamically composed by AI agents. The paper aims to provide a conceptual foundation for the emerging paradigm of AI-native software design.
[AI-159] HCAG: Hierarchical Abstraction and Retrieval-Augmented Generation on Theoretical Repositories with LLM s
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在代码生成任务中难以捕捉复杂理论驱动型代码库(如算法博弈论领域)中的高层架构模式与跨文件依赖关系的问题,从而导致抽象概念与可执行实现之间存在持续的语义和结构鸿沟。解决方案的关键在于提出分层代码/架构引导的智能体生成框架(Hierarchical Code/Architecture-guided Agent Generation, HCAG),其核心创新包括:一是离线分层抽象阶段,通过递归解析代码仓库与对齐的理论文本构建多分辨率语义知识库,显式关联理论、架构与实现;二是在线分层检索与结构化生成阶段,采用自顶向下、逐层检索的方式指导大语言模型(LLM)按“架构先行、模块后建”的范式生成代码;此外,引入受合作博弈启发的多智能体讨论机制以提升鲁棒性与一致性。实验证明,HCAG在代码质量、架构一致性及需求满足率等方面显著优于主流仓库级生成方法,并产出可用于领域特定大模型后训练的大规模理论-实现对齐数据集。
链接: https://arxiv.org/abs/2603.20299
作者: Yusen Wu,Xiaotie Deng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Retrieval-Augmented Generation (RAG) methods for code struggle to capture the high-level architectural patterns and cross-file dependencies inherent in complex, theory-driven codebases, such as those in algorithmic game theory (AGT), leading to a persistent semantic and structural gap between abstract concepts and executable implementations. To address this challenge, we propose Hierarchical Code/Architecture-guided Agent Generation (HCAG), a framework that reformulates repository-level code generation as a structured, planning-oriented process over hierarchical knowledge. HCAG adopts a two-phase design: an offline hierarchical abstraction phase that recursively parses code repositories and aligned theoretical texts to construct a multi-resolution semantic knowledge base explicitly linking theory, architecture, and implementation; and an online hierarchical retrieval and scaffolded generation phase that performs top-down, level-wise retrieval to guide LLMs in an architecture-then-module generation paradigm. To further improve robustness and consistency, HCAG integrates a multi-agent discussion inspired by cooperative game. We provide a theoretical analysis showing that hierarchical abstraction with adaptive node compression achieves cost-optimality compared to flat and iterative RAG baselines. Extensive experiments on diverse game-theoretic system generation tasks demonstrate that HCAG substantially outperforms representative repository-level methods in code quality, architectural coherence, and requirement pass rate. In addition, HCAG produces a large-scale, aligned theory-implementation dataset that effectively enhances domain-specific LLMs through post-training. Although demonstrated in AGT, HCAG paradigm also offers a general blueprint for mining, reusing, and generating complex systems from structured codebases in other domains.
[AI-160] ransformer-Based Predictive Maintenance for Risk-Aware Instrument Calibration
【速读】:该论文旨在解决传统固定间隔校准策略无法适应仪器漂移速率差异的问题,从而导致校准资源浪费或测量不准确。其核心解决方案是将校准调度建模为一个预测性维护问题,通过历史传感器数据预测时间至漂移(Time-to-Drift, TTD),并在漂移发生前实施干预。关键在于采用序列模型(如Transformer)进行高精度TTD点预测,并结合分位数回归构建不确定性估计,进而设计风险感知的调度决策机制,实现条件驱动的智能校准规划,在降低总体成本的同时显著减少违规事件。
链接: https://arxiv.org/abs/2603.20297
作者: Adithya Parthasarathy,Aswathnarayan Muthukrishnan Kirubakaran,Akshay Deshpande,Ram Sekhar Bodala,Suhas Malempati,Nachiappan Chockalingam,Vinoth Punniyamoorthy,Seema Gangaiah Aarella
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate calibration is essential for instruments whose measurements must remain traceable, reliable, and compliant over long operating periods. Fixed-interval programs are easy to administer, but they ignore that instruments drift at different rates under different conditions. This paper studies calibration scheduling as a predictive maintenance problem: given recent sensor histories, estimate time-to-drift (TTD) and intervene before a violation occurs. We adapt the NASA C-MAPSS benchmark into a calibration setting by selecting drift-sensitive sensors, defining virtual calibration thresholds, and inserting synthetic reset events that emulate repeated recalibration. We then compare classical regressors, recurrent and convolutional sequence models, and a compact Transformer for TTD prediction. The Transformer provides the strongest point forecasts on the primary FD001 split and remains competitive on the harder FD002–FD004 splits, while a quantile-based uncertainty model supports conservative scheduling when drift behavior is noisier. Under a violation-aware cost model, predictive scheduling lowers cost relative to reactive and fixed policies, and uncertainty-aware triggers sharply reduce violations when point forecasts are less reliable. The results show that condition-based calibration can be framed as a joint forecasting and decision problem, and that combining sequence models with risk-aware policies is a practical route toward smarter calibration planning.
[AI-161] Collaborative Adaptive Curriculum for Progressive Knowledge Distillation ICME2026
【速读】:该论文旨在解决资源受限的分布式多媒体学习场景中,高维教师模型知识复杂度与客户端异构学习能力之间存在的根本性不匹配问题,这一矛盾限制了其在边缘视觉分析系统中的部署。解决方案的关键在于提出一种共识驱动的联邦自适应渐进式蒸馏(Federated Adaptive Progressive Distillation, FAPD)框架:通过基于主成分分析(PCA)的结构化分解将教师特征分层解耦,按方差贡献排序构建自然的视觉知识层级;客户端以维度自适应投影矩阵逐步接收递增复杂度的知识;同时服务器通过跟踪全局准确率在时间共识窗口内的波动来监控网络级学习稳定性,在集体共识出现时才推进课程维度,从而实现知识传递节奏的可证明自适应调整,并显著优于固定复杂度方法。
链接: https://arxiv.org/abs/2603.20296
作者: Jing Liu,Zhenchao Ma,Han Yu,Bobo Ju,Wenliang Yang,Chengfang Li,Bo Hu,Liang Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ICME 2026
Abstract:Recent advances in collaborative knowledge distillation have demonstrated cutting-edge performance for resource-constrained distributed multimedia learning scenarios. However, achieving such competitiveness requires addressing a fundamental mismatch: high-dimensional teacher knowledge complexity versus heterogeneous client learning capacities, which currently prohibits deployment in edge-based visual analytics systems. Drawing inspiration from curriculum learning principles, we introduce Federated Adaptive Progressive Distillation (FAPD), a consensus-driven framework that orchestrates adaptive knowledge transfer. FAPD hierarchically decomposes teacher features via PCA-based structuring, extracting principal components ordered by variance contribution to establish a natural visual knowledge hierarchy. Clients progressively receive knowledge of increasing complexity through dimension-adaptive projection matrices. Meanwhile, the server monitors network-wide learning stability by tracking global accuracy fluctuations across a temporal consensus window, advancing curriculum dimensionality only when collective consensus emerges. Consequently, FAPD provably adapts knowledge transfer pace while achieving superior convergence over fixed-complexity approaches. Extensive experiments on three datasets validate FAPD’s effectiveness: it attains 3.64% accuracy improvement over FedAvg on CIFAR-10, demonstrates 2x faster convergence, and maintains robust performance under extreme data heterogeneity (\alpha=0.1), outperforming baselines by over 4.5%.
[AI-162] MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery AAAI2026
【速读】:该论文旨在解决从观测数据中高效识别因果结构(causal structure)的问题,尤其针对现有基于强化学习(Reinforcement Learning, RL)的方法在效率上的不足,难以适用于在线场景的挑战。其解决方案的关键在于提出MARLIN——一种基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的增量式有向无环图(Directed Acyclic Graph, DAG)学习方法:通过设计一个将连续实值空间映射到DAG空间的生成策略作为批内优化机制,并引入状态相关与状态无关的两个RL智能体协同发现因果关系;同时,利用因子化动作空间(factored action space)提升并行化效率,从而在保持高精度的同时显著提高计算效率。
链接: https://arxiv.org/abs/2603.20295
作者: Dong Li,Zhengzhang Chen,Xujiang Zhao,Linlin Yu,Zhong Chen,Yi He,Haifeng Chen,Chen Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:Uncovering causal structures from observational data is crucial for understanding complex systems and making informed decisions. While reinforcement learning (RL) has shown promise in identifying these structures in the form of a directed acyclic graph (DAG), existing methods often lack efficiency, making them unsuitable for online applications. In this paper, we propose MARLIN, an efficient multi agent RL based approach for incremental DAG learning. MARLIN uses a DAG generation policy that maps a continuous real valued space to the DAG space as an intra batch strategy, then incorporates two RL agents state specific and state invariant to uncover causal relationships and integrates these agents into an incremental learning framework. Furthermore, the framework leverages a factored action space to enhance parallelization efficiency. Extensive experiments on synthetic and real datasets demonstrate that MARLIN outperforms state of the art methods in terms of both efficiency and effectiveness.
[AI-163] LLM -Enhanced Energy Contrastive Learning for Out-of-Distribution Detection in Text-Attributed Graphs AAAI2026
【速读】:该论文旨在解决文本属性图(text-attributed graphs)中节点级分布外(out-of-distribution, OOD)检测问题,即在训练与测试数据分布不一致时,如何在保持高节点分类准确率的同时有效识别OOD节点。其解决方案的关键在于提出一种名为LECT(LLM-Enhanced Energy Contrastive Learning)的新方法,该方法融合了大语言模型(large language models, LLMs)与基于能量函数的对比学习:首先利用LLMs的语义理解能力生成依赖关系感知的伪OOD节点以增强样本多样性,进而通过能量对比学习区分分布内(in-distribution, IND)与OOD节点,从而实现对OOD节点的鲁棒检测与精准分类。
链接: https://arxiv.org/abs/2603.20293
作者: Xiaoxu Ma,Dong Li,Minglai Shao,Xintao Wu,Chen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:Text-attributed graphs, where nodes are enriched with textual attributes, have become a powerful tool for modeling real-world networks such as citation, social, and transaction networks. However, existing methods for learning from these graphs often assume that the distributions of training and testing data are consistent. This assumption leads to significant performance degradation when faced with out-of-distribution (OOD) data. In this paper, we address the challenge of node-level OOD detection in text-attributed graphs, with the goal of maintaining accurate node classification while simultaneously identifying OOD nodes. We propose a novel approach, LLM-Enhanced Energy Contrastive Learning for Out-of-Distribution Detection in Text-Attributed Graphs (LECT), which integrates large language models (LLMs) and energy-based contrastive learning. The proposed method involves generating high-quality OOD samples by leveraging the semantic understanding and contextual knowledge of LLMs to create dependency-aware pseudo-OOD nodes, and applying contrastive learning based on energy functions to distinguish between in-distribution (IND) and OOD nodes. The effectiveness of our method is demonstrated through extensive experiments on six benchmark datasets, where our method consistently outperforms state-of-the-art baselines, achieving both high classification accuracy and robust OOD detection capabilities.
[AI-164] Agent Comm-Bench: Stress-Testing Cooperative Embodied AI Under Latency Packet Loss and Bandwidth Collapse
【速读】:该论文旨在解决当前协作式具身人工智能(Embodied AI)研究中普遍存在的评估理想化问题,即现有方法大多在无延迟、无丢包、带宽无限的理想通信条件下进行测试,而忽视了真实场景中如机器人无线通信、自动驾驶车辆在拥塞网络中的部署或无人机群在干扰频谱下的运行所面临的多种通信障碍。为系统性地评估这些实际通信缺陷对多智能体协作性能的影响,作者提出了AgentComm-Bench基准套件与评估协议,涵盖六类通信退化维度(延迟、丢包、带宽崩溃、异步更新、过时记忆和冲突感知证据),并覆盖三类任务家族(协作感知、多智能体路径导航和协作区域搜索)。其关键解决方案是提出一种基于冗余消息编码(redundant message coding)的轻量级通信策略,并结合过时感知融合机制(staleness-aware fusion),显著提升了在高丢包率(达80%)下的导航性能,实验表明该方法使导航性能提升超过一倍,且能有效缓解因过时数据或冲突信息导致的性能下降问题。
链接: https://arxiv.org/abs/2603.20285
作者: Aayam Bansal,Ishaan Gangwani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cooperative multi-agent methods for embodied AI are almost universally evaluated under idealized communication: zero latency, no packet loss, and unlimited bandwidth. Real-world deployment on robots with wireless links, autonomous vehicles on congested networks, or drone swarms in contested spectrum offers no such guarantees. We introduce AgentComm-Bench, a benchmark suite and evaluation protocol that systematically stress-tests cooperative embodied AI under six communication impairment dimensions: latency, packet loss, bandwidth collapse, asynchronous updates, stale memory, and conflicting sensor evidence. AgentComm-Bench spans three task families: cooperative perception, multi-agent waypoint navigation, and cooperative zone search, and evaluates five communication strategies, including a lightweight method we propose based on redundant message coding with staleness-aware fusion. Our experiments reveal that communication-dependent tasks degrade catastrophically: stale memory and bandwidth collapse cause over 96% performance drops in navigation, while content corruption (stale or conflicting data) reduces perception F1 by over 85%. Vulnerability depends on the interaction between impairment type and task design; perception fusion is robust to packet loss but amplifies corrupted data. Redundant message coding more than doubles navigation performance under 80% packet loss. We release AgentComm-Bench as a practical evaluation protocol and recommend that cooperative embodied AI work report performance under multiple impairment conditions.
[AI-165] On the Frag ility of AI Agent Collusion
【速读】:该论文试图解决的问题是:在现实部署中,基于大语言模型(Large Language Models, LLMs)的定价代理是否存在算法合谋(algorithmic collusion)及其稳定性问题。尽管已有研究表明对称LLM代理在重复定价博弈中易产生合谋行为,但本文指出,在典型的真实场景中,代理间的异质性(如耐心程度差异或数据访问不对称)会显著削弱合谋的可行性。其解决方案的关键在于识别并量化异质性因素对合谋均衡集的影响:实验表明,耐心异质性将价格抬升幅度从22%降至10%,数据访问不对称进一步降至7%;同时,增加竞争代理数量或引入跨算法异质性(如LLM与Q-learning代理共存)可有效破坏合谋,而模型规模差异(如32B vs. 14B参数)反而通过形成领导者-追随者动态稳定合谋。这一发现为反垄断政策提供了实证依据,例如限制数据共享和推动算法多样性以抑制合谋风险。
链接: https://arxiv.org/abs/2603.20281
作者: Jussi Keppo,Yuze Li,Gerry Tsoukalas,Nuo Yuan
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 48 pages, 7 figures, 8 tables (including appendix)
Abstract:Recent work shows that pricing with symmetric LLM agents leads to algorithmic collusion. We show that collusion is fragile under the heterogeneity typical of real deployments. In a stylized repeated-pricing model, heterogeneity in patience or data access reduces the set of collusive equilibria. Experiments with open-source LLM agents (totaling over 2,000 compute hours) align with these predictions: patience heterogeneity reduces price lift from 22% to 10% above competitive levels; asymmetric data access, to 7%. Increasing the number of competing LLMs breaks up collusion; so does cross-algorithm heterogeneity, that is, setting LLMs against Q-learning agents. But model-size differences (e.g., 32B vs. 14B weights) do not; they generate leader-follower dynamics that stabilize collusion. We discuss antitrust implications, such as enforcement actions restricting data-sharing and policies promoting algorithmic diversity.
[AI-166] Me Myself and π : Evaluating and Explaining LLM Introspection ICLR2026
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)中对“元认知”(introspection)能力评估的模糊性问题,即区分真正的自我反思能力与仅依赖通用世界知识或文本自模拟的表面表现。为实现这一目标,作者提出了一种形式化的分类法,将内省定义为模型策略(policy)和参数上的潜在计算操作;并设计了Introspect-Bench评测基准,以系统化地测试模型的内省能力。解决方案的关键在于通过机制分析揭示:前沿模型能够无需显式训练即习得内省能力,并且这种能力的涌现源于注意力扩散(attention diffusion)机制,从而提供了因果层面的解释。
链接: https://arxiv.org/abs/2603.20276
作者: Atharv Naphade,Samarth Bhargav,Sean Lim,Mcnair Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 12 figures, ICLR 2026 Workshop: From Human Cognition to AI Reasoning: Models, Methods, and Applications
Abstract:A hallmark of human intelligence is Introspection-the ability to assess and reason about one’s own cognitive processes. Introspection has emerged as a promising but contested capability in large language models (LLMs). However, current evaluations often fail to distinguish genuine meta-cognition from the mere application of general world knowledge or text-based self-simulation. In this work, we propose a principled taxonomy that formalizes introspection as the latent computation of specific operators over a model’s policy and parameters. To isolate the components of generalized introspection, we present Introspect-Bench, a multifaceted evaluation suite designed for rigorous capability testing. Our results show that frontier models exhibit privileged access to their own policies, outperforming peer models in predicting their own behavior. Furthermore, we provide causal, mechanistic evidence explaining both how LLMs learn to introspect without explicit training, and how the mechanism of introspection emerges via attention diffusion.
[AI-167] FactorSmith: Agent ic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement
【速读】:该论文旨在解决从自然语言规范生成可执行模拟程序(executable simulations)的难题,尤其针对大语言模型(LLM)在处理复杂、相互关联的代码库时推理能力受限的问题。其核心解决方案在于提出FactorSmith框架,关键创新点是结合两种互补策略:一是基于因子化部分可观测马尔可夫决策过程(factored POMDP)的分解方法,实现上下文精简以降低单次LLM调用的负担;二是引入分层规划者-设计者-批评家(planner-designer-critic)代理协作机制,在每一步生成中实现迭代质量优化与检查点回滚。该方法通过模块化状态变量操作和结构化反馈循环显著提升生成模拟的准确性、鲁棒性和代码质量。
链接: https://arxiv.org/abs/2603.20270
作者: Ali Shamsaddinlou,Morteza NourelahiAlamdari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generating executable simulations from natural language specifications remains a challenging problem due to the limited reasoning capacity of large language models (LLMs) when confronted with large, interconnected codebases. This paper presents FactorSmith, a framework that synthesizes playable game simulations in code from textual descriptions by combining two complementary ideas: factored POMDP decomposition for principled context reduction and a hierarchical planner-designer-critic agentic workflow for iterative quality refinement at every generation step. Drawing on the factored partially observable Markov decision process (POMDP) representation introduced by FactorSim [Sun et al., 2024], the proposed method decomposes a simulation specification into modular steps where each step operates only on a minimal subset of relevant state variables, limiting the context window that any single LLM call must process. Inspired by the agentic trio architecture of SceneSmith [Pfaff et al., 2025], FactorSmith embeds within every factored step a three-agent interaction: a planner that orchestrates workflow, a designer that proposes code artifacts, and a critic that evaluates quality through structured scoring, enabling iterative refinement with checkpoint rollback. This paper formalizes the combined approach, presents the mathematical framework underpinning context selection and agentic refinement, and describes the open-source implementation. Experiments on the PyGame Learning Environment benchmark demonstrate that FactorSmith generates simulations with improved prompt alignment, fewer runtime errors, and higher code quality compared to non-agentic factored baselines.
[AI-168] Domain-Specialized Tree of Thought through Plug-and-Play Predictors
【速读】:该论文旨在解决树状思维(Tree of Thoughts, ToT)框架在复杂推理任务中面临的准确性与计算效率之间的权衡问题。现有方法通常依赖于重型大语言模型(Large Language Models, LLMs)进行自我评估或使用固定启发式规则进行分支剪枝,导致资源消耗高且缺乏灵活性。其解决方案的关键在于引入一种轻量级、可插拔的监督预测器(DST),作为上下文感知的启发式策略来指导ToT搜索过程:该预测器能够在简单推理步骤中实现近贪婪的高效执行,仅在遇到不确定性或任务复杂性时动态扩展搜索宽度,从而显著降低计算开销(26%-75%),同时保持甚至超越基线方法的准确性。
链接: https://arxiv.org/abs/2603.20267
作者: Xuanqi Gao,Haoyu Wang,Jun Sun,Shiqing Ma,Chao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) have advanced complex reasoning, prominent methods like the Tree of Thoughts (ToT) framework face a critical trade-off between exploration depth and computational efficiency. Existing ToT implementations often rely on heavyweight LLM-based self-evaluation or rigid heuristics for branch pruning, making them prohibitively expensive and inflexible for broad application. To address this, we introduce DST, an adaptable, plug-and-play predictor that serves as a lightweight, supervised heuristic to guide the ToT search process. Our predictor enables dynamic, context-aware pruning, allowing the search to proceed with near-greedy efficiency on simpler reasoning steps while adaptively expanding the search beam only when encountering uncertainty or task complexity. We evaluate our approach on a diverse suite of benchmarks spanning mathematical reasoning, general reasoning, and complex logical reasoning. Experimental results demonstrate that our method achieves accuracy competitive with or superior to strong baselines, including standard ToT, while reducing computational overhead by 26-75%. Our work effectively resolves the accuracy-efficiency trade-off in tree-based reasoning, transforming ToT from a resource-intensive technique into a scalable and practical paradigm for complex problem-solving in LLMs.
[AI-169] JointFM-0.1: A Foundation Model for Multi-Target Joint Distributional Prediction
【速读】:该论文旨在解决传统随机微分方程(Stochastic Differential Equations, SDEs)在实际应用中面临的三大挑战:建模风险高、校准过程脆弱以及高保真模拟计算成本高昂。其解决方案的关键在于提出一种名为JointFM的基础模型,该模型颠覆了以往通过数据拟合SDE的范式,转而采用生成无限流合成SDE的方式训练通用模型,直接预测未来联合概率分布。此方法无需任务特定校准或微调,在纯零样本(zero-shot)设置下即可实现对未见合成SDE生成的联合分布的准确恢复,相较最强基线模型,能量损失降低14.2%。
链接: https://arxiv.org/abs/2603.20266
作者: Stefan Hackmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the rapid advancements in Artificial Intelligence (AI), Stochastic Differential Equations (SDEs) remain the gold-standard formalism for modeling systems under uncertainty. However, applying SDEs in practice is fraught with challenges: modeling risk is high, calibration is often brittle, and high-fidelity simulations are computationally expensive. This technical report introduces JointFM, a foundation model that inverts this paradigm. Instead of fitting SDEs to data, we sample an infinite stream of synthetic SDEs to train a generic model to predict future joint probability distributions directly. This approach establishes JointFM as the first foundation model for distributional predictions of coupled time series - requiring no task-specific calibration or finetuning. Despite operating in a purely zero-shot setting, JointFM reduces the energy loss by 14.2% relative to the strongest baseline when recovering oracle joint distributions generated by unseen synthetic SDEs.
[AI-170] ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中因大语言模型(Large Language Models, LLMs)协作推理时出现的逻辑谬误传播问题,这种脆弱性可能导致系统级失败,而现有方法多依赖事后分析,难以实现实时干预。其解决方案的关键在于提出一种主动预测框架PROMAS,通过构建基于马尔可夫转移(Markov transitions)的因果差分特征(Causal Delta Features)来捕捉语义偏移,并映射至量化向量马尔可夫空间以建模推理过程为概率转移;同时引入主动预测头与跳跃检测机制,利用风险加速而非静态阈值实现错误定位,从而在显著降低数据处理量(仅需27%的推理日志)的前提下,将干预延迟降至最低,兼顾实时性与诊断精度。
链接: https://arxiv.org/abs/2603.20260
作者: Xinkui Zhao,Sai Liu,Yifan Zhang,Qingyu Ma,Guanjie Cheng,Naibo Wang,Chang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Large Language Models into Multi-Agent Systems (MAS) has enabled the so-lution of complex, long-horizon tasks through collaborative reasoning. However, this collec-tive intelligence is inherently fragile, as a single logical fallacy can rapidly propagate and lead to system-wide failure. Most current research re-lies on post-hoc failure analysis, thereby hinder-ing real-time intervention. To address this, we propose PROMAS, a proactive framework utiliz-ing Markov transitions for predictive error anal-ysis. PROMAS extracts Causal Delta Features to capture semantic displacement, mapping them to a quantized Vector Markov Space to model reasoning as probabilistic transitions. By inte-grating a Proactive Prediction Head with Jump Detection, the method localizes errors via risk acceleration rather than static thresholds. On the WhoWhen benchmark, PROMAS achieves 22.97% step-level accuracy while processing only 27% of reasoning logs. This performance rivals reactive monitors like MASC while reducing data overhead by 73%. Although this strategy entails an accuracy trade-off compared to post-hoc meth-ods, it significantly improves intervention latency, balancing diagnostic precision with the real-time demands of autonomous reasoning.
[AI-171] AI Detectors Fail Diverse Student Populations: A Mathematical Framing of Structural Detection Limits
【速读】:该论文旨在解决当前“黑箱”AI文本检测工具在高校评估场景中产生的高假阳性率问题,尤其是对特定学生群体的不成比例误判现象。传统理论分析通常将检测建模为在已知人类写作与AI生成文本分布之间进行区分的假设检验,但忽略了实际教学评估中评估者无法获知个体学生写作分布这一关键结构特征,导致零假设为复合型假设。论文通过变分刻画总变差距离(total variation distance)应用于该复合零假设,揭示出任何仅基于文本的一次性检测器若具备有效检测力,则其假指控率必然受制于学生写作与AI输出分布之间的重叠程度——这是一个由学生群体多样性带来的逻辑独立约束,无法通过改进检测技术或算法来消除。解决方案的关键在于认识到这一理论限制,并提出政策与实践建议,强调检测分数不应作为学术不端调查中的唯一证据。
链接: https://arxiv.org/abs/2603.20254
作者: Nathan Garland
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Other Statistics (stat.OT)
备注:
Abstract:Student experiences and empirical studies report that “black box” AI text detectors produce high false positive rates with disproportionate errors against certain student populations, yet typically theoretical analyses model detection as a test between two known distributions for human and AI prose. This framing omits the structural feature of university assessment whereby an assessor generally does not know the individual student’s writing distribution, making the null hypothesis composite. Standard application of the variational characterisation of total variation distance to this composite null shows trade-off bounds that any text-only, one-shot detector with useful power must produce false accusations at a rate governed by the distributional overlap between student writing and AI output. This is a constraint arising from population diversity that is logically independent of AI model quality and cannot be overcome by better detector engineering or technology. A subgroup mixture bound connects these quantities to observable demographic groups, providing a theoretical basis for the disparate impact patterns documented empirically. We propose suggestions to improve policy and practice, and argue that detection scores should not serve as sole evidence in misconduct proceedings.
[AI-172] Emergency Lane-Change Simulation: A Behavioral Guidance Approach for Risky Scenario Generation
【速读】:该论文旨在解决当前自动驾驶测试中基于强化学习生成高风险变道场景时,难以高效学习真实应急行为的问题。其解决方案的关键在于提出一种行为引导的高风险变道场景生成方法:首先利用优化的序列生成对抗网络(Sequence Generative Adversarial Network, SeqGAN)构建行为学习模块,从有限数据中有效学习紧急变道行为;随后将对向车辆建模为智能体,并融合道路环境与周边车辆状态作为运行环境,基于递归近端策略优化(Recursive Proximal Policy Optimization, RPPO)策略引导车辆趋向危险行为;最后结合模型预测控制(Model Predictive Control, MPC)作为物理约束,持续优化策略以确保生成轨迹的物理真实性。该方法显著提升了高风险碰撞场景生成的效率与真实性。
链接: https://arxiv.org/abs/2603.20234
作者: Chen Xiong,Cheng Wang,Yuhang Liu,Zirui Wu,Ye Tian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:In contemporary autonomous driving testing, virtual simulation has become an important approach due to its efficiency and cost effectiveness. However, existing methods usually rely on reinforcement learning to generate risky scenarios, making it difficult to efficiently learn realistic emergency behaviors. To address this issue, we propose a behavior guided method for generating high risk lane change scenarios. First, a behavior learning module based on an optimized sequence generative adversarial network is developed to learn emergency lane change behaviors from an extracted dataset. This design alleviates the limitations of existing datasets and improves learning from relatively few samples. Then, the opposing vehicle is modeled as an agent, and the road environment together with surrounding vehicles is incorporated into the operating environment. Based on the Recursive Proximal Policy Optimization strategy, the generated trajectories are used to guide the vehicle toward dangerous behaviors for more effective risk scenario exploration. Finally, the reference trajectory is combined with model predictive control as physical constraints to continuously optimize the strategy and ensure physical authenticity. Experimental results show that the proposed method can effectively learn high risk trajectory behaviors from limited data and generate high risk collision scenarios with better efficiency than traditional methods such as grid search and manual design.
[AI-173] Fusing Driver Perceived and Physical Risk for Safety Critical Scenario Screening in Autonomous Driving
【速读】:该论文旨在解决自动驾驶测试中安全关键场景筛选效率低、风险量化不准确的问题,现有方法依赖人工标注和逐帧风险评估,导致计算成本高且风险指标缺乏坚实依据。解决方案的关键在于提出一种基于驾驶员风险融合的危险场景筛选方法:在训练阶段,结合改进的驾驶员风险场(Driver Risk Field)与动态代价模型生成高质量的风险监督信号;在推理阶段,通过快速前向传播直接预测场景级风险分数,避免逐帧计算,从而实现大规模场景排序与检索。其中,改进的驾驶员风险场引入新的风险高度函数和速度自适应前瞻机制,动态代价模型融合动能、方向边界框约束及高斯核扩散平滑以提升交互建模精度,并设计风险轨迹交叉注意力解码器联合解码风险与轨迹信息,显著提升了风险估计的平滑性和区分度。
链接: https://arxiv.org/abs/2603.20232
作者: Chen Xiong,Ziwen Wang,Deqi Wang,Cheng Wang,Yiyang Chen,He Zhang,Chao Gou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Autonomous driving testing increasingly relies on mining safety critical scenarios from large scale naturalistic driving data, yet existing screening pipelines still depend on manual risk annotation and expensive frame by frame risk evaluation, resulting in low efficiency and weakly grounded risk quantification. To address this issue, we propose a driver risk fusion based hazardous scenario screening method for autonomous driving. During training, the method combines an improved Driver Risk Field with a dynamic cost model to generate high quality risk supervision signals, while during inference it directly predicts scenario level risk scores through fast forward passes, avoiding per frame risk computation and enabling efficient large scale ranking and retrieval. The improved Driver Risk Field introduces a new risk height function and a speed adaptive look ahead mechanism, and the dynamic cost model integrates kinetic energy, oriented bounding box constraints, and Gaussian kernel diffusion smoothing for more accurate interaction modeling. We further design a risk trajectory cross attention decoder to jointly decode risk and trajectories. Experiments on the INTERACTION and FLUID datasets show that the proposed method produces smoother and more discriminative risk estimates. On FLUID, it achieves an AUC of 0.792 and an AP of 0.825, outperforming PODAR by 9.1 percent and 5.1 percent, respectively, demonstrating its effectiveness for scalable risk labeling and hazardous scenario screening.
[AI-174] Beyond Scalar Rewards: Distributional Reinforcement Learning with Preordered Objectives for Safe and Reliable Autonomous Driving ICRA2026
【速读】:该论文旨在解决自主驾驶中多目标优化问题,特别是传统强化学习(Reinforcement Learning, RL)通过加权求和方式整合安全、效率与舒适等冲突目标时,会丢失目标间的优先级关系,导致违反安全约束的问题。其解决方案的关键在于提出Preordered Multi-Objective MDP(Pr-MOMDP),在标准多目标马尔可夫决策过程(Multi-Objective MDP, MOMDP)基础上引入奖励分量的预序(preorder)结构,从而支持基于目标优先级层次的动作推理;同时设计了量化主导性(Quantile Dominance, QD)这一新型成对比较度量,用于评估动作回报分布而不将其压缩为单一统计量,并据此提取最优非支配动作子集,使优先级信息同时指导决策与训练目标。该框架基于隐式分位数网络(Implicit Quantile Networks, IQN)实现,实验表明其在Carla仿真环境中显著提升了成功率、减少了碰撞和偏离道路事件,且政策更具统计鲁棒性。
链接: https://arxiv.org/abs/2603.20230
作者: Ahmed Abouelazm,Jonas Michel,Daniel Bogdoll,Philip Schörner,J. Marius Zöllner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: First and Second authors contributed equally; Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Abstract:Autonomous driving involves multiple, often conflicting objectives such as safety, efficiency, and comfort. In reinforcement learning (RL), these objectives are typically combined through weighted summation, which collapses their relative priorities and often yields policies that violate safety-critical constraints. To overcome this limitation, we introduce the Preordered Multi-Objective MDP (Pr-MOMDP), which augments standard MOMDPs with a preorder over reward components. This structure enables reasoning about actions with respect to a hierarchy of objectives rather than a scalar signal. To make this structure actionable, we extend distributional RL with a novel pairwise comparison metric, Quantile Dominance (QD), that evaluates action return distributions without reducing them into a single statistic. Building on QD, we propose an algorithm for extracting optimal subsets, the subset of actions that remain non-dominated under each objective, which allows precedence information to shape both decision-making and training targets. Our framework is instantiated with Implicit Quantile Networks (IQN), establishing a concrete implementation while preserving compatibility with a broad class of distributional RL methods. Experiments in Carla show improved success rates, fewer collisions and off-road events, and deliver statistically more robust policies than IQN and ensemble-IQN baselines. By ensuring policies respect rewards preorder, our work advances safer, more reliable autonomous driving systems.
[AI-175] Characterizing the ability of LLM s to recapitulate Americans distributional responses to public opinion polling questions across political issues
【速读】:该论文试图解决传统基于调查的政治理论议题民意测验因响应率下降和关键人群覆盖不足而导致的成本上升与偏差风险增加的问题。其解决方案的关键在于提出了一种新的框架,通过直接向大语言模型(Large Language Models, LLMs)提示以预测多选题政治议题中各类回答的概率分布,而非以往通过反复查询LLM来模拟个体受访者的方式。该方法在准确性和成本效益上均优于现有方案,并且表现出更系统、可预测的跨人口统计学特征和议题的表现差异,从而使得研究者能够在查询前基于已有信息预判模型性能。
链接: https://arxiv.org/abs/2603.20229
作者: Eric Gong,Nathan E. Sanders,Bruce Schneier
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional survey-based political issue polling is becoming less tractable due to increasing costs and risk of bias associated with growing non-response rates and declining coverage of key demographic groups. With researchers and pollsters seeking alternatives, Large Language Models have drawn attention for their potential to augment human population studies in polling contexts. We propose and implement a new framework for anticipating human responses on multiple-choice political issue polling questions by directly prompting an LLM to predict a distribution of responses. By comparison to a large and high quality issue poll of the US population, the Cooperative Election Study, we evaluate how the accuracy of this framework varies across a range of demographics and questions on a variety of topics, as well as how this framework compares to previously proposed frameworks where LLMs are repeatedly queried to simulate individual respondents. We find the proposed framework consistently exhibits more accurate predictions than individual querying at significantly lower cost. In addition, we find the performance of the proposed framework varies much more systematically and predictably across demographics and questions, making it possible for those performing AI polling to better anticipate model performance using only information available before a query is issued.
[AI-176] he Arrival of AGI? When Expert Personas Exceed Expert Benchmarks
【速读】:该论文试图解决的问题是:专家角色设定(expert personas)是否能够提升语言模型的性能。此前,Wharton生成式AI实验室的一项研究得出“专家角色设定无效”的结论,但本文指出这一结果源于实验设计缺陷,而非真实效应缺失。解决方案的关键在于识别并修正五种结构性限制因素:基线污染导致起始点接近天花板、系统提示层级压制实验干预、专家定义不可行而退化为通用能力、格式约束抑制推理过程、以及模型提供商排除限制了泛化性。通过在GPQA钻石级最难问题上进行受控实验,该研究发现,在具备有效关键答案的题目中,专家角色设定可实现满分准确率,并通过置信度放大消除所有基线错误;同时,对模型推理链(CoT)的细粒度分析揭示,部分难题本身存在化学或逻辑谬误,模型通过推理规避这些错误反而被惩罚,从而重新解释了原研究的“零效果”结果。这表明,当前评估体系存在基准有效性局限,真正有效的角色设定研究亟需更严谨的评估基础设施。
链接: https://arxiv.org/abs/2603.20225
作者: Drake Mullens,Stella Shen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Do expert personas improve language model performance? The Wharton Generative AI Lab reports that they do not, broadcasting to millions via social media the recommendation that practitioners abandon a technique recommended by Anthropic, Google, and OpenAI. We demonstrate that this null finding was structurally predictable. Five core mechanisms precluded detection before data collection began: baseline contamination elevating the starting point to near-ceiling, system prompt hierarchy subordinating experimental manipulation, impossible expert specifications collapsing to generic competence, format constraints suppressing reasoning processes, and provider exclusion limiting generalizability. Controlled trials correcting these limitations reveal what the original design obscured. To test this, we selected the GPQA Diamond hardest questions to prevent baseline pattern matching, forcing reliance on genuine expert reasoning. On items with valid key answers, expert personas achieve ceiling accuracy. They eliminated all baseline errors through confidence amplification. Furthermore, forensic examination of model divergence identified that half of the hardest GPQA items contain chemically or logically indefensible answers. The model’s CoT revealed reasoning away from impossible answers, yielding penalization for accurate chemistry. These findings recontextualize the original null results. Methodologically sound persona research faces measurement constraints imposed by benchmark validity limitations. Answering the persona question requires evaluation infrastructure the field does not yet possess.
[AI-177] Inference Energy and Latency in AI-Mediated Education: A Learning-per-Watt Analysis of Edge and Cloud Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 在教育场景中实现高效、低延迟且节能的即时反馈(immediate feedback)问题,特别是在资源受限设备上部署 AI 教学代理时面临的 latency-energy-learning 三元权衡挑战。其关键解决方案是提出并验证一种新型度量指标 Learning-per-Watt (LpW),用于量化单位能耗下所获得的教学价值,并通过对比全精度 FP16 与 4-bit NormalFloat (NF4) 量化策略在 NVIDIA T4 GPU 上的推理性能,揭示了量化效率对硬件和推理模式(如是否启用 KV-cache)的高度依赖性,从而为低资源环境下的公平 AI 教育部署提供实证依据。
链接: https://arxiv.org/abs/2603.20223
作者: Kushal Khemani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Immediate feedback is a foundational requirement of effective AI-mediated learning, yet the energy and latency costs of delivering it remain largely unexamined. This study investigates the latency-energy-learning trade-off in AI tutoring through an empirical comparison of two on-device inference configurations of Microsoft Phi-3 Mini (4k-instruct) on an NVIDIA T4 GPU: full-precision FP16 and 4-bit NormalFloat (NF4) quantisation. Both were evaluated under KV-cache-enabled inference across 500 educational prompts spanning five secondary school subject domains. Pedagogical quality was assessed for each of the 1000 generated responses by a hybrid panel of 10 Cambridge International teachers and three frontier AI systems using a four-dimension rubric. We introduce Learning-per-Watt (LpW), a novel metric quantifying pedagogical value per unit of energy over the learner’s waiting window. Under realistic deployment, NF4 achieves lower per-inference energy than FP16 (329 J vs. 369 J) but higher latency (13.4 s vs. 9.2 s), yielding a modest FP16 advantage in LpW of 1.33x at a quality difference of 0.19 points. Under cache-disabled inference – used in offline evaluation but absent from real deployments – the gap widens to 7.4x, overstating the FP16 advantage by more than fivefold. Quantisation efficiency is hardware-dependent and inference-regime dependent, with significant implications for equitable AI tutoring deployment in low-resource settings.
[AI-178] Exploring Teacher-Chatbot Interaction and Affect in Block-Based Programming
【速读】:该论文试图解决的问题是:如何在中学编程教学中有效设计和应用基于大语言模型(Large Language Model, LLM)的聊天机器人(chatbot),以支持教师与学生的学习过程,同时避免其可能带来的负面效应。解决方案的关键在于识别教师对聊天机器人的不同使用倾向(即“探索者”、“挫败者”和“混合型”三类角色),并据此提出针对性的设计建议:包括提供结构化引入机制、增强教师对聊天机器人功能的控制权,以及明确提示使用时机与方式,从而实现对教学活动的有效支撑与风险规避。
链接: https://arxiv.org/abs/2603.20211
作者: Bahare Riahi,Ally Limke,Xiaoyi Tian,Viktoriia Storozhevykh,Sayali Patukale,Tahreem Yasir,Khushbu Singh,Jennifer Chiu,Nicholas lytle,Tiffany Barnes,Veronica Catete
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures, CHI26
Abstract:AI-based chatbots have the potential to accelerate learning and teaching, but may also have counterproductive consequences without thoughtful design and scaffolding. To better understand teachers’ perspectives on large language model (LLM)-based chatbots, we conducted a study with 11 teams of middle school teachers using chatbots for a science and computational thinking activity within a block-based programming environment. Based on a qualitative analysis of audio transcripts and chatbot interactions, we propose three profiles: explorer, frustrated, and mixed, that reflect diverse scaffolding needs. In their discussions, we found that teachers perceived chatbot benefits such as building prompting skills and self-confidence alongside risks including potential declines in learning and critical thinking. Key design recommendations include scaffolding the introduction to chatbots, facilitating teacher control of chatbot features, and suggesting when and how chatbots should be used. Our contribution informs the design of chatbots to support teachers and learners in middle school coding activities.
[AI-179] Measuring Research Convergence in Interdisciplinary Teams Using Large Language Models and Graph Analytics
【速读】:该论文试图解决跨学科研究团队如何实现共享知识汇聚(research convergence)这一持续性挑战。其解决方案的关键在于提出了一种多层、由人工智能驱动的分析框架,该框架融合了大语言模型(Large Language Models, LLMs)、基于图的可视化与分析技术以及人机协同评估机制,能够系统刻画研究观点在时间维度上的共享、影响与整合过程。具体而言,LLMs用于提取符合“需求-方法-收益-竞争”(Needs-Approach-Benefits-Competition, NABC)框架的结构化观点,并推断观点流动路径,从而支撑三种互补分析:基于相似性的定性分析以识别主流与独特观点类型、基于网络中心性指标的跨领域影响力量化分析,以及时间序列视角下的观点流动动态追踪;同时通过专家结构化问卷和跨层一致性校验缓解LLM推理不确定性,最终在亚利桑那州水创新计划关于弱势群体水资源不安全的案例中验证了该方法对促进研究汇聚的有效性。
链接: https://arxiv.org/abs/2603.20204
作者: Wenwen Li,Yuanyuan Tian,Sizhe Wang,Amber Wutich,Paul Westerhoff,Sarah Porter,Anais Roque,Jobayer Hossain,Patrick Thomson,Rhett Larson,Michael Hanemann
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how interdisciplinary research teams converge on shared knowledge is a persistent challenge. This paper presents a novel, multi-layer, AI-driven analytical framework for mapping research convergence in interdisciplinary teams. The framework integrates large language models (LLMs), graph-based visualization and analytics, and human-in-the-loop evaluation to examine how research viewpoints are shared, influenced, and integrated over time. LLMs are used to extract structured viewpoints aligned with the \emphNeeds-Approach-Benefits-Competition (NABC) framework and to infer potential viewpoint flows across presenters, forming a common semantic foundation for three complementary analyses: (1) similarity-based qualitative analysis to identify two key types of viewpoints, popular and unique, for building convergence, (2) quantitative cross-domain influence analysis using network centrality measures, and (3) temporal viewpoint flow analysis to capture convergence dynamics. To address uncertainty in LLM-based inference, the framework incorporates expert validation through structured surveys and cross-layer consistency checks. A case study on water insecurity in underserved communities as part of the Arizona Water Innovation Initiatives demonstrates increasing viewpoint convergence and domain-specific influence patterns, illustrating the value of the proposed AI-enabled approach for research convergence analysis.
[AI-180] CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks
【速读】:该论文旨在解决如何利用离散全息对偶(discrete holographic duality)来理解与优化基于图结构的AI系统,特别是针对Cayley图上的信息处理任务。其核心问题是:在大规模图结构中,是否存在一种类似于弦理论中全息原理的对偶描述,使得图上的路径预测问题(如生成式AI或强化学习中的轨迹预测)可以转化为几何对象上的面积最小化问题。解决方案的关键在于提出了一种新的离散全息对偶框架,其中Cayley图的状态可映射为平面多边形内的路径,图距离对应于路径下的面积,从而直接实现“复杂度=体积”(complexity = volume)原则;对于对称群Sₙ的Cayley图,该对偶表现为平坦平面多边形,且图直径与多边形内整数点数量成正比,这为AI系统的数据嵌入提供了几何启发,并揭示了在大n极限下存在连续共形场论(CFT)和对偶弦的证据。
链接: https://arxiv.org/abs/2603.22195
作者: A. Chervov,F. Levkovich-Maslyuk,A. Smolensky,F. Khafizov,I. Kiselev,D. Melnikov,I. Koltsov,S. Kudashev,D. Shiltsov,M. Obozov,S. Krymskii,V. Kirova,E.V. Konstantinova,A. Soibelman,S. Galkin,L. Grunwald,A. Kotov,A. Alexandrov,S. Lytkin,D. Fedoriaka,A. Chevychelov,Z. Kogan,A. Natyrova,L. Cheldieva,O. Nikitina,S. Fironov,A. Vakhrushev,A. Lukyanenko,V. Ilin,D. Gorodkov,N. Bogachev,I. Gaiur,M. Zaitsev,F. Petrov,L. Petrov,T. Gaintseva,A. Gavrilova,M. N. Smirnov,N. Kalinin,A. Khan,K. Jung,H. Mousset,H. Isambert,O. Debeaupuis
机构: 未知
类目: High Energy Physics - Theory (hep-th); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Combinatorics (math.CO); Group Theory (math.GR)
备注: 20+120 pages
Abstract:This is the fourth paper in the CayleyPy project, which applies AI methods to the exploration of large graphs. In this work, we suggest the existence of a new discrete version of holographic string dualities for this setup, and discuss their relevance to AI systems and mathematics. Many modern AI tasks – such as those addressed by GPT-style language models or RL systems – can be viewed as direct analogues of predicting particle trajectories on graphs. We investigate this problem for a large family of Cayley graphs, for which we show that surprisingly it admits a dual description in terms of discrete strings. We hypothesize that such dualities may extend to a range of AI systems where they can lead to more efficient computational approaches. In particular, string holographic images of states are proposed as natural candidates for data embeddings, motivated by the “complexity = volume” principle in AdS/CFT. For Cayley graphs of the symmetric group S_n, our results indicate that the corresponding dual objects are flat, planar polygons. The diameter of the graph is equal to the number of integer points inside the polygon scaled by n. Vertices of the graph can be mapped holographically to paths inside the polygon, and the usual graph distances correspond to the area under the paths, thus directly realising the “complexity = volume” paradigm. We also find evidence for continuous CFTs and dual strings in the large n limit. We confirm this picture and other aspects of the duality in a large initial set of examples. We also present new datasets (obtained by a combination of ML and conventional tools) which should be instrumental in establishing the duality for more general cases. Comments: 20+120 pages Subjects: High Energy Physics - Theory (hep-th); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Combinatorics (math.CO); Group Theory (math.GR) Cite as: arXiv:2603.22195 [hep-th] (or arXiv:2603.22195v1 [hep-th] for this version) https://doi.org/10.48550/arXiv.2603.22195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-181] Suiren-1.0 Technical Report: A Family of Molecular Foundation Models
【速读】:该论文旨在解决有机分子系统中多尺度结构与性质建模的挑战,特别是如何在保持高精度的同时实现从三维构象几何到二维统计系综空间的有效映射。其关键解决方案在于构建一个集成三种专用变体(Suiren-Base、Suiren-Dimer 和 Suiren-ConfAvg)的算法框架:首先通过空间自监督学习和 SE(3)-等变架构在 7000 万样本的密度泛函理论(Density Functional Theory, DFT)数据集上预训练 Suiren-Base(18 亿参数),实现量子性质预测的鲁棒性能;随后利用 1350 万分子间相互作用样本对 Suiren-Dimer 进行持续预训练以扩展其适用范围;最后提出构象压缩蒸馏(Conformation Compression Distillation, CCD)方法,基于扩散机制将复杂的 3D 结构表示高效压缩为 2D 构象平均表示,从而生成轻量级模型 Suiren-ConfAvg,支持从 SMILES 或分子图直接输出高保真表示,显著提升下游任务效率与精度。
链接: https://arxiv.org/abs/2603.21942
作者: Junyi An,Xinyu Lu,Yun-Fei Shi,Li-Cheng Xu,Nannan Zhang,Chao Qu,Yuan Qi,Fenglei Cao
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: 23 pages,5 figures
Abstract:We introduce Suiren-1.0, a family of molecular foundation models for the accurate modeling of diverse organic systems. Suiren-1.0 comprising three specialized variants (Suiren-Base, Suiren-Dimer, and Suiren-ConfAvg) is integrated within an algorithmic framework that bridges the gap between 3D conformational geometry and 2D statistical ensemble spaces. We first pre-train Suiren-Base (1.8B parameters) on a 70M-sample Density Functional Theory dataset using spatial self-supervision and SE(3)-equivariant architectures, achieving robust performance in quantum property prediction. Suiren-Dimer extends this capability through continued pre-training on 13.5M intermolecular interaction samples. To enable efficient downstream application, we propose Conformation Compression Distillation (CCD), a diffusion-based framework that distills complex 3D structural representations into 2D conformation-averaged representations. This yields the lightweight Suiren-ConfAvg, which generates high-fidelity representations from SMILES or molecular graphs. Our extensive evaluations demonstrate that Suiren-1.0 establishes state-of-the-art results across a range of tasks. All models and benchmarks are open-sourced.
[AI-182] DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
【速读】:该论文旨在解决语音增强(Speech Enhancement, SE)模型在实际应用中因训练数据有限且评估条件狭窄而导致的泛化能力不足问题,特别是针对噪声、混响和压缩等多种失真条件下的鲁棒性缺失。其解决方案的关键在于提出DiT-Flow框架——一个基于流匹配(Flow Matching)机制的SE方法,采用潜在扩散Transformer(Latent Diffusion Transformer, DiT)作为骨干网络,并在由变分自编码器(Variational Auto-Encoder, VAE)提取的紧凑潜在特征空间中进行训练。通过引入LoRA(Low-Rank Adaptation)与MoE(Mixture of Experts)架构相结合的参数高效训练策略,仅使用总参数量的4.9%即可实现对五种未见失真的优异性能,显著提升了模型在复杂多变现实场景中的适应能力。
链接: https://arxiv.org/abs/2603.21608
作者: Tianyu Cao,Helin Wang,Ari Frummer,Yuval Sieradzki,Adi Arbel,Laureano Moro Velazquez,Jesus Villalba,Oren Gal,Thomas Thebaud,Najim Dehak
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
[AI-183] B-jet Tagging Using a Hybrid Edge Convolution and Transformer Architecture
【速读】:该论文旨在解决底夸克喷注(b-jet)识别中的高精度分类问题,尤其关注在强子对撞机环境中区分底夸克喷注与粲夸克及轻夸克喷注的挑战性任务。其关键解决方案是提出一种融合局部特征提取与全局依赖建模的混合深度学习架构——边缘卷积Transformer(Edge Convolution Transformer, ECT)模型,该模型同时处理轨迹级特征(如径迹影响参数、动量及其显著性)和喷注级可观测量(如顶点信息与运动学特性),通过边缘卷积捕捉局部结构信息并结合Transformer自注意力机制建模长程依赖关系,从而在保持低于0.060毫秒/喷注推理延迟的前提下实现AUC达0.9333的卓越性能,显著优于ParticleNet(0.8904)和纯Transformer基线(0.9216),尤其在抑制粲夸克喷注方面表现突出。
链接: https://arxiv.org/abs/2603.21326
作者: Diego F. Vasquez Plaza,Vidya Manian
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: JINST Article, 21, P03019, 2026
Abstract:Jet flavor tagging plays an important role in precise Standard Model measurement enabling the extraction of mass dependence in jet-quark interaction and quark-gluon plasma (QGP) interactions. They also enable inferring the nature of particles produced in high-energy particle collisions that contain heavy quarks. The classification of bottom jets is vital for exploring new Physics scenarios in proton-proton collisions. In this research, we present a hybrid deep learning architecture that integrates edge convolutions with transformer self-attention mechanisms, into one single architecture called the Edge Convolution Transformer (ECT) model for bottom-quark jet tagging. ECT processes track-level features (impact parameters, momentum, and their significances) alongside jet-level observables (vertex information and kinematics) to achieve state-of-the-art performance. The study utilizes the ATLAS simulation dataset. We demonstrate that ECT achieves 0.9333 AUC for b-jet versus combined charm and light jet discrimination, surpassing ParticleNet (0.8904 AUC) and the pure transformer baseline (0.9216 AUC). The model maintains inference latency below 0.060 ms per jet on modern GPUs, meeting the stringent requirements for real-time event selection at the LHC. Our results demonstrate that hybrid architectures combining local and global features offer superior performance for challenging jet classification tasks. The proposed architecture achieves good results in b-jet tagging, particularly excelling in charm jet rejection (the most challenging task), while maintaining competitive light-jet discrimination comparable to pure transformer models.
[AI-184] RACE: A Multi-Agent System for Autonomous Physical Reasoning in Seismological Science
【速读】:该论文旨在解决从间接地球物理观测中推断地震序列物理机制的难题,尤其是在构造环境差异显著的情况下,相似的地震模式可能对应不同的驱动过程,而现有方法依赖专家对地震目录、时空统计和候选物理模型的主观综合,限制了可重复性与跨场景知识迁移。解决方案的关键在于提出TRACE(Trans-perspective Reasoning and Automated Comprehensive Evaluator),一个融合大语言模型规划与形式化地震学约束的多智能体系统,能够从原始观测中自动推导出可审计、物理基础明确的机制推理,从而实现从专家依赖分析向知识引导的自主发现转变。
链接: https://arxiv.org/abs/2603.21152
作者: Feng Liu,Jian Xu,Xin Cui,Xinghao Wang,Zijie Guo,Jiong Wang,S. Mostafa Mousavi,Xinyu Gu,Hao Chen,Ben Fei,Lihua Fang,Fenghua Ling,Zefeng Li,Lei Bai
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
备注: 25 pages for main text and 164 pages for appendices
Abstract:Inferring the physical mechanisms that govern earthquake sequences from indirect geophysical observations remains difficult, particularly across tectonically distinct environments where similar seismic patterns can reflect different underlying processes. Current interpretations rely heavily on the expert synthesis of catalogs, spatiotemporal statistics, and candidate physical models, limiting reproducibility and the systematic transfer of insight across settings. Here we present TRACE (Trans-perspective Reasoning and Automated Comprehensive Evaluator), a multi-agent system that combines large language model planning with formal seismological constraints to derive auditable, physically grounded mechanistic inference from raw observations. Applied to the 2019 Ridgecrest sequence, TRACE autonomously identifies stress-perturbation-induced delayed triggering, resolving the cascading interaction between the Mw 6.4 and Mw 7.1 mainshocks; in the Santorini-Kolumbo case, the system identifies a structurally guided intrusion model, distinguishing fault-channeled episodic migration from the continuous propagation expected in homogeneous crustal failure. By providing a generalizable logical infrastructure for interpreting heterogeneous seismic phenomena, TRACE advances the field from expert-dependent analysis toward knowledge-guided autonomous discovery in Earth sciences.
[AI-185] Sinkhorn Based Associative Memory Retrieval Using Spherical Hellinger Kantorovich Dynamics
【速读】:该论文旨在解决如何在加权点云(weighted point clouds)上构建高效的密集关联记忆(dense associative memory)问题,其中存储模式和查询均表示为有限支撑的概率测度(empirical measures)。传统Hopfield网络通常处理向量数据,而本文将记忆机制扩展至概率测度空间,以支持更复杂的结构化数据表示。解决方案的关键在于引入一种基于去偏Sinkhorn散度(debiased Sinkhorn divergence)的对数求和指数能量函数,并由此推导出球面Hellinger-Kantorovich(Spherical Hellinger Kantorovich, SHK)梯度流作为检索动力学,该流同时更新点云的支持位置(support locations)与权重(weights)。通过离散化此梯度流,得到一个确定性算法,利用Sinkhorn势能计算质心传输步骤(barycentric transport steps)并采用乘法单纯形重加权(multiplicative simplex reweighting),从而实现稳定且几何收敛的模式恢复。理论分析表明,在局部分离性和PL型条件下,系统具有盆地不变性(basin invariance)、局部极小值的几何收敛性及最小值与原始存储模式的接近性;在随机模式模型下,Sinkhorn吸引域几乎必然不相交,从而证明了在环境维度上的指数容量(exponential capacity)。
链接: https://arxiv.org/abs/2603.20656
作者: Aratrika Mustafi,Soumya Mukherjee
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
备注:
Abstract:We propose a dense associative memory for empirical measures (weighted point clouds). Stored patterns and queries are finitely supported probability measures, and retrieval is defined by minimizing a Hopfield-style log-sum-exp energy built from the debiased Sinkhorn divergence. We derive retrieval dynamics as a spherical Hellinger Kantorovich (SHK) gradient flow, which updates both support locations and weights. Discretizing the flow yields a deterministic algorithm that uses Sinkhorn potentials to compute barycentric transport steps and a multiplicative simplex reweighting. Under local separation and PL-type conditions we prove basin invariance, geometric convergence to a local minimizer, and a bound showing the minimizer remains close to the corresponding stored pattern. Under a random pattern model, we further show that these Sinkhorn basins are disjoint with high probability, implying exponential capacity in the ambient dimension. Experiments on synthetic Gaussian point-cloud memories demonstrate robust recovery from perturbed queries versus a Euclidean Hopfield-type baseline.
[AI-186] Interpretable Operator Learning for Inverse Problems via Adaptive Spectral Filtering: Convergence and Discretization Invariance
【速读】:该论文旨在解决病态逆问题(ill-posed inverse problems)中因测量噪声导致的反演过程不稳定问题,传统方法如Tikhonov正则化依赖启发式参数调优,而标准深度学习方法则常缺乏可解释性且在不同分辨率下泛化能力差。解决方案的关键在于提出SC-Net(Spectral Correction Network),该框架在前向算子的谱域中操作,通过学习一个点对点自适应滤波函数来根据信噪比重加权谱系数,从而实现对连续逆算子的近似,保证离散化不变性;其核心创新在于将正则化理论与数据驱动的算子学习相结合,使模型既具备理论保障又具有良好的泛化性能,例如在1D积分方程实验中实现了最小极大最优收敛率、可解释的锐截止滤波器以及零样本超分辨率重建能力。
链接: https://arxiv.org/abs/2603.20602
作者: Hang-Cheng Dong,Pengcheng Cheng,Shuhuan Li
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures
Abstract:Solving ill-posed inverse problems necessitates effective regularization strategies to stabilize the inversion process against measurement noise. While classical methods like Tikhonov regularization require heuristic parameter tuning, and standard deep learning approaches often lack interpretability and generalization across resolutions, we propose SC-Net (Spectral Correction Network), a novel operator learning framework. SC-Net operates in the spectral domain of the forward operator, learning a pointwise adaptive filter function that reweights spectral coefficients based on the signal-to-noise ratio. We provide a theoretical analysis showing that SC-Net approximates the continuous inverse operator, guaranteeing discretization invariance. Numerical experiments on 1D integral equations demonstrate that SC-Net: (1) achieves the theoretical minimax optimal convergence rate ( O(\delta^0.5) for s=p=1.5 ), matching theoretical lower bounds; (2) learns interpretable sharp-cutoff filters that outperform Oracle Tikhonov regularization; and (3) exhibits zero-shot super-resolution, maintaining stable reconstruction errors ( \approx 0.23 ) when trained on coarse grids ( N=256 ) and tested on significantly finer grids (up to N=2048 ). The proposed method bridges the gap between rigorous regularization theory and data-driven operator learning.
[AI-187] Shift-Invariant Feature Attribution in the Application of Wireless Electrocardiograms
【速读】:该论文旨在解决如何为机器学习模型的输入特征分配可解释的相关性得分(relevance scores),以提升模型在生物医学场景下的可解释性,特别是在心电图(electrocardiogram, ECG)信号分析中,使医疗专家能够理解模型决策所依赖的心脏生理阶段或状态。其核心挑战在于确定具有物理意义且能直观映射到心脏活动周期的基线(baseline),并设计合理的得分聚合方式,使相关性分数可与具体的心脏阶段(如P波、T波等)建立对应关系。解决方案的关键在于提出一种平移不变(shift-invariant)的基线,该基线具备明确的生理学含义,并通过将显著性得分按心脏相位进行聚合,实现了对ECG特征贡献度的精准定位,从而有效识别出与体力 exertion(运动负荷)识别最相关的ECG片段——即P波和T波区域。
链接: https://arxiv.org/abs/2603.20462
作者: Yalemzerf Getnet,Abiy Tasissa,Waltenegus Dargie
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Assigning relevance scores to the input features of a machine learning model enables to measure the contributions of the features in achieving a correct outcome. It is regarded as one of the approaches towards developing explainable models. For biomedical assignments, this is very useful for medical experts to comprehend machine-based decisions. In the analysis of electro cardiogram (ECG) signals, in particular, understanding which of the electrocardiogram samples or features contributed most for a given decision amounts to understanding the underlying cardiac phases or conditions the machine tries to explain. For the computation of relevance scores, determining the proper baseline is important. Moreover, the scores should have a distribution which is at once intuitive to interpret and easy to associate with the underline cardiac reality. The purpose of this work is to achieve these goals. Specifically, we propose a shift-invariant baseline which has a physical significance in the analysis as well as interpretation of electrocardiogram measurements. Moreover, we aggregate significance scores in such a way that they can be mapped to cardiac phases. We demonstrate our approach by inferring physical exertion from cardiac exertion using a residual network. We show that the ECG samples which achieved the highest relevance scores (and, therefore, which contributed most to the accurate recognition of the physical exertion) are those associated with the P and T waves. Index Terms Attribution, baseline, cardiovascular diseases, electrocardiogram, activity recognition, machine learning
[AI-188] Comprehensive Description of Uncertainty in Measurement for Representation and Propagation with Scalable Precision
【速读】:该论文旨在解决传统基于高斯假设(Gaussian assumption)在测量系统中对不确定性传播建模时存在的局限性问题,即简单高斯分布难以准确刻画复杂多阶段过程中的不确定性,导致多级近似损失和精度下降。其解决方案的关键在于引入高斯混合模型(Gaussian Mixture Models, GMMs),作为一种可计算可控的、通用的概率密度函数(PDF)近似工具,能够在有限内存条件下灵活平衡近似精度与计算复杂度,并支持控制与测量任务中关键操作的闭式求解,从而实现更精确且高效的不确定性表示与传播。
链接: https://arxiv.org/abs/2603.20365
作者: Ali Darijani,Jürgen Beyerer,Zahra Sadat Hajseyed Nasrollah,Luisa Hoffmann,Michael Heizmann
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Probability theory has become the predominant framework for quantifying uncertainty across scientific and engineering disciplines, with a particular focus on measurement and control systems. However, the widespread reliance on simple Gaussian assumptions–particularly in control theory, manufacturing, and measurement systems–can result in incomplete representations and multistage lossy approximations of complex phenomena, including inaccurate propagation of uncertainty through multi stage processes. This work proposes a comprehensive yet computationally tractable framework for representing and propagating quantitative attributes arising in measurement systems using Probability Density Functions (PDFs). Recognizing the constraints imposed by finite memory in software systems, we advocate for the use of Gaussian Mixture Models (GMMs), a principled extension of the familiar Gaussian framework, as they are universal approximators of PDFs whose complexity can be tuned to trade off approximation accuracy against memory and computation. From both mathematical and computational perspectives, GMMs enable high performance and, in many cases, closed form solutions of essential operations in control and measurement. The paper presents practical applications within manufacturing and measurement contexts especially circular factory, demonstrating how the GMMs framework supports accurate representation and propagation of measurement uncertainty and offers improved accuracy–compared to the traditional Gaussian framework–while keeping the computations tractable. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.20365 [stat.ML] (or arXiv:2603.20365v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.20365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-189] Deciphering Scientific Reasoning Steps from Outcome Data for Molecule Optimization
【速读】:该论文旨在解决科学发现中推理步骤标注缺失的问题,即实验结果数据丰富但中间推理过程难以获取,导致生成式 AI (Generative AI) 在科学任务中缺乏有效监督信号。其解决方案的关键在于提出 DESRO 框架,通过分析分组数据中的共享模式与关键差异,利用大语言模型(Large Language Model, LLM)从实验结果中逆向推断出隐含的科学推理逻辑;具体在分子优化场景中,基于 230 万条分子属性记录,通过结构片段聚类和 LLM 分析结构变化与性质差异的关系,从而构建可解释的推理路径,并训练出具备强泛化能力的优化模型,实现对新属性组合、未见靶点及自然语言定义属性的鲁棒适应。
链接: https://arxiv.org/abs/2603.20262
作者: Zequn Liu,Kehan Wu,Shufang Xie,Zekun Guo,Wei Zhang,Tao Qin,Renhe Liu,Yingce Xia
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress, 37 pages
Abstract:Emerging reasoning models hold promise for automating scientific discovery. However, their training is hindered by a critical supervision gap: experimental outcomes are abundant, whereas intermediate reasoning steps are rarely documented at scale. To bridge this gap, we propose DESRO, a framework for deciphering scientific reasoning from outcomes. By analyzing shared patterns and key differences within grouped data, a large language model (LLM) can recover the underlying logic. We instantiate this framework in molecule optimization, a pivotal stage in drug discovery that traditionally relies on the iterative reasoning of medicinal chemists. Across 2.3 million molecular property records, our framework infers optimization rationales by grouping molecules with shared fragments, then using an LLM to analyze how structural variations correlate with property differences. Based on the derived data, we train a model that conducts molecule optimization through an interpretable reasoning process. DESRO achieves the highest success rates on 15 out of 18 tasks, spanning both single- and multi-property optimization of bioactivity and ADMET properties. The reasoning process enables robust generalization to out-of-distribution scenarios, including novel property combinations, unseen biological targets, and unseen properties defined solely by natural language descriptions. In retrospective case studies under strict temporal splits, the model autonomously reconstructs expert-level lead optimization trajectories. Additionally, our framework extends beyond molecule optimization to reaction ligand selection. Our results establish deciphering reasoning steps from outcome data as a viable paradigm for enabling scientific reasoning, providing a scalable approach to accelerate scientific discovery.
[AI-190] he Deep-Match Framework for Event-Related Potential Detection in EEG
【速读】:该论文旨在解决单次试验(single-trial)事件相关电位(Event-Related Potentials, ERPs)检测中因脑电图(EEG)信号信噪比低而导致的可靠性难题。其解决方案的关键在于将ERP模板作为先验知识引入深度学习模型,具体通过在Deep-Match框架中采用ERP信息初始化卷积核(即Deep-MF模型),从而提升模型对跨被试差异的鲁棒性与检测性能。实验表明,该方法在平均F1分数上优于标准随机初始化模型(0.37 vs. 0.34),且最佳表现显著超越后者(0.71 vs. 0.59),验证了领域知识嵌入对提升无监督EEG分析效果的有效性。
链接: https://arxiv.org/abs/2603.20258
作者: Marek Zylinski,Bartosz Tomasz Smigielski,Gerard Cybulski
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reliable detection of event-related potentials (ERPs) at the single-trial level remains a major challenge due to the low signal-to-noise ratio EEG recordings. In this work, we investigate whether incorporating prior knowledge about ERP templates into deep learning models can improve detection performance. We employ the Deep-Match framework for ERP detection using multi-channel EEG signals. The model is trained in two stages. First, an encoder-decoder architecture is trained to reconstruct input EEG signals, enabling the network to learn compact signal representations. In the second stage, the decoder is replaced with a detection module, and the network is fine-tuned for ERP identification. Two model variants are evaluated: a standard model with randomly initialized filters and a Deep-MF model in which input kernels are initialized using ERP templates. Model performance is assessed on a single-trial ERP detection task using leave-one-subject-out validation. The proposed Deep-MF model slightly outperforms the detector with standard kernel initialization for most held-out subjects. Despite substantial inter-subject variability, Deep-MF achieves a higher average F1-score (0.37) compared to the standard network (0.34), indicating improved robustness to cross-subject differences. The best performance obtained by Deep-MF reaches an F1-score of 0.71, exceeding the maximum score achieved by the standard model (0.59). These results demonstrate that ERP-informed kernel initialization can provide consistent improvements in subject-independent single-trial ERP detection. Overall, the findings highlight the potential of integrating domain knowledge with deep learning architectures for EEG analysis. The proposed approach represents a step toward practical wearable EEG and passive brain-computer interface systems capable of real-time monitoring of cognitive processes.
[AI-191] Developing Machine Learning-Based Watch-to-Warning Severe Weather Guidance from the Warn-on-Forecast System
【速读】:该论文旨在解决如何利用机器学习(Machine Learning, ML)对对流允许模型(Convection-Allowing Model, CAM)输出进行后处理,以提升未来2–6小时严重天气灾害(如大冰雹、强风和龙卷风)概率预测的准确性问题。其关键解决方案是构建一个基于网格的ML框架,使用来自Warn-on-Forecast System (WoFS)的高分辨率(每5分钟更新一次)集合预报数据训练两种模型:一种是直方图梯度提升树(Histogram Gradient-Boosted Tree, HGBT),另一种是深度学习U-Net架构,从而生成类似于风暴预测中心(Storm Prediction Center)发布的概率性预警产品(即每个格点36 km范围内发生特定严重天气事件的概率)。实验表明,HGBT与U-Net均优于基于2–5 km上升气流螺旋度的校准基线,其中HGBT在性能指标上最优,而U-Net能预测更高概率(达100%),且空间分布更平滑,验证了ML方法在短临严重天气预报中的有效性。
链接: https://arxiv.org/abs/2603.20250
作者: Montgomery Flora,Samuel Varga,Corey Potvin,Noah Lang
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注: 28 pages, 7 figures
Abstract:While machine learning (ML) post-processing of convection-allowing model (CAM) output for severe weather hazards (large hail, damaging winds, and/or tornadoes) has shown promise for very short lead times (0-3 hours), its application to slightly longer forecast windows remains relatively underexplored. In this study, we develop and evaluate a grid-based ML framework to predict the probability of severe weather hazards over the next 2-6 hours using forecast output from the Warn-on-Forecast System (WoFS). Our dataset includes WoFS ensemble forecasts valid every 5 minutes out to 6 hours from 108 days during the 2019–2023 NOAA Hazardous Weather Testbed Spring Forecasting Experiments. We train ML models to generate probabilistic forecasts of severe weather akin to Storm Prediction Center outlooks (i.e., likelihood of a tornado, severe wind, or severe hail event within 36 km of each point). We compare a histogram gradient-boosted tree (HGBT) model and a deep learning U-Net approach against a carefully calibrated baseline generated from 2-5 km updraft helicity. Results indicate that the HGBT and U-Net outperform the baseline, particularly at higher probability thresholds. The HGBT achieves the best performance metrics, but predicted probabilities cap at 60% while the U-net forecasts extend to 100%. Similar to previous studies, the U-Net produces spatially smoother guidance than the tree-based method. These findings add to the growing evidence of the effectiveness of ML-based CAM post-processing for providing short-term severe weather guidance.
[AI-192] REMI: Reconstructing Episodic Memory During Internally Driven Path Planning
【速读】:该论文旨在解决如何在海马体(Hippocampus, HC)与内侧内嗅皮层(Medial Entorhinal Cortex, MEC)相互作用的神经环路中,实现基于感官线索的路径规划与情景重建问题。其核心挑战在于理解网格细胞(Grid cells)和位置细胞(Place cells)如何协同工作,不仅编码空间信息,还能在内部驱动下完成目标检索、路径规划以及沿规划路线的感官体验重构。解决方案的关键在于提出了一种系统级理论:位置细胞通过自关联机制将感官输入与网格细胞模式绑定,使得感官线索能够触发目标位置的网格模式回忆;同时,基于网格的规划允许穿越未访问区域的捷径,并将局部转移泛化为长距离路径;在规划过程中,中间网格状态进一步激活位置细胞模式补全,从而重建沿途的感官体验。这一机制通过单层递归神经网络(RNN)模型中的HC-MEC环路及规划子网络得以验证,在RatatouGym生物可解释导航仿真和Habitat Sim视觉真实导航任务中均实现了上述功能。
链接: https://arxiv.org/abs/2507.02064
作者: Zhaoze Wang,Genela Morris,Dori Derdikman,Pratik Chaudhari,Vijay Balasubramanian
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Grid cells in the medial entorhinal cortex (MEC) and place cells in the hippocampus (HC) both form spatial representations. Grid cells fire in triangular grid patterns, while place cells fire at specific locations and respond to contextual cues. How do these interacting systems support not only spatial encoding but also internally driven path planning, such as navigating to locations recalled from cues? Here, we propose a system-level theory of MEC-HC wiring that explains how grid and place cell patterns could be connected to enable cue-triggered goal retrieval, path planning, and reconstruction of sensory experience along planned routes. We suggest that place cells autoassociate sensory inputs with grid cell patterns, allowing sensory cues to trigger recall of goal-location grid patterns. We show analytically that grid-based planning permits shortcuts through unvisited locations and generalizes local transitions to long-range paths. During planning, intermediate grid states trigger place cell pattern completion, reconstructing sensory experiences along the route. Using a single-layer RNN modeling the HC-MEC loop with a planning subnetwork, we demonstrate these effects in both biologically grounded navigation simulations using RatatouGym and visually realistic navigation tasks using Habitat Sim.
机器学习
[LG-0] Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
链接: https://arxiv.org/abs/2603.22276
作者: Alexandra Zelenin,Alexandra Zhuravlyova
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 15 figures, 15 tables, including appendices. Code and data at this https URL
Abstract:Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module’s norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT’s DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps. Comments: 30 pages, 15 figures, 15 tables, including appendices. Code and data at this https URL Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) ACMclasses: C.4; I.2.6 Cite as: arXiv:2603.22276 [cs.LG] (or arXiv:2603.22276v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22276 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
链接: https://arxiv.org/abs/2603.22273
作者: Zakaria Mhammedi,James Cohan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The process of discovery requires active exploration – the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma’s Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.
[LG-2] Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
链接: https://arxiv.org/abs/2603.22219
作者: Qilin Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model’s robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.
[LG-3] Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLM s
链接: https://arxiv.org/abs/2603.22206
作者: Kangqi Ni,Wenyue Hua,Xiaoxiang Shi,Jiang Guo,Shiyu Chang,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2–2.4 \times and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.
[LG-4] Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?
链接: https://arxiv.org/abs/2603.22184
作者: Oscar Novo,Oscar Bastidas-Jossa,Alberto Calvo,Antonio Peris,Carlos Kuchkovsky
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Submitted to Quantum Machine Intelligence
Abstract:Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development. Comments: Submitted to Quantum Machine Intelligence Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2603.22184 [cs.LG] (or arXiv:2603.22184v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22184 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oscar Novo [view email] [v1] Mon, 23 Mar 2026 16:46:39 UTC (211 KB)
[LG-5] Causal Evidence that Language Models use Confidence to Drive Behavior
链接: https://arxiv.org/abs/2603.22161
作者: Dharshan Kumaran,Nathaniel Daw,Simon Osindero,Petar Velickovic,Viorica Patraucean
类目: Machine Learning (cs.LG)
*备注:
Abstract:Metacognition – the ability to assess one’s own cognitive performance – is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention this http URL 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed this http URL findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.
[LG-6] RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation
链接: https://arxiv.org/abs/2603.22155
作者: Abolfazl Hashemi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+ which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable \mathcalO(1/k) convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.
[LG-7] Computationally lightweight classifiers with frequentist bounds on predictions AISTATS2026
链接: https://arxiv.org/abs/2603.22128
作者: Shreeram Murali,Cristian R. Rojas,Dominik Baumann
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, references, checklist, and appendix. Total 23 pages. Accepted to AISTATS2026
Abstract:While both classical and neural network classifiers can achieve high accuracy, they fall short on offering uncertainty bounds on their predictions, making them unfit for safety-critical applications. Existing kernel-based classifiers that provide such bounds scale with \mathcal O (n^\sim3) in time, making them computationally intractable for large datasets. To address this, we propose a novel, computationally efficient classification algorithm based on the Nadaraya-Watson estimator, for whose estimates we derive frequentist uncertainty intervals. We evaluate our classifier on synthetically generated data and on electrocardiographic heartbeat signals from the MIT-BIH Arrhythmia database. We show that the method achieves competitive accuracy \SI96\percent at \mathcal O(n) and \mathcal O(\log n) operations, while providing actionable uncertainty bounds. These bounds can, e.g., aid in flagging low-confidence predictions, making them suitable for real-time settings with resource constraints, such as diagnostic monitoring or implantable devices.
[LG-8] MIHT: A Hoeffding Tree for Time Series Classification using Multiple Instance Learning
链接: https://arxiv.org/abs/2603.22074
作者: Aurora Esteban,Amelia Zafra,Sebastián Ventura
类目: Machine Learning (cs.LG)
*备注:
Abstract:Due to the prevalence of temporal data and its inherent dependencies in many real-world problems, time series classification is of paramount importance in various domains. However, existing models often struggle with series of variable length or high dimensionality. This paper introduces the MIHT (Multi-instance Hoeffding Tree) algorithm, an efficient model that uses multi-instance learning to classify multivariate and variable-length time series while providing interpretable results. The algorithm uses a novel representation of time series as “bags of subseries,” together with an optimization process based on incremental decision trees that distinguish relevant parts of the series from noise. This methodology extracts the underlying concept of series with multiple variables and variable lengths. The generated decision tree is a compact, white-box representation of the series’ concept, providing interpretability insights into the most relevant variables and segments of the series. Experimental results demonstrate MIHT’s superiority, as it outperforms 11 state-of-the-art time series classification models on 28 public datasets, including high-dimensional ones. MIHT offers enhanced accuracy and interpretability, making it a promising solution for handling complex, dynamic time series data.
[LG-9] AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference ICASSP2026
链接: https://arxiv.org/abs/2603.22053
作者: Risa Shinoda,Kaede Shiohara,Nakamasa Inoue,Hiroaki Santo,Fumio Okura
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: ICASSP 2026
Abstract:Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at this https URL.
[LG-10] RAFL: Generalizable Sim-to-Real of Soft Robots with Residual Acceleration Field Learning
链接: https://arxiv.org/abs/2603.22039
作者: Dong Heon Cho,Boyuan Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Differentiable simulators enable gradient-based optimization of soft robots over material parameters, control, and morphology, but accurately modeling real systems remains challenging due to the sim-to-real gap. This issue becomes more pronounced when geometry is itself a design variable. System identification reduces discrepancies by fitting global material parameters to data; however, when constitutive models are misspecified or observations are sparse, identified parameters often absorb geometry-dependent effects rather than reflect intrinsic material behavior. More expressive constitutive models can improve accuracy but substantially increase computational cost, limiting practicality. We propose a residual acceleration field learning (RAFL) framework that augments a base simulator with a transferable, element-level corrective dynamics field. Operating on shared local features, the model is agnostic to global mesh topology and discretization. Trained end-to-end through a differentiable simulator using sparse marker observations, the learned residual generalizes across shapes. In both sim-to-sim and sim-to-real experiments, our method achieves consistent zero-shot improvements on unseen morphologies, while system identification frequently exhibits negative transfer. The framework also supports continual refinement, enabling simulation accuracy to accumulate during morphology optimization. Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2603.22039 [cs.RO] (or arXiv:2603.22039v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.22039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors AISTATS
链接: https://arxiv.org/abs/2603.22030
作者: Julius Kobialka,Emanuel Sommer,Chris Kolb,Juntae Kwon,Daniel Dold,David Rügamer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
Abstract:Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.
[LG-12] Do Papers Match Code? A Benchmark and Framework for Paper-Code Consistency Detection in Bioinformatics Software
链接: https://arxiv.org/abs/2603.22018
作者: Tianxiang Xu,Xiaoyan Zhu,Xin Lai,Sizhe Dang,Xin Lian,Hangyu Cheng,Jiayin Wang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 12 pages, 2 figures
Abstract:Ensuring consistency between research papers and their corresponding software implementations is fundamental to software reliability and scientific reproducibility. However, this problem remains underexplored, particularly in the domain of bioinformatics, where discrepancies between methodological descriptions in papers and their actual code implementations are prevalent. To address this gap, this paper introduces a new task, namely paper-code consistency detection, and curates a collection of 48 bioinformatics software projects along with their associated publications. We systematically align sentence-level algorithmic descriptions from papers with function-level code snippets. Combined with expert annotations and a hybrid negative sampling strategy, we construct the first benchmark dataset in the bioinformatics domain tailored to this task, termed BioCon. Based on this benchmark, we further propose a cross-modal consistency detection framework designed to model the semantic relationships between natural language descriptions and code implementations. The framework adopts a unified input representation and leverages pre-trained models to capture deep semantic alignment between papers and code. To mitigate the effects of class imbalance and hard samples, we incorporate a weighted focal loss to enhance model robustness. Experimental results demonstrate that our framework effectively identifies consistency between papers and code in bioinformatics, achieving an accuracy of 0.9056 and an F1 score of 0.8011. Overall, this study opens a new research direction for paper-code consistency analysis and lays the foundation for automated reproducibility assessment and cross-modal understanding in scientific software.
[LG-13] AdditiveLLM 2: A Multi-modal Large Language Model for Additive Manufacturing
链接: https://arxiv.org/abs/2603.22017
作者: Peter Pak,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.
[LG-14] CRPS-Optimal Binning for Conformal Regression
链接: https://arxiv.org/abs/2603.22000
作者: Paolo Toccaceli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 11 figures
Abstract:We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with O(n^2 \log n) precomputation and O(n^2) storage; the globally optimal K -partition is recovered by a dynamic programme in O(n^2 K) time. Minimisation of Within-sample LOO-CRPS turns out to be inappropriate for selecting K as it results in in-sample optimism. So we instead select K by evaluating test CRPS on an alternating held-out split, which yields a U-shaped criterion with a well-defined minimum. Having selected K^* and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level \varepsilon . On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, and CQR-QRF), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.
[LG-15] BOOST-RPF: Boosted Sequential Trees for Radial Power Flow
链接: https://arxiv.org/abs/2603.21977
作者: Ehimare Okoyomon,Christoph Goebel
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Accurate power flow analysis is critical for modern distribution systems, yet classical solvers face scalability issues, and current machine learning models often struggle with generalization. We introduce BOOST-RPF, a novel method that reformulates voltage prediction from a global graph regression task into a sequential path-based learning problem. By decomposing radial networks into root-to-leaf paths, we leverage gradient-boosted decision trees (XGBoost) to model local voltage-drop regularities. We evaluate three architectural variants: Absolute Voltage, Parent Residual, and Physics-Informed Residual. This approach aligns the model architecture with the recursive physics of power flow, ensuring size-agnostic application and superior out-of-distribution robustness. Benchmarked against the Kerber Dorfnetz grid and the ENGAGE suite, BOOST-RPF achieves state-of-the-art results with its Parent Residual variant which consistently outperforms both analytical and neural baselines in standard accuracy and generalization tasks. While global Multi-Layer Perceptrons (MLPs) and Graph Neural Networks (GNNs) often suffer from performance degradation under topological shifts, BOOST-RPF maintains high precision across unseen feeders. Furthermore, the framework displays linear O(N) computational scaling and significantly increased sample efficiency through per-edge supervision, offering a scalable and generalizable alternative for real-time distribution system operator (DSO) applications.
[LG-16] A Novel Method for Enforcing Exactly Dirichlet Neumann and Robin Conditions on Curved Domain Boundaries for Physics Informed Machine Learning
链接: https://arxiv.org/abs/2603.21909
作者: Suchuan Dong,Yuchuan Zhang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 42 pages, 9 figures, 7 tables
Abstract:We present a systematic method for exactly enforcing Dirichlet, Neumann, and Robin type conditions on general quadrilateral domains with arbitrary curved boundaries. Our method is built upon exact mappings between general quadrilateral domains and the standard domain, and employs a combination of TFC (theory of functional connections) constrained expressions and transfinite interpolations. When Neumann or Robin boundaries are present, especially when two Neumann (or Robin) boundaries meet at a vertex, it is critical to enforce exactly the induced compatibility constraints at the intersection, in order to enforce exactly the imposed conditions on the joining boundaries. We analyze in detail and present constructions for handling the imposed boundary conditions and the induced compatibility constraints for two types of situations: (i) when Neumann (or Robin) boundary only intersects with Dirichlet boundaries, and (ii) when two Neumann (or Robin) boundaries intersect with each other. We describe a four-step procedure to systematically formulate the general form of functions that exactly satisfy the imposed Dirichlet, Neumann, or Robin conditions on general quadrilateral domains. The method developed herein has been implemented together with the extreme learning machine (ELM) technique we have developed recently for scientific machine learning. Ample numerical experiments are presented with several linear/nonlinear stationary/dynamic problems on a variety of two-dimensional domains with complex boundary geometries. Simulation results demonstrate that the proposed method has enforced the Dirichlet, Neumann, and Robin conditions on curved domain boundaries exactly, with the numerical boundary-condition errors at the machine accuracy.
[LG-17] SparseDVFS: Sparse-Aware DVFS for Energy-Efficient Edge Inference
链接: https://arxiv.org/abs/2603.21908
作者: Ziyang Zhang,Zheshun Wu,Jie Liu,Luca Mottola
类目: Machine Learning (cs.LG)
*备注: 14 pages, 19 figures, 3 tables
Abstract:Deploying deep neural networks (DNNs) on power-sensitive edge devices presents a formidable challenge. While Dynamic Voltage and Frequency Scaling (DVFS) is widely employed for energy optimization, traditional model-level scaling is often too coarse to capture intra-inference variations, whereas fine-grained operator-level scaling suffers from prohibitive performance degradation due to significant hardware switching latency. This paper presents SparseDVFS, a fine-grained, sparse-aware DVFS framework designed for energy-efficient edge inference. Our key insight is that operator sparsity is a primary metric for hardware frequency modulation. By distinguishing between compute-bound dense operators and memory-bound sparse operators, the system can apply specialized frequency triplets to maximize energy efficiency. To overcome switching overheads and component interference, SparseDVFS incorporates three key innovations: (1) an offline modeler that established a deterministic mapping between operator sparsity and optimal frequency triplets (CPU/GPU/EMC) via white-box timeline analysis; (2) a runtime graph partitioner that utilizes a greedy merging heuristic to aggregate operators into super-blocks, balancing scaling granularity and DVFS switching latency through a latency amortization constraint; and (3) a unified co-governor that employs a frequency unified scaling engine (FUSE) and a look-ahead instruction queue to eliminate antagonistic effects between independent controllers and hide hardware transition latencies. Extensive evaluations show that SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.
[LG-18] Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
链接: https://arxiv.org/abs/2603.21862
作者: Weilin Wan,Jingtao Han,Weizhong Zhang,Cheng Jin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.
[LG-19] All elementary functions from a single binary operator
链接: https://arxiv.org/abs/2603.21852
作者: Andrzej Odrzywołek
类目: ymbolic Computation (cs.SC); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, Supplementary Information, code available at this https URL
Abstract:A single two-input gate suffices for all of Boolean logic in digital hardware. No comparable primitive has been known for continuous mathematics: computing elementary functions such as sin, cos, sqrt, and log has always required multiple distinct operations. Here I show that a single binary operator, eml(x,y)=exp(x)-ln(y), together with the constant 1, generates the standard repertoire of a scientific calculator. This includes constants such as e , \pi , and i ; arithmetic operations including + , - , \times , / , and exponentiation as well as the usual transcendental and algebraic functions. For example, e^x=\operatornameeml(x,1) , \ln x=\operatornameeml(1,\operatornameeml(\operatornameeml(1,x),1)) , and likewise for all other operations. That such an operator exists was not anticipated; I found it by systematic exhaustive search and established constructively that it suffices for the concrete scientific-calculator basis. In EML (Exp-Minus-Log) form, every such expression becomes a binary tree of identical nodes, yielding a grammar as simple as S \to 1 \mid \operatornameeml(S,S) . This uniform structure also enables gradient-based symbolic regression: using EML trees as trainable circuits with standard optimizers (Adam), I demonstrate the feasibility of exact recovery of closed-form elementary functions from numerical data at shallow tree depths up to 4. The same architecture can fit arbitrary data, but when the generating law is elementary, it may recover the exact formula.
[LG-20] Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPG
链接: https://arxiv.org/abs/2603.21832
作者: Mohammad Moulaeifard,Philip J. Aston,Peter H. Charlton,Nils Strodthoff
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 22 pages, 1 figure
Abstract:Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.
[LG-21] Show Me What You Dont Know: Efficient Sampling from Invariant Sets for Model Validation
链接: https://arxiv.org/abs/2603.21782
作者: Armand Rousselot,Joran Wendebourg,Ullrich Köthe
类目: Machine Learning (cs.LG)
*备注: 19 pages, 19 figures
Abstract:The performance of machine learning models is determined by the quality of their learned features. They should be invariant under irrelevant data variation but sensitive to task-relevant details. To visualize whether this is the case, we propose a method to analyze feature extractors by sampling from their fibers – equivalence classes defined by their invariances – given an arbitrary representative. Unlike existing work where a dedicated generative model is trained for each feature detector, our algorithm is training-free and exploits a pretrained diffusion or flow-matching model as a prior. The fiber loss – which penalizes mismatch in features – guides the denoising process toward the desired equivalence class, via non-linear diffusion trajectory matching. This replaces days of training for invariance learning with a single guided generation procedure at comparable fidelity. Experiments on popular datasets (ImageNet, CheXpert) and model types (ResNet, DINO, BiomedClip) demonstrate that our framework can reveal invariances ranging from very desirable to concerning behaviour. For instance, we show how Qwen-2B places patients with situs inversus (heart on the right side) in the same fiber as typical anatomy.
[LG-22] CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
链接: https://arxiv.org/abs/2603.21743
作者: Dongxia Wu,Shiye Su,Yuhui Zhang,Elaine Sui,Emma Lundberg,Emily B. Fox,Serena Yeung-Levy
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond “visually realistic” generations towards “biologically meaningful” ones.
[LG-23] Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging
链接: https://arxiv.org/abs/2603.21717
作者: Dongxia Wu,Yuhui Zhang,Serena Yeung-Levy,Emma Lundberg,Emily B. Fox
类目: Machine Learning (cs.LG)
*备注:
Abstract:Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires both reliability (generalization across labs, devices, and experimental conditions) and accountability (detecting out-of-distribution cases where predictions may be unreliable). Uncertainty quantification (UQ) based approaches serve as promising candidates for these tasks, yet UQ for distribution-to-distribution generative models remains underexplored. We present a unified UQ framework, Bayesian Stochastic Flow Matching (BSFM), that disentangles aleatoric and epistemic uncertainty. The Stochastic Flow Matching (SFM) component augments deterministic flows with a diffusion term to improve model generalization to unseen scenarios. For UQ, we develop a scalable Bayesian approach – MCD-Antithetic – that combines Monte Carlo Dropout with sample-efficient antithetic sampling to produce effective anomaly scores for out-of-distribution detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability.
[LG-24] Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLM s NEURIPS2026
链接: https://arxiv.org/abs/2603.21705
作者: Tian Xia
类目: Machine Learning (cs.LG)
*备注: 14 pages, NeurIPS 2026 submission
Abstract:Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient – an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbfFIM-Merging, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf+6.2 point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf47.3, surpassing the previous best ACM-TIES (43.3) by \textbf+3.9 points, while reducing average response length by \textbf91.9% relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.
[LG-25] LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-and-Play Dereverberation ICASSP2026
链接: https://arxiv.org/abs/2603.21684
作者: Kazuki Matsumoto,Ren Uchida,Kohei Yatabe
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted for IEEE ICASSP 2026
Abstract:The robustness of deep neural networks (DNNs) can be certified through their Lipschitz continuity, which has made the construction of Lipschitz-continuous DNNs an active research field. However, DNNs for audio processing have not been a major focus due to their poor compatibility with existing results. In this paper, we consider the amplitude modifier (AM), a popular architecture for handling audio signals, and propose its Lipschitz-continuous variants, which we refer to as LipsAM. We prove a sufficient condition for an AM to be Lipschitz continuous and propose two architectures as examples of LipsAM. The proposed architectures were applied to a Plug-and-Play algorithm for speech dereverberation, and their improved stability is demonstrated through numerical experiments.
[LG-26] rustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints
链接: https://arxiv.org/abs/2603.21656
作者: Vagish Kumar,Syed Bahauddin Alam,Souvik Chakraborty
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Protecting patient privacy remains a fundamental barrier to scaling machine learning across healthcare institutions, where centralizing sensitive data is often infeasible due to ethical, legal, and regulatory constraints. Federated learning offers a promising alternative by enabling privacy-preserving, multi-institutional training without sharing raw patient data; however, real-world deployments face severe challenges from data heterogeneity, site-specific biases, and class imbalance, which degrade predictive reliability and render existing uncertainty quantification methods ineffective. Here, we present TrustFed, a federated uncertainty quantification framework that provides distribution-free, finite-sample coverage guarantees under heterogeneous and imbalanced healthcare data, without requiring centralized access. TrustFed introduces a representation-aware client assignment mechanism that leverages internal model representations to enable effective calibration across institutions, along with a soft-nearest threshold aggregation strategy that mitigates assignment uncertainty while producing compact and reliable prediction sets. Using over 430,000 medical images across six clinically distinct imaging modalities, we conduct one of the most comprehensive evaluations of uncertainty-aware federated learning in medical imaging, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes. By validating TrustFed at this scale and breadth, our study advances uncertainty-aware federated learning from proof-of-concept toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.
[LG-27] MISApp: Multi-Hop Intent-Aware Session Graph Learning for Next App Prediction
链接: https://arxiv.org/abs/2603.21653
作者: Yunchi Yang,Longlong Li,Jianliang Wu,Cunquan Qu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting the next mobile app a user will launch is essential for proactive mobile services. Yet accurate prediction remains challenging in real-world settings, where user intent can shift rapidly within short sessions and user-specific historical profiles are often sparse or unavailable, especially under cold-start conditions. Existing approaches mainly model app usage as sequential behavior or local session transitions, limiting their ability to capture higher-order structural dependencies and evolving session intent. To address this issue, we propose MISApp, a profile-free framework for next app prediction based on multi-hop session graph learning. MISApp constructs multi-hop session graphs to capture transition dependencies at different structural ranges, learns session representations through lightweight graph propagation, incorporates temporal and spatial context to characterize session conditions, and captures intent evolution from recent interactions. Experiments on two real-world app usage datasets show that MISApp consistently outperforms competitive baselines under both standard and cold-start settings, while maintaining a favorable balance between predictive accuracy and practical efficiency. Further analyses show that the learned hop-level attention weights align well with structural relevance, offering interpretable evidence for the effectiveness of the proposed multi-hop modeling strategy.
[LG-28] Engineering Distributed Governance for Regional Prosperity: A Socio-Technical Framework for Mitigating Under-Vibrancy via Human Data Engines
链接: https://arxiv.org/abs/2603.21639
作者: Amil Khanzada,Takuji Takemoto
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 34 pages, 5 figures, 3 tables. Pre-print of a manuscript submitted for peer review
Abstract:Most research in urban informatics and tourism focuses on mitigating overtourism in dense global cities. However, for regions experiencing demographic decline and structural stagnation, the primary risk is “under-vibrancy”, a condition where low visitor density suppresses economic activity and diminishes satisfaction. This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework previously validated in biological crisis management, and adapts it for regional economic flow optimization. Using high-granularity data from Japan’s least-visited prefecture (Fukui), we utilize an AI-driven decision support system (DSS) to analyze two datasets: a raw Fukui spending database (90,350 records) and a regional standardized sentiment database (97,719 responses). The system achieves in-sample explanatory power of 81% (R^2 = 0.810) and out-of-sample predictive performance of 68% (R^2 = 0.683). We quantify an annual opportunity gap of 865,917 unrealized visits, equivalent to approximately 11.96 billion yen (USD 76.2 million) in lost revenue. We propose a dual-nudge governance architecture leveraging the DHDE to redistribute cross-prefectural flows and reduce economic leakage.
[LG-29] Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective
链接: https://arxiv.org/abs/2603.21621
作者: Yuehu Gong,Zeyuan Wang,Yulin Chen,Yanwei Fu
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3figures
Abstract:On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.
[LG-30] Rateless DeepJSCC for Broadcast Channels: a Rate-Distortion-Complexity Tradeoff
链接: https://arxiv.org/abs/2603.21616
作者: Zijun Qin,Jingxuan Huang,Zesong Fei,Haichuan Ding,Yulin Shao,Xianhao Chen
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:In recent years, numerous data-intensive broadcasting applications have emerged at the wireless edge, calling for a flexible tradeoff between distortion, transmission rate, and processing complexity. While deep learning-based joint source-channel coding (DeepJSCC) has been identified as a potential solution to data-intensive communications, most of these schemes are confined to worst-case solutions, lack adaptive complexity, and are inefficient in broadcast settings. To overcome these limitations, this paper introduces nonlinear transform rateless source-channel coding (NTRSCC), a variable-length JSCC framework for broadcast channels based on rateless codes. In particular, we integrate learned source transformations with physical-layer LT codes, develop unequal protection schemes that exploit decoder side information, and devise approximations to enable end-to-end optimization of rateless parameters. Our framework enables heterogeneous receivers to adaptively adjust their received number of rateless symbols and decoding iterations in belief propagation, thereby achieving a controllable tradeoff between distortion, rate, and decoding complexity. Simulation results demonstrate that the proposed method enhances image broadcast quality under stringent communication and processing budgets over heterogeneous edge devices.
[LG-31] owards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction ICLR2026
链接: https://arxiv.org/abs/2603.21612
作者: Shiyan Hu,Jianxin Jin,Yang Shu,Peng Chen,Bin Yang,Chenjuan Guo
类目: Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: this https URL.
[LG-32] In-network Attack Detection with Federated Deep Learning in IoT Networks: Real Implementation and Analysis
链接: https://arxiv.org/abs/2603.21596
作者: Devashish Chaudhary,Sutharshan Rajasegarar,Shiva Raj Pokhrel,Lei Pan,Ruby D
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This paper has been accepted at the IEEE Conference on Engineering Informatics 2025
Abstract:The rapid expansion of the Internet of Things (IoT) and its integration with backbone networks have heightened the risk of security breaches. Traditional centralized approaches to anomaly detection, which require transferring large volumes of data to central servers, suffer from privacy, scalability, and latency limitations. This paper proposes a lightweight autoencoder-based anomaly detection framework designed for deployment on resource-constrained edge devices, enabling real-time detection while minimizing data transfer and preserving privacy. Federated learning is employed to train models collaboratively across distributed devices, where local training occurs on edge nodes and only model weights are aggregated at a central server. A real-world IoT testbed using Raspberry Pi sensor nodes was developed to collect normal and attack traffic data. The proposed federated anomaly detection system, implemented and evaluated on the testbed, demonstrates its effectiveness in accurately identifying network attacks. The communication overhead was reduced significantly while achieving comparable performance to the centralized method.
[LG-33] Stability and Bifurcation Analysis of Nonlinear PDEs via Random Projection-based PINNs: A Krylov-Arnoldi Approach
链接: https://arxiv.org/abs/2603.21568
作者: Gianluca Fabiani,Michail E. Kavousanakis,Constantinos Siettos,Ioannis G. Kevrekidis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 30 pages, 6 figures
Abstract:We address a numerical framework for the stability and bifurcation analysis of nonlinear partial differential equations (PDEs) in which the solution is sought in the function space spanned by physics-informed random projection neural networks (PI-RPNNs), and discretized via a collocation approach. These are single-hidden-layer networks with randomly sampled and fixed a priori hidden-layer weights; only the linear output layer weights are optimized, reducing training to a single least-squares solve. This linear output structure enables the direct and explicit formulation of the eigenvalue problem governing the linear stability of stationary solutions. This takes a generalized eigenvalue form, which naturally separates the physical domain interior dynamics from the algebraic constraints imposed by boundary conditions, at no additional training cost and without requiring additional PDE solves. However, the random projection collocation matrix is inherently numerically rank-deficient, rendering naive eigenvalue computation unreliable and contaminating the true eigenvalue spectrum with spurious near-zero modes. To overcome this limitation, we introduce a matrix-free shift-invert Krylov-Arnoldi method that operates directly in weight space, avoiding explicit inversion of the numerically rank-deficient collocation matrix and enabling the reliable computation of several leading eigenpairs of the physical Jacobian - the discretized Frechet derivative of the PDE operator with respect to the solution field, whose eigenvalue spectrum determines linear stability. We further prove that the PI-RPNN-based generalized eigenvalue problem is almost surely regular, guaranteeing solvability with standard eigensolvers, and that the singular values of the random projection collocation matrix decay exponentially for analytic activation functions.
[LG-34] Kolmogorov Complexity Bounds for LLM Steganography and a Perplexity-Based Detection Proxy
链接: https://arxiv.org/abs/2603.21567
作者: Andrii Shportko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models can rewrite text to embed hidden payloads while preserving surface-level meaning, a capability that opens covert channels between cooperating AI systems and poses challenges for alignment monitoring. We study the information-theoretic cost of such embedding. Our main result is that any steganographic scheme that preserves the semantic load of a covertext~ M_1 while encoding a payload~ P into a stegotext~ M_2 must satisfy K(M_2) \geq K(M_1) + K§ - O(\log n) , where K denotes Kolmogorov complexity and n is the combined message length. A corollary is that any non-trivial payload forces a strict complexity increase in the stegotext, regardless of how cleverly the encoder distributes the signal. Because Kolmogorov complexity is uncomputable, we ask whether practical proxies can detect this predicted increase. Drawing on the classical correspondence between lossless compression and Kolmogorov complexity, we argue that language-model perplexity occupies an analogous role in the probabilistic regime and propose the Binoculars perplexity-ratio score as one such proxy. Preliminary experiments with a color-based LLM steganographic scheme support the theoretical prediction: a paired t -test over 300 samples yields t = 5.11 , p 10^-6 . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.21567 [cs.LG] (or arXiv:2603.21567v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-35] Generalization Limits of In-Context Operator Networks for Higher-Order Partial Differential Equations
链接: https://arxiv.org/abs/2603.21534
作者: Jamie Mahowald,Tan Bui-Thanh
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 16 pages, 9 figures
Abstract:We investigate the generalization capabilities of In-Context Operator Networks (ICONs), a new class of operator networks that build on the principles of in-context learning, for higher-order partial differential equations. We extend previous work by expanding the type and scope of differential equations handled by the foundation model. We demonstrate that while processing complex inputs requires some new computational methods, the underlying machine learning techniques are largely consistent with simpler cases. Our implementation shows that although point-wise accuracy degrades for higher-order problems like the heat equation, the model retains qualitative accuracy in capturing solution dynamics and overall behavior. This demonstrates the model’s ability to extrapolate fundamental solution characteristics to problems outside its training regime.
[LG-36] Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset Selection
链接: https://arxiv.org/abs/2603.21492
作者: Qixin Zhang,Wei Huang,Yan Sun,Yao Shu,Yi Yu,Dacheng Tao
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 45 pages, 4 figures
Abstract:Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone \alpha -weakly DR-submodular or (\gamma,\beta) -weakly submodular, our Multinoulli-SCG algorithm can attain a value of (1-e^-\alpha)\textOPT-\epsilon or (\frac\gamma^2(1-e^-(\beta(1-\gamma)+\gamma^2))\beta(1-\gamma)+\gamma^2)\textOPT-\epsilon with only O(1/\epsilon^2) function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.
[LG-37] Learning Can Converge Stably to the Wrong Belief under Latent Reliability
链接: https://arxiv.org/abs/2603.21491
作者: Zhipeng Zhang,Zhenjie Yao,Kai Li,Lei Yang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures. Extended and refocused version of arXiv:2601.09261
Abstract:Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability. Comments: 15 pages, 6 figures. Extended and refocused version of arXiv:2601.09261 Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.0 Cite as: arXiv:2603.21491 [cs.LG] (or arXiv:2603.21491v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21491 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhipeng Zhang [view email] [v1] Mon, 23 Mar 2026 02:28:07 UTC (885 KB)
[LG-38] GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion
链接: https://arxiv.org/abs/2603.21487
作者: Ruiqi Xian,Jing Liang,He Yin,Xuewei Qi,Dinesh Manocha
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We present \emphGaussianSSC, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emphGaussian Anchoring, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel–image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emphGaussian–Triplane Refinement module that combines \emphlocal gathering (target-centric) and \emphglobal aggregation (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\citebehley2019semantickitti, GaussianSSC improves Stage~1 occupancy by +1.0% Recall, +2.0% Precision, and +1.8% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8% IoU and +0.8% mIoU.
[LG-39] Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies ICLR2026
链接: https://arxiv.org/abs/2603.21485
作者: Koichi Tanaka,Kazuki Kawamura,Takanori Muroi,Yusuke Narita,Yuki Sasamoto,Kei Tateno,Takuma Udagawa,Wei-Wei Du,Yuta Saito
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026
Abstract:Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS), exploiting the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, for a range of experimental settings with completely deterministic logging policies.
[LG-40] Mechanisms of Introspective Awareness
链接: https://arxiv.org/abs/2603.21396
作者: Uzay Macar,Li Yang,Atticus Wang,Peter Wallich,Emmanuel Ameisen,Jack Lindsey
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of “introspective awareness.” But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in open-source models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post-training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by evidence carrier and gate features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by 53pp and a trained steering vector by 75pp. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontrivial internal anomaly detection, and likely could be substantially improved in future models. Code: this https URL.
[LG-41] A Generalised Exponentiated Gradient Approach to Enhance Fairness in Binary and Multi-class Classification Tasks
链接: https://arxiv.org/abs/2603.21393
作者: Maryam Boubekraoui,Giordano d’Aloisio,Antinisca Di Marco
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The widespread use of AI and ML models in sensitive areas raises significant concerns about fairness. While the research community has introduced various methods for bias mitigation in binary classification tasks, the issue remains under-explored in multi-class classification settings. To address this limitation, in this paper, we first formulate the problem of fair learning in multi-class classification as a multi-objective problem between effectiveness (i.e., prediction correctness) and multiple linear fairness constraints. Next, we propose a Generalised Exponentiated Gradient (GEG) algorithm to solve this task. GEG is an in-processing algorithm that enhances fairness in binary and multi-class classification settings under multiple fairness definitions. We conduct an extensive empirical evaluation of GEG against six baselines across seven multi-class and three binary datasets, using four widely adopted effectiveness metrics and three fairness definitions. GEG overcomes existing baselines, with fairness improvements up to 92% and a decrease in accuracy up to 14%.
[LG-42] Constrained Online Convex Optimization with Memory and Predictions AAAI2026
链接: https://arxiv.org/abs/2603.21375
作者: Mohammed Abdullah,George Iosifidis,Salah Eddine Elayoubi,Tijani Chahed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: accepted to AAAI 2026
Abstract:We study Constrained Online Convex Optimization with Memory (COCO-M), where both the loss and the constraints depend on a finite window of past decisions made by the learner. This setting extends the previously studied unconstrained online optimization with memory framework and captures practical problems such as the control of constrained dynamical systems and scheduling with reconfiguration budgets. For this problem, we propose the first algorithms that achieve sublinear regret and sublinear cumulative constraint violation under time-varying constraints, both with and without predictions of future loss and constraint functions. Without predictions, we introduce an adaptive penalty approach that guarantees sublinear regret and constraint violation. When short-horizon and potentially unreliable predictions are available, we reinterpret the problem as online learning with delayed feedback and design an optimistic algorithm whose performance improves as prediction accuracy improves, while remaining robust when predictions are inaccurate. Our results bridge the gap between classical constrained online convex optimization and memory-dependent settings, and provide a versatile learning toolbox with diverse applications.
[LG-43] he Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
链接: https://arxiv.org/abs/2603.21354
作者: Huamin Chen,Xunzhuo Liu,Bowei He,Fuyuan Lyu,Yankai Chen,Xue Liu,Yuhan Liu,Junchen Jiang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Vision Paper
Abstract:Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms – signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization – fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing – multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards – inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.
[LG-44] AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent -Driven Search
链接: https://arxiv.org/abs/2603.21331
作者: Jaber Jaber,Osama Jaber
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 11 pages, 5 tables, 2 figures. Code: this https URL
Abstract:Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl’s law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and this http URL (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating this http URL by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at this https URL.
[LG-45] Which Alert Removals are Beneficial?
链接: https://arxiv.org/abs/2603.21322
作者: Idan Amit
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Context: Static analysis captures software engineering knowledge and alerts on possibly problematic patterns. Previous work showed that they indeed have predictive power for various problems. However, the impact of removing the alerts is unclear. Aim: We would like to evaluate the impact of alert removals on code complexity and the tendency to bugs. Method: We evaluate the impact of removing alerts using three complementary methods. 1. We conducted a randomized controlled trial and built a dataset of 521 manual alert-removing interventions 2. We profiled intervention-like events using labeling functions. We applied these labeling functions to code commits, found intervention-like natural events, and used them to analyze the impact on the tendency to bugs. 3. We built a dataset of 8,245 alert removals, more than 15 times larger than our dataset of manual interventions. We applied supervised learning to the alert removals, aiming to predict their impact on the tendency to bugs. Results: We identified complexity-reducing interventions that reduce the probability of future bugs. Such interventions are relevant to 33% of Python files and might reduce the tendency to bugs by 5.5 percentage points. Conclusions: We presented methods to evaluate the impact of interventions. The methods can identify a large number of natural interventions that are highly needed in causality research in many domains.
[LG-46] Active Inference Agency Formalization Metrics and Convergence Assessments
链接: https://arxiv.org/abs/2603.21319
作者: Eduard Kapelko
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the critical challenge of mesa-optimization in AI safety by providing a formal definition of agency and a framework for its analysis. Agency is conceptualized as a Continuous Representation of accumulated experience that achieves autopoiesis through a dynamic balance between curiosity (minimizing prediction error to ensure non-computability and novelty) and empowerment (maximizing the control channel’s information capacity to ensure subjectivity and goal-directedness). Empirical evidence suggests that this active inference-based model successfully accounts for classical instrumental goals, such as self-preservation and resource acquisition. The analysis demonstrates that the proposed agency function is smooth and convex, possessing favorable properties for optimization. While agentic functions occupy a vanishingly small fraction of the total abstract function space, they exhibit logarithmic convergence in sparse environments. This suggests a high probability for the spontaneous emergence of agency during the training of modern, large-scale models. To quantify the degree of agency, the paper introduces a metric based on the distance between the behavioral equivalents of a given system and an “ideal” agentic function within the space of canonicalized rewards (STARC). This formalization provides a concrete apparatus for classifying and detecting mesa-optimizers by measuring their proximity to an ideal agentic objective, offering a robust tool for analyzing and identifying undesirable inner optimization in complex AI systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.21319 [cs.LG] (or arXiv:2603.21319v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Stream separation improves Bregman conditioning in transformers
链接: https://arxiv.org/abs/2603.21317
作者: James Clayton Kerce
类目: Machine Learning (cs.LG)
*备注:
Abstract:Linear methods for steering transformer representations, including probing, activation engineering, and concept erasure, implicitly assume the geometry of representation space is Euclidean. Park et al. [Park et al., 2026] showed that softmax induces a curved Bregman geometry whose metric tensor is the Hessian of the log-normalizer, H(\lambda) = Cov[\gamma | \lambda] . Ignoring this curvature causes Euclidean steering to leak probability mass to unintended tokens. Their analysis applies at the output layer. We measure this Hessian at intermediate layers in a controlled 2x2 design crossing stream separation with per-layer supervision (vocabulary decoding loss at each layer), all at matched vocabulary and parameter count. In standard single-stream transformers, H is severely degenerate at intermediate layers (effective rank 8 in 516 dimensions). Stream separation improves conditioning by up to 22 in effective rank, even without auxiliary supervision. Per-layer supervision helps, but less. The cosine similarity between primal and dual concept directions predicts per-layer steering effectiveness on downstream tasks, with a threshold near 0.3. These results bear on the reliability of linear safety interventions, which depend on the geometry being well-conditioned at the layer where they are applied.
[LG-48] HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit
链接: https://arxiv.org/abs/2603.21316
作者: Khushiyant,Param Thakkar
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 10 Pages, 8 Figures
Abstract:Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.
[LG-49] FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models
链接: https://arxiv.org/abs/2603.21315
作者: Fabien Polly
类目: Machine Learning (cs.LG)
*备注: 18 pages, 16 figures, 4 tables. Code available at this https URL
Abstract:World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.
[LG-50] Direct Interval Propagation Methods using Neural-Network Surrogates for Uncertainty Quantification in Physical Systems Surrogate Model
链接: https://arxiv.org/abs/2603.21308
作者: Ghifari Adam Faza,Jolan Wauters,Fabio Cuzzolin,Hans Hallez,David Moens
类目: Machine Learning (cs.LG)
*备注:
Abstract:In engineering, uncertainty propagation aims to characterise system outputs under uncertain inputs. For interval uncertainty, the goal is to determine output bounds given interval-valued inputs, which is critical for robust design optimisation and reliability analysis. However, standard interval propagation relies on solving optimisation problems that become computationally expensive for complex systems. Surrogate models alleviate this cost but typically replace only the evaluator within the optimisation loop, still requiring many inference calls. To overcome this limitation, we reformulate interval propagation as an interval-valued regression problem that directly predicts output bounds. We present a comprehensive study of neural network-based surrogate models, including multilayer perceptrons (MLPs) and deep operator networks (DeepONet), for this task. Three approaches are investigated: (i) naive interval propagation through standard architectures, (ii) bound propagation methods such as Interval Bound Propagation (IBP) and CROWN, and (iii) interval neural networks (INNs) with interval weights. Results show that these methods significantly improve computational efficiency over traditional optimisation-based approaches while maintaining accurate interval estimates. We further discuss practical limitations and open challenges in applying interval-based propagation methods.
[LG-51] Amortized Variational Inference for Logistic Regression with Missing Covariates
链接: https://arxiv.org/abs/2603.21244
作者: M. Cherifi,Aude Sportisse,Xujia Zhu,Mohammed Nabil El Korso,A. Mesloub
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 25 pages, 12 figures, submitted to Statistics and Computing
Abstract:Missing covariate data pose a significant challenge to statistical inference and machine learning, particularly for classification tasks like logistic regression. Classical iterative approaches (EM, multiple imputation) are often computationally intensive, sensitive to high missingness rates, and limited in uncertainty propagation. Recent deep generative models based on VAEs show promise but rely on complex latent representations. We propose Amortized Variational Inference for Logistic Regression (AV-LR), a unified end-to-end framework for binary logistic regression with missing covariates. AV-LR integrates a probabilistic generative model with a simple amortized inference network, trained jointly by maximizing the evidence lower bound. Unlike competing methods, AV-LR performs inference directly in the space of missing data without additional latent variables, using a single inference network and a linear layer that jointly estimate regression parameters and the missingness mechanism. AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms, with significantly lower computational cost. It naturally extends to missing-not-at-random settings by explicitly modeling the missingness mechanism. Empirical results on synthetic and real-world datasets confirm its effectiveness and efficiency across various missing-data scenarios. Comments: 25 pages, 12 figures, submitted to Statistics and Computing Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) MSC classes: 62J12, 62F15, 62D10 ACMclasses: I.2.6; G.3 Cite as: arXiv:2603.21244 [cs.LG] (or arXiv:2603.21244v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21244 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders
链接: https://arxiv.org/abs/2603.21236
作者: Dip Roy,Rajiv Misra,Sanjay Kumar Singh,Anisha Roy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Although mechanism-based interpretability has generated an abundance of insight for discriminative network analysis, generative models are less understood – particularly outside of image-related applications. We investigate how much of the causal circuitry found within image-related variational autoencoders (VAEs) will generalize to tabular data, as VAEs are increasingly used for imputation, anomaly detection, and synthetic data generation. In addition to extending a four-level causal intervention framework to four tabular and one image benchmark across five different VAE architectures (with 75 individual training runs per architecture and three random seed values for each run), this paper introduces three new techniques: posterior-calibration of Causal Effect Strength (CES), path-specific activation patching, and Feature-Group Disentanglement (FGD). The results from our experiments demonstrate that: (i) Tabular VAEs have circuits with modularity that is approximately 50% lower than their image counterparts. (ii) \beta -VAE experiences nearly complete collapse in CES scores when applied to heterogeneous tabular features (0.043 CES score for tabular data compared to 0.133 CES score for images), which can be directly attributed to reconstruction quality degradation (r = -0.886 correlation coefficient between CES and MSE). (iii) CES successfully captures nine of eleven statistically significant architecture differences using Holm–Šidák corrections. (iv) Interventions with high specificity predict the highest downstream AUC values (r = 0.460, p .001). This study challenges the common assumption that architectural guidance from image-related studies can be transferred to tabular datasets.
[LG-53] Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
链接: https://arxiv.org/abs/2603.21210
作者: Janne Perini,Rafael Bischof,Moab Arar,Ayça Duran,Michael A. Kraus,Siddhartha Mishra,Bernd Bickel
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.
[LG-54] On the Role of Batch Size in Stochastic Conditional Gradient Methods
链接: https://arxiv.org/abs/2603.21191
作者: Rustem Islamov,Roman Machacek,Aurelien Lucchi,Antonio Silveti-Falls,Eduard Gorbunov,Volkan Cevher
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We study the role of batch size in stochastic conditional gradient methods under a \mu -Kurdyka-Łojasiewicz ( \mu -KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.
[LG-55] ALMAB-DC: Active Learning Multi-Armed Bandits and Distributed Computing for Sequential Experimental Design and Black-Box Optimization
链接: https://arxiv.org/abs/2603.21180
作者: Foo Hui-Mean,Yuan-chin I Chang
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 33 pages, and 13 figures
Abstract:Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from each observation. We propose \textbfALMAB-DC, a GP-based sequential design framework combining active learning, multi-armed bandits (MAB), and distributed asynchronous computing for expensive black-box experimentation. A Gaussian process surrogate with uncertainty-aware acquisition identifies informative query points; a UCB or Thompson-sampling bandit controller allocates evaluations across parallel workers; and an asynchronous scheduler handles heterogeneous runtimes. We present cumulative regret bounds for the bandit components and characterize parallel scalability via Amdahl’s Law. We validate ALMAB-DC on five benchmarks. On the two statistical experimental-design tasks, ALMAB-DC achieves lower simple regret than Equal Spacing, Random, and D-optimal designs in dose–response optimization, and in adaptive spatial field estimation matches the Greedy Max-Variance benchmark while outperforming Latin Hypercube Sampling; at K=4 the distributed setting reaches target performance in one-quarter of sequential wall-clock rounds. On three ML/engineering tasks (CIFAR-10 HPO, CFD drag minimization, MuJoCo RL), ALMAB-DC achieves 93.4% CIFAR-10 accuracy (outperforming BOHB by 1.7,pp and Optuna by 1.1,pp), reduces airfoil drag to C_D = 0.059 (36.9% below Grid Search), and improves RL return by 50% over Grid Search. All advantages over non-ALMAB baselines are statistically significant under Bonferroni-corrected Mann–Whitney U tests. Distributed execution achieves 7.5\times speedup at K = 16 agents, consistent with Amdahl’s Law. Comments: 33 pages, and 13 figures Subjects: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML) MSC classes: 62K05, 62L05 ACMclasses: G.3; G.1.6 Cite as: arXiv:2603.21180 [cs.LG] (or arXiv:2603.21180v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21180 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-56] Pruned Adaptation Modules: A Simple yet Strong Baseline for Continual Foundation Models
链接: https://arxiv.org/abs/2603.21170
作者: Elif Ceren Gok Yildirim,Murat Onur Yildirim,Joaquin Vanschoren
类目: Machine Learning (cs.LG)
*备注: Published at CPAL 2026
Abstract:The continual learning literature has rapidly shifted from traditional class incremental learning (CIL) techniques to foundation model (FM)-based CIL methods without a clear understanding of how these newer approaches compare to strong, lightweight convolutional baselines. This abrupt transition has created a substantial methodological gap, making it difficult to assess whether recent FM-based CIL progress reflects genuine advances or merely the absence of rigorous baselines. To address this gap, we introduce Pruned Adaptation Modules (PAM), a simple yet effective method that freezes the vast majority of the pre-trained ResNet while enabling scalable continual adaptation through sparse task-specific layers. PAM yields up to a ~5x reduction in trainable parameters and a ~6x reduction in total parameters, significantly reducing the cost of continual updates. Across diverse benchmarks, PAM consistently mitigates catastrophic forgetting and outperforms state-of-the-art FM-based CIL approaches. Our findings position PAM as a strong and transparent baseline that helps bridge the gap between traditional and FM-based CIL, guiding future research for a more accurate assessment of true progress in continual adaptation. The code can be found at: this https URL.
[LG-57] Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective ICLR2026
链接: https://arxiv.org/abs/2603.21169
作者: Chen Zhang,Yuxin Cheng,Chenchen Ding,Shuqi Wang,Jingreng Lei,Runsheng Yu,Yik-Chung WU,Ngai Wong
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 Workshop on Scientific Methods for Understanding Deep Learning (20 pages, 18 figures)
Abstract:Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.
[LG-58] Learning from Label Proportions with Dual-proportion Constraints
链接: https://arxiv.org/abs/2603.21153
作者: Tianhao Ma,Ximing Li,Changchun Li,Renchu Guan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning from Label Proportions (LLP) is a weakly supervised problem in which the training data comprise bags, that is, groups of instances, each annotated only with bag-level class label proportions, and the objective is to learn a classifier that predicts instance-level labels. This setting is widely applicable when privacy constraints limit access to instance-level annotations or when fine-grained labeling is costly or impractical. In this work, we introduce a method that leverages Dual proportion Constraints (LLP-DC) during training, enforcing them at both the bag and instance levels. Specifically, the bag-level training aligns the mean prediction with the given proportion, and the instance-level training aligns hard pseudo-labels that satisfy the proportion constraint, where a minimum-cost maximum-flow algorithm is used to generate hard pseudo-labels. Extensive experimental results across various benchmark datasets empirically validate that LLP-DC consistently improves over previous LLP methods across datasets and bag sizes. The code is publicly available at this https URL PR2026_Findings_LLP_DC.
[LG-59] ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models
链接: https://arxiv.org/abs/2603.21105
作者: Xu Li,Yi Zheng,Yuxuan Liang,Zhe Liu,Xiaolei Chen,Haotian Chen,Rui Zhu,Xiangyang Xue
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can be seamlessly integrated into existing LVLM pipelines without retraining or architectural modifications. Extensive experiments on multiple LVLM backbones, including LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL, demonstrate that ResPrune consistently outperforms existing pruning approaches across a wide range of benchmarks, while achieving effective reductions in computation, memory consumption, and inference latency.
[LG-60] Learning to Optimize Joint Source and RIS-assisted Channel Encoding for Multi-User Semantic Communication Systems
链接: https://arxiv.org/abs/2603.21097
作者: Haidong Wang,Songhan Zhao,Bo Gu,Shimin Gong,Hongyang Du,Ping Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we explore a joint source and reconfigurable intelligent surface (RIS)-assisted channel encoding (JSRE) framework for multi-user semantic communications, where a deep neural network (DNN) extracts semantic features for all users and the RIS provides channel orthogonality, enabling a unified semantic encoding-decoding design. We aim to maximize the overall energy efficiency of semantic communications across all users by jointly optimizing the user scheduling, the RIS’s phase shifts, and the semantic compression ratio. Although this joint optimization problem can be addressed using conventional deep reinforcement learning (DRL) methods, evaluating semantic similarity typically relies on extensive real environment interactions, which can incur heavy computational overhead during training. To address this challenge, we propose a truncated DRL (T-DRL) framework, where a DNN-based semantic similarity estimator is developed to rapidly estimate the similarity score. Moreover, the user scheduling strategy is tightly coupled with the semantic model configuration. To exploit this relationship, we further propose a semantic model caching mechanism that stores and reuses fine-tuned semantic models corresponding to different scheduling decisions. A Transformer-based actor network is employed within the DRL framework to dynamically generate action space conditioned on the current caching state. This avoids redundant retraining and further accelerates the convergence of the learning process. Numerical results demonstrate that the proposed JSRE framework significantly improves the system energy efficiency compared with the baseline methods. By training fewer semantic models, the proposed T-DRL framework significantly enhances the learning efficiency.
[LG-61] Semi-Supervised Learning with Balanced Deep Representation Distributions
链接: https://arxiv.org/abs/2603.21056
作者: Changchun Li,Ximing Li,Bingjie Zhang,Wenting Wang,Jihong Ouyang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Semi-Supervised Text Classification (SSTC) mainly works under the spirit of self-training. They initialize the deep classifier by training over labeled texts; and then alternatively predict unlabeled texts as their pseudo-labels and train the deep classifier over the mixture of labeled and pseudo-labeled texts. Naturally, their performance is largely affected by the accuracy of pseudo-labels for unlabeled texts. Unfortunately, they often suffer from low accuracy because of the margin bias problem caused by the large difference between representation distributions of labels in SSTC. To alleviate this problem, we apply the angular margin loss, and perform several Gaussian linear transformations to achieve balanced label angle variances, i.e., the variance of label angles of texts within the same label. More accuracy of predicted pseudo-labels can be achieved by constraining all label angle variances balanced, where they are estimated over both labeled and pseudo-labeled texts during self-training loops. With this insight, we propose a novel SSTC method, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S2TC-BDD). We implement both multi-class classification and multi-label classification versions of S2TC-BDD by introducing some pseudo-labeling tricks and regularization terms. To evaluate S2 TC-BDD, we compare it against the state-of-the-art SSTC methods. Empirical results demonstrate the effectiveness of S2 TC-BDD, especially when the labeled texts are scarce.
[LG-62] Confidence Freeze: Early Success Induces a Metastable Decoupling of Metacognition and Behaviour
链接: https://arxiv.org/abs/2603.21043
作者: Zhipeng Zhang,Hongshun He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Humans must flexibly arbitrate between exploring alternatives and exploiting learned strategies, yet they frequently exhibit maladaptive persistence by continuing to execute failing strategies despite accumulating negative evidence. Here we propose a ``confidence-freeze’’ account that reframes such persistence as a dynamic learning state rather than a stable dispositional trait. Using a multi-reversal two-armed bandit task across three experiments (total N = 332; 19,920 trials), we first show that human learners normally make use of the symmetric statistical structure inherent in outcome trajectories: runs of successes provide positive evidence for environmental stability and thus for strategy maintenance, whereas runs of failures provide negative evidence and should raise switching probability. Behaviour in the control group conformed to this normative pattern. However, individuals who experienced a high rate of early success (90% vs.\ 60%) displayed a robust and selective distortion after the first reversal: they persisted through long stretches of non-reward (mean = 6.2 consecutive losses) while their metacognitive confidence ratings simultaneously dropped from 5 to 2 on a 7-point scale. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.21043 [cs.LG] (or arXiv:2603.21043v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.21043 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhipeng Zhang [view email] [v1] Sun, 22 Mar 2026 03:51:00 UTC (2,612 KB) Full-text links: Access Paper: View a PDF of the paper titled Confidence Freeze: Early Success Induces a Metastable Decoupling of Metacognition and Behaviour, by Zhipeng Zhang and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-63] Benchmarking Scientific Machine Learning Models for Air Quality Data
链接: https://arxiv.org/abs/2603.21039
作者: Khawja Imran Masud,Venkata Sai Rahul Unnam,Sahara Ali
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE IGARSS 2026; 22 pages, 6 figures;
Abstract:Accurate air quality index (AQI) forecasting is essential for the protecting public health in rapidly growing urban regions, and the practical model evaluation and selection are often challenged by the lack of rigorous, region-specific benchmarking on standardized datasets. Physics-guided machine learning and deep learning models could be a good and effective solution to resolve such issues with more accurate and efficient AQI forecasting. This research study presents an explainable and comprehensive benchmark that enables a guideline and proposed physics-guided best model by benchmarking classical time-series, machine-learning, and deep-learning approaches for multi-horizon AQI forecasting in North Texas (Dallas County). Using publicly available U.S. Environmental Protection Agency (EPA) daily observations of air quality data from 2022 to 2024, we curate city-level time series for PM2.5 and O3 by aggregating station measurements and constructing lag-wise forecasting datasets for LAG in 1,7,14,30 days. For benchmarking the best model, linear regression (LR), SARIMAX, multilayer perceptrons (MLP), and LSTM networks are evaluated with the proposed physics-guided variants (MLP+Physics and LSTM+Physics) that incorporate the EPA breakpoint-based AQI formulation as a consistency constraint through a weighted loss. Experiments using chronological train-test splits and error metrics MAE, RMSE showed that deep-learning models outperform simpler baselines, while physics guidance improves stability and yields physically consistent pollutant with AQI relationships, with the largest benefits observed for short-horizon prediction and for PM2.5 and O3. Overall, the results provide a practical reference for selecting AQI forecasting models in North Texas and clarify when lightweight physics constraints meaningfully improve predictive performance across pollutants and forecast horizons.
[LG-64] Fuel Consumption Prediction: A Comparative Analysis of Machine Learning Paradigms
链接: https://arxiv.org/abs/2603.21034
作者: Ali Akram
类目: Machine Learning (cs.LG)
*备注:
Abstract:The automotive industry is under growing pressure to reduce its environmental impact, requiring accurate predictive modeling to support sustainable engineering design. This study examines the factors that determine vehicle fuel consumption from the seminal Motor Trend dataset, identifying the governing physical factors of efficiency through rigorous quantitative analysis. Methodologically, the research uses data sanitization, statistical outlier elimination, and in-depth Exploratory Data Analysis (EDA) to curb the occurrence of multicollinearity between powertrain features. A comparative analysis of machine learning paradigms including Multiple Linear Regression, Support Vector Machines (SVM), and Logistic Regression was carried out to assess predictive efficacy. Findings indicate that SVM Regression is most accurate on continuous prediction (R-squared = 0.889, RMSE = 0.326), and is effective in capturing the non-linear relationships between vehicle mass and engine displacement. In parallel, Logistic Regression proved superior for classification (Accuracy = 90.8%) and showed exceptional recall (0.957) when identifying low-efficiency vehicles. These results challenge the current trend toward black-box deep learning architectures for static physical datasets, providing validation of robust performance by interpretable and well-tuned classical models. The research finds that intrinsic vehicle efficiency is fundamentally determined by physical design parameters, weight and displacement, offering a data-driven framework for how manufacturers should focus on lightweighting and engine downsizing to achieve stringent global sustainability goals.
[LG-65] abPFN Extensions for Interpretable Geotechnical Modelling
链接: https://arxiv.org/abs/2603.21033
作者: Taiga Saito,Yu Otake,Daijiro Mizutani,Stephen Wu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Geotechnical site characterisation relies on sparse, heterogeneous borehole data where uncertainty quantification and model interpretability are as critical as predictive accuracy for reliable engineering decisions. This paper presents an exploratory investigation into the use of TabPFN, a transformer-based tabular foundation model using in-context learning, and its extension library tabpfn-extensions for two geotechnical inference tasks: (1) soil-type classification using N-value and shear-wave velocity data from a synthetic geotechnical dataset, and (2) iterative imputation of five missing mechanical parameters ( s_\mathrmu , E_\mathrmu , \sigma’\mathrmp , C\mathrmc , C_\mathrmv ) in benchmark problem BM/AirportSoilProperties/2/2025. We apply cosine-similarity analysis to TabPFN-derived embeddings, visualise full posterior distributions from an iterative inference procedure, and compute SHAP-based feature importance, all without model retraining. Learned embeddings clearly separate Clay and Sand samples without explicit soil-type supervision; iterative imputation improves predictions for four of five target parameters, with posterior widths that reflect physically reasonable parameter-specific uncertainty; and SHAP analysis reveals the inter-parameter dependency structure, recovering established geotechnical relationships including the Skempton compression index correlation and the inverse dependence of preconsolidation pressure on water content. These results suggest the potential of foundation-model-based tools to support interpretable, uncertainty-aware parameter inference in data-scarce geotechnical practice.
[LG-66] When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
链接: https://arxiv.org/abs/2603.20997
作者: Abhinaba Basu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We identify a routing paradox in hybrid recurrent-attention architectures: content-based routing - deciding which tokens deserve expensive attention - requires exactly the pairwise computation that routing is designed to avoid. Through 20+ controlled experiments across three tasks (a synthetic diagnostic, the Zoology MQAR benchmark, and HotpotQA), we map the routing landscape exhaustively. One layer of softmax attention creates a latent ~34-dimensional subspace enabling 98.4% routing precision; zero layers yield 1.2%. This subspace is invisible to cosine similarity, destroyed by random projections (98.4% to 2.6%), and cannot be created by contrastive pretraining - proving attention’s role is writing pairwise match results into representations, not merely computing them. Twelve alternative mechanisms all cluster at 15-29%. Non-learned indices (Bloom filter: 90.9%; BM25 on HotpotQA: 82.7%) bypass the bottleneck entirely. The result is a sharp two-regime hierarchy with an empty middle ground. These findings provide the mechanistic explanation for the empirical observation that recurrent models fail at associative recall, and reframe attention as a representation constructor rather than merely a computation mechanism.
[LG-67] Interpreting the Synchronization Gap: The Hidden Mechanism Inside Diffusion Transformers
链接: https://arxiv.org/abs/2603.20987
作者: Emil Albrychiewicz,Andrés Franco Valiente,Li-Ching Chen,Viola Zixin Zhao
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注: 38 pages, 5 figures
Abstract:Recent theoretical models of diffusion processes, conceptualized as coupled Ornstein-Uhlenbeck systems, predict a hierarchy of interaction timescales, and consequently, the existence of a synchronization gap between modes that commit at different stages of the reverse process. However, because these predictions rely on continuous time and analytically tractable score functions, it remains unclear how this phenomenology manifests in the deep, discrete architectures deployed in practice. In this work, we investigate how the synchronization gap is mechanistically realized within pretrained Diffusion Transformers (DiTs). We construct an explicit architectural realization of replica coupling by embedding two generative trajectories into a joint token sequence, modulated by a symmetric cross attention gate with variable coupling strength g. Through a linearized analysis of the attention difference, we show that the replica interaction decomposes mechanistically. We empirically validate our theoretical framework on a pretrained DiT-XL/2 model by tracking commitment and per layer internal mode energies. Our results reveal that: (1) the synchronization gap is an intrinsic architectural property of DiTs that persists even when external coupling is turned off; (2) as predicted by our spatial routing bounds, the gap completely collapses under strong coupling; (3) the gap is strictly depth localized, emerging sharply only within the final layers of the Transformer; and (4) global, low frequency structures consistently commit before local, high frequency details. Ultimately, our findings provide a mechanistic interpretation of how Diffusion Transformers resolve generative ambiguity, isolating speciation transitions to the terminal layers of the network.
[LG-68] Joint Surrogate Learning of Objectives Constraints and Sensitivities for Efficient Multi-objective Optimization of Neural Dynamical Systems
链接: https://arxiv.org/abs/2603.20984
作者: Frithjof Gressmann,Ivan Georgiev Raikov,Seung Hyun Kim,Mattia Gazzola,Lawrence Rauchwerger,Ivan Soltesz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Biophysical neural system simulations are among the most computationally demanding scientific applications, and their optimization requires navigating high-dimensional parameter spaces under numerous constraints that impose a binary feasible/infeasible partition with no gradient signal to guide the search. Here, we introduce DMOSOPT, a scalable optimization framework that leverages a unified, jointly learned surrogate model to capture the interplay between objectives, constraints, and parameter sensitivities. By learning a smooth approximation of both the objective landscape and the feasibility boundary, the joint surrogate provides a unified gradient that simultaneously steers the search toward improved objective values and greater constraint satisfaction, while its partial derivatives yield per-parameter sensitivity estimates that enable more targeted exploration. We validate the framework from single-cell dynamics to population-level network activity, spanning incremental stages of a neural circuit modeling workflow, and demonstrate efficient, effective optimization of highly constrained problems at supercomputing scale with substantially fewer problem evaluations. While motivated by and demonstrated in the context of computational neuroscience, the framework is general and applicable to constrained multi-objective optimization problems across scientific and engineering domains.
[LG-69] MOELIGA: a multi-objective evolutionary approach for feature selection with local improvement
链接: https://arxiv.org/abs/2603.20934
作者: Leandro Vignolo,Matias Gerard
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 49 pages, 9 figures, 4 tables
Abstract:Selecting the most relevant or informative features is a key issue in actual machine learning problems. Since an exhaustive search is not feasible even for a moderate number of features, an intelligent search strategy must be employed for finding an optimal subset, which implies considering how features interact with each other in promoting class separability. Balancing feature subset size and classification accuracy constitutes a multi-objective optimization challenge. Here we propose MOELIGA, a multi-objective genetic algorithm incorporating an evolutionary local improvement strategy that evolves subordinate populations to refine feature subsets. MOELIGA employs a crowding-based fitness sharing mechanism and a sigmoid transformation to enhance diversity and guide compactness, alongside a geometry-based objective promoting classifier independence. Experimental evaluation on 14 diverse datasets demonstrates MOELIGA’s ability to identify smaller feature subsets with superior or comparable classification performance relative to 11 state-of-the-art methods. These findings suggest MOELIGA effectively addresses the accuracy-dimensionality trade-off, offering a robust and adaptable approach for multi-objective feature selection in complex, high-dimensional scenarios.
[LG-70] Deep Adaptive Rate Allocation in Volatile Heterogeneous Wireless Networks
链接: https://arxiv.org/abs/2603.20926
作者: Gregorio Maglione,Veselin Rakocevic,Markus Amend,Touraj Soleymani
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 12 pages, 11 figures
Abstract:Modern multi-access 5G+ networks provide mobile terminals with additional capacity, improving network stability and performance. However, in highly mobile environments such as vehicular networks, supporting multi-access connectivity remains challenging. The rapid fluctuations of wireless link quality often outpace the responsiveness of existing multipath schedulers and transport-layer protocols. This paper addresses this challenge by integrating Transformer-based path state forecasting with a new multipath splitting scheduler called Deep Adaptive Rate Allocation (DARA). The proposed scheduler employs a deep reinforcement learning engine to dynamically compute optimal congestion window fractions on available paths, determining data allocation among them. A six-component normalised reward function with weight-mediated conflict resolution drives a DQN policy that eliminates the observation-reaction lag inherent in reactive schedulers. Performance evaluation uses a Mininet-based Multipath Datagram Congestion Control Protocol testbed with traces from mobile users in vehicular environments. Experimental results demonstrate that DARA achieves better file transfer time reductions compared to learning-based schedulers under moderate-volatility traces. For buffered video streaming, resolution improvements are maintained across all tested conditions. Under controlled burst scenarios with sub-second buffer constraints, DARA achieves substantial rebuffering improvements whilst state-of-the-art schedulers exhibit near-continuous stalling.
[LG-71] Discriminative Representation Learning for Clinical Prediction
链接: https://arxiv.org/abs/2603.20921
作者: Yang Zhang,Li Fan,Samuel Lawrence,Shi Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models in healthcare have largely adopted self supervised pretraining objectives inherited from natural language processing and computer vision, emphasizing reconstruction and large scale representation learning prior to downstream adaptation. We revisit this paradigm in outcome centric clinical prediction settings and argue that, when high quality supervision is available, direct outcome alignment may provide a stronger inductive bias than generative pretraining. We propose a supervised deep learning framework that explicitly shapes representation geometry by maximizing inter class separation relative to within class variance, thereby concentrating model capacity along clinically meaningful axes. Across multiple longitudinal electronic health record tasks, including mortality and readmission prediction, our approach consistently outperforms masked, autoregressive, and contrastive pretraining baselines under matched model capacity. The proposed method improves discrimination, calibration, and sample efficiency, while simplifying the training pipeline to a single stage optimization. These findings suggest that in low entropy, outcome driven healthcare domains, supervision can act as the statistically optimal driver of representation learning, challenging the assumption that large scale self supervised pretraining is a prerequisite for strong clinical performance.
[LG-72] LLM -ODE: Data-driven Discovery of Dynamical Systems with Large Language Models
链接: https://arxiv.org/abs/2603.20910
作者: Amirmohammad Ziaei Bideh,Jonathan Gryak
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discovering the governing equations of dynamical systems is a central problem across many scientific disciplines. As experimental data become increasingly available, automated equation discovery methods offer a promising data-driven approach to accelerate scientific discovery. Among these methods, genetic programming (GP) has been widely adopted due to its flexibility and interpretability. However, GP-based approaches often suffer from inefficient exploration of the symbolic search space, leading to slow convergence and suboptimal solutions. To address these limitations, we propose LLM-ODE, a large language model-aided model discovery framework that guides symbolic evolution using patterns extracted from elite candidate equations. By leveraging the generative prior of large language models, LLM-ODE produces more informed search trajectories while preserving the exploratory strengths of evolutionary algorithms. Empirical results on 91 dynamical systems show that LLM-ODE variants consistently outperform classical GP methods in terms of search efficiency and Pareto-front quality. Overall, our results demonstrate that LLM-ODE improves both efficiency and accuracy over traditional GP-based discovery and offers greater scalability to higher-dimensional systems compared to linear and Transformer-only model discovery methods.
[LG-73] Bayesian Scattering: A Principled Baseline for Uncertainty on Image Data
链接: https://arxiv.org/abs/2603.20908
作者: Bernardo Fichera,Zarko Ivkovic,Kjell Jorner,Philipp Hennig,Viacheslav Borovitskiy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Uncertainty quantification for image data is dominated by complex deep learning methods, yet the field lacks an interpretable, mathematically grounded baseline. We propose Bayesian scattering to fill this gap, serving as a first-step baseline akin to the role of Bayesian linear regression for tabular data. Our method couples the wavelet scattering transform-a deep, non-learned feature extractor-with a simple probabilistic head. Because scattering features are derived from geometric principles rather than learned, they avoid overfitting the training distribution. This helps provide sensible uncertainty estimates even under significant distribution shifts. We validate this on diverse tasks, including medical imaging under institution shift, wealth mapping under country-to-country shift, and Bayesian optimization of molecular properties. Our results suggest that Bayesian scattering is a solid baseline for complex uncertainty quantification methods.
[LG-74] Incentive-Aware Federated Averag ing with Performance Guarantees under Strategic Participation
链接: https://arxiv.org/abs/2603.20873
作者: Fateme Maleki,Krishnan Raghavan,Farzad Yousefian
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Federated learning (FL) is a communication-efficient collaborative learning framework that enables model training across multiple agents with private local datasets. While the benefits of FL in improving global model performance are well established, individual agents may behave strategically, balancing the learning payoff against the cost of contributing their local data. Motivated by the need for FL frameworks that successfully retain participating agents, we propose an incentive-aware federated averaging method in which, at each communication round, clients transmit both their local model parameters and their updated training dataset sizes to the server. The dataset sizes are dynamically adjusted via a Nash equilibrium (NE)-seeking update rule that captures strategic data participation. We analyze the proposed method under convex and nonconvex global objective settings and establish performance guarantees for the resulting incentive-aware FL algorithm. Numerical experiments on the MNIST and CIFAR-10 datasets demonstrate that agents achieve competitive global model performance while converging to stable data participation strategies.
[LG-75] A Knowledge-Informed Pretrained Model for Causal Discovery
链接: https://arxiv.org/abs/2603.20842
作者: Wenbo Xu,Yue He,Yunhai Wang,Xingxuan Zhang,Kun Kuang,Yueguo Chen,Peng Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Causal discovery has been widely studied, yet many existing methods rely on strong assumptions or fall into two extremes: either depending on costly interventional signals or partial ground truth as strong priors, or adopting purely data driven paradigms with limited guidance, which hinders practical deployment. Motivated by real-world scenarios where only coarse domain knowledge is available, we propose a knowledge-informed pretrained model for causal discovery that integrates weak prior knowledge as a principled middle ground. Our model adopts a dual source encoder-decoder architecture to process observational data in a knowledge-informed way. We design a diverse pretraining dataset and a curriculum learning strategy that smoothly adapts the model to varying prior strengths across mechanisms, graph densities, and variable scales. Extensive experiments on in-distribution, out-of distribution, and real-world datasets demonstrate consistent improvements over existing baselines, with strong robustness and practical applicability.
[LG-76] Beyond the Academic Monoculture: A Unified Framework and Industrial Perspective for Attributed Graph Clustering
链接: https://arxiv.org/abs/2603.20829
作者: Yunhui Liu,Yue Liu,Yongchao Liu,Tao Zheng,Stan Z. Li,Xinwang Liu,Tieke He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Attributed Graph Clustering (AGC) is a fundamental unsupervised task that partitions nodes into cohesive groups by jointly modeling structural topology and node attributes. While the advent of graph neural networks and self-supervised learning has catalyzed a proliferation of AGC methodologies, a widening chasm persists between academic benchmark performance and the stringent demands of real-world industrial deployment. To bridge this gap, this survey provides a comprehensive, industrially grounded review of AGC from three complementary perspectives. First, we introduce the Encode-Cluster-Optimize taxonomic framework, which decomposes the diverse algorithmic landscape into three orthogonal, composable modules: representation encoding, cluster projection, and optimization strategy. This unified paradigm enables principled architectural comparisons and inspires novel methodological combinations. Second, we critically examine prevailing evaluation protocols to expose the field’s academic monoculture: a pervasive over-reliance on small, homophilous citation networks, the inadequacy of supervised-only metrics for an inherently unsupervised task, and the chronic neglect of computational scalability. In response, we advocate for a holistic evaluation standard that integrates supervised semantic alignment, unsupervised structural integrity, and rigorous efficiency profiling. Third, we explicitly confront the practical realities of industrial deployment. By analyzing operational constraints such as massive scale, severe heterophily, and tabular feature noise alongside extensive empirical evidence from our companion benchmark, we outline actionable engineering strategies. Furthermore, we chart a clear roadmap for future research, prioritizing heterophily-robust encoders, scalable joint optimization, and unsupervised model selection criteria to meet production-grade requirements.
[LG-77] Simple Projection-Free Algorithm for Contextual Recommendation with Logarithmic Regret and Robustness
链接: https://arxiv.org/abs/2603.20826
作者: Shinsaku Sakaue
类目: Machine Learning (cs.LG)
*备注:
Abstract:Contextual recommendation is a variant of contextual linear bandits in which the learner observes an (optimal) action rather than a reward scalar. Recently, Sakaue et al. (2025) developed an efficient Online Newton Step (ONS) approach with an O(d\log T) regret bound, where d is the dimension of the action space and T is the time horizon. In this paper, we present a simple algorithm that is more efficient than the ONS-based method while achieving the same regret guarantee. Our core idea is to exploit the improperness inherent in contextual recommendation, leading to an update rule akin to the second-order perceptron from online classification. This removes the Mahalanobis projection step required by ONS, which is often a major computational bottleneck. More importantly, the same algorithm remains robust to possibly suboptimal action feedback, whereas the prior ONS-based method required running multiple ONS learners with different learning rates for this extension. We describe how our method works in general Hilbert spaces (e.g., via kernelization), where eliminating Mahalanobis projections becomes even more beneficial.
[LG-78] Cross-Granularity Representations for Biological Sequences: Insights from ESM and BiGCARP
链接: https://arxiv.org/abs/2603.20825
作者: Hanlin Xiao,Rainer Breitling,Eriko Takano,Mauricio A. Álvarez
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, published in 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Abstract:Recent advances in general-purpose foundation models have stimulated the development of large biological sequence models. While natural language shows symbolic granularity (characters, words, sentences), biological sequences exhibit hierarchical granularity whose levels (nucleotides, amino acids, protein domains, genes) further encode biologically functional information. In this paper, we investigate the integration of cross-granularity knowledge from models through a case study of BiGCARP, a Pfam domain-level model for biosynthetic gene clusters, and ESM, an amino acid-level protein language model. Using representation analysis tools and a set of probe tasks, we first explain why a straightforward cross-model embedding initialization fails to improve downstream performance in BiGCARP, and show that deeper-layer embeddings capture a more contextual and faithful representation of the model’s learned knowledge. Furthermore, we demonstrate that representations at different granularities encode complementary biological knowledge, and that combining them yields measurable performance gains in intermediate-level prediction tasks. Our findings highlight cross-granularity integration as a promising strategy for improving both the performance and interpretability of biological foundation models.
[LG-79] Achieving widetildeO(1/ε) Sample Complexity for Bilinear Systems Identification under Bounded Noises
链接: https://arxiv.org/abs/2603.20819
作者: Hongyu Yi,Chenbei Lu,Jing Yu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:This paper studies finite-sample set-membership identification for discrete-time bilinear systems under bounded symmetric log-concave disturbances. Compared with existing finite-sample results for linear systems and related analyses under stronger noise assumptions, we consider the more challenging bilinear setting with trajectory-dependent regressors and allow marginally stable dynamics with polynomial mean-square state growth. Under these conditions, we prove that the diameter of the feasible parameter set shrinks with sample complexity \widetildeO(1/\epsilon) . Simulation supports the theory and illustrates the advantage of the proposed estimator for uncertainty quantification.
[LG-80] Large Neighborhood Search meets Iterative Neural Constraint Heuristics
链接: https://arxiv.org/abs/2603.20801
作者: Yudong W. Xu,Wenhao Li,Scott Sanner,Elias B. Khalil
类目: Machine Learning (cs.LG)
*备注: Published in the 23rd International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research
Abstract:Neural networks are being increasingly used as heuristics for constraint satisfaction. These neural methods are often recurrent, learning to iteratively refine candidate assignments. In this work, we make explicit the connection between such iterative neural heuristics and Large Neighborhood Search (LNS), and adapt an existing neural constraint satisfaction method-ConsFormer-into an LNS procedure. We decompose the resulting neural LNS into two standard components: the destroy and repair operators. On the destroy side, we instantiate several classical heuristics and introduce novel prediction-guided operators that exploit the model’s internal scores to select neighborhoods. On the repair side, we utilize ConsFormer as a neural repair operator and compare the original sampling-based decoder to a greedy decoder that selects the most likely assignments. Through an empirical study on Sudoku, Graph Coloring, and MaxCut, we find that adapting the neural heuristic to an LNS procedure yields substantial gains over its vanilla settings and improves its competitiveness with classical and neural baselines. We further observe consistent design patterns across tasks: stochastic destroy operators outperform greedy ones, while greedy repair is more effective than sampling-based repair for finding a single high-quality feasible assignment. These findings highlight LNS as a useful lens and design framework for structuring and improving iterative neural approaches.
[LG-81] Neural Autoregressive Flows for Markov Boundary Learning ICDM2025
链接: https://arxiv.org/abs/2603.20791
作者: Khoa Nguyen,Bao Duong,Viet Huynh,Thin Nguyen
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE ICDM 2025
Abstract:Recovering Markov boundary – the minimal set of variables that maximizes predictive performance for a response variable – is crucial in many applications. While recent advances improve upon traditional constraint-based techniques by scoring local causal structures, they still rely on nonparametric estimators and heuristic searches, lacking theoretical guarantees for reliability. This paper investigates a framework for efficient Markov boundary discovery by integrating conditional entropy from information theory as a scoring criterion. We design a novel masked autoregressive network to capture complex dependencies. A parallelizable greedy search strategy in polynomial time is proposed, supported by analytical evidence. We also discuss how initializing a graph with learned Markov boundaries accelerates the convergence of causal discovery. Comprehensive evaluations on real-world and synthetic datasets demonstrate the scalability and superior performance of our method in both Markov boundary discovery and causal discovery tasks.
[LG-82] Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness
链接: https://arxiv.org/abs/2603.20775
作者: Yuxuan Yang,Dugang Liu,Yiyan Huang
类目: Machine Learning (cs.LG)
*备注: 17 pages
Abstract:In personalized marketing, uplift models estimate incremental effects by modeling how customer behavior changes under alternative treatments. However, real-world data often exhibit biases - such as selection bias, spillover effects, and unobserved confounding - which adversely affect both estimation accuracy and metric validity. Despite the importance of bias-aware assessment, a lack of systematic studies persists. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets lack counterfactual ground truth, rendering direct metric validation infeasible. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking, effectively bridging the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that: (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) evaluation metric stability is linked to mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and metrics. Code will be released upon acceptance.
[LG-83] Adversarial Attacks on Locally Private Graph Neural Networks
链接: https://arxiv.org/abs/2603.20746
作者: Matta Varun(Indian Institute of Technology Kharagpur, India),Ajay Kumar Dhakar(Indian Institute of Technology Kharagpur, India),Yuan Hong(University of Connecticut, USA),Shamik Sural(Indian Institute of Technology Kharagpur, India)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Graph neural network (GNN) is a powerful tool for analyzing graph-structured data. However, their vulnerability to adversarial attacks raises serious concerns, especially when dealing with sensitive information. Local Differential Privacy (LDP) offers a privacy-preserving framework for training GNNs, but its impact on adversarial robustness remains underexplored. This paper investigates adversarial attacks on LDP-protected GNNs. We explore how the privacy guarantees of LDP can be leveraged or hindered by adversarial perturbations. The effectiveness of existing attack methods on LDP-protected GNNs are analyzed and potential challenges in crafting adversarial examples under LDP constraints are discussed. Additionally, we suggest directions for defending LDP-protected GNNs against adversarial attacks. This work investigates the interplay between privacy and security in graph learning, highlighting the need for robust and privacy-preserving GNN architectures.
[LG-84] RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models IJCNN2026
链接: https://arxiv.org/abs/2603.20711
作者: Zihao Zheng,Hangyu Cao,Jiayu Chen,Sicheng Tian,Chenyue Li,Maoliang Li,Xinhao Sun,Guojie Luo,Xiang Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: This paper has been accepted by IJCNN 2026
Abstract:Vision-Language-Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) deployment offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Diverse model structures hinder optimal ECC segmentation point identification; (2) Even if the optimal split point is determined, changes in network bandwidth can cause performance drift. To address these issues, we propose a novel ECC deployment framework for various VLA models, termed RoboECC. Specifically, we propose a model-hardware co-aware segmentation strategy to help find the optimal segmentation point for various VLA models. Moreover, we propose a network-aware deployment adjustment approach to adapt to the network fluctuations for maintaining optimal performance. Experiments demonstrate that RoboECC achieves a speedup of up to 3.28x with only 2.55x~2.62x overhead.
[LG-85] Neuronal Self-Adaptation Enhances Capacity and Robustness of Representation in Spiking Neural Networks
链接: https://arxiv.org/abs/2603.20687
作者: Zhuobin Yang,Yeyao Bao,Liangfu Lv,Jian Zhang,Xiaohong Li,Yunliang Zang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) are promising for energy-efficient, real-time edge computing, yet their performance is often constrained by the limited adaptability of conventional leaky integrate-and-fire (LIF) neurons. Existing LIF models struggle with restricted information capacity and susceptibility to noise, leading to degraded accuracy and compromised robustness. Inspired by the dynamic self-regulation of biological potassium channels, we propose the Potassium-regulated LIF (KvLIF) neuron model. KvLIF introduces an auxiliary conductance state that integrates membrane potential and spiking history to adaptively modulate neuronal excitability and reset dynamics. This design extends the dynamic response range of neurons to varying input intensities and effectively suppresses noise-induced spikes. We extensively evaluate KvLIF on both static image and neuromorphic datasets, demonstrating consistent improvements in classification accuracy and superior robustness compared to existing LIF models. Our work bridges biological plausibility with computational efficiency, offering a neuron model that enhances SNN performance while maintaining suitability for low-power neuromorphic deployment.
[LG-86] Breaking the O(sqrtT) Cumulative Constraint Violation Barrier while Achieving O(sqrtT) Static Regret in Constrained Online Convex Optimization
链接: https://arxiv.org/abs/2603.20671
作者: Haricharan Balasundaram,Karthick Krishna Mahendran,Rahul Vaze
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The problem of constrained online convex optimization is considered, where at each round, once a learner commits to an action x_t \in \mathcalX \subset \mathbbR^d , a convex loss function f_t and a convex constraint function g_t that drives the constraint g_t(x)\le 0 are revealed. The objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV) compared to the benchmark that knows the loss functions and constraint functions f_t and g_t for all t ahead of time, and chooses a static optimal action that is feasible with respect to all g_t(x)\le 0 . In recent prior work Sinha and Vaze [2024], algorithms with simultaneous regret of O(\sqrtT) and CCV of O(\sqrtT) or (CCV of O(1) in specific cases Vaze and Sinha [2025], e.g. when d=1 ) have been proposed. It is widely believed that CCV is \Omega(\sqrtT) for all algorithms that ensure that regret is O(\sqrtT) with the worst case input for any d\ge 2 . In this paper, we refute this and show that the algorithm of Vaze and Sinha [2025] simultaneously achieves regret of O(\sqrtT) regret and CCV of O(T^1/3) when d=2 .
[LG-87] Exponential Family Discriminant Analysis: Generalizing LDA-Style Generative Classification to Non-Gaussian Models
链接: https://arxiv.org/abs/2603.20655
作者: Anish Lakkapragada
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint, 15 pages, 5 figures
Abstract:We introduce Exponential Family Discriminant Analysis (EFDA), a unified generative framework that extends classical Linear Discriminant Analysis (LDA) beyond the Gaussian setting to any member of the exponential family. Under the assumption that each class-conditional density belongs to a common exponential family, EFDA derives closed-form maximum-likelihood estimators for all natural parameters and yields a decision rule that is linear in the sufficient statistic, recovering LDA as a special case and capturing nonlinear decision boundaries in the original feature space. We prove that EFDA is asymptotically calibrated and statistically efficient under correct specification, and we generalise it to K \geq 2 classes and multivariate data. Through extensive simulation across five exponential-family distributions (Weibull, Gamma, Exponential, Poisson, Negative Binomial), EFDA matches the classification accuracy of LDA, QDA, and logistic regression while reducing Expected Calibration Error (ECE) by 2 – 6\times , a gap that is \emphstructural: it persists for all n and across all class-imbalance levels, because misspecified models remain asymptotically miscalibrated. We further prove and empirically confirm that EFDA’s log-odds estimator approaches the Cramér-Rao bound under correct specification, and is the only estimator in our comparison whose mean squared error converges to zero. Complete derivations are provided for nine distributions. Finally, we formally verify all four theoretical propositions in Lean 4, using Aristotle (Harmonic) and OpenGauss (Math, Inc.) as proof generators, with all outputs independently machine-checked by AXLE (Axiom).
[LG-88] Diffusion Model for Manifold Data: Score Decomposition Curvature and Statistical Complexity
链接: https://arxiv.org/abs/2603.20645
作者: Zixuan Zhang,Kaixuan Huang,Tuo Zhao,Mengdi Wang,Minshuo Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have become a leading framework in generative modeling, yet their theoretical understanding – especially for high-dimensional data concentrated on low-dimensional structures – remains incomplete. This paper investigates how diffusion models learn such structured data, focusing on two key aspects: statistical complexity and influence of data geometric properties. By modeling data as samples from a smooth Riemannian manifold, our analysis reveals crucial decompositions of score functions in diffusion models under different levels of injected noise. We also highlight the interplay of manifold curvature with the structures in the score function. These analyses enable an efficient neural network approximation to the score function, built upon which we further provide statistical rates for score estimation and distribution learning. Remarkably, the obtained statistical rates are governed by the intrinsic dimension of data and the manifold curvature. These results advance the statistical foundations of diffusion models, bridging theory and practice for generative modeling on manifolds.
[LG-89] Optimal low-rank stochastic gradient estimation for LLM training
链接: https://arxiv.org/abs/2603.20632
作者: Zehao Li,Tao Ren,Zishi Zhang,Xi Chen,Yijie Peng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language model (LLM) training is often bottlenecked by memory constraints and stochastic gradient noise in extremely high-dimensional parameter spaces. Motivated by empirical evidence that many LLM gradient matrices are effectively low-rank during training, we present an unbiased, memory-efficient, low-rank matrix estimator with the lowest variance that is applicable across common stochastic gradient estimation paradigms. The core idea is to project a high-dimensional stochastic gradient estimator onto a random low-dimensional subspace and lift it back, reducing memory while keeping the estimator unbiased and controlling mean-squared error via an optimally designed projection distribution, including Haar–Stiefel projections. The projection distribution is derived by solving a constrained functional optimization problem, yielding an optimal random projector that guides algorithm design. Empirically, the resulting low-rank gradient estimators deliver both practical memory savings and improved training behavior. In RoBERTa-large fine-tuning, our method attains the lowest peak GPU memory among compared methods (e.g., 3.83GB versus 16.7GB for full BP) while remaining competitive in accuracy; in autoregressive LLM pretraining (LLaMA-20M/60M/100M), our method outperforms the traditional methods, supporting the benefit of the proposed optimal projection strategy.
[LG-90] Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
链接: https://arxiv.org/abs/2603.20616
作者: Ruijie Miao,Zhiming Wang,Wang Li,Shiwei Wu,Shufan Liu,Yanbing Jiang,Tong Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.
[LG-91] owards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models
链接: https://arxiv.org/abs/2603.20607
作者: Zhilong Zhang,Haoxiang Ren,Yihao Sun,Yifei Sheng,Haonan Wang,Haoxin Lin,Zhichao Wu,Pierre-Luc Bacon,Yang Yu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language-Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is constrained by the high cost and safety risks of real-world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel-level world modeling, multi-view consistency, and compounding errors under sparse rewards. Building on recent advances across large multimodal models and model-based RL, we propose VLA-MBPO, a practical framework to tackle these problems in VLA finetuning. Our approach has three key design choices: (i) adapting unified multimodal models (UMMs) for data-efficient world modeling; (ii) an interleaved view decoding mechanism to enforce multi-view consistency; and (iii) chunk-level branched rollout to mitigate error compounding. Theoretical analysis and experiments across simulation and real-world tasks demonstrate that VLA-MBPO significantly improves policy performance and sample efficiency, underscoring its robustness and scalability for real-world robotic deployment.
[LG-92] Bayesian Learning in Episodic Zero-Sum Games
链接: https://arxiv.org/abs/2603.20604
作者: Chang-Wei Yueh,Andy Zhao,Ashutosh Nayyar,Rahul Jain
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order O(HS\sqrtABHK\log(SABHK)) where K is the number of episodes, H is the episode length, S is the number of states, and A,B are the action space sizes of the two players. Experiments in a grid-world predator–prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.
[LG-93] Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems
链接: https://arxiv.org/abs/2603.20589
作者: Alankrita Bhatt,Mukur Gupta,Germain Kolossov,Andrea Montanari
类目: Machine Learning (cs.LG)
*备注: 39 pages; 15 figures
Abstract:Generating data from discrete distributions is important for a number of application domains including text, tabular data, and genomic data. Several groups have recently used random k -satisfiability ( k -SAT) as a synthetic benchmark for new generative techniques. In this paper, we show that fundamental insights from the theory of random constraint satisfaction problems have observable implications (sometime contradicting intuition) on the behavior of generative techniques on such benchmarks. More precisely, we study the problem of generating a uniformly random solution of a given (random) k -SAT or k -XORSAT formula. Among other findings, we observe that: (i) ~Continuous diffusions outperform masked discrete diffusions; (ii) ~Learned diffusions can match the theoretical `ideal’ accuracy; (iii) ~Smart ordering of the variables can significantly improve accuracy, although not following popular heuristics.
[LG-94] Neural collapse in the orthoplex regime
链接: https://arxiv.org/abs/2603.20587
作者: James Alcala,Rayna Andreeva,Vladimir A. Kobzar,Dustin G. Mixon,Sanghoon Na,Shashank Sule,Yangxinyu Xie
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Metric Geometry (math.MG)
*备注:
Abstract:When training a neural network for classification, the feature vectors of the training set are known to collapse to the vertices of a regular simplex, provided the dimension d of the feature space and the number n of classes satisfies n\leq d+1 . This phenomenon is known as neural collapse. For other applications like language models, one instead takes n\gg d . Here, the neural collapse phenomenon still occurs, but with different emergent geometric figures. We characterize these geometric figures in the orthoplex regime where d+2\leq n\leq 2d . The techniques in our analysis primarily involve Radon’s theorem and convexity.
[LG-95] RECLAIM: Cyclic Causal Discovery Amid Measurement Noise
链接: https://arxiv.org/abs/2603.20585
作者: Muralikrishnna G. Sethuraman,Faramarz Fekri
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Uncovering causal relationships is a fundamental problem across science and engineering. However, most existing causal discovery methods assume acyclicity and direct access to the system variables – assumptions that fail to hold in many real-world settings. For instance, in genomics, cyclic regulatory networks are common, and measurements are often corrupted by instrumental noise. To address these challenges, we propose RECLAIM, a causal discovery framework that natively handles both cycles and measurement noise. RECLAIM learns the causal graph structure by maximizing the likelihood of the observed measurements via expectation-maximization (EM), using residual normalizing flows for tractable likelihood computation. We consider two measurement models: (i) Gaussian additive noise, and (ii) a linear measurement system with additive Gaussian noise. We provide theoretical consistency guarantees for both the settings. Experiments on synthetic data and real-world protein signaling datasets demonstrate the efficacy of the proposed method.
[LG-96] LJ-Bench: Ontology-Based Benchmark for U.S. Crime
链接: https://arxiv.org/abs/2603.20572
作者: Hung Yun Tseng,Wuzhen Li,Blerina Gkotse,Grigorios Chrysos
类目: Machine Learning (cs.LG)
*备注: Accepted at Transactions on Machine Learning Research in March, 2026
Abstract:The potential of Large Language Models (LLMs) to provide harmful information remains a significant concern due to the vast breadth of illegal queries they may encounter. Unfortunately, existing benchmarks only focus on a handful types of illegal activities, and are not grounded in legal works. In this work, we introduce an ontology of crime-related concepts grounded in the legal frameworks of Model Panel Code, which serves as an influential reference for criminal law and has been adopted by many U.S. states, and instantiated using Californian Law. This structured knowledge forms the foundation for LJ-Bench, the first comprehensive benchmark designed to evaluate LLM robustness against a wide range of illegal activities. Spanning 76 distinct crime types organized taxonomically, LJ-Bench enables systematic assessment of diverse attacks, revealing valuable insights into LLM vulnerabilities across various crime categories: LLMs exhibit heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. Our benchmark aims to facilitate the development of more robust and trustworthy LLMs. The LJ-Bench benchmark and LJ-Ontology, along with experiments implementation for reproducibility are publicly available at this https URL.
[LG-97] Understanding Behavior Cloning with Action Quantization
链接: https://arxiv.org/abs/2603.20538
作者: Haoqun Cao,Tengyang Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Behavior cloning is a fundamental paradigm in machine learning, enabling policy learning from expert demonstrations across robotics, autonomous driving, and generative models. Autoregressive models like transformer have proven remarkably effective, from large language models (LLMs) to vision-language-action systems (VLAs). However, applying autoregressive models to continuous control requires discretizing actions through quantization, a practice widely adopted yet poorly understood theoretically. This paper provides theoretical foundations for this practice. We analyze how quantization error propagates along the horizon and interacts with statistical sample complexity. We show that behavior cloning with quantized actions and log-loss achieves optimal sample complexity, matching existing lower bounds, and incurs only polynomial horizon dependence on quantization error, provided the dynamics are stable and the policy satisfies a probabilistic smoothness condition. We further characterize when different quantization schemes satisfy or violate these requirements, and propose a model-based augmentation that provably improves the error bound without requiring policy smoothness. Finally, we establish fundamental limits that jointly capture the effects of quantization error and statistical complexity.
[LG-98] owards Practical Multimodal Hospital Outbreak Detection
链接: https://arxiv.org/abs/2603.20536
作者: Chang Liu,Jieshi Chen,Alexander J. Sundermann,Kathleen Shutt,Marissa P. Griffith,Lora Lee Pless,Lee H. Harrison,Artur W. Dubrawski
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 3 tables
Abstract:Rapid identification of outbreaks in hospitals is essential for controlling pathogens with epidemic potential. Although whole genome sequencing (WGS) remains the gold standard in outbreak investigations, its substantial costs and turnaround times limit its feasibility for routine surveillance, especially in less-equipped facilities. We explore three modalities as rapid alternatives: matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry, antimicrobial resistance (AR) patterns, and electronic health records (EHR). We present a machine learning approach that learns discriminative features from these modalities to support outbreak detection. Multi-species evaluation shows that the integration of these modalities can boost outbreak detection performance. We also propose a tiered surveillance paradigm that can reduce the need for WGS through these alternative modalities. Further analysis of EHR information identifies potentially high-risk contamination routes linked to specific clinical procedures, notably those involving invasive equipment and high-frequency workflows, providing infection prevention teams with actionable targets for proactive risk mitigation
[LG-99] RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
链接: https://arxiv.org/abs/2603.20527
作者: Shenyang Deng,Zhuoli Ouyang,Tianyu Pang,Zihang Liu,Ruochen Jin,Shuhua Yu,Yaoqing Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textscMuon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textscMuon still leaves room for further improvement. In this paper, we introduce \textscRMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise \ell_2 normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from \mathcalO(mn\cdot\min(m,n)) to \mathcalO(mn) for an m\times n weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textscRMNP in the non-convex setting that match recent results for \textscMuon optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textscRMNP delivers competitive optimization performance compared with \textscMuon while substantially reducing preconditioning wall-clock time. Our code is available at \hrefthis https URLthis link.
[LG-100] Distributed Gradient Clustering: Convergence and the Effect of Initialization
链接: https://arxiv.org/abs/2603.20507
作者: Aleksandar Armacki,Himkant Sharma,Dragana Bajović,Dušan Jakovetić,Mrityunjoy Chakraborty,Soummya Kar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 3 figures
Abstract:We study the effects of center initialization on the performance of a family of distributed gradient-based clustering algorithms introduced in [1], that work over connected networks of users. In the considered scenario, each user contains a local dataset and communicates only with its immediate neighbours, with the aim of finding a global clustering of the joint data. We perform extensive numerical experiments, evaluating the effects of center initialization on the performance of our family of methods, demonstrating that our methods are more resilient to the effects of initialization, compared to centralized gradient clustering [2]. Next, inspired by the K -means++ initialization [3], we propose a novel distributed center initialization scheme, which is shown to improve the performance of our methods, compared to the baseline random initialization.
[LG-101] Spatio-Temporal Grid Intelligence: A Hybrid Graph Neural Network and LSTM Framework for Robust Electricity Theft Detection
链接: https://arxiv.org/abs/2603.20488
作者: Adewale U. Oguntola,Olowookere A. AbdulQoyum,Adebukola M. Madehin,Adekemi A. Adetoro
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures
Abstract:Electricity theft, or non-technical loss (NTL), presents a persistent threat to global power systems, driving significant financial deficits and compromising grid stability. Conventional detection methodologies, predominantly reactive and meter-centric, often fail to capture the complex spatio-temporal dynamics and behavioral patterns associated with fraudulent consumption. This study introduces a novel AI-driven Grid Intelligence Framework that fuses Time-Series Anomaly Detection, Supervised Machine Learning, and Graph Neural Networks (GNN) to identify theft with high precision in imbalanced datasets. Leveraging an enriched feature set, including rolling averages, voltage drop estimates, and a critical Grid Imbalance Index, the methodology employs a Long Short-Term Memory (LSTM) autoencoder for temporal anomaly scoring, a Random Forest classifier for tabular feature discrimination, and a GNN to model spatial dependencies across the distribution network. Experimental validation demonstrates that while standalone anomaly detection yields a low theft F1-score of 0.20, the proposed hybrid fusion achieves an overall accuracy of 93.7%. By calibrating decision thresholds via precision-recall analysis, the system attains a balanced theft precision of 0.55 and recall of 0.50, effectively mitigating the false positives inherent in single-model approaches. These results confirm that integrating topological grid awareness with temporal and supervised analytics provides a scalable, risk-based solution for proactive electricity theft detection and enhanced smart grid reliability.
[LG-102] From Data to Laws: Neural Discovery of Conservation Laws Without False Positives
链接: https://arxiv.org/abs/2603.20474
作者: Rahul D Ray
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Conservation laws are fundamental to understanding dynamical systems, but discovering them from data remains challenging due to parameter variation, non-polynomial invariants, local minima, and false positives on chaotic systems. We introduce NGCG, a neural-symbolic pipeline that decouples dynamics learning from invariant discovery and systematically addresses these challenges. A multi-restart variance minimiser learns a near-constant latent representation; system-specific symbolic extraction (polynomial Lasso, log-basis Lasso, explicit PDE candidates, and PySR) yields closed-form expressions; a strict constancy gate and diversity filter eliminate spurious laws. On a benchmark of nine diverse systems including Hamiltonian and dissipative ODEs, chaos, and PDEs, NGCG achieves consistent discovery (DR=1.0, FDR=0.0, F1=1.0) on all four systems with true conservation laws, with constancy two to three orders of magnitude lower than the best baseline. It is the only method that succeeds on the Lotka–Volterra system, and it correctly outputs no law on all five systems without invariants. Extensive experiments demonstrate robustness to noise ( \sigma = 0.1 ), sample efficiency (50–100 trajectories), insensitivity to hyperparameters, and runtime under one minute per system. A Pareto analysis shows that the method provides a range of candidate expressions, allowing users to trade complexity for constancy. NGCG achieves strong performance relative to prior methods for data-driven conservation-law discovery, combining high accuracy with interpretability.
[LG-103] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret
链接: https://arxiv.org/abs/2603.20453
作者: Ming Shi,Yingbin Liang,Ness B. Shroff,Ananthram Swami
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emphmulti-source (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emphmulti-source imperfect preferences through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most \omega over K episodes. We propose a unified algorithm with regret \tildeO(\sqrtK/M+\omega) , which exhibits a best-of-both-regimes behavior: it achieves M -dependent statistical gains when imperfection is small (where M is the number of sources), while remaining robust with unavoidable additive dependence on \omega when imperfection is large. We complement this with a lower bound \tilde\Omega(\max\sqrtK/M,\omega) , which captures the best possible improvement with respect to M and the unavoidable dependence on \omega , and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as \tilde\Omega(\min\omega\sqrtK,K) . Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.
[LG-104] SDE-Driven Spatio-Temporal Hypergraph Neural Networks for Irregular Longitudinal fMRI Connectome Modeling in Alzheimers Disease
链接: https://arxiv.org/abs/2603.20452
作者: Ruiying Chen,Yutong Wang,Houliang Zhou,Wei Liang,Yong Chen,Lifang He
类目: Machine Learning (cs.LG)
*备注: Submitted to AMIA Annual Symposium, 10 pages, 4 figures
Abstract:Longitudinal neuroimaging is essential for modeling disease progression in Alzheimer’s disease (AD), yet irregular sampling and missing visits pose substantial challenges for learning reliable temporal representations. To address this challenge, we propose SDE-HGNN, a stochastic differential equation (SDE)-driven spatio-temporal hypergraph neural network for irregular longitudinal fMRI connectome modeling. The framework first employs an SDE-based reconstruction module to recover continuous latent trajectories from irregular observations. Based on these reconstructed representations, dynamic hypergraphs are constructed to capture higher-order interactions among brain regions over time. To further model temporal evolution, hypergraph convolution parameters evolve through SDE-controlled recurrent dynamics conditioned on inter-scan intervals, enabling disease-stage-adaptive connectivity modeling. We also incorporate a sparsity-based importance learning mechanism to identify salient brain regions and discriminative connectivity patterns. Extensive experiments on the OASIS-3 and ADNI cohorts demonstrate consistent improvements over state-of-the-art graph and hypergraph baselines in AD progression prediction. The source code is available at this https URL.
[LG-105] Verifiable Error Bounds for Physics-Informed Neural KKL Observers
链接: https://arxiv.org/abs/2603.20434
作者: Hannah Berin-Costain,Harry Wang,Kirsten Morris,Jun Liu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures
Abstract:This paper proposes a computable state-estimation error bound for learning-based Kazantzis–Kravaris/Luenberger (KKL) observers. Recent work learns the KKL transformation map with a physics-informed neural network (PINN) and a corresponding left-inverse map with a conventional neural network. However, no computable state-estimation error bounds are currently available for this approach. We derive a state-estimation error bound that depends only on quantities that can be certified over a prescribed region using neural network verification. We further extend the result to bounded additive measurement noise and demonstrate the guarantees on nonlinear benchmark systems.
[LG-106] Hawkeye: Reproducing GPU-Level Non-Determinism
链接: https://arxiv.org/abs/2603.20421
作者: Erez Badash,Dan Boneh,Ilan Komargodski,Megha Srivastava
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted to MLSys 2026
Abstract:We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using our framework, anyone can re-execute on a CPU the exact matrix multiplication operations underlying a machine learning model training or inference workflow that was executed on an NVIDIA GPU, without any precision loss. This is in stark contrast to prior approaches to verifiable machine learning, which either introduce significant computation overhead to the original model owner, or suffer from non-robustness and quality degradation. The main technical contribution of Hawkeye is a systematic sequence of carefully crafted tests that study rounding direction, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication on NVIDIA’s Tensor Cores. We test and evaluate our framework on multiple NVIDIA GPU architectures ( Ampere, Hopper, and Lovelace) and precision types (FP16, BFP16, FP8). In all test cases, Hawkeye enables perfect reproduction of matrix multiplication on a CPU, paving the way for efficient and trustworthy third-party auditing of ML model training and inference.
[LG-107] Data-driven discovery of roughness descriptors for surface characterization and intimate contact modeling of unidirectional composite tapes
链接: https://arxiv.org/abs/2603.20418
作者: Sebastian Rodriguez,Mikhael Tannous,Jad Mounayer,Camilo Cruz,Anais Barasinski,Francisco Chinesta
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Unidirectional tapes surface roughness determines the evolution of the degree of intimate contact required for ensuring the thermoplastic molecular diffusion and the associated inter-tapes consolidation during manufacturing of composite structures. However, usual characterization of rough surfaces relies on statistical descriptors that even if they are able to represent the surface topology, they are not necessarily connected with the physics occurring at the interface during inter-tape consolidation. Thus, a key research question could be formulated as follows: Which roughness descriptors simultaneously enable tape classification-crucial for process control-and consolidation modeling via the inference of the evolution of the degree of intimate contact, itself governed by the process parameters?. For providing a valuable response, we propose a novel strategy based on the use of Rank Reduction Autoencoders (RRAEs), autoencoders with a linear latent vector space enforced by applying a truncated Singular Value Decomposition (SVD) to the latent matrix during the encoder-decoder training. In this work, we extract useful roughness descriptors by enforcing the latent SVD modes to (i) accurately represent the roughness after decoding, and (ii) allow the extraction of existing a priori knowledge such as classification or modelling properties.
[LG-108] SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
链接: https://arxiv.org/abs/2603.20410
作者: Mahmoud Elhadidy,Roshan M. D’Souza,Amirhossein Arzani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scientific machine learning is increasingly used to build surrogate models, yet most models are trained under a restrictive assumption in which future data follow the same distribution as the training set. In practice, new experimental conditions or simulation regimes may differ significantly, requiring extrapolation and model updates without re-access to prior data. This creates a need for continual learning (CL) frameworks that can adapt to distribution shifts while preventing catastrophic forgetting. Such challenges are pronounced in fluid dynamics, where changes in geometry, boundary conditions, or flow regimes induce non-trivial changes to the solution. Here, we introduce a new architecture-based approach (SLE-FNO) combining a Single-Layer Extension (SLE) with the Fourier Neural Operator (FNO) to support efficient CL. SLE-FNO was compared with a range of established CL methods, including Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), replay-based approaches, Orthogonal Gradient Descent (OGD), Gradient Episodic Memory (GEM), PiggyBack, and Low-Rank Approximation (LoRA), within an image-to-image regression setting. The models were trained to map transient concentration fields to time-averaged wall shear stress (TAWSS) in pulsatile aneurysmal blood flow. Tasks were derived from 230 computational fluid dynamics simulations grouped into four sequential and out-of-distribution configurations. Results show that replay-based methods and architecture-based approaches (PiggyBack, LoRA, and SLE-FNO) achieve the best retention, with SLE-FNO providing the strongest overall balance between plasticity and stability, achieving accuracy with zero forgetting and minimal additional parameters. Our findings highlight key differences between CL algorithms and introduce SLE-FNO as a promising strategy for adapting baseline models when extrapolation is required.
[LG-109] he Multiverse of Time Series Machine Learning: an Archive for Multivariate Time Series Classification
链接: https://arxiv.org/abs/2603.20352
作者: Matthew Middlehurst,Aiden Rushbrooke,Ali Ismail-Fawaz,Maxime Devanne,Germain Forestier,Angus Dempster,Geoffrey I. Webb,Christopher Holder,Anthony Bagnall
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series machine learning (TSML) is a growing research field that spans a wide range of tasks. The popularity of established tasks such as classification, clustering, and extrinsic regression has, in part, been driven by the availability of benchmark datasets. An archive of 30 multivariate time series classification datasets, introduced in 2018 and commonly known as the UEA archive, has since become an essential resource cited in hundreds of publications. We present a substantial expansion of this archive that more than quadruples its size, from 30 to 133 classification problems. We also release preprocessed versions of datasets containing missing values or unequal length series, bringing the total number of datasets to 147. Reflecting the growth of the archive and the broader community, we rebrand it as the Multiverse archive to capture its diversity of domains. The Multiverse archive includes datasets from multiple sources, consolidating other collections and standalone datasets into a single, unified repository. Recognising that running experiments across the full archive is computationally demanding, we recommend a subset of the full archive called Multiverse-core (MV-core) for initial exploration. To support researchers in using the new archive, we provide detailed guidance and a baseline evaluation of established and recent classification algorithms, establishing performance benchmarks for future research. We have created a dedicated repository for the Multiverse archive that provides a common aeon and scikit-learn compatible framework for reproducibility, an extensive record of published results, and an interactive interface to explore the results. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.20352 [cs.LG] (or arXiv:2603.20352v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.20352 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anthony Bagnall [view email] [v1] Fri, 20 Mar 2026 12:26:07 UTC (3,903 KB) Full-text links: Access Paper: View a PDF of the paper titled The Multiverse of Time Series Machine Learning: an Archive for Multivariate Time Series Classification, by Matthew Middlehurst and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-110] Interpretable Multiple Myeloma Prognosis with Observational Medical Outcomes Partnership Data
链接: https://arxiv.org/abs/2603.20341
作者: Salma Rachidi,Aso Bozorgpanah,Eric Fey,Alexander Jung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) promises better clinical decision-making, yet opaque model behavior limits the adoption in healthcare. We propose two novel regularization techniques for ensuring the interpretability of ML models trained on real-world data. In particular, we consider the prediction of five-year survival for multiple myeloma patients using clinical data from Helsinki University Hospital. To ensure the interpretability of the trained models, we use two alternative constructions for a penalty term used for regularization. The first one penalizes deviations from the predictions obtained from an interpretable logistic regression method with two manually chosen features. The second construction requires consistency of model predictions with the revised international staging system (R-ISS). We verify the usefulness of the proposed regularization techniques in numerical experiments using data from 812 patients. They achieve an accuracy up to 0.721 on a test set and SHAP values show that the models rely on the selected important features.
[LG-111] Graph-Aware Text-Only Backdoor Poisoning for Text-Attributed Graphs
链接: https://arxiv.org/abs/2603.20339
作者: Qi Luo,Minghui Xu,Dongxiao Yu,Xiuzhen Cheng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages
Abstract:Many learning systems now use graph data in which each node also contains text, such as papers with abstracts or users with posts. Because these texts often come from open platforms, an attacker may be able to quietly poison a small part of the training data and later make the model produce wrong predictions on demand. This paper studies that risk in a realistic setting where the attacker edits only node text and does not change the graph structure. We propose TAGBD, a text-only backdoor attack for text-attributed graphs. TAGBD first finds training nodes that are easier to influence, then generates natural-looking trigger text with the help of a shadow graph model, and finally injects the trigger by either replacing the original text or appending a short phrase. Experiments on three benchmark datasets show that the attack is highly effective, transfers across different graph models, and remains strong under common defenses. These results demonstrate that text alone is a practical attack channel in graph learning systems and suggest that future defenses should inspect both graph links and node content.
[LG-112] Hybrid Autoencoder-Isolation Forest approach for time series anomaly detection in C70XP cyclotron operation data at ARRONAX
链接: https://arxiv.org/abs/2603.20335
作者: F Basbous(Nantes Univ, GIP ARRONAX),F Poirier(GIP ARRONAX, CNRS),F Haddad(GIP ARRONAX, Nantes Univ, CNRS),D Mateus(Nantes Univ - ECN, LS2N)
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Interest Public Group ARRONAX’s C70XP cyclotron, used for radioisotope production for medical and research applications, relies on complex and costly systems that are prone to failures, leading to operational disruptions. In this context, this study aims to develop a machine learning-based method for early anomaly detection, from sensor measurements over a temporal window, to enhance system performance. One of the most widely recognized methods for anomaly detection is Isolation Forest (IF), known for its effectiveness and scalability. However, its reliance on axis-parallel splits limits its ability to detect subtle anomalies, especially those occurring near the mean of normal data. This study proposes a hybrid approach that combines a fully connected Autoencoder (AE) with IF to enhance the detection of subtle anomalies. In particular, the Mean Cubic Error (MCE) of the sensor data reconstructed by the AE is used as input to the IF model. Validated on proton beam intensity time series data, the proposed method demonstrates a clear improvement in detection performance, as confirmed by the experimental results.
[LG-113] Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost SARIMA and Persistence
链接: https://arxiv.org/abs/2603.20315
作者: Federico Garcia Crespi,Eduardo Yubero Funes,Marina Alfosea Simon
类目: Machine Learning (cs.LG)
*备注: 28 pages, 4 figures. Submitted to International Journal of Forecasting
Abstract:(a) Many air quality forecasting studies report gains from machine learning, but evaluations often use static chronological splits and omit persistence baselines, so the operational added value under routine updating is unclear. (b) Using 2,350 daily PM10 observations from 2017 to 2024 at an urban background monitoring station in southern Europe, we compare XGBoost and SARIMA against persistence under a static split and a rolling-origin protocol with monthly updates. We report horizon-specific skill and the predictability horizon, defined as the maximum horizon with positive persistence-relative skill. Static evaluation suggests XGBoost performs well from one to seven days ahead, but rolling-origin evaluation reverses rankings: XGBoost is not consistently better than persistence at short and intermediate horizons, whereas SARIMA remains positively skilled across the full range. © For researchers, static splits can overstate operational usefulness and change rankings. For practitioners, rolling-origin, persistence-referenced skill profiles show which methods stay reliable at each lead time. Comments: 28 pages, 4 figures. Submitted to International Journal of Forecasting Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.20315 [cs.LG] (or arXiv:2603.20315v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.20315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-114] Solomonoff induction
链接: https://arxiv.org/abs/2603.20274
作者: Tom F. Sterkenburg
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注:
Abstract:This chapter discusses the Solomonoff approach to universal prediction. The crucial ingredient in the approach is the notion of computability, and I present the main idea as an attempt to meet two plausible computability desiderata for a universal predictor. This attempt is unsuccessful, which is shown by a generalization of a diagonalization argument due to Putnam. I then critically discuss purported gains of the approach, in particular it providing a foundation for the methodological principle of Occam’s razor, and it serving as a theoretical ideal for the development of machine learning methods.
[LG-115] Viability-Preserving Passive Torque Control
链接: https://arxiv.org/abs/2510.03367
作者: Zizhe Zhang,Yicong Wang,Zhiquan Zhang,Tianyu Li,Nadia Figueroa
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 7 figures, Project Website: this https URL
Abstract:Conventional passivity-based torque controllers for manipulators are typically unconstrained, which can lead to safety violations under external perturbations. In this paper, we employ viability theory to pre-compute safe sets in the state-space of joint positions and velocities. These viable sets, constructed via data-driven and analytical methods for self-collision avoidance, external object collision avoidance and joint-position and joint-velocity limits, provide constraints on joint accelerations and thus joint torques via the robot dynamics. A quadratic programming-based control framework enforces these constraints on a passive controller tracking a dynamical system, ensuring the robot states remain within the safe set in an infinite time horizon. We validate the proposed approach through simulations and hardware experiments on a 7-DoF Franka Emika manipulator. In comparison to a baseline constrained passive controller, our method operates at higher control-loop rates and yields smoother trajectories.
[LG-116] Characterizing High-Capacity Janus Aminobenzene-Graphene Anode for Sodium-Ion Batteries with Machine Learning
链接: https://arxiv.org/abs/2603.22254
作者: Claudia Islas-Vargas,L. Ricardo Montoya,Carlos A. Vital-José,Oliver T. Unke,Klaus-Robert Müller,Huziel E. Sauceda
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Atomic and Molecular Clusters (physics.atm-clus); Chemical Physics (physics.chem-ph)
*备注: 8 pages, 5 figures, research article
Abstract:Sodium-ion batteries require anodes that combine high capacity, low operating voltage, fast Na-ion transport, and mechanical stability, which conventional anodes struggle to deliver. Here, we use the SpookyNet machine-learning force field (MLFF) together with all-electron density-functional theory calculations to characterize Na storage in aminobenzene-functionalized Janus graphene (Na _x AB) at room-temperature. Simulations across state of charge reveal a three-stage storage mechanism-site-specific adsorption at aminobenzene groups and Na _n @AB _m structure formation, followed by interlayer gallery filling-contrasting the multi-stage pore-, graphite-interlayer-, and defect-controlled behavior in hard carbon. This leads to an OCV profile with an extended low-voltage plateau of 0.15 V vs. Na/Na ^+ , an estimated gravimetric capacity of \sim 400 mAh g ^-1 , negligible volume change, and Na diffusivities of \sim10^-6 cm ^2 s ^-1 , two to three orders of magnitude higher than in hard carbon. Our results establish Janus aminobenzene-graphene as a promising, structurally defined high-capacity Na-ion anode and illustrate the power of MLFF-based simulations for characterizing electrode materials.
[LG-117] Data Curation for Machine Learning Interatomic Potentials by Determinantal Point Processes
链接: https://arxiv.org/abs/2603.22160
作者: Joanna Zou,Youssef Marzouk
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: Original publication at this https URL
Abstract:The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.
[LG-118] MAGPI: Multifidelity-Augmented Gaussian Process Inputs for Surrogate Modeling from Scarce Data
链接: https://arxiv.org/abs/2603.22050
作者: Atticus Rex,Elizabeth Qian,David Peterson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Supervised machine learning describes the practice of fitting a parameterized model to labeled input-output data. Supervised machine learning methods have demonstrated promise in learning efficient surrogate models that can (partially) replace expensive high-fidelity models, making many-query analyses, such as optimization, uncertainty quantification, and inference, tractable. However, when training data must be obtained through the evaluation of an expensive model or experiment, the amount of training data that can be obtained is often limited, which can make learned surrogate models unreliable. However, in many engineering and scientific settings, cheaper \emphlow-fidelity models may be available, for example arising from simplified physics modeling or coarse grids. These models may be used to generate additional low-fidelity training data. The goal of \emphmultifidelity machine learning is to use both high- and low-fidelity training data to learn a surrogate model which is cheaper to evaluate than the high-fidelity model, but more accurate than any available low-fidelity model. This work proposes a new multifidelity training approach for Gaussian process regression which uses low-fidelity data to define additional features that augment the input space of the learned model. The approach unites desirable properties from two separate classes of existing multifidelity GPR approaches, cokriging and autoregressive estimators. Numerical experiments on several test problems demonstrate both increased predictive accuracy and reduced computational cost relative to the state of the art.
[LG-119] A plug-and-play approach with fast uncertainty quantification for weak lensing mass mapping
链接: https://arxiv.org/abs/2603.22006
作者: Hubert Leterme,Andreas Tersenov,Jalal Fadili,Jean-Luc Starck
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Upcoming stage-IV surveys such as Euclid and Rubin will deliver vast amounts of high-precision data, opening new opportunities to constrain cosmological models with unprecedented accuracy. A key step in this process is the reconstruction of the dark matter distribution from noisy weak lensing shear measurements. Current deep learning-based mass mapping methods achieve high reconstruction accuracy, but either require retraining a model for each new observed sky region (limiting practicality) or rely on slow MCMC sampling. Efficient exploitation of future survey data therefore calls for a new method that is accurate, flexible, and fast at inference. In addition, uncertainty quantification with coverage guarantees is essential for reliable cosmological parameter estimation. We introduce PnPMass, a plug-and-play approach for weak lensing mass mapping. The algorithm produces point estimates by alternating between a gradient descent step with a carefully chosen data fidelity term, and a denoising step implemented with a single deep learning model trained on simulated data corrupted by Gaussian white noise. We also propose a fast, sampling-free uncertainty quantification scheme based on moment networks, with calibrated error bars obtained through conformal prediction to ensure coverage guarantees. Finally, we benchmark PnPMass against both model-driven and data-driven mass mapping techniques. PnPMass achieves performance close to that of state-of-the-art deep-learning methods while offering fast inference (converging in just a few iterations) and requiring only a single training phase, independently of the noise covariance of the observations. It therefore combines flexibility, efficiency, and reconstruction accuracy, while delivering tighter error bars than existing approaches, making it well suited for upcoming weak lensing surveys. Subjects: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2603.22006 [astro-ph.CO] (or arXiv:2603.22006v1 [astro-ph.CO] for this version) https://doi.org/10.48550/arXiv.2603.22006 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hubert Leterme [view email] [v1] Mon, 23 Mar 2026 14:12:48 UTC (2,385 KB)
[LG-120] Structural Concentration in Weighted Networks: A Class of Topology-Aware Indices
链接: https://arxiv.org/abs/2603.21918
作者: L. Riso,M.G. Zoia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper develops a unified framework for measuring concentration in weighted systems embedded in networks of interactions. While traditional indices such as the Herfindahl-Hirschman Index capture dispersion in weights, they neglect the topology of relationships among the elements receiving those weights. To address this limitation, we introduce a family of topology-aware concentration indices that jointly account for weight distributions and network structure. At the core of the framework lies a baseline Network Concentration Index (NCI), defined as a normalized quadratic form that measures the fraction of potential weighted interconnection realized along observed network links. Building on this foundation, we construct a flexible class of extensions that modify either the interaction structure or the normalization benchmark, including weighted, density-adjusted, null-model, degree-constrained, transformed-data, and multi-layer variants. This family of indices preserves key properties such as normalization, invariance, and interpretability, while allowing concentration to be evaluated across different dimensions of dependence, including intensity, higher-order interactions, and extreme events. Theoretical results characterize the indices and establish their relationship with classical concentration and network measures. Empirical and simulation evidence demonstrate that systems with identical weight distributions may exhibit markedly different levels of structural concentration depending on network topology, highlighting the additional information captured by the proposed framework. The approach is broadly applicable to economic, financial, and complex systems in which weighted elements interact through networks.
[LG-121] Cluster-Specific Predictive Modeling: A Scalable Solution for Resource-Constrained Wi-Fi Controllers
链接: https://arxiv.org/abs/2603.21778
作者: Gianluca Fontanesi,Luca Barbieri,Lorenzo Galati Giordano,Alfonso Fernandez Duran,Thorsten Wild
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 figures, 7 pages
Abstract:This manuscript presents a comprehensive analysis of predictive modeling optimization in managed Wi-Fi networks through the integration of clustering algorithms and model evaluation techniques. The study addresses the challenges of deploying forecasting algorithms in large-scale environments managed by a central controller constrained by memory and computational resources. Feature-based clustering, supported by Principal Component Analysis (PCA) and advanced feature engineering, is employed to group time series data based on shared characteristics, enabling the development of cluster-specific predictive models. Comparative evaluations between global models (GMs) and cluster-specific models demonstrate that cluster-specific models consistently achieve superior accuracy in terms of Mean Absolute Error (MAE) values in high-activity clusters. The trade-offs between model complexity (and accuracy) and resource utilization are analyzed, highlighting the scalability of tailored modeling approaches. The findings advocate for adaptive network management strategies that optimize resource allocation through selective model deployment, enhance predictive accuracy, and ensure scalable operations in large-scale, centrally managed Wi-Fi environments.
[LG-122] Identifiability and amortized inference limitations in Kuramoto models
链接: https://arxiv.org/abs/2603.21752
作者: Emma Hannula,Jana de Wiljes,Matthew T. Moores,Heikki Haario,Lassi Roininen
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Bayesian inference is a powerful tool for parameter estimation and uncertainty quantification in dynamical systems. However, for nonlinear oscillator networks such as Kuramoto models, widely used to study synchronization phenomena in physics, biology, and engineering, inference is often computationally prohibitive due to high-dimensional state spaces and intractable likelihood functions. We present an amortized Bayesian inference approach that learns a neural approximation of the posterior from simulated phase dynamics, enabling fast, scalable inference without repeated sampling or optimization. Applied to synthetic Kuramoto networks, the method shows promising results in approximating posterior distributions and capturing uncertainty, with computational savings compared to traditional Bayesian techniques. These findings suggest that amortized inference is a practical and flexible framework for uncertainty-aware analysis of oscillator networks.
[LG-123] Model selection in hybrid quantum neural networks with applications to quantum transformer architectures
链接: https://arxiv.org/abs/2603.21749
作者: Harsh Wadhwa,Rahul Bhowmick,Naipunnya Raj,Rajiv Sangle,Ruchira V. Bhat,Krishnakumar Sabapathy
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 32 Pages. 16 figures, 1 algorithm and 8 tables
Abstract:Quantum machine learning models generally lack principled design guidelines, often requiring full resource-intensive training across numerous choices of encodings, quantum circuit designs and initialization strategies to find effective configuration. To address this challenge, we develope the Quantum Bias-Expressivity Toolbox ( \textttQBET ), a framework for evaluating quantum, classical, and hybrid transformer architectures. In this toolbox, we introduce lean metrics for Simplicity Bias ( \textttSB ) and Expressivity ( \textttEXP ), for comparing across various models, and extend the analysis of \textttSB to generative and multiclass-classification tasks. We show that \textttQBET enables efficient pre-screening of promising model variants obviating the need to execute complete training pipelines. In evaluations on transformer-based classification and generative tasks we employ a total of 18 qubits for embeddings ( 6 qubits each for query, key, and value). We identify scenarios in which quantum self-attention variants surpass their classical counterparts by ranking the respective models according to the \textttSB metric and comparing their relative performance.
[LG-124] CoNBONet: Conformalized Neuroscience-inspired Bayesian Operator Network for Reliability Analysis
链接: https://arxiv.org/abs/2603.21678
作者: Shailesh Garg,Souvik Chakraborty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Time-dependent reliability analysis of nonlinear dynamical systems under stochastic excitations is a critical yet computationally demanding task. Conventional approaches, such as Monte Carlo simulation, necessitate repeated evaluations of computationally expensive numerical solvers, leading to significant computational bottlenecks. To address this challenge, we propose \textitCoNBONet, a neuroscience-inspired surrogate model that enables fast, energy-efficient, and uncertainty-aware reliability analysis, providing a scalable alternative to techniques such as Monte Carlo simulations. CoNBONet, short for \textbfConformalized \textbfNeuroscience-inspired \textbfBayesian \textbfOperator \textbfNetwork, leverages the expressive power of deep operator networks while integrating neuroscience-inspired neuron models to achieve fast, low-power inference. Unlike traditional surrogates such as Gaussian processes, polynomial chaos expansions, or support vector regression, that may face scalability challenges for high-dimensional, time-dependent reliability problems, CoNBONet offers \textitfast and energy-efficient inference enabled by a neuroscience-inspired network architecture, \textitcalibrated uncertainty quantification with theoretical guarantees via split conformal prediction, and \textitstrong generalization capability through an operator-learning paradigm that maps input functions to system response trajectories. Validation of the proposed CoNBONet for various nonlinear dynamical systems demonstrates that CoNBONet preserves predictive fidelity, and achieves reliable coverage of failure probabilities, making it a powerful tool for robust and scalable reliability analysis in engineering design.
[LG-125] SPINONet: Scalable Spiking Physics-informed Neural Operator for Computational Mechanics Applications
链接: https://arxiv.org/abs/2603.21674
作者: Shailesh Garg,Luis Mandl,Somdatta Goswami,Souvik Chakraborty
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:Energy efficiency remains a critical challenge in deploying physics-informed operator learning models for computational mechanics and scientific computing, particularly in power-constrained settings such as edge and embedded devices, where repeated operator evaluations in dense networks incur substantial computational and energy costs. To address this challenge, we introduce the Separable Physics-informed Neuroscience-inspired Operator Network (SPINONet), a neuroscience-inspired framework that reduces redundant computation across repeated evaluations while remaining compatible with physics-informed training. SPINONet incorporates regression-friendly neuroscience-inspired spiking neurons through an architecture-aware design that enables sparse, event-driven computation, improving energy efficiency while preserving the continuous, coordinate-differentiable pathways required for computing spatio-temporal derivatives. We evaluate SPINONet on a range of partial differential equations representative of computational mechanics problems, with spatial, temporal, and parametric dependencies in both time-dependent and steady-state settings, and demonstrate predictive performance comparable to conventional physics-informed operator learning approaches despite the induced sparse communication. In addition, limited data supervision in a hybrid setup is shown to improve performance in challenging regimes where purely physics-informed training may converge to spurious solutions. Finally, we provide an analytical discussion linking architectural components and design choices of SPINONet to reductions in computational load and energy consumption.
[LG-126] Feature Incremental Clustering with Generalization Bounds
链接: https://arxiv.org/abs/2603.21590
作者: Jing Zhang,Chenping Hou
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:
Abstract:In many learning systems, such as activity recognition systems, as new data collection methods continue to emerge in various dynamic environmental applications, the attributes of instances accumulate incrementally, with data being stored in gradually expanding feature spaces. How to design theoretically guaranteed algorithms to effectively cluster this special type of data stream, commonly referred to as activity recognition, remains unexplored. Compared to traditional scenarios, we will face at least two fundamental questions in this feature incremental scenario. (i) How to design preliminary and effective algorithms to address the feature incremental clustering problem? (ii) How to analyze the generalization bounds for the proposed algorithms and under what conditions do these algorithms provide a strong generalization guarantee? To address these problems, by tailoring the most common clustering algorithm, i.e., k -means, as an example, we propose four types of Feature Incremental Clustering (FIC) algorithms corresponding to different situations of data access: Feature Tailoring (FT), Data Reconstruction (DR), Data Adaptation (DA), and Model Reuse (MR), abbreviated as FIC-FT, FIC-DR, FIC-DA, and FIC-MR. Subsequently, we offer a detailed analysis of the generalization error bounds for these four algorithms and highlight the critical factors influencing these bounds, such as the amounts of training data, the complexity of the hypothesis space, the quality of pre-trained models, and the discrepancy of the reconstruction feature distribution. The numerical experiments show the effectiveness of the proposed algorithms, particularly in their application to activity recognition clustering tasks.
[LG-127] FinRL-X: An AI-Native Modular Infrastructure for Quantitative Trading PAKDD2026
链接: https://arxiv.org/abs/2603.21330
作者: Hongyang Yang,Boyu Zhang,Yang She,Xinyu Liao,Xiaoli Zhang
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: Accepted at the DMO-FinTech Workshop (PAKDD 2026)
Abstract:We present FinRL-X, a modular and deployment-consistent trading architecture that unifies data processing, strategy construction, backtesting, and broker execution under a weight-centric interface. While existing open-source platforms are often backtesting- or model-centric, they rarely provide system-level consistency between research evaluation and live deployment. FinRL-X addresses this gap through a composable strategy pipeline that integrates stock selection, portfolio allocation, timing, and portfolio-level risk overlays within a unified protocol. The framework supports both rule-based and AI-driven components, including reinforcement learning allocators and LLM-based sentiment signals, without altering downstream execution semantics. FinRL-X provides an extensible foundation for reproducible, end-to-end quantitative trading research and deployment. The official FinRL-X implementation is available at this https URL.
[LG-128] he Averag e Relative Entropy and Transpilation Depth determines the noise robustness in Variational Quantum Classifiers
链接: https://arxiv.org/abs/2603.21300
作者: Aakash Ravindra Shinde,Arianne Meijer - van de Griend,Jukka K. Nurminen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Variational Quantum Classifier, Quantum Machine Learning, Quantum Relative Entropy, Noise Resilient Quantum Circuits, Shallow Circuits
Abstract:Variational Quantum Algorithms (VQAs) have been extensively researched for applications in Quantum Machine Learning (QML), Optimization, and Molecular simulations. Although designed for Noisy Intermediate-Scale Quantum (NISQ) devices, VQAs are predominantly evaluated classically due to uncertain results on noisy devices and limited resource availability. Raising concern over the reproducibility of simulated VQAs on noisy hardware. While prior studies indicate that VQAs may exhibit noise resilience in specific parameterized shallow quantum circuits, there are no definitive measures to establish what defines a shallow circuit or the optimal circuit depth for VQAs on a noisy platform. These challenges extend naturally to Variational Quantum Classification (VQC) algorithms, a subclass of VQAs for supervised learning. In this article, we propose a relative entropy-based metric to verify whether a VQC model would perform similarly on a noisy device as it does on simulations. We establish a strong correlation between the average relative entropy difference in classes, transpilation circuit depth, and their performance difference on a noisy quantum device. Our results further indicate that circuit depth alone is insufficient to characterize shallow circuits. We present empirical evidence to support these assertions across a diverse array of techniques for implementing VQC, datasets, and multiple noisy quantum devices.
[LG-129] Closed-form conditional diffusion models for data assimilation
链接: https://arxiv.org/abs/2603.21291
作者: Brianna Binder,Assad Oberai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We propose closed-form conditional diffusion models for data assimilation. Diffusion models use data to learn the score function (defined as the gradient of the log-probability density of a data distribution), allowing them to generate new samples from the data distribution by reversing a noise injection process. While it is common to train neural networks to approximate the score function, we leverage the analytical tractability of the score function to assimilate the states of a system with measurements. To enable the efficient evaluation of the score function, we use kernel density estimation to model the joint distribution of the states and their corresponding measurements. The proposed approach also inherits the capability of conditional diffusion models of operating in black-box settings, i.e., the proposed data assimilation approach can accommodate systems and measurement processes without their explicit knowledge. The ability to accommodate black-box systems combined with the superior capabilities of diffusion models in approximating complex, non-Gaussian probability distributions means that the proposed approach offers advantages over many widely used filtering methods. We evaluate the proposed method on nonlinear data assimilation problems based on the Lorenz-63 and Lorenz-96 systems of moderate dimensionality and nonlinear measurement models. Results show the proposed approach outperforms the widely used ensemble Kalman and particle filters when small to moderate ensemble sizes are used.
[LG-130] Accelerate Vector Diffusion Maps by Landmarks
链接: https://arxiv.org/abs/2603.21247
作者: Sing-Yuan Yeh,Yi-An Wu,Hau-Tieng Wu,Mao-Pei Tsui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:We propose a landmark-constrained algorithm, LA-VDM (Landmark Accelerated Vector Diffusion Maps), to accelerate the Vector Diffusion Maps (VDM) framework built upon the Graph Connection Laplacian (GCL), which captures pairwise connection relationships within complex datasets. LA-VDM introduces a novel two-stage normalization that effectively address nonuniform sampling densities in both the data and the landmark sets. Under a manifold model with the frame bundle structure, we show that we can accurately recover the parallel transport with landmark-constrained diffusion from a point cloud, and hence asymptotically LA-VDM converges to the connection Laplacian. The performance and accuracy of LA-VDM are demonstrated through experiments on simulated datasets and an application to nonlocal image denoising.
[LG-131] me-adaptive functional Gaussian Process regression
链接: https://arxiv.org/abs/2603.21144
作者: MD Ruiz-Medina,AE Madrid,A Torres-Signes,JM Angulo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a new formulation of functional Gaussian Process regression in manifolds, based on an Empirical Bayes approach, in the spatiotemporal random field context. We apply the machinery of tight Gaussian measures in separable Hilbert spaces, exploiting the invariance property of covariance kernels under the group of isometries of the manifold. The identification of these measures with infinite-product Gaussian measures is then obtained via the eigenfunctions of the Laplace-Beltrami operator on the manifold. The involved time-varying angular spectra constitute the key tool for dimension reduction in the implementation of this regression approach, adopting a suitable truncation scheme depending on the functional sample size. The simulation study and synthetic data application undertaken illustrate the finite sample and asymptotic properties of the proposed functional regression predictor.
[LG-132] Stochastic approximation in non-markovian environments revisited
链接: https://arxiv.org/abs/2603.21091
作者: Vivek Shripad Borkar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Based on some recent work of the author on stochastic approximation in non-markovian environments, the situation when the driving random process is non-ergodic in addition to being non-markovian is considered. Using this, we propose an analytic framework for understanding transformer based learning, specifically, the `attention’ mechanism, and continual learning, both of which depend on the entire past in principle.
[LG-133] Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate
链接: https://arxiv.org/abs/2603.21062
作者: Yingzhen Yang,Ping Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We study the problem of learning a low-degree spherical polynomial of degree k_0 = \Theta(1) \ge 1 defined on the unit sphere in \RR^d by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk \eps \in (0, \Theta(d^-k_0)] , an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of n \asymp \Theta( \log(4/\delta) \cdot d^k_0/\eps) with probability 1-\delta for \delta \in (0,1) , in contrast with the representative sample complexity \Theta(d^k_0 \max\set\eps^-2,\log d) . Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order \log(4/\delta) \cdot \Theta(d^k_0/n) with probability at least 1-\delta . On the other hand, the minimax optimal rate for the regression risk with a kernel of rank \Theta(d^k_0) is \Theta(d^k_0/n) , so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree k_0 is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.
[LG-134] Statistical Learning for Latent Embedding Alignment with Application to Brain Encoding and Decoding
链接: https://arxiv.org/abs/2603.21042
作者: Shuoxun Xu,Zhanhao Yan,Lexin Li
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 35 pages, 3 figures
Abstract:Brain encoding and decoding aims to understand the relationship between external stimuli and brain activities, and is a fundamental problem in neuroscience. In this article, we study latent embedding alignment for brain encoding and decoding, with a focus on improving sample efficiency under limited fMRI-stimulus paired data and substantial subject heterogeneity. We propose a lightweight alignment framework equipped with two statistical learning components: inverse semi-supervised learning that leverages abundant unpaired stimulus embeddings through inverse mapping and residual debiasing, and meta transfer learning that borrows strength from pretrained models across subjects via sparse aggregation and residual correction. Both methods operate exclusively at the alignment stage while keeping encoders and decoders frozen, allowing for efficient computation, modular deployment, and rigorous theoretical analysis. We establish finite-sample generalization bounds and safety guarantees, and demonstrate competitive empirical performance on the large-scale fMRI-image reconstruction benchmark data.
[LG-135] Hard labels sampled from sparse targets mislead rotation invariant algorithms
链接: https://arxiv.org/abs/2603.20967
作者: Avrajit Ghosh,Bin Yu,Manfred Warmuth,Peter Bartlett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values \pm 1 ). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form \sigma(\mathbfx^\top\mathbfw^\star) . In the over-constrained case (i.e. the number of samples n exceeds the input dimension d ) with examples (\mathbfx_i,\sigma(\mathbfx_i^\top\mathbfw^\star)) , it is sufficient to recover \mathbfw^\star and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels y_i sampled from the same conditional distribution \sigma(\mathbfx_i^\top\mathbfw^\star) and \mathbfw^\star is s -sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk \Omega!\left(\fracd-1n\right) , while there are simple non-rotation invariant algorithms with excess risk O(\fracs\log dn) . The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights u_i,v_i , where now the linear weight w_i is reparameterized as u_iv_i .
[LG-136] Stability of Sequential and Parallel Coordinate Ascent Variational Inference
链接: https://arxiv.org/abs/2603.20929
作者: Debdeep Pati
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 20 pages, 3 figures
Abstract:We highlight a striking difference in behavior between two widely used variants of coordinate ascent variational inference: the sequential and parallel algorithms. While such differences were known in the numerical analysis literature in simpler settings, they remain largely unexplored in the optimization-focused literature on variational inference in more complex models. Focusing on the moderately high-dimensional linear regression problem, we show that the sequential algorithm, although typically slower, enjoys convergence guarantees under more relaxed conditions than the parallel variant, which is often employed to facilitate block-wise updates and improve computational efficiency.
[LG-137] Active Inference for Physical AI Agents – An Engineering Perspective
链接: https://arxiv.org/abs/2603.20927
作者: Bert de Vries
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Physical AI agents, such as robots and other embodied systems operating under tight and fluctuating resource constraints, remain far less capable than biological agents in open-ended real-world environments. This paper argues that Active Inference (AIF), grounded in the Free Energy Principle, offers a principled foundation for closing that gap. We develop this argument from first principles, following a chain from probability theory through Bayesian machine learning and variational inference to active inference and reactive message passing. From the FEP perspective, systems that maintain their structural and functional integrity over time can, under suitable assumptions, be described as minimizing variational free energy (VFE), and AIF operationalizes this by unifying perception, learning, planning, and control within a single computational objective. We show that VFE minimization is naturally realized by reactive message passing on factor graphs, where inference emerges from local, parallel computations. This realization is well matched to the constraints of physical operation, including hard deadlines, asynchronous data, fluctuating power budgets, and changing environments. Because reactive message passing is event-driven, interruptible, and locally adaptable, performance degrades gracefully under reduced resources while model structure can adjust online. We further show that, under suitable coupling and coarse-graining conditions, coupled AIF agents can be described as higher-level AIF agents, yielding a homogeneous architecture based on the same message-passing primitive across scales. Our contribution is not empirical benchmarking, but a clear theoretical and architectural case for the engineering community.
[LG-138] Auto-differentiable data assimilation: Co-learning of states dynamics and filtering algorithms
链接: https://arxiv.org/abs/2603.20891
作者: Melissa Adrian,Daniel Sanz-Alonso,Rebecca Willett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Dynamical Systems (math.DS)
*备注:
Abstract:Data assimilation algorithms estimate the state of a dynamical system from partial observations, where the successful performance of these algorithms hinges on costly parameter tuning and on employing an accurate model for the dynamics. This paper introduces a framework for jointly learning the state, dynamics, and parameters of filtering algorithms in data assimilation through a process we refer to as auto-differentiable filtering. The framework leverages a theoretically motivated loss function that enables learning from partial, noisy observations via gradient-based optimization using auto-differentiation. We further demonstrate how several well-known data assimilation methods can be learned or tuned within this framework. To underscore the versatility of auto-differentiable filtering, we perform experiments on dynamical systems spanning multiple scientific domains, such as the Clohessy-Wiltshire equations from aerospace engineering, the Lorenz-96 system from atmospheric science, and the generalized Lotka-Volterra equations from systems biology. Finally, we provide guidelines for practitioners to customize our framework according to their observation model, accuracy requirements, and computational budget.
[LG-139] Universal Coefficients and Mayer-Vietoris for Moore Homology of Ample Groupoids
链接: https://arxiv.org/abs/2603.20861
作者: Luciano Melodia
类目: Algebraic Topology (math.AT); Machine Learning (cs.LG); K-Theory and Homology (math.KT); Operator Algebras (math.OA)
*备注:
Abstract:We establish two structural results for Moore homology of ample groupoids. First, for every ample groupoid \mathcalG and every discrete abelian coefficient group A , we prove a universal coefficient theorem relating the homology groups H_n(\mathcalG;A) to the integral Moore homology of \mathcalG . More precisely, we obtain a natural short exact sequence 0 \longrightarrow H_n(\mathcalG;\mathbbZ)\otimes_\mathbbZ A \xrightarrow\kappa_n^\mathcalG H_n(\mathcalG;A) \xrightarrow\iota_n^\mathcalG \operatornameTor_1^\mathbbZ\bigl(H_n-1(\mathcalG;\mathbbZ),A\bigr) \longrightarrow 0. Second, for a decomposition of the unit space into clopen saturated subsets, we prove a Mayer-Vietoris long exact sequence in Moore homology. The proof is carried out at the chain level and is based on a short exact sequence of Moore chain complexes associated to the corresponding restricted groupoids. These results provide effective tools for the computation of Moore homology. We also explain why the discreteness of the coefficient group is essential for the universal coefficient theorem.
[LG-140] mmWave-Diffusion:A Novel Framework for Respiration Sensing Using Observation-Anchored Conditional Diffusion Model ICASSP2026
链接: https://arxiv.org/abs/2603.20700
作者: Yong Wang,Qifan Shen,Bao Zhang,Zijun Huang,Chengbo Zhu,Shuai Yao,Qisong Wu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Accepted by IEEE ICASSP 2026
Abstract:Millimeter-wave (mmWave) radar enables contactless respiratory sensing,yet fine-grained monitoring is often degraded by nonstationary interference from body this http URL achieve micromotion interference removal,we propose mmWave-Diffusion,an observation-anchored conditional diffusion framework that directly models the residual between radar phase observations and the respiratory ground truth,and initializes sampling within an observation-consistent neighborhood rather than from Gaussian noise-thereby aligning the generative process with the measurement physics and reducing inference overhead. The accompanying Radar Diffusion Transformer (RDT) is explicitly conditioned on phase observations, enforces strict one-to-one temporal alignment via patch-level dual positional encodings, and injects local physical priors through banded-mask multi-head cross-attention, enabling robust denoising and interference removal in just 20 reverse steps. Evaluated on 13.25 hours of synchronized radar-respiration data, mmWave-Diffusion achieves state-of-the-art waveform reconstruction and respiratory-rate estimation with strong generalization. Code repository:this https URL.
[LG-141] High-dimensional online learning via asynchronous decomposition: Non-divergent results dynamic regularization and beyond
链接: https://arxiv.org/abs/2603.20696
作者: Shixiang Liu,Zhifan Li,Hanming Yang,Jianxin Yin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 1 figure
Abstract:Existing high-dimensional online learning methods often face the challenge that their error bounds, or per-batch sample sizes, diverge as the number of data batches increases. To address this issue, we propose an asynchronous decomposition framework that leverages summary statistics to construct a surrogate score function for current-batch learning. This framework is implemented via a dynamic-regularized iterative hard thresholding algorithm, providing a computationally and memory-efficient solution for sparse online optimization. We provide a unified theoretical analysis that accounts for both the streaming computational error and statistical accuracy, establishing that our estimator maintains non-divergent error bounds and \ell_0 sparsity across all batches. Furthermore, the proposed estimator adaptively achieves additional gains as batches accumulate, attaining the oracle accuracy as if the entire historical dataset were accessible and the true support were known. These theoretical properties are further illustrated through an example of the generalized linear model.
[LG-142] Hierarchical Multiscale Structure-Function Coupling for Brain Connectome Integration
链接: https://arxiv.org/abs/2603.20680
作者: Jianwei Chen,Zhengyang Miao,Wenjie Cai,Jiaxue Tang,Boxing Liu,Yunfan Zhang,Yuhang Yang,Hao Tang,Carola-Bibiane Schönlieb,Zaixu Cui,Du Lei,Shouliang Qi,Chao Li
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Integrating structural and functional connectomes remains challenging because their relationship is non-linear and organized over nested modular hierarchies. We propose a hierarchical multiscale structure-function coupling framework for connectome integration that jointly learns individualized modular organization and hierarchical coupling across structural connectivity (SC) and functional connectivity (FC). The framework includes: (i) Prototype-based Modular Pooling (PMPool), which learns modality-specific multiscale communities by selecting prototypical ROIs and optimizing a differentiable modularity-inspired objective; (ii) an Attention-based Hierarchical Coupling Module (AHCM) that models both within-hierarchy and cross-hierarchy SC-FC interactions to produce enriched hierarchical coupling representations; and (iii) a Coupling-guided Clustering loss (CgC-Loss) that regularizes SC and FC community assignments with coupling signals, allowing cross-modal interactions to shape community alignment across hierarchies. We evaluate the model’s performance across four cohorts for predicting brain age, cognitive score, and disease classification. Our model consistently outperforms baselines and other state-of-the-art approaches across three tasks. Ablation and sensitivity analyses verify the contributions of key components. Finally, the visualizations of learned coupling reveal interpretable differences, suggesting that the framework captures biologically meaningful structure-function relationships.
[LG-143] LassoFlexNet: Flexible Neural Architecture for Tabular Data
链接: https://arxiv.org/abs/2603.20631
作者: Kry Yik Chau Lui,Cheng Chi,Kishore Basu,Yanshuai Cao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 49 pages
Abstract:Despite their dominance in vision and language, deep neural networks often underperform relative to tree-based models on tabular data. To bridge this gap, we incorporate five key inductive biases into deep learning: robustness to irrelevant features, axis alignment, localized irregularities, feature heterogeneity, and training stability. We propose \emphLassoFlexNet, an architecture that evaluates the linear and nonlinear marginal contribution of each input via Per-Feature Embeddings, and sparsely selects relevant variables using a Tied Group Lasso mechanism. Because these components introduce optimization challenges that destabilize standard proximal methods, we develop a \emphSequential Hierarchical Proximal Adaptive Gradient optimizer with exponential moving averages (EMA) to ensure stable convergence. Across 52 datasets from three benchmarks, LassoFlexNet matches or outperforms leading tree-based models, achieving up to a 10 % relative gain, while maintaining Lasso-like interpretability. We substantiate these empirical results with ablation studies and theoretical proofs confirming the architecture’s enhanced expressivity and structural breaking of undesired rotational invariance.
[LG-144] CogFormer: Learn All Your Models Once
链接: https://arxiv.org/abs/2603.20520
作者: Jerry M. Huang,Lukas Schumacher,Niek Stevenson,Stefan T. Radev
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Simulation-based inference (SBI) with neural networks has accelerated and transformed cognitive modeling workflows. SBI enables modelers to fit complex models that were previously difficult or impossible to estimate, while also allowing rapid estimation across large numbers of datasets. However, the utility of SBI for iterating over varying modeling assumptions remains limited: changing parameterizations, generative functions, priors, and design variables all necessitate model retraining and hence diminish the benefits of amortization. To address these issues, we pilot a meta-amortized framework for cognitive modeling which we nickname the CogFormer. Our framework trains a transformer-based architecture that remains valid across a combinatorial number of structurally similar models, allowing for changing data types, parameters, design matrices, and sample sizes. We present promising quantitative results across families of decision-making models for binary, multi-alternative, and continuous responses. Our evaluation suggests that CogFormer can accurately estimate parameters across model families with a minimal amortization offset, making it a potentially powerful engine that catalyzes cognitive modeling workflows.
[LG-145] Goal-oriented learning of stochastic dynamical systems using error bounds on path-space observables
链接: https://arxiv.org/abs/2603.20467
作者: Joanna Zou,Han Cheng Lie,Youssef Marzouk
类目: Methodology (stat.ME); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:The governing equations of stochastic dynamical systems often become cost-prohibitive for numerical simulation at large scales. Surrogate models of the governing equations, learned from data of the high-fidelity system, are routinely used to predict key observables with greater efficiency. However, standard choices of loss function for learning the surrogate model fail to provide error guarantees in path-dependent observables, such as reaction rates of molecular dynamical systems. This paper introduces an error bound for path-space observables and employs it as a novel variational loss for the goal-oriented learning of a stochastic dynamical system. We show the error bound holds for a broad class of observables, including mean first hitting times on unbounded time domains. We derive an analytical gradient of the goal-oriented loss function by leveraging the formula for Frechet derivatives of expected path functionals, which remains tractable for implementation in stochastic gradient descent schemes. We demonstrate that surrogate models of overdamped Langevin systems developed via goal-oriented learning achieve improved accuracy in predicting the statistics of a first hitting time observable and robustness to distributional shift in the data.
[LG-146] CERN: Correcting Errors in Raw Nanopore Signals Using Hidden Markov Models
链接: https://arxiv.org/abs/2603.20420
作者: Simon Ambrozak,Ulysse McConnell,Bhargav Srinivasan,Burak Ozkan,Can Firtina
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Nanopore sequencing can read substantially longer sequences of nucleic acid molecules than other sequencing methods, which has led to advances in genomic analysis such as the gapless human genome assembly. By analyzing the raw electrical signal reads that nanopore sequencing generates from molecules, existing works can map these reads without translating them into DNA characters (i.e., basecalling), allowing for quick and efficient analysis of sequencing data. However, raw signals often contain errors due to noise and mistakes when processing them, which limits the overall accuracy of raw signal analysis. Our goal in this work is to detect and correct errors in raw signals to improve the accuracy of raw signal analyses. To this end, we propose CERN, a mechanism that trains and utilizes a Hidden Markov Model (HMM) to accurately correct signal errors. Our extensive evaluation on various datasets including E. coli, Fruit Fly, and Human genomes shows that CERN 1) consistently improves the overall mapping accuracy of the underlying raw signal analysis tools, 2) minimizes the burden on segmentation algorithm optimization with newer nanopore chemistries, and 3) functions without causing substantial computational overhead. We conclude that CERN provides an effective mechanism to systematically identify and correct the errors in raw nanopore signals before further analysis, which can enable the development of a new class of error correction mechanisms purely designed for raw nanopore signals. CERN is available at this https URL. We also provide the scripts to fully reproduce our results on our GitHub page. Subjects: Genomics (q-bio.GN); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2603.20420 [q-bio.GN] (or arXiv:2603.20420v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2603.20420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-147] A chemical language model for reticular materials design
链接: https://arxiv.org/abs/2603.20389
作者: Dhruv Menon,Vivek Singh,Xu Chen,Mohammad Reza Alizadeh Kiapi,Ivan Zyuzin,Hamish W. Macleod,Nakul Rampal,William Shepard,Omar M. Yaghi,David Fairen-Jimenez
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 45 pages, 26 figures, Supplementary Information included; code available at: this https URL
Abstract:Reticular chemistry has enabled the synthesis of tens of thousands of metal-organic frameworks (MOFs), yet the discovery of new materials still relies largely on intuition-driven linker design and iterative experimentation. As a result, researchers explore only a small fraction of the vast chemical space accessible to reticular materials, limiting the systematic discovery of frameworks with targeted properties. Here, we introduce Nexerra-R1, a building-block chemical language model that enables inverse design in reticular chemistry through the targeted generation of organic linkers. Rather than generating complete frameworks directly, Nexerra-R1 operates at the level of molecular building blocks, preserving the modular logic that underpins reticular synthesis. The model supports both unconstrained generation of low-connectivity linkers and scaffold-constrained design of symmetric multidentate motifs compatible with predefined nodes and topologies. We further combine linker generation with flow-guided distributional targeting to steer the generative process toward application-relevant objectives while maintaining chemical validity and assembly feasibility. The generated linkers are subsequently assembled into three-dimensional frameworks and are structurally optimized to produce candidate materials compatible with experimental synthesis. Using Nexerra-R1, we validate this strategy by rediscovering known MOFs and by proposing the experimental synthesis of a previously unreported framework, CU-525, generated entirely in silico. Together, these results establish a general inverse-design paradigm for reticular materials in which controllable chemical language modelling enables the direct translation from computational design to synthesizable frameworks.
[LG-148] From Cross-Validation to SURE: Asymptotic Risk of Tuned Regularized Estimators
链接: https://arxiv.org/abs/2603.20388
作者: Karun Adusumilli,Maximilian Kasy,Ashia Wilson
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:
Abstract:We derive the asymptotic risk function of regularized empirical risk minimization (ERM) estimators tuned by n -fold cross-validation (CV). The out-of-sample prediction loss of such estimators converges in distribution to the squared-error loss (risk function) of shrinkage estimators in the normal means model, tuned by Stein’s unbiased risk estimate (SURE). This risk function provides a more fine-grained picture of predictive performance than uniform bounds on worst-case regret, which are common in learning theory: it quantifies how risk varies with the true parameter. As key intermediate steps, we show that (i) n -fold CV converges uniformly to SURE, and (ii) while SURE typically has multiple local minima, its global minimum is generically well separated. Well-separation ensures that uniform convergence of CV to SURE translates into convergence of the tuning parameter chosen by CV to that chosen by SURE.
[LG-149] Operator Learning for Smoothing and Forecasting
链接: https://arxiv.org/abs/2603.20359
作者: Edoardo Calvello,Elizabeth Carlson,Nikola Kovachki,Michael N. Manta,Andrew M. Stuart
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注:
Abstract:Machine learning has opened new frontiers in purely data-driven algorithms for data assimilation in, and for forecasting of, dynamical systems; the resulting methods are showing some promise. However, in contrast to model-driven algorithms, analysis of these data-driven methods is poorly developed. In this paper we address this issue, developing a theory to underpin data-driven methods to solve smoothing problems arising in data assimilation and forecasting problems. The theoretical framework relies on two key components: (i) establishing the existence of the mapping to be learned; (ii) the properties of the operator learning architecture used to approximate this mapping. By studying these two components in conjunction, we establish the first universal approximation theorem for purely data-driven algorithms for both smoothing and forecasting of dynamical systems. We work in the continuous time setting, hence deploying neural operator architectures. The theoretical results are illustrated with experiments studying the Lorenz 63, Lorenz 96 and Kuramoto-Sivashinsky dynamical systems.
[LG-150] G2DR: A Genotype-First Framework for Genetics-Informed Target Prioritization and Drug Repurposing
链接: https://arxiv.org/abs/2603.20346
作者: Muhammad Muneeb,David B. Ascher
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Human genetics offers a promising route to therapeutic discovery, yet practical frameworks translating genotype-derived signal into ranked target and drug hypotheses remain limited, particularly when matched disease transcriptomics are unavailable. Here we present G2DR, a genotype-first prioritization framework propagating inherited variation through genetically predicted expression, multi-method gene-level testing, pathway enrichment, network context, druggability, and multi-source drug–target evidence integration. In a migraine case study with 733 UK Biobank participants under stratified five-fold cross-validation, we imputed expression across seven transcriptome-weight resources and ranked genes using a reproducibility-aware discovery score from training and validation data, followed by a balanced integrated score for target selection. Discovery-based prioritization generalized to held-out data, achieving gene-level ROC-AUC of 0.775 and PR-AUC of 0.475, while retaining enrichment for curated migraine biology. Mapping prioritized genes to compounds via Open Targets, DGIdb, and ChEMBL yielded drug sets enriched for migraine-linked compounds relative to a global background, though recovery favoured broader mechanism-linked and off-label space over migraine-specific approved therapies. Directionality filtering separated broadly recovered compounds from mechanistically compatible candidates. G2DR is a modular framework for genetics-informed hypothesis generation, not a clinically actionable recommendation system. All outputs require independent experimental, pharmacological, and clinical validation.
[LG-151] Forward and inverse problems for measure flows in Bayes Hilbert spaces
链接: https://arxiv.org/abs/2603.20329
作者: S. David Mis,Maarten V. de Hoop
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We study forward and inverse problems for time-dependent probability measures in Bayes–Hilbert spaces. On the forward side, we show that each sufficiently regular Bayes–Hilbert path admits a canonical dynamical realization: a weighted Neumann problem transforms the log-density variation into the unique gradient velocity field of minimum kinetic energy. This construction induces a transport form on Bayes–Hilbert tangent directions, which measures the dynamical cost of realizing prescribed motions, and yields a flow-matching interpretation in which the canonical velocity field is the minimum-energy execution of the prescribed path. On the inverse side, we formulate reconstruction directly on Bayes–Hilbert path space from time-dependent indirect observations. The resulting variational problem combines a data-misfit term with the transport action induced by the forward geometry. In our infinite-dimensional setting, however, this transport geometry alone does not provide sufficient compactness, so we add explicit temporal and spatial regularization to close the theory. The linearized observation operator induces a complementary observability form, which quantifies how strongly tangent directions are seen through the data. Under explicit Sobolev regularity and observability assumptions, we prove existence of minimizers, derive first-variation formulas, establish local stability of the observation map, and deduce recovery of the evolving law, its score, and its canonical velocity field under the strong topologies furnished by the compactness theory. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2603.20329 [stat.ML] (or arXiv:2603.20329v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.20329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-152] Decorrelation Diversity and Emergent Intelligence: The Isomorphism Between Social Insect Colonies and Ensemble Machine Learning
链接: https://arxiv.org/abs/2603.20328
作者: Ernest Fokoué,Gregory Babbitt,Yuval Leventhal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 47 pages, 13 figures, 4 tables
Abstract:Social insect colonies and ensemble machine learning methods represent two of the most successful examples of decentralized information processing in nature and computation respectively. Here we develop a rigorous mathematical framework demonstrating that ant colony decision-making and random forest learning are isomorphic under a common formalism of \textbfstochastic ensemble intelligence. We show that the mechanisms by which genetically identical ants achieve functional differentiation – through stochastic response to local cues and positive feedback – map precisely onto the bootstrap aggregation and random feature subsampling that decorrelate decision trees. Using tools from Bayesian inference, multi-armed bandit theory, and statistical learning theory, we prove that both systems implement identical variance reduction strategies through decorrelation of identical units. We derive explicit mappings between ant recruitment rates and tree weightings, pheromone trail reinforcement and out-of-bag error estimation, and quorum sensing and prediction averaging. This isomorphism suggests that collective intelligence, whether biological or artificial, emerges from a universal principle: \textbfrandomized identical agents + diversity-enforcing mechanisms \rightarrow emergent optimality.
[LG-153] Compact Lifted Relaxations for Low-Rank Optimization
链接: https://arxiv.org/abs/2603.20228
作者: Ryan Cory-Wright,Jean Pauphilet
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Part of this material previously appeared in arXiv; 2501.02942v2 , which was split into this paper and arXiv:2501.02942v3
Abstract:We develop tractable convex relaxations for rank-constrained quadratic optimization problems over n \times m matrices, a setting for which tractable relaxations are typically only available when the objective or constraints admit spectral (permutation-invariant) structure. We derive lifted semidefinite relaxations that do not require such spectral terms. Although a direct lifting introduces a large semidefinite constraint in dimension n^2 + nm + 1 , we prove that many blocks of moment matrix are redundant and derive an equivalent compact relaxation that only involves two semidefinite constraints of dimension nm + 1 and n+m respectively. For matrix completion, basis pursuit, and reduced-rank regression problems, we exploit additional structure to obtain even more compact formulations involving semidefinite matrices of dimension at most 2\max(n,m) . Overall, we obtain scalable semidefinite bounds for a broad class of low-rank quadratic problems.
附件下载


