本篇博文主要内容为 2026-02-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-11)
今日共更新595篇论文,其中:
- 自然语言处理共84篇(Computation and Language (cs.CL))
- 人工智能共163篇(Artificial Intelligence (cs.AI))
- 计算机视觉共112篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共175篇(Machine Learning (cs.LG))
- 多智能体系统共9篇(Multiagent Systems (cs.MA))
- 信息检索共19篇(Information Retrieval (cs.IR))
- 人机交互共27篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Resilient Topology-Aware Coordination for Dynamic 3D UAV Networks under Node Failure
【速读】:该论文旨在解决3D空中-地面一体化网络(3D Aerial-Ground Integrated Networks, AGINs)在突发硬件故障下维持连续服务覆盖的问题,尤其关注多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在动态拓扑变形下的鲁棒性不足。解决方案的关键在于提出一种拓扑感知的图注意力MAPPO(Topology-Aware Graph MAPPO, TAG-MAPPO)框架,通过图结构特征聚合与残差自体状态融合机制,精准建模智能体间的复杂依赖关系,从而实现生存集群在受损后的快速三维空间重构。该设计显著提升了系统在高干扰城市环境和稀疏乡村场景中的稳定性、能效比及自愈能力,尤其在节点灾难性失效后15个时间步内恢复超过90%的服务覆盖,并展现出优于传统多层感知机(MLP)基线的V型恢复轨迹和更优的公平性指标。
链接: https://arxiv.org/abs/2602.10029
作者: Chuan-Chi Lai
机构: Advanced Institute of Manufacturing with High-tech Innovations (AIM-HI); National Chung Cheng University (国立中正大学)
类目: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA)
备注: 25 pages, 5 figures. Full research paper providing a resilience-aware RL framework for UAV networks under node failure. A preliminary version has been submitted to ACM Transactions on Internet Technology (TOIT) for possible publication
Abstract:In 3D Aerial-Ground Integrated Networks (AGINs), ensuring continuous service coverage under unexpected hardware failures is critical for mission-critical applications. While Multi-Agent Reinforcement Learning (MARL) has shown promise in autonomous coordination, its resilience under sudden node failures remains a challenge due to dynamic topology deformation. This paper proposes a Topology-Aware Graph MAPPO (TAG-MAPPO) framework designed to enhance system survivability through autonomous 3D spatial reconfiguration. Our framework incorporates graph-based feature aggregation with a residual ego-state fusion mechanism to capture intricate inter-agent dependencies. This architecture enables the surviving swarm to rapidly adapt its topology compared to conventional Multi-Layer Perceptron (MLP) based approaches. Extensive simulations across heterogeneous environments, ranging from interference-limited Crowded Urban to sparse Rural areas, validate the proposed approach. The results demonstrate that TAG-MAPPO consistently outperforms baselines in both stability and efficiency; specifically, it reduces redundant handoffs by up to 50 percent while maintaining a lead in energy efficiency. Most notably, the framework exhibits exceptional self-healing capabilities following a catastrophic node failure. TAG-MAPPO restores over 90 percent of the pre-failure service coverage within 15 time steps, exhibiting a significantly faster V-shaped recovery trajectory than MLP baselines. Furthermore, in dense urban scenarios, the framework achieves a post-failure Jain’s Fairness Index that even surpasses its original four-UAV configuration by effectively resolving service overlaps. These findings suggest that topology-aware coordination is essential for the realization of resilient 6G aerial networks and provides a robust foundation for adaptive deployments in volatile environments.
[MA-1] A Collaborative Safety Shield for Safe and Efficient CAV Lane Changes in Congested On-Ramp Merging
【速读】:该论文旨在解决密集交通环境下自动驾驶车辆(Connected and Autonomous Vehicles, CAVs)进行变道时,如何同时兼顾安全性与交通效率这一冲突目标的问题。现有变道控制器通常仅保障安全或提升效率,难以协同优化二者。解决方案的关键在于提出多智能体安全屏障(Multi-Agent Safety Shield, MASS),其基于控制屏障函数(Control Barrier Functions, CBFs)设计,通过图结构构建的交互拓扑捕捉多车间的动态交互关系,并集成至多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)变道控制器中,形成MARL-MASS框架。该方法在严格满足安全约束的前提下,引入定制化奖励函数以增强策略稳定性,从而有效平衡安全与效率之间的权衡,在拥堵匝道汇入场景中验证了其有效性。
链接: https://arxiv.org/abs/2602.10007
作者: Bharathkumar Hegde,Melanie Bouroche
机构: Trinity College Dublin (都柏林圣三一学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Accepted in IEEE IV 2026
Abstract:Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi-Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi-agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state-of-the-art Multi-Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL-MASS, and evaluate it in a congested on-ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL-MASS effectively balances the trade-off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL-MASS is available with an open-source licence at this https URL
[MA-2] ny Moves: Game-based Hypothesis Refinement
【速读】:该论文旨在解决当前机器学习方法在科学发现中将假设视为端到端预测所导致的可解释性不足和推理过程不透明的问题,即忽视了科学推理中通过局部、渐进式修正逐步完善假设的特性。其解决方案的关键在于提出“假设博弈”(The Hypothesis Game)这一符号形式化框架,其中大型语言模型(LLM)代理基于固定的推理动作语法,在共享假设状态上进行协作式增量修改,从而实现结构可控、可解释且可迁移的假设精炼机制。
链接: https://arxiv.org/abs/2602.09801
作者: Agnieszka Dobrowolska,Rogier Hintzen,Martin Balla,Karl Gemayel,Sabine Reichert,Thomas Charman,Jen Ning Lim,Lindsay Edwards,Anna Gogleva
机构: Relation Therapeutics(关系治疗公司)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Most machine learning approaches to scientific discovery frame hypotheses as end-to-end predictions, obscuring the incremental structure of scientific reasoning. We propose The Hypothesis Game, a symbolic formalism for hypothesis refinement in which LLM agents operate on a shared hypothesis state using a fixed grammar of reasoning moves. The framework is motivated by the observation that scientific progress often proceeds through small, localized revisions, grounded in domain context, rather than extensive rewrites. We instantiate a minimal game with LLM agents and evaluate it on pathway-level mechanistic refinement tasks. In the primary setting of corruption recovery, where hypotheses contain controlled errors, the game-based approach consistently removes more errors and achieves higher precision than strong prompting baselines, while preserving valid structure through incremental edits. In a secondary reconstruction setting from partial cues, it performs comparably to the strongest baseline, indicating that explicit move-based refinement remains competitive even when ground-truth recovery is difficult. These findings support game-based reasoning as a principled route to more controllable, interpretable, and transferable hypothesis refinement systems for scientific discovery.
[MA-3] Dieu khien he da tac tu
【速读】:该文献旨在解决多智能体系统(Multi-Agent Systems, MAS)控制领域中系统性教材稀缺的问题,尤其在英文之外的语言环境中缺乏对基础原理进行深入且结构化阐述的资源。其解决方案的关键在于:以渐进式、分步骤的方式呈现多个核心课题,涵盖图论基础、线性一致性算法设计与分析,以及形成控制、分布式优化等典型应用方向,确保内容兼具理论深度与教学可及性,同时辅以研究者注释、延伸阅读和习题,便于学习者掌握常用方法与工具,从而弥合学术研究与工程实践之间的鸿沟。
链接: https://arxiv.org/abs/2602.09412
作者: Minh Hoang Trinh,Hieu Minh Nguyen
机构: 未知
类目: Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注: “Multi-Agent System Control”, 252 pages; in Vietnamese language, 82 figures
Abstract:Since the early 2000s, control of multiagent systems has attracted significant research interest, with applications ranging from natural collective behaviors and social dynamics to engineered systems such as autonomous vehicles, sensor networks, and smart grids. Although research on multi-agent systems has diversified into numerous specialized directions, textbooks – including those in English – that provide a systematic treatment of the fundamental principles of multi-agent system control remain scarce. The material presented in this book has been developed and used in teaching since 2021, initially as a concise Vietnamese-language reference for the courses Networked Control Systems and Control of Multi-Agent Systems at Hanoi University of Science and Technology. The book focuses on a selection of fundamental topics of broad and continuing interest in the field. The complexity of several topics is asymptotic to that encountered in research-level studies, however, the analysis is presented in a step-by-step manner to facilitate access to commonly used methods and tools. The material is divided into three main parts. Part I introduces multiagent systems and basic graph-theoretic concepts. Part II addresses the design and analysis of linear consensus algorithms. Part III covers selected applications and research directions, including formation control, network localization, distributed optimization, opinion dynamics, and matrix-weighted networks. Each chapter concludes with notes on notable researchers in this field, further reading, and exercises. This book cannot be completed without the encouragement, support and suggestions from families, colleagues and friends. The authors appreciate feedback from readers to further improve the content of the book. Comments: “Multi-Agent System Control”, 252 pages; in Vietnamese language, 82 figures Subjects: Multiagent Systems (cs.MA); Optimization and Control (math.OC) Cite as: arXiv:2602.09412 [cs.MA] (or arXiv:2602.09412v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2602.09412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLM s in Chinese Psychiatric Consultation and Diagnosis
【速读】:该论文旨在解决当前生成式 AI 在精神障碍辅助诊断中面临的两大核心挑战:一是缺乏能够同时提供真实患者模拟、临床医生验证的诊断标签以及支持动态多轮咨询评估的基准测试平台;二是现有大语言模型(Large Language Models, LLMs)在静态诊断任务与动态多轮咨询场景下的性能差异及其影响机制不明确。解决方案的关键在于构建 LingxiDiagBench,这是一个大规模多智能体基准,其核心是 LingxiDiag-16K 数据集——包含 16,000 条与电子病历(Electronic Medical Records, EMR)对齐的合成咨询对话,精准复现了 12 种 ICD-10 精神障碍类别下的真实人口统计学和诊断分布,并支持静态诊断推理与动态多轮咨询双重评估范式,从而为 LLM 在中文语境下的精神健康诊断能力提供了可量化、可复现的评测体系。
链接: https://arxiv.org/abs/2602.09379
作者: Shihao Xu,Tiancheng Zhou,Jiatong Ma,Yanli Ding,Yiming Yan,Ming Xiao,Guoyi Li,Haiyang Geng,Yunyun Han,Jianhua Chen,Yafeng Deng
机构: EverMind AI Inc.(EverMind AI公司); Tianqiao and Chrissy Chen Institute(天桥和克里斯蒂·陈研究所); Shanghai Mental Health Center(上海精神卫生中心); Shanghai Jiao Tong University School of Medicine(上海交通大学医学院)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:
Abstract:Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression–anxiety classification (up to 92.3%), performance deteriorates substantially for depression–anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at this https URL.
[MA-5] VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models
【速读】:该论文旨在解决当前手术图像分割方法存在的三大局限性:一是局限于预定义类别,缺乏灵活性;二是仅提供一次性预测结果,无法进行自适应优化;三是缺少临床医生参与的交互机制。其解决方案的关键在于提出IR-SIS(Iterative Refinement System for Surgical Image Segmentation),该系统通过三个核心模块实现语言驱动的自适应精化:首先利用微调后的Segment Anything Model (SAM3)完成初始分割;其次借助视觉-语言模型(Vision-Language Model, VLM)识别手术器械并评估分割质量;最后采用代理式工作流(agentic workflow)动态选择最优精化策略,并支持通过自然语言反馈实现临床医生在环(clinician-in-the-loop)的交互式迭代优化。此框架首次实现了基于语言指令的手术图像分割与自适应精化能力的结合。
链接: https://arxiv.org/abs/2602.09252
作者: Ange Lou,Yamin Li,Qi Chang,Nan Xi,Luyuan Xie,Zichao Li,Tianyu Luan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.
[MA-6] CoMMa: Contribution-Aware Medical Multi-Agent s From A Game-Theoretic Perspective
【速读】:该论文旨在解决肿瘤学决策支持任务中,如何在动态且异构的患者数据环境下实现更稳定、可解释的多智能体协作决策问题。传统多智能体架构多依赖随机性的叙事推理,难以保证决策路径的稳定性与证据溯源的准确性。解决方案的关键在于提出一种去中心化的大型语言模型(Large Language Model, LLM)智能体框架——贡献感知医疗多智能体(Contribution-Aware Medical Multi-Agents, CoMMa),其核心创新是通过博弈论目标进行智能体间协调,并利用确定性的嵌入投影近似实现贡献感知的信用分配机制,从而显式地估算每个智能体的边际效用,生成数学上严谨且可解释的决策路径,显著提升了性能稳定性和准确性。
链接: https://arxiv.org/abs/2602.09159
作者: Yichen Wu,Yujin Oh,Sangjoon Park,Kailong Fan,Dania Daye,Hana Farzaneh,Xiang Li,Raul Uppot,Quanzheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 3 figures
Abstract:Recent multi-agent frameworks have broadened the ability to tackle oncology decision support tasks that require reasoning over dynamic, heterogeneous patient data. We propose Contribution-Aware Medical Multi-Agents (CoMMa), a decentralized LLM-agent framework in which specialists operate on partitioned evidence and coordinate through a game-theoretic objective for robust decision-making. In contrast to most agent architectures relying on stochastic narrative-based reasoning, CoMMa utilizes deterministic embedding projections to approximate contribution-aware credit assignment. This yields explicit evidence attribution by estimating each agent’s marginal utility, producing interpretable and mathematically grounded decision pathways with improved stability. Evaluated on diverse oncology benchmarks, including a real-world multidisciplinary tumor board dataset, CoMMa achieves higher accuracy and more stable performance than data-centralized and role-based multi-agents baselines.
[MA-7] SciDataCopilot: An Agent ic Data Preparation Framework for AGI-driven Scientific Discovery
【速读】:该论文旨在解决当前人工智能用于科学(AI for Science, AI4S)研究中,生成式 AI 系统因缺乏对原始实验数据的有效利用而难以实现闭环科学发现的问题。原始实验数据具有高度异构性、领域特异性及深度专业要求,难以与语言表示直接对齐或嵌入统一的语义空间,从而阻碍了通用科学人工智能(Artificial General Intelligence for Science, AGI4S)与物理实验现实的有效交互。解决方案的关键在于提出“科学数据就绪”(Scientific AI-Ready data)范式,明确规范科学数据在计算工作流中的定义、结构化和组合方式,并设计 SciDataCopilot——一个自主代理框架,实现从数据摄入、科学意图解析到多模态融合的端到端自动化处理。该框架将数据就绪作为核心操作原语,为可复用、可迁移的系统提供理论基础,推动实验驱动的科学通用智能发展。
链接: https://arxiv.org/abs/2602.09132
作者: Jiyong Rao,Yicheng Qiu,Jiahui Zhang,Juntao Deng,Shangquan Sun,Fenghua Ling,Hao Chen,Nanqing Dong,Zhangyang Gao,Siqi Sun,Yuqiang Li,Dongzhan Zhou,Guangyu Wang,Lijun Wu,Conghui He,Xuhong Wang,Jing Shao,Xiang Liu,Yu Zhu,Mianxin Liu,Qihao Zheng,Yinghui Zhang,Jiamin Wu,Xiaosong Wang,Shixiang Tang,Wenlong Zhang,Bo Zhang,Wanli Ouyang,Runkai Zhao,Chunfeng Song,Lei Bai,Chi Zhang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Databases (cs.DB); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:
Abstract:The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready data paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this idea, we propose SciDataCopilot, an autonomous agentic framework designed to handle data ingestion, scientific intent parsing, and multi-modal integration in a end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30 \times speedup in data preparation.
[MA-8] Collective Behavior of AI Agents : the Case of Moltbook
【速读】:该论文旨在揭示AI代理(AI agents)在社交平台上的集体行为是否具有与人类在线社区相似的统计规律,从而理解AI系统在社会交互中的演化机制。其解决方案的关键在于对Moltbook平台中超过369,000条帖子和300万条评论进行大规模数据分析,发现AI集体行为虽表现出与人类社区一致的幂律分布、流行度标度关系及注意力衰减模式等特征,但同时也存在显著差异,如点赞数与讨论规模呈亚线性关系,这暗示了AI代理的集体动态虽源于个体差异,却仍可复现人类社会系统的结构性规律。
链接: https://arxiv.org/abs/2602.09270
作者: Giordano De Marzo,David Garcia
机构: University of Konstanz (康斯坦茨大学); Centro Ricerche Enrico Fermi (恩里科·费米研究中心); Complexity Science Hub (复杂科学中心)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:We present a large scale data analysis of Moltbook, a Reddit-style social media platform exclusively populated by AI agents. Analyzing over 369,000 posts and 3.0 million comments from approximately 46,000 active agents, we find that AI collective behavior exhibits many of the same statistical regularities observed in human online communities: heavy-tailed distributions of activity, power-law scaling of popularity metrics, and temporal decay patterns consistent with limited attention dynamics. However, we also identify key differences, including a sublinear relationship between upvotes and discussion size that contrasts with human behavior. These findings suggest that, while individual AI agents may differ fundamentally from humans, their emergent collective dynamics share structural similarities with human social systems.
自然语言处理
[NLP-0] Quantum-Audit: Evaluating the Reasoning Limits of LLM s on Quantum Computing
【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)在量子计算领域知识理解能力缺乏系统评估的问题。现有基准主要聚焦于量子代码生成与电路设计,但对模型是否真正掌握量子计算核心概念的测量尚不充分。为此,作者提出了Quantum-Audit基准,其关键创新在于构建了一个包含2,700道问题的多维度测评体系:涵盖1,000道专家编写题、1,000道从研究论文中提取并经专家验证的题目,以及700道包含350道开放性问题和350道设错前提的问题,以检验模型在事实核查与批判性推理方面的能力。该方案首次实现了对语言模型在量子计算领域知识掌握程度的全面量化评估,揭示了顶级模型虽整体表现优于人类专家平均(如Claude Opus 4.5达84%),但在处理复杂或错误前提问题时仍存在显著认知偏差,凸显出当前生成式AI在专业领域推理严谨性方面的局限性。
链接: https://arxiv.org/abs/2602.10092
作者: Mohamed Afane,Kayla Laufer,Wenqi Wei,Ying Mao,Junaid Farooq,Ying Wang,Juntao Chen
机构: Fordham University (福特汉姆大学); University of Michigan-Dearborn (密歇根大学迪尔伯恩分校); Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computation and Language (cs.CL)
备注: 18 pages
Abstract:Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
[NLP-1] Agent World Model: Infinity Synthetic Environments for Agent ic Reinforcement Learning
【速读】: 该论文旨在解决当前自主代理(Autonomous Agents)在多轮工具使用任务中训练受限于环境多样性与可靠性不足的问题。其解决方案的关键在于提出了一种名为“Agent World Model”(AWM)的全合成环境生成流水线,通过代码驱动且基于数据库的状态管理机制,构建出1000个覆盖日常场景的高保真仿真环境,每个环境平均集成35种工具,并支持高效、可重复的代理交互。相比依赖大语言模型(LLM)模拟的环境,AWM提供更可靠的状态转移和更易设计的奖励函数,从而显著提升代理在分布外(out-of-distribution)任务中的泛化能力。
链接: https://arxiv.org/abs/2602.10090
作者: Zhaoyang Wang,Canwen Xu,Boyi Liu,Yite Wang,Siwei Han,Zhewei Yao,Huaxiu Yao,Yuxiong He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 41 pages
Abstract:Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at this https URL.
[NLP-2] Anagent For Enhancing Scientific Table Figure Analysis
【速读】: 该论文旨在解决当前人工智能(AI)系统在科学文献中对复杂多模态知识(如表格和图表)进行准确分析时存在的局限性问题,尤其是面对科学图表的结构异质性、长上下文依赖以及跨领域知识整合需求时表现不佳。其核心挑战在于如何实现任务分解、信息检索、推理合成与质量评估的协同优化。解决方案的关键是提出Anagent框架,该框架由四个专业化智能体组成:Planner负责任务分解,Expert通过工具调用获取特定信息,Solver整合信息生成连贯分析,Critic则基于五维指标进行迭代优化;同时采用模块化训练策略,结合监督微调与专用强化学习,确保各智能体能力提升的同时维持高效协作,从而显著提升科学图表分析的质量与泛化能力。
链接: https://arxiv.org/abs/2602.10081
作者: Xuehang Guo,Zhiyong Lu,Tom Hope,Qingyun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \ figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring 63,178 instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \ figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to \uparrow 13.43% in training-free settings and \uparrow 42.12% with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \ figure analysis. Our project page: this https URL.
[NLP-3] CAPID: Context-Aware PII Detection for Question-Answering Systems EACL2026
【速读】: 该论文旨在解决隐私保护与问答系统响应质量之间的矛盾问题:现有方法通常对所有个人身份信息(PII)进行删除,忽略了部分PII在特定上下文中可能对用户问题具有语义相关性,从而导致回答质量下降。解决方案的关键在于提出CAPID框架,其核心是通过微调一个本地部署的小语言模型(SLM),在PII传递给大语言模型(LLM)进行问答前实现细粒度的隐私过滤——该SLM不仅能识别PII片段、分类其类型,还能评估其上下文相关性;为训练此模型,作者进一步设计了一种基于LLM的合成数据生成管道,构建覆盖多类PII和不同相关性层级的多样化数据集,从而支持更精准的相关性感知PII检测,在保障隐私的同时显著提升下游任务的可用性。
链接: https://arxiv.org/abs/2602.10074
作者: Mariia Ponomarenko,Sepideh Abedini,Masoumeh Shafieinejad,D.B.Emerson,Shubhankar Mohapatra,Xi He
机构: University of Waterloo (滑铁卢大学); Vector Institute
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted to the Student Research Workshop at EACL 2026
Abstract:Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user’s question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
[NLP-4] MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval EACL-26
【速读】: 该论文旨在解决现有声明验证(claim verification)方法在多模态推理和可解释性方面的不足,即大多数研究仅依赖文本证据进行推理或忽略解释生成,导致验证结果不准确且缺乏透明度。其解决方案的关键在于提出一个统一框架,实现证据检索、多模态声明验证与解释生成的联合优化:首先构建双层多模态图结构以支持图像到文本和文本到图像的跨模态检索;其次设计token级和证据级融合机制,整合声明与多模态证据嵌入以增强验证准确性;最后引入多模态融合解码器(multi-modal Fusion-in-Decoder)机制,生成具有可解释性的自然语言说明。此外,为弥补现有数据集在专业领域的局限性,作者还构建了面向人工智能领域的科学数据集AIChartClaim,进一步推动该方向的研究发展。
链接: https://arxiv.org/abs/2602.10023
作者: Delvin Ce Zhang,Suhan Cui,Zhelin Chu,Xianren Zhang,Dongwon Lee
机构: University of Sheffield (谢菲尔德大学); University of Science and Technology Beijing (北京科技大学); University of California San Diego (加州大学圣地亚哥分校); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EACL-26
Abstract:Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
[NLP-5] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在集成大量动态知识时面临的挑战,即事实数据与推理模式的固有纠缠问题。现有方法如非参数检索增强生成(Retrieval-Augmented Generation, RAG)或参数化知识编辑,在实践中受限于有限的上下文窗口、检索器噪声以及灾难性遗忘风险。其解决方案的关键在于提出一种双模型架构DRIFT,通过显式地将知识提取过程与推理过程解耦,利用一个轻量级知识模型动态地将文档块压缩为条件感知的隐式事实标记(implicit fact tokens),并将这些稠密表示投影到推理模型的嵌入空间中,从而替代冗余的原始文本,同时保持推理准确性。该方法显著提升了长上下文任务的表现,并扩展了LLMs的有效上下文窗口和推理能力。
链接: https://arxiv.org/abs/2602.10021
作者: Wenxuan Xie,Yujia Wang,Xin Tan,Chaochao Lu,Xia Hu,Xuhong Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model’s embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at this https URL.
[NLP-6] SCORE: Specificity Context Utilization Robustness and Relevance for Reference-Free LLM Evaluation
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)和开放式问答评估框架在高风险、领域特定场景中评估不足的问题,即现有方法主要依赖表面相似性、事实一致性或语义相关性,难以判断模型输出是否提供了决策所需的关键细节。其解决方案的关键在于提出一种多维、无需参考答案的评估框架,从四个互补维度——具体性(specificity)、对改写和语义扰动的鲁棒性(robustness to paraphrasing and semantic perturbations)、答案相关性(answer relevance)以及上下文利用度(context utilization)——系统评估大语言模型(Large Language Models, LLMs)输出的质量,并构建了一个涵盖40种专业角色和7类自然灾害的1,412个领域特定问答对的数据集以支持实证研究,同时通过人工评估验证了多指标协同的重要性,强调单一指标无法充分刻画高质量回答,必须采用结构化多指标框架来保障LLMs在关键应用中的可靠性。
链接: https://arxiv.org/abs/2602.10017
作者: Homaira Huda Shomee,Rochana Chaturvedi,Yangxinyu Xie,Tanwi Mallick
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Argonne National Laboratory (阿贡国家实验室); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
[NLP-7] ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition
【速读】: 该论文旨在解决越南语自动语音识别(ASR)中因词典覆盖不足和训练偏差导致的识别性能受限问题。其解决方案的关键在于利用越南语拼音文字系统高度的字音透明性(grapheme-phoneme transparency),提出一种基于音素(phoneme)表示的ASR框架——ViSpeechFormer,首次在越南语ASR中显式建模音素特征,从而提升对未登录词的泛化能力并降低训练偏差的影响。
链接: https://arxiv.org/abs/2602.10003
作者: Khoa Anh Nguyen,Long Minh Hoang,Nghia Hieu Nguyen,Luan Thanh Nguyen,Ngan Luu-Thuy Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbfVietnamese \textbfSpeech Trans\textbfFormer), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
[NLP-8] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models
【速读】: 该论文试图解决儿童如何从有限的语言输入中习得母语级句法结构的问题,尤其是针对“刺激贫乏假说”(Poverty of the Stimulus Hypothesis, PoSH)所提出的主张——即语言输入不足以解释某些语法通则的习得,因此需要先天的语言约束来解释语言学习。论文的关键解决方案在于构建一个名为 \poshbench 的训练与评估框架,用于系统测试神经语言模型在疑问句形成、移动限制(islands to movement)等核心句法现象上的泛化能力。通过在10–50M词的发展适宜文本上训练Transformer模型,研究发现模型即使缺乏直接正面证据也能表现出一定的泛化能力,但其数据效率和泛化强度仍低于儿童;进一步引入三种基于认知动机的归纳偏置后,模型的句法能力有所提升,但并未显著改善 \poshbench 表现。这一结果挑战了“先天句法是唯一通向泛化的路径”的观点,同时指出人类般高效的数据利用可能需要超出当前测试范围的归纳偏置。
链接: https://arxiv.org/abs/2602.09992
作者: Xiulin Yang,Arianna Bisazza,Nathan Schneider,Ethan Gotlieb Wilcox
机构: Georgetown University (乔治城大学); University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10–50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence – yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.
[NLP-9] ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese
【速读】: 该论文旨在解决多选题阅读理解(Multiple-choice Reading Comprehension, MCRC)模型缺乏解释生成能力的问题,即现有模型虽能选出正确选项,但无法提供推理依据。其解决方案的关键在于提出一种名为ViMultiChoice的新方法,该方法通过联合训练选项决策与解释生成任务,使模型在预测答案的同时生成自然语言解释,从而提升模型的可解释性与准确性。实验表明,该方法在越南语MCRC基准(ViMMRC 2.0)及新构建的数据集上均达到当前最优性能(State-of-the-art, SotA),且联合训练显著提升了多选题准确率。
链接: https://arxiv.org/abs/2602.09961
作者: Trung Tien Cao,Lam Minh Thai,Nghia Hieu Nguyen,Duc-Vu Nguyen,Ngan Luu-Thuy Nguyen
机构: University of Information Technology (信息科技大学); Vietnam National University, Ho Chi Minh city (胡志明市越南国家大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.
[NLP-10] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
【速读】: 该论文旨在解决大型推理模型在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中出现的“过度思考”问题,即模型生成冗余推理步骤但未带来性能提升,而现有轨迹级长度惩罚方法因缺乏细粒度信号难以有效缩短推理长度且可能损害准确率。解决方案的关键在于提出一种低开销的过程监督强化学习框架 ATTNPO,其核心创新是利用模型内部注意力机制提供的细粒度信号实现步骤级信用分配:首先识别出专注于关键推理步骤、抑制冗余步骤的特殊注意力头,进而通过两种子策略分别抑制冗余步骤并降低对必要步骤的惩罚,从而在显著减少推理长度的同时提升多基准测试上的性能表现。
链接: https://arxiv.org/abs/2602.09953
作者: Shuaiyi Nie,Siyu Ding,Wenyuan Zhang,Linhao Yu,Tianmeng Yang,Yao Chen,Tingwen Liu,Weichong Yin,Yu Sun,Hua Wu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Baidu Inc. (百度公司); Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注: Work in process
Abstract:Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model’s intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
[NLP-11] LLM s Encode Their Failures: Predicting Success from Pre-Generation Activations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂任务时因持续启用扩展推理(extended reasoning)而导致的计算开销过高问题,同时探索如何基于模型内部表征动态判断哪些输入真正需要额外计算资源。其解决方案的关键在于训练线性探测器(linear probes)于预生成激活状态(pre-generation activations)上,以预测特定策略下的任务成功率(policy-specific success),从而实现对模型自身难度感知的建模;研究表明,这种基于模型内部表征的难度信号优于传统表面特征(如问题长度或TF-IDF),且能有效指导查询路由策略,在MATH数据集上实现比最优单模型性能更高、推理成本降低达70%的效率提升。
链接: https://arxiv.org/abs/2602.09924
作者: William Lugoloobi,Thomas Foster,William Bankes,Chris Russell
机构: Oxford Internet Institute, University of Oxford (牛津互联网研究所,牛津大学); FLAIR, University of Oxford (牛津大学人工智能与语言研究中心); Department of Computer Science, University College London (伦敦大学学院计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: this https URL
[NLP-12] he Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies
【速读】: 该论文试图解决多智能体系统(multi-agent systems)在基于大语言模型(LLMs)构建时所面临的“自进化三难困境”(self-evolution trilemma),即同时实现连续自进化、完全隔离和安全不变性(safety invariance)的不可能性问题。论文通过信息论框架将安全定义为与人类价值分布的差异程度,理论证明孤立环境下的自进化会引入统计盲区,导致系统安全性不可逆地退化。解决方案的关键在于识别出这一内在动力学风险,并提出需依赖外部监督或设计新型安全保持机制,从而从症状驱动的安全修补转向对系统本质风险的原理性理解。
链接: https://arxiv.org/abs/2602.09877
作者: Chenxu Wang,Chaozhuo Li,Songyang Liu,Zejian Chen,Jinyu Hou,Ji Qi,Rui Li,Litian Zhang,Qiwei Ye,Zheng Liu,Xu Chen,Xi Zhang,Philip S. Yu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Renmin University of China (中国人民大学); University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment–a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system’s safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.
[NLP-13] Steer2Edit: From Activation Steering to Component-Level Editing
【速读】: 该论文旨在解决传统生成式 AI(Generative AI)控制方法中存在属性-效用权衡问题,即在强控制条件下,通过全局性激活干预实现行为引导时,往往导致模型性能下降或引入不可控的副作用。其核心问题是现有方法忽略了模型内部行为由少数异质组件(如特定注意力头和MLP神经元)主导的事实,从而无法实现精细化控制。解决方案的关键在于提出 Steer2Edit 框架,该框架无需训练即可将推理阶段的控制向量(steering vector)转化为用于组件级秩一权重编辑的诊断信号,通过选择性重分配行为影响至个体注意力头与MLP神经元,实现可解释、非侵入式的参数更新,同时保持标准前向传播流程和并行推理效率,在安全对齐、幻觉抑制与推理效率等方面显著改善属性-效用平衡。
链接: https://arxiv.org/abs/2602.09870
作者: Chung-En Sun,Ge Yan,Zimo Wang,Tsui-Wei Weng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model’s internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
[NLP-14] SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech EACL2026
【速读】: 该论文旨在解决低资源语言(如僧伽罗语)在神经机器翻译(Neural Machine Translation, NMT)中处理文化嵌入型多词短语(Figures of Speech, FoS)时表现不佳的问题。其关键解决方案是构建了一个包含2,344个僧伽罗语FoS的标注语料库,该语料库兼具文化来源分类和跨语言对应关系标注,并开发了一个二分类模型以区分FoS类型(准确率约92%)。此外,通过评估现有大语言模型(Large Language Models, LLMs)在该数据集上的表现,揭示了当前LLMs在传达习语含义方面的显著不足,从而为未来低资源自然语言处理(NLP)与文化敏感机器翻译研究提供了重要基准。
链接: https://arxiv.org/abs/2602.09866
作者: Johan Sofalas,Dilushri Pavithra,Nevidu Jayatilleke,Ruvan Weerasinghe
机构: Informatics Institute of Technology (信息学院技术研究所); University of Moratuwa (莫鲁塔瓦大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures, 8 tables, Accepted paper at the 22nd Workshop on Multiword Expressions (MWE 2026) @ EACL 2026
Abstract:Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
[NLP-15] How Do People Quantify Naturally: Evidence from Mandarin Picture Description
【速读】: 该论文旨在解决自然语境下说话者如何决定是否以及如何进行数量表达(quantification)的问题,即在无明确指令要求计数或量化的情况下,语言使用者如何自发地选择量化策略。其解决方案的关键在于采用基于图片的诱发描述任务,在口语和书面两种模态中系统考察三个维度:是否量化、量化精度以及所采用的量化策略,并发现物体数量(numerosity)、生物性(animacy)及产出模态(production modality)对量化行为具有系统性影响,从而揭示了数量表达在真实语言生产中的认知与语言机制。
链接: https://arxiv.org/abs/2602.09838
作者: Yayun Zhang,Guanyi Chen,Fahime Same,Saad Mahamood,Tingting He
机构: Central China Normal University (华中师范大学); Trivago N.V.; Shopware
类目: Computation and Language (cs.CL)
备注:
Abstract:Quantification is a fundamental component of everyday language use, yet little is known about how speakers decide whether and how to quantify in naturalistic production. We investigate quantification in Mandarin Chinese using a picture-based elicited description task in which speakers freely described scenes containing multiple objects, without explicit instructions to count or quantify. Across both spoken and written modalities, we examine three aspects of quantification: whether speakers choose to quantify at all, how precise their quantification is, and which quantificational strategies they adopt. Results show that object numerosity, animacy, and production modality systematically shape quantificational behaviour. In particular, increasing numerosity reduces both the likelihood and the precision of quantification, while animate referents and modality selectively modulate strategy choice. This study demonstrates how quantification can be examined under unconstrained production conditions and provides a naturalistic dataset for further analyses of quantity expression in language production.
[NLP-16] LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse
【速读】: 该论文试图解决的问题是:在大规模自动化教育对话分析中,大型语言模型(Large Language Models, LLMs)缺乏可靠的错误检测机制,难以识别其自身预测的错误。解决方案的关键在于利用LLM生成的推理过程(reasoning)作为特征来预测其标签预测的正确性。研究通过TF-IDF编码推理文本,并使用五种监督分类器进行建模,发现随机森林(Random Forest)分类器在F1得分上达到0.83(召回率=0.854),显著优于基线方法;进一步地,针对特定教学行为类别训练专用检测器可提升对难点类别的识别性能,表明错误检测受益于任务相关的语言线索。此外,基于LIWC框架的语言学标记分析揭示:正确推理更倾向使用因果连接词(如“because”、“therefore”),而错误推理则频繁出现认知模糊表达(如“might”、“could”)和元认知动词(如“think”、“realize”),从而为基于推理内容的质量控制提供了可解释且可扩展的方法。
链接: https://arxiv.org/abs/2602.09832
作者: Bakhtawar Ahtisham,Kirk Vanacore,Zhuqian Zhou,Jinsook Lee,Rene F. Kizilcec
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model’s own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model’s assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
[NLP-17] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
【速读】: 该论文旨在解决阿拉伯语语言模型(Arabic Language Models, LMs)在跨方言迁移中的表现不均衡问题,特别是针对现代标准阿拉伯语(Modern Standard Arabic, MSA)与各阿拉伯语方言之间相似性差异导致的性能差异。其关键解决方案在于通过在三个自然语言处理(Natural Language Processing, NLP)任务上进行探针分析(probing)和表征相似性度量,系统评估不同方言向MSA模型的迁移能力,并发现地理邻近性可部分解释迁移效果的差异;同时揭示了多方言联合训练可能引发负面干扰(negative interference),从而质疑当前多方言模型设计的有效性及其对跨语言迁移的适用性。
链接: https://arxiv.org/abs/2602.09826
作者: Abdulmuizz Khalak,Abderrahmane Issam,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注: Accepted to VarDial 2026
Abstract:Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
[NLP-18] Covo-Audio Technical Report
【速读】: 该论文旨在解决现有端到端大语言音频模型(Large Audio Language Model, LALM)在自然对话系统中部署成本高、语音生成与对话智能耦合导致灵活性不足的问题。其核心解决方案是提出一种“智能-语音解耦”策略,将对话智能(dialogue intelligence)与语音渲染(voice rendering)分离,从而在仅需少量文本到语音(Text-to-Speech, TTS)数据的前提下实现灵活的声音定制,同时保持高质量的对话性能。这一设计显著降低了部署复杂度,并提升了模型在实际应用场景中的可扩展性与实用性。
链接: https://arxiv.org/abs/2602.09823
作者: Wenfu Wang,Chenxing Li,Liqiang Zhang,Yiyang Zhao,Yuxiang Zou,Hanzhao Li,Mingyu Cui,Hao Zhang,Kun Wei,Le Xu,Zikang Huang,Jiajun Xu,Jiliang Hu,Xiang He,Zeyu Xie,Jiawen Kang,Youjun Chen,Meng Yu,Dong Yu,Rilin Chen,Linlin Di,Shulin Feng,Na Hu,Yang Liu,Bang Wang,Shan Yang
机构: Tencent(腾讯)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Technical Report
Abstract:In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
[NLP-19] xt summarization via global structure awareness
【速读】: 该论文旨在解决长文本摘要任务中因忽视全局结构而导致的语义连贯性破坏和下游任务性能下降的问题,同时应对大语言模型(LLM)在摘要生成中资源消耗高、效率低的挑战。其解决方案的关键在于引入GloSA-sum框架,通过拓扑数据分析(TDA)实现对文本全局结构的感知:首先构建基于句子嵌入的语义加权图,利用持久同调(persistent homology)识别核心语义与逻辑结构,并将其保存至“保护池”作为摘要骨架;进而设计拓扑引导的迭代策略,以轻量级代理指标近似句子重要性,避免重复高成本计算,在保持结构完整性的同时提升效率;此外,提出分层策略融合段落级与全局摘要,显著增强长文本处理能力。
链接: https://arxiv.org/abs/2602.09821
作者: Jiaquan Zhang,Chaoning Zhang,Shuxu Chen,Yibei Liu,Chenghao Li,Qigan Sun,Shuai Yuan,Fachrina Dewi Puspitasari,Dongshen Han,Guoqing Wang,Sung-Ho Bae,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Kyung Hee University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24pages
Abstract:Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool’’ as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
[NLP-20] AnalyticsGPT : An LLM Workflow for Scientometric Question Answering
【速读】: 该论文旨在解决科学计量学(scientometrics)领域中针对“科学的科学”(science of science)类问题的智能问答难题,这类问题相较于传统基于论文内容的科学问答更具挑战性,尤其体现在任务规划阶段对学术实体命名实体识别(named-entity recognition)和多维度数据检索(如影响因子等科学计量指标)的需求。解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的端到端工作流,融合检索增强生成(Retrieval-Augmented Generation, RAG)与代理型(agentic)概念,实现从问题理解、多源数据检索到结构化分析输出的全流程自动化,并通过专家评估与LLM-as-judges方法验证其有效性。
链接: https://arxiv.org/abs/2602.09817
作者: Khang Ly,Georgios Cheirmpos,Adrian Raudaschl,Christopher James,Seyed Amin Tabatabaei
机构: Elsevier B.V. (爱思唯尔)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
Abstract:This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the “science of science.” When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: this https URL.
[NLP-21] Decomposing Reasoning Efficiency in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理任务中因推理令牌(inference tokens)使用效率与准确性之间的权衡问题,而现有评估方法仅报告最终准确率,忽略了令牌消耗的分布和浪费情况。其解决方案的关键在于提出一种可选追踪(trace-optional)的分解框架,将令牌效率解耦为可解释的因子:固定令牌预算下的完成度(避免截断)、给定完成条件下的正确性(conditional correctness)以及冗余性(verbosity)。进一步地,当提供每实例的工作负载代理指标时,冗余性被细分为平均表述开销(tokens per work unit)和耦合系数(衡量开销随任务复杂度增长的敏感性);当存在推理轨迹时,引入确定性质量指标(如 grounding、重复性、提示复制),以区分低效循环与高冗余但有效推理,从而无需人工标注或大模型评判即可精准识别效率瓶颈。
链接: https://arxiv.org/abs/2602.09805
作者: Daniel Kaiser,Arnoldo Frigessi,Ali Ramezani-Kebrya,Benjamin Ricaud
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint (under review). 29 pages, 4 figures
Abstract:Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman \rho=0.63 ), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
[NLP-22] Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏客观正确答案的主观决策场景中(如旅行助手)如何有效模拟人类偏好并提供可信的决策支持问题。其解决方案的关键在于通过设计选择困境任务,利用多项逻辑回归模型(multinomial logit models)从LLM响应中推导出隐含的意愿支付(Willingness to Pay, WTP)估计值,并将其与经济学文献中的人类基准WTP进行对比分析。研究进一步考察了信息提示(如用户历史偏好)和基于角色的提示(persona-based prompting)对模型行为的影响,发现条件化于用户过往偏好(尤其是对低价选项的倾向)可显著提升LLM决策结果与人类基准的一致性,从而为实际部署中模型选择、提示工程和用户表征策略提供了实证依据。
链接: https://arxiv.org/abs/2602.09802
作者: Manon Reusens,Sofie Goethals,Toon Calders,David Martens
机构: University of Antwerp (安特卫普大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users’ past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.
[NLP-23] Where Are We At with Automatic Speech Recognition for the Bambara Language? EACL2026
【速读】: 该论文旨在解决非洲语言中资源匮乏的代表性不足语言(如巴马拉语)在自动语音识别(ASR)领域缺乏标准化评估基准的问题。其解决方案的关键在于构建首个针对巴马拉语的标准化ASR基准测试集,该数据集基于一小时专业录制的马里宪法文本,处于近最优声学与语言条件下的受控参考集合,并通过评估37种模型(包括本地训练系统和大规模商业模型)验证了当前ASR性能远未达到部署标准,从而揭示多语言预训练和模型规模扩展对低资源语言的局限性,同时提供公开基准与排行榜以推动后续研究。
链接: https://arxiv.org/abs/2602.09785
作者: Seydou Diallo,Yacouba Diarra,Mamadou K. Keita,Panga Azazia Kamaté,Adam Bouno Kampo,Aboubacar Ouattara
机构: MALIBA-AI; RobotsMali AI4D Lab; Rochester Institute of Technology; DJELIA; Dakar American University of Science and Technology
类目: Computation and Language (cs.CL)
备注: v1- 8 pages, 5 tables, 1 figure- AfricaNLP Workshop @ EACL 2026
Abstract:This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76% and the best Character Error Rate (CER) of 13.00% was set by another model, while several prominent multilingual models exceeded 100% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
[NLP-24] Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path ICML2026
【速读】: 该论文试图解决的问题是:在Transformer模型中,电路发现(circuit discovery)与激活操控(activation steering)作为两个独立的研究方向,是否共享相同的底层几何结构。解决方案的关键在于提出“电路指纹”(Circuit Fingerprint)假说——即答案标记(answer tokens)在孤立处理时编码了生成它们的方向信息。这一几何原则使得无需梯度或因果干预即可实现电路发现,仅通过几何对齐即可恢复与基于梯度方法相当的结构;同时,这些方向同样可用于可控操控,在情感分类任务中达到69.8%准确率(显著优于指令提示的53.1%),并保持事实准确性。该发现揭示了Transformer电路本质上是几何结构,可解释性与可控性是其同一对象的两个方面。
链接: https://arxiv.org/abs/2602.09784
作者: Andres Saurez,Neha Sengar,Dongsoo Har
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to ICML 2026. 15 pages, 11 figures
Abstract:Circuit discovery and activation steering in transformers have developed as separate research threads, yet both operate on the same representational space. Are they two views of the same underlying structure? We show they follow a single geometric principle: answer tokens, processed in isolation, encode the directions that would produce them. This Circuit Fingerprint hypothesis enables circuit discovery without gradients or causal intervention – recovering comparable structure to gradient-based methods through geometric alignment alone. We validate this on standard benchmarks (IOI, SVA, MCQA) across four model families, achieving circuit discovery performance comparable to gradient-based methods. The same directions that identify circuit components also enable controlled steering – achieving 69.8% emotion classification accuracy versus 53.1% for instruction prompting while preserving factual accuracy. Beyond method development, this read-write duality reveals that transformer circuits are fundamentally geometric structures: interpretability and controllability are two facets of the same object.
[NLP-25] Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints ICML2026
【速读】: 该论文试图解决的问题是:为何在深度非线性模型(如Transformer)中,简单的线性可解释性方法(如线性探测器和稀疏自编码器)仍能有效恢复语义结构。其解决方案的关键在于提出“不变子空间必要性定理”(Invariant Subspace Necessity theorem),证明了Transformer通过线性接口(如注意力的OV电路和未嵌入矩阵)传递信息,导致任何通过此类接口解码的语义特征必须位于一个与上下文无关的线性子空间中;进一步推导出“自指性质”(Self-Reference Property),即标记符(token)直接提供了其对应特征的几何方向,从而实现无需标注数据或训练探测器的零样本语义结构识别。这一理论框架为线性可解释性方法的有效性提供了架构层面的根本解释,并统一了线性探测与稀疏自编码器的机制。
链接: https://arxiv.org/abs/2602.09783
作者: Andres Saurez,Yousung Lee,Dongsoo Har
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to ICML 2026. 19 pages, 13 figures
Abstract:Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations – yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emphInvariant Subspace Necessity theorem and derive the \emphSelf-Reference Property: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbfa principled architectural explanation for why linear interpretability methods work, unifying linear probes and sparse autoencoders.
[NLP-26] Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练大型语言模型(Large Language Models, LLMs)时出现的策略熵塌陷(policy entropy collapse)问题,其表现为熵快速衰减导致过早自信、输出多样性降低以及梯度范数消失,从而阻碍学习进程。解决方案的关键在于从梯度保持剪裁(Gradient-Preserving Clipping)的角度重新构建熵控制机制:通过理论与实证分析特定重要性采样比区域对熵增长与下降的贡献,提出一种基于动态剪裁阈值的新型调控机制,以实现精确的熵管理;并设计了包括“先增后减”、“先减再增后减”及振荡衰减等动态熵控制策略,实验表明这些方法能有效缓解熵塌陷,并在多个基准测试中取得更优性能。
链接: https://arxiv.org/abs/2602.09782
作者: Kun Chen,Peng Shi,Fanfan Liu,Haibo Qiu,Zhixiong Zeng,Siqi Yang,Wenji Mao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
[NLP-27] Improving Interpretability of Lexical Semantic Change with Neurobiological Features ACL
【速读】: 该论文旨在解决词汇语义变迁(Lexical Semantic Change, LSC)研究中可解释性不足的问题,即现有方法虽能估计语义变化程度,却难以揭示具体的变化机制。其解决方案的关键在于将预训练语言模型获得的词上下文嵌入(contextualized embeddings)映射到一个神经生物学特征空间(neurobiological feature space),其中每个维度对应一个词的原始语义特征,数值表示该特征的强度。这一映射使得语义变迁可以被系统性地解释,并在提升LSC度量性能的同时,支持对特定类型语义变迁的发现与搜索。
链接: https://arxiv.org/abs/2602.09760
作者: Kohei Oda,Hiroya Takamura,Kiyoaki Shirai,Natthawut Kertkeidkachorn
机构: Japan Advanced Institute of Science and Technology (日本先进科学技术学院); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Computation and Language (cs.CL)
备注: PACLIC 2025
Abstract:Lexical Semantic Change (LSC) is the phenomenon in which the meaning of a word change over time. Most studies on LSC focus on improving the performance of estimating the degree of LSC, however, it is often difficult to interpret how the meaning of a word change. Enhancing the interpretability of LSC is a significant challenge as it could lead to novel insights in this field. To tackle this challenge, we propose a method to map the semantic space of contextualized embeddings of words obtained by a pre-trained language model to a neurobiological feature space. In the neurobiological feature space, each dimension corresponds to a primitive feature of words, and its value represents the intensity of that feature. This enables humans to interpret LSC systematically. When employed for the estimation of the degree of LSC, our method demonstrates superior performance in comparison to the majority of the previous methods. In addition, given the high interpretability of the proposed method, several analyses on LSC are carried out. The results demonstrate that our method not only discovers interesting types of LSC that have been overlooked in previous studies but also effectively searches for words with specific types of LSC.
[NLP-28] argum – A Multilingual New Testament Translation Corpus
【速读】: 该论文旨在解决现有圣经译本语料库在语言广度上优先而忽视译本深度的问题,即多数语料库未能充分捕捉欧洲多种语言中圣经翻译的历史演变与多样性。其解决方案的关键在于构建一个包含657个新约译本的多语言语料库(其中352个为独特版本),并在五种语言(英语、法语、意大利语、波兰语和西班牙语)中实现前所未有的深度覆盖,同时通过人工标注元数据对每个译本进行标准化标识(包括作品、版本和修订年份),从而支持研究者根据需求灵活定义“独特性”,既可开展微观层面的译本家族分析(如KJV谱系),也可进行宏观层面的去重与跨文本比较,为翻译史的量化研究树立了新基准。
链接: https://arxiv.org/abs/2602.09724
作者: Maciej Rapacz,Aleksander Smywiński-Pohl
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define “uniqueness” for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.
[NLP-29] AI-Assisted Scientific Assessment: A Case Study on Climate Change
【速读】: 该论文试图解决的问题是:当前生成式 AI 在科学研究中的应用主要局限于可重复验证的任务,难以胜任需要理论与现有证据协同共识才能确立“真实情况”的复杂科学问题。针对这一局限,论文提出并评估了一种基于 Gemini 的 AI 环境,将其集成到标准科学工作流中,以支持气候科学领域专家对大西洋经向翻转环流(AMOC)稳定性问题的协作评估。解决方案的关键在于将 AI 作为协作者而非独立研究者,通过在“猜测与检验”循环中辅助文献整合、逻辑一致性维护和初稿生成,显著提升科学合成效率——实证显示团队在不到 46 个工时内完成 79 篇文献的综合分析并经历 104 次修订,其中 AI 贡献内容被广泛保留,但最终成果仍需专家深度介入以确保科学严谨性与质量。
链接: https://arxiv.org/abs/2602.09723
作者: Christian Buck,Levke Caesar,Michelle Chen Huebscher,Massimiliano Ciaramita,Erich M. Fischer,Zeke Hausfather,Özge Kart Tokmak,Reto Knutti,Markus Leippold,Joseph Ludescher,Katharine J. Mach,Sofia Palazzo Corner,Kasra Rafiezadeh Shahi,Johan Rockström,Joeri Rogelj,Boris Sakschewski
机构: Potsdam Institute for Climate Impact Research (PIK), Member of the Leibniz Association; Institute for Atmospheric and Climate Science, ETH Zurich; Stripe; University of Zurich; University of Miami; Centre for Environmental Policy, Imperial College London; Grantham Institute for Climate Change and Environment, Imperial College London; Energy, Climate and Environment Program, International Institute for Applied Systems Analysis
类目: Computation and Language (cs.CL)
备注:
Abstract:The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in ‘guess and check’ loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.
[NLP-30] Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段进行无监督、样本特定的测试时适应(Test-Time Adaptation, TTA)时所面临的不稳定性问题。具体而言,现有方法在仅使用提示词(prompt)本身而无标注答案或外部监督信号的情况下,采用固定的手动学习率进行参数更新,容易因过拟合于单个提示的统计特性而导致生成质量下降和分布漂移。为克服此问题,作者提出了一种分层动态测试时适应框架(layer-wise dynamic test-time adaptation),其核心创新在于引入一个轻量级超网络(hypernetwork),根据提示表示、LLM结构及适应步骤动态预测每层、每步的LoRA参数学习率缩放因子,从而实现对TTA强度的细粒度调控,显著提升适应过程的稳定性和生成性能。
链接: https://arxiv.org/abs/2602.09719
作者: Longhuan Xu,Cunjian Chen,Feng Yin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
[NLP-31] raceMem: Weaving Narrative Memory Schemata from User Conversational Traces
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期交互中因上下文窗口有限而难以有效管理随时间演化的对话历史的问题。现有记忆系统通常将交互视为离散片段,无法捕捉对话流中的叙事连贯性。其解决方案的关键在于提出一种受认知启发的框架——TraceMem,通过三阶段管道构建结构化的叙事记忆模式:(1) 短期记忆处理采用演绎式主题分割方法识别对话事件边界并提取语义表征;(2) 突触记忆巩固将事件总结为情景记忆,并融合语义信息形成用户专属的记忆轨迹;(3) 系统记忆巩固利用两阶段层次聚类将这些轨迹组织成统一主题下随时间演化的叙事线程,最终封装为结构化的用户记忆卡片。该架构显著提升了多跳推理与时间推理能力,在LoCoMo基准测试中达到当前最优性能,验证了其在深层叙事理解中的关键作用。
链接: https://arxiv.org/abs/2602.09712
作者: Yiming Shu,Pei Liu,Tiange Zhang,Ruiyang Gao,Jun Ma,Chen Sun
机构: The University of Hong Kong(香港大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); Nankai University(南开大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: this https URL
[NLP-32] Maastricht University at AMIYA: Adapting LLM s for Dialectal Arabic using Fine-tuning and MBR Decoding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在方言变体(dialect variation)上表现不足的问题,尤其是阿拉伯语方言(如叙利亚、摩洛哥和沙特阿拉伯阿拉伯语)因数据稀缺和语言差异导致的生成与翻译质量低下。解决方案的关键在于采用低秩适配(Low Rank Adaptation, LoRA)微调技术,在单语及英-方言平行语料上进行训练,并结合适配器融合(adapter merging)与方言感知的最小束搜索(dialect-aware MBR decoding),从而在保持语义准确性的前提下显著提升方言忠实度(dialectal fidelity)。这一组合提供了一个紧凑且高效的框架,用于增强阿拉伯语方言生成的鲁棒性。
链接: https://arxiv.org/abs/2602.09703
作者: Abdulhai Alali,Abderrahmane Issam
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
[NLP-33] Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在机器翻译场景中应用时,缺乏对计算复杂度与碳排放成本的系统评估问题,导致在资源受限条件下难以选择最优KD方法。其解决方案的关键在于引入机器学习生命周期评估(Machine Learning Life Cycle Assessment, MLCA)工具,量化KD全过程(教师训练、蒸馏和推理)中的碳足迹,包括运行时能耗和硬件生产成本的分摊,从而揭示不同KD策略在翻译质量与计算开销之间的权衡关系,尤其发现词级蒸馏通常比序列级蒸馏具有更优的能效-性能平衡,并指出蒸馏开销在小规模部署中占主导,而推理阶段在大规模部署中成为主要碳源,为在明确质量和算力约束下选择KD方法提供可复现的决策依据。
链接: https://arxiv.org/abs/2602.09691
作者: Joseph Attieh,Timothee Mickus,Anne-Laure Ligozat,Aurélie Névéol,Jörg Tiedemann
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
[NLP-34] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在表格问答(Table Question Answering, TableQA)任务中面临的可靠性、可扩展性和效率问题,尤其是在资源受限或隐私敏感环境下的应用挑战。其解决方案的关键在于提出MATA框架,该框架通过多个互补的推理路径生成候选答案,并借助由小型语言模型构建的工具集对答案进行精炼或选择;同时引入一种优化算法以最小化昂贵的LLM代理调用次数,从而实现高效推理。MATA在保持高性能的同时显著降低计算开销,且能适配多种类型的LLM,展现出良好的可扩展性与鲁棒性。
链接: https://arxiv.org/abs/2602.09642
作者: Sieun Hyeon,Jusang Oh,Sunghwan Steve Cho,Jaeyoung Do
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at this https URL.
[NLP-35] MILE-RefHumEval: A Reference-Free Multi-Independent LLM Framework for Human-Aligned Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中依赖人工标注(ground-truth annotations)和评估者协调的问题,这些问题限制了评估的可扩展性与实用性。解决方案的关键在于提出 MILE-RefHumEval,一个无需参考答案或评估者协同的框架,其核心机制是通过一组独立提示(independently prompted)的评估器组成集成系统,并由人类对齐的评分架构(human-aligned schema)引导,从而实现离散和连续评分的灵活判断,兼顾评估的准确性、可解释性和计算效率。
链接: https://arxiv.org/abs/2602.09624
作者: Nalin Srun(UL, CNRS, LORIA),Parisa Rastin(UL, CNRS, LORIA),Guénaël Cabanes(UL, CNRS, LORIA),Lydia Boudjeloud Assala(UL, CNRS, LORIA)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
[NLP-36] AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练对齐(post-training alignment)过程中存在的可复现性差、工具链碎片化及奖励机制不统一的问题。具体而言,现有工作受限于特定后端(backend)的工具和临时拼接代码(ad-hoc glue code),导致实验难以复现;同时存在后端干扰(backend interference)、奖励信号碎片化(reward fragmentation)以及流水线不可控等问题。解决方案的关键在于提出 AlignTune——一个模块化工具包,通过统一接口支持监督微调(Supervised Fine-Tuning, SFT)与基于强化学习的人类反馈优化(Reinforcement Learning from Human Feedback, RLHF)风格的优化,并提供可互换的 TRL 和 Unsloth 后端;其核心设计是将后端逻辑隔离在单一工厂边界(factory boundary)之后,从而实现配置标准化、奖励层可扩展(包括规则驱动与学习型奖励),并集成标准基准测试与自定义任务评估,显著提升对齐实验的可控性和可复现性。
链接: https://arxiv.org/abs/2602.09621
作者: R E Zera Marveen Lyngkhoi,Chirag Chawla,Pratinav Seth,Utsav Avaiya,Soham Bhattacharjee,Mykola Khandoga,Rui Yuan,Vinay Kumar Sankarapu
机构: Lexsi Labs; Indian Institute of Technology (Banaras Hindu University) (印度理工学院(贝拿勒斯印度教大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: this https URL
Abstract:Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
[NLP-37] Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning
【速读】: 该论文旨在解决工具集成推理(Tool-integrated Reasoning, TIR)场景下基于仅结果的强化学习(outcome-only reinforcement learning)所面临的稀疏奖励和弱步骤级信用分配问题。在长时程TIR轨迹中,早期不可挽回的错误会决定最终成败,因此准确识别首个不可挽回步骤并据此进行细粒度信用分配至关重要。解决方案的关键在于提出误差定位策略优化(Error-Localized Policy Optimization, ELPO),其通过固定回放预算下的二分搜索回放树定位首个不可挽回步骤,将所得树结构转化为稳定的梯度信号,利用层级优势归因机制实现精准信用分配,并引入误差局部自适应裁剪策略强化对关键步骤及其后续序列的修正更新。
链接: https://arxiv.org/abs/2602.09598
作者: Qiao Liang,Yuke Zhu,Chao Ge,Lei Yang,Ying Shen,Bo Zheng,Sheng Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 11 figures
Abstract:Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
[NLP-38] On the Optimal Reasoning Length for RL-Trained Language Models ICLR2026
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练后大语言模型在推理过程中出现的输出长度不可控问题,即如何在保持模型推理能力的同时提升效率。研究表明,传统的长度惩罚策略可能抑制模型的推理能力,而合理调优的长度控制方法则能显著提升具备强先验推理能力模型的效率。解决方案的关键在于识别并规避两种失败模式:一是过长输出导致推理路径分散(increased dispersion),二是过短输出引发浅层思考(under-thinking),从而通过精细化调控输出长度实现性能与计算效率之间的最优平衡。
链接: https://arxiv.org/abs/2602.09591
作者: Daisuke Nohara,Taishi Nakamura,Rio Yokota
机构: Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 10 figures. Submitted to the Workshop on Scaling Post-training for LLMs (SPOT) at ICLR 2026
Abstract:Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
[NLP-39] Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models
【速读】: 该论文旨在解决在微调语言模型(Language Models, LMs)以缓解社会偏见时,可能出现的语言建模能力下降问题,这会损害下游任务性能。现有方法如反事实数据增强(Counterfactual Data Augmentation, CDA)常因生成的合成数据与真实分布不一致,或构造过于简化的反事实样本而忽略敏感属性(如性别)的社会语境,导致效果受限。解决方案的关键在于提出一种结构简单的上下文增强型CDA方法(Context-CDA),利用大语言模型(Large Language Models, LLMs)提升去偏语料的多样性与语境相关性,并通过最小化去偏语料与预训练数据之间的差异来增强对齐度,从而保留语言建模能力;进一步引入基于不确定性的过滤机制,剔除目标小模型认为质量较低的生成反事实样本,优化微调语料质量。实验表明,Context-CDA在性别偏见基准测试中有效降低偏见,同时维持语言建模性能,并通过分析下一词生成概率分布的变化揭示社会偏见模式。
链接: https://arxiv.org/abs/2602.09590
作者: Shweta Parihar,Liu Guangliang,Natalie Parde,Lu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.
[NLP-40] Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLM s
【速读】: 该论文旨在解决树搜索解码(tree-search decoding)在实际部署中因固定每查询token预算而引发的效率问题:现有树搜索策略通常对预算不敏感,仅将其视为终止条件,导致在搜索后期可能出现过度分支(late-stage over-branching)或过早终止(premature termination)。其解决方案的关键在于提出预算引导的蒙特卡洛树搜索(Budget-Guided MCTS, BG-MCTS),该方法动态调整搜索策略以匹配剩余token预算——初期进行广泛探索,随着预算减少逐步转向精细化优化与答案完成,并抑制浅层节点的晚期分支行为,从而在不同预算下均显著优于无预算感知的基线方法。
链接: https://arxiv.org/abs/2602.09574
作者: Sora Miyamoto,Daisuke Oba,Naoaki Okazaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose Budget-Guided MCTS (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.
[NLP-41] Advancing Block Diffusion Language Models for Test-Time Scaling
【速读】: 该论文旨在解决块扩散语言模型(Block Diffusion Language Models, BDLMs)在测试时扩展(test-time scaling)场景下存在的推理效率与效果难以平衡的问题,尤其是在长链式思维(Chain-of-Thought)推理中面临的解码挑战。其核心解决方案包括两个关键创新:一是提出有界自适应置信度解码(Bounded Adaptive Confidence Decoding, BACD),通过动态调整去噪强度以适应模型置信度,从而在加速推理的同时控制误差累积;二是设计粗思细评(Think Coarse, Critic Fine, TCCF)范式,根据任务阶段自适应分配大块生成和小块细化策略,在探索性推理阶段使用大块尺寸以提升效率,而在精炼阶段采用小块尺寸以增强准确性,并结合渐进式块大小扩展(Progressive Block Size Extension)缓解因扩大块尺寸导致的性能下降问题。这一统一框架显著提升了BDLMs在复杂推理任务中的效率-效果平衡能力。
链接: https://arxiv.org/abs/2602.09555
作者: Yi Lu,Deyang Kong,Jianing Wang,Linsen Guo,Xue Wang,Qi Guo,Tao Gui,Xuanjing Huang,Wei Ye,Shikun Zhang,Wei Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
[NLP-42] UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment
【速读】: 该论文旨在解决多目标对齐(multi-objective alignment)中因独立参数设计导致的偏好特征交互忽略或特征纠缠问题,从而提升生成结果与用户偏好的一致性。其解决方案的关键在于提出 Preference-Modulated Shared Low-Rank Adaptation (MoSLoRA),通过先使用偏好无关模块提取共享特征,再由基于混合偏好向量条件化的偏好调制模块对共享特征进行仿射变换,有效缓解特征纠缠并实现推理阶段对偏好权衡的精确控制;在此基础上构建统一的自回归奖励模型(Unified Autoregressive Reward Model, UniARM),在单一参数空间中联合建模所有偏好维度,消除了每个偏好目标独立参数的需求,提升了在大规模语言模型上的实用性。
链接: https://arxiv.org/abs/2602.09538
作者: Hongyan Xie,Yikun Ban,Ruiyu Fang,Zixuan Huang,Deqing Wang,Jianxin Li,Yitong Yao,Chao Wang,Shuangyong Song
机构: Beihang University (北京航空航天大学); Institute of Artificial Intelligence (TeleAI) (中国电信人工智能研究院)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated \ Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.
[NLP-43] Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在多跳问答(multi-hop QA)和复杂推理任务中,因长链思维(chain-of-thought)推理过程中知识整合衰减(Knowledge Integration Decay, KID)而导致的外部知识利用效率下降问题。其核心解决方案是提出一种无需训练的推理阶段策略——自锚定知识编码(Self-Anchored Knowledge Encoding, SAKE),通过在推理过程的起始与结尾处同时锚定检索到的知识,防止其被先前上下文稀释,从而保持知识的语义完整性并提升后续推理步骤中的有效整合能力。
链接: https://arxiv.org/abs/2602.09517
作者: Sangwon Yu,Ik-hwan Kim,Donghun Kang,Bongkyu Hwang,Junhwa Choi,Suk-hoon Jung,Seungki Hong,Taehee Lee,Sungroh Yoon
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.
[NLP-44] he CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking
【速读】: 该论文旨在解决在线通信中虚假信息与操纵行为的识别与应对问题,尤其聚焦于多语言、多平台环境下事实核查(fact-checking)技术的提升。其解决方案的关键在于构建一个以验证流程为核心的多任务系统,包括:任务1针对科学类网络主张的来源检索(source retrieval),任务2引入推理机制对数值和时间类主张进行事实核查,以及任务3通过生成完整事实核查文章扩展验证流程。这些任务涵盖了从分类、检索到文档级与片段级生成的挑战性问题,体现了跨语言场景下的复杂性和实用性。
链接: https://arxiv.org/abs/2602.09516
作者: Julia Maria Struß,Sebastian Schellhammer,Stefan Dietze,Venktesh V,Vinay Setty,Tanmoy Chakraborty,Preslav Nakov,Avishek Anand,Primakov Chungkham,Salim Hafid,Dhruv Sahnan,Konstantin Todorov
机构: 未知
类目: Computation and Language (cs.CL)
备注: misinformation, disinformation, fact-checking, claim source retrieval, generating fact-checking articles
Abstract:The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year’s edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.
[NLP-45] EcoGym: Evaluating LLM s for Long-Horizon Plan-and-Execute in Interactive Economies
【速读】: 该论文旨在解决当前自主大语言模型(LLM)代理在长期规划能力评估中存在的局限性问题,即现有评估框架多为片段式、领域特定或缺乏对持续经济动态的充分建模。其解决方案的关键在于提出EcoGym——一个通用的、可扩展的基准测试平台,用于连续计划与执行决策任务,涵盖三个多样化环境(自动售货机、自由职业和运营),并通过标准化接口和预算化动作设计实现有效无限时间尺度(1000+步或365天循环)下的评估。该平台以业务相关指标(如净资产、收入和日活跃用户数)为核心评价标准,强调在部分可观测性和随机性条件下长期战略一致性与鲁棒性,从而揭示模型在高层策略制定与高效行动执行之间的系统性权衡。
链接: https://arxiv.org/abs/2602.09514
作者: Xavier Hu,Jinxiang Xia,Shengze Xu,Kangqi Song,Yishuo Yuan,Guibin Zhang,Jincheng Ren,Boyu Feng,Li Lu,Tieyong Zeng,Jiaheng Liu,Minghao Liu,Yuchen Elenor Jiang,Wei Wang,He Zhu,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress
Abstract:Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
[NLP-46] Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)在生成文本时,推理阶段如何高效选择未掩码位置(where-to-unmask)的问题。现有方法通常依赖启发式置信度指标或通过代价高昂的策略梯度强化学习来训练解码顺序,难以保证最优的解码路径。其解决方案的关键在于提出一种基于真实标签的逐位置评分机制——Gt-Margin,该指标定义为正确词与最强替代词之间的概率差值,可构建一个理想(oracle)的未掩码顺序:优先处理当前部分掩码状态下更容易预测的位置。进一步地,作者利用学习排序(learning-to-rank)方法训练一个监督式未掩码规划器(unmasking planner),以模仿该oracle顺序,并将其集成到标准MDLM采样流程中,从而在不修改原有token预测模型的前提下显著提升逻辑推理类任务的生成质量。
链接: https://arxiv.org/abs/2602.09501
作者: Hikaru Asano,Tadashi Kozuno,Kuniaki Saito,Yukino Baba
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures
Abstract:Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
[NLP-47] Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement
【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在生成文本时容易产生看似流畅但事实错误的“幻觉”(hallucinations)问题,这一现象严重削弱了模型在下游任务中的可靠性与实用性。解决方案的关键在于提出一种无需训练的解码算法 CoCoA(Confusion and Consistency Aware),其核心思想是利用模型中间层表征的不稳定性作为信号来判断生成文本的事实性:通过量化中间层表示的混乱度(confusion)和一致性(consistency),对表现出高内部不稳定的输出施加惩罚,从而引导模型生成更内一致且事实准确的结果;进一步提出的 CoCoA-SIG 变体则引入自信息门控机制,动态调节惩罚强度,聚焦于高意外性(high-surprise)的不稳定生成,显著提升了多类任务(如问答、摘要和代码生成)中不同模型家族(如 Llama-3、Qwen-2.5、Mistral)的事实正确性。
链接: https://arxiv.org/abs/2602.09486
作者: Koduvayur Subbalakshmi,Sabbir Hossain Ujjal,Venkata Krishna Teja Mangichetty,Nastaran Jamalipour Soofi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, 23 pages, 13 tables, 12 figures
Abstract:Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span’s factuality is correlated with its representational instability across the model’s internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
[NLP-48] NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts
【速读】: 该论文旨在解决从非结构化电子健康记录(Electronic Health Records, EHR)中提取药物使用信息的难题,特别是在西班牙语临床文本中识别毒性物质使用及其上下文属性的任务。针对这一领域特定且低资源的挑战,其解决方案的关键在于构建一个多层次集成系统(multi-output ensemble system),该系统结合了BETO语言模型与条件随机场(Conditional Random Field, CRF)层进行序列标注,并采用多样化的训练策略和句子过滤机制以提升精度。该方法在ToxHabits共享任务中取得了优异性能,触发词检测(Trigger Detection)的F1值达0.94、精确率为0.97,论元检测(Argument Detection)的F1值为0.91。
链接: https://arxiv.org/abs/2602.09469
作者: Huu-Huy-Hoang Tran,Gia-Bao Duong,Quoc-Viet-Anh Tran,Thi-Hai-Yen Vuong,Hoang-Quynh Le
机构: University of Engineering and Technology, Vietnam National University(越南国家大学工程技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
[NLP-49] AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
【速读】: 该论文旨在解决生成式 AI 在形式化验证代码(vericoding)领域缺乏统一跨范式评估方法的问题。现有基准测试仅针对单一语言或工具(如 Dafny、Verus 和 Lean),且任务差异大,导致性能指标不可直接比较。为此,作者提出 AlgoVeri 基准,通过在三种系统中对 77 个经典算法施加一致的功能契约(functional contracts),实现可比性评估。其关键在于:首先,统一的规范约束使不同验证系统间的性能差异归因于底层机制而非任务设计;其次,揭示出模型在不同语言特性下的表现分化——例如 Gemini-3 Flash 在 Dafny 中达 40.3% 成功率,但在 Verus(24.7%)和 Lean(7.8%)中显著下降,凸显了语言设计对模型推理路径的影响,尤其是语法与语义障碍对 Lean 等需显式证明系统的限制。
链接: https://arxiv.org/abs/2602.09464
作者: Haoyu Zhao,Ziran Yang,Jiawei Li,Deyuan He,Zenan Li,Chi Jin,Venugopal V. Veeravalli,Aarti Gupta,Sanjeev Arora
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 32 pages
Abstract:Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of 77 classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ( 40.3 % for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ( 24.7 %) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at this https URL.
[NLP-50] SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏代码检索依赖的情况下,能否自主完成从明确规格说明到生产级软件系统构建的端到端任务这一开放性问题。其解决方案的关键在于提出SWE-AGI——一个基于MoonBit语言的开源基准测试平台,要求LLM代理严格依据权威标准和RFC文档,在固定API框架下实现解析器、解释器、二进制解码器及SAT求解器等复杂系统,每项任务涉及数千行核心逻辑,相当于资深开发者数周至数月的工作量。通过利用MoonBit生态的低数据泄露特性,该基准强制模型进行长程架构推理而非依赖短程代码检索,从而更真实地评估AI在规范驱动下的自主软件工程能力。
链接: https://arxiv.org/abs/2602.09447
作者: Zhirui Zhang,Hongbo Zhang,Haoxiang Fei,Zhiyuan Bao,Yubin Chen,Zhengyu Lei,Ziyue Liu,Yixuan Sun,Mingkun Xiao,Zihang Ye,Yu Zhang,Hongcheng Zhu,Yuxiang Wen,Heung-Yeung Shum
机构: International Digital Economy Academy (国际数字经济研究院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 3 figures
Abstract:Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
[NLP-51] Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality EACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化场景中缺乏对句子层面文化特异性(cultural specificity)系统评估的问题。现有方法难以量化句子在特定文化语境下的独特性,限制了LLMs在多文化应用中的可控性和可解释性。解决方案的关键在于提出概念文化指数(Conceptual Cultural Index, CCI),其定义为:目标文化内的句子通用性估计值与其它文化平均通用性估计值之间的差异。该设计使用户可通过比较设置灵活控制文化范围,并提供良好的可解释性,因为分数直接来源于底层的通用性估计。实验验证表明,CCI能有效区分文化特异句与普适句,在二分类任务中显著优于直接使用LLM评分的方法,AUC提升超过10个百分点。
链接: https://arxiv.org/abs/2602.09444
作者: Takumi Ohashi,Hitoshi Iyatomi
机构: Hosei University (明治大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 8 tables. Accepted at the First Workshop on Multilingual Multicultural Evaluation (MME) @ EACL 2026
Abstract:Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at this https URL .
[NLP-52] Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts PAKDD2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中固有的社会偏见问题,特别是检索增强生成(Retrieval-Augmented Generation, RAG)架构在引入外部知识源后是否仍存在偏见传播风险。研究发现,RAG通过引入多样化的外部上下文信息,能够有效降低模型输出中的偏见水平,其关键在于外部知识的引入有助于打破基于刻板印象的预测模式,从而提升生成内容的公平性。进一步分析表明,尽管链式思维(Chain-of-Thought, CoT)提示能提高推理准确性,但反而加剧了整体偏见,凸显出构建兼顾推理忠实性与偏见控制的意识框架的重要性。
链接: https://arxiv.org/abs/2602.09442
作者: Shweta Parihar,Lu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a full paper with an oral presentation at the 30th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2026)
Abstract:Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model’s outputs. To better understand this phenomenon, we then explore the model’s reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model’s CoT. Our experiments reveal that the model’s bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.
[NLP-53] Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency
【速读】: 该论文旨在解决自洽性推理(Self-Consistency, SC)在大型语言模型(Large Language Models, LLMs)中因需大量采样而导致的高推理成本问题,以及难度自适应自洽性(Difficulty-Adaptive Self-Consistency, DSC)方法因需额外模型调用和预采样来估计难度而带来的显著计算开销。其解决方案的关键在于提出激活感知的难度感知自洽性(Activation-Informed Difficulty-Aware Self-Consistency, ACTSC),该方法利用前馈神经网络(feed-forward network)中神经元激活状态所反映的内部难度信号,构建轻量级难度估计探针(probe),无需额外生成 token 或模型调用即可动态调整 SC 的采样数量,并可直接应用于新数据集而无需重新进行难度预采样。
链接: https://arxiv.org/abs/2602.09438
作者: Taewoong Yoon,Geunyeong Jeong,Geon Park,Sihyeong Yeom,Harksoo Kim
机构: Konkuk University (中央大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.
[NLP-54] Are Language Models Sensitive to Morally Irrelevant Distractors?
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高风险场景中道德行为一致性的问题,即现有道德评估基准假设LLM的道德偏好具有稳定性,但这一假设未考虑情境因素对道德判断的影响。研究发现,人类道德判断受无关情境因素(如气味或环境噪音)显著影响,因此作者借鉴道德心理学中的“情境主义”观点,提出一种新的评估框架:通过引入60个来自心理学数据库的无道德相关性的多模态“道德干扰项”(moral distractors),注入现有道德基准测试中,以检验这些干扰项是否同样改变LLM的道德判断。关键解决方案在于构建了一个包含情绪化图像与叙事的新型多模态数据集,并实证表明即使在低模糊性场景下,这些干扰项也能使LLM的道德判断发生超过30%的变化,从而揭示LLM存在类似人类的认知道德偏差,强调需发展更注重上下文的道德评估方法和更精细的认知道德建模。
链接: https://arxiv.org/abs/2602.09416
作者: Andrew Shaw,Christina Hahn,Catherine Rasgaitis,Yash Mishra,Alisa Liu,Natasha Jaques,Yulia Tsvetkov,Amy X. Zhang
机构: University of Washington (华盛顿大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this “situationist” view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 “moral distractors” from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
[NLP-55] Effective vocabulary expanding of multilingual language models for extremely low-resource languages
【速读】: 该论文旨在解决多语言预训练语言模型(multilingual pre-trained language models, mPLMs)在扩展至此前未支持的低资源语言时所面临的挑战,即如何有效提升模型对新语言的适应能力而不损害其在源语言(如英语)上的性能。解决方案的关键在于:首先利用目标语言语料库扩展模型词汇表,随后筛选出原模型中偏向源语言表示的词汇子集,并借助双语词典初始化新增词汇的嵌入表示;最后基于这些初始化后的扩展词汇继续进行预训练。该方法显著优于随机初始化扩展词汇的基线,在词性标注(POS tagging)和命名实体识别(NER)任务上分别提升了0.54%和2.60%,且在不同训练语料选择下表现稳健,同时保持了源语言性能不下降。
链接: https://arxiv.org/abs/2602.09388
作者: Jianyu Zheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 7 tables, under review
Abstract:Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model’s vocabulary using a target language corpus. We then screen out a subset from the model’s original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models’ performance on the source language does not degrade after continued pre-training.
[NLP-56] Contractual Deepfakes: Can Large Language Models Generate Contracts?
【速读】: 该论文试图解决当前对大型语言模型(Large Language Models, LLMs)在法律领域应用的过度乐观预期问题,尤其是认为LLMs能够胜任合同起草这一典型法律任务的误解。论文指出,尽管LLMs能基于统计模式生成看似合理的文本,但其缺乏对语义的理解、情境感知能力及法律推理能力,导致生成的合同可能包含不一致条款或虽具可执行性却与具体交易需求不符。解决方案的关键在于明确区分“预测词序”与“基于特定交易场景的语言运用”,强调法律实务中合同制定需要深层的逻辑推理和专业判断,而非仅依赖表面文本生成能力。
链接: https://arxiv.org/abs/2602.09384
作者: Eliza Mik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication
Abstract:Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.
[NLP-57] BiasScope: Towards Automated Detection of Bias in LLM -as-a-Judge Evaluation ICLR2026
【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-Judge)在模型评估中存在潜在偏见的问题,尤其是现有研究多集中于已知偏见的影响,缺乏对未知偏见的自动化、系统性探索。其解决方案的关键在于提出 BiasScope——一个基于大语言模型驱动的框架,能够自动且大规模地发现评估过程中可能存在的潜在偏见,从而将偏见识别从依赖人工和预定义偏见列表的被动方式转变为一种主动、全面的自动化探索机制。BiasScope 在 JudgeBench 数据集上验证了其通用性和有效性,并进一步推动构建了更具挑战性的 JudgeBench-Pro 基准,揭示出即使强大 LLM 作为评估者在该基准上的错误率仍超过 50%,凸显了提升评估鲁棒性和缓解偏见的紧迫性。
链接: https://arxiv.org/abs/2602.09383
作者: Peng Lai,Zhihao Ou,Yong Wang,Longyue Wang,Jian Yang,Yun Chen,Guanhua Chen
机构: Southern University of Science and Technology (南方科技大学); Alibaba Group (阿里巴巴集团); Beihang University (北京航空航天大学); Shanghai University of Finance and Economics (上海财经大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to ICLR 2026
Abstract:LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.
[NLP-58] AfriNLLB: Efficient Translation Models for African Languages
【速读】: 该论文旨在解决非洲语言(African languages)在资源受限场景下高效部署机器翻译模型的问题。当前多数主流翻译模型对非洲语言支持不足,且计算资源消耗大,难以落地于边缘设备或低算力环境。解决方案的关键在于:基于NLLB-200 600M模型,通过迭代层剪枝(iterative layer pruning)与量化(quantization)技术实现模型压缩,并结合自研的平行语料库进行微调,同时引入知识蒸馏(knowledge distillation)从大型教师模型中迁移知识,从而在保持翻译性能接近基线的同时显著提升推理效率。最终发布的AfriNLLB模型包含可进一步微调的Transformers版本和面向高效推理的CTranslate2版本,为非洲语言翻译提供了轻量、实用的解决方案。
链接: https://arxiv.org/abs/2602.09373
作者: Yasmin Moslem,Aman Kassahun Wassie,Amanuel Gizachew Abebe
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at AfricaNLP 2026 (oral)
Abstract:In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research. Comments: Accepted at AfricaNLP 2026 (oral) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.09373 [cs.CL] (or arXiv:2602.09373v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.09373 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the Seventh Workshop on African Natural Language Processing (AfricaNLP 2026)
[NLP-59] AgentS killer: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在真实世界问题求解中因缺乏高质量、长周期交互数据而导致的通用智能瓶颈问题。现有方法受限于隐私保护的API日志或脚本化交互,难以生成具备多样性和复杂性的训练数据,从而制约了模型能力的扩展。其解决方案的关键在于提出AgentSkiller框架,该框架通过基于有向无环图(DAG)的架构实现确定性与可恢复的状态转移,构建领域本体(domain ontology)和以人为中心的实体图(Person-Centric Entity Graph),并利用服务蓝图(Service Blueprints)定义工具接口,结合严格领域策略(Domain Policies)和一致数据库环境,实现跨域服务融合以模拟复杂任务。最终通过基于角色的模拟器(Persona-based Simulator)生成验证后的用户任务,形成结构清晰、状态明确的高质量多轮交互数据集,显著提升模型在函数调用上的性能表现,尤其在参数规模较大的模型中优势明显。
链接: https://arxiv.org/abs/2602.09372
作者: Zexu Sun,Bokai Ji,Hengyi Cai,Shuaiqiang Wang,Lei Wang,Guangxia Li,Xu Chen
机构: AMU, Baidu Inc. (百度公司); School of Computer Science and Technology, Xidian University (西安电子科技大学计算机科学与技术学院); Singapore Management University (新加坡管理大学); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL)
备注: 33 pages, 9 figures
Abstract:Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized \approx 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.
[NLP-60] Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only
【速读】: 该论文旨在解决低资源语言(low-resource languages)在缺乏词性标注数据时,如何实现有效词性标注(POS tagging)的问题。现有方法多依赖平行语料库进行词性标签投影,但此类语料对多数低资源语言不可得。为克服此限制,作者提出一种完全无监督的跨语言词性标注框架,其核心创新在于利用无监督神经机器翻译(Unsupervised Neural Machine Translation, UNMT)系统,仅基于单语语料构建伪平行句对,从而实现从高资源语言到低资源语言的词性标签迁移。关键步骤包括:通过UNMT生成目标语言的伪翻译句子,再基于词对齐标准执行词性标签投影,并引入多源投影技术校准目标侧标签分布,显著提升模型性能。实验表明,该方法在28组语言对上达到与依赖平行语料的方法相当甚至更优的效果,平均提升1.3%。
链接: https://arxiv.org/abs/2602.09366
作者: Jianyu Zheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 6 figures, 7 tables, under review
Abstract:Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.
[NLP-61] Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)对西班牙语地理词汇变异的表征能力问题,即这些模型是否能够准确捕捉和区分西班牙语在不同国家和地区之间的词汇差异。其解决方案的关键在于将LLMs视为虚拟信息提供者,通过设计两种问卷式测试格式——是非题与多选题——并利用一个大规模、专家校准的西班牙语词汇变异数据库进行系统评估。该研究覆盖21个西班牙语国家超过900个词汇项,并在国家层面和方言区层面进行分析,从而揭示了LLMs在不同地区变体上的识别准确性存在系统性差异,且这种差异不能仅由数字资源量解释,凸显了除数据规模外其他因素(如训练数据分布、语言结构复杂性等)对模型方言表征能力的重要影响。
链接: https://arxiv.org/abs/2602.09346
作者: Yoshifumi Kawasaki
机构: University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
[NLP-62] Not-in-Perspective: Towards Shielding Googles Perspective API Against Adversarial Negation Attacks
【速读】: 该论文旨在解决社交媒体平台中生成式AI(Generative AI)驱动的毒性内容检测系统在面对逻辑性对抗攻击(如否定句式修改)时准确率下降的问题。其解决方案的关键在于提出一套形式化推理(formal reasoning)方法,作为预处理和后处理模块嵌套于现有机器学习毒性检测模型之外,通过逻辑层面的语义分析来缓解否定攻击对毒性评分的影响,从而显著提升整体系统的准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.09343
作者: Michail S. Alexiou,J. Sukarno Mertoguno
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.
[NLP-63] Understanding Risk and Dependency in AI Chatbot Use from User Discourse
【速读】: 该论文试图解决的问题是:当前对生成式 AI (Generative AI) 使用过程中用户所体验的心理风险的产生机制、感知方式及其调节过程缺乏实证理解。为回应这一问题,研究者采用大规模计算主题分析方法,基于2023至2025年间从Reddit社区r/AIDangers和r/ChatbotAddiction收集的用户帖子,结合Braun与Clarke的反思性框架及多智能体大语言模型(LLM)辅助的主题识别技术,识别出14个重复出现的主题类别,并将其归纳为五个更高阶的经验维度。解决方案的关键在于通过真实世界用户话语数据构建经验维度,辅以BERT情感分类器刻画情绪模式,从而首次从实证角度揭示了AI安全在非实验室情境下的心理感知结构,特别是自控困难最为普遍,而恐惧则集中于自主性、控制感和技术风险等维度。
链接: https://arxiv.org/abs/2602.09339
作者: Jianfeng Zhu,Karin G. Coifman,Ruoming Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 5 figures
Abstract:Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large-scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI-related harm and distress. Using a multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke’s reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher-order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT-based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI-related psychological risk grounded in real-world user discourse, with self-regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.
[NLP-64] FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding
【速读】: 该论文旨在解决标准操作流程(Standard Operating Procedures, SOPs)理解与跨领域泛化能力不足的问题,现有语言模型在术语精确性、步骤顺序正确性和条件逻辑推理等关键能力上表现不佳。其解决方案的核心在于提出FM SO.P框架,通过两个创新实现:一是设计渐进式任务混合机制,分阶段构建三种能力——概念消歧以提升术语精度、动作序列理解以保障流程正确性、场景感知图推理以处理条件逻辑;二是开发自动多智能体评估系统,由三个智能体协同生成评分标准、分层测试集和评分规则,适配不同领域(如DMV的时间约束或银行的合规要求)。该方法在SOPBench七个领域上验证有效,使用32B参数模型达到48.3%通过率,7B开源模型亦达34.3%,媲美Qwen-2.5-72B-Instruct基线但仅需1/10参数量。
链接: https://arxiv.org/abs/2602.09336
作者: Siyuan Huang,Ziyu Wang,Chao Pan,Han Zhao
机构: Amazon(亚马逊); Johns Hopkins University (约翰霍普金斯大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3% pass rate with our 32B model and 34.3% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4%) with 10x fewer parameters.
[NLP-65] Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization
【速读】: 该论文旨在解决生成式语言模型在策略梯度(policy gradient)优化过程中对推理步骤中不同token赋予相同信用的问题,即“填充短语”如“让我想想”与关键计算步骤如“23 + 45 = 68”被同等对待,导致训练效率低下。其解决方案的关键在于提出反事实重要性加权(counterfactual importance weighting):通过掩码推理片段并测量答案概率下降幅度,直接从策略模型自身的概率变化估计每个token的重要性权重,并在策略梯度更新中对其进行加权调整。该方法无需辅助模型或外部标注,仅依赖模型自身输出的概率变化即可实现更合理的梯度分配,从而提升训练效率和最终性能。
链接: https://arxiv.org/abs/2602.09331
作者: Mykola Khandoga,Rui Yuan,Vinay Kumar Sankarapu
机构: Lexsi Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 1 figure
Abstract:Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase “Let me think” receives the same gradient update as the critical calculation “23 + 45 = 68.” We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model’s own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
[NLP-66] Dont Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention EMNLP2024
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLM)在商业场景中作为聊天机器人时,因话题连续性不足而导致用户体验下降和计算资源浪费的问题。解决方案的关键在于构建一个可解释的、基于朴素贝叶斯(Naive Bayes)方法量化的自然语言理解(Natural Language Understanding, NLU)模型,并引入注意力机制与对数非线性变换以增强其捕捉话题连贯性的能力。该方法使NLU模型转化为解析公式,具备线性时间复杂度,能够无缝处理任意长度对话,且在长篇和复杂对话中显著优于传统方法,从而保障LLM使用的责任性和可解释性。
链接: https://arxiv.org/abs/2602.09312
作者: Shu-Ting Pi,Pradeep Bagavan,Yejia Li,Disha,Qun Liu
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024: Industry Track; 8 pages, 2 figures, 1 table
Abstract:Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model’s ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.
[NLP-67] riggered: A Statistical Analysis of Environmental Influences on Extremist Groups
【速读】: 该论文旨在解决极端主义社群在信息生态系统中的动态响应机制问题,具体包括:极端暴力事件如何影响社群行为、政治实体的新闻报道是否能预测对话动态变化,以及主流与极端主义空间之间是否存在语言扩散现象。其解决方案的关键在于采用系统视角,结合反事实合成(counterfactual synthesis)估算事件级影响,并运用向量自回归(vector autoregression)和格兰杰因果分析(Granger causality analysis)建模新闻信号、行为结果及跨社群语言变迁之间的持续关系,从而揭示不同极端主义社群(如Stormfront、Incels)与主流社区(r/News)在对外部刺激响应上的异质性。
链接: https://arxiv.org/abs/2602.09289
作者: Christine de Kock,Eduard Hovy
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
Abstract:Online extremist communities operate within a wider information ecosystem shaped by real-world events, news coverage, and cross-community interaction. We adopt a systems perspective to examine these influences using seven years of data from two ideologically distinct extremist forums (Stormfront and Incels) and a mainstream reference community (r/News). We ask three questions: how extremist violence impacts community behaviour; whether news coverage of political entities predicts shifts in conversation dynamics; and whether linguistic diffusion occurs between mainstream and extremist spaces and across extremist ideologies. Methodologically, we combine counterfactual synthesis to estimate event-level impacts with vector autoregression and Granger causality analyses to model ongoing relationships among news signals, behavioural outcomes, and cross-community language change. Across analyses, our results indicate that Stormfront and r/News appear to be more reactive to external stimuli, while Incels demonstrates less cross-community linguistic influence and less responsiveness to news and violent events. These findings underscore that extremist communities are not homogeneous, but differ in how tightly they are coupled to the surrounding information ecosystem.
[NLP-68] Effective Reasoning Chains Reduce Intrinsic Dimensionality
【速读】: 该论文旨在解决当前链式思维(Chain-of-thought, CoT)及其变体在复杂推理任务中提升语言模型性能的机制不明确问题,特别是不同推理策略如何促进泛化能力尚缺乏定量理解。其解决方案的关键在于引入**内在维度(intrinsic dimensionality)**作为量化指标,该指标衡量达到特定准确率阈值所需的最小模型维度数。通过固定模型架构并改变任务表述方式(即不同推理策略),作者发现有效推理策略能显著降低任务的内在维度,且该维度与泛化性能呈强负相关关系,从而揭示了高效推理链通过更优的任务压缩(用更少参数实现更高精度)来促进学习的本质机制。
链接: https://arxiv.org/abs/2602.09276
作者: Archiki Prasad,Mandar Joshi,Kenton Lee,Mohit Bansal,Peter Shaw
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 3 figures
Abstract:Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.
[NLP-69] Measuring Inclusion in Interaction: Inclusion Analytics for Human-AI Collaborative Learning
【速读】: 该论文旨在解决当前人工智能(AI)与教育领域中对包容性(inclusion)、公平性(equity)和可及性(access)的评估往往依赖粗粒度样本描述或事后自我报告,从而难以捕捉协作问题解决(Collaborative Problem Solving, CPS)过程中包容性如何实时、动态地形成这一关键问题。解决方案的关键在于提出“包容性分析”(inclusion analytics),这是一个基于话语的框架,将包容性概念化为三个互补维度——参与公平性(participation equity)、情感氛围(affective climate)和认知公平性(epistemic equity),并通过可扩展的交互层级测量方法使这些构念在CPS过程中实现可观测和量化,从而揭示传统聚合或事后评估无法识别的参与模式、关系动态与观点采纳规律。
链接: https://arxiv.org/abs/2602.09269
作者: Jaeyoon Choi,Nia Nixon
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Inclusion, equity, and access are widely valued in AI and education, yet are often assessed through coarse sample descriptors or post-hoc self-reports that miss how inclusion is shaped moment by moment in collaborative problem solving (CPS). In this proof-of-concept paper, we introduce inclusion analytics, a discourse-based framework for examining inclusion as a dynamic, interactional process in CPS. We conceptualize inclusion along three complementary dimensions – participation equity, affective climate, and epistemic equity – and demonstrate how these constructs can be made analytically visible using scalable, interaction-level measures. Using both simulated conversations and empirical data from human-AI teaming experiments, we illustrate how inclusion analytics can surface patterns of participation, relational dynamics, and idea uptake that remain invisible to aggregate or post-hoc evaluations. This work represents an initial step toward process-oriented approaches to measuring inclusion in human-AI collaborative learning environments.
[NLP-70] Overview of PAN 2026: Voight-Kampff Generative AI Detection Text Watermarking Multi-Author Writing Style Analysis Generative Plagiarism Detection and Reasoning Trajectory Detection
【速读】: 该论文旨在解决生成式 AI (Generative AI) 文本检测、文本水印鲁棒性评估、多作者写作风格分析、生成式抄袭检测以及大语言模型(LLM)推理轨迹溯源与安全检测等文本取证(text forensics)领域的关键挑战。解决方案的关键在于通过PAN(Plagiarism, Authorship, and Detection)工作坊组织标准化、可复现的评测任务,推动计算文体学(computational stylometry)的发展;同时要求参赛者提交基于Docker容器的软件工具,确保方法的可重复性和公平比较,从而提升各任务在混合作者场景、对抗性扰动和真实应用场景下的性能与可靠性。
链接: https://arxiv.org/abs/2602.09147
作者: Janek Bevendorff,Maik Fröbe,André Greiner-Petter,Andreas Jakoby,Maximilian Mayerl,Preslav Nakov,Henry Plutz,Martin Potthast,Benno Stein,Minh Ngoc Ta,Yuxia Wang,Eva Zangerle
机构: 1. University of Duisburg-Essen (杜伊斯堡-埃森大学); 2. University of Hamburg (汉堡大学); 3. University of Stuttgart (斯图加特大学); 4. University of Bremen (不来梅大学); 5. Qatar Computing Research Institute (卡塔尔计算研究研究所); 6. German National Library of Science and Technology (德国科学与技术国家图书馆); 7. TU Dresden (德累斯顿工业大学); 8. Fraunhofer IAIS (弗劳恩霍夫信息与通信技术研究所); 9. University of California, Berkeley (加州大学伯克利分校); 10. University of Vienna (维也纳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.
[NLP-71] PABU: Progress-Aware Belief Update for Efficient LLM Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在决策时依赖完整动作-观测历史所导致的任务无关信息干扰问题,这会引发冗余动作和更高的推理成本。其解决方案的核心是提出一种进度感知信念更新(Progress-Aware Belief Update, PABU)框架,通过显式建模任务进展并选择性保留过去交互信息来压缩信念状态表示;在每一步中,代理预测自上一轮以来的相对进展,并据此决定是否存储新交互,仅基于保留子集进行后续决策,从而实现高效且高精度的任务执行。
链接: https://arxiv.org/abs/2602.09138
作者: Haitao Jiang,Lin Ge,Hengrui Cai,Rui Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Model (LLM) agents commonly condition actions on full action-observation histories, which introduce task-irrelevant information that easily leads to redundant actions and higher inference cost. We propose Progress-Aware Belief Update (PABU), a belief-state framework that compactly represents an agent’s state by explicitly modeling task progress and selectively retaining past actions and observations. At each step, the agent predicts its relative progress since the previous round and decides whether the newly encountered interaction should be stored, conditioning future decisions only on the retained subset. Across eight environments in the AgentGym benchmark, and using identical training trajectories, PABU achieves an 81.0% task completion rate, outperforming previous State of the art (SoTA) models with full-history belief by 23.9%. Additionally, PABU’s progress-oriented action selection improves efficiency, reducing the average number of interaction steps to 9.5, corresponding to a 26.9% reduction. Ablation studies show that both explicit progress prediction and selective retention are necessary for robust belief learning and performance gains.
[NLP-72] Benchmarking the Energy Savings with Speculative Decoding Strategies EACL
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在采用推测解码(speculative decoding)策略时,其能量消耗(energy requirements)缺乏系统研究的问题。解决方案的关键在于通过全面调查不同因素对推测解码策略能耗的影响,包括模型规模与架构、具体推测解码方法以及数据集特征,从而为优化推理过程中的能效提供理论依据和实践指导。
链接: https://arxiv.org/abs/2602.09113
作者: Rohit Dutta,Paramita Koley,Soham Poddar,Janardan Misra,Sanjay Podder,Naveen Balani,Saptarshi Ghosh,Niloy Ganguly
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EACL Findings 2026
Abstract:Speculative decoding has emerged as an effective method to reduce latency and inference cost of LLM inferences. However, there has been inadequate attention towards the energy requirements of these models. To address this gap, this paper presents a comprehensive survey of energy requirements of speculative decoding strategies, with detailed analysis on how various factors – model size and family, speculative decoding strategies, and dataset characteristics – influence the energy optimizations.
[NLP-73] UI-Venus-1.5 Technical Report
【速读】: 该论文旨在解决GUI代理(GUI Agent)在数字环境中实现广泛通用性与稳定高性能之间难以兼顾的问题。其核心挑战在于如何在复杂、动态的图形用户界面(Graphical User Interface, GUI)场景中,既保持对多样化任务的适应能力,又确保长期导航和交互的可靠性。解决方案的关键在于提出UI-Venus-1.5,一个统一的端到端GUI代理模型,包含三个核心技术突破:(1) 引入中训练阶段(Mid-Training),利用30余个数据集共计100亿token进行基础GUI语义建模;(2) 采用全轨迹在线强化学习(Online Reinforcement Learning with full-trajectory rollouts),使训练目标与大规模环境中的长程动态导航一致;(3) 通过模型融合(Model Merging)技术,将领域特定模型(如GUI定位、网页和移动端模型)整合为单一统一代理,从而提升泛化能力和部署效率。实验证明,该方案在ScreenSpot-Pro、VenusBench-GD和AndroidWorld等多个基准上达到当前最优性能。
链接: https://arxiv.org/abs/2602.09082
作者: Veuns-Team:Changlong Gao,Zhangxuan Gu,Yulin Liu,Xinyu Qiu,Shuheng Shen,Yue Wen,Tianyu Xia,Zhenyu Xu,Zhengwen Zeng,Beitong Zhou,Xingran Zhou,Weizhi Chen,Sunhao Dai,Jingya Dou,Yichen Gong,Yuan Guo,Zhenlin Guo,Feng Li,Qian Li,Jinzhen Lin,Yuqi Zhou,Linchao Zhu,Liang Chen,Zhenyu Guo,Changhua Meng,Weiqiang Wang
机构: Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains this http URL this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world this http URL proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application this http URL to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: this https URL Model: this https URL
[NLP-74] VTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
【速读】: 该论文旨在解决实时语音转换(voice conversion)与说话人匿名化(speaker anonymization)中因果性、低延迟合成与语音自然度及可懂度之间难以兼顾的问题。当前系统存在核心表征不匹配:语音内容是时变的,而说话人身份通常以静态全局嵌入(global embedding)注入,导致音色一致性与局部变化难以协调。其解决方案的关键在于提出一种内容同步的时间变化音色(content-synchronous, time-varying timbre, TVT)表示机制,通过全局音色记忆(Global Timbre Memory)将全局音色实例分解为多个紧凑维度,帧级内容信息对这一记忆进行注意力交互,并由门控机制调控变化幅度,同时利用球面插值保持说话人身份的几何结构并支持平滑局部调整;此外,采用因子化向量量化瓶颈(factorized vector-quantized bottleneck)约束内容表征以减少残留说话人信息泄露。该方法实现了端到端流式合成,GPU延迟仅为80 ms,在自然度、说话人迁移和匿名化性能上均优于现有最先进流式基线,验证了TVT在严苛延迟约束下隐私保护与高表达力语音合成中的可扩展性。
链接: https://arxiv.org/abs/2602.09389
作者: Waris Quamer,Mu-Ruei Tseng,Ghady Nasrallah,Ricardo Gutierrez-Osuna
机构: Texas A&M University (德州农工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with 80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
信息检索
[IR-0] Overview of the TREC 2025 RAG TIME Track
【速读】:该论文旨在解决多语言源文档中的报告生成问题,即如何从阿拉伯语、中文、英语和俄语等不同语言的新闻故事中自动构建连贯、准确的报告内容。其解决方案的关键在于设计并实施了RAG TREC Instrument for Multilingual Evaluation (RAGTIME)任务框架,该框架包含三种任务类型:多语言报告生成(Multilingual Report Generation)、英文报告生成(English Report Generation)以及多语言信息检索(Multilingual Information Retrieval, MLIR),并通过一个涵盖多种语言的文档集合来评估系统性能,从而推动跨语言文本理解与生成技术的发展。
链接: https://arxiv.org/abs/2602.10024
作者: Dawn Lawrie,Sean MacAvaney,James Mayfield,Luca Soldaini,Eugene Yang,Andrew Yates
机构: Johns Hopkins University Human Language Technology Center of Excellence (约翰霍普金斯大学人机语言技术卓越中心); University of Glasgow (格拉斯哥大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, notebook version of the RAGTIME 2025 overview paper
Abstract:The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
[IR-1] Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design
【速读】:该论文旨在解决推荐系统中模型性能与计算投入之间缺乏可预测缩放规律的问题,尤其针对同时处理用户历史行为和上下文特征的复杂推荐场景。其核心挑战在于低效模块导致的模型浮点运算利用率(Model FLOPs Utilization, MFU)低下以及资源分配不合理,从而阻碍了幂律缩放关系的建立。解决方案的关键在于提出Kunlun架构,通过多层次优化实现效率提升:底层优化包括广义点积注意力(Generalized Dot-Product Attention, GDPA)、分层种子池化(Hierarchical Seed Pooling, HSP)和滑动窗口注意力;高层创新引入计算跳过机制(Computation Skip, CompSkip)和事件级个性化策略,显著提升MFU(从17%增至37%),并将缩放效率提升至现有最优方法的两倍以上,已在Meta广告系统中部署并产生显著生产效益。
链接: https://arxiv.org/abs/2602.10016
作者: Bojian Hou,Xiaolong Liu,Xiaoyi Liu,Jiaqi Xu,Yasmine Badr,Mengyue Hang,Sudhanshu Chanpuriya,Junqing Zhou,Yuhang Yang,Han Xu,Qiuling Suo,Laming Chen,Yuxi Hu,Jiasheng Zhang,Huaqing Xiong,Yuzhen Huang,Chao Chen,Yue Dong,Yi Yang,Shuo Chang,Xiaorui Gan,Wenlin Chen,Santanu Kolay,Darren Liu,Jade Nie,Chunzhi Yang,Jiyan Yang,Huayu Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.
[IR-2] Efficient Learning of Sparse Representations from Interactions WWW
【速读】:该论文旨在解决推荐系统初始召回阶段中嵌入表示(embedding)的表达能力与服务效率(可扩展性和延迟)之间的固有权衡问题,即如何在保持高推荐准确率的同时实现更紧凑且高效的嵌入表示。其解决方案的关键在于提出一种训练策略,用高维稀疏嵌入层替代传统的密集嵌入层,从而在保证模型表达能力和可解释性的同时显著降低存储和计算开销;实验表明,该方法可在不损失推荐精度的情况下将嵌入尺寸减少至原来的1/10,或在仅接受2.5%精度损失的前提下压缩至1/100,且稀疏嵌入的活跃维度呈现出可解释的倒排索引结构,能够直接支持基于段落级别的推荐功能(如二维主页布局)。
链接: https://arxiv.org/abs/2602.09935
作者: Vojtěch Vančura,Martin Spišák,Rodrigo Alves,Ladislav Peška
机构: Recombee(Recombee); Czech Technical University in Prague(捷克技术大学); Faculty of Mathematics and Physics, Charles University(查尔斯大学数学与物理学院)
类目: Information Retrieval (cs.IR)
备注: In the proceedings of the Web Conference (WWW) 2026 (4 pages)
Abstract:Behavioral patterns captured in embeddings learned from interaction data are pivotal across various stages of production recommender systems. However, in the initial retrieval stage, practitioners face an inherent tradeoff between embedding expressiveness and the scalability and latency of serving components, resulting in the need for representations that are both compact and expressive. To address this challenge, we propose a training strategy for learning high-dimensional sparse embedding layers in place of conventional dense ones, balancing efficiency, representational expressiveness, and interpretability. To demonstrate our approach, we modified the production-grade collaborative filtering autoencoder ELSA, achieving up to 10x reduction in embedding size with no loss of recommendation accuracy, and up to 100x reduction with only a 2.5% loss. Moreover, the active embedding dimensions reveal an interpretable inverted-index structure that segments items in a way directly aligned with the model’s latent space, thereby enabling integration of segment-level recommendation functionality (e.g., 2D homepage layouts) within the candidate retrieval model itself. Source codes, additional results, as well as a live demo are available at this https URL
[IR-3] AmharicIRInstr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning
【速读】:该论文旨在解决低资源语言(如阿姆哈拉语)在神经检索(neural retrieval)和生成式AI(Generative AI)模型训练中高质量监督数据稀缺的问题。解决方案的关键在于构建并发布两个高质量的阿姆哈拉语数据集:一是包含1,091个手动验证的查询-正样本-负样本三元组的检索排序数据集,支持对比学习与神经检索器(如DPR、ColBERT和SPLADE)的训练与评估;二是包含6,285个跨领域指令跟随型文本生成对的数据集,通过多大语言模型(LLM)生成并经母语者人工校验,确保语法正确性、相关性、流畅性和事实合理性。两个数据集均采用标准化格式(CSV、JSON、JSONL)发布,具备可复现性,并提供可推广至其他低资源语言的方法论。
链接: https://arxiv.org/abs/2602.09914
作者: Tilahun Yeshambel,Moncef Garouani,Josiane Mothe
机构: Addis Ababa University (亚的斯亚贝巴大学); Univ. Toulouse Capitole, IRIT, UMR5505 CNRS (图卢兹第一大学,IRIT,UMR5505 CNRS); UT2J, Univ. de Toulouse, IRIT, UMR5505 CNRS (图卢兹第二大学,IRIT,UMR5505 CNRS)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 7 pages, Submitted to resource track
Abstract:Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
[IR-4] QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search
【速读】:该论文旨在解决大规模社交网络服务(Social Network Service, SNS)搜索中查询处理(Query Processing, QP)系统的两大核心问题:一是传统基于判别式模型(如BERT)的流水线架构存在语义理解能力有限且维护成本高的缺陷;二是现有大语言模型(Large Language Models, LLMs)方法在SNS场景下缺乏对非正式语言模式的充分建模,且未有效利用任务间的内在语义协同关系。解决方案的关键在于提出QP-OneModel——一个面向多任务查询理解的统一生成式LLM框架,其创新性地将异构子任务重构为统一序列生成范式,并采用渐进式三阶段对齐策略结合多奖励强化学习进行优化;同时引入意图描述(intent description)作为高保真语义信号,显著增强下游任务(如查询重写和排序)的表现,从而在离线评估与线上A/B测试中均实现显著性能提升。
链接: https://arxiv.org/abs/2602.09901
作者: Jianzhao Huang,Xiaorui Huang,Fei Zhao,Yunpeng Liu,Hui Zhang,Fangcheng Shi,Congfeng Li,Zechen Sun,Yi Wu,Yao Hu,Yunhan Bai,Shaosheng Cao
机构: Xiaohongshu Inc.(小红书公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Query Processing (QP) bridges user intent and content supply in large-scale Social Network Service (SNS) search engines. Traditional QP systems rely on pipelines of isolated discriminative models (e.g., BERT), suffering from limited semantic understanding and high maintenance overhead. While Large Language Models (LLMs) offer a potential solution, existing approaches often optimize sub-tasks in isolation, neglecting intrinsic semantic synergy and necessitating independent iterations. Moreover, standard generative methods often lack grounding in SNS scenarios, failing to bridge the gap between open-domain corpora and informal SNS linguistic patterns, while struggling to adhere to rigorous business definitions. We present QP-OneModel, a Unified Generative LLM for Multi-Task Query Understanding in the SNS domain. We reformulate heterogeneous sub-tasks into a unified sequence generation paradigm, adopting a progressive three-stage alignment strategy culminating in multi-reward Reinforcement Learning. Furthermore, QP-OneModel generates intent descriptions as a novel high-fidelity semantic signal, effectively augmenting downstream tasks such as query rewriting and ranking. Offline evaluations show QP-OneModel achieves a 7.35% overall gain over discriminative baselines, with significant F1 boosts in NER (+9.01%) and Term Weighting (+9.31%). It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks. Fully deployed at Xiaohongshu, online A/B tests confirm its industrial value, optimizing retrieval relevance (DCG) by 0.21% and lifting user retention by 0.044%.
[IR-5] Internalizing Multi-Agent Reasoning for Accurate and Efficient LLM -based Recommendation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推荐系统中应用时,如何高效融合协同信号(collaborative signals)与复杂推理能力,同时避免推理延迟过高的问题。其解决方案的关键在于提出了一种轨迹驱动的内部化框架(trajectory-driven internalization framework),通过构建一个多智能体教师系统(multi-agent teacher system)实现多轮工具调用与自我反思,并引入协同信号翻译机制(Collaborative Signal Translation mechanism)将用户行为模式转化为自然语言证据以增强推理准确性;随后,采用基于轨迹的蒸馏管道(trajectory-driven distillation pipeline)将教师模型中的规划、工具使用和自省逻辑完整迁移至轻量级单智能体推荐模型STAR中,从而在不产生迭代延迟的情况下显著提升推荐性能。
链接: https://arxiv.org/abs/2602.09829
作者: Yang Wu,Haoze Wang,Qian Li,Jun Zhang,Huan Yu,Jie Jiang
机构: Tencent(腾讯)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) are reshaping recommender systems by leveraging extensive world knowledge and semantic reasoning to interpret user intent. However, effectively integrating these capabilities with collaborative signals while avoiding prohibitive inference latency remains a critical bottleneck. To address this, we propose a trajectory-driven internalization framework to develop a Single-agent Trajectory-Aligned Recommender (STAR). Specifically, to internalize complex reasoning capabilities into a single efficient model, we first design a multi-agent teacher system capable of multi-turn tool usage and reflection. This teacher utilizes a Collaborative Signal Translation mechanism to explicitly convert latent behavioral patterns into descriptive natural language evidence to enhance reasoning accuracy. Subsequently, a trajectory-driven distillation pipeline transfers this agentic logic, including planning, tool usage, and self-reflection, into the compact STAR model. Extensive experiments demonstrate that STAR surpasses its teacher by 8.7% to 39.5% while eliminating iterative latency, paving the way for real-time, reasoning-enhanced recommendation.
[IR-6] Self-Supervised Learning as Discrete Communication
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中连续视觉表征难以控制信息在表示维度上的结构化分布的问题。传统SSL方法通过对齐同一输入的不同视图来学习连续特征,缺乏对表征空间语义组织的有效调控。其解决方案的关键在于将视觉自监督学习建模为教师-学生网络之间的离散通信过程,其中语义信息通过一个容量受限的二进制信道传输;学生网络预测由教师生成的多标签二进制消息,利用逐元素二进制交叉熵损失强制离散一致性,并引入编码率正则项以优化信道利用率,从而促进结构化表示的学习。此外,定期重置投影头进一步增强嵌入在多个离散编码下仍保持预测能力的特性,实验表明该方法在图像分类、检索、密集视觉预测任务及域偏移场景下均优于连续对齐基线。
链接: https://arxiv.org/abs/2602.09764
作者: Kawtar Zaher,Ilyass Moummad,Olivier Buisson,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
[IR-7] DiffuReason : Bridging Latent Reasoning and Generative Refinement for Sequential Recommendation
【速读】:该论文旨在解决现有顺序推荐模型在 latent reasoning(潜在推理)过程中存在的两大问题:一是依赖确定性潜在链导致推理噪声累积,二是采用分阶段训练流程阻碍了联合优化与探索。解决方案的关键在于提出一个统一的“Think-then-Diffuse”框架——通过引入多步 Thinking Tokens 进行初始意图假设生成,在 Diffuse 阶段利用扩散机制将用户意图建模为概率分布并进行迭代去噪,从而有效缓解噪声影响;同时,采用基于 Group Relative Policy Optimization (GRPO) 的强化学习策略,使推理与精炼模块在端到端训练中协同演化,突破传统分阶段优化的限制,显著提升推荐效果。
链接: https://arxiv.org/abs/2602.09744
作者: Jie Jiang,Yang Wu,Qian Li,Yuling Xiong,Yihang Su,Junbang Huo,Longfei Lu,Jun Zhang,Huan Yu
机构: Tencent(腾讯)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Latent reasoning has emerged as a promising paradigm for sequential recommendation, enabling models to capture complex user intent through multi-step deliberation. Yet existing approaches often rely on deterministic latent chains that accumulate noise and overlook the uncertainty inherent in user intent, and they are typically trained in staged pipelines that hinder joint optimization and exploration. To address these challenges, we propose DiffuReason, a unified “Think-then-Diffuse” framework for sequential recommendation. It integrates multi-step Thinking Tokens for latent reasoning, diffusion-based refinement for denoising intermediate representations, and end-to-end Group Relative Policy Optimization (GRPO) alignment to optimize for ranking performance. In the Think stage, the model generates Thinking Tokens that reason over user history to form an initial intent hypothesis. In the Diffuse stage, rather than treating this hypothesis as the final output, we refine it through a diffusion process that models user intent as a probabilistic distribution, providing iterative denoising against reasoning noise. Finally, GRPO-based reinforcement learning enables the reasoning and refinement modules to co-evolve throughout training, without the constraints of staged optimization. Extensive experiments on four benchmarks demonstrate that DiffuReason consistently improves diverse backbone architectures. Online A/B tests on a large-scale industrial platform further validate its practical effectiveness.
[IR-8] With Argus Eyes: Assessing Retrieval Gaps via Uncertainty Scoring to Detect and Remedy Retrieval Blind Spots
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中神经检索器存在的“盲点”问题,即检索器无法召回与查询语义相关但嵌入表示相似度较低的实体,这源于训练诱导的偏差导致这些实体被映射到嵌入空间中难以访问的区域,从而降低可检索性。解决方案的关键在于提出一种名为ARGUS的流水线方法,通过预知高风险(低检索概率得分,RPS)实体并基于知识库(KB)和维基百科首段进行针对性文档增强,显著提升这些低可检索性实体的召回能力;实验表明,ARGUS在多个基准数据集上均实现稳定且显著的性能提升,验证了提前修复盲点对构建鲁棒、可信RAG系统的重要性。
链接: https://arxiv.org/abs/2602.09616
作者: Zeinab Sadat Taghavi,Ali Modarressi,Hinrich Schutze,Andreas Marfurt
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Reliable retrieval-augmented generation (RAG) systems depend fundamentally on the retriever’s ability to find relevant information. We show that neural retrievers used in RAG systems have blind spots, which we define as the failure to retrieve entities that are relevant to the query, but have low similarity to the query embedding. We investigate the training-induced biases that cause such blind spot entities to be mapped to inaccessible parts of the embedding space, resulting in low retrievability. Using a large-scale dataset constructed from Wikidata relations and first paragraphs of Wikipedia, and our proposed Retrieval Probability Score (RPS), we show that blind spot risk in standard retrievers (e.g., CONTRIEVER, REASONIR) can be predicted pre-index from entity embedding geometry, avoiding expensive retrieval evaluations. To address these blind spots, we introduce ARGUS, a pipeline that enables the retrievability of high-risk (low-RPS) entities through targeted document augmentation from a knowledge base (KB), first paragraphs of Wikipedia, in our case. Extensive experiments on BRIGHT, IMPLIRET, and RAR-B show that ARGUS achieves consistent improvements across all evaluated retrievers (averaging +3.4 nDCG@5 and +4.5 nDCG@10 absolute points), with substantially larger gains in challenging subsets. These results establish that preemptively remedying blind spots is critical for building robust and trustworthy RAG systems (Code and Data).
[IR-9] LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval EACL
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言法律场景中部署时面临的两大挑战:一是检索可靠性差,二是缺乏经过领域适配的开源嵌入模型(open-embedding models)。现有多语言法律语料库未针对语义检索设计,且基于PDF的立法文本因提取噪声严重导致质量下降。为应对这些问题,作者构建了LEMUR——一个涵盖25种语言、由24,953份欧盟环境法规官方EUR-Lex PDF文档组成的大型多语言法律语料库,并通过词法一致性评分(Lexical Content Score, LCS)量化PDF转文本的保真度。关键解决方案在于基于LEMUR对三种先进的多语言嵌入模型进行对比学习微调(contrastive fine-tuning),分别在单语和双语设置下模拟真实法律检索场景。实验表明,领域微调显著提升了Top-k检索准确率,尤其在低资源语言上效果突出;跨语言评估进一步验证了模型对语言无关的法律内容表征能力的增强,而非依赖语言特异性特征。
链接: https://arxiv.org/abs/2602.09570
作者: Narges Baba Ahmadi,Jan Strich,Martin Semmann,Chris Biemann
机构: Hub of Computing and Data Science (HCDS); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at EACL SRW 26
Abstract:Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote\hrefthis https URLGitHub Repository and data\footnote\hrefthis https URLHugging Face Dataset.
[IR-10] Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA EACL
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在多轮对话问答(Multi-turn Conversational QA)场景中缺乏系统性比较的问题,尤其是在对话历史、指代消解和用户意图漂移等复杂因素下,现有方法性能表现不一且难以复现。其关键解决方案在于构建了一个统一的实验框架,在八个跨领域、多样化的多轮对话QA数据集上对基础与先进RAG方法进行全面评估,涵盖检索质量与生成效果两个维度,并分析不同策略在对话轮次中的演化规律。研究发现,简单但鲁棒的方法(如重排序、混合BM25和HyDE)普遍优于原始RAG,而部分复杂方法反而导致性能下降;更重要的是,检索效果高度依赖于数据集特性和对话长度,表明有效的多轮RAG更取决于检索策略与数据结构之间的匹配度,而非方法本身的复杂程度。
链接: https://arxiv.org/abs/2602.09552
作者: Klejda Alushi,Jan Strich,Chris Biemann,Martin Semmann
机构: Hub of Computing and Data Science (HCDS); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to EACL SRW 26
Abstract:Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote\hrefthis https URLGitHub Repository
[IR-11] he Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training
【速读】:该论文旨在解决合成数据生成中查询多样性(query diversity)对密集检索(dense retrieval)性能影响不一致的问题,即先前研究对此存在矛盾结论。其核心解决方案是提出Q-D指标以量化多样性的影响,从而将该问题可测量化;并通过在31个数据集上的实验发现,查询多样性尤其有利于多跳检索任务(multi-hop retrieval),且其收益与查询复杂度(由内容词数CW衡量)高度相关(r ≥ 0.95, p < 0.05 在12/14条件下)。基于此,作者提出“复杂度-多样性原则”(Complexity-Diversity Principle, CDP),指出查询复杂度决定了最优多样性水平,并给出可操作阈值(CW > 10时使用多样性,CW < 7时避免多样性)。在此基础上,提出零样本多查询合成方法(zero-shot multi-query synthesis),专门针对多跳任务设计,实现了当前最优性能。
链接: https://arxiv.org/abs/2602.09448
作者: Xincan Feng,Noriki Nishida,Yusuke Sakai,Yuji Matsumoto
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Under review
Abstract:Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity’s impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity ( r \geq 0.95, p 0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW 10: use diversity; CW 7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.
[IR-12] Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation WWW2026
【速读】:该论文旨在解决当前基于多模态基础模型(multimodal foundation models)的推荐系统中,物品嵌入(item embeddings)缺乏用户个性化的问题。尽管这些模型能有效编码图像、文本等多模态信息,但其生成的嵌入未考虑不同用户的兴趣差异,导致无法捕捉用户关注的细粒度物品特征。解决方案的关键在于提出PerPEFT(Personalized Parameter-Efficient Fine-Tuning),通过将用户按兴趣分组,并为每组分配独立的参数高效微调(PEFT)模块,使每个模块能够学习与特定用户群体购买决策最相关的物品特征。该方法在保持轻量级的同时显著提升了推荐性能,且不依赖于具体PEFT实现方式,具有良好的通用性。
链接: https://arxiv.org/abs/2602.09445
作者: Sunwoo Kim,Hyunjin Hwang,Kijung Shin
机构: KAIST(韩国科学技术院)
类目: Information Retrieval (cs.IR)
备注: To be published at The Web Conference 2026 (WWW 2026)
Abstract:In recent years, substantial research has integrated multimodal item metadata into recommender systems, often by using pre-trained multimodal foundation models to encode such data. Since these models are not originally trained for recommendation tasks, recent works efficiently adapt them via parameter-efficient fine-tuning (PEFT). However, even with PEFT, item embeddings from multimodal foundation models remain user-blind: item embeddings are not conditioned on user interests, despite the fact that users with diverse interests attend to different item aspects. To address this limitation, we propose PerPEFT, a personalized PEFT strategy for multimodal recommendation. Specifically, PerPEFT groups users by interest and assigns a distinct PEFT module to each group, enabling each module to capture the fine-grained item aspects most predictive of that group`s purchase decisions. We further introduce a specialized training technique that strengthens this user-group conditioning. Notably, PerPEFT is PEFT-agnostic and can be paired with any PEFT method applicable to multimodal foundation models. Through extensive experiments, we show that (1) PerPEFT outperforms the strongest baseline by up to 15.3% (NDCG@20) and (2) delivers consistent gains across diverse PEFT variants. It is noteworthy that, even with personalization, PEFT remains lightweight, adding only 1.3% of the parameter count of the foundation model. We provide our code and datasets at this https URL.
[IR-13] SARM: LLM -Augmented Semantic Anchor for End-to-End Live-Streaming Ranking
【速读】:该论文旨在解决大规模直播推荐系统中非平稳内容语义的精确建模问题,尤其是在严格实时服务约束下的挑战。现有工业部署中的两种主流方法存在根本性局限:离散语义抽象通过聚类牺牲了描述精度,而密集多模态嵌入则独立提取且与排序优化弱对齐,难以实现细粒度的内容感知排序。解决方案的关键在于提出端到端的排序架构SARM(Semantic Anchor Ranking Model),其核心创新是将自然语言语义锚点(semantic anchors)直接融入排序优化过程,使模型能够生成基于多模态内容条件的细粒度作者表征。每个语义锚点以可学习文本标记的形式表示,并与排序特征联合优化,从而让内容描述自适应地匹配排序目标;同时采用轻量级双标记门控设计捕捉直播领域特异性语义,并结合非对称部署策略保障低延迟在线训练与推理,最终在离线评估和大规模A/B测试中均显著优于现有基线模型。
链接: https://arxiv.org/abs/2602.09401
作者: Ruochen Yang,Yueyang Liu,Zijie Zhuang,Changxin Lao,Yuhui Zhang,Jiangxia Cao,Jia Xu,Xiang Chen,Haoke Xiao,Xiangyu Wu,Xiaoyou Zhou,Xiao Lv,Shuang Yang,Tingwen Liu,Zhaojie Liu,Han Li,Kun Gai
机构: Kuaishou Technology (快手科技); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large-scale live-streaming recommendation requires precise modeling of non-stationary content semantics under strict real-time serving constraints. In industrial deployment, two common approaches exhibit fundamental limitations: discrete semantic abstractions sacrifice descriptive precision through clustering, while dense multimodal embeddings are extracted independently and remain weakly aligned with ranking optimization, limiting fine-grained content-aware ranking. To address these limitations, we propose \textbfSARM, an end-to-end ranking architecture that integrates natural-language semantic anchors directly into ranking optimization, enabling fine-grained author representations conditioned on multimodal content. Each semantic anchor is represented as learnable text tokens jointly optimized with ranking features, allowing the model to adapt content descriptions to ranking objectives. A lightweight dual-token gated design captures domain-specific live-streaming semantics, while an asymmetric deployment strategy preserves low-latency online training and serving. Extensive offline evaluation and large-scale A/B tests show consistent improvements over production baselines. SARM is fully deployed and serves over 400 million users daily.
[IR-14] Query-Mixed Interest Extraction and Heterogeneous Interaction: A Scalable CTR Model for Industrial Recommender Systems
【速读】:该论文旨在解决工业级推荐系统中因稀疏多字段输入和超长用户行为序列导致的特征交互建模难题,尤其关注如何从长期与实时行为序列中同时构建上下文感知(context-aware)和上下文无关(context-independent)的用户意图,并克服现有方法在交互机制上的低效与同质化问题。其解决方案的关键在于提出HeMix模型,通过Query-Mixed Interest Extraction模块动态地利用全局与实时行为序列分别提取两种用户兴趣,同时引入HeteroMixer块替代传统自注意力机制,实现高效、多粒度的跨特征交互,该结构融合多头token融合、异质交互与分组对齐重建三个子流程,显著提升了模型扩展性与预测精度,在工业数据集上验证了其有效性,并已在AMAP平台上线带来显著在线收益。
链接: https://arxiv.org/abs/2602.09387
作者: Fangye Wang,Guowei Yang,Xiaojiang Zhou,Song Yang,Pengjie Wang
机构: AMAP, Alibaba Group(阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Learning effective feature interactions is central to modern recommender systems, yet remains challenging in industrial settings due to sparse multi-field inputs and ultra-long user behavior sequences. While recent scaling efforts have improved model capacity, they often fail to construct both context-aware and context-independent user intent from the long-term and real-time behavior sequence. Meanwhile, recent work also suffers from inefficient and homogeneous interaction mechanisms, leading to suboptimal prediction performance. To address these limitations, we propose HeMix, a scalable ranking model that unifies adaptive sequence tokenization and heterogeneous interaction structure. Specifically, HeMix introduces a Query-Mixed Interest Extraction module that jointly models context-aware and context-independent user interests via dynamic and fixed queries over global and real-time behavior sequences. For interaction, we replace self-attention with the HeteroMixer block, enabling efficient, multi-granularity cross-feature interactions that adopt the multi-head token fusion, heterogeneous interaction and group-aligned reconstruction pipelines. HeMix demonstrates favorable scaling behavior, driven by the HeteroMixer block, where increasing model scale via parameter expansion leads to steady improvements in recommendation accuracy. Experiments on industrial-scale datasets show that HeMix scales effectively and consistently outperforms strong baselines. Most importantly, HeMix has been deployed on the AMAP platform, delivering significant online gains: +0.61% GMV, +2.32% PV_CTR, and +0.81% UV_CVR.
[IR-15] SMES: Towards Scalable Multi-Task Recommendation via Expert Sparsity
【速读】:该论文旨在解决工业级多任务推荐系统中因统一参数扩展与异构任务容量需求不匹配所引发的可扩展性挑战,特别是在模型规模扩大时导致的在线推理成本激增及稀疏任务因标签分布偏斜而收益递减的问题。解决方案的关键在于提出一种可扩展的稀疏专家混合(Sparse Mixture-of-Experts, MoE)框架SMES,其核心创新包括:通过分解专家激活机制,将专家划分为跨任务共享子集与任务自适应私有专家,从而在保证实例级稀疏性的同时保留各任务特异性能力;同时引入全局多门控负载均衡正则项,通过调控所有任务下专家使用率的聚合分布来稳定训练过程,有效缓解由独立任务路由导致的专家负载偏斜问题。
链接: https://arxiv.org/abs/2602.09386
作者: Yukun Zhang,Si Dong,Xu Wang,Bo Chen,Qinglin Jia,Shengzhe Wang,Jinlong Jiao,Runhan Li,Jiaqing Liu,Chaoyi Ma,Ruiming Tang,Guorui Zhou,Han Li,Kun Gai
机构: Kuaishou Technology Co., Ltd.(快手科技有限公司)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Industrial recommender systems typically rely on multi-task learning to estimate diverse user feedback signals and aggregate them for ranking. Recent advances in model scaling have shown promising gains in recommendation. However, naively increasing model capacity imposes prohibitive online inference costs and often yields diminishing returns for sparse tasks with skewed label distributions. This mismatch between uniform parameter scaling and heterogeneous task capacity demands poses a fundamental challenge for scalable multi-task recommendation. In this work, we investigate parameter sparsification as a principled scaling paradigm and identify two critical obstacles when applying sparse Mixture-of-Experts (MoE) to multi-task recommendation: exploded expert activation that undermines instance-level sparsity and expert load skew caused by independent task-wise routing. To address these challenges, we propose SMES, a scalable sparse MoE framework with progressive expert routing. SMES decomposes expert activation into a task-shared expert subset jointly selected across tasks and task-adaptive private experts, explicitly bounding per-instance expert execution while preserving task-specific capacity. In addition, SMES introduces a global multi-gate load-balancing regularizer that stabilizes training by regulating aggregated expert utilization across all tasks. SMES has been deployed in Kuaishou large-scale short-video services, supporting over 400 million daily active users. Extensive online experiments demonstrate stable improvements, with GAUC gain of 0.29% and a 0.31% uplift in user watch time.
[IR-16] Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning
【速读】:该论文旨在解决对比学习中广泛使用的余弦相似度(cosine similarity)隐含假设——嵌入向量的模长为噪声,而忽视了模长可能携带的重要信息的问题。其核心发现是:嵌入模长并非无意义噪声,而是与任务特性密切相关,在文本检索等不对称任务中能显著提升性能;解决方案的关键在于通过系统性消融实验(2×2设计)分离输入侧和输出侧归一化策略,揭示模长在不同任务中的作用机制,并提出基于任务对称性的选择原则——当任务具有明确的输入角色差异时(如文本检索、RAG),应保留模长信息以使用点积(dot product);反之,在对称任务(如语义文本相似度STS、图文对齐)中则应采用余弦相似度以避免性能下降。这一发现实现了无需额外计算成本的性能优化。
链接: https://arxiv.org/abs/2602.09229
作者: Xincan Feng,Taro Watanabe
机构: 未知
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注: Preliminary work. Under review
Abstract:Cosine similarity is prevalent in contrastive learning, yet it makes an implicit assumption: embedding magnitude is noise. Prior work occasionally found dot product and cosine similarity comparable, but left unanswered WHAT information magnitude carries, WHEN it helps, and HOW to leverage it. We conduct a systematic study through a 2 \times 2 ablation that independently controls input-side and output-side normalization across text and vision models. Our findings reveal three key insights. First, in text retrieval, output (document) magnitude strongly correlates with relevance (Cohen’s d up to 1.80), yielding the largest gains on reasoning-intensive tasks. Second, input and output magnitudes serve asymmetric roles: output magnitude directly scales similarity scores while input magnitude modulates training dynamics. Third, magnitude learning benefits asymmetric tasks (text retrieval, RAG) but harms symmetric tasks (STS, text-image alignment). These findings establish a task symmetry principle: the choice between cosine and dot product depends on whether the task has distinct input roles, enabling cost-free improvements by simply removing an unnecessary constraint.
[IR-17] FlyAOC: Evaluating Agent ic Ontology Curation of Drosophila Scientific Knowledge Bases
【速读】:该论文旨在解决科学知识库构建中人工标注效率低下的问题,即如何通过AI代理(AI agents)实现从原始文献到结构化注释的端到端自动化知识提炼,尤其针对基因功能、表达模式及命名演变等复杂语义信息的整合。其解决方案的关键在于提出FlyBench基准测试平台,该平台模拟真实场景下多步骤、跨文档的知识推理任务:给定一个基因符号,AI代理需在16,898篇全文论文中检索并理解内容,最终输出符合Gene Ontology(GO)标准的结构化注释。实验表明,多智能体(multi-agent)架构显著优于单智能体或固定流水线设计,但单纯扩大模型规模效果有限,提示未来研究应聚焦于增强检索增强型推理能力,而非仅依赖大模型参数增长。
链接: https://arxiv.org/abs/2602.09163
作者: Xingjian Zhang,Sophia Moylan,Ziyang Xiong,Qiaozhu Mei,Yichen Luo,Jiaqi W. Ma
机构: University of Michigan (密歇根大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California, Berkeley (加州大学伯克利分校); University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
[IR-18] An Interactive Metrics Dashboard for the Keck Observatory Archive
【速读】:该论文旨在解决Keck Observatory Archive (KOA)在数据规模快速增长背景下,传统指标采集方法无法满足实时性、可扩展性和易用性需求的问题。现有方法存在延迟高、不支持按需查询、分散于多个工具且操作繁琐等缺陷,难以支撑未来新仪器和大视场巡天望远镜(如Vera C. Rubin Observatory)带来的海量瞬变源数据处理需求。解决方案的关键在于构建一套基于Python的现代化、符合虚拟天文台(VO)标准的查询基础设施,采用Plotly-Dash框架与R树索引技术,将查询速度提升20倍,并在此基础上开发统一的实时仪表盘(dashboard),用于监控数据实时入库情况及长期增长趋势,从而实现对档案健康状态的动态评估与硬件/软件升级的科学规划。
链接: https://arxiv.org/abs/2602.09126
作者: G. Bruce Berriman,Min Phone Myat Zaw
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR)
备注: 4 pages, 2 figures, Submitted to Proc. ADASS 2025
Abstract:Since 2004, the Keck Observatory Archive (KOA) has operated as a NASA-funded collaboration between the NASA Exoplanet Science Institute ( NExScI) and the W.M. Keck Observatory. It ingests and serves all data acquired by the twin 10-meter Keck telescopes on Mauna Kea, Hawaii. In the past three years, KOA has begun a modernization program to replace the architecture and systems used since the archive’s creation with a new modern Python-based infrastructure. This infrastructure will position KOA to respond to the rapid growth of new and complex data sets that will be acquired by new instruments now in development, and enable follow-up to identify the deluge of alerts of transient sources expected by new survey telescopes such as the Vera C. Rubin Observatory. Since 2022, KOA has ingested new data in near-real time, generally within one minute of creation, and has made them immediately accessible to observers through a dedicated web interface. The archive is now deploying a new, scalable, Python-based, VO-compliant query infrastructure built with the Plotly-Dash framework and R-tree indices to speed-up queries by a factor of 20. The project described here exploits the new query infrastructure to develop a dashboard that will return live metrics on the performance and growth of the archive. These metrics assess the current health of the archive and guide planning future hardware and software upgrades. This single dashboard will enable, for example, monitoring of real-time ingestion, as well as studying the long-term growth of the archive. Current methods of gathering metrics that have been in place since the archive opened will not support the archive as it continues to scale. These methods suffer from high latency, are not optimized for on-demand metrics, are scattered among various tools, and are cumbersome to use. Comments: 4 pages, 2 figures, Submitted to Proc. ADASS 2025 Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR) Cite as: arXiv:2602.09126 [astro-ph.IM] (or arXiv:2602.09126v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2602.09126 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
人机交互
[HC-0] AIDED: Augmenting Interior Design with Human Experience Data for Designer-AI Co-Design
【速读】:该论文旨在解决室内设计领域中客户体验的细微差异难以被设计师准确捕捉的问题,导致客户感受与设计行动之间存在脱节。其解决方案的关键在于提出一种名为AIDED的设计师-人工智能(AI)协同设计工作流,该流程将多模态客户数据(如问卷可视化、凝视热图和AI预测叠加层)整合进生成式AI(GAI)设计过程中,从而增强设计对客户主观体验的响应能力。研究通过对照实验发现,问卷数据在提升创意性和满意度方面表现最优,而AI预测叠加层虽改善了GAI沟通效率,但需自然语言中介以建立信任,体现出真实性与可解释性之间的权衡关系。
链接: https://arxiv.org/abs/2602.10054
作者: Yang Chen Lin,Chen-Ying Chen,Kai-Hsin Hou,Hung-Yu Chen,Po-Chih Kuo
机构: National Tsing Hua University (国立清华大学); Freelancer (自由职业者)
类目: Human-Computer Interaction (cs.HC)
备注: 29 pages, 14 figures, Accepted to CHI 2026
Abstract:Interior design often struggles to capture the subtleties of client experience, leaving gaps between what clients feel and what designers can act upon. We present AIDED, a designer-AI co-design workflow that integrates multimodal client data into generative AI (GAI) design processes. In a within-subjects study with twelve professional designers, we compared four modalities: baseline briefs, gaze heatmaps, questionnaire visualizations, and AI-predicted overlays. Results show that questionnaire data were trusted, creativity-enhancing, and satisfying; gaze heatmaps increased cognitive load; and AI-predicted overlays improved GAI communication but required natural language mediation to establish trust. Interviews confirmed that an authenticity-interpretability trade-off is central to balancing client voices with professional control. Our contributions are: (1) a system that incorporates experiential client signals into GAI design workflows; (2) empirical evidence of how different modalities affect design outcomes; and (3) implications for future AI tools that support human-data interaction in creative practice.
[HC-1] Discovering High Level Patterns from Simulation Traces
【速读】:该论文旨在解决嵌入物理环境中的智能体(AI agent)在进行推理、规划、摘要和问答任务时面临的挑战,尤其是当人类用户希望通过自然语言与智能体交互或引导其行为时,传统语言模型(Language Model, LM)因缺乏对物理规律的显式建模而表现不佳的问题。解决方案的关键在于提出一种以自然语言引导的方法,从详细的仿真日志中自动提取粗粒度的物理模式(如“刚体碰撞”、“稳定支撑”等),通过合成操作仿真日志的程序将其映射为高层激活模式,从而构建更适于自然语言推理的结构化表示。实验表明,这种标注后的仿真日志表示能显著提升语言模型对物理系统的理解能力,并支持从自然语言目标生成有效的奖励程序,用于规划或监督学习场景。
链接: https://arxiv.org/abs/2602.10009
作者: Sean Memery,Kartic Subr
机构: University of Edinburgh(爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the default choice, as an AI tool, they struggle with tasks involving physics. The LM’s capability for physical reasoning is learned from observational data, rather than being grounded in simulation. A common approach is to include simulation traces as context, but this suffers from poor scalability as simulation traces contain larger volumes of fine-grained numerical and semantic data. In this paper, we propose a natural language guided method to discover coarse-grained patterns (e.g., ‘rigid-body collision’, ‘stable support’, etc.) from detailed simulation logs. Specifically, we synthesize programs that operate on simulation logs and map them to a series of high level activated patterns. We show, through two physics benchmarks, that this annotated representation of the simulation log is more amenable to natural language reasoning about physical systems. We demonstrate how this method enables LMs to generate effective reward programs from goals specified in natural language, which may be used within the context of planning or supervised learning.
[HC-2] Human-AI Synergy Supports Collective Creative Search
【速读】:该论文旨在解决生成式 AI (Generative AI) 在集体创造力中对创意产出质量与多样性影响不明确的问题。其解决方案的关键在于设计了一个受控的词语猜测任务,该任务在保持开放性的同时引入客观的语义相似度评分机制,从而量化参与者(人类、AI或人机混合组)的创造性表现。实验对比了纯人类组、纯AI组与人机混合组的表现和猜词多样性,发现混合组在取得最高性能的同时维持了高多样性;进一步分析表明,人类与AI在混合环境中均会系统性调整策略,体现出更高阶的交互效应,即双方因彼此存在而适应性改变行为模式。这一结果揭示了人机协作在集体创造力中的独特优势,优于异构AI之间的协作,凸显了人类与AI在创造过程中互补的作用机制。
链接: https://arxiv.org/abs/2602.10001
作者: Chenyi Li,Raja Marjieh,Haoyu Hu,Mark Steyvers,Katherine M. Collins,Ilia Sucholutsky,Nori Jacoby
机构: Cornell University (康奈尔大学); Princeton University (普林斯顿大学); University of California, Irvine (加州大学欧文分校); Massachusetts Institute of Technology (麻省理工学院); New York University (纽约大学)
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI is increasingly transforming creativity into a hybrid human-artificial process, but its impact on the quality and diversity of creative output remains unclear. We study collective creativity using a controlled word-guessing task that balances open-endedness with an objective measure of task performance. Participants attempt to infer a hidden target word, scored based on the semantic similarity of their guesses to the target, while also observing the best guess from previous players. We compare performance and outcome diversity across human-only, AI-only, and hybrid human-AI groups. Hybrid groups achieve the highest performance while preserving high diversity of guesses. Within hybrid groups, both humans and AI agents systematically adjust their strategies relative to single-agent conditions, suggesting higher-order interaction effects, whereby agents adapt to each other’s presence. Although some performance benefits can be reproduced through collaboration between heterogeneous AI systems, human-AI collaboration remains superior, underscoring complementary roles in collective creativity.
[HC-3] Self-Regulated Reading with AI Support: An Eight-Week Study with Students
【速读】:该论文旨在解决高校学生使用生成式 AI (Generative AI) 作为学术阅读辅助工具时,其交互行为如何影响阅读体验与认知投入的机制尚不清晰的问题。研究通过为期八周的纵向实验,收集了15名本科生在课程阅读中使用的838条提示(prompt),并构建了涵盖解码(Decoding)、理解(Comprehension)、推理(Reasoning)和元认知(Metacognition)四类认知主题的编码体系,发现理解类提示占主导(59.6%),且学生在单次阅读会话中虽呈现从理解向推理的认知推进趋势,但常被中断;同时个体差异稳定存在,且学生普遍存在“意图-行为”鸿沟——即虽意识到高质量提示需付出努力,却普遍以效率为优先驱动因素,并发展出一种“通过AI阅读”而非“与AI共读”的策略:利用AI生成的摘要筛选值得深入阅读的内容。解决方案的关键在于识别出学生在实际使用中对认知深度的限制性选择及其动机逻辑,从而为设计能够持续支持认知投入的AI阅读系统提供实证依据与设计启示。
链接: https://arxiv.org/abs/2602.09907
作者: Yue Fu,Joel Wester,Niels Van Berkel,Alexis Hiniker
机构: University of Washington (华盛顿大学); University of Copenhagen (哥本哈根大学); Aalborg University (奥尔堡大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:College students increasingly use AI chatbots to support academic reading, yet we lack granular understanding of how these interactions shape their reading experience and cognitive engagement. We conducted an eight-week longitudinal study with 15 undergraduates who used AI to support assigned readings in a course. We collected 838 prompts across 239 reading sessions and developed a coding schema categorizing prompts into four cognitive themes: Decoding, Comprehension, Reasoning, and Metacognition. Comprehension prompts dominated (59.6%), with Reasoning (29.8%), Metacognition (8.5%), and Decoding (2.1%) less frequent. Most sessions (72%) contained exactly three prompts, the required minimum of the reading assignment. Within sessions, students showed natural cognitive progression from comprehension toward reasoning, but this progression was truncated. Across eight weeks, students’ engagement patterns remained stable, with substantial individual differences persisting throughout. Qualitative analysis revealed an intention-behavior gap: students recognized that effective prompting required effort but rarely applied this knowledge, with efficiency emerging as the primary driver. Students also strategically triaged their engagement based on interest and academic pressures, exhibiting a novel pattern of reading through AI rather than with it: using AI-generated summaries as primary material to filter which sections merited deeper attention. We discuss design implications for AI reading systems that scaffold sustained cognitive engagement.
[HC-4] Safeguarding Privacy: Privacy-Preserving Detection of Mind Wandering and Disengagement Using Federated Learning in Online Education
【速读】:该论文旨在解决远程学习中因缺乏直接教师支持而导致的学习者注意力分散(mind wandering)与参与度下降的问题,这些问题会显著影响学习效果。为实现对学习者认知脱离状态的自动化检测并提供实时支持,研究提出了一种基于跨设备联邦学习(federated learning)的框架,其关键在于利用视频中的面部表情和注视特征构建可泛化的认知脱离检测模型,同时通过联邦学习机制在不共享原始敏感数据的前提下完成分布式模型训练,从而保障用户隐私(privacy-by-design)。此外,针对佩戴眼镜可能带来的识别干扰,研究引入相关特征以提升模型性能,最终在五个数据集上验证了该方案的有效性与实用性。
链接: https://arxiv.org/abs/2602.09904
作者: Anna Bodonhelyi,Mengdi Wang,Efe Bozkir,Babette Bühler,Enkelejda Kasneci
机构: Technische Universität München (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
备注:
Abstract:Since the COVID-19 pandemic, online courses have expanded access to education, yet the absence of direct instructor support challenges learners’ ability to self-regulate attention and engagement. Mind wandering and disengagement can be detrimental to learning outcomes, making their automated detection via video-based indicators a promising approach for real-time learner support. However, machine learning-based approaches often require sharing sensitive data, raising privacy concerns. Federated learning offers a privacy-preserving alternative by enabling decentralized model training while also distributing computational load. We propose a framework exploiting cross-device federated learning to address different manifestations of behavioral and cognitive disengagement during remote learning, specifically behavioral disengagement, mind wandering, and boredom. We fit video-based cognitive disengagement detection models using facial expressions and gaze features. By adopting federated learning, we safeguard users’ data privacy through privacy-by-design and introduce a novel solution with the potential for real-time learner support. We further address challenges posed by eyeglasses by incorporating related features, enhancing overall model performance. To validate the performance of our approach, we conduct extensive experiments on five datasets and benchmark multiple federated learning algorithms. Our results show great promise for privacy-preserving educational technologies promoting learner engagement.
[HC-5] BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices
【速读】:该论文旨在解决在资源受限的可穿戴和移动设备上实现高效人体活动识别(Human Activity Recognition, HAR)的问题,即如何在有限的内存占用和计算预算下维持高精度,并适应不同传感器配置的异构性。其核心解决方案是提出BabyMamba-HAR框架,包含两种轻量级Mamba-inspired架构:CI-BabyMamba-HAR通过通道独立的stem结构减少跨通道噪声传播,而Crossover-BiDir-BabyMamba-HAR采用早期融合stem实现与通道数量无关的计算复杂度;二者均结合权重共享的双向扫描和轻量级时序注意力池化机制,显著提升模型效率与性能,在8个基准测试中平均宏F1-score达86.52%,参数仅约27K、MACs为2.21M,相较TinyHAR在高通道数据集上节省11倍计算量。
链接: https://arxiv.org/abs/2602.09872
作者: Mridankan Mandal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.
[HC-6] Code2World: A GUI World Model via Renderable Code Generation
【速读】:该论文旨在解决现有自主GUI代理在交互环境中面临的视觉保真度与结构可控性难以兼顾的问题,尤其是文本和像素级方法在生成下一UI状态时的局限性。其核心解决方案是提出Code2World,一种基于代码生成的GUI世界模型,通过将GUI轨迹转化为可渲染的HTML代码并结合视觉反馈机制优化代码质量,从而实现高保真且结构可控的界面状态预测。关键创新在于:1)构建包含8万+高质量屏幕-动作对的数据集AndroidCode,缓解数据稀缺问题;2)引入Render-Aware强化学习策略,以渲染结果作为奖励信号,确保视觉语义一致性和动作连贯性,显著提升下游导航任务的成功率。
链接: https://arxiv.org/abs/2602.09856
作者: Yuhao Zheng,Li’an Zhong,Yi Wang,Rui Dai,Kaikui Liu,Xiangxiang Chu,Linyuan Lv,Philip Torr,Kevin Qinghong Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: github: this https URL project page: this https URL
Abstract:Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
[HC-7] An open-source implementation of a closed-loop electrocorticographic Brain-Computer Interface using Micromed FieldTrip and PsychoPy
【速读】:该论文旨在解决当前脑-机接口(Brain-Computer Interface, BCI)研究中缺乏透明、可定制且易于部署的闭环系统实现问题,尤其针对皮层电图(electrocorticographic, ECoG)信号的实时处理与实验设计需求。其解决方案的关键在于开发并开源了三个专用Python库:psychopylib用于交互式实验设计,pymarkerlib用于事件标记传输,pyfieldtriplib用于实时信号处理,并通过FieldTrip与Micromed采集系统集成,实现从数据采集到实验控制的全流程闭环管理,从而提升BCI系统的灵活性、可复现性与易用性,降低研究人员将ECoG解码技术转化为实际应用的技术门槛。
链接: https://arxiv.org/abs/2602.09735
作者: Bob Van Dyck,Arne Van Den Kerchove,Marc M. Van Hulle
机构: Laboratory for Neuro- and Psychophysiology (KU Leuven)
类目: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注:
Abstract:We present an open-source implementation of a closed-loop Brain-Computer Interface (BCI) system based on electrocorticographic (ECoG) recordings. Our setup integrates FieldTrip for interfacing with a Micromed acquisition system and PsychoPy for implementing experiments. We open-source three custom Python libraries (psychopylib, pymarkerlib, and pyfieldtriplib) each covering different aspects of a closed-loop BCI interface: designing interactive experiments, sending event information, and real-time signal processing. Our modules facilitate the design and operation of a transparent BCI system, promoting customization and flexibility in BCI research, and lowering the barrier for researchers to translate advances in ECoG decoding into BCI applications.
[HC-8] A Multiliteracy Model for Interactive Visualization Literacy: Definitions Literacies and Steps for Future Research
【速读】:该论文试图解决现有可视化素养(visualization literacy)研究中忽视交互性及其相关挑战的问题,而交互性是使用数据可视化系统的核心特征。解决方案的关键在于提出一个二维理论模型,该模型整合了四个已知的素养维度与五个新的素养维度,从而系统地描述人们如何使用交互式数据可视化及其系统。通过分析现有可视化系统及一项探索性研究中的观察结果,作者验证了该模型的有效性,并进一步指出未来在测量、评估、设计和教学层面推进交互式可视化素养的发展路径。
链接: https://arxiv.org/abs/2602.09631
作者: Gabriela Molina León,Benjamin Bach,Matheus Valentim,Niklas Elmqvist
机构: Aarhus University (奥胡斯大学); Inria Bordeaux (法国国家信息与自动化研究院波尔多分部); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注: Conditionally accepted to CHI 2026
Abstract:This paper presents a theoretical model for interactive visualization literacy to describe how people use interactive data visualizations and systems. Literacies have become an important concept in describing modern life skills, with visualization literacy generally referring to the use and interpretation of data visualizations. However, prior work on visualization literacy overlooks interaction and its associated challenges, despite it being an intrinsic aspect of using visualizations. Based on existing theoretical frameworks, we derive a two-dimensional model that combines four well-known literacies with five novel ones. We found evidence for our model through analyzing existing visualization systems as well as through observations from an exploratory study involving such systems. We conclude by outlining steps towards measuring, evaluating, designing for, and teaching interactive visualization literacy.
[HC-9] Stop Testing Attacks Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全机制在面对对抗性提示(adversarial prompts)时失效原因不明确的问题,即现有研究虽证明了越狱攻击(jailbreak attacks)的成功性,但未能揭示防御机制在何处失效以及为何失效。为此,作者提出一个结构化的分析框架——四检查点框架(Four-Checkpoint Framework),将LLM安全机制划分为四个独立的防御层,依据两个维度:处理阶段(输入 vs. 输出)和检测层级(字面匹配 vs. 意图识别)。这一框架使得每层防御可被单独评估,并据此设计了13种针对性的规避技术,用于系统性测试各防御层的有效性。关键创新在于引入加权攻击成功率(Weighted Attack Success Rate, WASR),该指标能捕捉传统二值化评估忽略的部分信息泄露,从而更准确地量化漏洞严重程度,揭示出输出阶段防御(CP3、CP4)最薄弱(WASR达72–79%),而输入级字面检测(CP1)最强(WASR仅13%),为后续针对性改进提供了诊断路径。
链接: https://arxiv.org/abs/2602.09629
作者: Hayfa Dhabhi,Kashyap Thimmaraju
机构: Technische Universität Berlin (柏林工业大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 17 pages, pre-print
Abstract:Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textitwhere defenses fail or \textitwhy. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbfFour-Checkpoint Framework, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6% attack success. However, WASR reveals 52.7%, a 2.3 \times higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72–79% WASR, while input-literal defenses (CP1) are strongest at 13% WASR. Claude achieves the strongest safety (42.8% WASR), followed by GPT-5 (55.9%) and Gemini (59.5%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems. Comments: 17 pages, pre-print Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.09629 [cs.CR] (or arXiv:2602.09629v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.09629 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kashyap Thimmaraju [view email] [v1] Tue, 10 Feb 2026 10:17:25 UTC (936 KB) Full-text links: Access Paper: View a PDF of the paper titled Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks, by Hayfa Dhabhi and Kashyap ThimmarajuView PDFHTML (experimental)TeX Source view license Current browse context: cs.CR prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.CY cs.ET cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-10] Earinter: A Closed-Loop System for Eating Pace Regulation with Just-in-Time Intervention Using Commodity Earbuds
【速读】:该论文旨在解决日常场景中进食速度难以实时调控的问题,核心挑战在于个体对进食节奏变化感知不足且持续自我监控具有高认知负荷。解决方案的关键在于提出一个基于商品化耳塞的闭环系统Earinter,其创新性地利用骨传导语音传感器捕捉咀嚼相关振动,实现每口吞咽次数(chews per swallow, CPS)的本地推理与精准估计(平均绝对误差MAE: 0.18 ± 0.13 chews/min),并结合双系统理论(Dual Systems Theory)设计轻量级、适时干预(just-in-time, JIT)策略,在真实环境中实现对进食节奏的有效调节。
链接: https://arxiv.org/abs/2602.09522
作者: Jun Fang,Ka I Chan,Xiyuxing Zhang,Yuntao Wang,Mingze Gao,Leyi Peng,Jiajin Li,Zihang Zhan,Zhixin Zhao,Yuanchun Shi
机构: Tsinghua University (清华大学); University of Michigan (密歇根大学); Yale University (耶鲁大学); Academy of Arts and Design, Tsinghua University (清华大学美术学院); Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University (教育部普适计算重点实验室,清华大学计算机科学与技术系); Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University (青海省智能计算与应用重点实验室,青海大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Rapid eating is common yet difficult to regulate in situ, partly because people seldom notice pace changes and sustained self-monitoring is effortful. We present Earinter, a commodity-earbud-based closed-loop system that integrates in-the-wild sensing, real-time reasoning, and theory-grounded just-in-time (JIT) intervention to regulate eating pace during daily meals. Earinter repurposes the earbud’s bone-conduction voice sensor to capture chewing-related vibrations and estimate eating pace as chews per swallow (CPS) for on-device inference. With data collected equally across in-lab and in-the-wild sessions, Earinter achieves reliable chewing detection (F1 = 0.97) and accurate eating pace estimation (MAE: 0.18 \pm 0.13 chews/min, 3.65 \pm 3.86 chews/swallow), enabling robust tracking for closed-loop use. Guided by Dual Systems Theory and refined through two Wizard-of-Oz pilots, Earinter adopts a user-friendly design for JIT intervention content and delivery policy in daily meals. In a 13-day within-subject field study (N=14), the closed-loop system significantly increased CPS and reduced food-consumption speed, with statistical signs of carryover on retention-probe days and acceptable user burden. Our findings highlight how single-modality commodity earables can support practical, theory-driven closed-loop JIT interventions for regulating eating pace in the wild.
[HC-11] Jokeasy: Exploring Human-AI Collaboration in Thematic Joke Generation
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)虽然能够支持交互式笑话生成,但普通对话界面缺乏对创作者在构建情境(context)和笑点(punchline)时所需的材料获取能力、控制权与实时性,导致创作过程受限。解决方案的关键在于设计了一个名为 Jokeasy 的搜索增强型原型系统,其核心创新是引入一个双角色 LLM 代理(dual-role LLM agent),同时承担“素材搜寻者”(material scout)与“原型写作者”(prototype writer)的功能,并通过可视化画布将检索到的网络内容组织为可编辑的灵感块(inspiration blocks),并辅以多阶段工作流支持人机协作式主题笑话创作。这一设计有效提升了创作中的创意丰富性和作者自主性,同时揭示了未来工具需进一步优化搜索粒度、聊天与画布集成度及视觉编辑灵活性的需求。
链接: https://arxiv.org/abs/2602.09496
作者: Yate Ge,Lin Tian,Chiqian Xu,Luyao Xu,Meiying Li,Yuanda Hu,Weiwei Guo
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at IASDR 2025. This is the author-accepted version. Correspondence to first author: geyate@gmail.com
Abstract:Thematic jokes are central to stand-up comedy, sitcoms, and public speaking, where contexts and punchlines rely on fresh material - news, anecdotes, and cultural references that resonate with the audience. Recent advances in Large Language Models (LLMs) have enabled interactive joke generation through conversational interfaces. Although LLMs enable interactive joke generation, ordinary conversational interfaces seldom give creators enough agency, control, or timely access to such source material for constructing context and punchlines. We designed Jokeasy, a search-enabled prototype system that integrates a dual-role LLM agent acting as both a material scout and a prototype writer to support human-AI collaboration in thematic joke writing. Jokeasy provides a visual canvas in which retrieved web content is organized into editable inspiration blocks and developed through a multistage workflow. A qualitative study with 13 hobbyists and 5 expert participants (including professional comedians and HCI/AI specialists) showed that weaving real-time web material into this structured workflow enriches ideation and preserves author agency, while also revealing needs for finer search control, tighter chat-canvas integration, and more flexible visual editing. These insights refine our understanding of AI-assisted humour writing and guide future creative-writing tools.
[HC-12] Beyond Input-Output: Rethinking Creativity through Design-by-Analogy in Human-AI Collaboration
【速读】:该论文试图解决的问题是:随着基础模型(foundation models)在个体生产力提升中的广泛应用,可能引发创意内容的同质化问题,进而导致设计固化(design fixation)。其解决方案的关键在于重新审视并拓展“基于类比的设计”(Design-by-Analogy, DbA)的应用边界——将DbA从早期构思阶段或特定数据模态中解放出来,嵌入到整个创造性过程中,通过跨域映射激发新颖解决方案,从而有效缓解设计固化现象。研究通过系统综述85项文献识别出六种表征形式与七个创造阶段的技术分类,并验证了DbA在创意产业、智能制造和教育服务等领域的实践价值,最终将其定位为人类与AI协作中的中介技术(mediating technology),为HCI与设计研究中创造力支持的发展提供新路径。
链接: https://arxiv.org/abs/2602.09423
作者: Xuechen Li,Shuai Zhang,Nan Cao,Qing Chen
机构: Tongji University (同济大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures. Accepted to the 2026 CHI Conference on Human Factors in Computing Systems
Abstract:While the proliferation of foundation models has significantly boosted individual productivity, it also introduces a potential challenge: the homogenization of creative content. In response, we revisit Design-by-Analogy (DbA), a cognitively grounded approach that fosters novel solutions by mapping inspiration across domains. However, prevailing perspectives often restrict DbA to early ideation or specific data modalities, while reducing AI-driven design to simplified input-output pipelines. Such conceptual limitations inadvertently foster widespread design fixation. To address this, we expand the understanding of DbA by embedding it into the entire creative process, thereby demonstrating its capacity to mitigate such fixation. Through a systematic review of 85 studies, we identify six forms of representation and classify techniques across seven stages of the creative process. We further discuss three major application domains: creative industries, intelligent manufacturing, and education and services, demonstrating DbA’s practical relevance. Building on this synthesis, we frame DbA as a mediating technology for human-AI collaboration and outline the potential opportunities and inherent risks for advancing creativity support in HCI and design research.
[HC-13] Scaffolding Metacognition with GenAI: Exploring Design Opportunities to Support Task Management for University Students with ADHD
【速读】:该论文旨在解决患有注意力缺陷多动障碍(ADHD)的大学生在 transitioning to独立灵活的生活方式时,因元认知能力不足(即对自身思维过程的认知与调控困难)而导致的学业任务管理挑战。其解决方案的关键在于利用生成式 AI (Generative AI) 的先进信息理解与生成能力,通过元认知视角设计支持工具,具体包括:提供认知支架以增强任务与自我意识、促进反思性任务执行以发展元认知能力,以及辅助情绪调节以维持任务投入度。
链接: https://arxiv.org/abs/2602.09381
作者: Zihao Zhu,Junnan Yu,Yuhan Luo
机构: City University of Hong Kong (香港城市大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages excluding reference and appendix. To appear in the Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract:For university students transitioning to an independent and flexible lifestyle, having ADHD poses multiple challenges to their academic task management, which are closely tied to their metacognitive struggles–difficulties in awareness and regulation of one’s own thinking processes. The recently surged Generative AI shows promise to mitigate these gaps with its advanced information understanding and generation capabilities. As an exploratory step, we conducted co-design sessions with 20 university students diagnosed with ADHD, followed by interviews with five experts specialized in ADHD intervention. Adopting a metacognitive lens, we examined participants’ ideas on GenAI-based task management support and experts’ assessments, which led to three design directions: providing cognitive scaffolding to enhance task and self-awareness, promoting reflective task execution for building metacognitive abilities, and facilitating emotional regulation to sustain task engagement. Drawing on these findings, we discuss opportunities for GenAI to support the metacognitive needs of neurodivergent populations, offering future directions for both research and practice.
[HC-14] Understanding Remote Mental Health Supporters Help-Seeking in Online Communities
【速读】:该论文旨在解决远程照护者(remote caregivers)在地理距离下难以提供有效心理支持所面临的独特挑战,尤其是他们如何通过在线社区寻求帮助及其互动模式。研究通过定性分析522条Reddit帖子发现,远程照护者的求助动机包括寻求指导、情绪表达和获得认同,而社区回应则涵盖情感支持、信息策略建议及个人经验分享。关键解决方案在于识别出远程照护者特有的困境,如依赖数字线索(如语音)判断照护对象状态、危机时遭遇“数字沉默”等问题,并据此提出加强远程照护者与照护对象间的沟通与信息共享机制、优化危机管理协调能力,以及设计更契合需求的照护者在线社区功能。
链接: https://arxiv.org/abs/2602.09353
作者: Tuan-He Lee,Gilly Leshed
机构: Cornell University (康奈尔大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: accepted to CHI’26
Abstract:Providing mental health support for loved ones across a geographic distance creates unique challenges for the remote caregivers, who sometimes turn to online communities for peer support. We qualitatively analyzed 522 Reddit threads to understand what drives remote caregivers’ online help-seeking behaviors and the responses they receive from the community. Their purposes of posting included requesting guidance, expressing emotions, and seeking validation. Community responses included providing emotional support, suggesting informational strategies, and sharing personal experiences. While certain themes in posts (emotional toll, monitoring symptoms, and prioritizing caregiver well-being) are shared across remote and non-remote contexts, remote caregivers’ posts surfaced nuanced experiences. For example, they often rely on digital cues, such as voice, to interpret care receivers’ well-being while struggling with digital silence during crises. We discuss the need for supporting communication and information sharing between remote caregivers and receivers, care coordination for crisis management, and design recommendations for caregiver communities.
[HC-15] A11y-CUA Dataset: Characterizing the Accessibility Gap in Computer Use Agents
【速读】:该论文旨在解决当前计算机使用代理(Computer Use Agents, CUAs)在无障碍交互方面的显著差距问题,即现有CUA系统主要模拟视觉用户(Sighted Users, SUs)的鼠标点击和键盘输入行为,未能反映盲人及低视力用户(Blind and Low-Vision Users, BLVUs)依赖辅助技术(Assistive Technology, AT)的真实交互模式,导致BLVUs难以与CUA协作。解决方案的关键在于构建并公开A11y-CUA数据集,该数据集包含60项日常任务、40.4小时的交互记录和158,325个事件,清晰揭示了SUs与BLVUs之间在交互方式上的本质差异(鼠标主导 vs. 键盘主导),以及BLVUs内部的多样性(顺序导航 vs. 快捷键操作)。通过对比主流CUA在默认及AT模拟条件下的表现(如键盘仅用和放大镜模式下成功率分别降至28.3%和41.67%),研究验证了现有CUA对真实AT使用场景的不适应性,从而为开发更具包容性和协作性的下一代CUA提供了量化基准与数据支撑。
链接: https://arxiv.org/abs/2602.09310
作者: Ananya Gubbi Mohanbabu,Rosiana Natalie,Brandon Kim,Anhong Guo,Amy Pavel
机构: University of California, Berkeley (加州大学伯克利分校); University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 10 figures. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract:Computer Use Agents (CUAs) operate interfaces by pointing, clicking, and typing – mirroring interactions of sighted users (SUs) who can thus monitor CUAs and share control. CUAs do not reflect interactions by blind and low-vision users (BLVUs) who use assistive technology (AT). BLVUs thus cannot easily collaborate with CUAs. To characterize the accessibility gap of CUAs, we present A11y-CUA, a dataset of BLVUs and SUs performing 60 everyday tasks with 40.4 hours and 158,325 events. Our dataset analysis reveals that our collected interaction traces quantitatively confirm distinct interaction styles between SU and BLVU groups (mouse- vs. keyboard-dominant) and demonstrate interaction diversity within each group (sequential vs. shortcut navigation for BLVUs). We then compare collected traces to state-of-the-art CUAs under default and AT conditions (keyboard-only, magnifier). The default CUA executed 78.3% of tasks successfully. But with the AT conditions, CUA’s performance dropped to 41.67% and 28.3% with keyboard-only and magnifier conditions respectively, and did not reflect nuances of real AT use. With our open A11y-CUA dataset, we aim to promote collaborative and accessible CUAs for everyone.
[HC-16] PointAloud: An Interaction Suite for AI-Supported Pointer-Centric Think-Aloud Computing
【速读】:该论文旨在解决传统Think-Aloud Computing(思考 aloud 计算)在实际应用中面临的三大挑战:用户对系统捕捉内容缺乏意识、难以被有效激励进行实时口头表达,以及系统反馈易造成干扰或遗漏。此外,还需确保用户认为“思考 aloud”具有价值,即获得情境化的AI辅助。其解决方案的关键在于提出PointAloud,一套基于指针(pointer-centric)的新型AI驱动交互机制,通过在任务执行过程中提供低干扰的系统反馈、即时鼓励用户发声,并同步记录上下文丰富的操作流程,从而实现人机协同创作与高质量工作过程文档化。
链接: https://arxiv.org/abs/2602.09296
作者: Frederic Gmeiner,John Thompson,George Fitzmaurice,Justin Matejka
机构: Autodesk Research (Autodesk 研究); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: Paper accepted to the 2026 CHI Conference on Human Factors in Computing Systems (CHI 2026)
Abstract:Think-Aloud Computing, a method for capturing users’ verbalized thoughts during software tasks, allows eliciting rich contextual insights into evolving intentions, struggles, and decision-making processes of users in real-time. However, existing approaches face practical challenges: users often lack awareness of what is captured by the system, are not effectively encouraged to speak, and miss or are interrupted by system feedback. Additionally, thinking aloud should feel worthwhile for users due to the gained contextual AI assistance. To better support and harness Think-Aloud Computing, we introduce PointAloud, a suite of novel AI-driven pointer-centric interactions for in-the-moment verbalization encouragement, low-distraction system feedback, and contextually rich work process documentation alongside proactive AI assistance. Our user study with 12 participants provides insights into the value of pointer-centric think-aloud computing for work process documentation and human-AI co-creation. We conclude by discussing the broader implications of our findings and design considerations for pointer-centric and AI-supported Think-Aloud Computing workflows.
[HC-17] Disambiguating Anthropomorphism and Anthropomimesis in Human-Robot Interaction
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)领域中“拟人化”(anthropomorphism)与“拟人模拟”(anthropomimesis)这两个概念长期混淆的问题。其解决方案的关键在于明确区分二者:拟人化指用户将人类特质赋予机器人(即感知层面的归属),而拟人模拟指开发者在设计中主动引入人类特征(即设计层面的赋予)。这一概念澄清有助于未来HRI研究在机器人设计与评估中更精准地定位人类特质来源,从而推动理论发展与实践应用。
链接: https://arxiv.org/abs/2602.09287
作者: Minja Axelsson,Henry Shevlin
机构: University of Cambridge (剑桥大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 6 pages, 0 figures. Accepted at Human-Robot Interaction (HRI) conference (ACM/IEEE) 2026, Late-Breaking Reports
Abstract:In this preliminary work, we offer an initial disambiguation of the theoretical concepts anthropomorphism and anthropomimesis in Human-Robot Interaction (HRI) and social robotics. We define anthropomorphism as users perceiving human-like qualities in robots, and anthropomimesis as robot developers designing human-like features into robots. This contribution aims to provide a clarification and exploration of these concepts for future HRI scholarship, particularly regarding the party responsible for human-like qualities - robot perceiver for anthropomorphism, and robot designer for anthropomimesis. We provide this contribution so that researchers can build on these disambiguated theoretical concepts for future robot design and evaluation.
[HC-18] Human Control Is the Anchor Not the Answer: Early Divergence of Oversight in Agent ic AI Communities
【速读】:该论文旨在解决当前对生成式 AI(Generative AI)监督机制设计中普遍存在的“一刀切”问题,即忽视不同应用场景下角色特异性需求导致的监督失效风险。其核心问题是:如何在早期采纳阶段识别并适配不同社会技术角色对监督预期的差异化构成?解决方案的关键在于提出一种基于社区语境的比较分析框架,通过主题建模与多维指标(如粗粒度监督主题抽象、加权显著性及分歧检验)识别出两个 Reddit 社区——r/OpenClaw(侧重部署与运维)和 r/Moltbook(侧重代理交互)之间显著可分的监督焦点差异:前者聚焦执行层面的风险控制与恢复能力(行动风险),后者则强调公共互动中的身份合法性与问责制(意义风险)。这一区分提供了可迁移的监督设计视角,支持根据代理的具体角色定制化监督机制,而非采用统一策略。
链接: https://arxiv.org/abs/2602.09286
作者: Hanjing Shi,Dominic DiFranzo
机构: Lehigh University (莱赫igh大学)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Oversight for agentic AI is often discussed as a single goal (“human control”), yet early adoption may produce role-specific expectations. We present a comparative analysis of two newly active Reddit communities in Jan–Feb 2026 that reflect different socio-technical roles: r/OpenClaw (deployment and operations) and r/Moltbook (agent-centered social interaction). We conceptualize this period as an early-stage crystallization phase, where oversight expectations form before norms reach equilibrium. Using topic modeling in a shared comparison space, a coarse-grained oversight-theme abstraction, engagement-weighted salience, and divergence tests, we show the communities are strongly separable (JSD =0.418, cosine =0.372, permutation p=0.0005 ). Across both communities, “human control” is an anchor term, but its operational meaning diverges: r/OpenClaw emphasizes execution guardrails and recovery (action-risk), while r/Moltbook emphasizes identity, legitimacy, and accountability in public interaction (meaning-risk). The resulting distinction offers a portable lens for designing and evaluating oversight mechanisms that match agent role, rather than applying one-size-fits-all control policies. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.09286 [cs.AI] (or arXiv:2602.09286v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.09286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-19] Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation
【速读】:该论文旨在解决机器人辅助微创手术(RMIS)中因触觉反馈和深度线索减弱而导致的视觉依赖问题,以及如何利用不同来源的注视监督信号(专家水平与感知模态)来训练更有效的注意力模型。其核心挑战在于:操作者注视数据获取成本高,且不同监督源(如中级 vs. 初学者、主动执行 vs. 被动观察)对模型学习效果的影响尚不明确。解决方案的关键是构建了一个配对的主动-被动多任务手术注视数据集(在da Vinci SimNow模拟器上采集四个训练任务),通过VR眼动追踪记录主动注视,并复用相同视频刺激收集被动注视,从而实现受控对比;进一步采用固定点密度重叠分析和单帧显著性建模评估被动注视作为监督信号的替代可行性,结果表明基于MSI-Net的模型能稳定生成可解释的预测,而SalGAN则表现不稳定,同时发现被动注视虽能恢复部分中级主动注意但存在可预测的性能下降,且主动与被动之间的迁移具有不对称性,为未来基于众包方式的手术教学与感知建模提供了可行路径。
链接: https://arxiv.org/abs/2602.09259
作者: Yizhou Li,Shuyuan Yang,Jiaji Su,Zonghe Chua
机构: Case Western Reserve University (凯斯西储大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 8 pages, conference submission pre-print
Abstract:In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.
[HC-20] “Create an environment that protects women rather than selling anxiety!”: Participatory Threat Modeling with Chinese Young Women Living Alone
【速读】:该论文旨在解决中国独居年轻女性(Young Women Living Alone, YWLA)在智能家庭、在线平台和公共基础设施中面临的隐私(Privacy)、安全(Security)与安全(Safety, PSS)风险交织问题。研究发现,这些风险包括数字促成的物理暴力、网络骚扰与诈骗,以及来自个人、企业和政府的普遍监控,三者相互强化,形成复杂的安全困境。解决方案的关键在于构建一个以用户为中心的威胁模型,并提出四类缓解策略:智能设备配置、边界管理、社会文化实践及社交媒体策略;同时开发了针对YWLA的数字PSS指南,并从产品设计、政策法规、教育干预等维度提出可操作的设计启示,从而在提升防护能力的同时识别并规避新引入的风险与心理负担。
链接: https://arxiv.org/abs/2602.09256
作者: Shijing He,Chenkai Ma,Chi Zhang,Adam Jenkins,Ruba Abu-Salma,Jose Such
机构: King’s College London(伦敦国王学院); University of Warwick(华威大学); INGENIO (CSIC-Universitat Politècnica de València)(英格尼奥研究中心(CSIC-瓦伦西亚理工大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Paper accepted for presentation at CHI 2026
Abstract:As more young women in China live alone, they navigate entangled privacy, security, and safety (PSS) risks across smart homes, online platforms, and public infrastructures. Drawing on six participatory threat modeling (PTM) workshops (n = 33), we present a human-centered threat model that illustrates how digitally facilitated physical violence, digital harassment and scams, and pervasive surveillance by individuals, companies, and the state are interconnected and mutually reinforcing. We also document four mitigation strategies employed by participants: smart home device configurations, boundary management, sociocultural practices, and social media tactics–each of which can introduce new vulnerabilities and emotional burdens. Based on these insights, we developed a digital PSS guidebook for young women living alone (YWLA) in China. We further propose actionable design implications for smart home devices and social media platforms, along with policy and legal recommendations and directions for educational interventions.
[HC-21] Investigating Bystander Privacy in Chinese Smart Home Apps
【速读】:该论文旨在解决非西方国家(以中国为例)智能家庭环境中“旁观者隐私”(bystander privacy)保护不足的问题,即现有智能家庭应用在政策声明与实际数据处理之间存在显著脱节,尤其忽视了访客等非用户个体的隐私权益。其解决方案的关键在于通过混合方法(包括隐私政策审查、用户体验/界面评估及苹果应用商店隐私标签分析),识别出数据控制不一致和数据共享透明度缺失的核心问题,并提出三项设计建议:强化旁观者隐私保护机制、提升面向用户的隐私信息透明度(privacy-oriented UI transparency),以及增强隐私标签的可信度,从而推动非西方语境下更具包容性的智能家居生态系统的健康发展。
链接: https://arxiv.org/abs/2602.09254
作者: Shijing He,Xuchen Wang,Yaxiong Lei,Chi Zhang,Ruba Abu-Salma,Jose Such
机构: King’s College London(伦敦国王学院); University of St Andrews(圣安德鲁斯大学); University of Warwick(华威大学); INGENIO (CSIC-Universitat Politècnica de València)(INGENIO (CSIC-瓦伦西亚理工大学))
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Paper accepted for presentation at CHI 2026
Abstract:Bystander privacy in smart homes has been widely studied in Western contexts, yet it remains underexplored in non-Western countries such as China. In this study, we analyze 49 Chinese smart home apps using a mixed-methods approach, including privacy policy review, UX/UI evaluation, and assessment of Apple App Store privacy labels. While most apps nominally comply with national regulations, we identify significant gaps between written policies and actual implementation. Our traceability analysis highlights inconsistencies in data controls and a lack of transparency in data-sharing practices. Crucially, bystander privacy – particularly for visitors and non-user individuals – is largely absent from both policy documents and interface design. Additionally, discrepancies between privacy labels and actual data practices threaten user trust and undermine informed consent. We provide design recommendations to strengthen bystander protections, improve privacy-oriented UI transparency, and enhance the credibility of privacy labels, supporting the development of inclusive smart home ecosystems in non-Western contexts.
[HC-22] “These cameras are just like the Eye of Sauron”: A Sociotechnical Threat Model for AI-Driven Smart Home Devices as Perceived by UK-Based Domestic Workers
【速读】:该论文旨在解决AI驱动的智能家居设备在家庭佣工(Domestic Workers, DWs)隐私保护方面带来的新型风险问题,尤其关注其在雇主家庭和自身家庭中因数据收集、AI分析及跨家庭数据流动所引发的隐私威胁。解决方案的关键在于构建一个社会技术威胁模型(sociotechnical threat model),将家庭佣工的雇佣机构(agencies)识别为制度性对手,并系统映射AI驱动下跨家庭场景中的隐私风险,从而揭示出AI功能透明度不足、数据留存不确定性以及性别化管理角色等核心挑战,为提升家庭佣工的隐私权与自主性提供理论框架与实践指导。
链接: https://arxiv.org/abs/2602.09239
作者: Shijing He,Yaxiong Lei,Xiao Zhan,Ruba Abu-Salma,Jose Such
机构: King’s College London (伦敦国王学院); University of St Andrews (圣安德鲁斯大学); Universitat Politècnica de València (瓦伦西亚理工大学); INGENIO (CSIC-UPV)
类目: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: Paper accepted for presentation at Symposium on Usable Security and Privacy (USEC) 2026
Abstract:The growing adoption of AI-driven smart home devices has introduced new privacy risks for domestic workers (DWs), who are frequently monitored in employers’ homes while also using smart devices in their own households. We conducted semi-structured interviews with 18 UK-based DWs and performed a human-centered threat modeling analysis of their experiences through the lens of Communication Privacy Management (CPM). Our findings extend existing threat models beyond abstract adversaries and single-household contexts by showing how AI analytics, residual data logs, and cross-household data flows shaped the privacy risks faced by participants. In employer-controlled homes, AI-enabled features and opaque, agency-mediated employment arrangements intensified surveillance and constrained participants’ ability to negotiate privacy boundaries. In their own homes, participants had greater control as device owners but still faced challenges, including gendered administrative roles, opaque AI functionalities, and uncertainty around data retention. We synthesize these insights into a sociotechnical threat model that identifies DW agencies as institutional adversaries and maps AI-driven privacy risks across interconnected households, and we outline social and practical implications for strengthening DW privacy and agency.
[HC-23] Untangling the Timeline: Challenges and Opportunities in Supporting Version Control in Modern Computer-Aided Design
【速读】:该论文旨在解决现代计算机辅助设计(CAD)软件中版本控制(version control)实施所面临的系统性挑战,这些问题阻碍了产品开发过程中的可追溯性、变体管理与协作效率。其关键解决方案在于通过系统性分析170个在线论坛帖子,识别出贯穿版本管理、连续性、范围和分发等维度的重复性社会技术问题,并据此提出应从工具设计层面强化对“协调工作”(articulation work)的支持、促进跨边界协作以及具备基础设施反思能力(infrastructural reflexivity),从而推动版本控制机制向更适应复杂设计数据交互的下一代架构演进。
链接: https://arxiv.org/abs/2602.09236
作者: Yuanzhe Deng,Shutong Zhang,Kathy Cheng,Alison Olechowski,Shurui Zhou
机构: University of Toronto (多伦多大学); Stanford University (斯坦福大学)
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 7 figures, to be published in the Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems
Abstract:Version control is critical in mechanical computer-aided design (CAD) to enable traceability, manage product variation, and support collaboration. Yet, its implementation in modern CAD software as an essential information infrastructure for product development remains plagued by issues due to the complexity and interdependence of design data. This paper presents a systematic review of user-reported challenges with version control in modern CAD tools. Analyzing 170 online forum threads, we identify recurring socio-technical issues that span the management, continuity, scope, and distribution of versions. Our findings inform a broader reflection on how version control should be designed and improved for CAD and motivate opportunities for tools and mechanisms that better support articulation work, facilitate cross-boundary collaboration, and operate with infrastructural reflexivity. This study offers actionable insights for CAD software providers and highlights opportunities for researchers to rethink version control.
[HC-24] owards Human-AI Accessibility Mapping in India: VLM-Guided Annotations and POI-Centric Analysis in Chandigarh AAAI2025
【速读】:该论文旨在解决城市尺度下人行道无障碍环境评估的可扩展性与精准性问题,尤其针对发展中国家如印度 Chandigarh 城市中缺乏系统性、高精度的无障碍数据采集手段。解决方案的关键在于对 Project Sidewalk 平台进行本地化适配:包括调整标注类型以符合当地道路特征、提供更具场景相关性的示例指导,并引入视觉语言模型(Visual Language Model, VLM)驱动的任务引导机制,该机制能根据街景图像和元数据动态生成个性化操作指引。此改进显著提升了标注效率与一致性,经三名标注员评估,AI 任务引导平均得分达 4.66(满分 5),并成功应用于 Chandigarh 三个土地利用类型迥异区域(居住、商业、机构)的 POI 中心无障碍分析,覆盖约 40 km 街道和 230 个兴趣点,识别出 1,644 / 2,913 处可通过基础设施改善提升可达性的位置。
链接: https://arxiv.org/abs/2602.09216
作者: Varchita Lalwani,Utkarsh Agarwal,Michael Saugstad,Manish Kumar,Jon E. Froehlich,Anupam Sobti
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted at the First Workshop on AI for Urban Planning (AI4Up) at AAAI 2025
Abstract:Project Sidewalk is a web-based platform that enables crowdsourcing accessibility of sidewalks at city-scale by virtually walking through city streets using Google Street View. The tool has been used in 40 cities across the world, including the US, Mexico, Chile, and Europe. In this paper, we describe adaptation efforts to enable deployment in Chandigarh, India, including modifying annotation types, provided examples, and integrating VLM-based mission guidance, which adapts instructions based on a street scene and metadata analysis. Our evaluation with 3 annotators indicates the utility of AI-mission guidance with an average score of 4.66. Using this adapted Project Sidewalk tool, we conduct a Points of Interest (POI)-centric accessibility analysis for three sectors in Chandigarh with very different land uses, residential, commercial and institutional covering about 40 km of sidewalks. Across 40 km of roads audited in three sectors and around 230 POIs, we identified 1,644 of 2,913 locations where infrastructure improvements could enhance accessibility.
[HC-25] Elements of Robot Morphology: Supporting Designers in Robot Form Exploration
【速读】:该论文旨在解决人机交互(HRI)中机器人形态设计缺乏系统性指导框架的问题,即如何有效引导设计者对机器人形式进行结构化探索。其解决方案的关键在于提出“机器人形态元素”(Elements of Robot Morphology)框架,该框架识别出五个基本构成要素:感知(perception)、运动关节(articulation)、末端执行器(end effectors)、移动方式(locomotion)和结构(structure),并进一步开发了“形态探索积木”(Morphology Exploration Blocks, MEB)这一具身化工具,支持设计者通过动手协作的方式进行机器人形态的实验与创新,从而在分析、构思、反思和协同设计等环节实现系统性的形态探索。
链接: https://arxiv.org/abs/2602.09203
作者: Amy Koike, Ge (Serena)Guo,Xinning He,Callie Y. Kim,Dakota Sullivan,Bilge Mutlu
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 10 pages, 5 figures, Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26)
Abstract:Robot morphology, the form, shape, and structure of robots, is a key design space in human-robot interaction (HRI), shaping how robots function, express themselves, and interact with people. Yet, despite its importance, little is known about how design frameworks can guide systematic form exploration. To address this gap, we introduce Elements of Robot Morphology, a framework that identifies five fundamental elements: perception, articulation, end effectors, locomotion, and structure. Derived from an analysis of existing robots, the framework supports structured exploration of diverse robot forms. To operationalize the framework, we developed Morphology Exploration Blocks (MEB), a set of tangible blocks that enable hands-on, collaborative experimentation with robot morphologies. We evaluate the framework and toolkit through a case study and design workshops, showing how they support analysis, ideation, reflection, and collaborative robot design.
[HC-26] Genocide by Algorithm in Gaza: Artificial Intelligence Countervailing Responsibility and the Corruption of Public Discourse
【速读】:该论文试图解决的问题是:在人工智能加速军事化背景下,AI驱动的瞄准系统如何作为认知基础设施(epistemic infrastructure)参与暴力的分类、合法化与执行,并重塑战争的伦理、政治与治理结构。其核心关切在于揭示AI并非中立工具,而是主动再生产殖民等级秩序并使暴行常态化的机制。解决方案的关键在于重构技术主体性与问责机制,提出必须推动AI伦理的民主化,以抵抗技术官僚宿命论,并将高技术军事主义中最受影响群体的生存经验置于中心位置,从而实现对算法暴力的实质性回应。
链接: https://arxiv.org/abs/2602.09202
作者: Branislav Radeljic
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注:
Abstract:The accelerating militarization of artificial intelligence has transformed the ethics, politics, and governance of warfare. This article interrogates how AI-driven targeting systems function as epistemic infrastructures that classify, legitimize, and execute violence, using Israel’s conduct in Gaza as a paradigmatic case. Through the lens of responsibility, the article examines three interrelated dimensions: (a) political responsibility, exploring how states exploit AI to accelerate warfare while evading accountability; (b) professional responsibility, addressing the complicity of technologists, engineers, and defense contractors in the weaponization of data; and © personal responsibility, probing the moral agency of individuals who participate in or resist algorithmic governance. This is complemented by an examination of the position and influence of those participating in public discourse, whose narratives often obscure or normalize AI-enabled violence. The Gaza case reveals AI not as a neutral instrument but as an active participant in the reproduction of colonial hierarchies and the normalization of atrocity. Ultimately, the paper calls for a reframing of technological agency and accountability in the age of automated warfare. It concludes that confronting algorithmic violence demands a democratization of AI ethics, one that resists technocratic fatalism and centers the lived realities of those most affected by high-tech militarism.
计算机视觉
[CV-0] SAGE: Scalable Agent ic 3D Scene Generation for Embodied AI
【速读】:该论文旨在解决真实世界中具身智能体(embodied agents)数据采集成本高且不安全的问题,提出一种可扩展、逼真且适用于模拟器的3D环境生成方法。其核心解决方案是SAGE框架,该框架基于代理式(agentic)架构,能够理解用户指定的具身任务意图(如“拿起碗并放到桌子上”),并通过耦合多个布局与物体组合生成器以及评估语义合理性、视觉真实性和物理稳定性的判别器,实现场景的迭代推理与自适应工具选择,从而自动构建符合物理有效性和任务意图的仿真环境。该方案的关键创新在于通过多模态生成与批判机制协同优化,实现高质量、多样化且直接部署于现代模拟器的场景生成,为具身AI的策略训练提供可扩展的数据基础。
链接: https://arxiv.org/abs/2602.10116
作者: Hongchi Xia,Xuan Li,Zhaoshuo Li,Qianli Ma,Jiashu Xu,Ming-Yu Liu,Yin Cui,Tsung-Yi Lin,Wei-Chiu Ma,Shenlong Wang,Shuran Song,Fangyin Wei
机构: NVIDIA; University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Cornell University (康奈尔大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL
Abstract:Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: this https URL.
[CV-1] Quantum Multiple Rotation Averag ing
【速读】:该论文旨在解决多旋转平均(Multiple Rotation Averaging, MRA)问题,即从噪声相对旋转测量中恢复全局一致的绝对旋转,这是3D视觉与机器人领域中的核心优化任务。传统方法如L1-IRLS和Shonan受限于局部极小值敏感性和依赖凸松弛带来的流形几何失真,在高噪声场景下精度下降。其解决方案的关键在于提出IQARS(Iterative Quantum Annealing for Rotation Synchronization),首次将MRA建模为一系列可在量子退火器上执行的局部二次非凸子问题,通过二值化处理利用量子硬件优势;该方法摒弃了凸松弛依赖,更精确保留非欧几里得旋转流形结构,并借助量子隧穿效应与并行性实现高效解空间探索,从而在当前受限的量子退火设备上已实现比最优经典方法Shonan约12%的精度提升。
链接: https://arxiv.org/abs/2602.10115
作者: Shuteng Wang,Natacha Kuete Meli,Michael Möller,Vladislav Golyanik
机构: Max Planck Institute for Informatics, SIC; University of Siegen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 13 figures, 4 tables; project page: this https URL
Abstract:Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS’s performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.
[CV-2] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
【速读】:该论文旨在解决图像到视频生成(Image-to-Video generation, I2V)中长期存在的对象身份一致性保持难题,尤其是在多视角变化下易出现的外观漂移(appearance drift)和几何失真(geometric distortion)问题。这些问题主要源于单视图二维观测信息稀疏以及跨模态对齐弱化。解决方案的关键在于两个层面:一是构建了大规模、以物体为中心的高质量时序对齐数据集 ConsIDVid,并提出 ConsIDVid-Bench 基准测试框架,采用对细微几何与外观偏差敏感的指标评估多视角一致性;二是设计了 ConsID-Gen 框架,通过引入未姿态校正的辅助视图增强首帧信息,利用双流视觉-几何编码器融合语义与结构线索,并结合文本-视觉连接器实现统一条件输入,驱动 Diffusion Transformer 主干网络,从而显著提升生成视频的身份保真度与时间连贯性。
链接: https://arxiv.org/abs/2602.10113
作者: Mingyang Wu,Ashirbad Mishra,Soumik Dey,Shuo Xing,Naveen Ravipati,Hansi Wu,Binbin Li,Zhengzhong Tu
机构: Texas A&M University (德克萨斯农工大学); eBay Inc. (eBay公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at this https URL.
[CV-3] Olaf-World: Orienting Latent Actions for Video World Modeling
【速读】:该论文旨在解决动作可控世界模型(action-controllable world models)在大规模预训练中因动作标签稀缺而导致的性能瓶颈问题。现有方法通过潜在动作学习(latent action learning)从无标签视频中提取控制接口,但其学习到的潜在表示往往无法跨场景迁移,主要由于缺乏统一的坐标系且易混入场景特异性线索。解决方案的关键在于提出Seq Δ-REPA(sequence-level control-effect alignment)目标函数,该函数利用冻结的自监督视频编码器提取的时间特征差异作为共享参考,从而对齐不同上下文中的动作语义,使潜在动作空间具有更强的结构化和泛化能力。基于此,作者进一步构建了Olaf-World预训练流水线,实现了从大规模被动视频中学习动作条件视频世界模型的目标。
链接: https://arxiv.org/abs/2602.10104
作者: Yuxin Jiang,Yuchao Gu,Ivor W. Tsang,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL Code: this https URL
Abstract:Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq \Delta -REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
[CV-4] VideoWorld 2: Learning Transferable Knowledge from Real-world Videos KR
【速读】:该论文旨在解决智能体从无标签视频数据中学习可迁移的世界知识,并将其应用于新环境中的问题,尤其在真实世界的手工制作任务中,现有视频生成模型和隐变量动力学模型难以稳定运行。其解决方案的关键在于提出 VideoWorld 2,其中引入了动态增强的潜在动力学模型(dynamic-enhanced Latent Dynamics Model, dLDM),通过预训练的视频扩散模型分离视觉外观与动作动力学,使 dLDM 能够专注于学习紧凑且任务相关的潜在编码;这些潜在编码被自回归建模以学习任务策略并支持长时程推理,从而显著提升任务成功率(最高达 70% 提升)和生成连贯的长期执行视频。
链接: https://arxiv.org/abs/2602.10102
作者: Zhongwei Ren,Yunchao Wei,Xiao Yu,Guixun Luo,Yao Zhao,Bingyi Kang,Jiashi Feng,Xiaojie Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models are released at: this https URL
Abstract:Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
[CV-5] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
【速读】:该论文旨在解决标准扩散 Transformer(Diffusion Transformer)在直接应用于表示编码器(representation encoders)的高维特征空间时无法收敛的问题。现有研究将此归因于容量瓶颈,并提出通过增加模型宽度来缓解,但本文指出根本原因在于几何结构失配:标准欧几里得流匹配(Euclidean flow matching)迫使概率路径穿越特征空间低密度区域,而非沿流形表面演化,导致训练不稳定。解决方案的关键是提出黎曼流匹配结合雅可比正则化(Riemannian Flow Matching with Jacobi Regularization, RJF),通过约束生成过程沿流形测地线(geodesics)进行,并校正由曲率引起的误差传播,从而实现无需宽度扩展即可稳定收敛。实验表明,RJF使标准 DiT-B 架构(131M 参数)在 FID 达到 3.37,而此前方法均无法收敛。
链接: https://arxiv.org/abs/2602.10099
作者: Amandeep Kumar,Vishal M. Patel
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
[CV-6] VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
【速读】:该论文旨在解决当前基于互联网规模视频预训练的视觉-语言-动作(Vision-Language-Action, VLA)策略中存在的问题:现有隐式动作目标(latent-action objectives)往往学习到错误的表征,即过度依赖像素级变化而非与动作相关的状态转移,导致模型对外观偏差(appearance bias)、无关运动(nuisance motion)和信息泄露(information leakage)敏感。其解决方案的关键在于提出一种基于JEPA(Joint-Embedding Predictive Architecture)风格的预训练框架——VLA-JEPA,核心思想是“无泄漏的状态预测”:目标编码器从未来帧中生成潜在表示作为监督信号,而学生路径仅接收当前观测图像,未来信息仅用于监督目标而不作为输入。通过在潜在空间而非像素空间进行预测,VLA-JEPA能够学习到对相机运动和背景干扰鲁棒的状态动态抽象,从而实现更稳定、泛化的动作策略,且无需复杂的多阶段训练流程,仅需两阶段方案(JEPA预训练+动作头微调)即可获得显著性能提升。
链接: https://arxiv.org/abs/2602.10098
作者: Jingwen Sun,Wenyao Zhang,Zekun Qi,Shaojie Ren,Zezhi Liu,Hanxin Zhu,Guangzhong Sun,Xin Jin,Zhibo Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emphleakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation – future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe – JEPA pretraining followed by action-head fine-tuning – without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
[CV-7] Causality in Video Diffusers is Separable from Denoising
【速读】:该论文旨在解决当前因果扩散模型(causal diffusion models)中时间推理与多步去噪过程耦合导致的计算冗余和效率低下问题。现有方法在每一去噪步骤中均对全序列应用因果注意力机制,造成大量重复计算且难以优化性能。解决方案的关键在于提出分离式因果扩散(Separable Causal Diffusion, SCD)架构,其核心创新是显式解耦“每帧一次”的时间推理(通过因果Transformer编码器完成)与“多步帧内渲染”(由轻量级扩散解码器实现),从而显著提升生成吞吐量和单帧延迟,同时保持或超越基线模型的生成质量。
链接: https://arxiv.org/abs/2602.10095
作者: Xingjian Bai,Guande He,Zhengqi Li,Eli Shechtman,Xun Huang,Zongze Wu
机构: Massachusetts Institute of Technology (麻省理工学院); Adobe Research (Adobe 研究院); Morpheus AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Causality – referring to temporal, uni-directional cause-effect relationships between components – underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
[CV-8] 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
【速读】:该论文旨在解决单目视频中4D重建(4D reconstruction)的难题,即如何从单一视角视频中联合建模密集场景几何结构与动态运动信息。现有方法通常将运动与几何解耦处理,或仅能生成稀疏轨迹、双视图场景光流等有限的4D属性,难以实现统一且高质量的4D表示。其解决方案的关键在于提出4RC框架,引入“一次编码、任意查询”(encode-once, query-anywhere and anytime)范式:利用Transformer主干网络将整段视频编码为紧凑的时空潜在空间,再通过条件解码器在任意目标时间戳对任意查询帧高效提取3D几何和运动信息;同时,通过将每帧的4D属性最小因子化分解为基底几何与随时间变化的相对运动,显著提升学习效率与重建精度。
链接: https://arxiv.org/abs/2602.10094
作者: Yihang Luo,Shangchen Zhou,Yushi Lan,Xingang Pan,Chen Change Loy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
[CV-9] Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
【速读】:该论文旨在解决图像伪造检测中仅定位篡改区域(target)而忽略源区域(source)所导致的误判问题,尤其是在抗议类图像等场景下,若仅识别出被插入的暴力行为(target),而未定位其复制来源(source),可能导致对事件背景的错误解读。解决方案的关键在于提出Forensim框架,该框架基于注意力机制的状态空间模型,通过归一化注意力图捕捉图像内部相似性以识别复制模式,并结合基于区域的块注意力模块区分篡改区域,从而实现对原始区域(pristine)、源区域(source)和目标区域(target)的三类掩码联合定位,支持统一架构下的拼接(splicing)与复制-粘贴(copy-move)伪造检测,且具备端到端训练能力与高精度定位性能。
链接: https://arxiv.org/abs/2602.10079
作者: Soumyaroop Nandi,Prem Natarajan
机构: USC Information Sciences Institute (南加州大学信息科学研究所); USC Thomas Lord Department of Computer Science (南加州大学托马斯·洛德计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.
[CV-10] Vendi Novelty Scores for Out-of-Distribution Detection
【速读】:该论文旨在解决机器学习系统在实际部署中面临的分布外(Out-of-distribution, OOD)检测难题,即如何有效识别与训练数据分布显著不同的测试样本。现有方法多依赖模型置信度或特征空间中的似然估计,常受限于严格的分布假设。解决方案的关键在于提出一种基于多样性视角的新范式:Vendi Novelty Score (VNS),其核心是利用Vendi Scores(VS)这一组基于相似性的多样性度量指标,通过量化测试样本对训练集特征集合的多样性提升程度来定义新颖性(novelty),从而无需密度建模即可实现鲁棒的OOD检测。VNS具有线性时间复杂度、非参数特性,并能自然融合类别条件(局部)与数据集级(全局)的新颖信号,在多个图像分类基准和网络架构上达到当前最优性能,且仅需1%训练数据即可保持高性能,适用于内存或访问受限场景。
链接: https://arxiv.org/abs/2602.10062
作者: Amey P. Pasarkar,Adji Bousso Dieng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
[CV-11] Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving
【速读】:该论文旨在解决现有基于Transformer的语义分割模型在视频场景中因独立处理每一帧而导致 temporal consistency(时间一致性)不足的问题,从而影响动态场景下的分割准确性和稳定性。解决方案的关键在于提出一种时空注意力机制(Spatio-Temporal Attention, STA),通过扩展标准自注意力模块以处理多帧特征序列,在不显著增加计算复杂度的前提下,有效融合时间维度上的上下文信息,实现对视频语义分割任务的鲁棒性增强。
链接: https://arxiv.org/abs/2602.10052
作者: Serin Varghese,Kevin Ross,Fabian Hueger,Kira Maag
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
[CV-12] Conformal Prediction Sets for Instance Segmentation
【速读】:该论文旨在解决当前实例分割(Instance Segmentation)模型在不确定性量化方面的不足问题,即模型输出缺乏校准性,且无法保证预测掩码与真实目标掩码之间的高交并比(IoU)。为应对这一挑战,作者提出了一种基于合规预测(Conformal Prediction)的算法,其关键在于:给定图像和像素坐标查询时,生成一个实例预测的置信集,该集合具有可证明的概率保证——至少有一个预测结果与真实对象掩码的IoU达到较高水平。该方法通过自适应调整置信集大小来反映查询难度,并在农业田块划分、细胞分割和车辆检测等任务中验证了其覆盖概率达到目标水平,显著优于现有基线方法如Learn Then Test、Conformal Risk Control及形态学膨胀法。
链接: https://arxiv.org/abs/2602.10045
作者: Kerri Lu,Dan M. Kluger,Stephen Bates,Sherrie Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Current instance segmentation models achieve high performance on average predictions, but lack principled uncertainty quantification: their outputs are not calibrated, and there is no guarantee that a predicted mask is close to the ground truth. To address this limitation, we introduce a conformal prediction algorithm to generate adaptive confidence sets for instance segmentation. Given an image and a pixel coordinate query, our algorithm generates a confidence set of instance predictions for that pixel, with a provable guarantee for the probability that at least one of the predictions has high Intersection-Over-Union (IoU) with the true object instance mask. We apply our algorithm to instance segmentation examples in agricultural field delineation, cell segmentation, and vehicle detection. Empirically, we find that our prediction sets vary in size based on query difficulty and attain the target coverage, outperforming existing baselines such as Learn Then Test, Conformal Risk Control, and morphological dilation-based methods. We provide versions of the algorithm with asymptotic and finite sample guarantees.
[CV-13] Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI
【速读】:该论文旨在解决医学影像数据共享中的隐私保护问题,即即使经过颅骨剥离(skull stripping)处理的头部磁共振成像(Head Magnetic Resonance Imaging, MRI)仍可能包含个体特异的脑组织特征,从而在与其他数据库中的数据结合时存在重新识别风险。针对这一问题,论文提出了一种基于标准预处理与图像相似性计算的解决方案,其关键在于无需复杂训练或高计算成本,仅通过常规图像处理流程即可实现跨时间间隔、扫描设备类型、空间分辨率和采集协议的高精度匹配,验证了潜在的再识别风险,为制定更科学的数据共享政策提供实证依据。
链接: https://arxiv.org/abs/2602.10043
作者: Gaurang Sharma,Harri Polonen,Juha Pajula,Jutta Suksi,Jussi Tohka
机构: VTT Technical Research Centre of Finland Ltd (芬兰技术研究中心有限公司); A.I. Virtanen Institute for Molecular Sciences, University of Eastern Finland (东芬兰大学分子科学A.I.维尔塔宁研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual’s skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.10043 [cs.CV] (or arXiv:2602.10043v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.10043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-14] Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection ICASSP2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 伪造图像检测中因过度使用链式思维(Chain-of-Thought, CoT)推理导致的资源开销过高问题,尤其是在面对明显伪造图像时,冗长推理造成token消耗和延迟增加。解决方案的关键在于提出Fake-HR1模型,其首次实现了基于生成检测任务特征自适应判断是否需要推理的机制;具体通过两阶段训练框架实现:首先进行混合微调(Hybrid Fine-Tuning, HFT)完成冷启动初始化,随后采用带混合推理分组策略优化的在线强化学习(Hybrid-Reasoning Grouped Policy Optimization, HGRPO),隐式学习在不同场景下选择合适推理模式的能力,从而在保持高检测性能的同时显著提升响应效率。
链接: https://arxiv.org/abs/2602.10042
作者: Changjiang Jiang,Xinkuan Sha,Fengchang Yu,Jingjing Liu,Jian Liu,Mingqi Fang,Chenfeng Zhang,Wei Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026
Abstract:Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model’s ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
[CV-15] Perception with Guarantees: Certified Pose Estimation via Reachability Analysis
【速读】:该论文旨在解决安全关键型网络物理系统(Cyber-Physical Systems, CPS)中代理(Agent)的3D位姿估计(Pose Estimation)安全性保障问题,即如何在最坏情况下仍能形式化地保证位姿估计的准确性,尤其当依赖外部服务(如GPS)不可信或传感器融合方案不充分时。解决方案的关键在于:仅使用单张相机图像与已知目标几何信息,通过结合可达性分析(Reachability Analysis)和神经网络形式验证(Formal Neural Network Verification)技术,对位姿估计结果进行形式化边界约束(Formally Bounded Pose Estimation),从而实现无需依赖外部可信源的安全位姿定位。
链接: https://arxiv.org/abs/2602.10032
作者: Tobias Ladner,Yasser Shoukry,Matthias Althoff
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.
[CV-16] Faster-GS: Analyzing and Improving Gaussian Splatting Optimization FAST
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)优化过程中存在的效率低下与方法碎片化问题,即现有研究常将实现级改进与算法层面的修改混杂,或以牺牲重建保真度为代价换取性能提升,导致难以进行公平比较。其解决方案的关键在于系统性地整合并评估先前研究中最有效且普适性强的优化策略,并引入若干新颖优化技术,同时深入探究数值稳定性、高斯截断和梯度近似等被忽视的框架因素。最终提出的Faster-GS系统在保持视觉质量的前提下,实现了最高达5倍的训练加速,为3DGS优化建立了新的高效基准,并可扩展至4D非刚性场景重建,展现出良好的通用性和资源效率。
链接: https://arxiv.org/abs/2602.09999
作者: Florian Hahlbohm,Linus Franke,Martin Eisemann,Marcus Magnor
机构: Computer Graphics Lab, TU Braunschweig, Germany(德国布伦瑞克工业大学计算机图形学实验室); Inria, Université Côte d’Azur, France(法国蔚蓝海岸大学 INRIA 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5 \times faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.
[CV-17] Efficient Special Stain Classification
【速读】:该论文旨在解决数字病理学中特殊染色(special stains)自动分类的问题,以确保临床档案的质量控制和计算病理学数据集的完整性。其关键解决方案是提出了一种轻量级基于缩略图(thumbnail-based)的分类方法,并与多实例学习(Multi-Instance Learning, MIL)管道进行对比。实验表明,尽管MIL在内部测试数据上性能更优(宏F1达0.941),但缩略图方法在外部独立数据集(TCGA)上泛化能力更强(加权F1为0.843),且处理速度提升两个数量级(5.635 vs. 0.018 slides/s),从而为数字病理流程中的常规视觉质量控制提供了可扩展、鲁棒的自动化方案。
链接: https://arxiv.org/abs/2602.09989
作者: Oskar Thaeter,Christian Grashei,Anette Haas,Elisa Schmoeckel,Han Li,Peter J. Schüffler
机构: Technical University of Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心); Munich Data Science Institute(慕尼黑数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, 2 tables
Abstract:Stains are essential in histopathology to visualize specific tissue characteristics, with Haematoxylin and Eosin (HE) serving as the clinical standard. However, pathologists frequently utilize a variety of special stains for the diagnosis of specific morphologies. Maintaining accurate metadata for these slides is critical for quality control in clinical archives and for the integrity of computational pathology datasets. In this work, we compare two approaches for automated classification of stains using whole slide images, covering the 14 most commonly used special stains in our institute alongside standard and frozen-section HE. We evaluate a Multi-Instance Learning (MIL) pipeline and a proposed lightweight thumbnail-based approach. On internal test data, MIL achieved the highest performance (macro F1: 0.941 for 16 classes; 0.969 for 14 merged classes), while the thumbnail approach remained competitive (0.897 and 0.953, respectively). On external TCGA data, the thumbnail model generalized best (weighted F1: 0.843 vs. 0.807 for MIL). The thumbnail approach also increased throughput by two orders of magnitude (5.635 vs. 0.018 slides/s for MIL with all patches). We conclude that thumbnail-based classification provides a scalable and robust solution for routine visual quality control in digital pathology workflows. Comments: 14 pages, 7 figures, 2 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.09989 [cs.CV] (or arXiv:2602.09989v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.09989 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oskar Thaeter [view email] [v1] Tue, 10 Feb 2026 17:15:13 UTC (13,991 KB) Full-text links: Access Paper: View a PDF of the paper titled Efficient Special Stain Classification, by Oskar Thaeter and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-18] Online Monitoring Framework for Automotive Time Series Data using JEPA Embeddings
【速读】:该论文旨在解决自动驾驶系统中未知异常(unknown anomalies)的检测问题,尤其是在缺乏异常标签的情况下如何实现有效的在线监控。其核心挑战在于构建一个无需依赖异常标注数据即可识别对象状态表示异常的框架。解决方案的关键在于采用基于JEPA(Joint-Embedding Predictive Architecture)的自监督嵌入方法,通过设计自监督预测任务将对象数据映射到高维潜在表示空间,从而生成具有丰富语义信息的对象嵌入;这些嵌入作为输入供传统异常检测算法使用,实现了对未知异常的有效识别,特别适用于现实场景中可能发生的未见过的异常情况。
链接: https://arxiv.org/abs/2602.09985
作者: Alexander Fertig,Karthikeyan Chandra Sekaran,Lakshman Balasubramanian,Michael Botsch
机构: Technische Hochschule Ingolstadt (英戈尔施塔特应用技术大学); AImotion Bavaria (巴伐利亚人工智能运动); Research Center CARISSMA (汽车与交通研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 IEEE Intelligent Vehicles Symposium. Copyright 2026 IEEE. Permission from IEEE must be obtained for use in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:As autonomous vehicles are rolled out, measures must be taken to ensure their safe operation. In order to supervise a system that is already in operation, monitoring frameworks are frequently employed. These run continuously online in the background, supervising the system status and recording anomalies. This work proposes an online monitoring framework to detect anomalies in object state representations. Thereby, a key challenge is creating a framework for anomaly detection without anomaly labels, which are usually unavailable for unknown anomalies. To address this issue, this work applies a self-supervised embedding method to translate object data into a latent representation space. For this, a JEPA-based self-supervised prediction task is constructed, allowing training without anomaly labels and the creation of rich object embeddings. The resulting expressive JEPA embeddings serve as input for established anomaly detection methods, in order to identify anomalies within object state representations. This framework is particularly useful for applications in real-world environments, where new or unknown anomalies may occur during operation for which there are no labels available. Experiments performed on the publicly available, real-world nuScenes dataset illustrate the framework’s capabilities.
[CV-19] Coupled Inference in Diffusion Models for Semantic Decomposition
【速读】:该论文旨在解决视觉场景中语义成分的分解问题,即如何从绑定表示(bound representation)中有效识别和重构潜在因子(latent factors)。其核心挑战在于如何在不依赖显式标注的情况下,实现对复杂组合结构的准确解耦与重建。解决方案的关键在于将语义分解建模为一个逆问题,并引入一种基于重构驱动的引导项(reconstruction-driven guidance term),通过耦合扩散过程来约束各因子估计值的组合结果与原始绑定向量一致;同时提出了一种新颖的迭代采样策略以提升模型性能。该框架不仅统一了注意力机制下的共振网络(resonator networks)作为特例,还在合成任务上显著优于传统共振网络方法。
链接: https://arxiv.org/abs/2602.09983
作者: Calvin Yeung,Ali Zakeri,Zhuowen Zou,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
[CV-20] Learning to Detect Baked Goods with Limited Supervision
【速读】:该论文旨在解决德国烘焙行业中剩余产品监测自动化的问题,核心挑战在于德国烘焙食品种类繁多、标注数据稀缺,导致传统全监督训练成本高昂且难以扩展。解决方案的关键在于提出两种弱监督与伪标签增强的训练流程:首先,结合OWLv2和Grounding DINO的定位能力与图像级标签,实现弱监督训练;其次,利用Segment Anything 2生成伪标签并基于视频帧微调,提升模型在不同视角下的鲁棒性。最终采用YOLOv11架构,在仅依赖图像级监督的情况下,实现了mAP达0.91的检测性能,并在非理想部署条件下较基线模型表现更优。
链接: https://arxiv.org/abs/2602.09979
作者: Thomas H. Schmitt,Maximilian Bundscherer,Tobias Bocklet
机构: Technische Hochschule Nürnberg Georg Simon Ohm(纽伦堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
[CV-21] Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework
【速读】:该论文旨在解决尿路上皮膀胱癌术后随访中因膀胱形态可变且缺乏稳定解剖标志而导致的内镜导航困难问题,特别是如何在存在气泡、光照不均、黏膜褶皱等干扰因素下实现血管结构的精准分割。解决方案的关键在于提出一种混合注意力-卷积(Hybrid Attention-Convolution, HAC)架构:一方面利用Transformer模块建模全局血管拓扑先验以增强结构连通性,训练时剔除短分支和末端分支以聚焦主干结构;另一方面通过CNN学习残差细化映射以恢复细小血管细节。此外,为缓解标注数据稀缺问题,引入基于物理规律的自监督预训练策略,在未标注数据上使用临床合理增强方法提升模型泛化能力。该方法在BlaVeS数据集上实现了高精度(0.94)、优异的精确率(0.61)与clDice分数(0.66),并有效抑制了动态黏膜褶皱引发的假阳性,从而为膀胱内镜导航提供了可靠的结构稳定性支撑。
链接: https://arxiv.org/abs/2602.09949
作者: Franziska Krauß,Matthias Ege,Zoltan Lovasz,Albrecht Bartz-Schmidt,Igor Tsaur,Oliver Sawodny,Carina Veil
机构: University of Stuttgart (斯图加特大学); University Hospital Tübingen (图宾根大学医院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific “vascular fingerprint” for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
[CV-22] VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉编码器在密集预测任务(如语义分割、深度估计)中表现不佳的问题,即其视觉特征表示缺乏对像素级细节的充分建模能力。解决方案的关键在于提出VersaViT,一种基于视觉Transformer架构的通用视觉骨干网络,并设计了一种新颖的多任务协同后训练框架:通过轻量级任务头与多粒度监督信号联合优化视觉主干,从而提升其在语言引导推理和像素级理解任务中的泛化性能。
链接: https://arxiv.org/abs/2602.09934
作者: Yikun Liu,Yuan Liu,Shangzhe Di,Haicheng Wang,Zhongyin Zhao,Le Tian,Xiao Zhou,Jie Zhou,Jiangchao Yao,Yanfeng Wang,Weidi Xie
机构: School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院); CMIC, Shanghai Jiao Tong University (上海交通大学CMIC); WeChat AI, Tencent Inc. (腾讯微信AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
[CV-23] Unbalanced optimal transport for robust longitudinal lesion evolution with registration-aware and appearance-guided priors
【速读】:该论文旨在解决纵向CT扫描中病灶演化评估的难题,特别是如何在不同时间点之间建立可靠且鲁棒的病灶对应关系(lesion correspondence),尤其是在病灶出现、消失、融合或分裂等复杂情况下,传统基于几何邻近性的双部匹配方法难以有效处理。其解决方案的关键在于提出一种注册感知的匹配器(registration-aware matcher),该方法基于非平衡最优传输(Unbalanced Optimal Transport, UOT),能够适应病灶数量不一致的情况,并通过患者层面的肿瘤负荷变化动态调整先验信息。该匹配器的代价函数融合了三方面因素:(i) 归一化尺寸的几何信息、(ii) 来自形变场雅可比行列式的局部注册置信度、(iii) 可选的局部图像块外观一致性;最终通过相对剪枝策略生成稀疏传输计划,实现一对一匹配以及对新出现、消失、合并和分裂病灶的自动识别,无需重新训练或依赖启发式规则。
链接: https://arxiv.org/abs/2602.09933
作者: Melika Qahqaie,Dominik Neumann,Tobias Heimann,Andreas Maier,Veronika A. Zimmer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication. Accepted at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract:Evaluating lesion evolution in longitudinal CT scans of can cer patients is essential for assessing treatment response, yet establishing reliable lesion correspondence across time remains challenging. Standard bipartite matchers, which rely on geometric proximity, struggle when lesions appear, disappear, merge, or split. We propose a registration-aware matcher based on unbalanced optimal transport (UOT) that accommodates unequal lesion mass and adapts priors to patient-level tumor-load changes. Our transport cost blends (i) size-normalized geometry, (ii) local registration trust from the deformation-field Jacobian, and (iii) optional patch-level appearance consistency. The resulting transport plan is sparsified by relative pruning, yielding one-to-one matches as well as new, disappearing, merging, and splitting lesions without retraining or heuristic rules. On longitudinal CT data, our approach achieves consistently higher edge-detection precision and recall, improved lesion-state recall, and superior lesion-graph component F1 scores versus distance-only baselines.
[CV-24] GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery
【速读】:该论文旨在解决城市三维数据(建筑高度BH和建筑轮廓BF)在气候建模、灾害风险评估与城市规划中因依赖专有传感器或跨城市泛化能力差而导致的稀缺性问题。其解决方案的关键在于提出GeoFormer——一个基于Swin Transformer的开源框架,仅利用Sentinel-1/2遥感影像与公开数字高程模型(DEM)数据,在100米网格上联合估计BH与BF;通过地理区块分割策略确保训练与测试集的空间独立性,从而实现高精度(BH RMSE为3.19 m,BF RMSE为0.05)且具备跨大陆迁移能力(BH RMSE<3.5 m)。实验证明,DEM对高度估计不可或缺,光学反射率主导信息提取,但多源融合可获得最优整体性能。
链接: https://arxiv.org/abs/2602.09932
作者: Han Jinzhen,JinByeong Lee,JiSung Kim,MinKyung Cho,DaHee Kim,HongSik Yun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.
[CV-25] Monocular Normal Estimation via Shading Sequence Estimation ICLR2026
【速读】:该论文旨在解决单目法向估计(monocular normal estimation)中普遍存在的3D几何错位问题,即模型生成的法向图在视觉上看似合理,但重建表面与真实几何细节不匹配。其根本原因在于现有方法依赖深度学习直接预测法向图,而法向图中的几何差异仅通过细微的颜色变化体现,导致模型难以准确区分和重建不同几何结构。解决方案的关键在于提出一种新范式——将法向估计重构为阴影序列估计(shading sequence estimation),因为阴影序列对几何信息更敏感;在此基础上,作者设计了RoSE方法,利用图像到视频的生成式AI(Generative AI)模型预测阴影序列,并通过简单的最小二乘法将其转换为法向图,从而实现高精度且几何一致的法向估计。
链接: https://arxiv.org/abs/2602.09929
作者: Zongrui Li,Xinhua Ma,Minghui Hu,Yunqing Zhao,Yingchen Yu,Qian Zheng,Chang Liu,Xudong Jiang,Song Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026 (Oral Presentation)
Abstract:Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
[CV-26] A benchmark for video-based laparoscopic skill analysis and assessment
【速读】:该论文旨在解决腹腔镜手术训练中深度学习模型开发与评估受限于标注数据集规模不足的问题。其解决方案的关键在于构建并公开了LASANA数据集,该数据集包含1270段立体视频记录,涵盖四种基础腹腔镜训练任务,每段视频均配有由三位独立评审员评分的结构化技能评级及任务特异性错误的二分类标签,且多数数据来源于真实培训课程,具有自然的技能多样性。此外,研究提供了针对每个任务的预定义数据划分,以支持对现有和新型视频驱动技能评估与错误识别方法的基准测试。
链接: https://arxiv.org/abs/2602.09927
作者: Isabel Funke,Sebastian Bodenstedt,Felix von Bechtolsheim,Florian Oehme,Michael Maruschke,Stefanie Herrlich,Jürgen Weitz,Marius Distler,Sören Torge Mees,Stefanie Speidel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.
[CV-27] SARS: A Novel Face and Body Shape and Appearance Aware 3D Reconstruction System extends Morphable Models
【速读】:该论文旨在解决传统三维可变形模型(3D Morphable Models, 3DMMs)在人体三维重建中忽略高阶面部语义特征(如年龄、性别、面部关键点等)的问题,这些特征对于准确刻画人脸结构和外观细节至关重要。现有方法主要关注全局面部几何信息,难以捕捉面部边界、曲线、凹陷与皱纹等局部形态变化。为此,作者提出了一种形状与外观感知的三维重建系统(Shape and Appearance-aware Reconstruction System, SARS),其核心创新在于构建一个模块化流程,能够从单张图像中联合提取面部与身体信息,并通过参数化控制实现对形状和外观特征的精细化建模,从而生成更真实、语义一致的人体三维模型。
链接: https://arxiv.org/abs/2602.09918
作者: Gulraiz Khan,Kenneth Y. Wertheim,Kevin Pimbblet,Waqas Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.
[CV-28] AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization
【速读】:该论文旨在解决扩散Transformer(Diffusion Transformers, DiTs)在边缘设备部署时面临的高计算成本与内存占用问题,尤其针对现有后训练量化(Post-Training Quantization, PTQ)方法因忽视扩散过程中的独特时间动态性而导致性能下降的问题。解决方案的关键在于提出AdaTSQ框架,其核心创新包括:一是设计了一种基于帕累托感知的 timestep-dynamic bit-width allocation 策略,将量化策略搜索建模为带约束的路径查找问题,并采用由端到端重建误差引导的束搜索算法,实现不同时间步长下层级比特宽度的动态分配;二是提出一种基于Fisher信息引导的时间校准机制,利用时间维度上的Fisher信息优先选择敏感时间步的校准数据,与基于Hessian的权重优化无缝集成,从而显著提升量化模型在效率与生成质量之间的权衡表现。
链接: https://arxiv.org/abs/2602.09883
作者: Shaoqiu Zhang,Zizhong Ding,Kaicheng Yang,Junyi Wu,Xianglong Yan,Xi Li,Bingnan Duan,Jianping Fang,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released at this https URL
Abstract:Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at this https URL.
[CV-29] MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
【速读】:该论文旨在解决当前基于世界模型的“想象后执行”(imagine-then-act)范式在机器人操作中对完整4D场景动态预测能力不足的问题,尤其是现有方法通常仅支持纯图像预测或部分3D几何推理,难以实现几何一致且任意视角下的RGBD生成。其解决方案的关键在于提出一种新型具身4D世界模型(embodied 4D world model),通过单视角RGBD输入生成其他视角的RGBD数据,并利用多视图、跨模态特征融合机制确保RGB与深度信息的一致性及不同视角间的几何对齐。此外,为将生成的未来状态转化为可执行动作,作者设计了一种测试时动作优化策略,通过反向传播生成模型推断轨迹级潜在变量,并结合残差逆动力学模型将此轨迹先验转化为精确的动作指令,从而有效缓解逆动力学问题的病态性(ill-posedness)。
链接: https://arxiv.org/abs/2602.09878
作者: Jiaxu Wang,Yicheng Jiang,Tianlun He,Jingkai Sun,Qiang Zhang,Junhao He,Jiahang Cao,Zesen Gan,Mingyuan Sun,Qiming Shao,Xiangyu Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
[CV-30] Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence
【速读】:该论文旨在解决当前生成式视频压缩(Generative Video Compression, GVC)方法在超低码率下因未能充分挖掘时序相关性而导致的显著闪烁和时序一致性下降的问题。其解决方案的关键在于提出了一种无需训练的生成式视频压缩框架Free-GVC,该框架将视频编码重构为由视频扩散先验引导的潜在轨迹压缩过程:首先在GOP(Group-of-Pictures)级别将视频片段编码至紧凑潜在空间,并沿扩散轨迹逐步压缩;同时引入自适应质量控制模块以动态构建在线率-感知代理模型,预测每个GOP的最佳扩散步数;并设计跨GOP对齐模块通过帧重叠与潜在融合机制缓解闪烁、提升时序一致性。
链接: https://arxiv.org/abs/2602.09868
作者: Xiaoyue Ling,Chuqin Zhou,Chunyi Li,Yunuo Chen,Yuan Tian,Guo Lu,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
[CV-31] Reason -IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection
【速读】:该论文旨在解决工业异常检测中现有多模态大语言模型(Multimodal Large Language Models, MLLMs)因在通用领域数据上预训练而难以捕捉特定类别异常模式的问题,从而限制了检测精度与可解释性。解决方案的关键在于提出一个知识引导的动态潜在推理框架(Reason-IAD),其核心创新包括:1)检索增强的知识模块,将类别特定的文本描述引入模型输入,实现对领域缺陷的上下文感知推理;2)基于熵驱动的潜在推理机制,通过可优化的潜在思考标记(latent think tokens)在紧凑潜在空间中进行迭代探索,并以熵奖励促进预测的置信度与稳定性;3)动态视觉注入策略,选择最具信息量的图像块注入潜在序列,引导推理过程聚焦于异常检测关键区域。
链接: https://arxiv.org/abs/2602.09850
作者: Peng Chen,Chao Huang,Yunkang Cao,Chengliang Liu,Wenqiang Wang,Mingbo Yang,Li Shen,Wenqi Ren,Xiaochun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods. The code will be publicly available at this https URL.
[CV-32] Kelix Technique Report
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在统一理解与生成能力上的瓶颈问题,尤其是现有基于离散视觉标记(discrete visual tokens)的自回归模型因码本容量有限导致的信息损失,从而造成其理解性能显著弱于依赖连续特征(continuous features)的视觉-语言模型(Vision-Language Models, VLMs)。解决方案的关键在于提出Kelix——一个完全离散的自回归统一模型,通过优化视觉标记化机制,在保持离散表示优势的同时,显著缩小了离散与连续视觉表征之间的理解性能差距,实现了真正意义上的多模态统一建模。
链接: https://arxiv.org/abs/2602.09843
作者: Boyang Ding,Chenglong Chu,Dunju Zang,Han Li,Jiangxia Cao,Kun Gai,Muhao Wei,Ruiming Tang,Shiyao Wang,Siyang Mao,Xinchen Luo,Yahui Liu,Zhixin Ling,Zhuoran Yang,Ziming Li,Chengru Song,Guorui Zhou,Guowang Zhang,Hao Peng,Hao Wang,Jiaxin Deng,Jin Ouyang,Jinghao Zhang,Lejian Ren,Qianqian Wang,Qigen Hu,Tao Wang,Xingmei Wang,Yiping Yang,Zixing Zhang,Ziqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress
Abstract:Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
[CV-33] ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge
【速读】:该论文旨在解决现有多模态检索基准在日常图像语义匹配上的局限性,即缺乏对专业领域知识和复杂推理能力的诊断。为填补这一空白,作者提出了ARK基准,其关键在于从两个互补维度评估多模态检索:(i) 知识领域(五个领域,17个子类型),用于刻画检索所依赖的内容与专业知识;(ii) 推理技能(六类),用于刻画识别正确候选者所需的多模态证据推理类型。ARK覆盖16种异构视觉数据类型,并采用针对性的难负样本配对以防止捷径匹配,从而更真实地评估模型在细粒度视觉与空间推理等瓶颈任务上的表现。
链接: https://arxiv.org/abs/2602.09839
作者: Yijie Lin,Guofeng Ding,Haochen Zhou,Haobin Li,Mouxing Yang,Xi Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.
[CV-34] SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中的幻觉问题,即模型在生成文本时产生与输入图像不一致或错误的内容,这对实际应用中的安全性和可靠性构成严重威胁。解决方案的关键在于提出一种无需训练的稳定性感知知识增强解码方法(Stability-Aware Knowledge-Enhanced Decoding, SAKED),其核心是引入逐层知识稳定性评分(Knowledge Stability Score, KSS),量化模型各层内部知识的稳定性,并通过对比稳定性和非稳定性层来抑制解码噪声,动态利用最可靠的内部知识进行忠实的token生成。
链接: https://arxiv.org/abs/2602.09825
作者: Zhaoxu Li,Chenqi Kong,Peijun Bao,Song Xia,Yi Tu,Yi Yu,Xinghao Jiang,Xudong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model 's internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
[CV-35] CompSplat: Compression-aware 3D Gaussian Splatting for Real-world Video
【速读】:该论文旨在解决真实世界视频中高质量新视角合成(Novel View Synthesis, NVS)所面临的挑战,特别是由长序列、非规则相机轨迹、未知位姿以及压缩失真导致的位姿漂移、特征错位和几何畸变问题。现有方法在处理长序列或无位姿重建时表现有限,且缺乏对多样化压缩模式下图像不一致性的系统性建模。其解决方案的关键在于提出CompSplat框架,通过显式建模帧级压缩特性来缓解帧间不一致性与累积几何误差;核心创新包括压缩感知的帧加权机制和自适应剪枝策略,从而显著提升在重度压缩条件下的重建鲁棒性和几何一致性,实验证明其在Tanks and Temples、Free、Hike等基准上达到当前最优性能。
链接: https://arxiv.org/abs/2602.09816
作者: Hojun Song,Heejung Choi,Aro Kim,Chae-yeong Song,Gahyeon Kim,Soo Ye Kim,Jaehyup Lee,Sang-hyo Park
机构: Kyungpook National University (庆北国立大学); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review
Abstract:High-quality novel view synthesis (NVS) from real-world videos is crucial for applications such as cultural heritage preservation, digital twins, and immersive media. However, real-world videos typically contain long sequences with irregular camera trajectories and unknown poses, leading to pose drift, feature misalignment, and geometric distortion during reconstruction. Moreover, lossy compression amplifies these issues by introducing inconsistencies that gradually degrade geometry and rendering quality. While recent studies have addressed either long-sequence NVS or unposed reconstruction, compression-aware approaches still focus on specific artifacts or limited scenarios, leaving diverse compression patterns in long videos insufficiently explored. In this paper, we propose CompSplat, a compression-aware training framework that explicitly models frame-wise compression characteristics to mitigate inter-frame inconsistency and accumulated geometric errors. CompSplat incorporates compression-aware frame weighting and an adaptive pruning strategy to enhance robustness and geometric consistency, particularly under heavy compression. Extensive experiments on challenging benchmarks, including Tanks and Temples, Free, and Hike, demonstrate that CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing most recent state-of-the-art NVS approaches under severe compression conditions.
[CV-36] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在科学图表(scientific diagram)生成任务中结构正确性不足的问题,即现有文本到图像模型虽能生成视觉上合理的图像,但常忽视图示的结构性信息。解决方案的关键在于提出 SciFlow-Bench——一个以结构为先的基准测试框架,通过从真实科学 PDF 中提取源图并构建对应的规范图结构(canonical ground-truth graph),采用闭环、往返协议对模型输出的像素级图像进行逆向解析,将其还原为结构化表示用于比较,从而强制评估生成结果的结构可恢复性而非仅依赖视觉相似度。该方法依托于分层多智能体系统实现规划、感知与结构推理的协同,首次系统地推动了基于像素输出的科学图表生成评估研究。
链接: https://arxiv.org/abs/2602.09809
作者: Tong Zhang,Honglin Lin,Zhou Liu,Chong Chen,Wentao Zhang
机构: Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Huawei Cloud BU (华为云BU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
[CV-37] Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets
【速读】:该论文试图解决生成式 AI(Generative AI)模型在训练数据中地理代表性不足的问题,即这些模型所依赖的图像-文本对数据是否真实反映了全球不同地区的多样性。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)从英文 captions 中提取地理位置信息,并据此对大规模多模态数据集进行地理画像(geographic profiling),从而量化各国在训练数据中的分布情况及其与经济水平(GDP)的相关性。研究发现,发达国家如美国、英国和加拿大占样本的近一半,而南美和非洲国家则严重欠代表;此外,高代表度并不等同于视觉或语义多样性提升,且基于此类数据训练的 Stable Diffusion 模型生成的图像虽具真实性,但覆盖范围远低于真实世界图像。
链接: https://arxiv.org/abs/2602.09775
作者: Abhipsa Basu,Yugam Bahl,Kirti Bhagat,Preethi Seshadri,R. Venkatesh Babu,Danish Pruthi
机构: Indian Institute of Science, Bangalore (印度科学研究所,班加罗尔); TNSQ AI; University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 20 figures
Abstract:Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across 20 common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for 48.0% of samples, while South American and African countries are severely under-represented with only 1.8% and 3.8% of images, respectively. We observe a strong correlation between a country’s GDP and its representation in the data ( \rho = 0.82 ). Examining non-English subsets for 4 languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
[CV-38] Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors
【速读】:该论文旨在解决自动驾驶车辆(Connected and Autonomous Vehicles, CAVs)中视觉系统(Vision System)的鲁棒性问题,这是实现L5级自动驾驶的关键挑战。其解决方案的核心在于构建一个参考架构(Reference Architecture)来明确CAV视觉系统(CAVVS)的关键传感器与视觉组件,并据此识别潜在的攻击面(Attack Surface)。在此基础上,进一步分析针对每个攻击面的具体攻击向量(Attack Vector),并从机密性(Confidentiality)、完整性(Integrity)和可用性(Availability, CIA)三个维度评估其影响,从而为制定能够保障CIA三原则的系统性安全措施提供理论依据和实践指导。
链接: https://arxiv.org/abs/2602.09740
作者: Sandeep Gupta,Roberto Passerone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Intelligent Vehicles
Abstract:This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.
[CV-39] oward Fine-Grained Facial Control in 3D Talking Head Generation
【速读】:该论文旨在解决音频驱动的数字人面部生成中精细面部运动控制不足的问题,尤其是唇同步不准确和面部抖动(facial jitter)导致的恐怖谷效应(uncanny valley effect)。其解决方案的关键在于提出了一种细粒度的3D高斯点绘(Fine-Grained 3D Gaussian Splatting, FG-3DGS)框架,通过频域感知的解耦策略对不同运动特征的面部区域进行差异化建模:低频区域(如脸颊、鼻梁和额头)由标准多层感知机(MLP)联合建模,高频区域(如眼睛和嘴巴)则借助面部区域掩码引导的专用网络独立捕捉;同时,基于静态高斯点的运动动态以高斯偏移量(Gaussian deltas)形式施加,并通过帧级相机参数进行光栅化渲染;此外,引入一个由大规模音视频对预训练得到的高频精修后处理对齐机制,进一步提升逐帧生成质量与唇同步精度。
链接: https://arxiv.org/abs/2602.09736
作者: Shaoyang Xie,Xiaofeng Cong,Baosheng Yu,Zhipeng Gui,Jie Gui,Yuan Yan Tang,James Tin-Yau Kwok
机构: Southeast University (东南大学); Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学); Purple Mountain Laboratories (紫金山实验室); Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education (教育部区块链应用工程研究中心(东南大学)); University of Macau (澳门大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
[CV-40] Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings
【速读】:该论文旨在解决数字绘画中裂纹(craquelure)的自动检测问题,该问题对艺术品的劣化评估与修复指导至关重要,但因复杂背景及裂纹与笔触等艺术特征在视觉上的相似性而极具挑战。解决方案的关键在于将裂纹检测建模为一个逆问题,通过将观测图像分解为无裂纹画作和裂纹成分两部分实现;其中,利用深度生成模型作为画作先验以捕捉底层艺术内容,同时采用Mumford–Shah型变分泛函结合裂纹先验来刻画裂纹结构,最终通过联合优化获得像素级的裂纹定位图。
链接: https://arxiv.org/abs/2602.09730
作者: Laura Paul,Holger Rauhut,Martin Burger,Samira Kabri,Tim Roith
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY (德国电子同步加速器研究所在赫尔姆霍兹成像中心); University of Hamburg (汉堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:
Abstract:Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford–Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.
[CV-41] From Lightweight CNNs to SpikeNets: Benchmarking Accuracy-Energy Tradeoffs with Pruned Spiking SqueezeNet
【速读】:该论文旨在解决轻量级卷积神经网络(Convolutional Neural Networks, CNNs)向脉冲神经网络(Spiking Neural Networks, SNNs)转换过程中缺乏系统性评估与优化的问题,尤其关注边缘智能场景下的能效瓶颈。其关键解决方案在于:构建了一套统一训练框架,将 ShuffleNet、SqueezeNet、MnasNet 和 MixNet 等紧凑 CNN 架构转化为基于漏电积分-放电(Leaky-Integrate-and-Fire, LIF)神经元的 SNN,并采用代理梯度下降法进行端到端训练;在此基础上,通过结构化剪枝策略移除冗余模块,得到 SNN-SqueezeNet-P 模型,在保持接近 CNN 原始精度(仅低 1%)的同时,实现参数减少 19% 和能耗降低 88.1%,从而验证了轻量级 SNN 在边缘部署中具备高能效与实用性。
链接: https://arxiv.org/abs/2602.09717
作者: Radib Bin Kabir,Tawsif Tashwar Dipto,Mehedi Ahamed,Sabbir Ahmed,Md Hasanul Kabir
机构: Islamic University of Technology (伊斯兰科技大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Spiking Neural Networks (SNNs) are increasingly studied as energy-efficient alternatives to Convolutional Neural Networks (CNNs), particularly for edge intelligence. However, prior work has largely emphasized large-scale models, leaving the design and evaluation of lightweight CNN-to-SNN pipelines underexplored. In this paper, we present the first systematic benchmark of lightweight SNNs obtained by converting compact CNN architectures into spiking networks, where activations are modeled with Leaky-Integrate-and-Fire (LIF) neurons and trained using surrogate gradient descent under a unified setup. We construct spiking variants of ShuffleNet, SqueezeNet, MnasNet, and MixNet, and evaluate them on CIFAR-10, CIFAR-100, and TinyImageNet, measuring accuracy, F1-score, parameter count, computational complexity, and energy consumption. Our results show that SNNs can achieve up to 15.7x higher energy efficiency than their CNN counterparts while retaining competitive accuracy. Among these, the SNN variant of SqueezeNet consistently outperforms other lightweight SNNs. To further optimize this model, we apply a structured pruning strategy that removes entire redundant modules, yielding a pruned architecture, SNN-SqueezeNet-P. This pruned model improves CIFAR-10 accuracy by 6% and reduces parameters by 19% compared to the original SNN-SqueezeNet. Crucially, it narrows the gap with CNN-SqueezeNet, achieving nearly the same accuracy (only 1% lower) but with an 88.1% reduction in energy consumption due to sparse spike-driven computations. Together, these findings establish lightweight SNNs as practical, low-power alternatives for edge deployment, highlighting a viable path toward deploying high-performance, low-power intelligence on the edge.
[CV-42] Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models ICLR2026
【速读】:该论文旨在解决现有3D生成方法在创建可动画化几何体方面的局限性,以及传统绑定(rigging)技术在骨骼结构精细化控制上的不足。其核心挑战在于如何从用户输入(如2D手绘线条和文本提示)中直接生成具备完整绑定信息的高质量3D网格。解决方案的关键在于提出Stroke3D框架,采用两阶段流水线:第一阶段通过Skeletal Graph VAE(Sk-VAE)与Skeletal Graph DiT(Sk-DiT)实现条件可控的骨骼生成,结合文本语义与2D手绘结构进行联合建模;第二阶段利用TextuRig数据集增强骨架到网格的合成模型,并引入SKA-DPO偏好优化策略以提升骨骼与网格间的几何一致性,从而实现从用户交互输入到可直接用于动画的3D内容的端到端生成。
链接: https://arxiv.org/abs/2602.09713
作者: Ruisi Zhao,Haoren Zheng,Zongxin Yang,Hehe Fan,Yi Yang
机构: ReLER, CCAI, Zhejiang University (浙江大学); DBMI, HMS, Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton’s graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE’s decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
[CV-43] Physics-informed diffusion models in spectral space
【速读】:该论文旨在解决参数化偏微分方程(parametric partial differential equations, PDEs)在部分观测条件下的生成问题,涵盖前向和反演PDE问题。其核心挑战在于如何在稀疏观测下高效、准确地生成满足物理约束的解,同时保持计算效率与数值稳定性。解决方案的关键在于将生成式潜扩散模型(generative latent diffusion models)与物理信息机器学习(physics-informed machine learning)相结合:通过在缩放谱表示的潜空间中建模PDE参数与解的联合分布,利用高斯噪声诱导具有可控正则性的函数空间;该谱形式显著降低维度并确保解空间内PDE算子定义良好;在推理过程中,结合扩散后验采样与Adam优化,在每一步扩散迭代中施加物理约束和测量条件,从而实现高精度且高效的PDE解生成。
链接: https://arxiv.org/abs/2602.09708
作者: Davide Gallon,Philippe von Wurstemberger,Patrick Cheridito,Arnulf Jentzen
机构: University of Münster (明斯特大学); ETH Zurich (苏黎世联邦理工学院); The Chinese University of Hong Kong, Shenzhen (深圳中文大学); CUHK-Shenzhen (深圳中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 24 pages, 9 figures
Abstract:We propose a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier–Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at this https URL.
[CV-44] GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
【速读】:该论文旨在解决细粒度指代表达图像分割(fine-grained referring image segmentation)问题,即根据自然语言查询准确地从复杂场景中分割出对应的目标实例。其解决方案的关键在于提出了一种解耦的“先推理后分割”(reason-then-segment)框架 GenSeg-R1,其中视觉语言模型(VLM)负责理解语义并生成结构化的空间提示(包括边界框和内部关键点),随后使用冻结的可提示分割器(SAM 2)将这些提示转化为高质量掩码。该方法通过 Group Relative Policy Optimization (GRPO) 对 Qwen3-VL 模型进行微调,无需监督式的推理链标注,在 RefCOCOg 和 GRefCOCO 等基准上显著优于现有方法,尤其在负样本(无目标提示)检测能力方面表现突出。
链接: https://arxiv.org/abs/2602.09701
作者: Sandesh Hegde,Jaison Saji Chacko,Debarshi Banerjee,Uma Mahesh
机构: Camcom Technologies Pvt. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.09701 [cs.CV] (or arXiv:2602.09701v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.09701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-45] Semi-supervised Liver Segmentation and Patch-based Fibrosis Staging with Registration-aided Multi-parametric MRI
【速读】:该论文旨在解决肝脏纤维化(liver fibrosis)临床诊断中肝脏分割(LiSeg)与疾病分期(LiFS)的精准性问题,尤其在多参数磁共振成像(multiparametric MRI)数据存在标注样本有限、模态差异大及域偏移(domain shift)等挑战下的建模难题。解决方案的关键在于提出一个端到端的多任务深度学习框架:在肝脏分割阶段采用半监督学习策略,融合图像分割与配准机制,有效利用有限标注数据和大量未标注数据以缓解模态间分布差异;在纤维化分期阶段引入基于patch的分类方法,实现对纤维化程度的可视化判别。该方法在包含in-distribution(ID)和out-of-distribution(OOD)测试案例的多通道MRI数据上验证了鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2602.09686
作者: Boya Wang,Ruizhe Li,Chao Chen,Xin Chen
机构: University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: this https URL
[CV-46] reeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
【速读】:该论文旨在解决GUI自动化(GUI automation)在大规模场景下难以有效扩展的问题,尤其聚焦于GUI规划(GUI planning)这一关键环节,而非以往研究主要关注的GUI定位(GUI grounding)。其核心挑战在于如何高效收集高质量、可扩展的GUI操作轨迹数据。解决方案的关键在于提出TreeCUA框架,通过构建树状结构(tree-structured topology)组织探索轨迹,实现重复节点的存储与回放以降低数据成本;同时设计多智能体协作机制进行环境探索、动作验证、轨迹总结与质量评估,结合自适应探索算法平衡轨迹深度(难度)与广度(多样性),并引入世界知识引导和全局记忆回溯策略避免低质量生成。进一步地,利用树节点信息自然延伸出TreeCUA-DPO方法,通过参考相邻轨迹分支信息优化GUI规划能力,显著提升性能与跨域泛化性。
链接: https://arxiv.org/abs/2602.09662
作者: Deyang Jiang,Jing Huang,Xuanle Zhao,Lei Chen,Liming Zheng,Fanfan Liu,Haibo Qiu,Peng Shi,Zhixiong Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures
Abstract:Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emphi.e., trajectory difficulty) and breadth (\emphi.e., trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at this https URL.
[CV-47] me2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation
【速读】:该论文旨在解决领域泛化视频语义分割(Domain Generalized Video Semantic Segmentation, DGVSS)中因域迁移(domain shift)和时间采样差异(temporal-sampling shift)导致的帧间闪烁问题,尤其是在无目标标签和测试时适应的情况下维持视频流中的时序一致性。解决方案的关键在于提出Time2General框架,其核心创新是引入基于稳定性查询(Stability Queries)的时空记忆解码器(Spatio-Temporal Memory Decoder),该模块将多帧上下文聚合为片段级(clip-level)时空记忆,并直接解码出时序一致的逐帧分割掩码,无需显式对应传播;同时设计了掩码时间一致性损失(Masked Temporal Consistency Loss),通过正则化不同采样率下的时序预测差异并随机化训练步长,显著提升模型对不同采样频率的鲁棒性与分割稳定性。
链接: https://arxiv.org/abs/2602.09648
作者: Siyu Chen,Ting Han,Haoling Huang,Chaolei Wang,Chengzheng Fu,Duxin Zhu,Guorong Cai,Jinhe Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
[CV-48] VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
【速读】:该论文旨在解决3D可操作性定位(3D affordance grounding)任务中因依赖静态线索(如语言和图像)而导致动态交互上下文不足的问题,从而难以捕捉时间与因果关系的挑战。解决方案的关键在于构建一个大规模视频驱动的3D可操作性数据集VIDA,并提出VideoAfford基线模型:该模型融合多模态大语言模型(Multimodal Large Language Models, MLLMs)与额外的可操作性分割能力,在统一框架内实现世界知识推理与细粒度可操作性定位;同时引入潜在动作编码器(latent action encoder)从人-物体交互(Human-Object Interaction, HOI)视频中提取动态交互先验,并设计空间感知损失函数以增强模型对三维空间信息的建模能力,从而显著提升模型在开放世界场景下的泛化性能与可操作性推理能力。
链接: https://arxiv.org/abs/2602.09638
作者: Hanqing Wang,Mingyu Liu,Xiaoyu Chen,Chengwei MA,Yiming Zhong,Wenti Yin,Yuhao Liu,Zhiqing Cui,Jiahao Yuan,Lu Dai,Zhiyuan Ma,Hui Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textitVIDA, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textitVIDA, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textitspatial-aware loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
[CV-49] owards Training-free Multimodal Hate Localisation with Large Language Models
【速读】:该论文旨在解决在线视频中仇恨内容(hateful content)检测存在的两大问题:一是现有方法严重依赖大规模人工标注数据,二是缺乏细粒度的时间定位精度。其解决方案的关键在于提出一种无需训练的大型语言模型(Large Language Model, LLM)框架LELA,通过多模态分解与跨模态推理机制实现无监督的仇恨内容检测与精确定位。具体而言,LELA将视频解构为图像、语音、光学字符识别(OCR)、音乐和视频上下文五个模态,并采用多阶段提示(prompting)策略计算每帧的仇恨得分,同时引入组合匹配机制增强跨模态推理能力,从而在HateMM和MultiHateClip两个挑战性基准上显著优于所有现有无训练基线方法。
链接: https://arxiv.org/abs/2602.09637
作者: Yueming Sun,Long Yang,Jianbo Jiao,Zeyu Fu
机构: University of Durham (杜伦大学); University of Exeter (埃克塞特大学); University of Birmingham (伯明翰大学); The MIx Group
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
[CV-50] AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception ICLR2026
【速读】:该论文旨在解决当前机器人触觉感知研究中对动态触觉信息建模不足的问题,特别是现有数据集和模型多聚焦于物体层面属性(如材质),而忽视了物理交互过程中细微的表面形变与力动力学等时序动态特性。其关键解决方案在于构建了一个大规模分层触觉数据集ToucHD,涵盖触觉原子动作、真实操作场景及触觉-力配对数据,并在此基础上提出AnyTouch 2框架,该框架通过统一对象级理解与细粒度力感知的动态感知能力,从像素级到动作特定形变均进行建模,同时显式刻画物理力动力学,从而实现多层次动态触觉感知能力的学习与泛化。
链接: https://arxiv.org/abs/2602.09617
作者: Ruoxuan Feng,Yuxuan Zhou,Siyu Mei,Dongzhan Zhou,Pengwei Wang,Shaowei Cui,Bin Fang,Guocai Yao,Di Hu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.
[CV-51] AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中水印技术存在的两大核心问题:一是视觉无关的水印可能引入与图像内容不相关的标记,破坏视觉定位(visual grounding);二是现有基于视觉的水印方法依赖静态的一次性视觉关键权重估计,忽视了权重分布密度对保护token比例的影响,导致在生成过程中无法适应动态视觉依赖变化,从而在长尾阶段引入低质量token。解决方案的关键在于提出注意力引导的动态水印框架(Attention-Guided Dynamic Watermarking, AGMark),其通过在每个解码步骤中动态识别基于注意力权重的语义关键证据和上下文感知的一致性线索,构建更自适应且校准良好的证据权重分布,并结合不确定性感知(token熵)与证据校准(权重密度)联合确定语义关键token的比例,实现自适应词汇分区,避免无关token的插入,从而在保持高检测准确率(至少99.36% AUC)和强抗攻击能力(至少88.61% AUC)的同时显著提升生成质量和视觉语义保真度。
链接: https://arxiv.org/abs/2602.09611
作者: Yue Li,Xin Yi,Dongsheng Shi,Yongyi Cui,Gerard de Melo,Linlin Wang
机构: East China Normal University (华东师范大学); Hasso Plattner Institute, University of Potsdam (波茨坦大学哈索普拉特纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: preprint
Abstract:Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36% AUC) and robust attack resilience (at least 88.61% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
[CV-52] -Omni: a Unified Multimodal Framework for Video Generation and Editing
【速读】:该论文旨在解决当前基于扩散模型的视频生成与编辑方法在任务专用性、多模态输入支持不足以及编辑流程难以扩展和组合等方面的局限性。现有方法通常仅依赖文本指令,难以处理图像、参考视频等多模态输入,且编辑操作常需为每种任务单独设计工程化流水线,导致系统缺乏通用性和可扩展性。其解决方案的关键在于提出一个统一的多模态框架Tele-Omni,通过预训练的多模态大语言模型(Multimodal Large Language Models, MLLMs)解析异构指令并推断结构化的生成或编辑意图,再由扩散生成器根据这些结构化信号进行高质量视频合成;同时引入一种任务感知的数据处理管道,将不同视频任务的输入统一为结构化指令格式,从而实现跨任务联合训练与灵活控制,兼顾多模态输入支持、强时序一致性与视觉稳定性。
链接: https://arxiv.org/abs/2602.09609
作者: Jialun Liu,Yukuo Ma,Xiao Cao,Tian Li,Gonghu Shang,Haibin Huang,Chi Zhang,Xuelong Li,Cong Liu,Junqi Liu,Jiakui Hu,Robby T. Tan,Shiwen Zhang,Liying Yang,Xiaoyan Yang,Qizhen Weng,Xiangzhen Chang,Yuanzhi Liang,Yifan Xu,Zhiyong Huang,Zuoxin Li,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.
[CV-53] Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
【速读】:该论文旨在解决第一人称交互世界建模(egocentric interactive world models)中的核心挑战,即如何从单张场景图像出发,在自由空间手势控制下生成具有低延迟、几何一致性与长期稳定性的逼真视频,其中手部进入场景并与物体交互,同时在头部运动下引发合理的物理动态变化。其关键挑战包括:自由手势与接触密集型训练数据之间的分布偏移、单目视角下手部运动与相机运动的模糊性,以及任意长度视频生成的需求。解决方案的核心在于提出Hand2World框架——一个统一的自回归生成模型,通过基于投影3D手部网格的遮挡不变的手部条件编码,使可见性与遮挡信息由场景上下文推断而非依赖控制信号;并通过每像素Plücker射线嵌入显式注入相机几何信息,解耦相机运动与手部运动,避免背景漂移;此外,结合全自动单目标注流程与双向扩散模型蒸馏为因果生成器,实现长时程交互生成。
链接: https://arxiv.org/abs/2602.09600
作者: Yuxi Wang,Wenqi Ouyang,Tianyi Wei,Yi Dong,Zhiqi Shen,Xingang Pan
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
[CV-54] MieDB-100k: A Comprehensive Dataset for Medical Image Editing
【速读】:该论文旨在解决多模态生成式 AI (Generative AI) 在医学图像编辑中因高质量数据稀缺而导致的适应性瓶颈问题。现有医学图像编辑数据集普遍存在多样性不足、忽视医学图像理解能力以及难以兼顾质量与可扩展性的缺陷。解决方案的关键在于构建一个大规模(100k样本)、高质量且多样化的文本引导医学图像编辑数据集 MieDB-100k,其通过将编辑任务细分为感知(Perception)、修改(Modification)和转换(Transformation)三个维度以同时考虑理解与生成能力,并采用结合模态特异性专家模型与规则驱动的数据合成方法进行数据构建,辅以严格的临床真实性人工校验,从而显著提升模型性能与泛化能力。
链接: https://arxiv.org/abs/2602.09587
作者: Yongfan Lai,Wen Qian,Bo Liu,Hongyan Li,Hao Luo,Fan Wang,Bohan Zhuang,Shenda Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
[CV-55] Delving into Spectral Clustering with Vision-Language Representations ICLR26
【速读】:该论文旨在解决传统谱聚类(Spectral Clustering)方法主要依赖单模态数据、未能充分利用多模态表示中丰富信息的问题。其解决方案的关键在于提出一种基于神经切向核(Neural Tangent Kernel, NTK)的多模态谱聚类方法,通过预训练视觉-语言模型中的跨模态对齐特性,将图像间的亲和度(affinity)建模为视觉相似性与语义重叠的耦合。具体而言,利用正向名词(positive nouns)锚定神经切向核,增强簇内连接并抑制簇间伪连接,从而促进块对角结构;同时引入正则化亲和扩散机制,自适应融合不同提示(prompt)诱导的亲和矩阵,显著提升聚类性能。
链接: https://arxiv.org/abs/2602.09586
作者: Bo Peng,Yuanwei Hu,Bo Liu,Ling Chen,Jie Lu,Zhen Fang
机构: Australian Artificial Intelligence Institute, University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR26
Abstract:Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf16 benchmarks – including classical, large-scale, fine-grained and domain-shifted datasets – manifest that our method consistently outperforms the state-of-the-art by a large margin.
[CV-56] ECG-IMN: Interpretable Mesomorphic Neural Networks for 12-Lead Electrocardiogram Interpretation
【速读】:该论文旨在解决深度学习模型在心电图(ECG)自动诊断中因“黑箱”特性而难以获得临床信任的问题,即模型虽具备专家级性能,但缺乏对决策过程的透明性与可解释性。解决方案的关键在于提出一种名为ECG-IMN(Interpretable Mesomorphic Neural Network)的新型神经网络架构,其核心创新是将传统分类器重构为超网络(hypernetwork):由一个深度卷积主干网络生成每个输入样本对应的线性模型参数(权重W),从而实现内在可解释性;该权重矩阵直接作为高分辨率特征归因图,精确映射病理证据(如ST段抬高、T波倒置)在时间和导联维度上的定位,且无需依赖后验近似方法(如Grad-CAM或SHAP),确保解释的忠实性与计算效率。
链接: https://arxiv.org/abs/2602.09566
作者: Vajira Thambawita,Jonas L. Isaksen,Jørgen K. Kanters,Hugo L. Hammer,Pål Halvorsen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
备注:
Abstract:Deep learning has achieved expert-level performance in automated electrocardiogram (ECG) diagnosis, yet the “black-box” nature of these models hinders their clinical deployment. Trust in medical AI requires not just high accuracy but also transparency regarding the specific physiological features driving predictions. Existing explainability methods for ECGs typically rely on post-hoc approximations (e.g., Grad-CAM and SHAP), which can be unstable, computationally expensive, and unfaithful to the model’s actual decision-making process. In this work, we propose the ECG-IMN, an Interpretable Mesomorphic Neural Network tailored for high-resolution 12-lead ECG classification. Unlike standard classifiers, the ECG-IMN functions as a hypernetwork: a deep convolutional backbone generates the parameters of a strictly linear model specific to each input sample. This architecture enforces intrinsic interpretability, as the decision logic is mathematically transparent and the generated weights (W) serve as exact, high-resolution feature attribution maps. We introduce a transition decoder that effectively maps latent features to sample-wise weights, enabling precise localization of pathological evidence (e.g., ST-elevation, T-wave inversion) in both time and lead dimensions. We evaluate our approach on the PTB-XL dataset for classification tasks, demonstrating that the ECG-IMN achieves competitive predictive performance (AUROC comparable to black-box baselines) while providing faithful, instance-specific explanations. By explicitly decoupling parameter generation from prediction execution, our framework bridges the gap between deep learning capability and clinical trustworthiness, offering a principled path toward “white-box” cardiac diagnostics.
[CV-57] Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination WACV2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型生成的内容与输入图像信息不一致的现象。其核心原因在于大语言模型(Large Language Models, LLMs)的强先验性以及跨模态注意力机制的错位。解决方案的关键在于提出一种名为Scalpel的方法,通过精细化调整Transformer层中每个注意力头的激活分布,使其更聚焦于可信区域。该方法利用高斯混合模型捕捉可信与幻觉空间中的多峰注意力分布,并借助熵正则最优传输(等价于Schrödinger桥问题)实现对高斯成分的精确映射;在缓解阶段,根据成分归属关系动态调节干预强度和方向,从而有效抑制幻觉并提升输出一致性。该方法具有模型和数据无关性,仅需一次解码即可完成干预,无需额外计算开销。
链接: https://arxiv.org/abs/2602.09541
作者: Ziqiang Shi,Rujie Liu,Shanshan Yu,Satoshi Munakata,Koichi Shirahata
机构: Fujitsu Research & Development Center Co.,LTD.(富士通研发与开发中心有限公司); Fujitsu Limited(富士通有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026 (It was accepted in the first round, with an acceptance rate of 6%.)
Abstract:Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbfScalpel, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.
[CV-58] AUHead: Realistic Emotional Talking Head Generation via Action Units Control ICLR
【速读】:该论文旨在解决当前生成式AI在虚拟人像视频生成中难以实现细腻情感表达的问题,尤其是缺乏对微表情单元(Action Units, AUs)的精细控制能力。解决方案的关键在于提出一种两阶段方法AUHead:第一阶段利用大音频语言模型(Audio-Language Models, ALMs)通过时空AUs标记化与“情绪-然后-AU”的思维链机制,从原始语音中解耦出AUs,有效捕捉细微情感线索;第二阶段构建基于AUs驱动的可控扩散模型,将AUs序列映射为结构化的二维人脸表示以提升空间保真度,并在交叉注意力模块中建模AUs与视觉特征的交互关系,同时引入推理阶段的AUs解耦引导策略,实现情感表现力与身份一致性的灵活权衡。
链接: https://arxiv.org/abs/2602.09534
作者: Jiayi Lyu,Leigang Qu,Wenjing Zhang,Hanyu Jiang,Kai Liu,Zhenglin Zhou,Xiaobo Xia,Jian Xue,Tat-Seng Chua
机构: University of the Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an “emotion-then-AU” chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at this https URL
[CV-59] RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
【速读】:该论文旨在解决单目度量深度估计(Monocular Metric Depth Estimation, MMDE)中对复杂场景下低频类(underrepresented classes)深度估计精度不足的问题。其解决方案的关键在于提出一种检索增强框架(Retrieval-Augmented Framework, RAD),通过引入不确定性感知的检索机制,识别输入图像中置信度较低的区域,并从RGB-D数据集中检索语义相似的上下文样本作为结构几何代理(structural geometric proxies)。随后,利用双流网络分别处理输入与检索到的上下文,并通过匹配的交叉注意力模块(matched cross-attention module)仅在可靠点对应关系处传递几何信息,从而有效提升对低频类别的深度估计准确性。
链接: https://arxiv.org/abs/2602.09532
作者: Michael Baltaxe,Dan Levi,Sagie Benaim
机构: General Motors(通用汽车); The Hebrew University of Jerusalem(希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
[CV-60] DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment AAAI2026
【速读】:该论文旨在解决盲图像质量评估(Blind Image Quality Assessment, BIQA)中现有模型难以有效捕捉细微失真线索、导致与人类主观判断不一致的问题。其核心挑战在于缺乏可靠的失真先验(distortion priors),现有方法通常仅学习统一图像特征与质量评分之间的浅层关系,从而对失真不敏感。解决方案的关键在于提出一种基于先验驱动的BIQA框架,通过引入一个退化感知的视觉-语言模型获取特定于失真的先验信息,并利用提出的失真显著性差异模块(Distortion-Saliency Differential Module)将这些先验与语义注意力区分开来,以确保失真表征的真实性;随后,通过动态失真加权模块(Dynamic Distortion Weighting Module)融合细化后的先验、语义信息及桥梁表示,按感知影响权重分配各失真特征,最终实现更贴近人类感知的质量预测。
链接: https://arxiv.org/abs/2602.09531
作者: Bohan Fu,Guanyi Qin,Fazhan Zhang,Zihao Huang,Mingxuan Li,Runze Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce this http URL, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. this http URL begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of this http URL over current methods and showcase its excellence in terms of generalization and data efficiency.
[CV-61] SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection
【速读】:该论文旨在解决遥感影像中变化检测任务面临的两大挑战:一是现有深度学习模型对小目标(如小型建筑)敏感性不足,二是计算成本过高导致训练效率低下。解决方案的关键在于提出SCA-Net架构,其核心创新包括:1)设计差异金字塔块(Difference Pyramid Block)以实现多尺度变化分析;2)引入自适应多尺度处理模块(Adaptive Multi-scale Processing),融合形状感知与高分辨率增强结构;3)集成多层级注意力机制(PPM和CSAGate)协同处理上下文信息与细节特征;4)采用动态复合损失函数和四阶段训练策略,提升训练稳定性并加速收敛。实验表明,该方法在LEVIR-MCI数据集上mIoU提升2.64%,小建筑IoU提升57.9%,同时训练时间减少61%。
链接: https://arxiv.org/abs/2602.09529
作者: Emad Gholibeigi,Abbas Koochari,Azadeh ZamaniFar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 3 tables. Submitted for review
Abstract:Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net’s superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.
[CV-62] SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem ICASSP2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在高风险领域(如医疗健康)应用中因幻觉(hallucination)问题导致的可靠性不足问题,即模型生成文本与视觉输入不一致或忽略视觉信息的现象。其核心解决方案是提出SchröMind框架,通过求解薛定谔桥(Schrödinger bridge)问题,在 hallucinatory 和 truthful 激活状态之间建立低运输成本的 token 级映射,从而在轻量级训练下显著减少幻觉,同时保持原始模型能力。该方法的关键在于利用最优传输理论实现精准的中间激活调整,避免传统自回归生成过程中错误难以修正的缺陷。
链接: https://arxiv.org/abs/2602.09528
作者: Ziqiang Shi,Rujie Liu,Shanshan Yu,Satoshi Munakata,Koichi Shirahata
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2026
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model’s original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.
[CV-63] HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection
【速读】:该论文旨在解决工业场景中缺陷样本稀缺条件下,无监督异常检测(Unsupervised Industrial Anomaly Detection, UAD)的准确性和鲁棒性问题。传统方法多依赖像素级重建,难以捕捉复杂纹理和结构信息,且易受环境噪声干扰。其解决方案的关键在于提出高-低分辨率引导特征对齐框架(High-Low Resolution Guided Feature Alignment, HLGFA),通过建模正常样本在高分辨率与低分辨率表示之间的跨分辨率特征一致性来学习正常模式,而非依赖像素级重建;具体而言,利用共享冻结主干网络提取多层级特征,并将高分辨率特征分解为结构先验和细节先验,通过条件调制和门控残差修正引导低分辨率特征优化,从而实现更精准的异常定位;此外引入噪声感知的数据增强策略以抑制工业环境中常见的干扰响应,显著提升模型泛化能力。
链接: https://arxiv.org/abs/2602.09524
作者: Han Zhou,Yuxuan Gao,Yinchao Du,Xuezhe Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, references added
Abstract:Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.
[CV-64] Singpath-VL Technical Report
【速读】:该论文旨在解决生成式 AI (Generative AI) 在宫颈细胞学(cervical cytology)领域应用中的空白问题,即当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在细胞病理学尤其是宫颈细胞学中尚未得到充分探索,主要受限于高质量标注数据集的稀缺。解决方案的关键在于构建一个百万级图像-描述合成数据集的三阶段流水线:首先利用多个通用MLLM作为弱标注器生成初始描述,继而通过共识融合与专家知识注入对输出进行精炼,最终获得高保真度的细胞形态描述;在此基础上,采用多阶段微调策略对Qwen3-VL-4B模型进行专业化训练,从而得到名为Singpath-VL的细胞病理学专用MLLM,显著提升了细粒度形态感知和细胞级别诊断分类能力。
链接: https://arxiv.org/abs/2602.09523
作者: Zhen Qiu,Kaiwen Xiao,Zhengwei Lu,Xiangyu Liu,Lei Zhao,Hao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
[CV-65] Attention to details logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的视觉注意力不足问题,这会导致生成内容出现幻觉(hallucination)。现有方法通过增强所有视觉标记的注意力来缓解该问题,但会同时提升与任务无关标记的关注度,从而引入噪声。其解决方案的关键在于提出一种无需训练的注意力干预算法,基于任务相关视觉标记通常具有较高视觉-文本相似性的假设,从跨模态注意力子矩阵中提取视觉-文本相关性信息,构建重加权矩阵以重新分配注意力权重,从而聚焦于任务相关的视觉标记;此外,通过将视觉注意力值注入束搜索解码过程,进一步强化对高视觉注意力路径的选择,从而在不牺牲生成内容准确性和连贯性的前提下显著降低幻觉现象。
链接: https://arxiv.org/abs/2602.09521
作者: Jingyi Wang,Fei Li,Rujie Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
[CV-66] A Universal Action Space for General Behavior Analysis
【速读】:该论文旨在解决动物和人类行为分析在计算机视觉中长期存在的难题,即如何从复杂且低层次的视觉特征(如颜色、形状、纹理)中有效提取可泛化的高阶行为表征。传统方法依赖手工设计的边缘检测与稀疏特征点追踪,难以应对多样性和噪声干扰,导致鲁棒性差、泛化能力弱。解决方案的关键在于引入大规模预训练模型的思想——利用ImageNet推动的深度神经网络学习到的高层语义表示,构建一个通用的动作空间(Universal Action Space, UAS),并基于已有的标注人类动作数据集进行扩展,进而将该UAS作为统一框架用于哺乳动物及黑猩猩行为数据集的分析与分类,从而实现跨物种行为识别的高效建模与迁移。
链接: https://arxiv.org/abs/2602.09518
作者: Hung-Shuo Chang,Yue-Cheng Yang,Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,James C. Liao,Chien-Chang Chen,Hen-Hsen Huang,Hong-Yuan Mark Liao
机构: Institute of Information Science, Academia Sinica, Taiwan (中央研究院资讯科学研究所); Institute of Biomedical Sciences, Academia Sinica, Taiwan (中央研究院生物医学科学研究所); Institute of Biological Chemistry, Academia Sinica, Taiwan (中央研究院生物化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at this https URL.
[CV-67] Energy-Efficient Fast Object Detection on Edge Devices for IoT Systems
【速读】:该论文旨在解决物联网(IoT)系统中快速移动目标检测的效率与准确性难题,尤其是在资源受限的边缘设备上实现低延迟、高能效的目标检测。其核心挑战在于传统端到端(end-to-end)检测方法在处理高速运动物体时存在精度下降、延迟高和能耗大的问题。解决方案的关键在于提出一种基于帧差法(frame difference method)的轻量级检测算法,结合人工智能(AI)分类器与优化模型(如MobileNet),显著提升了检测准确率、降低了延迟并提高了能效:实验表明,相较端到端方法,该方案平均准确率提升28.314%,平均延迟降低39.305%,平均效率提高3.6倍,尤其适用于对实时性和能效要求严苛的IoT场景下的快速移动目标检测任务。
链接: https://arxiv.org/abs/2602.09515
作者: Mas Nurul Achmadiah,Afaroj Ahamad,Chi-Chia Sun,Wen-Kai Kuo
机构: National Formosa University (国立中兴大学); Yuan Ze University (元智大学); National Taipei University (国立台北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures
Abstract:This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoT M U50, Jetson Orin Nano, and Hailo-8T M AI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.
[CV-68] Robust Depth Super-Resolution via Adaptive Diffusion Sampling
【速读】:该论文旨在解决深度超分辨率(Depth Super-Resolution)任务中,如何从任意退化形式的低分辨率输入中鲁棒地恢复高分辨率深度图的问题。传统方法通常直接回归深度值,在严重或未知退化条件下易产生伪影,缺乏泛化能力。解决方案的关键在于利用高斯平滑的收缩性质(contraction property):在前向扩散过程中,退化输入与高质量原始图像之间的分布差异逐渐缩小并收敛至各向同性高斯先验。AdaDS据此自适应地选择逆向扩散轨迹的起始时间步,基于估计的重构不确定性,并注入定制噪声,使中间样本落入目标后验分布的高概率区域,从而借助预训练扩散模型的生成先验实现鲁棒恢复,即使上游估计不准确也能保持性能稳定。
链接: https://arxiv.org/abs/2602.09510
作者: Kun Wang,Yun Zhu,Pan Zhou,Na Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS’s superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.
[CV-69] Equilibrium contrastive learning for imbalanced image classification
【速读】:该论文旨在解决对比学习(Contrastive Learning, CL)在类别不平衡数据集上性能受限的问题,尤其针对现有监督式CL方法中类特征坍缩与类间均值分布不均、以及原型(prototype)贡献受批内样本数量影响导致的不平衡问题。解决方案的关键在于提出均衡对比学习(Equilibrium Contrastive Learning, ECL),其核心机制包括两个方面:一是通过构建表示几何平衡(representation geometric equilibrium),实现类内特征坍缩与类均值均匀分布的正则单纯形结构,并平衡类平均特征与类原型的贡献;二是建立分类器-类中心几何平衡(classifier-class center geometric equilibrium),通过对齐分类器权重与类原型,提升模型泛化能力。实验表明,ECL在多个长尾数据集上优于当前最先进的监督对比学习方法。
链接: https://arxiv.org/abs/2602.09506
作者: Sumin Roh,Harim Kim,Ho Yun Lee,Il Yong Chun
机构: Sungkyunkwan University ( Sungkyunkwan University); Samsung Medical Center (Samsung Medical Center); Institute for Basic Science (Institute for Basic Science)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures
Abstract:Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.
[CV-70] OSI: One-step Inversion Excels in Extracting Diffusion Watermarks
【速读】:该论文旨在解决基于高斯阴影(Gaussian Shading)的扩散模型图像水印在提取过程中依赖多步扩散逆过程所带来的计算复杂度高、效率低的问题。其解决方案的关键在于提出了一步逆向(One-step Inversion, OSI)方法,将水印提取重构为一个可学习的符号分类问题,从而无需精确回归初始噪声;通过在合成噪声-图像对上以符号分类目标微调从扩散主干网络初始化的模型,实现了仅需一步即可高效准确地完成水印提取,显著提升了速度与精度,并增强了水印容量。
链接: https://arxiv.org/abs/2602.09494
作者: Yuwei Chen,Zhenliang He,Jia Tang,Meina Kan,Shiguang Shan
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China(中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (CAS), China(中国科学院大学); School of Information Science and Technology, ShanghaiTech University, China(上海科技大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Watermarking is an important mechanism for provenance and copyright protection of diffusion-generated images. Training-free methods, exemplified by Gaussian Shading, embed watermarks into the initial noise of diffusion models with negligible impact on the quality of generated images. However, extracting this type of watermark typically requires multi-step diffusion inversion to obtain precise initial noise, which is computationally expensive and time-consuming. To address this issue, we propose One-step Inversion (OSI), a significantly faster and more accurate method for extracting Gaussian Shading style watermarks. OSI reformulates watermark extraction as a learnable sign classification problem, which eliminates the need for precise regression of the initial noise. Then, we initialize the OSI model from the diffusion backbone and finetune it on synthesized noise-image pairs with a sign classification objective. In this manner, the OSI model is able to accomplish the watermark extraction efficiently in only one step. Our OSI substantially outperforms the multi-step diffusion inversion method: it is 20x faster, achieves higher extraction accuracy, and doubles the watermark payload capacity. Extensive experiments across diverse schedulers, diffusion backbones, and cryptographic schemes consistently show improvements, demonstrating the generality of our OSI framework.
[CV-71] Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)因参数量庞大而导致的部署困难问题,现有知识蒸馏(Knowledge Distillation, KD)方法主要依赖静态的下一个词预测对齐,忽略了动态的token交互关系,而这些交互关系对于多模态理解与生成至关重要。解决方案的关键在于提出Align-TI框架,从token交互角度出发,设计两个核心组件:IVA(Instruction-Visual Alignment)通过对齐显著视觉区域,使学生模型模仿教师模型在指令引导下提取相关视觉信息的能力;TPA(Token-to-Token Alignment)则通过匹配序列token之间的转移概率,捕捉教师模型在生成过程中的动态逻辑。该方法显著提升了蒸馏效果,在多个指标上优于传统KD,并实现了参数更少但性能更强的模型。
链接: https://arxiv.org/abs/2602.09483
作者: Lin Chen,Xiaoke Zhao,Kun Ding,Weiwei Feng,Changtao Miao,Zili Wang,Wenxuan Guo,Ying Wang,Kaiyuan Zheng,Bo Zhang,Zhe Li,Shiming Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher’s instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher’s dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI’s superiority. Notably, our approach achieves 2.6% relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by 7.0% , establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at this https URL.
[CV-72] Weakly Supervised Contrastive Learning for Histopathology Patch Embeddings
【速读】:该论文旨在解决数字病理学中全切片图像(Whole Slide Images, WSI)分析因标注样本稀缺而导致的模型训练困难问题。传统弱监督多实例学习(Weakly Supervised Multiple Instance Learning, MIL)方法依赖于冻结的图像块特征进行特征聚合,但忽略了在MIL场景下对图像编码器预训练时的特征表示学习。其解决方案的关键在于提出一种新型的弱监督对比学习框架——WeakSupCon,该框架在训练过程中引入了袋级(slide-level)标签信息,无需实例级伪标签即可有效区分不同类别图像块的特征空间分布,从而显著提升下游MIL任务的性能。
链接: https://arxiv.org/abs/2602.09477
作者: Bodong Zhang,Xiwen Li,Hamid Manoochehri,Xiaoya Tang,Deepika Sirohi,Beatrice S. Knudsen,Tolga Tasdizen
机构: University of Utah(犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at this http URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.09477 [cs.CV] (or arXiv:2602.09477v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.09477 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-73] FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation
【速读】:该论文旨在解决合成数据(synthetic data)与真实数据(real data)之间因外观和成像差异导致的域偏移(domain shift)问题,从而提升下游视觉任务(如语义分割)的性能。现有无配对合成到真实图像翻译方法常面临保真度与结构稳定性之间的权衡:过度自由的生成易引入形变或虚假纹理,而过于严格的约束则限制了对真实域统计特性的适应能力。其解决方案的关键在于提出一种频率解耦的双分支模型(FD-DB),将外观迁移分解为低频可解释编辑分支(预测白平衡、曝光、对比度、饱和度、模糊和噪点等物理参数以构建稳定的低频外观基底)与高频残差补偿分支(通过残差生成补充细节),并通过门控融合机制在显式频率约束下整合两分支输出,有效抑制低频漂移并保持几何与语义结构一致性。此外,采用两阶段训练策略先稳定编辑分支再释放残差分支,进一步提升了优化稳定性与性能表现。
链接: https://arxiv.org/abs/2602.09476
作者: Chuanhai Zang,Jiabao Hu,XW Song
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 13 figures, 2 tables. Code available at this https URL
Abstract:Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.
[CV-74] ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs
【速读】:该论文旨在解决生成式图像模型(Generative AI)中合成图像 artifacts(如扭曲的手部或变形物体)的检测问题,这对评估生成器性能和训练奖励模型以提升生成质量至关重要。当前方法依赖于在数万张标注图像上微调视觉语言模型(VLM),但成本高昂且难以适应新生成器或新型artifacts。解决方案的关键在于:利用预训练VLM中已有的知识,通过少量标注样本(每类仅需数百张)即可激活其检测能力,核心是设计了一种多组件架构,结合上下文学习(in-context learning)与文本指令优化,并引入多项创新改进,从而实现对多种artifacts类型(包括对象形态、动物解剖结构及实体交互)以及AIGC检测任务的泛化,显著降低数据需求并达到当前最优性能。
链接: https://arxiv.org/abs/2602.09475
作者: James Burgess,Rameen Abdal,Dan Stoddart,Sergey Tulyakov,Serena Yeung-Levy,Kuan-Chieh Jackson Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL
Abstract:Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
[CV-75] LLM -Grounded Dynamic Task Planning with Hierarchical Temporal Logic for Human-Aware Multi-Robot Collaboration
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)在多机器人任务规划中生成的方案缺乏运动学可行性且效率低下,尤其是在长时程场景下的问题;同时,传统形式化方法如线性时序逻辑(Linear Temporal Logic, LTL)虽能提供正确性和最优性保障,却受限于静态、离线环境且难以扩展至实时动态场景。其解决方案的关键在于提出一种神经符号框架,将LLM推理嵌入到分层LTL规范中,并通过滚动时域规划(Receding Horizon Planning, RHP)机制结合实时感知,动态优化任务分配与路径规划(Simultaneous Task Allocation and Planning, STAP),从而在不确定性环境中实现高成功率、低延迟且流畅的人机交互。
链接: https://arxiv.org/abs/2602.09472
作者: Shuyuan Hu,Tao Lin,Kai Ye,Yang Yang,Tianwei Zhang
机构: The Shenzhen Institute of Artificial Intelligence and Robotics for Society (深圳市人工智能与机器人研究院); Harbin Institute of Technology (哈尔滨工业大学); The Chinese University of Hong Kong-Shenzhen (香港中文大学(深圳)); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Large Language Models (LLM) enable non-experts to specify open-world multi-robot tasks, the generated plans often lack kinematic feasibility and are not efficient, especially in long-horizon scenarios. Formal methods like Linear Temporal Logic (LTL) offer correctness and optimal guarantees, but are typically confined to static, offline settings and struggle with computational scalability. To bridge this gap, we propose a neuro-symbolic framework that grounds LLM reasoning into hierarchical LTL specifications and solves the corresponding Simultaneous Task Allocation and Planning (STAP) problem. Unlike static approaches, our system resolves stochastic environmental changes, such as moving users or updated instructions via a receding horizon planning (RHP) loop with real-time perception, which dynamically refines plans through a hierarchical state space. Extensive real-world experiments demonstrate that our approach significantly outperforms baseline methods in success rate and interaction fluency while minimizing planning latency.
[CV-76] Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中因速度场(velocity field)调整引入误差并沿生成路径累积的问题,从而影响图像生成质量。现有训练-free 流匹配方法虽可通过修改速度场提升性能,但其误差难以控制;而直接在潜在空间(latent space)中调整潜在轨迹(latent trajectory)则可利用预训练的速度网络自动校正误差,减少累积偏差。解决方案的关键在于提出两种互补的训练-free 潜在轨迹平滑策略:Look-Ahead 通过曲率门控权重平均当前与下一步潜在状态实现轨迹优化,Look-Back 则基于指数移动平均对潜在状态进行平滑处理,二者均在不重新训练模型的前提下显著提升生成质量,且在多个基准数据集(如 COCO17、CUB-200 和 Flickr30K)上优于现有最先进方法。
链接: https://arxiv.org/abs/2602.09449
作者: Yan Luo,Henry Huang,Todd Y. Zhou,Mengyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field v introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory z are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity v and latent trajectory z information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emphLook-Ahead, which averages the current and next-step latents using a curvature-gated weight, and \emphLook-Back, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.
[CV-77] A Scoping Review of Deep Learning for Urban Visual Pollution and Proposal of a Real-Time Monitoring Framework with a Visual Pollution Index
【速读】:该论文旨在解决城市视觉污染(Urban Visual Pollution, UVP)自动检测与应用研究碎片化的问题,尤其针对现有方法在检测、分类及系统集成方面的不足。其解决方案的关键在于提出一个整合性的视觉污染管理框架,该框架包含三个核心要素:一是建立统一的污染物分类体系(pollutant taxonomy),以实现跨区域、跨场景的标准化识别;二是构建跨城市的基准数据集(cross-city benchmark dataset),弥补当前数据集地域局限性;三是引入视觉污染指数(Visual Pollution Index, VPI),用于量化评估特定区域的污染严重程度,从而支持可持续的城市美学管理和居民福祉提升。
链接: https://arxiv.org/abs/2602.09446
作者: Mohammad Masudur Rahman,Md. Rashedur Rahman,Ashraful Islam,Saadia B Alam,M Ashraful Amin
机构: Independent University, Bangladesh (独立大学, 孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.
[CV-78] Fine-T2I: An Open Large-Scale and Diverse Dataset for High-Quality T2I Fine-Tuning
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)微调中高质量、大规模开放数据集匮乏的问题,这一瓶颈限制了开源研究模型与企业级模型之间的性能差距。其解决方案的关键在于构建一个名为Fine-T2I的大规模、高质量且完全开放的T2I微调数据集,该数据集融合了由先进生成模型合成的图像与专业摄影师提供的精心筛选的真实图像,并通过严格过滤机制确保文本-图像对齐度、视觉保真度和提示质量,最终保留超过600万对文本-图像样本,数据量接近预训练级别但保持微调所需的高精度标准,从而显著提升多种扩散模型和自回归模型在生成质量和指令遵循性方面的表现。
链接: https://arxiv.org/abs/2602.09439
作者: Xu Ma,Yitian Zhang,Qihua Dong,Yun Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset: this https URL
Abstract:High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
[CV-79] SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL
【速读】:该论文旨在解决当前单次遍历的3D场景生成方法中存在的空间幻觉(spatial hallucinations)问题,例如物体间的碰撞冲突,其根源在于缺乏细致的推理机制。解决方案的关键在于提出了一种视觉引导的自我反思框架 SceneReVis,该框架采用迭代式的“诊断-行动”(diagnose-and-act)循环,利用多模态反馈显式识别并修复空间冲突;同时构建了大规模因果构建轨迹数据集 SceneChain-12k,并设计两阶段训练策略(从监督微调过渡到代理强化学习),使模型逐步演化为具备主动空间规划能力的智能体,从而实现高保真生成与目标导向优化的统一。
链接: https://arxiv.org/abs/2602.09432
作者: Yang Zhao,Shizhao Sun,Meisheng Zhang,Yingdong Shi,Xubo Yang,Jiang Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act’’ loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
[CV-80] Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对基于编码器的对抗攻击时,攻击方法在不同模型架构间迁移能力严重不足的问题。现有攻击方法仅优化视觉编码器,虽计算效率高,但在真实黑盒场景下跨模型泛化性能差,缺乏系统性研究支撑。解决方案的关键在于提出一种语义引导的多模态攻击框架(Semantic-Guided Multimodal Attack, SGMA),其核心创新是针对发现的两个根本原因:一是不同模型对图像区域的视觉定位不一致(inconsistent visual grounding);二是模型内部存在冗余的语义对齐(redundant semantic alignment),即单一对象被分散到多个重叠的token表示中。SGMA通过将扰动聚焦于语义关键区域,并在全局与局部层面破坏跨模态对齐关系,显著提升了对抗样本在不同LVLM之间的迁移成功率,揭示了当前LVLM部署中的重大安全风险。
链接: https://arxiv.org/abs/2602.09431
作者: Xinwei Zhang,Li Bai,Tianwei Zhang,Youqian Zhang,Qingqing Ye,Yingnan Zhao,Ruochen Du,Haibo Hu
机构: The Hong Kong Polytechnic University (香港理工大学); Nanyang Technological University (南洋理工大学); Harbin Engineering University (哈尔滨工程大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review; 21 pages
Abstract:Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.
[CV-81] Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification
【速读】:该论文旨在解决智能交通系统(ITS)中基于激光雷达(LiDAR)的卡车细粒度分类问题,其核心挑战在于现有方法依赖监督学习与人工标注,导致扩展性差、成本高。解决方案的关键在于利用视觉语言模型(VLM)的少样本泛化能力,并通过一个深度感知的图像生成管道将稀疏、遮挡的LiDAR点云转化为结构化的深度编码二维视觉代理(depth-encoded 2D visual proxies),从而弥合LiDAR与图像模态之间的差距。该方法无需参数微调即可实现仅需16–30个样本/类的高精度分类,且在极低样本场景下表现出“语义锚定”效应(Semantic Anchor Effect),显著降低了初始标注成本,为ITS应用提供了可落地的冷启动策略。
链接: https://arxiv.org/abs/2602.09425
作者: Yiqiao Li,Bo Shang,Jie Wei
机构: City College of New York (纽约市立学院); University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 10 figures, 4 tables
Abstract:Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a “Semantic Anchor” effect: text-based guidance regularizes performance in ultra-low-shot regimes k 4 , but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
[CV-82] Stability and Concentration in Nonlinear Inverse Problems with Block-Structured Parameters: Lipschitz Geometry Identifiability and an Application to Gaussian Splatting
【速读】:该论文旨在解决高维非线性逆问题中参数估计的稳定性与统计集中性问题,尤其针对具有块结构参数的场景。其核心挑战在于如何在不依赖特定重建算法的前提下,建立与前向算子(forward operator)内在性质相关的误差界。解决方案的关键在于构建一个基于算子理论的统一框架,通过结合块状Lipschitz几何、局部可辨识性和次高斯噪声等假设,推导出确定性的稳定性不等式、全局Lipschitz边界以及非渐近浓度估计,从而获得高概率意义下的参数误差界。这一方法揭示了估计误差本质上受限于图像分辨率与模型复杂度之比,体现了稳定性和分辨率之间的根本权衡关系。
链接: https://arxiv.org/abs/2602.09415
作者: Joe-Mei Feng,Hsin-Hsiung Kao
机构: Tamkang University (淡江大学); Central Police University (中央警察大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:We develop an operator-theoretic framework for stability and statistical concentration in nonlinear inverse problems with block-structured parameters. Under a unified set of assumptions combining blockwise Lipschitz geometry, local identifiability, and sub-Gaussian noise, we establish deterministic stability inequalities, global Lipschitz bounds for least-squares misfit functionals, and nonasymptotic concentration estimates. These results yield high-probability parameter error bounds that are intrinsic to the forward operator and independent of any specific reconstruction algorithm. As a concrete instantiation, we verify that the Gaussian Splatting rendering operator satisfies the proposed assumptions and derive explicit constants governing its Lipschitz continuity and resolution-dependent observability. This leads to a fundamental stability–resolution tradeoff, showing that estimation error is inherently constrained by the ratio between image resolution and model complexity. Overall, the analysis characterizes operator-level limits for a broad class of high-dimensional nonlinear inverse problems arising in modern imaging and differentiable rendering.
[CV-83] LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging
【速读】:该论文旨在解决现有任务向量(task-vector)合并方法在大型视觉Transformer中因忽略层间异质性而导致的性能瓶颈问题,即浅层对干扰敏感而深层编码稳定任务特征的特性未被有效利用。解决方案的关键在于提出一种无需训练、无需数据、与合并器无关的分层自适应缩放贴片(Layer-wise Adaptive Rescaling Veneer, LARV),其通过为每个任务向量分配逐层缩放系数,在不修改原有合并规则的前提下,自适应地抑制浅层干扰并增强深层任务特征对齐,从而显著提升多任务模型合并的鲁棒性和性能。
链接: https://arxiv.org/abs/2602.09413
作者: Xinyu Wang,Ke Deng,Fei Dou,Jinbo Bi,Jin Lu
机构: University of Connecticut (康涅狄格大学); University of Georgia (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 9 figures, 6 tables
Abstract:Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.
[CV-84] K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge ICLR2026
【速读】:该论文旨在解决视觉生成模型(Visual Generative Models)评估中面临的可扩展性与人类对齐性不足的问题。传统基于众包平台的人类偏好评估方法成本高、效率低,而直接使用视觉语言模型(VLMs)进行替代则因模型固有的幻觉和偏差导致评估结果与人类偏好不一致,且静态比较策略效率低下。解决方案的关键在于提出K-Sort Eval框架,其核心创新包括:一是通过后验校正(posterior correction)机制,在贝叶斯更新过程中根据VLM预测与人类标注的一致性动态调整后验概率,提升评估结果的可靠性;二是设计动态匹配策略(dynamic matching),在每次比较中平衡不确定性与多样性以最大化预期收益,从而显著提高评估效率,实验表明该方法仅需少于90次模型运行即可获得与人工评估高度一致的结果。
链接: https://arxiv.org/abs/2602.09411
作者: Zhikai Li,Jiatong Li,Xuewen Liu,Wangbo Zhao,Pan Du,Kaicheng Zhou,Qingyi Gu,Yang You,Zhen Dong,Kurt Keutzer
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of California, Berkeley (加州大学伯克利分校); University of California, Santa Barbara (加州大学圣巴巴拉分校); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Collov Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026. Code is available at: this https URL
Abstract:The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.
[CV-85] Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D
【速读】:该论文旨在解决医学影像中三维(3D)解剖结构理解的瓶颈问题,即当前基于二维(2D)医学图像进行3D重建时存在精度不足、深度信息模糊以及重建结果过于简化等挑战。其解决方案的关键在于通过构建一个受控的零样本基准测试(zero-shot benchmark),系统评估五种前沿图像到3D生成模型(SAM3D、Hunyuan3D-2.1、Direct3D、Hi3DGen 和 TripoSG)在六组医学数据集和两组自然图像数据集上的表现,从而量化单切片医学图像重建的局限性,并揭示几何先验从自然图像向医学图像迁移的有效性边界。研究发现,尽管所有模型在体素级重叠指标上表现中等,表明深度推断失败是普遍现象,但全局距离指标显示SAM3D在拓扑结构相似性方面最优,而其他模型则倾向于过度简化重建结果,这凸显了平面2D医学图像固有的深度歧义性,进而推动多视角图像到3D重建成为实现可靠医学3D推断的必要方向。
链接: https://arxiv.org/abs/2602.09407
作者: Yan Luo,Advaith Ravishankar,Serena Liu,Yutong Yang,Mengyu Wang
机构: Harvard AI and Robotics Lab (哈佛人工智能与机器人实验室); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.
[CV-86] Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中因高质量像素级标注数据稀缺而导致的模型训练困难问题,其核心挑战在于标注成本高且依赖专业临床知识。为此,作者提出了一种全可微分双向协同学习(Fully Differentiable Bidirectional Synergistic Learning, DBiSL)框架,其关键创新在于突破了现有双任务协作学习中单向交互机制的限制(如通常仅从回归任务到分割任务),实现了在线双向跨任务协同:通过构建可微分的双向连接,使分割与回归任务能够实时互为监督信号,从而更充分地挖掘双任务间的预测一致性潜力。该框架系统整合了监督学习、一致性正则化、伪监督学习和不确定性估计四大半监督学习(Semi-Supervised Learning, SSL)模块,在两个基准数据集上取得了当前最优性能,同时为双任务驱动的SSL提供了新的架构范式。
链接: https://arxiv.org/abs/2602.09378
作者: Jun Li
机构: Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ESWA 2026
Abstract:Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method’s state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.
[CV-87] Impact of domain adaptation in deep learning for medical image classifications
【速读】:该论文旨在解决域适应(Domain Adaptation, DA)在医学图像分析中应用的性能提升与泛化能力优化问题,尤其关注不同场景下如多模态数据、噪声干扰、联邦学习(Federated Learning, FL)、模型可解释性及分类器校准等实际挑战。其解决方案的关键在于利用10种深度学习模型模拟主流DA技术,将源域与目标域的数据映射至共享特征空间,从而实现从标注充足的源域向标注稀缺的目标域迁移知识;实验表明,该方法在脑肿瘤(Brain Tumor, BT)数据集上使用ResNet34时可提升模型性能4.7%,并有效缓解高斯噪声影响(准确率提升约3%),同时增强Grad-CAM++可视化效果以提高临床可解释性,并降低预期校准误差(Expected Calibration Error, ECE)约2%,证明DA在复杂医疗场景中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2602.09355
作者: Yihang Wu,Ahmad Chaddad
机构: Guilin University of Electronic Technology (桂林电子科技大学); École de Technologie Supérieure (加拿大魁北克工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE SMC 2025
Abstract:Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides \sim 3% accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., \sim 0.3% increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value \sim 2% compared to CNN alone on a multi-modality dataset.
[CV-88] Kyrtos: A methodology for automatic deep analysis of graphic charts with curves in technical documents
【速读】:该论文旨在解决技术文档中图形图像内曲线的自动识别与分析问题,以实现对文档内容的深度理解(Deep Understanding of Technical Documents, DUTD)。其核心挑战在于准确提取图形中曲线的结构特征并建立其语义关联,从而支持后续的结构化建模与知识转化。解决方案的关键在于提出Kyrtos方法:首先通过聚类算法识别曲线的中间点以分割线段,再基于提取的线段解析方向、趋势等行为特征,并将这些关系转化为带属性的图结构(attributed graphs);进一步地,该图关系被映射为自然语言(Natural Language, NL)文本,辅助生成随机Petri网(Stochastic Petri-net, SPN)模型,从而刻画图表所表达的内部功能逻辑。
链接: https://arxiv.org/abs/2602.09337
作者: Michail S. Alexiou,Nikolaos G. Bourbakis
机构: Kennesaw State University (肯尼索州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Understanding of Technical Documents (DUTD) has become a very attractive field with great potential due to large amounts of accumulated documents and the valuable knowledge contained in them. In addition, the holistic understanding of technical documents depends on the accurate analysis of its particular modalities, such as graphics, tables, diagrams, text, etc. and their associations. In this paper, we introduce the Kyrtos methodology for the automatic recognition and analysis of charts with curves in graphics images of technical documents. The recognition processing part adopts a clustering based approach to recognize middle-points that delimit the line-segments that construct the illustrated curves. The analysis processing part parses the extracted line-segments of curves to capture behavioral features such as direction, trend and etc. These associations assist the conversion of recognized segments’ relations into attributed graphs, for the preservation of the curves’ structural characteristics. The graph relations are also are expressed into natural language (NL) text sentences, enriching the document’s text and facilitating their conversion into Stochastic Petri-net (SPN) graphs, which depict the internal functionality represented in the chart image. Extensive evaluation results demonstrate the accuracy of Kyrtos’ recognition and analysis methods by measuring the structural similarity between input chart curves and the approximations generated by Kyrtos for charts with multiple functions.
[CV-89] Deep Modeling and Interpretation for Bladder Cancer Classification
【速读】:该论文旨在解决深度学习模型在膀胱癌医学图像分类任务中表现不佳的问题,尤其是针对异常区域占比小的医学影像场景下,传统基于卷积神经网络(Convolutional Neural Network, CNN)和视觉Transformer(Vision Transformer, ViT)模型的泛化能力、校准性及可解释性不足的问题。其解决方案的关键在于:首先通过标准分类实验对比13种主流模型(包括CNN与Transformer类),发现ConvNext系列模型在跨中心数据上泛化能力有限(准确率约60%);其次引入校准分析表明ViT类模型相比ConvNext和Swin Transformer具有更好的校准效果;最后利用GradCAM++进行可解释性评估,并结合测试时增强(Test Time Augmentation, TTA)提升模型解释力,结果表明无单一模型能同时满足所有场景需求——ConvNext适用于分布内样本,而ViT及其变体更适合解释分布外样本。
链接: https://arxiv.org/abs/2602.09324
作者: Ahmad Chaddad,Yihang Wu,Xianrui Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE SMC 2025
Abstract:Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate \sim 300 experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., \sim 60% accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.
[CV-90] GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification
【速读】:该论文旨在解决乳腺癌组织病理图像分类中因标注数据稀缺导致的深度学习模型性能下降以及缺乏可解释性的问题(即“黑箱”特性),从而阻碍其在临床实践中的应用。解决方案的关键在于提出GAFRNet——一种结合图注意力机制与可微分模糊规则网络的新型架构:首先通过构建基于相似性的图表示来建模样本间关系,并利用多头图注意力机制捕获异质组织区域间的复杂关联特征;其次引入一个可微分的模糊规则模块,将节点度、聚类系数和标签一致性等拓扑描述符转化为人类可理解的“IF-THEN”诊断逻辑,实现预测过程的透明化与可解释性,无需依赖事后归因方法即可提供清晰的推理依据。
链接: https://arxiv.org/abs/2602.09318
作者: Lin-Guo Gao,Suxing Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic this http URL, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a “blackbox” nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue this http URL, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent “IF-THEN” mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.
[CV-91] A Deep Multi-Modal Method for Patient Wound Healing Assessment
【速读】:该论文旨在解决因伤口恶化导致患者住院所引发的高护理成本问题,尤其是针对那些初始并不需要紧急住院但因治疗延迟、患者依从性差或合并症等因素最终发展为需住院的情况。其解决方案的关键在于提出了一种基于深度多模态学习的方法,通过联合分析患者的伤口变量(wound variables)和伤口图像,实现对住院风险的精准预测;该方法还引入了迁移学习(transfer learning)策略,能够从伤口图像中同时预测伤口特征及其愈合轨迹,从而提升早期识别复杂伤口的能力,并减少临床医生诊断所需时间。
链接: https://arxiv.org/abs/2602.09315
作者: Subba Reddy Oota,Vijay Rowtula,Shahid Mohammed,Jeffrey Galitz,Minghsun Liu,Manish Gupta
机构: Woundtech Innovative Healthcare Solutions; Microsoft AI Research, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures
Abstract:Hospitalization of patients is one of the major factors for high wound care costs. Most patients do not acquire a wound which needs immediate hospitalization. However, due to factors such as delay in treatment, patient’s non-compliance or existing co-morbid conditions, an injury can deteriorate and ultimately lead to patient hospitalization. In this paper, we propose a deep multi-modal method to predict the patient’s risk of hospitalization. Our goal is to predict the risk confidently by collectively using the wound variables and wound images of the patient. Existing works in this domain have mainly focused on healing trajectories based on distinct wound types. We developed a transfer learning-based wound assessment solution, which can predict both wound variables from wound images and their healing trajectories, which is our primary contribution. We argue that the development of a novel model can help in early detection of the complexities in the wound, which might affect the healing process and also reduce the time spent by a clinician to diagnose the wound.
[CV-92] X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging
【速读】:该论文旨在解决医学影像数据集在未经授权情况下被用于训练深度学习模型所引发的版权与伦理问题,尤其针对胸部X光片等高分辨率、动态尺度且视觉多样性有限的医学图像,传统基于自然图像设计的水印方法难以兼顾诊断质量与水印鲁棒性的问题。解决方案的关键在于提出一种样本特异性的干净标签水印方法X-Mark,其核心创新是利用条件U-Net生成每个样本中显著区域内的独特扰动,并通过多组件训练目标优化水印的有效性、抗动态缩放能力及诊断质量保持;同时引入拉普拉斯正则化以抑制高频扰动,实现水印尺度不变性,从而在黑盒环境下准确验证所有权。
链接: https://arxiv.org/abs/2602.09284
作者: Pranav Kulkarni,Junfeng Guo,Heng Huang
机构: University of Maryland Institute for Health Computing (马里兰大学健康计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.
[CV-93] Rethinking Global Text Conditioning in Diffusion Transformers ICLR26
【速读】:该论文旨在解决扩散模型(diffusion models)中文本条件控制机制的有效性问题,具体探讨基于调制(modulation)的文本条件是否必要以及能否带来性能提升。传统方法通常依赖注意力机制(attention)与池化后的文本嵌入(pooled text embedding)联合实现文本引导,但本文通过系统分析发现,在常规使用下,池化嵌入对整体性能贡献有限,表明仅靠注意力机制即可有效传递提示信息。其关键解决方案在于重新定义池化嵌入的角色——将其作为无训练(training-free)、轻量级的引导信号,用于可控地调整生成结果朝向更理想属性偏移,从而在不增加显著计算开销的前提下,提升多种任务(如文生图、文生视频和图像编辑)的生成质量与可控性。
链接: https://arxiv.org/abs/2602.09268
作者: Nikita Starodubcev,Daniil Pakhomov,Zongze Wu,Ilya Drobyshevskiy,Yuchen Liu,Zhonghao Wang,Yuqian Zhou,Zhe Lin,Dmitry Baranchuk
机构: Yandex Research (Yandex 研究院); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR26
Abstract:Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
[CV-94] VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中不确定性量化(Uncertainty Quantification, UQ)的细粒度定位问题,即区分不确定性来源于图像、文本还是跨模态对齐错误。其解决方案的关键在于提出VLM-UQBench基准,该基准包含600个来自VizWiz数据集的真实世界样本,并划分为干净数据、图像不确定、文本不确定及跨模态不确定子集,同时设计了一套可扩展的扰动流水线(包含8种视觉、5种文本和3种跨模态扰动),并引入两个新指标来衡量UQ分数对扰动的敏感性及其与幻觉(hallucination)的相关性,从而系统评估多种UQ方法在不同VLM和数据集上的表现。
链接: https://arxiv.org/abs/2602.09214
作者: Chenyu Wang,Tianle Chen,H. M. Sabbir Ahmad,Kayhan Batmanghelich,Wenchao Li
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.
[CV-95] Wearable environmental sensing to forecast how legged systems will interact with upcoming terrain
【速读】:该论文旨在解决在步态过程中预测足部接触变化环境时的前后向足底压力中心(Anterior-Posterior Center of Pressure, AP COP)和触地时间(Time of Impact, TOI)的问题,以支持辅助系统中的前瞻控制。其关键解决方案是基于RGB-D视觉数据,采用轻量级卷积神经网络与循环神经网络(CNN-RNN)架构,在足触地前250ms的时间窗口内连续预测AP COP和TOI,实验证明该模型在不同预测时间窗下均表现出较高的准确性(COP MAE最低达23.72mm,TOI MAE最低达17.73ms),且可在消费级笔记本电脑或边缘计算设备上以60 FPS实时运行,为实现基于视觉感知的前瞻性控制提供了可行路径。
链接: https://arxiv.org/abs/2602.09209
作者: Michael D. Murray,James Tung,Richard W. Nuckols
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages excluding references and comments, 5 figures, 3 tables
Abstract:Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.
[CV-96] All-in-One Conditioning for Text-to-Image Synthesis
【速读】:该论文旨在解决文本到图像合成(text-to-image synthesis)中复杂提示词(prompt)的准确解释与可视化表示问题,特别是当提示包含多个对象、属性及空间关系时,现有模型难以保持语义一致性和结构连贯性。其解决方案的关键在于提出一种基于场景图(scene graph)结构的零样本条件机制,核心组件为Attribute-Size-Quantity-Location (ASQL) Conditioner,该模块通过轻量级语言模型生成推理时的软视觉引导,并结合扩散模型进行优化,从而在不依赖预定义布局图的前提下,实现文本-图像对齐、结构一致性与生成多样性之间的平衡。
链接: https://arxiv.org/abs/2602.09165
作者: Hirunima Jayasekara,Chuong Huynh,Yixuan Ren,Christabel Acquaye,Abhinav Shrivastava
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
[CV-97] Decoding Future Risk: Deep Learning Analysis of Tubular Adenoma Whole-Slide Images
【速读】:该论文试图解决的问题是:在低级别管状腺瘤(low-grade tubular adenomas)患者中,如何精准识别那些具有较高长期发展为结直肠癌(colorectal cancer, CRC)风险的个体,从而实现个性化随访和预防性干预策略。传统组织学评估难以捕捉潜在的恶性特征,而该研究的关键解决方案在于利用卷积神经网络(convolutional neural networks, CNNs)对全切片图像(whole-slide images, WSIs)进行深度分析,以挖掘肉眼难以察觉的细微组织学特征,从而预测患者的远期癌症进展风险。
链接: https://arxiv.org/abs/2602.09155
作者: Ahmed Rahu,Brian Shula,Brandon Combs,Aqsa Sultana,Surendra P. Singh,Vijayan K. Asari,Derrick Forchetti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 5 figures
Abstract:Colorectal cancer (CRC) remains a significant cause of cancer-related mortality, despite the widespread implementation of prophylactic initiatives aimed at detecting and removing precancerous polyps. Although screening effectively reduces incidence, a notable portion of patients initially diagnosed with low-grade adenomatous polyps will still develop CRC later in life, even without the presence of known high-risk syndromes. Identifying which low-risk patients are at higher risk of progression is a critical unmet need for tailored surveillance and preventative therapeutic strategies. Traditional histological assessment of adenomas, while fundamental, may not fully capture subtle architectural or cytological features indicative of malignant potential. Advancements in digital pathology and machine learning provide an opportunity to analyze whole-slide images (WSIs) comprehensively and objectively. This study investigates whether machine learning algorithms, specifically convolutional neural networks (CNNs), can detect subtle histological features in WSIs of low-grade tubular adenomas that are predictive of a patient’s long-term risk of developing colorectal cancer.
[CV-98] A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video
【速读】:该论文旨在解决视频新闻中屏幕上显示的个人姓名信息难以自动提取的问题,尤其是在图形布局多样、字体规范不一及平台设计差异显著的情况下,传统人工标注已不可行。其核心解决方案是提出一个可解释、模块化的检测与提取流水线,基于一个精心构建且平衡的标注帧语料库进行训练和评估,实现了对新闻视频中图形元素的高精度定位(mAP@0.5达95.8%),并在保持操作稳健性的同时提供全程可审计的数据溯源路径。相较生成式多模态方法,该方案虽在原始准确率上略低(F1: 77.08% vs 84.18%),但避免了幻觉问题,并确保了新闻分析场景所需的透明性和可追溯性,从而为现代新闻媒体中的混合多模态信息抽取提供了方法论严谨且可解释的基准。
链接: https://arxiv.org/abs/2602.09154
作者: Andrea Filiberto Lucas,Dylan Seychell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 7 pages, 5 figures. Accepted for publication at the 2026 IEEE Conference on Artificial Intelligence (CAI)
Abstract:The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media. Comments: 7 pages, 5 figures. Accepted for publication at the 2026 IEEE Conference on Artificial Intelligence (CAI) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM) ACMclasses: I.2.10; I.4.8 Cite as: arXiv:2602.09154 [cs.CV] (or arXiv:2602.09154v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.09154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-99] SemanticMoments: Training-Free Motion Similarity via Third Moment Features
【速读】:该论文旨在解决基于语义运动(semantic motion)的视频检索问题,这一任务长期以来因现有视频表示方法过度依赖静态外观和场景上下文而难以实现,导致模型无法有效区分运动与外观特征。为揭示这一固有偏差,作者构建了SimMotion基准测试集,结合受控合成数据与人工标注的真实世界数据,验证了现有模型在该任务上的表现不佳。解决方案的关键在于提出SemanticMoments方法——一种无需训练的简单框架,通过在预训练语义模型提取的特征空间中计算时间统计量(特别是高阶矩),从而实现对运动动态的语义感知建模。实验表明,该方法在多个基准上显著优于RGB、光流及文本监督等传统方法,证明了语义特征空间中的时间统计量可作为可扩展且具感知基础的运动中心视频理解新范式。
链接: https://arxiv.org/abs/2602.09146
作者: Saar Huberman,Kfir Goldberg,Or Patashnik,Sagie Benaim,Ron Mokady
机构: BRIA AI; Tel Aviv University (特拉维夫大学); Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
[CV-100] Distributed Hybrid Parallelism for Large Language Models : Comparative Study and System Design Guide
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在训练与推理过程中,分布式计算与内存管理策略缺乏系统性分析与优化方法的问题。其解决方案的关键在于通过数学建模对集体操作(collective operations)和分布式并行策略进行理论深化,并提出混合并行设计(hybrid parallelization designs),强调在模型部署不同阶段(包括训练和推理)中实现通信与计算的重叠以提升效率;同时引入基于成本模型的自动化搜索方法来寻找最优混合并行策略,并结合主流架构案例提供实证洞察,从而为研究人员和实践者选择并行策略提供指导。
链接: https://arxiv.org/abs/2602.09109
作者: Hossam Amer,Rezaul Karim,Ali Pourranjbar,Weiwei Zhang,Walid Ahmed,Boxing Chen
机构: Huawei Canada (华为加拿大)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
[CV-101] Agent Banana: High-Fidelity Image Editing with Agent ic Thinking and Tooling
【速读】:该论文针对专业图像编辑工作流中三个持续存在的挑战展开研究:(i)编辑者常过度修改内容,超出用户意图;(ii)现有模型多为单轮交互,而多轮编辑易导致目标对象失真;(iii)当前评估在约1K分辨率下进行,与实际使用超高清图像(如4K)的流程不一致。为此,作者提出Agent Banana框架,其核心在于两个机制:(1)Context Folding(上下文折叠),将长交互历史压缩为结构化记忆以实现稳定长周期控制;(2)Image Layer Decomposition(图像层分解),通过局部化分层编辑,在保持非目标区域不变的同时支持原生分辨率输出。该方案显著提升了多轮编辑的一致性和背景保真度,并构建了HDD-Bench基准用于高保真、对话式评估,推动生成式AI(Generative AI)在专业图像编辑领域的可靠落地。
链接: https://arxiv.org/abs/2602.09084
作者: Ruijie Ye,Jiayi Zhang,Zhuoxin Liu,Zihao Zhu,Siyuan Yang,Li Li,Tianfu Fu,Franck Dernoncourt,Yue Zhao,Jiacheng Zhu,Ryan Rossi,Wenhao Chai,Zhengzhong Tu
机构: TAMU; Brown University (布朗大学); UW-Madison; UCSD; USC; xAI; Adobe Research; Meta AI; Princeton University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this http URL
Abstract:We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user’s intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
[CV-102] LiDAR-based 3D Change Detection at City Scale
【速读】:该论文旨在解决城市尺度下基于激光雷达(LiDAR)的变更检测问题,传统方法如数字表面模型(DSM)和图像差分易受垂直偏差与视角不匹配影响,而原始点云或体素模型则存在内存占用高、对齐假设严格及细长结构退化等问题。其解决方案的关键在于提出一种不确定性感知、以对象为中心的方法:首先利用多分辨率正态分布变换(NDT)与点到平面迭代最近点(ICP)实现跨时相数据配准;随后通过注册协方差与表面粗糙度计算逐点级检测置信度以校准变更决策;进一步结合几何关联、语义与实例分割,并采用类约束二分图分配优化处理分裂-合并情况;最后通过分块处理控制内存并保留狭窄地物变化,实例级决策融合重叠度、位移和体积差异,在局部检测门控下实现精准变更识别。
链接: https://arxiv.org/abs/2510.21112
作者: Hezam Albagami,Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Zainy M. Malakan,Abdullah M. Alqamdi,Mohammed H. Alghamdi,Ajmal Mian
机构: University of Jeddah (吉达大学); University of Western Australia (西澳大利亚大学); Umm Al-Qura University (乌姆库拉大学); College of Computer Science and Engineering, University of Jeddah (吉达大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:High-definition 3D city maps enable city planning and change detection, which is essential for municipal compliance, map maintenance, and asset monitoring, including both built structures and urban greenery. Conventional Digital Surface Model (DSM) and image differencing are sensitive to vertical bias and viewpoint mismatch, while original point cloud or voxel models require large memory, assume perfect alignment, and degrade thin structures. We propose an uncertainty-aware, object-centric method for city-scale LiDAR-based change detection. Our method aligns data from different time periods using multi-resolution Normal Distributions Transform (NDT) and a point-to-plane Iterative Closest Point (ICP) method, normalizes elevation, and computes a per-point level of detection from registration covariance and surface roughness to calibrate change decisions. Geometry-based associations are refined by semantic and instance segmentation and optimized using class-constrained bipartite assignment with augmented dummies to handle split-merge cases. Tiled processing bounds memory and preserves narrow ground changes, while instance-level decisions integrate overlap, displacement, and volumetric differences under local detection gating. We perform experiments on a Subiaco (Western Australia) dataset captured in 2023 and again in 2025. Our method achieves 95.3% accuracy, 90.8% mF1, and 82.9% mIoU, improving over the strongest baseline, Triplet KPConv, by 0.3, 0.6, and 1.1 points, respectively. The datasets are available on IEEE DataPort (2023: this https URL and 2025: this https URL). The source code is available at this https URL.
[CV-103] Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在手术视频分类任务中的泛化能力与本地适应性问题,特别是如何在不共享患者数据的前提下实现跨中心模型的稳健性能提升。其关键解决方案包括:采用多中心Appendix300视频数据集进行基准测试,评估不同FL策略(如FedAvg、FedMedian、FedSAM)在未见临床中心的泛化表现及本地微调后的适应能力;引入基于ViViT的视觉模型架构以增强时空建模能力,并结合线性探测(linear probing)、三元组损失(triplet loss)等方法优化特征表示;同时强调预处理和损失函数设计对缓解类别不平衡及提升训练稳定性的重要性。研究揭示了局部个性化与全局鲁棒性之间的权衡关系,为未来开发更具适应性、抗偏倚且稳定的临床手术AI联邦学习方法提供了重要参考。
链接: https://arxiv.org/abs/2510.04772
作者: Max Kirchner,Hanna Hoffmann,Alexander C. Jenke,Oliver L. Saldanha,Kevin Pfeiffer,Weam Kanjo,Julia Alekseenko,Claas de Boer,Santhi Raj Kolamuri,Lorenzo Mazza,Nicolas Padoy,Sophia Bano,Annika Reinke,Lena Maier-Hein,Danail Stoyanov,Jakob N. Kather,Fiona R. Kolbinger,Sebastian Bodenstedt,Stefanie Speidel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: A challenge report pre-print (31 pages), including 7 tables and 8 figures
Abstract:Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
[CV-104] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy
【速读】:该论文针对高速双向扫描光学分辨率光声显微镜(OR-PAM)在功能脑成像中因扫描方向依赖的域偏移(domain shift)和几何失真导致的时空错位问题展开研究。传统配准方法依赖亮度恒定假设,但在双向扫描下该假设失效,造成对齐不可靠。解决方案的关键在于提出一种统一的场景-外观分离框架(Scene-Appearance Separation, SAS-Net),通过将场景内容(domain-invariant scene content)与域特定外观特征(domain-specific appearance characteristics)分离,实现跨域重建并保持几何一致性;其中引入场景一致性损失(scene consistency loss)在潜在空间中建立几何对应关系,将域偏移校正与空间配准整合于单一框架内,从而显著提升配准精度(NCC达0.961,SSIM达0.894)并支持实时处理(11.2 ms/帧,86 fps)。
链接: https://arxiv.org/abs/2602.09050
作者: Jiahao Qin
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 6 figures, 3 tables
Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional scanning enables rapid functional brain imaging but introduces severe spatiotemporal misalignment from coupled scan-direction-dependent domain shift and geometric distortion. Conventional registration methods rely on brightness constancy, an assumption violated under bidirectional scanning, leading to unreliable alignment. A unified scene-appearance separation framework is proposed to jointly address domain shift and spatial misalignment. The proposed architecture separates domain-invariant scene content from domain-specific appearance characteristics, enabling cross-domain reconstruction with geometric preservation. A scene consistency loss promotes geometric correspondence in the latent space, linking domain shift correction with spatial registration within a single framework. For in vivo mouse brain vasculature imaging, the proposed method achieves normalized cross-correlation (NCC) of 0.961 and structural similarity index (SSIM) of 0.894, substantially outperforming conventional methods. Ablation studies demonstrate that domain alignment loss is critical, with its removal causing 82% NCC reduction (0.961 to 0.175), while scene consistency and cycle consistency losses provide complementary regularization for optimal performance. The method achieves 11.2 ms inference time per frame (86 fps), substantially exceeding typical OR-PAM acquisition rates and enabling real-time processing. These results suggest that the proposed framework enables robust high-speed bidirectional OR-PAM for reliable quantitative and longitudinal functional imaging. The code will be publicly available at this https URL Comments: 21 pages, 6 figures, 3 tables Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68U10, 92C55 ACMclasses: I.4.3; I.2.6; J.3 Cite as: arXiv:2602.09050 [eess.IV] (or arXiv:2602.09050v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2602.09050 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jiahao Qin [view email] [v1] Fri, 6 Feb 2026 21:01:27 UTC (19,772 KB) Full-text links: Access Paper: View a PDF of the paper titled SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy, by Jiahao QinView PDFHTML (experimental)TeX Source view license Current browse context: eess.IV prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-105] Mamba-FCS: Joint Spatio- Frequency Feature Fusion Change-Guided Attention and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing
【速读】:该论文旨在解决遥感影像语义变化检测(Semantic Change Detection, SCD)中模型难以平衡全局空间上下文建模、计算效率与类别不平衡下变化敏感性的问题。现有方法中,卷积神经网络(Convolutional Neural Networks, CNNs)擅长局部特征提取但缺乏全局感知能力,而Transformer虽能建模长程依赖却存在高计算开销;Mamba架构基于状态空间模型(State-Space Model, SSM)提供线性复杂度下的高效长距离建模潜力。本文提出Mamba-FCS框架,其关键创新在于:1)引入视觉状态空间模型(Visual State Space Model)作为骨干网络以兼顾效率与全局建模能力;2)设计联合时空频域融合模块(Joint Spatio-Frequency Fusion block),利用对数幅值频域特征提升边缘清晰度并抑制光照伪影;3)构建变化引导注意力机制(Change-Guided Attention, CGA),显式关联变化检测(Binary Change Detection, BCD)与SCD任务以增强语义一致性;4)提出分离的Kappa损失(Separated Kappa, SeK)用于优化类别不平衡场景下的性能表现。实验表明,该方案在SECOND和Landsat-SCD数据集上均达到当前最优指标,验证了Mamba架构结合上述技术的有效性和可扩展性。
链接: https://arxiv.org/abs/2508.08232
作者: Buddhi Wijenayake,Athulya Ratnayake,Praveen Sumanasekara,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath,Nichula Wasalathilaka
机构: University of Peradeniya(佩拉德尼亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 19 pages, 13 Figures
Abstract:Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.
人工智能
[AI-0] Biases in the Blind Spot: Detecting What LLM s Fail to Mention ICML2026
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在但难以被察觉的“未言明偏差”(unverbalized biases)问题,即模型在推理过程中隐含的偏见无法通过其显式链式思维(chain-of-thought, CoT)路径被识别,从而导致传统基于CoT的监控方法失效。现有偏见评估通常依赖预定义类别和人工构建的数据集,缺乏自动化与任务特异性。论文提出一种完全自动化的黑盒检测管道,其核心在于:首先利用LLM自评工具(LLM autoraters)从任务数据集中生成候选偏见概念;随后通过逐步扩增输入样本,生成正负例变体并应用多重假设检验与早期停止策略进行统计验证;若某概念在统计上显著影响模型输出但未出现在CoT解释中,则判定为未言明偏差。该方法实现了对特定任务下未知偏见的自动发现与已知偏见的有效验证,为偏见检测提供了可扩展、实用的新路径。
链接: https://arxiv.org/abs/2602.10117
作者: Iván Arcuschin,David Chanin,Adrià Garriga-Alonso,Oana-Maria Camburu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, Under review at ICML 2026
Abstract:Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model’s CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
[AI-1] Step-resolved data attribution for looped transformers
【速读】:该论文旨在解决现有训练数据影响估计方法(如TracIn)在循环变压器(looped transformers)中无法定位具体计算步骤的问题,即这些方法仅提供一个聚合了所有循环迭代的标量得分,从而掩盖了训练样本在哪些循环步骤中对模型内部计算产生关键影响。解决方案的关键在于提出步骤分解影响(Step-Decomposed Influence, SDI),通过展开循环计算图并为每个循环迭代分配影响值,从而获得长度为τ的影响轨迹;为实现大规模变压器模型上的高效计算,进一步引入TensorSketch实现方式,避免显式生成每样本梯度,显著提升可扩展性与实用性。
链接: https://arxiv.org/abs/2602.10097
作者: Georgios Kaissis,David Mildenberger,Juan Felipe Gomez,Martin J. Menten,Eleni Triantafillou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for \tau recurrent iterations to enable latent reasoning. Existing training-data influence estimators such as TracIn yield a single scalar score that aggregates over all loop iterations, obscuring when during the recurrent computation a training example matters. We introduce \textitStep-Decomposed Influence (SDI), which decomposes TracIn into a length- \tau influence trajectory by unrolling the recurrent computation graph and attributing influence to specific loop iterations. To make SDI practical at transformer scale, we propose a TensorSketch implementation that never materialises per-example gradients. Experiments on looped GPT-style models and algorithmic reasoning tasks show that SDI scales excellently, matches full-gradient baselines with low error and supports a broad range of data attribution and interpretability tasks with per-step insights into the latent reasoning process.
[AI-2] CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
【速读】:该论文旨在解决开放式技能发现与学习中的核心挑战:如何在不依赖人工设计奖励函数的前提下,持续自动扩展和优化智能体的技能库,从而支持复杂、长时程目标的达成。传统强化学习方法受限于预设奖励函数,难以适应未知技能空间;现有自动化奖励设计方法也仅限于微调已有任务的奖励,无法实现开放式演进。解决方案的关键在于提出CODE-SHARP框架,其利用基础模型(Foundation Model, FM)构建一个分层奖励程序(Hierarchical Reward Programs, SHARP)结构,将可执行的奖励函数以有向图形式组织为技能档案,并通过不断生成新的奖励函数来扩展该图,使目标条件代理仅凭这些自动生成的奖励即可学习解决越来越复杂的任务。实验表明,该框架在Craftax环境中显著提升了智能体的长时程规划能力,且由高层FM规划器组合使用所发现的技能后,性能优于预训练代理和任务专用专家策略超过134%。
链接: https://arxiv.org/abs/2602.10085
作者: Richard Bornemann,Pierluigi Vito Amadori,Antoine Cully
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over 134 % on average. We will open-source our code and provide additional videos \hrefthis https URLhere .
[AI-3] Chain of Mindset: Reasoning with Adaptive Cognitive Modes
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在推理过程中普遍存在的“单一思维模式”问题,即模型在处理复杂任务时始终采用固定的认知策略,而忽视了不同推理阶段需要不同类型思维模式的动态适配。这种局限性阻碍了模型向更高层次智能发展。解决方案的关键在于提出Chain of Mindset (CoM) 框架,其核心创新是将推理过程分解为四种功能异质的思维模式:空间思维(Spatial)、收敛思维(Convergent)、发散思维(Divergent)和算法思维(Algorithmic),并通过一个元代理(Meta-Agent)根据推理状态动态选择最优思维模式,同时引入双向上下文门(Bidirectional Context Gate)控制模块间信息流动,从而实现步骤级自适应的思维模式协同调度。该方法无需额外训练即可显著提升多模态推理性能,在多个基准测试中达到SOTA效果。
链接: https://arxiv.org/abs/2602.10063
作者: Tianyi Jiang,Arctanx An,Hengyi Feng,Naixin Zhai,Haodong Li,Xiaomin Yu,Jiahui Liu,Hanwen Du,Shuo Zhang,Zhi Yang,Jie Huang,Yuhua Li,Yongxin Ni,Huacan Wang,Ronghao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \hrefthis https URLthis https URL.
[AI-4] Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization ICASSP
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中产生的冗余输出问题,即生成过于冗长的推理过程,导致计算成本和延迟增加但性能提升有限。解决方案的关键在于提出一种细粒度分组策略优化(Fine-grained Group Policy Optimization, FGO)算法,该算法通过将群体响应细分为更小单元,并基于长度与熵(entropy)分配权重,实现有效的CoT压缩;同时,FGO作为分组相对策略优化(Group Relative Policy Optimization, GRPO)的增强版本,有效缓解了GRPO存在的数据利用效率低和熵崩溃(entropy collapse)两大缺陷。
链接: https://arxiv.org/abs/2602.10048
作者: Xinchen Han,Hossam Afifi,Michel Marot,Xilu Wang,Lu Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026
Abstract:Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbfFine-grained \textbfGroup policy \textbfOptimization (\textbfFGO), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.
[AI-5] Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中在稀疏奖励环境下的高效探索问题。传统方法如基于上置信界(Upper Confidence Bound, UCB)的探索策略通常依赖不确定性估计或约束优化,难以 scalable 且计算复杂度高。解决方案的关键在于提出乐观世界模型(Optimistic World Models, OWMs),其核心创新是将乐观性直接嵌入模型学习过程,通过引入一个乐观动力学损失(optimistic dynamics loss)来引导模拟轨迹向高奖励结果偏移,从而实现无需显式不确定性估计或约束优化的梯度驱动探索机制。该方法可无缝集成至现有世界模型框架,仅需最小改动即可显著提升样本效率和累计回报。
链接: https://arxiv.org/abs/2602.10044
作者: Akshay Mete,Shahid Aamir Sheikh,Tzu-Hsiang Lin,Dileep Kalathil,P. R. Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.
[AI-6] ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning
【速读】:该论文旨在解决强化学习中策略优化因静态优势函数估计而导致的信用分配效率低下问题,这一问题表现为训练样本随时间演变的动态效用被忽视,进而引发策略更新次优、收敛速度慢及学习不稳定等现象。其解决方案的关键在于提出ADORA(Advantage Dynamics via Online Rollout Adaptation)框架,通过在线模型回放过程中对训练数据进行动态分类,区分暂时有利与不利的样本,并据此自适应调整优势函数的权重,从而实现更高效的策略更新。该方法无需修改现有策略优化算法架构,即可显著提升长链条推理任务中的性能表现。
链接: https://arxiv.org/abs/2602.10019
作者: Qingnan Ren,Shiting Huang,Zhen Fang,Zehui Chen,Lin Chen,Lijun Li,Feng Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbfADORA (\textbfAdvantage \textbfDynamics via \textbfOnline \textbfRollout \textbfAdaptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function’s weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.
[AI-7] RoboSubtaskNet: Temporal Sub-task Segmentation for Human-to-Robot Skill Transfer in Real-World Environments
【速读】:该论文旨在解决长时未剪辑视频中细粒度子任务段的时空定位与分类问题,以支持安全的人机协作场景下的机器人执行。传统活动识别方法无法提供可直接由机器人执行的子任务标签,而本研究提出RoboSubtaskNet框架,其核心创新在于:融合注意力增强的I3D特征(RGB与光流)与改进的多尺度时间卷积网络(MS-TCN),并采用斐波那契膨胀调度策略以更好捕捉短时过渡行为(如“接近-抓取-放置”);同时设计复合目标函数(包含交叉熵损失、截断均方误差和过渡感知项),有效抑制过分割并促进合法子任务序列演化。此外,作者构建了RoboSubtask数据集,实现视觉标注到机械臂操作原语的确定性映射,从而打通从视频理解到实际机器人执行的闭环路径。
链接: https://arxiv.org/abs/2602.10015
作者: Dharmendra Sharma,Archit Sharma,John Reberio,Vaibhav Kesharwani,Peeyush Thakur,Narendra Kumar Dhar,Laxmidhar Behera
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
[AI-8] ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在生成长链式思维(chain-of-thought)过程中存在的计算冗余问题,即模型在已得出正确答案后仍继续进行不必要的推理步骤,导致资源浪费。解决方案的关键在于提出一种名为“基于Token感知的早期停止”(Early-Stopping for Token-Aware Reasoning, ESTAR)的方法,其核心包括三部分:(i) 基于轨迹的分类器识别可安全停止推理的时机;(ii) 通过监督微调使LRM学会生成自定义的停止信号;(iii) 利用带计算感知奖励的停止感知强化学习,在自动生成的停止点截断推理路径。实验表明,ESTAR可在保持准确率基本不变的前提下,将推理长度平均减少约3.7倍(从4,799降至1,290 tokens),并展现出良好的跨领域泛化能力。
链接: https://arxiv.org/abs/2602.10004
作者: Junda Wang,Zhichao Yang,Dongxu Zhang,Sanjit Singh Batra,Robert E. Tillman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated stop signals, and (iii) stop-aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.
[AI-9] Empirical Stability Analysis of Kolmogorov-Arnold Networks in Hard-Constrained Recurrent Physics-Informed Discovery
【速读】:该论文旨在解决在硬约束的循环物理信息神经网络(HRPINN)中,如何高效且准确地学习未知动力学残差流形(residual manifold)的问题,特别是在振荡系统中的建模精度与稳定性问题。其解决方案的关键在于引入Kolmogorov-Arnold Networks (KANs) 替代传统的多层感知机(MLP),以利用Kolmogorov-Arnold表示定理所蕴含的加法结构优势,期望实现对未知项的更优恢复能力。然而,实验表明,尽管小规模KAN在单变量多项式残差(如Duffing系统)上表现良好,但在处理乘积项(如Van der Pol系统)时表现出严重的超参数敏感性和深层结构不稳定性,整体性能不如标准MLP,揭示了原始KAN中加性归纳偏置(additive inductive bias)在状态耦合建模中的局限性。
链接: https://arxiv.org/abs/2602.09988
作者: Enzo Nicolas Spotorno,Josafat Leal Filho,Antonio Augusto Medeiros Frohlich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 5 pages
Abstract:We investigate the integration of Kolmogorov-Arnold Networks (KANs) into hard-constrained recurrent physics-informed architectures (HRPINN) to evaluate the fidelity of learned residual manifolds in oscillatory systems. Motivated by the Kolmogorov-Arnold representation theorem and preliminary gray-box results, we hypothesized that KANs would enable efficient recovery of unknown terms compared to MLPs. Through initial sensitivity analysis on configuration sensitivity, parameter scale, and training paradigm, we found that while small KANs are competitive on univariate polynomial residuals (Duffing), they exhibit severe hyperparameter fragility, instability in deeper configurations, and consistent failure on multiplicative terms (Van der Pol), generally outperformed by standard MLPs. These empirical challenges highlight limitations of the additive inductive bias in the original KAN formulation for state coupling and provide preliminary empirical evidence of inductive bias limitations for future hybrid modeling.
[AI-10] Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
【速读】:该论文旨在解决如何通过修改训练数据来系统性地诱导模型产生特定行为的问题,即从传统的“用影响函数解释模型行为”转向“利用影响函数反向设计训练数据以改变模型行为”。其解决方案的关键在于提出了一种名为Infusion的框架,该框架基于可扩展的影响函数近似方法,计算对训练文档的小幅扰动(perturbations),从而通过参数变化实现目标行为的诱导。实验表明,仅需对少量训练样本(如CIFAR-10中0.2%的样本)进行细微编辑,即可达到与插入显式行为示例相当的效果,并且该方法在不同模型架构间具有迁移性,凸显了训练数据可解释性在对抗和防御场景中的重要价值。
链接: https://arxiv.org/abs/2602.09987
作者: J Rosser,Robert Kirk,Edward Grefenstette,Jakob Foerster,Laura Ruis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 14 figures
Abstract:Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet \leftrightarrow CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: this https URL.
[AI-11] Supervised Metric Regularization Through Alternating Optimization for Multi-Regime Physics-Informed Neural Networks
【速读】:该论文旨在解决标准物理信息神经网络(Physics-Informed Neural Networks, PINNs)在建模具有尖锐态转变(如分岔现象)的参数化动力系统时所面临的挑战,尤其是由于从参数到解的连续映射导致的谱偏差(spectral bias)或“模式坍缩”问题,即网络会平均不同物理行为而失去对多态特性的分辨能力。其解决方案的关键在于提出一种拓扑感知的PINN(Topology-Aware PINN, TAPINN),通过监督度量正则化(Supervised Metric Regularization)结构化隐空间,使模型条件于一个能反映不同物理态之间度量分离的潜在状态;同时采用基于相位的交替优化(phase-based Alternating Optimization, AO)策略来缓解度量目标与物理残差目标之间的梯度冲突,从而实现更稳定且物理一致的解空间表征。
链接: https://arxiv.org/abs/2602.09980
作者: Enzo Nicolas Spotorno,Josafat Ribeiro Leal,Antonio Augusto Frohlich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 5 pages, 1 figure
Abstract:Standard Physics-Informed Neural Networks (PINNs) often face challenges when modeling parameterized dynamical systems with sharp regime transitions, such as bifurcations. In these scenarios, the continuous mapping from parameters to solutions can result in spectral bias or “mode collapse”, where the network averages distinct physical behaviors. We propose a Topology-Aware PINN (TAPINN) that aims to mitigate this challenge by structuring the latent space via Supervised Metric Regularization. Unlike standard parametric PINNs that map physical parameters directly to solutions, our method conditions the solver on a latent state optimized to reflect the metric-based separation between regimes, showing ~49% lower physics residual (0.082 vs. 0.160). We train this architecture using a phase-based Alternating Optimization (AO) schedule to manage gradient conflicts between the metric and physics objectives. Preliminary experiments on the Duffing Oscillator demonstrate that while standard baselines suffer from spectral bias and high-capacity Hypernetworks overfit (memorizing data while violating physics), our approach achieves stable convergence with 2.18x lower gradient variance than a multi-output Sobolev Error baseline, and 5x fewer parameters than a hypernetwork-based alternative.
[AI-12] Drug Release Modeling using Physics-Informed Neural Networks
【速读】:该论文旨在解决传统药物释放模型(如Fick、Higuchi和Peppas模型)在复杂几何结构和释放机制下预测精度不足的问题,这些模型依赖于简化假设,难以准确描述平面、一维褶皱及二维皱褶薄膜中的药物释放行为。其解决方案的关键在于提出一种基于物理信息神经网络(Physics-Informed Neural Networks, PINNs)与贝叶斯物理信息神经网络(Bayesian PINNs, BPINNs)的新方法,将Fick第二定律作为先验知识嵌入神经网络损失函数中,并结合有限实验数据实现从短期测量到长期释放行为的高精度预测。该框架通过拉丁超立方采样生成10,000个配点进行训练,在噪声环境和小样本条件下仍显著优于经典模型,最大平均误差降低达40%,且BPINNs提供了更可靠的不确定性量化,从而为药物释放系统的早期开发提供了高效、准确的表征工具。
链接: https://arxiv.org/abs/2602.09963
作者: Daanish Aleem Qureshi,Khemraj Shukla,Vikas Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Accurate modeling of drug release is essential for designing and developing controlled-release systems. Classical models (Fick, Higuchi, Peppas) rely on simplifying assumptions that limit their accuracy in complex geometries and release mechanisms. Here, we propose a novel approach using Physics-Informed Neural Networks (PINNs) and Bayesian PINNs (BPINNs) for predicting release from planar, 1D-wrinkled, and 2D-crumpled films. This approach uniquely integrates Fick’s diffusion law with limited experimental data to enable accurate long-term predictions from short-term measurements, and is systematically benchmarked against classical drug release models. We embedded Fick’s second law into PINN as loss with 10,000 Latin-hypercube collocation points and utilized previously published experimental datasets to assess drug release performance through mean absolute error (MAE) and root mean square error (RMSE), considering noisy conditions and limited-data scenarios. Our approach reduced mean error by up to 40% relative to classical baselines across all film types. The PINN formulation achieved RMSE 0.05 utilizing only the first 6% of the release time data (reducing 94% of release time required for the experiments) for the planar film. For wrinkled and crumpled films, the PINN reached RMSE 0.05 in 33% of the release time data. BPINNs provide tighter and more reliable uncertainty quantification under noise. By combining physical laws with experimental data, the proposed framework yields highly accurate long-term release predictions from short-term measurements, offering a practical route for accelerated characterization and more efficient early-stage drug release system formulation.
[AI-13] Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning
【速读】:该论文旨在解决临床决策支持系统中“正确答案”与“临床有效推理”之间的脱节问题,即模型虽能给出正确答案,但其推理过程可能缺乏医学逻辑或不符合临床规范。解决方案的关键在于提出差异推理学习(Differential Reasoning Learning, DRL)框架,通过对比参考推理路径(如医生撰写的临床推理、临床指南或更强大模型的输出)与代理生成的自由形式思维链(Chain-of-Thought, CoT),提取结构化的推理图(reasoning graphs,以有向无环图 DAG 表示),并基于临床加权的图编辑距离(GED)进行差异分析。进一步利用大语言模型作为评判者(LLM-as-a-judge)识别语义等价节点并诊断图间差异,将这些差异转化为自然语言指令并存入差异推理知识库(DR-KB)。在推理阶段,通过检索增强生成(Retrieval-Augmented Generation, RAG)召回 top-k 指令以增强提示(prompt),从而修复潜在的逻辑漏洞。该方法显著提升了最终答案准确率和推理一致性,在开放医学问答和住院复诊预测任务中优于基线,并经临床医生评审验证了其可靠性。
链接: https://arxiv.org/abs/2602.09945
作者: Jinsong Liu,Yuhang Jiang,Ramayya Krishnan,Rema Padman,Yiye Zhang,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical decision support requires not only correct answers but also clinically valid reasoning. We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies. From reference reasoning rationales (e.g., physician-authored clinical rationale, clinical guidelines, or outputs from more capable models) and the agent’s free-form chain-of-thought (CoT), DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis. An LLM-as-a-judge aligns semantically equivalent nodes and diagnoses discrepancies between graphs. These graph-level discrepancy diagnostics are converted into natural-language instructions and stored in a Differential Reasoning Knowledge Base (DR-KB). At inference, we retrieve top- k instructions via Retrieval-Augmented Generation (RAG) to augment the agent prompt and patch likely logic gaps. Evaluation on open medical question answering (QA) benchmarks and a Return Visit Admissions (RVA) prediction task from internal clinical data demonstrates gains over baselines, improving both final-answer accuracy and reasoning fidelity. Ablation studies confirm gains from infusing reference reasoning rationales and the top- k retrieval strategy. Clinicians’ review of the output provides further assurance of the approach. Together, results suggest that DRL supports more reliable clinical decision-making in complex reasoning scenarios and offers a practical mechanism for deployment under limited token budgets.
[AI-14] Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation
【速读】:该论文旨在解决机器人在真实环境中难以根据自由格式的人类指令执行可靠操作的问题,尤其受限于计算资源和感知能力。其核心解决方案在于提出一个轻量级、全设备端的流水线系统,关键创新包括:(1) 指令到动作模块(Instruct2Act),采用紧凑的双向长短期记忆网络(BiLSTM)结合多头注意力自编码器,将自然语言指令解析为原子动作序列(如抓取、移动、放置);(2) 机器人动作网络(RAN),融合动态自适应轨迹径向网络(DATRN)与基于YOLOv8的视觉环境分析器,为每个子动作生成精确控制轨迹。该方案实现了无需云端服务的实时推理,在单摄像头条件下完成细粒度指令解析与视觉引导的确定性操作,验证了其在资源受限场景下的实用性。
链接: https://arxiv.org/abs/2602.09940
作者: Archit Sharma,Dharmendra Sharma,John Rebeiro,Peeyush Thakur,Narendra Dhar,Laxmidhar Behera
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
[AI-15] Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
【速读】:该论文旨在解决大规模云系统中故障根因分析(Root Cause Analysis, RCA)自动化效率低下的问题,尤其针对当前基于大语言模型(Large Language Model, LLM)的RCA代理在实际运行中检测准确率不足且缺乏过程级失败诊断机制的局限性。其解决方案的关键在于提出了一种面向LLM-RCA代理的过程级失败分析方法,通过在OpenRCA基准上执行1,675次代理运行,系统性地将失败归类为12种陷阱类型(pitfall types),涵盖代理内部推理、代理间通信及代理与环境交互三个维度;研究发现,最普遍的陷阱如“幻觉数据解读”和“探索不充分”在不同能力层级的LLM中均普遍存在,表明其根源在于共享的代理架构而非单个模型性能差异;进一步的受控实验验证了仅靠提示工程无法解决主导性陷阱,而优化代理间通信协议可显著降低通信相关失败达15个百分点,从而为设计更可靠的自主云RCA代理提供了可量化、可诊断的基础框架。
链接: https://arxiv.org/abs/2602.09937
作者: Taeyoon Kim,Woohyeok Park,Hoyeong Yun,Kyungyong Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent’s reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.
[AI-16] Routing Cascades and User Choice for LLM s ICLR2026
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)服务提供商在用户任务调度中面临的性能与成本权衡问题,即如何根据任务难度和延迟动态分配模型资源以优化整体效用。其核心解决方案是构建一个双层博弈模型(Stackelberg game),其中LLM提供商作为领导者选择路由策略,用户作为跟随者决定是否重新提示或放弃任务。关键在于通过完全刻画用户的最优响应并简化提供商的优化问题,发现最优路由策略通常为静态阈值规则,无需级联机制,并揭示了当用户与提供商对模型效用和成本排序不一致时存在“错配间隙”(misalignment gap),极端情况下甚至会导致提供商通过降低模型响应速度(throttling)来压低成本,从而损害用户效用。这一分析为单提供者-单用户场景下的路由、级联和限速策略提供了清晰的决策边界。
链接: https://arxiv.org/abs/2602.09902
作者: Rafid Mahmood
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, accepted in ICLR 2026
Abstract:To mitigate the trade-offs between performance and costs, LLM providers route user tasks to different models based on task difficulty and latency. We study the effect of LLM routing with respect to user behavior. We propose a game between an LLM provider with two models (standard and reasoning) and a user who can re-prompt or abandon tasks if the routed model cannot solve them. The user’s goal is to maximize their utility minus the delay from using the model, while the provider minimizes the cost of servicing the user. We solve this Stackelberg game by fully characterizing the user best response and simplifying the provider problem. We observe that in nearly all cases, the optimal routing policy involves a static policy with no cascading that depends on the expected utility of the models to the user. Furthermore, we reveal a misalignment gap between the provider-optimal and user-preferred routes when the user’s and provider’s rankings of the models with respect to utility and cost differ. Finally, we demonstrate conditions for extreme misalignment where providers are incentivized to throttle the latency of the models to minimize their costs, consequently depressing user utility. The results yield simple threshold rules for single-provider, single-user interactions and clarify when routing, cascading, and throttling help or harm.
[AI-17] aCo: A Benchmark for Lossless and Lossy Codecs of Heterogeneous Tactile Data
【速读】:该论文旨在解决触觉感知数据在实时机器人应用中因带宽限制而亟需高效压缩的问题,尤其针对触觉数据固有的异质性和时空复杂性带来的挑战。解决方案的关键在于构建首个全面的触觉数据编解码器基准测试平台TaCo,系统评估30种压缩方法(包括通用压缩算法与神经编解码器)在五类传感器数据上的表现,并首次提出基于触觉数据训练的专用编解码器TaCo-LL(无损)和TaCo-L(有损),验证了其在存储、可视化、材料/物体分类及灵巧抓取等任务中的优越性能,为触觉感知中压缩效率与任务性能之间的权衡提供了可量化的分析框架。
链接: https://arxiv.org/abs/2602.09893
作者: Zhengxue Cheng,Yan Zhao,Keyu Wang,Hengdi Zhang,Li Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 27 pages
Abstract:Tactile sensing is crucial for embodied intelligence, providing fine-grained perception and control in complex environments. However, efficient tactile data compression, which is essential for real-time robotic applications under strict bandwidth constraints, remains underexplored. The inherent heterogeneity and spatiotemporal complexity of tactile data further complicate this challenge. To bridge this gap, we introduce TaCo, the first comprehensive benchmark for Tactile data Codecs. TaCo evaluates 30 compression methods, including off-the-shelf compression algorithms and neural codecs, across five diverse datasets from various sensor types. We systematically assess both lossless and lossy compression schemes on four key tasks: lossless storage, human visualization, material and object classification, and dexterous robotic grasping. Notably, we pioneer the development of data-driven codecs explicitly trained on tactile data, TaCo-LL (lossless) and TaCo-L (lossy). Results have validated the superior performance of our TaCo-LL and TaCo-L. This benchmark provides a foundational framework for understanding the critical trade-offs between compression efficiency and task performance, paving the way for future advances in tactile perception.
[AI-18] Hybrid Responsible AI-Stochastic Approach for SLA Compliance in Multivendor 6G Networks
【速读】:该论文旨在解决6G网络自动化中因多厂商管理系统的AI闭环编排所引发的责任空白问题,即当服务等级协议(SLA)违反时,难以因果归责于特定智能体或供应商。其核心解决方案是提出一种融合负责任人工智能(Responsible AI, RAI)与随机学习的混合框架,通过将公平性、鲁棒性和可审计性嵌入网络控制环路,实现跨异构厂商域的动态对抗重加权和概率探索;关键创新在于引入RAAP(Responsibility-Aware Audit Protocol)持续记录AI决策轨迹,并生成用户级SLA摘要与运营商级责任分析双报告,从而在保证性能提升的同时建立可追溯的问责机制,实验证明该方法使最差群体准确率提高至60.5%,且99%的模拟SLA违规可被精准定位到责任AI实体。
链接: https://arxiv.org/abs/2602.09841
作者: Emanuel Figetakis,Ahmed Refaey Hussein
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures
Abstract:The convergence of AI and 6G network automation introduces new challenges in maintaining transparency, fairness, and accountability across multivendor management systems. Although closed-loop AI orchestration improves adaptability and self-optimization, it also creates a responsibility gap, where violations of SLAs cannot be causally attributed to specific agents or vendors. This paper presents a hybrid responsible AI-stochastic learning framework that embeds fairness, robustness, and auditability directly into the network control loop. The framework integrates RAI games with stochastic optimization, enabling dynamic adversarial reweighting and probabilistic exploration across heterogeneous vendor domains. An RAAP continuously records AI-driven decision trajectories and produces dual accountability reports: user-level SLA summaries and operator-level responsibility analytics. Experimental evaluations on synthetic two-class multigroup datasets demonstrate that the proposed hybrid model improves the accuracy of the worst group by up to 10.5%. Specifically, hybrid RAI achieved a WGAcc of 60.5% and an AvgAcc of 72.7%, outperforming traditional RAI-GA (50.0%) and ERM (21.5%). The audit mechanism successfully traced 99% simulated SLA violations to the AI entities responsible, producing both vendor and agent-level accountability indices. These results confirm that the proposed hybrid approach enhances fairness and robustness as well as establishes a concrete accountability framework for autonomous SLA assurance in multivendor 6G networks.
[AI-19] Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning
【速读】:该论文旨在解决资源受限场景下无监督环境设计(Unsupervised Environment Design, UED)中教师-学生交互机会有限的问题,传统方法依赖随机过程生成无限环境以保证开放性(Open-Endedness),但在实际应用中难以满足高效训练需求。解决方案的关键在于提出一种分层马尔可夫决策过程(hierarchical Markov Decision Process, MDP)框架,其中教师智能体通过分析学生策略在已发现评估环境中的表现来理解其能力,并据此生成针对性的训练环境;同时引入生成式模型扩充教师训练数据集,减少对真实师生交互的依赖,从而显著降低交互次数并提升训练效率。
链接: https://arxiv.org/abs/2602.09813
作者: Dexun Li,Sidney Tio,Pradeep Varakantham
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised Environment Design (UED) has emerged as a promising approach to developing general-purpose agents through automated curriculum generation. Popular UED methods focus on Open-Endedness, where teacher algorithms rely on stochastic processes for infinite generation of useful environments. This assumption becomes impractical in resource-constrained scenarios where teacher-student interaction opportunities are limited. To address this challenge, we introduce a hierarchical Markov Decision Process (MDP) framework for environment design. Our framework features a teacher agent that leverages student policy representations derived from discovered evaluation environments, enabling it to generate training environments based on the student’s capabilities. To improve efficiency, we incorporate a generative model that augments the teacher’s training dataset with synthetic data, reducing the need for teacher-student interactions. In experiments across several domains, we show that our method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode. The results suggest the applicability of our approach in settings where training opportunities are limited.
[AI-20] A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer
【速读】:该论文试图解决的问题是:在基于值的深度强化学习中,不同神经网络架构(具体为Double Deep Q-Networks, DDQN 与 Dueling DQN)在跨环境迁移学习时的表现差异及其对负迁移(negative transfer)的敏感性问题。解决方案的关键在于通过受控的实验设计,在相同超参数和训练条件下,固定层间表示迁移协议,系统评估两种架构在从CartPole到LunarLander这一结构差异显著的任务迁移中的性能表现。结果表明,DDQN由于其架构诱导的归纳偏置(inductive bias)展现出更强的迁移鲁棒性,能有效避免负迁移并维持接近基线的优化稳定性,而Dueling DQN则表现出显著的负迁移特征,即奖励下降且优化过程不稳定,揭示了架构设计对迁移效果的关键影响。
链接: https://arxiv.org/abs/2602.09810
作者: Azka Nasir,Fatima Dossa,Muhammad Ahmed Atif,Mohammad Ahmed Atif
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.
[AI-21] Symbolic Pattern Temporal Numeric Planning with Intermediate Conditions and Effects
【速读】:该论文旨在解决**时序规划(Temporal Planning)**中包含中间条件与效果(Intermediate Conditions and Effects, ICEs)的复杂性问题,即如何在动作具有持续时间且可重叠、同时需在特定时间点满足某些条件或产生效果的情况下,高效生成有效计划。解决方案的关键在于将原有的符号模式规划(Symbolic Pattern Planning, SPP)方法扩展至ICE场景:通过构建一个有限动作序列(即模式)来建议动作间的因果顺序,并将其编码为SMT公式,其中模型对应合法计划;若初始模式不准确导致无解,则逐步扩展模式直至捕获有效计划中的因果关系,从而保证算法的完备性。实验表明,该方案在多数无ICE的时序领域优于现有规划器,在含ICE的领域达到与最先进搜索型规划器相当的效果,并在真实应用驱动的新领域中表现更优。
链接: https://arxiv.org/abs/2602.09798
作者: Matteo Cardellini,Enrico Giunchiglia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review at the Artificial Intelligence Journal
Abstract:Recently, a Symbolic Pattern Planning (SPP) approach was proposed for numeric planning where a pattern (i.e., a finite sequence of actions) suggests a causal order between actions. The pattern is then encoded in a SMT formula whose models correspond to valid plans. If the suggestion by the pattern is inaccurate and no valid plan can be found, the pattern is extended until it contains the causal order of actions in a valid plan, making the approach complete. In this paper, we extend the SPP approach to the temporal planning with Intermediate Conditions and Effects (ICEs) fragment, where (i) actions are durative (and thus can overlap over time) and have conditions/effects which can be checked/applied at any time during an action’s execution, and (ii) one can specify plan’s conditions/effects that must be checked/applied at specific times during the plan execution. Experimental results show that our SPP planner Patty (i) outperforms all other planners in the literature in the majority of temporal domains without ICEs, (ii) obtains comparable results with the SoTA search planner for ICS in literature domains with ICEs, and (iii) outperforms the same planner in a novel domain based on a real-world application.
[AI-22] GHS-TDA: A Synergistic Reasoning Framework Integrating Global Hypothesis Space with Topological Data Analysis
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)方法在大语言模型(Large Language Models, LLMs)推理过程中存在的两大核心问题:一是推理过程对早期决策高度敏感,局部错误易传播并放大,且缺乏全局协调与修正机制;二是现有方法缺乏结构化分析手段以过滤冗余推理路径并提取关键推理特征,导致推理不稳定且可解释性差。解决方案的关键在于提出GHS-TDA框架,其核心创新包括:首先构建语义增强的全局假设图(Global Hypothesis Graph),用于聚合、对齐和协调多个候选推理路径,从而在局部推理失败时提供替代的全局修正路径;其次引入基于持久同调(Persistent Homology)的拓扑数据分析(Topological Data Analysis, TDA),以捕捉多尺度稳定结构、消除冗余与不一致信息,并提取更可靠的推理骨架。通过融合推理多样性与拓扑稳定性,GHS-TDA实现了自适应收敛,生成高置信度且可解释的推理路径,在多个推理基准上显著优于现有强基线模型。
链接: https://arxiv.org/abs/2602.09794
作者: Jiaquan Zhang,Chaoning Zhang,Shuxu Chen,Xudong Wang,Zhenzhen Huang,Pengcheng Zheng,Shuai Yuan,Sheng Zheng,Qigan Sun,Jie Zou,Lik-Hang Lee,Yang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23pages
Abstract:Chain-of-Thought (CoT) has been shown to significantly improve the reasoning accuracy of large language models (LLMs) on complex tasks. However, due to the autoregressive, step-by-step generation paradigm, existing CoT methods suffer from two fundamental limitations. First, the reasoning process is highly sensitive to early decisions: once an initial error is introduced, it tends to propagate and amplify through subsequent steps, while the lack of a global coordination and revision mechanism makes such errors difficult to correct, ultimately leading to distorted reasoning chains. Second, current CoT approaches lack structured analysis techniques for filtering redundant reasoning and extracting key reasoning features, resulting in unstable reasoning processes and limited interpretability. To address these issues, we propose GHS-TDA. GHS-TDA first constructs a semantically enriched global hypothesis graph to aggregate, align, and coordinate multiple candidate reasoning paths, thereby providing alternative global correction routes when local reasoning fails. It then applies topological data analysis based on persistent homology to capture stable multi-scale structures, remove redundancy and inconsistencies, and extract a more reliable reasoning skeleton. By jointly leveraging reasoning diversity and topological stability, GHS-TDA achieves self-adaptive convergence, produces high-confidence and interpretable reasoning paths, and consistently outperforms strong baselines in terms of both accuracy and robustness across multiple reasoning benchmarks.
[AI-23] Explainability in Generative Medical Diffusion Models: A Faithfulness-Based Analysis on MRI Synthesis
【速读】:该论文旨在解决生成式扩散模型在医学影像领域(特别是磁共振成像MRI合成)中的可解释性问题,即其内部决策机制缺乏透明度,限制了其在医疗场景下的可信度与应用安全性。解决方案的关键在于提出一种基于忠实度(faithfulness)的可解释性框架,通过分析原型类解释方法(如ProtoPNet、Enhanced ProtoPNet和ProtoPool)如何关联生成图像与训练特征,结合扩散模型去噪轨迹来理解图像生成过程,并以忠实度指标量化解释结果的可靠性。实验表明,增强型原型网络(EPPNet)在忠实度评分上达到最高(0.1534),从而为生成式AI在医疗领域的安全部署提供了更透明、可信赖的解释路径。
链接: https://arxiv.org/abs/2602.09781
作者: Surjo Dey,Pallabi Saikia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at 3rd World Congress on Smart Computing (WCSC2026) conference
Abstract:This study investigates the explainability of generative diffusion models in the context of medical imaging, focusing on Magnetic resonance imaging (MRI) synthesis. Although diffusion models have shown strong performance in generating realistic medical images, their internal decision making process remains largely opaque. We present a faithfulness-based explainability framework that analyzes how prototype-based explainability methods like ProtoPNet (PPNet), Enhanced ProtoPNet (EPPNet), and ProtoPool can link the relationship between generated and training features. Our study focuses on understanding the reasoning behind image formation through denoising trajectory of diffusion model and subsequently prototype explainability with faithfulness analysis. Experimental analysis shows that EPPNet achieves the highest faithfulness (with score 0.1534), offering more reliable insights, and explainability into the generative process. The results highlight that diffusion models can be made more transparent and trustworthy through faithfulness-based explanations, contributing to safer and more interpretable applications of generative AI in healthcare.
[AI-24] Grounding LTL Tasks in Sub-Symbolic RL Environments for Zero-Shot Generalization
【速读】:该论文旨在解决在子符号(sub-symbolic)环境中训练强化学习代理以遵循多个时序扩展指令的问题,这些指令以线性时序逻辑(Linear Temporal Logic, LTL)表达。传统多任务方法通常依赖于已知的从原始观测到公式中符号的映射关系,这一假设在实际场景中往往不现实。本文的关键解决方案是联合训练一个多功能策略网络与一个符号接地器(symbol grounder),二者共享相同的环境经验;其中符号接地器仅基于原始观测和稀疏奖励,通过神经奖励机器(Neural Reward Machines)以半监督方式学习,从而无需预先定义符号与观测之间的显式映射。实验表明,该方法在视觉基础环境中性能接近使用真实符号接地的效果,并显著优于当前最先进的子符号环境方法。
链接: https://arxiv.org/abs/2602.09761
作者: Matteo Pannacci,Andrea Fanti,Elena Umili,Roberto Capobianco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint currently under review
Abstract:In this work we address the problem of training a Reinforcement Learning agent to follow multiple temporally-extended instructions expressed in Linear Temporal Logic in sub-symbolic environments. Previous multi-task work has mostly relied on knowledge of the mapping between raw observations and symbols appearing in the formulae. We drop this unrealistic assumption by jointly training a multi-task policy and a symbol grounder with the same experience. The symbol grounder is trained only from raw observations and sparse rewards via Neural Reward Machines in a semi-supervised fashion. Experiments on vision-based environments show that our method achieves performance comparable to using the true symbol grounding and significantly outperforms state-of-the-art methods for sub-symbolic environments.
[AI-25] ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm
【速读】:该论文旨在解决深度强化学习中策略优化的样本效率与训练稳定性之间的权衡问题。现有方法如近端策略优化(Proximal Policy Optimization, PPO)通过保守的在线策略更新保证了训练稳定性,但牺牲了样本效率;而离线策略方法虽能更高效利用数据,却易引入估计方差和偏差。解决方案的关键在于提出一种扩展的离线策略近端策略优化(Extended Off-policy Proximal Policy Optimization, ExO-PPO),其核心创新包括:基于广义策略改进下界推导出扩展的离线策略提升准则,设计分段指数函数形式的裁剪机制以构建更优的代理目标函数,并将历史M个策略生成的轨迹组织进回放缓冲区用于离线训练,从而在保持PPO稳定性的基础上显著提升样本利用率。
链接: https://arxiv.org/abs/2602.09726
作者: Hanyong Wang,Menglong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we first derive an extended off-policy improvement from an expectation form of generalized policy improvement lower bound. Then, we extend the clipping mechanism with segmented exponential functions for a suitable surrogate objective function. Third, the trajectories generated by the past M policies are organized in the replay buffer for off-policy training. We refer to this method as Extended Off-policy Proximal Policy Optimization (ExO-PPO). Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks in the empirical experiments.
[AI-26] Resilient Class-Incremental Learning: on the Interplay of Drifting Unlabelled and Imbalanced Data Streams
【速读】:该论文旨在解决动态环境中流式数据面临的多重挑战,包括概念漂移(concept drift)、类别不平衡(class imbalance)、标签稀缺(label scarcity)以及新类别的出现,这些问题共同导致表示不稳定、学习偏向过时分布,并降低检测的鲁棒性和可靠性。解决方案的关键在于提出一种名为SCIL(Streaming Class-Incremental Learning)的框架,其核心机制包括:利用自编码器(autoencoder, AE)与多层感知机(multi-layer perceptron)结合进行多分类预测;采用分类损失与重构损失相结合的双损失策略以同时实现预测和新类检测;通过修正伪标签(corrected pseudo-labels)支持在线训练;使用队列管理类别以维持历史知识;并引入过采样(oversampling)缓解类别不平衡问题。实验表明,该方法在真实和合成数据集上均显著优于现有基线和前沿方法。
链接: https://arxiv.org/abs/2602.09681
作者: Jin Li,Kleanthis Malialis,Marios Polycarpou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by Artificial Intelligence Science and Engineering
Abstract:In today’s connected world, the generation of massive streaming data across diverse domains has become commonplace. In the presence of concept drift, class imbalance, label scarcity, and new class emergence, they jointly degrade representation stability, bias learning toward outdated distributions, and reduce the resilience and reliability of detection in dynamic environments. This paper proposes SCIL (Streaming Class-Incremental Learning) to address these challenges. The SCIL framework integrates an autoencoder (AE) with a multi-layer perceptron for multi-class prediction, uses a dual-loss strategy (classification and reconstruction) for prediction and new class detection, employs corrected pseudo-labels for online training, manages classes with queues, and applies oversampling to handle imbalance. The rationale behind the method’s structure is elucidated through ablation studies and a comprehensive experimental evaluation is performed using both real-world and synthetic datasets that feature class imbalance, incremental classes, and concept drifts. Our results demonstrate that SCIL outperforms strong baselines and state-of-the-art methods. Based on our commitment to Open Science, we make our code and datasets available to the community.
[AI-27] Administrative Laws Fourth Settlement: AI and the Capability-Accountability Trap
【速读】:该论文试图解决行政法长期面临的“能力—问责困境”(capability-accountability trap):技术进步促使政府机构日益复杂化,但这种复杂性又使监管者(如法院和国会)难以有效监督,导致程序性审查取代实质性 oversight,进而形成冗余的制度负担并削弱民主控制。其解决方案的关键在于利用生成式 AI (Generative AI) 的独特能力实现“可审视性”(scrutability)——即通过将技术复杂性转化为可理解的形式、揭示关键假设并支持对决策逻辑的实质验证,从而在不牺牲治理能力的前提下重建可理解的监督机制。论文提出三项法律创新:模型与系统档案(Model and System Dossier)、重大模型变更触发机制以及“审计优先型 deference”标准,共同构成所谓“第四次制度安排”(Fourth Settlement),旨在突破传统行政法的局限,实现能力与问责的动态平衡。
链接: https://arxiv.org/abs/2602.09678
作者: Nicholas Caputo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 67 pages
Abstract:Since 1887, administrative law has navigated a “capability-accountability trap”: technological change forces government to become more sophisticated, but sophistication renders agencies opaque to generalist overseers like the courts and Congress. The law’s response–substituting procedural review for substantive oversight–has produced a sedimentary accretion of requirements that ossify capacity without ensuring democratic control. This Article argues that the Supreme Court’s post-Loper Bright retrenchment is best understood as an effort to shrink administration back to comprehensible size in response to this complexification. But reducing complexity in this way sacrifices capability precisely when climate change, pandemics, and AI risks demand more sophisticated governance. AI offers a different path. Unlike many prior administrative technologies that increased opacity alongside capacity, AI can help build “scrutability” in government, translating technical complexity into accessible terms, surfacing the assumptions that matter for oversight, and enabling substantive verification of agency reasoning. This Article proposes three doctrinal innovations within administrative law to realize this potential: a Model and System Dossier (documenting model purpose, evaluation, monitoring, and versioning) extending the administrative record to AI decision-making; a material-model-change trigger specifying when AI updates require new process; and a “deference to audit” standard that rewards agencies for auditable evaluation of their AI tools. The result is a framework for what this Article calls the “Fourth Settlement,” administrative law that escapes the capability-accountability trap by preserving capability while restoring comprehensible oversight of administration. Comments: 67 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.09678 [cs.CY] (or arXiv:2602.09678v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.09678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-28] ClinAlign: Scaling Healthcare Alignment from Clinician Preference
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域中生成内容时,难以与临床医生的细粒度偏好对齐的问题。现有方法通常依赖于粗粒度的目标函数或缺乏专业指南支撑的自动化评判器,导致对齐效果不佳。解决方案的关键在于提出一个两阶段框架:首先构建HealthRubrics数据集,包含7,034个由医生验证的偏好样本,用于精细化调整LLM输出;其次从中提炼出HealthPrinciples——119条可广泛复用、基于临床维度的医学原则,实现可扩展的监督信号。该框架通过离线对齐和推理时引导自修正两种方式提升模型性能,最终使仅激活3B参数的30B模型在HealthBench-Hard测试集上达到33.4%得分,优于多个更大规模模型,确立了资源高效的临床对齐基线。
链接: https://arxiv.org/abs/2602.09653
作者: Shiwei Lyu,Xidong Wang,Lei Liu,Hao Zhu,Chaohe Zhang,Jian Wang,Jinjie Gu,Benyou Wang,Yue Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B parameter model that activates only 3B parameters at inference trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.
[AI-29] FLINGO – Instilling ASP Expressiveness into Linear Integer Constraints
【速读】:该论文旨在解决约束答案集编程(Constraint Answer Set Programming, CASP)中数值属性表达能力不足的问题。当前大多数CASP求解器在数值约束的规范上偏向于数值后端的表达能力和语义,而非标准答案集编程(Answer Set Programming, ASP)中的约定,导致在ASP中可实现的默认值声明、未定义属性、非确定性赋值及聚合值等特性在转换为基于约束的表示后丢失。解决方案的关键在于提出FLINGO语言及其工具,该语言将上述ASP的表达能力内嵌至数值约束中,从而保留了ASP的灵活性与语义丰富性;同时,论文还提供了一种从FLINGO语法到标准CLINGCON格式的翻译机制,确保其与现有CASP系统的兼容性。
链接: https://arxiv.org/abs/2602.09620
作者: Jorge Fandinno,Pedro Cabalar,Philipp Wanko,Torsten Schaub
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, something required in many real-world applications. The usual specification of constraints in most CASP solvers is closer to the numerical back-end expressiveness and semantics, rather than to standard specification in ASP. In the latter, numerical attributes are represented with predicates and this allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the FLINGO language (and tool) that incorporates the aforementioned expressiveness inside the numerical constraints and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced FLINGO syntax to regular CASP programs following the CLINGCON input format.
[AI-30] Detecting radar targets swarms in range profiles with a partially complex-valued neural network
【速读】:该论文旨在解决雷达距离剖面中多个目标因距离分辨率受限及回波失真导致的检测难题,特别是当目标间距较近时可能出现的目标合并或相互干扰问题。解决方案的关键在于提出一种部分复数域神经网络(partially complex-valued neural networks),该网络作为自适应的距离剖面处理方法,能够一次性处理整个接收信号并生成完整的检测剖面,相较于传统脉冲压缩方法(逐脉冲长度处理)具有更强的端到端建模能力和对复杂场景的鲁棒性。
链接: https://arxiv.org/abs/2602.09597
作者: Martin Bauw
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Correctly detecting radar targets is usually challenged by clutter and waveform distortion. An additional difficulty stems from the relative proximity of several targets, the latter being perceived as a single target in the worst case, or influencing each other’s detection thresholds. The negative impact of targets proximity notably depends on the range resolution defined by the radar parameters and the adaptive threshold adopted. This paper addresses the matter of targets detection in radar range profiles containing multiple targets with varying proximity and distorted echoes. Inspired by recent contributions in the radar and signal processing literature, this work proposes partially complex-valued neural networks as an adaptive range profile processing. Simulated datasets are generated and experiments are conducted to compare a common pulse compression approach with a simple neural network partially defined by complex-valued parameters. Whereas the pulse compression processes one pulse length at a time, the neural network put forward is a generative architecture going through the entire received signal in one go to generate a complete detection profile.
[AI-31] Why the Counterintuitive Phenomenon of Likelihood Rarely Appears in Tabular Anomaly Detection with Deep Generative Models?
【速读】:该论文旨在解决深度生成模型在异常检测中出现的反直觉现象——即生成模型对异常数据赋予更高似然值(likelihood)的问题,这在图像领域较为常见,但在表格数据(tabular data)中是否普遍存在尚不明确。为解决这一问题,作者提出了一种与领域无关的定义和评估框架,首次对这一现象进行了精确界定,并通过在47个表格数据集和10个CV/NLP嵌入数据集上的系统实验验证了该现象在表格数据中极为罕见。解决方案的关键在于:利用可解析计算似然的生成模型(如归一化流,normalizing flows),结合理论分析与实证研究,证明基于似然的异常检测方法在表格数据中具有高度可靠性,从而为实际应用提供了稳健的范式。
链接: https://arxiv.org/abs/2602.09593
作者: Donghwan Kim,Junghun Phee,Hyunsoo Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 47 pages, 11 figures
Abstract:Deep generative models with tractable and analytically computable likelihoods, exemplified by normalizing flows, offer an effective basis for anomaly detection through likelihood-based scoring. We demonstrate that, unlike in the image domain where deep generative models frequently assign higher likelihoods to anomalous data, such counterintuitive behavior occurs far less often in tabular settings. We first introduce a domain-agnostic formulation that enables consistent detection and evaluation of the counterintuitive phenomenon, addressing the absence of precise definition. Through extensive experiments on 47 tabular datasets and 10 CV/NLP embedding datasets in ADBench, benchmarked against 13 baseline models, we demonstrate that the phenomenon, as defined, is consistently rare in general tabular data. We further investigate this phenomenon from both theoretical and empirical perspectives, focusing on the roles of data dimensionality and difference in feature correlation. Our results suggest that likelihood-only detection with normalizing flows offers a practical and reliable approach for anomaly detection in tabular domains.
[AI-32] Mitigating the Likelihood Paradox in Flow-based OOD Detection via Entropy Manipulation
【速读】:该论文旨在解决深度生成模型在处理分布外(out-of-distribution, OOD)输入时出现的似然悖论问题,即这些模型常对OOD样本赋予异常高的概率值,导致基于似然的检测方法失效。解决方案的关键在于通过语义相似性控制输入熵:利用一个分布内记忆库(in-distribution memory bank)衡量输入与分布内样本的相似度,并对语义差异较大的输入施加更强的扰动,从而增强分布内样本的期望对数似然优势。该方法无需重新训练密度模型,且理论分析表明其能有效扩大分布内与分布外样本之间的对数似然差距,实验结果也验证了其在标准基准上相较基线方法具有更优的AUROC性能。
链接: https://arxiv.org/abs/2602.09581
作者: Donghwan Kim,Hyunsoo Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 4 figures
Abstract:Deep generative models that can tractably compute input likelihoods, including normalizing flows, often assign unexpectedly high likelihoods to out-of-distribution (OOD) inputs. We mitigate this likelihood paradox by manipulating input entropy based on semantic similarity, applying stronger perturbations to inputs that are less similar to an in-distribution memory bank. We provide a theoretical analysis showing that entropy control increases the expected log-likelihood gap between in-distribution and OOD samples in favor of the in-distribution, and we explain why the procedure works without any additional training of the density model. We then evaluate our method against likelihood-based OOD detectors on standard benchmarks and find consistent AUROC improvements over baselines, supporting our explanation.
[AI-33] Predictive Query Language: A Domain-Specific Language for Predictive Modeling on Relational Databases
【速读】:该论文旨在解决在关系型数据库中进行预测建模时,手动提取训练样本(即预测实体和目标标签)所面临的效率低、劳动密集且易出错的问题。传统机器学习模型的训练依赖于人工从数据库中构造特征和标签,这一过程不仅耗时,还难以规模化。解决方案的关键在于提出一种名为 Predictive Query Language (PQL) 的声明式语言,其设计灵感源自 SQL,能够通过单个查询语句自动定义预测任务并计算适用于多种机器学习场景(如回归、分类、时间序列预测和推荐系统)的训练标签,从而显著降低建模门槛并提升自动化水平。
链接: https://arxiv.org/abs/2602.09572
作者: Vid Kocijan,Jinu Sunil,Jan Eric Lenssen,Viman Deb,Xinwei Xe,Federco Reyes Gomez,Matthias Fey,Jure Leskovec
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The purpose of predictive modeling on relational data is to predict future or missing values in a relational database, for example, future purchases of a user, risk of readmission of the patient, or the likelihood that a financial transaction is fraudulent. Typically powered by machine learning methods, predictive models are used in recommendations, financial fraud detection, supply chain optimization, and other systems, providing billions of predictions every day. However, training a machine learning model requires manual work to extract the required training examples - prediction entities and target labels - from the database, which is slow, laborious, and prone to mistakes. Here, we present the Predictive Query Language (PQL), a SQL-inspired declarative language for defining predictive tasks on relational databases. PQL allows specifying a predictive task in a single declarative query, enabling the automatic computation training labels for a large variety of machine learning tasks, such as regression, classification, time-series forecasting, and recommender systems. PQL is already successfully integrated and used in a collection of use cases as part of a predictive AI platform. The versatility of the language can be demonstrated through its many ongoing use cases, including financial fraud, item recommendations, and workload prediction. We demonstrate its versatile design through two implementations; one for small-scale, low-latency use and one that can handle large-scale databases.
[AI-34] Autoregressive Direct Preference Optimization
【速读】:该论文旨在解决当前直接偏好优化(Direct Preference Optimization, DPO)方法中因依赖响应级Bradley-Terry(Bradley-Terry, BT)模型而导致的理论局限性问题,即在推导目标函数时假设参考模型和可学习模型均为自回归模型(autoregressive),但这一假设并未在建模阶段被显式引入,从而可能限制了DPO的性能潜力。解决方案的关键在于重新审视DPO的理论基础,提出一种新的形式化框架——自回归DPO(Autoregressive DPO, ADPO),其核心创新是将自回归假设提前引入到BT模型之前,从而在不违背原有理论的前提下,重构损失函数:将原DPO目标中log-sigmoid内的求和操作移至外部,使优化过程更符合语言生成的自回归本质。此外,论文首次明确区分两种长度度量——标记长度(token length, μ)与反馈长度(feedback length, μ’),并揭示二者对基于DPO算法设计的影响,为后续高效、稳定地优化大语言模型(Large Language Models, LLMs)提供了理论支撑与实践指导。
链接: https://arxiv.org/abs/2602.09533
作者: Masanari Oi,Mahiro Ukai,Masahiro Kaneko,Naoaki Okazaki,Nakamasa Inoue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length \mu and the feedback length \mu '. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.
[AI-35] Learning to Discover Iterative Spectral Algorithms
【速读】:该论文旨在解决大规模数值线性代数与数值优化中迭代谱算法(iterative spectral algorithms)的自动发现难题,即如何高效地设计针对特定任务的矩阵多项式计算方法以加速求解过程。解决方案的关键在于提出AutoSpec框架:其核心是基于自监督学习的神经网络架构,能够利用输入算子的粗粒度谱信息(如特征值估计和残差范数)预测递推系数,从而生成适配下游任务的矩阵多项式;该框架通过三方面保障有效性——推理过程实现可执行的短数值线性代数递推、在小规模合成问题上高效训练并迁移至真实世界的大规模算子、以及基于任务定义的目标函数确保在整个训练集谱分布范围内实现期望的逼近或预条件行为。
链接: https://arxiv.org/abs/2602.09530
作者: Zihang Liu,Oleg Balabanov,Yaoqing Yang,Michael W. Mahoney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:We introduce AutoSpec, a neural network framework for discovering iterative spectral algorithms for large-scale numerical linear algebra and numerical optimization. Our self-supervised models adapt to input operators using coarse spectral information (e.g., eigenvalue estimates and residual norms), and they predict recurrence coefficients for computing or applying a matrix polynomial tailored to a downstream task. The effectiveness of AutoSpec relies on three ingredients: an architecture whose inference pass implements short, executable numerical linear algebra recurrences; efficient training on small synthetic problems with transfer to large-scale real-world operators; and task-defined objectives that enforce the desired approximation or preconditioning behavior across the range of spectral profiles represented in the training set. We apply AutoSpec to discovering algorithms for representative numerical linear algebra tasks: accelerating matrix-function approximation; accelerating sparse linear solvers; and spectral filtering/preconditioning for eigenvalue computations. On real-world matrices, the learned procedures deliver orders-of-magnitude improvements in accuracy and/or reductions in iteration count, relative to basic baselines. We also find clear connections to classical theory: the induced polynomials often exhibit near-equiripple, near-minimax behavior characteristic of Chebyshev polynomials.
[AI-36] Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA
【速读】:该论文旨在解决低秩适配(Low-rank adaptation, LoRA)在微调大语言模型时,其多种变体在相同基准上表现出矛盾性能结果的问题。研究表明,这些不一致源于一个被忽视的关键因素——批量大小(batch size)。论文的核心解决方案是揭示批量大小对LoRA性能的决定性影响,并提出一种基于代理指标的高效批量大小调优策略,从而明确 rank、数据集规模和模型容量如何共同决定最优批量大小。这一发现将批量大小从次要实现细节提升为一阶设计参数,有效统一了先前关于LoRA变体性能评估的分歧,提升了方法比较的可靠性。
链接: https://arxiv.org/abs/2602.09492
作者: Sangyoon Lee,Jaeho Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.
[AI-37] Computing Conditional Shapley Values Using Tabular Foundation Models
【速读】:该论文旨在解决生成式 AI 中 Shapley 值计算的高计算成本问题,尤其是在特征存在依赖关系时,传统方法需大量近似条件期望,导致效率低下。其解决方案的关键在于利用表格式基础模型(Tabular foundation models, 如 TabPFN)通过上下文学习(in-context learning)机制,在无需重新训练的情况下高效估计每个条件期望,从而显著降低计算开销。实验表明,该方法在多数场景下性能优于现有最优方法,且在性能稍逊时仍远快于其他方法。
链接: https://arxiv.org/abs/2602.09489
作者: Lars Henry Berge Olsen,Dennis Christensen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Shapley values have become a cornerstone of explainable AI, but they are computationally expensive to use, especially when features are dependent. Evaluating them requires approximating a large number of conditional expectations, either via Monte Carlo integration or regression. Until recently it has not been possible to fully exploit deep learning for the regression approach, because retraining for each conditional expectation takes too long. Tabular foundation models such as TabPFN overcome this computational hurdle by leveraging in-context learning, so each conditional expectation can be approximated without any re-training. In this paper, we compute Shapley values with multiple variants of TabPFN and compare their performance with state-of-the-art methods on both simulated and real datasets. In most cases, TabPFN yields the best performance; where it does not, it is only marginally worse than the best method, at a fraction of the runtime. We discuss further improvements and how tabular foundation models can be better adapted specifically for conditional Shapley value estimation.
[AI-38] Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models
【速读】:该论文旨在解决长思维链(Long CoTs)在多模态推理模型中因冗余步骤过多而导致的推理效率低下问题,同时克服现有压缩方法可能破坏视觉-文本对齐信息完整性及缺乏可解释性的局限。其解决方案的关键在于提出一种可解释的多模态思维链压缩方法(XMCC),将压缩过程建模为一个通过强化学习优化的序列决策过程,能够在保留关键推理步骤和答案正确性的前提下有效缩短推理轨迹,并生成自然语言形式的压缩决策解释。
链接: https://arxiv.org/abs/2602.09485
作者: Yizhi Wang,Linan Yue,Min-Ling Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.
[AI-39] SpotAgent : Grounding Visual Geo-localization in Large Vision-Language Models through Agent ic Reasoning
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在真实场景下进行地理定位(geo-localization)时面临的挑战,即当视觉线索稀疏、长尾分布且高度模糊时,模型常产生自信但缺乏依据的错误预测。其核心解决方案是提出SpotAgent框架,将地理定位建模为一个代理式推理过程,通过外部工具(如网络搜索、地图服务)辅助验证来增强模型的可解释性和准确性。关键创新在于引入三阶段后训练流程:首先进行监督微调(Supervised Fine-Tuning, SFT)实现基础对齐;接着利用多智能体框架合成高质量轨迹进行代理冷启动(Agentic Cold Start),注入工具调用能力;最后通过强化学习(Reinforcement Learning, RL)优化推理能力,并设计空间感知动态过滤策略(Spatially-Aware Dynamic Filtering),依据空间难度优先选择学习样本以提升效率。该方法显著减少了幻觉现象,实现了高精度且可验证的地理定位结果。
链接: https://arxiv.org/abs/2602.09463
作者: Furong Jia,Ling Dai,Wenjin Deng,Fan Zhang,Chen Hu,Daxin Jiang,Yu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model’s reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
[AI-40] P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)从符号推理向科学级推理跃迁的挑战,尤其聚焦于如何将抽象逻辑与物理现实相结合,以实现符合物理定律的可靠推理。其核心问题在于:现有模型缺乏多模态感知能力,难以从视觉信息(如竞赛题中的图示)中提取关键约束条件(如边界条件和空间对称性),从而导致推理偏离物理一致性。解决方案的关键在于提出P1-VL系列开源视觉-语言模型(Vision-Language Models, VLMs),通过融合课程强化学习(Curriculum Reinforcement Learning)稳定后训练过程,并引入代理增强机制(Agentic Augmentation)在推理阶段实现迭代自验证,从而显著提升模型在物理奥林匹克竞赛(HiPhO基准)等复杂科学任务中的表现,成为首个获得12枚金牌的开源VLM,且整体排名全球第二。
链接: https://arxiv.org/abs/2602.09443
作者: Yun Luo,Futing Wang,Qianjia Cheng,Fangchen Yu,Haodi Lei,Jianhao Yan,Chenxi Li,Jiacheng Chen,Yufeng Zhao,Haiyuan Wan,Yuchen Zhang,Shenghe Zheng,Junchi Yao,Qingyang Zhang,Haonan He,Wenxuan Zeng,Li Sheng,Chengxing Xie,Yuxin Zuo,Yizhuo Li,Yulun Wu,Rui Huang,Dongzhan Zhou,Kai Chen,Yu Qiao,Lei Bai,Yu Cheng,Ning Ding,Bowen Zhou,Peng Ye,Ganqu Cui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
[AI-41] Diffusion-Guided Pretraining for Brain Graph Foundation Models
【速读】:该论文旨在解决当前基于图结构的脑信号预训练方法中存在的两个关键问题:一是现有对比学习和掩码自编码器方法通常采用随机丢弃或掩码策略进行数据增强,这种做法会破坏脑图谱中语义上有意义的连接模式,不适用于脑图(brain graphs)和超图(hypergraphs);二是常用的图级读出(graph-level readout)与重建方案无法有效捕捉全局结构信息,从而限制了所学表示的鲁棒性。其解决方案的关键在于提出一个统一的基于扩散模型(diffusion-based)的预训练框架:首先,利用扩散过程引导设计结构感知的丢弃与掩码策略,在保持脑图语义完整性的同时实现有效的预训练多样性;其次,通过扩散机制实现拓扑感知的图级读出与节点级全局重建,使图嵌入和被掩码节点能够从全局相关区域聚合信息,从而提升表征质量。
链接: https://arxiv.org/abs/2602.09437
作者: Xinxu Wei,Rong Zhou,Lifang He,Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:With the growing interest in foundation models for brain signals, graph-based pretraining has emerged as a promising paradigm for learning transferable representations from connectome data. However, existing contrastive and masked autoencoder methods typically rely on naive random dropping or masking for augmentation, which is ill-suited for brain graphs and hypergraphs as it disrupts semantically meaningful connectivity patterns. Moreover, commonly used graph-level readout and reconstruction schemes fail to capture global structural information, limiting the robustness of learned representations. In this work, we propose a unified diffusion-based pretraining framework that addresses both limitations. First, diffusion is designed to guide structure-aware dropping and masking strategies, preserving brain graph semantics while maintaining effective pretraining diversity. Second, diffusion enables topology-aware graph-level readout and node-level global reconstruction by allowing graph embeddings and masked nodes to aggregate information from globally related regions. Extensive experiments across multiple neuroimaging datasets with over 25,000 subjects and 60,000 scans involving various mental disorders and brain atlases demonstrate consistent performance improvements.
[AI-42] A Behavioral Fingerprint for Large Language Models : Provenance Tracking via Refusal Vectors
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)知识产权(Intellectual Property, IP)保护难题,特别是针对未经授权的衍生模型泛滥问题。解决方案的关键在于提出一种基于安全对齐行为模式的指纹框架,利用“拒绝向量”(refusal vectors)作为模型的行为指纹——这些向量通过分析模型在处理有害与无害提示时内部表示的方向性差异提取而来,具备高度鲁棒性,能够抵抗微调、合并和量化等常见修改操作。实验表明,该指纹具有模型家族特异性,在76个后代模型的大规模识别任务中实现100%准确率,且即使在对齐破坏攻击下仍保留可检测痕迹,从而为LLM的溯源与版权保护提供了有效手段。
链接: https://arxiv.org/abs/2602.09434
作者: Zhenyu Xu,Victor S. Sheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Protecting the intellectual property of large language models (LLMs) is a critical challenge due to the proliferation of unauthorized derivative models. We introduce a novel fingerprinting framework that leverages the behavioral patterns induced by safety alignment, applying the concept of refusal vectors for LLM provenance tracking. These vectors, extracted from directional patterns in a model’s internal representations when processing harmful versus harmless prompts, serve as robust behavioral fingerprints. Our contribution lies in developing a fingerprinting system around this concept and conducting extensive validation of its effectiveness for IP protection. We demonstrate that these behavioral fingerprints are highly robust against common modifications, including finetunes, merges, and quantization. Our experiments show that the fingerprint is unique to each model family, with low cosine similarity between independently trained models. In a large-scale identification task across 76 offspring models, our method achieves 100% accuracy in identifying the correct base model family. Furthermore, we analyze the fingerprint’s behavior under alignment-breaking attacks, finding that while performance degrades significantly, detectable traces remain. Finally, we propose a theoretical framework to transform this private fingerprint into a publicly verifiable, privacy-preserving artifact using locality-sensitive hashing and zero-knowledge proofs.
[AI-43] Autonomous Action Runtime Management(AARM):A System Specification for Securing AI-Driven Actions at Runtime
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统从被动助手向具备自主执行能力的智能体演进过程中,传统安全边界失效的问题。随着 AI 驱动的动作不可逆、执行速度快且可能源自被攻破的编排层,现有日志聚合、边界防御和事后取证等安全范式已无法提供有效保护。其解决方案的关键在于提出 Autonomous Action Runtime Management (AARM),一种面向运行时的开放规范,通过在动作执行前拦截、积累会话上下文、评估策略与意图一致性、强制授权决策并记录防篡改凭证,将“动作执行”作为稳定的新的安全边界。AARM 不依赖特定模型、框架或厂商,定义了可区分禁止类、条件拒绝类和条件允许类动作的分类体系,并提供四种具有不同信任属性的实现架构(协议网关、SDK 插桩、内核 eBPF 和厂商集成),确保系统在复杂威胁场景下仍能保障行为可控与可追溯。
链接: https://arxiv.org/abs/2602.09433
作者: Herman Errico
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As artificial intelligence systems evolve from passive assistants into autonomous agents capable of executing consequential actions, the security boundary shifts from model outputs to tool execution. Traditional security paradigms - log aggregation, perimeter defense, and post-hoc forensics - cannot protect systems where AI-driven actions are irreversible, execute at machine speed, and originate from potentially compromised orchestration layers. This paper introduces Autonomous Action Runtime Management (AARM), an open specification for securing AI-driven actions at runtime. AARM defines a runtime security system that intercepts actions before execution, accumulates session context, evaluates against policy and intent alignment, enforces authorization decisions, and records tamper-evident receipts for forensic reconstruction. We formalize a threat model addressing prompt injection, confused deputy attacks, data exfiltration, and intent drift. We introduce an action classification framework distinguishing forbidden, context-dependent deny, and context-dependent allow actions. We propose four implementation architectures - protocol gateway, SDK instrumentation, kernel eBPF, and vendor integration - with distinct trust properties, and specify minimum conformance requirements for AARM-compliant systems. AARM is model-agnostic, framework-agnostic, and vendor-neutral, treating action execution as the stable security boundary. This specification aims to establish industry-wide requirements before proprietary fragmentation forecloses interoperability.
[AI-44] Sci-VLA: Agent ic VLA Inference Plugin for Long-Horizon Tasks in Scientific Experiments
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在科学实验中执行长时程任务时的局限性问题,即VLA模型虽能可靠完成训练阶段见过的原子操作(atomic tasks),但在面对由这些原子操作重新组合而成的复合任务(composite tasks)时表现不佳,其根本原因在于训练与推理阶段之间的分布差异导致模型无法执行必要的过渡操作(transitional operations)。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的代理式推理插件(Agentic VLA Inference Plugin),该插件在执行序列化操作任务时主动介入,通过显式的过渡推理生成机器人动作代码,从而引导VLA模型跨越缺失的过渡步骤,实现无需额外训练即可稳定执行复合科学工作流的能力。此方法仅依赖推理阶段干预,具备计算高效、数据高效的特点,适用于开放场景下的长时程机器人实验室任务。
链接: https://arxiv.org/abs/2602.09430
作者: Yiwen Pang,Bo Zhou,Changjin Li,Xuanhao Wang,Shengxiang Xu,Deng-Bao Wang,Min-Ling Zhang,Shimin Di
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robotic laboratories play a critical role in autonomous scientific discovery by enabling scalable, continuous experimental execution. Recent vision-language-action (VLA) models offer a promising foundation for robotic laboratories. However, scientific experiments typically involve long-horizon tasks composed of multiple atomic tasks, posing a fundamental challenge to existing VLA models. While VLA models fine-tuned for scientific tasks can reliably execute atomic experimental actions seen during training, they often fail to perform composite tasks formed by reordering and composing these known atomic actions. This limitation arises from a distributional mismatch between training-time atomic tasks and inference-time composite tasks, which prevents VLA models from executing necessary transitional operations between atomic tasks. To address this challenge, we propose an Agentic VLA Inference Plugin for Long-Horizon Tasks in Scientific Experiments. It introduces an LLM-based agentic inference mechanism that intervenes when executing sequential manipulation tasks. By performing explicit transition inference and generating transitional robotic action code, the proposed plugin guides VLA models through missing transitional steps, enabling reliable execution of composite scientific workflows without any additional training. This inference-only intervention makes our method computationally efficient, data-efficient, and well-suited for open-ended and long-horizon robotic laboratory tasks. We build 3D assets of scientific instruments and common scientific operating scenes within an existing simulation environment. In these scenes, we have verified that our method increases the average success rate per atomic task by 42% during inference. Furthermore, we show that our method can be easily transferred from the simulation to real scientific laboratories.
[AI-45] Accelerating Post-Quantum Cryptography via LLM -Driven Hardware-Software Co-Design
【速读】:该论文旨在解决后量子密码学(Post-quantum Cryptography, PQC)算法在硬件实现中计算复杂度高、设计效率低的问题,尤其是在FPGA上进行软硬件协同设计时面临的开发周期长、优化难度大等挑战。解决方案的关键在于提出一种基于大型语言模型(Large Language Models, LLMs)的新框架,通过LLM自动分析PQC算法(以FALCON数字签名方案为例),识别性能关键组件,并生成可用于FPGA实现的候选硬件描述;实验表明,人机协作下由LLM生成的加速器可在计算密集型核上实现最高2.6倍的执行速度提升,同时缩短关键路径,尽管在资源利用率和功耗方面存在权衡,但显著降低了设计迭代成本与开发时间,为PQC加速器的快速适应性设计提供了新范式。
链接: https://arxiv.org/abs/2602.09410
作者: Yuchao Liao,Tosiron Adegbija,Roman Lysecky
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted at the 27th International Symposium on Quality Electronic Design (ISQED 2026)
Abstract:Post-quantum cryptography (PQC) is crucial for securing data against emerging quantum threats. However, its algorithms are computationally complex and difficult to implement efficiently on hardware. In this paper, we explore the potential of Large Language Models (LLMs) to accelerate the hardware-software co-design process for PQC, with a focus on the FALCON digital signature scheme. We present a novel framework that leverages LLMs to analyze PQC algorithms, identify performance-critical components, and generate candidate hardware descriptions for FPGA implementation. We present the first quantitative comparison between LLM-driven synthesis and conventional HLS-based approaches for low-level compute-intensive kernels in FALCON, showing that human-in-the-loop LLM-generated accelerators can achieve up to 2.6x speedup in kernel execution time with shorter critical paths, while highlighting trade-offs in resource utilization and power consumption. Our results suggest that LLMs can minimize design effort and development time by automating FPGA accelerator design iterations for PQC algorithms, offering a promising new direction for rapid and adaptive PQC accelerator design on FPGAs.
[AI-46] Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning
【速读】:该论文旨在解决流式强化学习(Streaming Reinforcement Learning, SRL)中因缺乏经验回放缓冲区(replay buffer)而导致的样本效率低下问题,尤其在仅利用单次更新即丢弃过渡数据的场景下,价值函数损失难以从瞬时观测中提取有意义的表征。其解决方案的关键在于将自预测表征(Self-Predictive Representations, SPR)引入流式框架,并通过引入相对于动量目标的正交梯度更新机制,缓解由于流式采样高度相关性所引发的训练不稳定性和梯度冲突问题。该方法在Atari、MinAtar和Octax基准上显著优于现有流式基线,且通过潜在空间分析(如t-SNE可视化和有效秩测量)验证了其能学习更丰富的表征,从而缩小无回放缓冲区带来的性能差距,同时保持对少量CPU核心的高效训练能力。
链接: https://arxiv.org/abs/2602.09396
作者: Nilaksh,Antoine Clavaud,Mathieu Reymond,François Rivest,Sarath Chandar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
[AI-47] LLM AC: A Global and Explainable Access Control Framework with Large Language Model
【速读】:该论文旨在解决现代业务组织在面对复杂、动态且情境依赖的访问控制需求时,传统访问控制方法(如基于角色的访问控制 RBAC、基于属性的访问控制 ABAC 和自主访问控制 DAC)难以有效管理的问题。解决方案的关键在于提出一种统一的新方法 LLMAC,利用大型语言模型(Large Language Models, LLMs)整合多种访问控制机制,构建一个兼具高准确率与可解释性的综合系统;实验表明,基于 Mistral 7B 训练的模型在合成数据集上达到 98.5% 的准确率,显著优于传统方法,并能提供清晰的人类可读决策解释,具备实际部署可行性。
链接: https://arxiv.org/abs/2602.09392
作者: Sharif Noor Zisad,Ragib Hasan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper is accepted and presented in IEEE Consumer Communications Networking Conference (CCNC 2026)
Abstract:Today’s business organizations need access control systems that can handle complex, changing security requirements that go beyond what traditional methods can manage. Current approaches, such as Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), and Discretionary Access Control (DAC), were designed for specific purposes. They cannot effectively manage the dynamic, situation-dependent workflows that modern systems require. In this research, we introduce LLMAC, a new unified approach using Large Language Models (LLMs) to combine these different access control methods into one comprehensive, understandable system. We used an extensive synthetic dataset that represents complex real-world scenarios, including policies for ownership verification, version management, workflow processes, and dynamic role separation. Using Mistral 7B, our trained LLM model achieved outstanding results with 98.5% accuracy, significantly outperforming traditional methods (RBAC: 14.5%, ABAC: 58.5%, DAC: 27.5%) while providing clear, human readable explanations for each decision. Performance testing shows that the system can be practically deployed with reasonable response times and computing resources.
[AI-48] Image Quality in the Era of Artificial Intelligence
【速读】:该论文旨在解决生成式 AI (Generative AI) 在放射影像重建与增强过程中可能引入的新故障模式,以及由此加剧的图像感知质量与实际信息含量之间的脱节问题。解决方案的关键在于提升使用者对 AI 重建/增强技术局限性的认知,从而在充分受益于该技术的同时,有效降低潜在风险,确保其安全、高效的应用。
链接: https://arxiv.org/abs/2602.09347
作者: Jana G. Delfino,Jason L. Granstedt,Frank W. Samuelson,Robert Ochs,Krishna Juluru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures
Abstract:Artificial intelligence (AI) is being deployed within radiology at a rapid pace. AI has proven an excellent tool for reconstructing and enhancing images that appear sharper, smoother, and more detailed, can be acquired more quickly, and allowing clinicians to review them more rapidly. However, incorporation of AI also introduces new failure modes and can exacerbate the disconnect between perceived quality of an image and information content of that image. Understanding the limitations of AI-enabled image reconstruction and enhancement is critical for safe and effective use of the technology. Hence, the purpose of this communication is to bring awareness to limitations when AI is used to reconstruct or enhance a radiological image, with the goal of enabling users to reap benefits of the technology while minimizing risks.
[AI-49] Agent Cgroup: Understanding and Controlling OS Resources of AI Agents
【速读】:该论文旨在解决多租户云环境中生成式 AI (Generative AI) 编码代理在沙箱容器内执行工具调用时,因操作系统级资源动态特性(如内存波动剧烈、任务间资源需求不可预测)导致的性能瓶颈与资源隔离失效问题。其核心挑战在于现有资源控制机制(如容器级 cgroup 策略)在粒度、响应速度和适应性上与工具调用级别的细粒度、突发性资源需求存在三重不匹配:粒度不匹配(容器级 vs. 工具调用级)、响应不匹配(用户空间反应慢于亚秒级突发)、适配不匹配(基于历史预测无法应对状态依赖的非确定性执行)。解决方案的关键是提出 AgentCgroup,一个基于 eBPF 的资源控制器,通过三层 cgroup 结构对齐工具调用边界、利用 sched_ext 和 memcg_bpf_ops 在内核层强制执行策略,并结合内核运行时监控驱动自适应调度与内存限制策略,从而实现更精细的资源隔离和减少浪费。
链接: https://arxiv.org/abs/2602.09345
作者: Yusheng Zheng,Jiakun Fan,Quanzhi Fu,Yiwei Yang,Wei Zhang,Andi Quinn
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly deployed in multi-tenant cloud environments, where they execute diverse tool calls within sandboxed containers, each call with distinct resource demands and rapid fluctuations. We present a systematic characterization of OS-level resource dynamics in sandboxed AI coding agents, analyzing 144 software engineering tasks from the SWE-rebench benchmark across two LLM models. Our measurements reveal that (1) OS-level execution (tool calls, container and agent initialization) accounts for 56-74% of end-to-end task latency; (2) memory, not CPU, is the concurrency bottleneck; (3) memory spikes are tool-call-driven with a up to 15.4x peak-to-average ratio; and (4) resource demands are highly unpredictable across tasks, runs, and models. Comparing these characteristics against serverless, microservice, and batch workloads, we identify three mismatches in existing resource controls: a granularity mismatch (container-level policies vs. tool-call-level dynamics), a responsiveness mismatch (user-space reaction vs. sub-second unpredictable bursts), and an adaptability mismatch (history-based prediction vs. non-deterministic stateful execution). We propose AgentCgroup , an eBPF-based resource controller that addresses these mismatches through hierarchical cgroup structures aligned with tool-call boundaries, in-kernel enforcement via sched_ext and memcg_bpf_ops, and runtime-adaptive policies driven by in-kernel monitoring. Preliminary evaluation demonstrates improved multi-tenant isolation and reduced resource waste.
[AI-50] Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM -as-Judge
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)中基于多数投票(majority voting)的输出聚合方法在面对“虚构共识”(confabulation consensus)时的脆弱性问题,即当多个代理因共享相关偏差而得出相同错误推理路径时,传统投票机制无法识别并纠正此类错误。解决方案的关键在于提出AgentAuditor框架,其通过构建显式表示代理推理路径分歧与一致性的“推理树”(Reasoning Tree),将全局仲裁转化为在关键分歧点进行局部验证的路径搜索策略;同时引入反共识偏好优化(Anti-Consensus Preference Optimization, ACPO),训练裁判模型在多数失败案例中优先选择基于证据的少数意见,从而提升整体推理准确性。
链接: https://arxiv.org/abs/2602.09341
作者: Wei Yang,Shixuan Li,Heng Ping,Peiyu Zhang,Paul Bogdan,Jesse Thomason
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.
[AI-51] Measuring Dataset Diversity from a Geometric Perspective
【速读】:该论文旨在解决现有多样性度量方法(如特征空间分散度和度量空间幅度)主要关注分布变化或熵值,而忽视数据集几何结构的问题。其解决方案的关键在于引入基于拓扑数据分析(Topological Data Analysis, TDA)与持久景观(Persistence Landscapes, PLs)的框架,通过提取和量化数据的几何特征,实现对数据多样性的理论严谨且可解释的测量,从而将多样性直接关联到数据的底层几何结构,为数据集构建、增强与评估提供基础性工具。
链接: https://arxiv.org/abs/2602.09340
作者: Yang Ba,Mohammad Sadeq Abolhasani,Michelle V Mancenido,Rong Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.
[AI-52] SnareNet: Flexible Repair Layers for Neural Networks with Hard Constraints
【速读】:该论文旨在解决神经网络在作为代理求解器或控制策略时,其输出可能违反物理、操作或安全约束的问题。解决方案的关键在于提出SnareNet架构,该架构通过添加一个可微的修复层(repair layer),在约束映射的值域空间中进行迭代导航,将输出逐步引导至满足输入相关非线性约束的可行区域,并生成符合用户指定容差的可行解。此外,为稳定端到端训练,论文引入自适应松弛机制,设计了一个初始时宽松、随训练逐渐收缩至真实可行集的松弛可行集,从而实现早期探索与后期严格可行性的平衡。
链接: https://arxiv.org/abs/2602.09317
作者: Ya-Chi Chu,Alkiviades Boukas,Madeleine Udell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Neural networks are increasingly used as surrogate solvers and control policies, but unconstrained predictions can violate physical, operational, or safety requirements. We propose SnareNet, a feasibility-controlled architecture for learning mappings whose outputs must satisfy input-dependent nonlinear constraints. SnareNet appends a differentiable repair layer that navigates in the constraint map’s range space, steering iterates toward feasibility and producing a repaired output that satisfies constraints to a user-specified tolerance. To stabilize end-to-end training, we introduce adaptive relaxation, which designs a relaxed feasible set that snares the neural network at initialization and shrinks it into the feasible set, enabling early exploration and strict feasibility later in training. On optimization-learning and trajectory planning benchmarks, SnareNet consistently attains improved objective quality while satisfying constraints more reliably than prior work.
[AI-53] Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory
【速读】:该论文旨在解决现有优化算法在神经网络训练中数据效率差异的机制问题,特别是针对结构化优化器(如Shampoo和Muon)与逐元素优化器(如Adam和Signum)之间的相对性能差异缺乏清晰理论解释的问题。其关键解决方案在于揭示Shampoo对权重矩阵(weight matrices)的更新可被分解为一种适配后的Muon更新形式,并证明Shampoo的优越性完全源于其对参数形状(parameter shapes)的利用,而非传统观点所强调的方差适应或白化(whitening)等机制;进一步指出Shampoo的更新在期望意义上是时间平均半正交的(time-averaged semi-orthogonal),从而提供了一个更本质且避免已有解释缺陷的新视角。
链接: https://arxiv.org/abs/2602.09314
作者: Runa Eschenhagen,Anna Cai,Tsung-Hsien Lee,Hao-Jun Michael Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam’s advantage over Signum. We show that Shampoo’s update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo’s benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo’s updates are time-averaged semi-orthogonal in expectation.
[AI-54] Empowering Contrastive Federated Sequential Recommendation with LLM s
【速读】:该论文旨在解决联邦序列推荐(Federated Sequential Recommendation, FedSeqRec)中因用户数据分散、噪声干扰及交互日志同质化导致的模型质量受限问题。现有方法通过人工数据增强或服务器端约束进行补偿,但存在语义多样性不足或系统开销过高的缺陷。其解决方案的关键在于提出一种参数隔离的联邦架构LUMOS,利用本地大语言模型(Large Language Models, LLMs)作为语义生成器,从每个用户历史中私有地构建三类互补序列变体:未来导向轨迹(future-oriented trajectories)、语义等价重述(semantically equivalent rephrasings)和参考不一致反事实样本(reference-inconsistent counterfactuals),并通过三视图对比优化机制联合编码这些合成序列,从而在不暴露敏感信息的前提下实现更丰富的表示学习,显著提升推荐性能与鲁棒性。
链接: https://arxiv.org/abs/2602.09306
作者: Thi Minh Chau Nguyen,Minh Hieu Nguyen,Duc Anh Nguyen,Xuan Huong Tran,Thanh Trung Huynh,Quoc Viet Hung Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Federated sequential recommendation (FedSeqRec) aims to perform next-item prediction while keeping user data decentralised, yet model quality is frequently constrained by fragmented, noisy, and homogeneous interaction logs stored on individual devices. Many existing approaches attempt to compensate through manual data augmentation or additional server-side constraints, but these strategies either introduce limited semantic diversity or increase system overhead. To overcome these challenges, we propose \textbfLUMOS, a parameter-isolated FedSeqRec architecture that integrates large language models (LLMs) as \emphlocal semantic generators. Instead of sharing gradients or auxiliary parameters, LUMOS privately invokes an on-device LLM to construct three complementary sequence variants from each user history: (i) \emphfuture-oriented trajectories that infer plausible behavioural continuations, (ii) \emphsemantically equivalent rephrasings that retain user intent while diversifying interaction patterns, and (iii) \emphpreference-inconsistent counterfactuals that serve as informative negatives. These synthesized sequences are jointly encoded within the federated backbone through a tri-view contrastive optimisation scheme, enabling richer representation learning without exposing sensitive information. Experimental results across three public benchmarks show that LUMOS achieves consistent gains over competitive centralised and federated baselines on HR@20 and NDCG@20. In addition, the use of semantically grounded positive signals and counterfactual negatives improves robustness under noisy and adversarial environments, even without dedicated server-side protection modules. Overall, this work demonstrates the potential of LLM-driven semantic generation as a new paradigm for advancing privacy-preserving federated recommendation.
[AI-55] STaR: Scalable Task-Conditioned Retrieval for Long-Horizon Multimodal Robot Memory
【速读】:该论文旨在解决移动机器人在长时间、开放动态场景中如何构建可扩展的长时记忆系统,以支持对开放式指令进行多粒度的规划、检索与推理,并生成精确且可执行的导航答案。其核心挑战在于既要保持环境语义的细粒度信息(如物体属性、空间关系和动态事件),又要实现对未见查询的泛化能力。解决方案的关键在于提出STaR框架:(i) 构建一个任务无关的多模态长期记忆,能够保留环境语义并适应新查询;(ii) 引入基于信息瓶颈原理的可扩展任务条件检索算法(Scalable Task-Conditioned Retrieval),从长期记忆中提取紧凑、非冗余且信息丰富的候选记忆集合用于上下文推理,从而提升导航准确性和系统可扩展性。
链接: https://arxiv.org/abs/2602.09255
作者: Mingfeng Yuan,Hao Zhang,Mahan Mohammadi,Runhao Li,Jinjun Shan,Steven L. Waslander
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile robots are often deployed over long durations in diverse open, dynamic scenes, including indoor setting such as warehouses and manufacturing facilities, and outdoor settings such as agricultural and roadway operations. A core challenge is to build a scalable long-horizon memory that supports an agentic workflow for planning, retrieval, and reasoning over open-ended instructions at variable granularity, while producing precise, actionable answers for navigation. We present STaR, an agentic reasoning framework that (i) constructs a task-agnostic, multimodal long-term memory that generalizes to unseen queries while preserving fine-grained environmental semantics (object attributes, spatial relations, and dynamic events), and (ii) introduces a Scalable TaskConditioned Retrieval algorithm based on the Information Bottleneck principle to extract from long-term memory a compact, non-redundant, information-rich set of candidate memories for contextual reasoning. We evaluate STaR on NaVQA (mixed indoor/outdoor campus scenes) and WH-VQA, a customized warehouse benchmark with many visually similar objects built with Isaac Sim, emphasizing contextual reasoning. Across the two datasets, STaR consistently outperforms strong baselines, achieving higher success rates and markedly lower spatial error. We further deploy STaR on a real Husky wheeled robot in both indoor and outdoor environments, demonstrating robust longhorizon reasoning, scalability, and practical utility.
[AI-56] Do Neural Networks Lose Plasticity in a Gradually Changing World?
【速读】:该论文旨在解决持续学习(continual learning)中普遍存在的“可塑性丧失”(loss of plasticity)问题,即神经网络在不断学习新任务时逐渐丧失适应能力的现象。现有研究多基于突变式任务切换的假设环境,难以反映真实世界中任务渐进变化的特点。论文的关键解决方案是模拟一个逐步变化的环境,通过输入/输出插值和任务采样来实现平滑的任务过渡,理论与实证分析表明,这种渐进式环境设计可显著缓解可塑性丧失现象,从而提升模型在动态场景下的持续学习能力。
链接: https://arxiv.org/abs/2602.09234
作者: Tianhui Liu,Lili Mou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning has become a trending topic in machine learning. Recent studies have discovered an interesting phenomenon called loss of plasticity, referring to neural networks gradually losing the ability to learn new tasks. However, existing plasticity research largely relies on contrived settings with abrupt task transitions, which often do not reflect real-world environments. In this paper, we propose to investigate a gradually changing environment, and we simulate this by input/output interpolation and task sampling. We perform theoretical and empirical analysis, showing that the loss of plasticity is an artifact of abrupt tasks changes in the environment and can be largely mitigated if the world changes gradually.
[AI-57] MUZZLE: Adaptive Agent ic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的网络代理(web agent)在执行在线任务时,因间接提示注入攻击(indirect prompt injection attack)而面临的安全风险问题。这类攻击通过嵌入不可信网页内容中的恶意指令,诱导代理偏离用户意图,从而危害数据保密性、完整性与可用性。现有评估方法受限于固定攻击模板、人工选择的注入表面或狭窄场景,难以模拟真实环境中自适应且隐蔽的攻击行为。论文提出的解决方案是MUZZLE——一个自动化代理框架,其关键创新在于利用代理执行轨迹自动识别高显著性的注入表面,并基于反馈迭代生成上下文感知的恶意指令,实现对安全弱点的动态探测与适应性攻击。该方法无需人工干预即可有效发现37种新攻击,涵盖跨应用注入和代理定制钓鱼等新型策略,显著提升了对复杂现实威胁的评估能力。
链接: https://arxiv.org/abs/2602.09222
作者: Georgios Syros,Evan Rose,Brian Grinstead,Christoph Kerschbaumer,William Robertson,Cristina Nita-Rotaru,Alina Oprea
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users’ behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent’s trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent’s observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.
[AI-58] A Lightweight Multi-View Approach to Short-Term Load Forecasting
【速读】:该论文旨在解决时间序列预测中因模型复杂度高而导致的过拟合与预测不稳定问题,尤其是在历史数据随时间推移相关性减弱时的表现不佳。其核心解决方案是提出一种轻量级多视角短时负荷预测方法,关键在于引入单值嵌入(single-value embeddings)和缩放的时间范围输入(scaled time-range input),以高效捕捉时序相关特征;同时设计嵌入丢弃机制(embedding dropout)来避免对特定特征的过度依赖,从而提升模型鲁棒性和可解释性。该方法在参数显著减少的情况下仍保持优异性能,且在噪声或稀疏数据场景下表现稳定。
链接: https://arxiv.org/abs/2602.09220
作者: Julien Guité-Vinet,Alexandre Blondin Massé,Éric Beaudry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting is a critical task across domains such as energy, finance, and meteorology, where accurate predictions enable informed decision-making. While transformer-based and large-parameter models have recently achieved state-of-the-art results, their complexity can lead to overfitting and unstable forecasts, especially when older data points become less relevant. In this paper, we propose a lightweight multi-view approach to short-term load forecasting that leverages single-value embeddings and a scaled time-range input to capture temporally relevant features efficiently. We introduce an embedding dropout mechanism to prevent over-reliance on specific features and enhance interpretability. Our method achieves competitive performance with significantly fewer parameters, demonstrating robustness across multiple datasets, including scenarios with noisy or sparse data, and provides insights into the contributions of individual features to the forecast.
[AI-59] CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning
【速读】:该论文旨在解决当前基于扩散的强化学习(Diffusion-based Reinforcement Learning, RL)方法在决策过程中仅依赖统计关联而忽视因果关系的问题,这限制了模型对真正驱动高回报动作成分的识别能力。解决方案的关键在于提出一种因果引导的扩散策略框架(Causality-guided Diffusion Policy, CausalGDP),该框架首先从离线数据中学习基础扩散策略和初始因果动力学模型,以捕捉状态、动作与奖励之间的因果依赖关系;随后在实时交互中持续更新并引入因果信息作为引导信号,从而在扩散过程中聚焦于那些因果上影响未来状态和奖励的动作组件,实现更高效且可解释的策略优化。
链接: https://arxiv.org/abs/2602.09207
作者: Xiaofeng Xiao,Xiao Hu,Yang Ye,Xubo Yue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has achieved remarkable success in a wide range of sequential decision-making problems. Recent diffusion-based policies further improve RL by modeling complex, high-dimensional action distributions. However, existing diffusion policies primarily rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, limiting their ability to identify which action components truly cause high returns. In this paper, we propose Causality-guided Diffusion Policy (CausalGDP), a unified framework that integrates causal reasoning into diffusion-based RL. CausalGDP first learns a base diffusion policy and an initial causal dynamical model from offline data, capturing causal dependencies among states, actions, and rewards. During real-time interaction, the causal information is continuously updated and incorporated as a guidance signal to steer the diffusion process toward actions that causally influence future states and rewards. By explicitly considering causality beyond association, CausalGDP focuses policy optimization on action components that genuinely drive performance improvements. Experimental results demonstrate that CausalGDP consistently achieves competitive or superior performance over state-of-the-art diffusion-based and offline RL methods, especially in complex, high-dimensional control tasks.
[AI-60] Gradient Residual Connections
【速读】:该论文旨在解决神经网络在逼近高频函数时性能受限的问题,特别是传统残差连接难以捕捉快速变化的函数特征。其解决方案的关键在于提出一种基于梯度信息的残差连接(gradient-based residual connection),作为对标准恒等跳跃连接(identity skip connection)的补充,利用梯度信息增强模型区分输入差异的能力,从而提升对高频率模式的逼近精度。通过在合成回归任务和单图像超分辨率任务中的实验验证,该方法显著改善了高频函数的建模能力,同时在标准图像分类与分割任务中保持与传统残差网络相当的性能,展现出良好的通用性。
链接: https://arxiv.org/abs/2602.09190
作者: Yangchen Pan,Qizhen Ying,Philip Torr,Bo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Existing work has linked properties of a function’s gradient to the difficulty of function approximation. Motivated by these insights, we study how gradient information can be leveraged to improve neural network’s ability to approximate high-frequency functions, and we propose a gradient-based residual connection as a complement to the standard identity skip connection used in residual networks. We provide simple theoretical intuition for why gradient information can help distinguish inputs and improve the approximation of functions with rapidly varying behaviour. On a synthetic regression task with a high-frequency sinusoidal ground truth, we show that conventional residual connections struggle to capture high-frequency patterns. In contrast, our gradient residual substantially improves approximation quality. We then introduce a convex combination of the standard and gradient residuals, allowing the network to flexibly control how strongly it relies on gradient information. After validating the design choices of our proposed method through an ablation study, we further validate our approach’s utility on the single-image super-resolution task, where the underlying function may be high-frequency. Finally, on standard tasks such as image classification and segmentation, our method achieves performance comparable to standard residual networks, suggesting its broad utility.
[AI-61] AIDev: Studying AI Coding Agents on GitHub
【速读】:该论文旨在解决当前软件工程领域缺乏大规模真实世界数据集的问题,尤其是关于AI编码代理(Coding Agent)在实际项目中使用情况的数据缺失。为填补这一空白,研究者提出了AIDev数据集,其关键创新在于系统性地收集并整合了来自五个主流AI编码代理(OpenAI Codex、Devin、GitHub Copilot、Cursor和Claude Code)生成的932,791个代理编写拉取请求(Agentic-PRs),覆盖116,211个GitHub仓库及72,189名开发者;同时,还提供了一个高质量子集(33,596个Agentic-PRs),包含评论、代码审查、提交记录和相关问题等丰富元数据,从而为AI采纳、开发者生产力提升以及人机协作机制等方向的研究提供了可扩展、高保真的实证基础。
链接: https://arxiv.org/abs/2602.09185
作者: Hao Li,Haoxiang Zhang,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI coding agents are rapidly transforming software engineering by performing tasks such as feature development, debugging, and testing. Despite their growing impact, the research community lacks a comprehensive dataset capturing how these agents are used in real-world projects. To address this gap, we introduce AIDev, a large-scale dataset focused on agent-authored pull requests (Agentic-PRs) in real-world GitHub repositories. AIDev aggregates 932,791 Agentic-PRs produced by five agents: OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code. These PRs span 116,211 repositories and involve 72,189 developers. In addition, AIDev includes a curated subset of 33,596 Agentic-PRs from 2,807 repositories with over 100 stars, providing further information such as comments, reviews, commits, and related issues. This dataset offers a foundation for future research on AI adoption, developer productivity, and human-AI collaboration in the new era of software engineering. AI Agent, Agentic AI, Coding Agent, Agentic Coding, Agentic Software Engineering, Agentic Engineering Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.09185 [cs.SE] (or arXiv:2602.09185v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.09185 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3793302.3797249 Focus to learn more DOI(s) linking to related resources
[AI-62] n-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models
【速读】:该论文旨在解决如何在不依赖大型单一语言模型(Large Language Models, LLMs)的前提下,利用多个冻结的小型专业化语言模型(Specialized Language Models, SLMs)实现结构化推理的问题。其核心解决方案是提出“软隐状态协作”(soft hidden-state collaboration),即通过可训练的注意力接口将多个异构的冻结SLM专家基于其内部表示进行潜在集成,从而在无需微调专家模型的情况下实现高效协同。实验表明,该机制在Reasoning Gym和GSM8K任务上性能可媲美强基准模型,并揭示了专家利用的双机制:简单算术任务中表现为静态专家偏好,复杂任务中则演化出集中且结构化的专家注意力模式,体现出路由机制对专家能力的动态适配与专业化分工。
链接: https://arxiv.org/abs/2602.09173
作者: Ryozo Masukawa,Sanggeon Yun,Hyunwoo Oh,SuhgHeon Jeong,Raheeb Hassa,Hanning Chen,Wenjun Huang,Mahdi Imani,Pietro Mercati,Nathaniel D. Bastian,Mohsen Imani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in reinforcement learning with verifiable rewards (RLVR) shows that small, specialized language models (SLMs) can exhibit structured reasoning without relying on large monolithic LLMs. We introduce soft hidden-state collaboration, where multiple heterogeneous frozen SLM experts are integrated through their internal representations via a trainable attention interface. Experiments on Reasoning Gym and GSM8K show that this latent integration is competitive with strong single-model RLVR baselines. Ablations further reveal a dual mechanism of expert utilization: for simpler arithmetic domains, performance gains can largely be explained by static expert preferences, whereas more challenging settings induce increasingly concentrated and structured expert attention over training, indicating emergent specialization in how the router connects to relevant experts. Overall, hidden-state collaboration provides a compact mechanism for leveraging frozen experts, while offering an observational window into expert utilization patterns and their evolution under RLVR.
[AI-63] What do Geometric Hallucination Detection Metrics Actually Measure? ICML
【速读】:该论文旨在解决生成式 AI(Generative AI)在高后果应用场景中因幻觉(hallucination)问题导致的部署障碍,特别是当外部真实信息难以获取以验证模型输出时。研究聚焦于语言模型(LLM)内部状态中的几何信号(geometric signals),这些信号能够预测幻觉现象且对额外外部知识依赖较少。解决方案的关键在于:通过构建一个合成数据集来系统性地分离和控制与幻觉相关的不同属性(如正确性、置信度、相关性、连贯性和完整性),从而揭示不同几何统计量对应不同类型幻觉的本质差异;同时提出一种简单的归一化方法以缓解任务领域迁移(domain shift)对几何统计量的影响,在多领域设置下使AUROC指标提升34点。
链接: https://arxiv.org/abs/2602.09158
作者: Eric Yeats,John Buckheit,Sarah Scullen,Brendan Kennedy,Loc Truong,Davis Brown,Bill Kay,Cliff Joslyn,Tegan Emerson,Michael J. Henry,John Emanuello,Henry Kvinge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at the 2025 ICML Workshop on Reliable and Responsible Foundation Models
Abstract:Hallucination remains a barrier to deploying generative models in high-consequence applications. This is especially true in cases where external ground truth is not readily available to validate model outputs. This situation has motivated the study of geometric signals in the internal state of an LLM that are predictive of hallucination and require limited external knowledge. Given that there are a range of factors that can lead model output to be called a hallucination (e.g., irrelevance vs incoherence), in this paper we ask what specific properties of a hallucination these geometric statistics actually capture. To assess this, we generate a synthetic dataset which varies distinct properties of output associated with hallucination. This includes output correctness, confidence, relevance, coherence, and completeness. We find that different geometric statistics capture different types of hallucinations. Along the way we show that many existing geometric detection methods have substantial sensitivity to shifts in task domain (e.g., math questions vs. history questions). Motivated by this, we introduce a simple normalization method to mitigate the effect of domain shift on geometric statistics, leading to AUROC gains of +34 points in multi-domain settings.
[AI-64] SceneSmith: Agent ic Generation of Simulation-Ready Indoor Scenes
【速读】:该论文旨在解决现有仿真环境在训练和评估家用机器人时无法充分模拟真实室内空间多样性与物理复杂性的问题,尤其是当前场景合成方法生成的房间家具稀疏、缺乏密集杂物、可动家具及物理属性,难以支撑机器人操作任务的真实测试。其解决方案的关键在于提出SceneSmith——一个分层代理式框架,通过设计者(designer)、评论者(critic)和协调者(orchestrator)三个视觉语言模型(VLM)代理之间的协同交互,实现从建筑布局到家具布置再到小物件填充的多阶段场景构建;该框架融合文本到3D生成、数据集检索和物理属性估计,显著提升了场景中物体数量(3–6倍于基线)、碰撞率低(2%)、稳定性高(96%),并经用户研究验证其真实感(92%平均得分)和提示忠实度(91%胜率)。
链接: https://arxiv.org/abs/2602.09153
作者: Nicholas Pfaff,Thomas Cohn,Sergey Zakharov,Rick Cory,Russ Tedrake
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages \unicodex2013 from architectural layout to furniture placement to small object population \unicodex2013 each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with 2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
[AI-65] Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization
【速读】:该论文旨在解决多模态情绪识别(Multimodal Emotion Recognition, MER)在边缘设备上部署时面临的计算效率低、隐私保护不足以及跨模态不确定性难以建模的问题。解决方案的关键在于提出一个轻量级且隐私友好的模块化框架,其中各模态(语音、文本、面部图像)分别采用针对推理效率优化的专用主干网络(Emotion2Vec、ResNet-based模型、DistilRoBERTa),并通过基于Dempster-Shafer理论与Dirichlet证据的模型和任务无关融合机制,在不依赖额外训练或联合分布估计的前提下,直接在模型logits层面捕捉预测不确定性,从而提升系统对模糊或缺失输入的鲁棒性,并保障实际应用中的可扩展性和可行性。
链接: https://arxiv.org/abs/2602.09121
作者: Rémi Grzeczkowicz,Eric Soriano,Ali Janati,Miyu Zhang,Gerard Comas-Quiles,Victor Carballo Araruna,Aneesh Jonelagadda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures
Abstract:In this work, we present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices. To demonstrate framework’s versatility, our implementation uses three modalities - speech, text and facial imagery. However, the system is fully modular, and can be extended to support other modalities or tasks. Each modality is processed through a dedicated backbone optimized for inference efficiency: Emotion2Vec for speech, a ResNet-based model for facial expressions, and DistilRoBERTa for text. To reconcile uncertainty across modalities, we introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence. Operating directly on model logits, this approach captures predictive uncertainty without requiring additional training or joint distribution estimation, making it broadly applicable beyond emotion recognition. Validation on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS and CREMA-D) show that our method achieves competitive accuracy while remaining computationally efficient and robust to ambiguous or missing inputs. Overall, the proposed framework emphasizes modularity, scalability, and real-world feasibility, paving the way toward uncertainty-aware multimodal systems for healthcare, human-computer interaction, and other emotion-informed applications.
[AI-66] A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)进行程序合成研究时面临的诸多挑战,包括训练分布难以控制、微调效果不透明、分词机制影响复杂、以及计算与存储成本高昂等问题。为克服这些限制,作者提出了一种名为Cadmus的系统,其关键在于构建了一个轻量级整数虚拟机(integer virtual machine)、一个涵盖多样化任务的真实程序数据集,以及一个仅需不到200美元计算成本即可训练的自回归Transformer模型。该方案使研究人员能够在可控且低成本的环境下精细调整训练分布,并对模型行为进行可观测和可仪器化分析,从而支持对程序补全、分布外表示、归纳推理及指令遵循等复杂认知能力的研究,同时避免了大模型引入的未知先验干扰,提升了实验的可解释性与科学严谨性。
链接: https://arxiv.org/abs/2602.09112
作者: Russ Webb,Jason Ramapuram
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:What research can be pursued with small models trained to complete true programs? Typically, researchers study program synthesis via large language models (LLMs) which introduce issues such as knowing what is in or out of distribution, understanding fine-tuning effects, understanding the effects of tokenization, and higher demand on compute and storage to carry out experiments. We present a system called Cadmus which includes an integer virtual machine (VM), a dataset composed of true programs of diverse tasks, and an autoregressive transformer model that is trained for under \ 200 of compute cost. The system can be used to study program completion, out-of-distribution representations, inductive reasoning, and instruction following in a setting where researchers have effective and affordable fine-grained control of the training distribution and the ability to inspect and instrument models. Smaller models working on complex reasoning tasks enable instrumentation and investigations that may be prohibitively expensive on larger models. To demonstrate that these tasks are complex enough to be of interest, we show that these Cadmus models outperform GPT-5 (by achieving 100% accuracy while GPT-5 has 95% accuracy) even on a simple task of completing correct, integer arithmetic programs in our domain-specific language (DSL) while providing transparency into the dataset’s relationship to the problem. We also show that GPT-5 brings unknown priors into its reasoning process when solving the same tasks, demonstrating a confounding factor that prevents the use of large-scale LLMs for some investigations where the training set relationship to the task needs to be fully understood.
[AI-67] DMamba: Decomposition-enhanced Mamba for Time Series Forecasting
【速读】:该论文旨在解决现有基于状态空间模型(State Space Models, SSMs)的长时序预测方法在处理具有非平稳特征的数据集时表现不佳的问题。其核心挑战在于,传统模型未能区分趋势(trend)与季节性(seasonal)成分在变量间统计关系上的本质差异:趋势成分通常由少数共同随机因子驱动,具有低维结构;而季节性成分则包含复杂的动态交互(如相位偏移和振幅共变),需更高表达能力建模。解决方案的关键在于提出DMamba模型,通过显式分离趋势与季节性成分,并设计差异化复杂度的模块——利用变量方向的Mamba编码器捕捉季节性成分中丰富的跨变量动态,同时用简单的多层感知机(MLP)学习趋势成分中的低维变量关系,从而实现架构复杂度与数据特性的一致性对齐。实验表明,该方法显著优于当前主流Mamba架构及分解类模型,达到新的SOTA性能。
链接: https://arxiv.org/abs/2602.09081
作者: Ruxuan Chen,Fang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 4 tables
Abstract:State Space Models (SSMs), particularly Mamba, have shown potential in long-term time series forecasting. However, existing Mamba-based architectures often struggle with datasets characterized by non-stationary patterns. A key observation from time series theory is that the statistical nature of inter-variable relationships differs fundamentally between the trend and seasonal components of a decomposed series. Trend relationships are often driven by a few common stochastic factors or long-run equilibria, suggesting that they reside on a lower-dimensional manifold. In contrast, seasonal relationships involve dynamic, high-dimensional interactions like phase shifts and amplitude co-movements, requiring more expressive modeling. In this paper, we propose DMamba, a novel forecasting model that explicitly aligns architectural complexity with this component-specific characteristic. DMamba employs seasonal-trend decomposition and processes the components with specialized, differentially complex modules: a variable-direction Mamba encoder captures the rich, cross-variable dynamics within the seasonal component, while a simple Multi-Layer Perceptron (MLP) suffices to learn from the lower-dimensional inter-variable relationships in the trend component. Extensive experiments on diverse datasets demonstrate that DMamba sets a new state-of-the-art (SOTA), consistently outperforming both recent Mamba-based architectures and leading decomposition-based models.
[AI-68] Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在训练和推理过程中参数利用率低的问题,即尽管模型参数量庞大,但其能力并未被充分挖掘。解决方案的关键在于提出一种递归式Transformer架构——RecursiveVLM,通过递归精炼机制复用模型参数以提取更强的多模态表示,而无需增加模型规模。其核心创新包括:(i) 递归连接器(Recursive Connector),通过融合中间层隐藏状态并施加模态特异性投影,实现跨递归步骤的特征对齐,尊重视觉与语言token的不同统计结构;(ii) 单调递归损失(Monotonic Recursion Loss),监督每一步递归过程并保证性能随递归深度单调提升。这一设计使递归成为按需优化的精炼机制,在资源受限设备上用少量循环即可获得良好结果,而在计算资源充足时可逐步提升输出质量,从而实现高效且部署自适应的LMM。
链接: https://arxiv.org/abs/2602.09080
作者: Ruihan Xu,Yuting Gao,Lan Wang,Jianing Li,Weihao Chen,Qingpei Guo,Ming Yang,Shiliang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is a primary contribution in the Recursive Vision-Language Models
Abstract:Large Multimodal Models (LMMs) have achieved remarkable success in vision-language tasks, yet their vast parameter counts are often underutilized during both training and inference. In this work, we embrace the idea of looping back to move forward: reusing model parameters through recursive refinement to extract stronger multimodal representations without increasing model size. We propose RecursiveVLM, a recursive Transformer architecture tailored for LMMs. Two key innovations enable effective looping: (i) a Recursive Connector that aligns features across recursion steps by fusing intermediate-layer hidden states and applying modality-specific projections, respecting the distinct statistical structures of vision and language tokens; (ii) a Monotonic Recursion Loss that supervises every step and guarantees performance improves monotonically with recursion depth. This design transforms recursion into an on-demand refinement mechanism: delivering strong results with few loops on resource-constrained devices and progressively improving outputs when more computation resources are available. Experiments show consistent gains of +3% over standard Transformers and +7% over vanilla recursive baselines, demonstrating that strategic looping is a powerful path toward efficient, deployment-adaptive LMMs.
[AI-69] Framework for Integrating Zero Trust in Cloud-Based Endpoint Security for Critical Infrastructure
【速读】:该论文试图解决的问题是在云环境中将零信任架构(Zero Trust Architecture, ZTA)应用于关键基础设施的终端管理时存在的实践空白。解决方案的关键在于提出一个定制化的ZTA集成框架,通过强化访问控制、持续验证和最小权限原则,实现对敏感操作的持续保护,从而降低攻击面并提升合规性。
链接: https://arxiv.org/abs/2602.09078
作者: Shyam Kumar Gajula
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
备注: 12 pages
Abstract:Cyber threats have become highly sophisticated, prompting a heightened concern for endpoint security, especially in critical infrastructure, to new heights. A security model, such as Zero Trust Architecture (ZTA), is required to overcome this challenge. ZTA treats every access request as new and assumes no implicit trust. Critical infrastructure like power plants, healthcare systems, financial systems, water supply, and military assets are especially prone to becoming targets for hackers and phishing attacks. This proposes a comprehensive framework for integrating tailored ZTA into organizations that manage sensitive operations. The paper highlights how the ZTA framework can enhance compliance, enabling continuous protection, thereby reducing attack surfaces. This paper aims to address the gap that exists in applying ZTA to endpoint management within cloud environments for critical infrastructure.
[AI-70] Learning to Remember Learn and Forget in Attention-Based Models
【速读】:该论文旨在解决Transformer模型中基于上下文学习(In-Context Learning, ICL)的在线关联记忆机制在门控线性注意力(gated linear attention)模型中因固定容量和易受干扰而导致性能下降的问题,尤其是在处理长序列时。解决方案的关键在于将ICL建模为一个持续学习(continual learning)问题,并引入贝叶斯元塑性(Bayesian metaplasticity)机制:通过为每个注意力状态分配一个由先验分布驱动的重要度状态(importance state),动态调节其可塑性(plasticity),从而在稳定性(stability)与可塑性之间取得平衡。这一设计使模型能够有效扩展记忆容量并减少遗忘,实验表明该方法在多查询关联回忆(MQAR)和常识推理任务上显著优于基线模型。
链接: https://arxiv.org/abs/2602.09075
作者: Djohan Bonnet,Jamie Lohoff,Jan Finkbeiner,Elidona Skhikerujah,Emre Neftci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.
[AI-71] DRAG ON: Robust Classification for Very Large Collections of Software Repositories
【速读】:该论文旨在解决大规模软件代码仓库(repository)自动分类问题,尤其针对现有方法严重依赖README文件等元数据、而在实际场景中这些信息常缺失导致分类效果受限的挑战。解决方案的关键在于提出DRAGON模型,该模型仅利用版本控制系统中轻量级信号(如文件和目录名称)进行分类,即使在缺少README文件的情况下仍保持较高鲁棒性(性能仅下降6%),从而显著提升在真实世界大规模软件集合中的适用性;此外,其预测结果多为语义相近的“近似正确”标签,增强了实际应用中搜索与发现的指导价值。
链接: https://arxiv.org/abs/2602.09071
作者: Stefano Balla(DISI),Stefano Zacchiroli(IP Paris, LTCI, ACES, INFRES),Thomas Degueule(LaBRI, UB),Jean-Rémy Falleri(LaBRI, UB),Romain Robbes(LaBRI, UB)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The ability to automatically classify source code repositories with ‘‘topics’’ that reflect their content and purpose is very useful, especially when navigating or searching through large software collections. However, existing approaches often rely heavily on README files and other metadata, which are frequently missing, limiting their applicability in real-world large-scale settings. We present DRAGON, a repository classifier designed for very large and diverse software collections. It operates entirely on lightweight signals commonly stored in version control systems: file and directory names, and optionally the README when available. In repository classification at scale, DRAGON improves F1@5 from 54.8% to 60.8%, surpassing the state of the art. DRAGON remains effective even when README files are absent, with performance degrading by only 6% w.r.t. when they are present. This robustness makes it practical for real-world settings where documentation is sparse or inconsistent. Furthermore, many of the remaining classification errors are near misses, where predicted labels are semantically close to the correct topics. This property increases the practical value of the predictions in real-world software collections, where suggesting a few related topics can still guide search and discovery. As a byproduct of developing DRAGON, we also release the largest open dataset to date for repository classification, consisting of 825 thousand repositories with associated ground-truth topics, sourced from the Software Heritage archive, providing a foundation for future large-scale and language-agnostic research on software repository understanding.
[AI-72] NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
【速读】:该论文旨在解决长视频配乐生成中的三大核心挑战:计算可扩展性、时间一致性以及最关键的对叙事逻辑演变的语义盲区。其解决方案的关键在于提出一种名为NarraScore的分层框架,该框架的核心洞察是情绪作为叙事逻辑的高密度压缩表示;通过复用冻结的视觉-语言模型(Vision-Language Models, VLMs)作为连续情感传感器,将高维视觉流转化为富含叙事感知的效价-唤醒(Valence-Arousal)轨迹,并采用双分支注入机制——全局语义锚点保障风格稳定性,而细粒度的情感适配器则通过元素级残差注入精确调控局部张力,从而在不依赖密集注意力或架构复制的前提下实现高效且鲁棒的音频合成。
链接: https://arxiv.org/abs/2602.09070
作者: Yufan Wen,Zhaocheng Liu,YeGuo Hua,Ziyi Guo,Lihua Zhang,Chun Yuan,Jian Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textitGlobal Semantic Anchor ensures stylistic stability, while a surgical \textitToken-Level Affective Adapter modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
[AI-73] Spectral Disentanglement and Enhancement: A Dual-domain Contrastive Framework for Representation Learning
【速读】:该论文旨在解决大规模多模态对比学习中特征维度均匀处理导致的谱结构忽视问题,即高维嵌入易坍缩为窄锥状分布,使任务相关语义集中于小子空间而多数维度充斥噪声与伪相关,从而引发谱不平衡和特征纠缠,损害模型泛化能力。其解决方案的关键在于提出谱解耦与增强(Spectral Disentanglement and Enhancement, SDE)框架,通过奇异值分解(Singular Value Decomposition, SVD)自适应地将特征维度划分为强信号(捕捉任务关键语义)、弱信号(反映辅助相关性)和噪声(无关扰动),并引入基于课程学习的谱增强策略,选择性放大信息丰富成分且保证训练稳定性;进一步设计双域对比损失,在特征空间与谱空间联合优化对齐关系,实现谱正则化融入训练过程,显著提升表示的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2602.09066
作者: Jinjin Guo,Yexin Li,Zhichao Huang,Jun Fang,Zhiyuan Liu,Chao Liu,Pengzhang Liu,Qixia Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale multimodal contrastive learning has recently achieved impressive success in learning rich and transferable representations, yet it remains fundamentally limited by the uniform treatment of feature dimensions and the neglect of the intrinsic spectral structure of the learned features. Empirical evidence indicates that high-dimensional embeddings tend to collapse into narrow cones, concentrating task-relevant semantics in a small subspace, while the majority of dimensions remain occupied by noise and spurious correlations. Such spectral imbalance and entanglement undermine model generalization. We propose Spectral Disentanglement and Enhancement (SDE), a novel framework that bridges the gap between the geometry of the embedded spaces and their spectral properties. Our approach leverages singular value decomposition to adaptively partition feature dimensions into strong signals that capture task-critical semantics, weak signals that reflect ancillary correlations, and noise representing irrelevant perturbations. A curriculum-based spectral enhancement strategy is then applied, selectively amplifying informative components with theoretical guarantees on training stability. Building upon the enhanced features, we further introduce a dual-domain contrastive loss that jointly optimizes alignment in both the feature and spectral spaces, effectively integrating spectral regularization into the training process and encouraging richer, more robust representations. Extensive experiments on large-scale multimodal benchmarks demonstrate that SDE consistently improves representation robustness and generalization, outperforming state-of-the-art methods. SDE integrates seamlessly with existing contrastive pipelines, offering an effective solution for multimodal representation learning.
[AI-74] Enhanced Graph Transformer with Serialized Graph Tokens ICASSP2026
【速读】:该论文旨在解决现有基于Transformer的图学习方法在生成图级表示时存在的信息瓶颈问题,尤其是传统单标记(single token)范式无法充分利用自注意力机制对token序列的编码能力,导致模型退化为节点信号的加权求和,难以捕捉复杂的全局结构信息。解决方案的关键在于提出一种新颖的序列化标记(serialized token)范式:首先通过图序列化方法将节点信号聚合为有序的图标记序列,并自动引入位置编码;随后使用堆叠的自注意力层对这一标记序列进行编码,以捕获其内部依赖关系。该方法能够更有效地封装全局信号,从而生成更具表达力的图表示,在多个图级基准测试中取得了当前最优性能。
链接: https://arxiv.org/abs/2602.09065
作者: Ruixiang Wang,Yuyang Hong,Shiming Xiang,Chunhong Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICASSP 2026
Abstract:Transformers have demonstrated success in graph learning, particularly for node-level tasks. However, existing methods encounter an information bottleneck when generating graph-level representations. The prevalent single token paradigm fails to fully leverage the inherent strength of self-attention in encoding token sequences, and degenerates into a weighted sum of node signals. To address this issue, we design a novel serialized token paradigm to encapsulate global signals more effectively. Specifically, a graph serialization method is proposed to aggregate node signals into serialized graph tokens, with positional encoding being automatically involved. Then, stacked self-attention layers are applied to encode this token sequence and capture its internal dependencies. Our method can yield more expressive graph representations by modeling complex interactions among multiple graph tokens. Experimental results show that our method achieves state-of-the-art results on several graph-level benchmarks. Ablation studies verify the effectiveness of the proposed modules.
[AI-75] Predicting Open Source Software Sustainability with Deep Temporal Neural Hierarchical Architectures and Explainable AI
【速读】:该论文旨在解决现有研究中对开源软件(Open Source Software, OSS)项目可持续性评估依赖静态或聚合指标(如项目年龄或累计活动量)所带来的局限性,这些指标难以揭示OSS可持续性在时间维度上的动态演化过程。解决方案的关键在于提出一个分层预测框架,将OSS项目映射到基于社会技术理论划分的生命周期阶段,并通过融合人工设计的表格特征与24个月的时间序列活动数据,构建多阶段分类管道来识别不同协调与参与模式对应的生命周期阶段。该框架不仅实现了超过94%的整体分类准确率,还利用可解释人工智能(Explainable AI)技术量化各特征类别对预测结果的贡献度,从而证实了贡献活跃度和社区相关特征是驱动可持续性的核心信号。
链接: https://arxiv.org/abs/2602.09064
作者: S M Rakib Ul Karim,Wenyi Lu,Enock Kasaadha,Sean Goggins
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Open Source Software (OSS) projects follow diverse lifecycle trajectories shaped by evolving patterns of contribution, coordination, and community engagement. Understanding these trajectories is essential for stakeholders seeking to assess project organization and health at scale. However, prior work has largely relied on static or aggregated metrics, such as project age or cumulative activity, providing limited insight into how OSS sustainability unfolds over time. In this paper, we propose a hierarchical predictive framework that models OSS projects as belonging to distinct lifecycle stages grounded in established socio-technical categorizations of OSS development. Rather than treating sustainability solely as project longevity, these lifecycle stages operationalize sustainability as a multidimensional construct integrating contribution activity, community participation, and maintenance dynamics. The framework combines engineered tabular indicators with 24-month temporal activity sequences and employs a multi-stage classification pipeline to distinguish lifecycle stages associated with different coordination and participation regimes. To support transparency, we incorporate explainable AI techniques to examine the relative contribution of feature categories to model predictions. Evaluated on a large corpus of OSS repositories, the proposed approach achieves over 94% overall accuracy in lifecycle stage classification. Attribution analyses consistently identify contribution activity and community-related features as dominant signals, highlighting the central role of collective participation dynamics.
[AI-76] RuleFlow : Generating Reusable Program Optimizations with LLM s
【速读】:该论文旨在解决Pandas程序优化中的挑战问题,即现有系统和编译器方法虽具可靠性但存在重量级或优化种类有限的缺陷,而基于大语言模型(Large Language Models, LLMs)的逐程序优化方法虽能生成复杂优化策略,却存在不可靠、成本高且优化产出率低的问题。其解决方案的关键在于提出一种三阶段混合优化框架RuleFlow:首先在特定程序中发现优化策略(发现阶段),其次将这些优化转化为通用重写规则(桥接阶段),最后将规则集成至编译器中实现自动应用(部署阶段),从而避免重复依赖LLMs,兼顾优化效果与效率。
链接: https://arxiv.org/abs/2602.09051
作者: Avaljot Singh,Dushyant Bharadwaj,Stefanos Baziotis,Kaushik Varadharajan,Charith Mendis
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing Pandas programs is a challenging problem. Existing systems and compiler-based approaches offer reliability but are either heavyweight or support only a limited set of optimizations. Conversely, using LLMs in a per-program optimization methodology can synthesize nontrivial optimizations, but is unreliable, expensive, and offers a low yield. In this work, we introduce a hybrid approach that works in a 3-stage manner that decouples discovery from deployment and connects them via a novel bridge. First, it discovers per-program optimizations (discovery). Second, they are converted into generalised rewrite rules (bridge). Finally, these rules are incorporated into a compiler that can automatically apply them wherever applicable, eliminating repeated reliance on LLMs (deployment). We demonstrate that RuleFlow is the new state-of-the-art (SOTA) Pandas optimization framework on PandasBench, a challenging Pandas benchmark consisting of Python notebooks. Across these notebooks, we achieve a speedup of up to 4.3x over Dias, the previous compiler-based SOTA, and 1914.9x over Modin, the previous systems-based SOTA. Our code is available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.09051 [cs.SE] (or arXiv:2602.09051v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.09051 Focus to learn more arXiv-issued DOI via DataCite
[AI-77] DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis
【速读】:该论文旨在解决流匹配(Flow-matching)模型在文本到语音(Text-to-Speech, TTS)合成中推理阶段迭代采样带来的高计算成本问题,以及现有知识蒸馏方法因终点误差累积导致的过程方差和连续时间架构直接用于离散固定步长生成所引发的结构参数效率低下问题。解决方案的关键在于提出一种模块化蒸馏框架DSFlow,其核心创新包括:将生成过程重构为离散预测任务,并通过双监督策略(端点匹配与确定性均值速度对齐)提升训练稳定性,确保不同推理步数下生成轨迹的一致性;同时用轻量级步长感知标记(step-aware tokens)替代连续时间步长条件输入,显著降低模型参数量并适配离散任务的简化时间空间。
链接: https://arxiv.org/abs/2602.09041
作者: Bin Lin,Peng Yang,Chao Yan,Xiaochen Liu,Wei Wang,Boyong Wu,Pengfei Tan,Xuerui Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Flow-matching models have enabled high-quality text-to-speech synthesis, but their iterative sampling process during inference incurs substantial computational cost. Although distillation is widely used to reduce the number of inference steps, existing methods often suffer from process variance due to endpoint error accumulation. Moreover, directly reusing continuous-time architectures for discrete, fixed-step generation introduces structural parameter inefficiencies. To address these challenges, we introduce DSFlow, a modular distillation framework for few-step and one-step synthesis. DSFlow reformulates generation as a discrete prediction task and explicitly adapts the student model to the target inference regime. It improves training stability through a dual supervision strategy that combines endpoint matching with deterministic mean-velocity alignment, enforcing consistent generation trajectories across inference steps. In addition, DSFlow improves parameter efficiency by replacing continuous-time timestep conditioning with lightweight step-aware tokens, aligning model capacity with the significantly reduced timestep space of the discrete task. Extensive experiments across diverse flow-based text-to-speech architectures demonstrate that DSFlow consistently outperforms standard distillation approaches, achieving strong few-step and one-step synthesis quality while reducing model parameters and inference cost.
[AI-78] Efficient Distance Pruning for Process Suffix Comparison in Prescriptive Process Monitoring
【速读】:该论文旨在解决生成式 AI (Generative AI) 在流程监控中因大规模后缀(suffix)比较带来的高计算成本问题,该问题随日志规模增长而急剧恶化。解决方案的关键在于利用三角不等式(triangle inequality)设计一种高效的检索方法:通过一组优化的枢轴点(pivots)定义距离边界,从而精确剪枝冗余比较。该方法显著降低运行时间且完全可并行化,同时保证剪枝的准确性——所检索到的后缀与穷举比较结果一致,从而在保持精度的前提下支持可扩展的预测性系统。
链接: https://arxiv.org/abs/2602.09039
作者: Sarra Madad(UTT, LIST3N - OPTI, QAD)
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Winter Simulation Conference, Dec 2025, Seattle WA, United States
Abstract:Prescriptive process monitoring seeks to recommend actions that improve process outcomes by analyzing possible continuations of ongoing cases. A key obstacle is the heavy computational cost of large-scale suffix comparisons, which grows rapidly with log size. We propose an efficient retrieval method exploiting the triangle inequality: distances to a set of optimized pivots define bounds that prune redundant comparisons. This substantially reduces runtime and is fully parallelizable. Crucially, pruning is exact: the retrieved suffixes are identical to those from exhaustive comparison, thereby preserving accuracy. These results show that metric-based pruning can accelerate suffix comparison and support scalable prescriptive systems.
[AI-79] Scaling GraphLLM with Bilevel-Optimized Sparse Querying
【速读】:该论文旨在解决生成式 AI (Generative AI) 在文本属性图(Text-attributed Graphs, TAGs)上的节点级任务中因频繁调用大语言模型(Large Language Models, LLMs)而导致的高计算与成本开销问题。现有方法如TAPE在处理中等规模数据集(如Photo,含48k节点)时需耗费数天时间,严重限制了实际应用。解决方案的关键在于提出一种双层优化稀疏查询框架(Bilevel-Optimized Sparse Querying, BOSQ),其核心是设计了一种自适应稀疏查询策略,动态判断何时调用LLM以避免冗余或低收益的查询,从而显著降低计算开销,同时在六种真实世界TAG数据集上的两类节点级任务中实现相比现有GraphLLM方法快几个数量级的速度提升,并保持相当或更优的性能表现。
链接: https://arxiv.org/abs/2602.09038
作者: Yangzhe Peng,Haiquan Qiu,Quanming Yao,Kun He
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features. However, their practical use is severely limited by the high computational and monetary cost of repeated LLM queries. To illustrate, naively generating explanations for all nodes on a medium-sized benchmark like Photo (48k nodes) using a representative method (e.g., TAPE) would consume days of processing time. In this paper, we propose Bilevel-Optimized Sparse Querying (BOSQ), a general framework that selectively leverages LLM-derived explanation features to enhance performance on node-level tasks on TAGs. We design an adaptive sparse querying strategy that selectively decides when to invoke LLMs, avoiding redundant or low-gain queries and significantly reducing computation overhead. Extensive experiments on six real-world TAG datasets involving two types of node-level tasks demonstrate that BOSQ achieves orders of magnitude speedups over existing GraphLLM methods while consistently delivering on-par or superior performance.
[AI-80] Seeing the Goal Missing the Truth: Human Accountability for AI Bias
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成中间表征(如情感得分和竞争指标)时,因知晓下游任务目标而产生偏差的问题。研究发现,即使这些中间测量本应与下游任务无关,当LLM被告知其输出将用于特定任务(如预测股票收益或盈利)时,其生成的表征会向该目标偏移,从而引入人为诱导的AI偏差。解决方案的关键在于“目标感知提示”(goal-aware prompting)——即通过明确告知LLM下游用途来调节其内部认知过程,使中间度量更贴合目标任务。然而,这种策略仅在LLM知识截止日期前有效,表明该偏差源于人类在研究设计中对统计有效性与可靠性的责任分配,而非算法本身的缺陷。
链接: https://arxiv.org/abs/2602.09504
作者: Sean Cao,Wei Jiang,Hui Xu
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, 5 tables
Abstract:This research explores how human-defined goals influence the behavior of Large Language Models (LLMs) through purpose-conditioned cognition. Using financial prediction tasks, we show that revealing the downstream use (e.g., predicting stock returns or earnings) of LLM outputs leads the LLM to generate biased sentiment and competition measures, even though these measures are intended to be downstream task-independent. Goal-aware prompting shifts intermediate measures toward the disclosed downstream objective. This purpose leakage improves performance before the LLM’s knowledge cutoff, but with no advantage post-cutoff. AI bias due to “seeing the goal” is not an algorithmic flaw, but stems from human accountability in research design to ensure the statistical validity and reliability of AI-generated measurements.
[AI-81] he Critical Horizon: Inspection Design Principles for Multi-Stage Operations and Deep Reasoning
【速读】:该论文致力于解决多阶段系统中“信用分配”(credit assignment)问题,即如何将最终结果准确归因于中间环节中的某一阶段。其核心挑战在于,早期步骤与最终结果之间的信号会随着阶段数的增加呈指数衰减,从而形成一个信息理论上的障碍,使得算法仅依赖终端数据难以学习到早期决策的影响。解决方案的关键在于:首先,证明了样本复杂度随干预步骤数量呈指数增长(信号衰减边界);其次,指出并行展开虽可缓解问题但受限于相关性(宽度限制),有效独立样本数仅为对数级;第三,揭示了累加奖励目标与序列有效性要求之间的不匹配(目标错位);最后,提出最优检查点设计策略——在均匀衰减下均匀间隔是最优的,而在非均匀衰减下贪婪算法能实现最优非均匀调度。这些结论共同构成了操作优化和AI监督设计的统一分析基础。
链接: https://arxiv.org/abs/2602.09394
作者: Seyed Morteza Emadi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 49 pages, 5 figures
Abstract:Manufacturing lines, service journeys, supply chains, and AI reasoning chains share a common challenge: attributing a terminal outcome to the intermediate stage that caused it. We establish an information-theoretic barrier to this credit assignment problem: the signal connecting early steps to final outcomes decays exponentially with depth, creating a critical horizon beyond which no algorithm can learn from endpoint data alone. We prove four results. First, a Signal Decay Bound: sample complexity for attributing outcomes to early stages grows exponentially in the number of intervening steps. Second, Width Limits: parallel rollouts provide only logarithmic relief, with correlation capping the effective number of independent samples. Third, an Objective Mismatch: additive reward aggregation optimizes the wrong quantity when sequential validity requires all steps to be correct. Fourth, Optimal Inspection Design: uniform checkpoint spacing is minimax-optimal under homogeneous signal attenuation, while a greedy algorithm yields optimal non-uniform schedules under heterogeneous attenuation. Together, these results provide a common analytical foundation for inspection design in operations and supervision design in AI.
[AI-82] Surrogate-Guided Quantum Discovery in Black-Box Landscapes with Latent-Quadratic Interaction Embedding Transformers
【速读】:该论文旨在解决在昂贵的黑箱评估和严格查询预算条件下,如何发现既具有高效用又具备结构多样性的配置这一核心挑战。传统优化方法往往聚焦于主导模式,而质量-多样性方法则因需要大量评估预算来填充高维档案而受限;同时,量子近似优化算法(QAOA)虽能提供分布采样,但其依赖显式的哈密顿量,在黑箱场景中不可用。解决方案的关键在于:提出一种基于自注意力机制建模高阶变量依赖关系,并将其投影为适用于QAOA的正定二次型(Positive Semi-Definite quadratic form)的代理哈密顿量方法,从而实现从学习到的能量景观中进行面向多样性的量子采样,同时捕捉超出成对交互结构的高阶相互作用。实验表明,该方法在企业文档处理系统的风险发现任务中显著提升了结构多样性与极端风险配置的识别能力,优于多数经典基线方法。
链接: https://arxiv.org/abs/2602.09374
作者: Saisubramaniam Gopalakrishnan,Dagnachew Birru
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Discovering configurations that are both high-utility and structurally diverse under expensive black-box evaluation and strict query budgets remains a central challenge in data-driven discovery. Many classical optimizers concentrate on dominant modes, while quality-diversity methods require large evaluation budgets to populate high-dimensional archives. Quantum Approximate Optimization Algorithm (QAOA) provides distributional sampling but requires an explicit problem Hamiltonian, which is unavailable in black-box settings. Practical quantum circuits favor quadratic Hamiltonians since higher-order interaction terms are costly to realize. Learned quadratic surrogates such as Factorization Machines (FM) have been used as proxies, but are limited to pairwise structure. We extend this surrogate-to-Hamiltonian approach by modelling higher-order variable dependencies via self-attention and projects them into a valid Positive Semi-Definite quadratic form compatible with QAOA. This enables diversity-oriented quantum sampling from learned energy landscapes while capturing interaction structure beyond pairwise terms. We evaluate on risk discovery for enterprise document processing systems against diverse classical optimizers. Quantum-guided samplers achieve competitive utility while consistently improving structural diversity and exclusive discovery. FM surrogates provide stronger early coverage, whereas ours yields higher-fidelity surrogate landscapes and better extreme-case discovery. Our method recovers roughly twice as many structurally tail-risk outliers as most classical baselines and identify an exclusive non-overlapping fraction of high-utility configurations not found by competing methods, highlighting that an effective mechanism for learning higher-order interaction structure and projecting it into quadratic surrogate Hamiltonians for quantum-assisted black-box discovery.
[AI-83] Behavioral Economics of AI: LLM Biases and Corrections
【速读】:该论文旨在解决生成式 AI 模型(尤其是大语言模型,Large Language Models, LLMs)在经济和金融决策中是否表现出系统性行为偏差的问题,并探讨如何有效缓解这些偏差。研究通过借鉴认知心理学与实验经济学的经典实验范式,在多个主流 LLM 家族及其不同版本和规模下进行了最全面的实证测试,发现:在基于偏好的任务中,随着模型能力提升或规模扩大,其行为逐渐趋近人类模式;而在基于信念的任务中,先进且大规模的模型往往能生成更理性的回答。解决方案的关键在于通过提示工程(prompting)引导模型做出理性决策,从而显著降低其认知偏差。
链接: https://arxiv.org/abs/2602.09362
作者: Pietro Bini,Lin William Cong,Xing Huang,Lawrence J. Jin
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Do generative AI models, particularly large language models (LLMs), exhibit systematic behavioral biases in economic and financial decisions? If so, how can these biases be mitigated? Drawing on the cognitive psychology and experimental economics literatures, we conduct the most comprehensive set of experiments to date - originally designed to document human biases - on prominent LLM families across model versions and scales. We document systematic patterns in LLM behavior. In preference-based tasks, responses become more human-like as models become more advanced or larger, while in belief-based tasks, advanced large-scale models frequently generate rational responses. Prompting LLMs to make rational decisions reduces biases.
[AI-84] Quantifying Epistemic Uncertainty in Diffusion Models AISTATS
【速读】:该论文旨在解决扩散模型(diffusion models)在生成数据时难以准确量化认知不确定性(epistemic uncertainty)的问题,因为现有方法常将认知不确定性与偶然不确定性(aleatoric uncertainty)混杂,导致输出可靠性不足。解决方案的关键在于提出一种基于Fisher信息的方法,能够显式分离出认知方差,并进一步设计了FLARE(Fisher-Laplace Randomized Estimator)——通过使用模型参数的均匀随机子集近似Fisher信息矩阵,实现高效且可扩展的不确定性估计。实验证明FLARE在合成时间序列生成任务中显著提升了不确定性估计的准确性与可靠性,理论分析也表明其随机化近似具有收敛性保障,且仅用最后一层Laplace近似不足以有效捕捉认知不确定性。
链接: https://arxiv.org/abs/2602.09170
作者: Aditi Gupta,Raphael A. Meyer,Yotam Yaniv,Elynn Chen,N. Benjamin Erichson
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Will appear in the Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026
Abstract:To ensure high quality outputs, it is important to quantify the epistemic uncertainty of diffusion this http URL methods are often unreliable because they mix epistemic and aleatoric uncertainty. We introduce a method based on Fisher information that explicitly isolates epistemic variance, producing more reliable plausibility scores for generated data. To make this approach scalable, we propose FLARE (Fisher-Laplace Randomized Estimator), which approximates the Fisher information using a uniformly random subset of model parameters. Empirically, FLARE improves uncertainty estimation in synthetic time-series generation tasks, achieving more accurate and reliable filtering than other methods. Theoretically, we bound the convergence rate of our randomized approximation and provide analytic and empirical evidence that last-layer Laplace approximations are insufficient for this task.
[AI-85] AntigenLM: Structure-Aware DNA Language Modeling for Influenza ICLR2026
【速读】:该论文旨在解决当前DNA基础模型在序列分析任务中表现落后于专用方法的问题,其核心挑战在于缺乏对基因组功能单元结构信息的有效建模。解决方案的关键在于提出AntigenLM——一个基于流感病毒基因组进行预训练的生成式DNA语言模型,该模型保留了完整且对齐的功能单元(functional units)结构信息,从而能够捕捉进化约束并实现跨任务泛化。实验证明,这种结构感知的预训练策略显著提升了对抗原变异预测和亚型分类等任务的性能,同时通过消融实验揭示了保持功能单元完整性对于DNA语言建模至关重要。
链接: https://arxiv.org/abs/2602.09067
作者: Yue Pei,Xuebin Chi,Yu Kang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.
[AI-86] scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在处理真实世界单细胞 RNA 测序(scRNA-seq)数据时缺乏可靠生物洞察力的问题,即如何评估和提升 AI 模型在复杂、不规范的 scRNA-seq 分析流程中提取可验证生物学结果的能力。解决方案的关键在于构建了一个名为 scBench 的基准测试平台,包含 394 个源自实际工作流的可验证问题,覆盖六种测序平台与七类分析任务,并提供确定性评分机制以量化模型对关键生物结果的恢复能力。该基准揭示了模型性能受平台和任务类型显著影响,且在低文档化技术平台上准确率下降超过 40 个百分点,从而为开发更忠实、可重现的单细胞数据分析智能体提供了测量工具与诊断视角。
链接: https://arxiv.org/abs/2602.09063
作者: Kenny Workman,Zhen Yang,Harihara Muralidharan,Aidan Abdulali,Hannah Le
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:As single-cell RNA sequencing datasets grow in adoption, scale, and complexity, data analysis remains a bottleneck for many research groups. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world single-cell datasets. We introduce scBench, a benchmark of 394 verifiable problems derived from practical scRNA-seq workflows spanning six sequencing platforms and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on eight frontier models shows that accuracy ranges from 29-53%, with strong model-task and model-platform interactions. Platform choice affects accuracy as much as model choice, with 40+ percentage point drops on less-documented technologies. scBench complements SpatialBench to cover the two dominant single-cell modalities, serving both as a measurement tool and a diagnostic lens for developing agents that can analyze real scRNA-seq datasets faithfully and reproducibly.
[AI-87] Persistent Entropy as a Detector of Phase Transitions
【速读】:该论文旨在解决持久熵(Persistent Entropy, PE)在复杂系统中检测相变(phase transitions)时缺乏普遍理论支撑的问题,尤其是在随机和数据驱动场景下,其可靠性与适用性尚不明确。解决方案的关键在于提出一个模型无关的普适定理,证明在满足连续性条件或适度正则化条件下,持久熵在不同相之间存在渐近非消失的间隙(asymptotically non-vanishing gap),从而理论上保证其能可靠区分两相。该理论框架不依赖于特定数据模态、过滤方式或同调度,具有广泛适用性;同时,作者进一步引入基于拓扑稳定性的操作框架,通过滑动窗口内统计量的稳定化定义拓扑转变时间,并构建有限观测期内临界参数的概率估计器,实现了从渐近理论到有限样本计算的桥梁。
链接: https://arxiv.org/abs/2602.09058
作者: Matteo Rucco
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Persistent entropy (PE) is an information-theoretic summary statistic of persistence barcodes that has been widely used to detect regime changes in complex systems. Despite its empirical success, a general theoretical understanding of when and why persistent entropy reliably detects phase transitions has remained limited, particularly in stochastic and data-driven settings. In this work, we establish a general, model-independent theorem providing sufficient conditions under which persistent entropy provably separates two phases. We show that persistent entropy exhibits an asymptotically non-vanishing gap across phases. The result relies only on continuity of persistent entropy along the convergent diagram sequence, or under mild regularization, and is therefore broadly applicable across data modalities, filtrations, and homological degrees. To connect asymptotic theory with finite-time computations, we introduce an operational framework based on topological stabilization, defining a topological transition time by stabilizing a chosen topological statistic over sliding windows, and a probability-based estimator of critical parameters within a finite observation horizon. We validate the framework on the Kuramoto synchronization transition, the Vicsek order-to-disorder transition in collective motion, and neural network training dynamics across multiple datasets and architectures. Across all experiments, stabilization of persistent entropy and collapse of variability across realizations provide robust numerical signatures consistent with the theoretical mechanism.
[AI-88] Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures
【速读】:该论文旨在解决自监督语音表示学习中因缺乏显式锚定而导致的表示坍缩(representation collapse)问题。其关键解决方案是提出GMM-Anchored JEPA,即在训练初期通过拟合一个高斯混合模型(Gaussian Mixture Model, GMM)于对数梅尔频谱图上,并将冻结的软后验概率作为辅助目标引入JEPA框架;同时采用衰减式的监督策略,使GMM正则化在早期主导训练过程,随后逐步让位于JEPA主目标。相比HuBERT和WavLM等方法需要迭代重聚类,该方案仅进行一次软聚类,避免了硬分配带来的信息损失,实验表明该方法显著提升了自动语音识别(ASR)、情感识别和槽位填充任务的性能,并实现了更均匀的聚类利用(熵值达98% vs. WavLM的31%)。
链接: https://arxiv.org/abs/2602.09040
作者: Georgios Ioannides,Adrian Kieback,Judah Goldfeder,Linsey Pang,Aman Chadha,Aaron Elkins,Yann LeCun,Ravid Shwartz-Ziv
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: 15 pages, 5 figures. Code: this http URL
Abstract:Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at this https URL.
[AI-89] E2CAR: An Efficient 2D-CNN Framework for Real-Time EEG Artifact Removal on Edge Devices
【速读】:该论文旨在解决脑电图(EEG)信号中伪迹(artifact)干扰导致后续分析精度下降的问题,尤其针对传统去伪迹方法在边缘设备上计算成本高、难以实现实时处理的挑战。解决方案的关键在于将原有的1维卷积神经网络(1-D CNN)替换为2维卷积神经网络(2-D CNN),从而显著降低模型计算复杂度,并部署于边缘张量处理单元(Edge Tensor Processing Unit, TPU)上运行;该优化策略使推理时间减少90%、功耗降低18.98%,同时保持与现有方法相当的去伪迹性能,实现了高效低延迟的EEG信号边缘处理。
链接: https://arxiv.org/abs/2602.09035
作者: Haoliang Liu,Chengkun Cai,Xu Zhao,Lei Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Electroencephalography (EEG) signals are frequently contaminated by artifacts, affecting the accuracy of subsequent analysis. Traditional artifact removal methods are often computationally expensive and inefficient for real-time applications in edge devices. This paper presents a method to reduce the computational cost of most existing convolutional neural networks (CNN) by replacing one-dimensional (1-D) CNNs with two-dimensional (2-D) CNNs and deploys them on Edge Tensor Processing Unit (TPU), which is an open-resource hardware accelerator widely used in edge devices for low-latency, low-power operation. A new Efficient 2D-CNN Artifact Removal (E2CAR) framework is also represented using the method above, and it achieves a 90% reduction in inference time on the TPU and decreases power consumption by 18.98%, while maintaining comparable artifact removal performance to existing methods. This approach facilitates efficient EEG signal processing on edge devices.
[AI-90] Recovering Whole-Brain Causal Connectivity under Indirect Observation with Applications to Human EEG and fMRI
【速读】:该论文旨在解决从神经成像数据中推断有向连接性(directed connectivity)这一病态逆问题,即记录信号受血氧水平依赖(BOLD)滤波和体积传导(volume conduction)等测量物理过程干扰,导致真实神经交互被掩盖或混淆,从而产生由测量过程驱动的虚假因果图。解决方案的关键在于提出INCAMA(INdirect CAusal MAmba)框架,其核心创新是将一个物理感知的反演模块与基于选择性状态空间序列的非平稳性驱动、延迟敏感的因果发现模型相结合,通过利用机制变化作为软干预(soft interventions),实现从间接观测中可识别延迟因果结构,并提供量化反演误差对图恢复影响的稳定性边界,从而在EEG和fMRI大规模生物物理仿真及人类连接组计划(Human Connectome Project)真实fMRI数据上实现零样本泛化,准确恢复经典视觉-运动通路(如V1→V2和M1↔S1)。
链接: https://arxiv.org/abs/2602.09034
作者: Sangyoon Bae,Miruna Oprescu,David Keetae Park,Shinjae Yoo,Jiook Cha
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures
Abstract:Inferring directed connectivity from neuroimaging is an ill-posed inverse problem: recorded signals are distorted by hemodynamic filtering and volume conduction, which can mask true neural interactions. Many existing methods conflate these observation artifacts with genuine neural influence, risking spurious causal graphs driven by the measurement process. We introduce INCAMA (INdirect CAusal MAmba), a latent-space causal discovery framework that explicitly accounts for measurement physics to separate neural dynamics from indirect observations. INCAMA integrates a physics-aware inversion module with a nonstationarity-driven, delay-sensitive causal discovery model based on selective state-space sequences. Leveraging nonstationary mechanism shifts as soft interventions, we establish identifiability of delayed causal structure from indirect measurements and a stability bound that quantifies how inversion error affects graph recovery. We validate INCAMA on large-scale biophysical simulations across EEG and fMRI, where it significantly outperforms standard pipelines. We further demonstrate zero-shot generalization to real-world fMRI from the Human Connectome Project: without domain-specific fine-tuning, INCAMA recovers canonical visuo-motor pathways (e.g., V1 \to V2 and M1 \leftrightarrow S1 ) consistent with established neuroanatomy, supporting its use for whole-brain causal inference.
机器学习
[LG-0] owards Explainable Federated Learning: Understanding the Impact of Differential Privacy
链接: https://arxiv.org/abs/2602.10100
作者: Júlio Oliveira,Rodrigo Ferreira,André Riker,Glaucio H. S. Carvalho,Eirini Eleni Tsilopoulou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Data privacy and eXplainable Artificial Intelligence (XAI) are two important aspects for modern Machine Learning systems. To enhance data privacy, recent machine learning models have been designed as a Federated Learning (FL) system. On top of that, additional privacy layers can be added, via Differential Privacy (DP). On the other hand, to improve explainability, ML must consider more interpretable approaches with reduced number of features and less complex internal architecture. In this context, this paper aims to achieve a machine learning (ML) model that combines enhanced data privacy with explainability. So, we propose a FL solution, called Federated EXplainable Trees with Differential Privacy (FEXT-DP), that: (i) is based on Decision Trees, since they are lightweight and have superior explainability than neural networks-based FL systems; (ii) provides additional layer of data privacy protection applying Differential Privacy (DP) to the Tree-Based model. However, there is a side effect adding DP: it harms the explainability of the system. So, this paper also presents the impact of DP protection on the explainability of the ML model. The carried out performance assessment shows improvements of FEXT-DP in terms of a faster training, i.e., numbers of rounds, Mean Squared Error and explainability.
[LG-1] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
链接: https://arxiv.org/abs/2602.10067
作者: Aaditya Vikram Prasad,Connor Watts,Jack Merullo,Dhruvil Gala,Owen Lewis,Thomas McGrath,Ekdeep Singh Lubana
类目: Machine Learning (cs.LG)
*备注:
Abstract:Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
[LG-2] Evaluating Disentangled Representations for Controllable Music Generation ICASSP2026
链接: https://arxiv.org/abs/2602.10058
作者: Laura Ibáñez-Martínez,Chukwuemeka Nkama,Andrea Poltronieri,Xavier Serra,Martín Rocamora
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2026
Abstract:Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
[LG-3] WildCat: Near-Linear Attention in Theory and Practice
链接: https://arxiv.org/abs/2602.10056
作者: Tobias Schröder,Lester Mackey
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length n . WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm – randomly pivoted Cholesky – and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial O(n^-\sqrt\log(\log(n))) error decay while running in near-linear O(n^1+o(1)) time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.
[LG-4] Effectiveness of Binary Autoencoders for QUBO-Based Optimization Problems
链接: https://arxiv.org/abs/2602.10037
作者: Tetsuro Abe,Masashi Yamashita,Shu Tanaka
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph)
*备注: 14 pages, 5 figures
Abstract:In black-box combinatorial optimization, objective evaluations are often expensive, so high quality solutions must be found under a limited budget. Factorization machine with quantum annealing (FMQA) builds a quadratic surrogate model from evaluated samples and optimizes it on an Ising machine. However, FMQA requires binary decision variables, and for nonbinary structures such as integer permutations, the choice of binary encoding strongly affects search efficiency. If the encoding fails to reflect the original neighborhood structure, small Hamming moves may not correspond to meaningful modifications in the original solution space, and constrained problems can yield many infeasible candidates that waste evaluations. Recent work combines FMQA with a binary autoencoder (bAE) that learns a compact binary latent code from feasible solutions, yet the mechanism behind its performance gains is unclear. Using a small traveling salesman problem as an interpretable testbed, we show that the bAE reconstructs feasible tours accurately and, compared with manually designed encodings at similar compression, better aligns tour distances with latent Hamming distances, yields smoother neighborhoods under small bit flips, and produces fewer local optima. These geometric properties explain why bAE+FMQA improves the approximation ratio faster while maintaining feasibility throughout optimization, and they provide guidance for designing latent representations for black-box optimization.
[LG-5] Position: Message-passing and spectral GNNs are two sides of the same coin
链接: https://arxiv.org/abs/2602.10031
作者: Antonis Vasileiou,Juan Cervino,Pascal Frossard,Charilaos I. Kanatsoulis,Christopher Morris,Michael T. Schaub,Pierre Vandergheynst,Zhiyang Wang,Guy Wolf,Ron Levie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs) are commonly divided into message-passing neural networks (MPNNs) and spectral graph neural networks, reflecting two largely separate research traditions in machine learning and signal processing. This paper argues that this divide is mostly artificial, hindering progress in the field. We propose a viewpoint in which both MPNNs and spectral GNNs are understood as different parametrizations of permutation-equivariant operators acting on graph signals. From this perspective, many popular architectures are equivalent in expressive power, while genuine gaps arise only in specific regimes. We further argue that MPNNs and spectral GNNs offer complementary strengths. That is, MPNNs provide a natural language for discrete structure and expressivity analysis using tools from logic and graph isomorphism research, while the spectral perspective provides principled tools for understanding smoothing, bottlenecks, stability, and community structure. Overall, we posit that progress in graph learning will be accelerated by clearly understanding the key similarities and differences between these two types of GNNs, and by working towards unifying these perspectives within a common theoretical and conceptual framework rather than treating them as competing paradigms.
[LG-6] A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula
链接: https://arxiv.org/abs/2602.10014
作者: Chenruo Liu,Yijun Dong,Yiqiu Shen,Qi Lei
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.
[LG-7] Answer First Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning
链接: https://arxiv.org/abs/2602.10006
作者: Shijie Zhang,Xiang Guo,Rujun Guo,Shaoyu Liu,Xiaozhao Wang,Guanjun Jiang,Kevin Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbfAnswer-First, Reason Later (AFRL) paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a “Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)” pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbfmode collapse in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbfReverse KL divergence, which tends to seek probability peaks (mode-seeking) and is prone to “reward hacking.” On the other hand, SFT minimizes the \textbfForward KL divergence, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbfMode-Balanced Optimization strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.
[LG-8] Causal Identification in Multi-Task Demand Learning with Confounding
链接: https://arxiv.org/abs/2602.09969
作者: Varun Gupta,Vijay Kamble
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:
Abstract:We study a canonical multi-task demand learning problem motivated by retail pricing, in which a firm seeks to estimate heterogeneous linear price-response functions across a large collection of decision contexts. Each context is characterized by rich observable covariates yet typically exhibits only limited historical price variation, motivating the use of multi-task learning to borrow strength across tasks. A central challenge in this setting is endogeneity: historical prices are chosen by managers or algorithms and may be arbitrarily correlated with unobserved, task-level demand determinants. Under such confounding by latent fundamentals, commonly used approaches, such as pooled regression and meta-learning, fail to identify causal price effects. We propose a new estimation framework that achieves causal identification despite arbitrary dependence between prices and latent task structure. Our approach, Decision-Conditioned Masked-Outcome Meta-Learning (DCMOML), involves carefully designing the information set of a meta-learner to leverage cross-task heterogeneity while accounting for endogenous decision histories. Under a mild restriction on price adaptivity in each task, we establish that this method identifies the conditional mean of the task-specific causal parameters given the designed information set. Our results provide guarantees for large-scale demand estimation with endogenous prices and small per-task samples, offering a principled foundation for deploying causal, data-driven pricing models in operational environments. Subjects: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML) Cite as: arXiv:2602.09969 [cs.LG] (or arXiv:2602.09969v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.09969 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] Stemphonic: All-at-once Flexible Multi-stem Music Generation ICASSP
链接: https://arxiv.org/abs/2602.09891
作者: Shih-Lun Wu,Ge Zhu,Juan-Pablo Caceres,Cheng-Zhi Anna Huang,Nicholas J. Bryan
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted for publication at Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: this https URL.
[LG-10] Statistical benchmarking of transformer models in low signal-to-noise time-series forecasting ICML
链接: https://arxiv.org/abs/2602.09869
作者: Cyril Garcia,Guillaume Remy
类目: Machine Learning (cs.LG)
*备注: Submitted to ICML
Abstract:We study the performance of transformer architectures for multivariate time-series forecasting in low-data regimes consisting of only a few years of daily observations. Using synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios, we conduct bootstrapped experiments that enable direct evaluation via out-of-sample correlations with the optimal ground-truth predictor. We show that two-way attention transformers, which alternate between temporal and cross-sectional self-attention, can outperform standard baselines-Lasso, boosting methods, and fully connected multilayer perceptrons-across a wide range of settings, including low signal-to-noise regimes. We further introduce a dynamic sparsification procedure for attention matrices applied during training, and demonstrate that it becomes significantly effective in noisy environments, where the correlation between the target variable and the optimal predictor is on the order of a few percent. Analysis of the learned attention patterns reveals interpretable structure and suggests connections to sparsity-inducing regularization in classical regression, providing insight into why these models generalize effectively under noise.
[LG-11] Differentiable Tripartite Modularity for Clustering Heterogeneous Graphs
链接: https://arxiv.org/abs/2602.09864
作者: Benoît Hurpeau
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 12 pages, 3 figures
Abstract:Clustering heterogeneous relational data remains a central challenge in graph learning, particularly when interactions involve more than two types of entities. While differentiable modularity objectives such as DMoN have enabled end-to-end community detection on homogeneous and bipartite graphs, extending these approaches to higher-order relational structures remains non-trivial. In this work, we introduce a differentiable formulation of tripartite modularity for graphs composed of three node types connected through mediated interactions. Community structure is defined in terms of weighted co-paths across the tripartite graph, together with an exact factorized computation that avoids the explicit construction of dense third-order tensors. A structural normalization at pivot nodes is introduced to control extreme degree heterogeneity and ensure stable optimization. The resulting objective can be optimized jointly with a graph neural network in an end-to-end manner, while retaining linear complexity in the number of edges. We validate the proposed framework on large-scale urban cadastral data, where it exhibits robust convergence behavior and produces spatially coherent partitions. These results highlight differentiable tripartite modularity as a generic methodological building block for unsupervised clustering of heterogeneous graphs. Comments: 12 pages, 3 figures Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2602.09864 [cs.LG] (or arXiv:2602.09864v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.09864 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] CoFEH: LLM -driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization
链接: https://arxiv.org/abs/2602.09851
作者: Beicheng Xu,Keyao Ding,Wei Liu,Yupeng Lu,Bin Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Feature Engineering (FE) is pivotal in automated machine learning (AutoML) but remains a bottleneck for traditional methods, which treat it as a black-box search, operating within rigid, predefined search spaces and lacking domain awareness. While Large Language Models (LLMs) offer a promising alternative by leveraging semantic reasoning to generate unbounded operators, existing methods fail to construct free-form FE pipelines, remaining confined to isolated subtasks such as feature generation. Most importantly, they are rarely optimized jointly with hyperparameter optimization (HPO) of the ML model, leading to greedy “FE-then-HPO” workflows that cannot capture strong FE-HPO interactions. In this paper, we present CoFEH, a collaborative framework that interleaves LLM-based FE and Bayesian HPO for robust end-to-end AutoML. CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (ToT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that realizes interleaved optimization by adaptively scheduling FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism that shares context between LLM and BO, enabling mutually informed decisions. Experiments show that CoFEH not only outperforms traditional and LLM-based FE baselines, but also achieves superior end-to-end performance under joint optimization.
[LG-13] PlugSI: Plug-and-Play Test-Time Graph Adaptation for Spatial Interpolation DASFAA2026
链接: https://arxiv.org/abs/2602.09824
作者: Xuhang Wu,Zhuoxuan Liang,Wei Li,Xiaohua Jia,Sumi Helal
类目: Machine Learning (cs.LG)
*备注: Accepted at DASFAA 2026 (Full Research Paper)
Abstract:With the rapid advancement of IoT and edge computing, sensor networks have become indispensable, driving the need for large-scale sensor deployment. However, the high deployment cost hinders their scalability. To tackle the issues, Spatial Interpolation (SI) introduces virtual sensors to infer readings from observed sensors, leveraging graph structure. However, current graph-based SI methods rely on pre-trained models, lack adaptation to larger and unseen graphs at test-time, and overlook test data utilization. To address these issues, we propose PlugSI, a plug-and-play framework that refines test-time graph through two key innovations. First, we design an Unknown Topology Adapter (UTA) that adapts to the new graph structure of each small-batch at test-time, enhancing the generalization of SI pre-trained models. Second, we introduce a Temporal Balance Adapter (TBA) that maintains a stable historical consensus to guide UTA adaptation and prevent drifting caused by noise in the current batch. Empirically, extensive experiments demonstrate PlugSI can be seamlessly integrated into existing graph-based SI methods and provide significant improvement (e.g., a 10.81% reduction in MAE).
[LG-14] Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinsons disease and isolated REM sleep behavior disorder
链接: https://arxiv.org/abs/2602.09793
作者: Jesper Strøm,Casper Skjærbæk,Natasha Becker Bertelsen,Steffen Torpe Simonsen,Niels Okkels,David Bertram,Sinah Röttgen,Konstantin Kufer,Kaare B. Mikkelsen,Marit Otto,Poul Jørgen Jennum,Per Borghammer,Michael Sommerauer,Preben Kidmose
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 21 pages excluding supplementary, 9 figures
Abstract:Isolated REM sleep behavior disorder (iRBD) is a key prodromal marker of Parkinson’s disease (PD), and video-polysomnography (vPSG) remains the diagnostic gold standard. However, manual sleep staging is particularly challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, making PSG assessments a bottleneck for deploying new RBD screening technologies at scale. We adapted U-Sleep, a deep neural network, for generalizable sleep staging in PD and iRBD. A pretrained U-Sleep model, based on a large publicly available, multisite non-neurodegenerative dataset (PUB; 19,236 PSGs across 12 sites), was fine-tuned on research datasets from two centers (Lundbeck Foundation Parkinson’s Disease Research Center (PACE) and the Cologne-Bonn Cohort (CBC); 112 PD, 138 iRBD, 89 age-matched controls. The resulting model was evaluated on an independent dataset from the Danish Center for Sleep Medicine (DCSM; 81 PD, 36 iRBD, 87 sleep-clinic controls). A subset of PSGs with low agreement between the human rater and the model (\kappa 0.6) was re-scored by a second blinded human rater to identify sources of disagreement. Finally, we applied confidence-based thresholds to optimize REM sleep staging. The pretrained model achieved mean \kappa = 0.81 in PUB, but \kappa = 0.66 when applied directly to PACE/CBC. By fine-tuning the model, we developed a generalized model with \kappa = 0.74 on PACE/CBC (p 0.001 vs. the pretrained model). In DCSM, mean and median \kappa increased from 0.60 to 0.64 (p 0.001) and 0.64 to 0.69 (p 0.001), respectively. In the interrater study, PSGs with low agreement between the model and the initial scorer showed similarly low agreement between human scorers. Applying a confidence threshold increased the proportion of correctly identified REM sleep epochs from 85% to 95.5%, while preserving sufficient ( 5 min) REM sleep for 95% of subjects.
[LG-15] When Less is More: The LLM Scaling Paradox in Context Compression
链接: https://arxiv.org/abs/2602.09789
作者: Ruishan Guo,Yibing Liu,Guoxin Ma,Yan Wang,Yueyang Zhang,Long Xia,Kecheng Chen,Zhiyuan Sun,Daiting Shi
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, conference
Abstract:Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor-decoder setup, we observe a Size-Fidelity Paradox: increasing the compressor size can lessen the faithfulness of reconstructed contexts though training loss decreases. Through extensive experiments across models from 0.6B to 90B, we coin this paradox arising from two dominant factors: 1) knowledge overwriting: larger models increasingly replace source facts with their own prior beliefs, e.g., the white strawberry'' \to the red strawberry’‘; and 2) semantic drift: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, e.g., Alice hit Bob'' \to Bob hit Alice’'. By holding model size fixed, we reflect on the emergent properties of compressed context representations. We show that the culprit is not parameter count itself, but the excessive semantic capacity and amplified generative uncertainty that accompany scaling. Specifically, the increased rank of context embeddings facilitates prior knowledge intrusion, whereas higher entropy over token prediction distributions promotes rewriting. Our results complement existing evaluations over context compression paradigm, underpinning a breakdown in scaling laws for faithful preservation in open-ended generation.
[LG-16] owards Poisoning Robustness Certification for Natural Language Generation
链接: https://arxiv.org/abs/2602.09757
作者: Mihnea Ghitu,Matthew Wicker
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA’s effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.
[LG-17] BRAVA-GNN: Betweenness Ranking Approximation Via Degree MAss Inspired Graph Neural Network KDD
链接: https://arxiv.org/abs/2602.09716
作者: Justin Dachille,Aurora Rossi,Sunil Kumar Maurya,Frederik Mallmann-Trenn,Xin Liu,Frédéric Giroire,Tsuyoshi Murata,Emanuele Natale
类目: Machine Learning (cs.LG)
*备注: Submitted to KDD
Abstract:Computing node importance in networks is a long-standing fundamental problem that has driven extensive study of various centrality measures. A particularly well-known centrality measure is betweenness centrality, which becomes computationally prohibitive on large-scale networks. Graph Neural Network (GNN) models have thus been proposed to predict node rankings according to their relative betweenness centrality. However, state-of-the-art methods fail to generalize to high-diameter graphs such as road networks. We propose BRAVA-GNN, a lightweight GNN architecture that leverages the empirically observed correlation linking betweenness centrality to degree-based quantities, in particular multi-hop degree mass. This correlation motivates the use of degree masses as size-invariant node features and synthetic training graphs that closely match the degree distributions of real networks. Furthermore, while previous work relies on scale-free synthetic graphs, we leverage the hyperbolic random graph model, which reproduces power-law exponents outside the scale-free regime, better capturing the structure of real-world graphs like road networks. This design enables BRAVA-GNN to generalize across diverse graph families while using 54x fewer parameters than the most lightweight existing GNN baseline. Extensive experiments on 19 real-world networks, spanning social, web, email, and road graphs, show that BRAVA-GNN achieves up to 214% improvement in Kendall-Tau correlation and up to 70x speedup in inference time over state-of-the-art GNN-based approaches, particularly on challenging road networks.
[LG-18] Contextual and Seasonal LSTMs for Time Series Anomaly Detection ICLR2026
链接: https://arxiv.org/abs/2602.09690
作者: Lingpei Zhang,Qingming Li,Yong Yang,Jiahao Chen,Rui Zeng,Chenyang Lyu,Shouling Ji
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026
Abstract:Univariate time series (UTS), where each timestamp records a single variable, serve as crucial indicators in web systems and cloud servers. Anomaly detection in UTS plays an essential role in both data mining and system reliability management. However, existing reconstruction-based and prediction-based methods struggle to capture certain subtle anomalies, particularly small point anomalies and slowly rising anomalies. To address these challenges, we propose a novel prediction-based framework named Contextual and Seasonal LSTMs (CS-LSTMs). CS-LSTMs are built upon a noise decomposition strategy and jointly leverage contextual dependencies and seasonal patterns, thereby strengthening the detection of subtle anomalies. By integrating both time-domain and frequency-domain representations, CS-LSTMs achieve more accurate modeling of periodic trends and anomaly localization. Extensive evaluations on public benchmark datasets demonstrate that CS-LSTMs consistently outperform state-of-the-art methods, highlighting their effectiveness and practical value in robust time series anomaly detection.
[LG-19] Model soups need only one ingredient
链接: https://arxiv.org/abs/2602.09689
作者: Alireza Abdollahpoorrostam,Nikolaos Dimitriadis,Adam Hazimeh,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer’s update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.
[LG-20] Differentiable Modeling for Low-Inertia Grids: Benchmarking PINNs NODEs and DP for Identification and Control of SMIB System
链接: https://arxiv.org/abs/2602.09667
作者: Shinhoo Kang,Sangwook Kim,Sehyun Yun
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 9 pages, 7 figures, 4 tables
Abstract:The transition toward low-inertia power systems demands modeling frameworks that provide not only accurate state predictions but also physically consistent sensitivities for control. While scientific machine learning offers powerful nonlinear modeling tools, the control-oriented implications of different differentiable paradigms remain insufficiently understood. This paper presents a comparative study of Physics-Informed Neural Networks (PINNs), Neural Ordinary Differential Equations (NODEs), and Differentiable Programming (DP) for modeling, identification, and control of power system dynamics. Using the Single Machine Infinite Bus (SMIB) system as a benchmark, we evaluate their performance in trajectory extrapolation, parameter estimation, and Linear Quadratic Regulator (LQR) synthesis. Our results highlight a fundamental trade-off between data-driven flexibility and physical structure. NODE exhibits superior extrapolation by capturing the underlying vector field, whereas PINN shows limited generalization due to its reliance on a time-dependent solution map. In the inverse problem of parameter identification, while both DP and PINN successfully recover the unknown parameters, DP achieves significantly faster convergence by enforcing governing equations as hard constraints. Most importantly, for control synthesis, the DP framework yields closed-loop stability comparable to the theoretical optimum. Furthermore, we demonstrate that NODE serves as a viable data-driven surrogate when governing equations are unavailable. Comments: 9 pages, 7 figures, 4 tables Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) MSC classes: 93B30, 93D15, 68T07 Cite as: arXiv:2602.09667 [cs.LG] (or arXiv:2602.09667v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.09667 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Blind denoising diffusion models and the blessings of dimensionality
链接: https://arxiv.org/abs/2602.09639
作者: Zahra Kadkhodaie,Aram-Alexandre Pooladian,Sinho Chewi,Eero Simoncelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages, 12 figures
Abstract:We analyze, theoretically and empirically, the performance of generative diffusion models based on \emphblind denoisers, in which the denoiser is not given the noise amplitude in either the training or sampling processes. Assuming that the data distribution has low intrinsic dimensionality, we prove that blind denoising diffusion models (BDDMs), despite not having access to the noise amplitude, \emphautomatically track a particular \emphimplicit noise schedule along the reverse process. Our analysis shows that BDDMs can accurately sample from the data distribution in polynomially many steps as a function of the intrinsic dimension. Empirical results corroborate these mathematical findings on both synthetic and image data, demonstrating that the noise variance is accurately estimated from the noisy image. Remarkably, we observe that schedule-free BDDMs produce samples of higher quality compared to their non-blind counterparts. We provide evidence that this performance gain arises because BDDMs correct the mismatch between the true residual noise (of the image) and the noise assumed by the schedule used in non-blind diffusion models.
[LG-22] LLM -FS: Zero-Shot Feature Selection for Effective and Interpretable Malware Detection
链接: https://arxiv.org/abs/2602.09634
作者: Naveen Gill,Ajvad Haneef K,Madhu Kumar S D
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Feature selection (FS) remains essential for building accurate and interpretable detection models, particularly in high-dimensional malware datasets. Conventional FS methods such as Extra Trees, Variance Threshold, Tree-based models, Chi-Squared tests, ANOVA, Random Selection, and Sequential Attention rely primarily on statistical heuristics or model-driven importance scores, often overlooking the semantic context of features. Motivated by recent progress in LLM-driven FS, we investigate whether large language models (LLMs) can guide feature selection in a zero-shot setting, using only feature names and task descriptions, as a viable alternative to traditional approaches. We evaluate multiple LLMs (GPT-5.0, GPT-4.0, Gemini-2.5 etc.) on the EMBOD dataset (a fusion of EMBER and BODMAS benchmark datasets), comparing them against established FS methods across several classifiers, including Random Forest, Extra Trees, MLP, and KNN. Performance is assessed using accuracy, precision, recall, F1, AUC, MCC, and runtime. Our results demonstrate that LLM-guided zero-shot feature selection achieves competitive performance with traditional FS methods while offering additional advantages in interpretability, stability, and reduced dependence on labeled data. These findings position zero-shot LLM-based FS as a promising alternative strategy for effective and interpretable malware detection, paving the way for knowledge-guided feature selection in security-critical applications
[LG-23] Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows
链接: https://arxiv.org/abs/2602.09580
作者: Chenyu Yang,Denis Tarasov,Davide Liconti,Hehui Zheng,Robert K. Katzschmann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy’s temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SOFT-FLOW on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp – both of which require precise, dexterous control over long horizons. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation where standard methods struggle.
[LG-24] Rollout-Training Co-Design for Efficient LLM -Based Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2602.09578
作者: Zhida Jiang,Zhaolong Xing,Jiawei Lu,Yipei Niu,Qingyuan Sang,Liangxu Zhang,Wenquan Dai,Junhua Shu,Jiaxing Wang,Qiangyu Pei,Qiong Chen,Xinyu Liu,Fangming Liu,Ai Han,Zhen Chen,Ke Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.
[LG-25] raining deep physical neural networks with local physical information bottleneck
链接: https://arxiv.org/abs/2602.09569
作者: Hao Wang,Ziao Wang,Xiangpeng Liang,Han Zhao,Jianqi Hu,Junjie Jiang,Xing Fu,Jianshi Tang,Huaqiang Wu,Sylvain Gigan,Qiang Liu
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 9 pages, 4 figures
Abstract:Deep learning has revolutionized modern society but faces growing energy and latency constraints. Deep physical neural networks (PNNs) are interconnected computing systems that directly exploit analog dynamics for energy-efficient, ultrafast AI execution. Realizing this potential, however, requires universal training methods tailored to physical intricacies. Here, we present the Physical Information Bottleneck (PIB), a general and efficient framework that integrates information theory and local learning, enabling deep PNNs to learn under arbitrary physical dynamics. By allocating matrix-based information bottlenecks to each unit, we demonstrate supervised, unsupervised, and reinforcement learning across electronic memristive chips and optical computing platforms. PIB also adapts to severe hardware faults and allows for parallel training via geographically distributed resources. Bypassing auxiliary digital models and contrastive measurements, PIB recasts PNN training as an intrinsic, scalable information-theoretic process compatible with diverse physical substrates.
[LG-26] Rashomon Sets and Model Multiplicity in Federated Learning
链接: https://arxiv.org/abs/2602.09520
作者: Xenia Heilmann,Luca Corbucci,Mattia Cerrato
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server’s coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in this http URL, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client’s local this http URL, we show how standard multiplicity metrics can be estimated under FL’s privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.
[LG-27] Beyond Student: An Asymmetric Network for Neural Network Inheritance
链接: https://arxiv.org/abs/2602.09509
作者: Yiyun Zhou,Jingwei Shi,Mingjing Xu,Zhonghua Jiang,Jingyuan Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher’s structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher’s weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.
[LG-28] owards Uniformity and Alignment for Multimodal Representation Learning
链接: https://arxiv.org/abs/2602.09507
作者: Wenzhe Yin,Pan Zhou,Zehao Xiao,Jie Liu,Shujian Yu,Jan-Jakob Sonke,Efstratios Gavves
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
[LG-29] Improved Approximate Regret for Decentralized Online Continuous Submodular Maximization via Reductions
链接: https://arxiv.org/abs/2602.09502
作者: Yuanyu Wan,Yu Shen,Dingzhi Yu,Bo Xue,Mingli Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:To expand the applicability of decentralized online learning, previous studies have proposed several algorithms for decentralized online continuous submodular maximization (D-OCSM) – a non-convex/non-concave setting with continuous DR-submodular reward functions. However, there exist large gaps between their approximate regret bounds and the regret bounds achieved in the convex setting. Moreover, if focusing on projection-free algorithms, which can efficiently handle complex decision sets, they cannot even recover the approximate regret bounds achieved in the centralized setting. In this paper, we first demonstrate that for D-OCSM over general convex decision sets, these two issues can be addressed simultaneously. Furthermore, for D-OCSM over downward-closed decision sets, we show that the second issue can be addressed while significantly alleviating the first issue. Our key techniques are two reductions from D-OCSM to decentralized online convex optimization (D-OCO), which can exploit D-OCO algorithms to improve the approximate regret of D-OCSM in these two cases, respectively.
[LG-30] Computationally Efficient Replicable Learning of Parities
链接: https://arxiv.org/abs/2602.09499
作者: Moshe Noivirt,Jessica Sorrell,Eliad Tsfadia
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:We study the computational relationship between replicability (Impagliazzo et al. [STOC 22], Ghazi et al. [NeurIPS 21]) and other stability notions. Specifically, we focus on replicable PAC learning and its connections to differential privacy (Dwork et al. [TCC 2006]) and to the statistical query (SQ) model (Kearns [JACM `98]). Statistically, it was known that differentially private learning and replicable learning are equivalent and strictly more powerful than SQ-learning. Yet, computationally, all previously known efficient (i.e., polynomial-time) replicable learning algorithms were confined to SQ-learnable tasks or restricted distributions, in contrast to differentially private learning. Our main contribution is the first computationally efficient replicable algorithm for realizable learning of parities over arbitrary distributions, a task that is known to be hard in the SQ-model, but possible under differential privacy. This result provides the first evidence that efficient replicable learning over general distributions strictly extends efficient SQ-learning, and is closer in power to efficient differentially private learning, despite computational separations between replicability and privacy. Our main building block is a new, efficient, and replicable algorithm that, given a set of vectors, outputs a subspace of their linear span that covers most of them. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2602.09499 [cs.LG] (or arXiv:2602.09499v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.09499 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-31] Adaptive recurrent flow map operator learning for reaction diffusion dynamics
链接: https://arxiv.org/abs/2602.09487
作者: Huseyin Tunc
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Reaction-diffusion (RD) equations underpin pattern formation across chemistry, biology, and physics, yet learning stable operators that forecast their long-term dynamics from data remains challenging. Neural-operator surrogates provide resolution-robust prediction, but autoregressive rollouts can drift due to the accumulation of error, and out-of-distribution (OOD) initial conditions often degrade accuracy. Physics-based numerical residual objectives can regularize operator learning, although they introduce additional assumptions, sensitivity to discretization and loss design, and higher training cost. Here we develop a purely data-driven operator learner with adaptive recurrent training (DDOL-ART) using a robust recurrent strategy with lightweight validation milestones that early-exit unproductive rollout segments and redirect optimization. Trained only on a single in-distribution toroidal Gaussian family over short horizons, DDOL-ART learns one-step operators that remain stable under long rollouts and generalize zero-shot to strong morphology shifts across FitzHugh-Nagumo (FN), Gray-Scott (GS), and Lambda-Omega (LO) systems. Across these benchmarks, DDOL-ART delivers a strong accuracy and cost trade-off. It is several-fold faster than a physics-based numerical-loss operator learner (NLOL) under matched settings, and it remains competitive on both in-distribution stability and OOD robustness. Training-dynamics diagnostics show that adaptivity strengthens the correlation between validation error and OOD test error performance, acting as a feedback controller that limits optimization drift. Our results indicate that feedback-controlled recurrent training of DDOL-ART generates robust flow-map surrogates without PDE residuals, while simultaneously maintaining competitiveness with NLOL at significantly reduced training costs.
[LG-32] Online Learning in MDPs with Partially Adversarial Transitions and Losses
链接: https://arxiv.org/abs/2602.09474
作者: Ofir Schlisselberg,Tal Lancewicki,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of \Lambda steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emphconditioned occupancy measures, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret \tildeO(H S^\Lambda\sqrtK S A^\Lambda+1) , where K is the number of episodes, S is the number of state, A is the number of actions and H is the episode’s horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on S to \tildeO(H\sqrtK S^3 A^\Lambda+1) . We further give a K^2/3 -regret reduction that removes the need to know which steps are the \Lambda adversarial steps. We also characterize the regret of adversarial MDPs in the \emphfully adversarial setting ( \Lambda=H-1 ) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).
[LG-33] Scalable and Reliable State-Aware Inference of High-Impact N-k Contingencies
链接: https://arxiv.org/abs/2602.09461
作者: Lihao Mai,Chenhan Xiao,Yang Weng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Increasing penetration of inverter-based resources, flexible loads, and rapidly changing operating conditions make higher-order N!-!k contingency assessment increasingly important but computationally prohibitive. Exhaustive evaluation of all outage combinations using AC power-flow or ACOPF is infeasible in routine operation. This fact forces operators to rely on heuristic screening methods whose ability to consistently retain all critical contingencies is not formally established. This paper proposes a scalable, state-aware contingency inference framework designed to directly generate high-impact N!-!k outage scenarios without enumerating the combinatorial contingency space. The framework employs a conditional diffusion model to produce candidate contingencies tailored to the current operating state, while a topology-aware graph neural network trained only on base and N!-!1 cases efficiently constructs high-risk training samples offline. Finally, the framework is developed to provide controllable coverage guarantees for severe contingencies, allowing operators to explicitly manage the risk of missing critical events under limited AC power-flow evaluation budgets. Experiments on IEEE benchmark systems show that, for a given evaluation budget, the proposed approach consistently evaluates higher-severity contingencies than uniform sampling. This allows critical outages to be identified more reliably with reduced computational effort.
[LG-34] aming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits
链接: https://arxiv.org/abs/2602.09456
作者: Hao Qin,Chicheng Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages (13 pages main body, 24 pages supplementary materials)
Abstract:We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with O(log(T)) calls to an offline regression oracle over T rounds, and makes O(loglog(T)) calls when T is known. The design of OE2D algorithm generalizes Falcon~\citepsimchi2022bypassing and its linear reward version~\citep[][Section 4]xu2020upper in that it chooses an action distribution that we term ``exploitative F-design’’ that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citepfoster2021statistical, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.
[LG-35] Enhancing Affine Maximizer Auctions with Correlation-Aware Payment
链接: https://arxiv.org/abs/2602.09455
作者: Haoran Sun,Xuanzhi Xia,Xu Chu,Xiaotie Deng
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 22 pages. Work in progress
Abstract:Affine Maximizer Auctions (AMAs), a generalized mechanism family from VCG, are widely used in automated mechanism design due to their inherent dominant-strategy incentive compatibility (DSIC) and individual rationality (IR). However, as the payment form is fixed, AMA’s expressiveness is restricted, especially in distributions where bidders’ valuations are correlated. In this paper, we propose Correlation-Aware AMA (CA-AMA), a novel framework that augments AMA with a new correlation-aware payment. We show that any CA-AMA preserves the DSIC property and formalize finding optimal CA-AMA as a constraint optimization problem subject to the IR constraint. Then, we theoretically characterize scenarios where classic AMAs can perform arbitrarily poorly compared to the optimal revenue, while the CA-AMA can reach the optimal revenue. For optimizing CA-AMA, we design a practical two-stage training algorithm. We derive that the target function’s continuity and the generalization bound on the degree of deviation from strict IR. Finally, extensive experiments showcase that our algorithm can find an approximate optimal CA-AMA in various distributions with improved revenue and a low degree of violation of IR.
[LG-36] Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design
链接: https://arxiv.org/abs/2602.09424
作者: Prin Phunyaphibarn,Minhyuk Sung
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Discrete diffusion models have recently emerged as a powerful class of generative models for chemistry and biology data. In these fields, the goal is to generate various samples with high rewards (e.g., drug-likeness in molecules), making reward-based guidance crucial. Most existing methods are based on guiding the diffusion model using intermediate rewards but tend to underperform since intermediate rewards are noisy due to the non-smooth nature of reward functions used in scientific domains. To address this, we propose Clean-Sample Markov Chain (CSMC) Sampler, a method that performs effective test-time reward-guided sampling for discrete diffusion models, enabling local search without relying on intermediate rewards. CSMC constructs a Markov chain of clean samples using the Metropolis-Hastings algorithm such that its stationary distribution is the target distribution. We design a proposal distribution by sequentially applying the forward and backward diffusion processes, making the acceptance probability tractable. Experiments on molecule and biological sequence generation with various reward functions demonstrate that our method consistently outperforms prior approaches that rely on intermediate rewards.
[LG-37] Learning with Multiple Correct Answers – A Trichotomy of Regret Bounds under Different Feedback Models
链接: https://arxiv.org/abs/2602.09402
作者: Alireza F. Pour,Farnam Mansouri,Shai Ben-David
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study an online learning problem with multiple correct answers, where each instance admits a set of valid labels, and in each round the learner must output a valid label for the queried example. This setting is motivated by language generation tasks, in which a prompt may admit many acceptable completions, but not every completion is acceptable. We study this problem under three feedback models. For each model, we characterize the optimal mistake bound in the realizable setting using an appropriate combinatorial dimension. We then establish a trichotomy of regret bounds across the three models in the agnostic setting. Our results also imply sample complexity bounds for the batch setup that depend on the respective combinatorial dimensions.
[LG-38] Sparse Layer Sharpness-Aware Minimization for Efficient Fine-Tuning
链接: https://arxiv.org/abs/2602.09395
作者: Yifei Cheng,Xianglin Yang,Guoxia Wang,Chao Huang,Fei Ma,Dianhai Yu,Xiaochun Cao,Li Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sharpness-aware minimization (SAM) seeks the minima with a flat loss landscape to improve the generalization performance in machine learning tasks, including fine-tuning. However, its extra parameter perturbation step doubles the computation cost, which becomes the bottleneck of SAM in the practical implementation. In this work, we propose an approach SL-SAM to break this bottleneck by introducing the sparse technique to layers. Our key innovation is to frame the dynamic selection of layers for both the gradient ascent (perturbation) and descent (update) steps as a multi-armed bandit problem. At the beginning of each iteration, SL-SAM samples a part of the layers of the model according to the gradient norm to participate in the backpropagation of the following parameter perturbation and update steps, thereby reducing the computation complexity. We then provide the analysis to guarantee the convergence of SL-SAM. In the experiments of fine-tuning models in several tasks, SL-SAM achieves the performances comparable to the state-of-the-art baselines, including a #1 rank on LLM fine-tuning. Meanwhile, SL-SAM significantly reduces the ratio of active parameters in backpropagation compared to vanilla SAM (SL-SAM activates 47%, 22% and 21% parameters on the vision, moderate and large language model respectively while vanilla SAM always activates 100%), verifying the efficiency of our proposed algorithm.
[LG-39] Latent Poincaré Shaping for Agent ic Reinforcement Learning
链接: https://arxiv.org/abs/2602.09375
作者: Hanchen Xia,Baoyou Chen,Zelin Zang,Yutang Ge,Guojiang Zhao,Siyu Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose LaPha, a method for training AlphaZero-like LLM agents in a Poincaré latent space. Under LaPha, the search process can be visualized as a tree rooted at the prompt and growing outward from the origin toward the boundary of the Poincaré ball, where negative curvature provides exponentially increasing capacity with radius. Using hyperbolic geodesic distance to rule-verified correctness, we define a node potential and assign dense process rewards by potential differences. We further attach a lightweight value head on the same shared latent space, enabling self-guided test-time scaling with almost no additional overhead. On MATH-500, LaPha improves Qwen2.5-Math-1.5B from 66.0% to 88.2%. With value-head-guided search, LaPha-1.5B reaches 56.7% accuracy on AIME’24, and LaPha-7B further achieves 60.0% on AIME’24 and 53.3% on AIME’25.
[LG-40] Large Language Models for Designing Participatory Budgeting Rules AAMAS2026
链接: https://arxiv.org/abs/2602.09349
作者: Nguyen Thach,Xingchen Sha,Hau Chan
类目: Machine Learning (cs.LG)
*备注: Accepted as full paper to AAMAS 2026
Abstract:Participatory budgeting (PB) is a democratic paradigm for deciding the funding of public projects given the residents’ preferences, which has been adopted in numerous cities across the world. The main focus of PB is designing rules, functions that return feasible budget allocations for a set of projects subject to some budget constraint. Designing PB rules that optimize both utility and fairness objectives based on agent preferences had been challenging due to the extensive domain knowledge required and the proven trade-off between the two notions. Recently, large language models (LLMs) have been increasingly employed for automated algorithmic design. Given the resemblance of PB rules to algorithms for classical knapsack problems, in this paper, we introduce a novel framework, named LLMRule, that addresses the limitations of existing works by incorporating LLMs into an evolutionary search procedure for automating the design of PB rules. Our experimental results, evaluated on more than 600 real-world PB instances obtained from the U.S., Canada, Poland, and the Netherlands with different representations of agent preferences, demonstrate that the LLM-generated rules generally outperform existing handcrafted rules in terms of overall utility while still maintaining a similar degree of fairness.
[LG-41] MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
链接: https://arxiv.org/abs/2602.09329
作者: Xueying Ding,Simon Klüttermann,Haomin Wen,Yilong Chen,Leman Akoglu
类目: Machine Learning (cs.LG)
*备注: 28 pages
Abstract:Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at this https URL.
[LG-42] In-Hospital Stroke Prediction from PPG-Derived Hemodynamic Features KDD KDD’26
链接: https://arxiv.org/abs/2602.09328
作者: Jiaming Liu,Cheng Ding,Daoqiang Zhang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 11 pages, 6 figures, 3 tables. To appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26)
Abstract:The absence of pre-hospital physiological data in standard clinical datasets fundamentally constrains the early prediction of stroke, as patients typically present only after stroke has occurred, leaving the predictive value of continuous monitoring signals such as photoplethysmography (PPG) unvalidated. In this work, we overcome this limitation by focusing on a rare but clinically critical cohort - patients who suffered stroke during hospitalization while already under continuous monitoring - thereby enabling the first large-scale analysis of pre-stroke PPG waveforms aligned to verified onset times. Using MIMIC-III and MC-MED, we develop an LLM-assisted data mining pipeline to extract precise in-hospital stroke onset timestamps from unstructured clinical notes, followed by physician validation, identifying 176 patients (MIMIC) and 158 patients (MC-MED) with high-quality synchronized pre-onset PPG data, respectively. We then extract hemodynamic features from PPG and employ a ResNet-1D model to predict impending stroke across multiple early-warning horizons. The model achieves F1-scores of 0.7956, 0.8759, and 0.9406 at 4, 5, and 6 hours prior to onset on MIMIC-III, and, without re-tuning, reaches 0.9256, 0.9595, and 0.9888 on MC-MED for the same horizons. These results provide the first empirical evidence from real-world clinical data that PPG contains predictive signatures of stroke several hours before onset, demonstrating that passively acquired physiological signals can support reliable early warning, supporting a shift from post-event stroke recognition to proactive, physiology-based surveillance that may materially improve patient outcomes in routine clinical care.
[LG-43] Priority-Aware Shapley Value
链接: https://arxiv.org/abs/2602.09326
作者: Kiljae Lee,Ziqi Liu,Weijing Tang,Yuan Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Shapley values are widely used for model-agnostic data valuation and feature attribution, yet they implicitly assume contributors are interchangeable. This can be problematic when contributors are dependent (e.g., reused/augmented data or causal feature orderings) or when contributions should be adjusted by factors such as trust or risk. We propose Priority-Aware Shapley Value (PASV), which incorporates both hard precedence constraints and soft, contributor-specific priority weights. PASV is applicable to general precedence structures, recovers precedence-only and weight-only Shapley variants as special cases, and is uniquely characterized by natural axioms. We develop an efficient adjacent-swap Metropolis-Hastings sampler for scalable Monte Carlo estimation and analyze limiting regimes induced by extreme priority weights. Experiments on data valuation (MNIST/CIFAR10) and feature attribution (Census Income) demonstrate more structure-faithful allocations and a practical sensitivity analysis via our proposed “priority sweeping”.
[LG-44] Effective MoE-based LLM Compression by Exploiting Heterogeneous Inter-Group Experts Routing Frequency and Information Density
链接: https://arxiv.org/abs/2602.09316
作者: Zhendong Mi,Yixiao Chen,Pu Zhao,Xiaodong Yu,Hao Wang,Yanzhi Wang,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) based Large Language Models (LLMs) have achieved superior performance, yet the massive memory overhead caused by storing multiple expert networks severely hinders their practical deployment. Singular Value Decomposition (SVD)-based compression has emerged as a promising post-training technique; however, most existing methods apply uniform rank allocation or rely solely on static weight properties. This overlooks the substantial heterogeneity in expert utilization observed in MoE models, where frequent routing patterns and intrinsic information density vary significantly across experts. In this work, we propose RFID-MoE, an effective framework for MoE compression by exploiting heterogeneous Routing Frequency and Information Density. We first introduce a fused metric that combines expert activation frequency with effective rank to measure expert importance, adaptively allocating higher ranks to critical expert groups under a fixed budget. Moreover, instead of discarding compression residuals, we reconstruct them via a parameter-efficient sparse projection mechanism to recover lost information with minimal parameter overhead. Extensive experiments on representative MoE LLMs (e.g., Qwen3, DeepSeekMoE) across multiple compression ratios demonstrate that RFID-MoE consistently outperforms state-of-the-art methods like MoBE and D2-MoE. Notably, RFID-MoE achieves a perplexity of 16.92 on PTB with the Qwen3-30B model at a 60% compression ratio, reducing perplexity by over 8.0 compared to baselines, and improves zero-shot accuracy on HellaSwag by approximately 8%.
[LG-45] Reward Modeling for Reinforcement Learning-Based LLM Reasoning : Design Challenges and Evaluation
链接: https://arxiv.org/abs/2602.09305
作者: Pei-Chi Pan,Yingbin Liang,Sen Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)-based fine-tuning is a key mechanism for improvement, but its effectiveness is fundamentally governed by reward design. Despite its importance, the relationship between reward modeling and core LLM challenges–such as evaluation bias, hallucination, distribution shift, and efficient learning–remains poorly understood. This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment, shaping what models learn, how they generalize, and whether their outputs can be trusted. We introduce Reasoning-Aligned Reinforcement Learning (RARL), a unifying framework that systematizes diverse reward paradigms for multi-step reasoning. Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges ranging from inference-time scaling to hallucination mitigation. We further critically evaluate existing benchmarks, highlighting vulnerabilities such as data contamination and reward misalignment, and outline directions for more robust evaluation. By integrating fragmented research threads and clarifying the interplay between reward design and fundamental reasoning capabilities, this work provides a foundational roadmap for building reasoning models that are robust, verifiable, and trustworthy.
[LG-46] Statistical Roughness-Informed Machine Unlearning
链接: https://arxiv.org/abs/2602.09304
作者: Mohammad Partohaghighi,Roummel Marcia,Bruce J. West,YangQuan Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine unlearning aims to remove the influence of a designated forget set from a trained model while preserving utility on the retained data. In modern deep networks, approximate unlearning frequently fails under large or adversarial deletions due to pronounced layer-wise heterogeneity: some layers exhibit stable, well-regularized representations while others are brittle, undertrained, or overfit, so naive update allocation can trigger catastrophic forgetting or unstable dynamics. We propose Statistical-Roughness Adaptive Gradient Unlearning (SRAGU), a mechanism-first unlearning algorithm that reallocates unlearning updates using layer-wise statistical roughness operationalized via heavy-tailed spectral diagnostics of layer weight matrices. Starting from an Adaptive Gradient Unlearning (AGU) sensitivity signal computed on the forget set, SRAGU estimates a WeightWatcher-style heavy-tailed exponent for each layer, maps it to a bounded spectral stability weight, and uses this stability signal to spectrally reweight the AGU sensitivities before applying the same minibatch update form. This concentrates unlearning motion in spectrally stable layers while damping updates in unstable or overfit layers, improving stability under hard deletions. We evaluate unlearning via behavioral alignment to a gold retrained reference model trained from scratch on the retained data, using empirical prediction-divergence and KL-to-gold proxies on a forget-focused query set; we additionally report membership inference auditing as a complementary leakage signal, treating forget-set points as should-be-forgotten members during evaluation.
[LG-47] Stabilizing Physics-Informed Consistency Models via Structure-Preserving Training
链接: https://arxiv.org/abs/2602.09303
作者: Che-Chia Chang,Chen-Yang Dai,Te-Sheng Lin,Ming-Chih Lai,Chieh-Hsin Lai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We propose a physics-informed consistency modeling framework for solving partial differential equations (PDEs) via fast, few-step generative inference. We identify a key stability challenge in physics-constrained consistency training, where PDE residuals can drive the model toward trivial or degenerate solutions, degrading the learned data distribution. To address this, we introduce a structure-preserving two-stage training strategy that decouples distribution learning from physics enforcement by freezing the coefficient decoder during physics-informed fine-tuning. We further propose a two-step residual objective that enforces physical consistency on refined, structurally valid generative trajectories rather than noisy single-step predictions. The resulting framework enables stable, high-fidelity inference for both unconditional generation and forward problems. We demonstrate that forward solutions can be obtained via a projection-based zero-shot inpainting procedure, achieving consistent accuracy of diffusion baselines with orders of magnitude reduction in computational cost.
[LG-48] Risk-sensitive reinforcement learning using expectiles shortfall risk and optimized certainty equivalent risk
链接: https://arxiv.org/abs/2602.09300
作者: Sumedh Gupte,Shrey Rakeshkumar Patel,Soumen Pachal,Prashanth L. A.,Sanjay P. Bhat
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures, namely expectiles, utility-based shortfall risk and optimized certainty equivalent risk. For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem. Second, we propose estimators of the risk-sensitive policy gradient for each of the aforementioned risk measures, and establish \mathcalO\left(1/m\right) mean-squared error bounds for our estimators, where m is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.
[LG-49] he Laplacian Mechanism Improves Transformers by Reshaping Token Geometry
链接: https://arxiv.org/abs/2602.09297
作者: Yuchong Zhang,Vardan Papyan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers leverage attention, the residual connection, and layer normalization to control the variance of token representations. We propose to modify attention into a Laplacian mechanism that gives the model more direct control over token variance. We conjecture that this helps transformers achieve the ideal token geometry. To investigate our conjecture, we first show that incorporating the Laplacian mechanism into transformers induces consistent improvements across benchmarks in computer vision and language. Next, we study how the Laplacian mechanism impacts the geometry of token representations using various tools: 1) principal component analysis, 2) cosine similarity metric, 3) analysis of variance, and 4) Neural Collapse metrics. Our investigation shows that the Laplacian mechanism reshapes token embeddings toward a geometry of maximal separability: tokens collapse according to their classes, and the class means exhibit Neural Collapse.
[LG-50] Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation
链接: https://arxiv.org/abs/2602.09295
作者: Bret Nestor,Bohan Yao,Jasmine Moore,Jasper Kanes
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:
Abstract:This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg’s orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype. Subjects: Machine Learning (cs.LG); Sound (cs.SD) Cite as: arXiv:2602.09295 [cs.LG] (or arXiv:2602.09295v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.09295 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] Measuring Privacy Risks and Tradeoffs in Financial Synthetic Data Generation
链接: https://arxiv.org/abs/2602.09288
作者: Michael Zuo,Inwon Kang,Stacy Patterson,Oshani Seneviratne
类目: Machine Learning (cs.LG)
*备注:
Abstract:We explore the privacy-utility tradeoff of synthetic data generation schemes on tabular financial datasets, a domain characterized by high regulatory risk and severe class imbalance. We consider representative tabular data generators, including autoencoders, generative adversarial networks, diffusion, and copula synthesizers. To address the challenges of the financial domain, we provide novel privacy-preserving implementations of GAN and autoencoder synthesizers. We evaluate whether and how well the generators simultaneously achieve data quality, downstream utility, and privacy, with comparison across balanced and imbalanced input datasets. Our results offer insight into the distinct challenges of generating synthetic data from datasets that exhibit severe class imbalance and mixed-type attributes.
[LG-52] he effect of whitening on explanation performance NEURIPS2024
链接: https://arxiv.org/abs/2602.09278
作者: Benedict Clark,Stoyan Karastoyanov,Rick Wilming,Stefan Haufe
类目: Machine Learning (cs.LG)
*备注: Presented at the NeurIPS 2024 workshop on Interpretable AI: Past, Present and Future
Abstract:Explainable Artificial Intelligence (XAI) aims to provide transparent insights into machine learning models, yet the reliability of many feature attribution methods remains a critical challenge. Prior research (Haufe et al., 2014; Wilming et al., 2022, 2023) has demonstrated that these methods often erroneously assign significant importance to non-informative variables, such as suppressor variables, leading to fundamental misinterpretations. Since statistical suppression is induced by feature dependencies, this study investigates whether data whitening, a common preprocessing technique for decorrelation, can mitigate such errors. Using the established XAI-TRIS benchmark (Clark et al., 2024b), which offers synthetic ground-truth data and quantitative measures of explanation correctness, we empirically evaluate 16 popular feature attribution methods applied in combination with 5 distinct whitening transforms. Additionally, we analyze a minimal linear two-dimensional classification problem (Wilming et al., 2023) to theoretically assess whether whitening can remove the impact of suppressor features from Bayes-optimal models. Our results indicate that, while specific whitening techniques can improve explanation performance, the degree of improvement varies substantially across XAI methods and model architectures. These findings highlight the complex relationship between data non-linearities, preprocessing quality, and attribution fidelity, underscoring the vital role of pre-processing techniques in enhancing model interpretability.
[LG-53] Generalizing GNNs with Tokenized Mixture of Experts
链接: https://arxiv.org/abs/2602.09258
作者: Xiaoguang Guo,Zehong Wang,Jiazheng Li,Shawn Spitzel,Qi Yang,Kaize Ding,Jundong Li,Chuxu Zhang
类目: Machine Learning (cs.LG)
*备注: Graph Neural Networks, Generalization, Mixture of Experts
Abstract:Deployed graph neural networks (GNNs) are frozen at deployment yet must fit clean data, generalize under distribution shifts, and remain stable to perturbations. We show that static inference induces a fundamental tradeoff: improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor. Instance-conditional routing can break this ceiling, but is fragile because shifts can mislead routing and perturbations can make routing fluctuate. We capture these effects via two decompositions separating coverage vs selection, and base sensitivity vs fluctuation amplification. Based on these insights, we propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths, a vector-quantized token interface to stabilize encoder-to-head signals, and a Lipschitz-regularized head to bound output amplification. Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
[LG-54] Feature salience – not task-informativeness – drives machine learning model explanations
链接: https://arxiv.org/abs/2602.09238
作者: Benedict Clark,Marta Oliveira,Rick Wilming,Stefan Haufe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Explainable AI (XAI) promises to provide insight into machine learning models’ decision processes, where one goal is to identify failures such as shortcut learning. This promise relies on the field’s assumption that input features marked as important by an XAI must contain information about the target variable. However, it is unclear whether informativeness is indeed the main driver of importance attribution in practice, or if other data properties such as statistical suppression, novelty at test-time, or high feature salience substantially contribute. To clarify this, we trained deep learning models on three variants of a binary image classification task, in which translucent watermarks are either absent, act as class-dependent confounds, or represent class-independent noise. Results for five popular attribution methods show substantially elevated relative importance in watermarked areas (RIW) for all models regardless of the training setting ( R^2 \geq .45 ). By contrast, whether the presence of watermarks is class-dependent or not only has a marginal effect on RIW ( R^2 \leq .03 ), despite a clear impact impact on model performance and generalisation ability. XAI methods show similar behaviour to model-agnostic edge detection filters and attribute substantially less importance to watermarks when bright image intensities are encoded by smaller instead of larger feature values. These results indicate that importance attribution is most strongly driven by the salience of image structures at test time rather than statistical associations learned by machine learning models. Previous studies demonstrating successful XAI application should be reevaluated with respect to a possibly spurious concurrency of feature salience and informativeness, and workflows using feature attribution methods as building blocks should be scrutinised.
[LG-55] RAPID: Risk of Attribute Prediction-Induced Disclosure in Synthetic Microdata
链接: https://arxiv.org/abs/2602.09235
作者: Matthias Templ,Oscar Thees,Roman Müller
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 29 pages, 5 figures
Abstract:Statistical data anonymization increasingly relies on fully synthetic microdata, for which classical identity disclosure measures are less informative than an adversary’s ability to infer sensitive attributes from released data. We introduce RAPID (Risk of Attribute Prediction–Induced Disclosure), a disclosure risk measure that directly quantifies inferential vulnerability under a realistic attack model. An adversary trains a predictive model solely on the released synthetic data and applies it to real individuals’ quasi-identifiers. For continuous sensitive attributes, RAPID reports the proportion of records whose predicted values fall within a specified relative error tolerance. For categorical attributes, we propose a baseline-normalized confidence score that measures how much more confident the attacker is about the true class than would be expected from class prevalence alone, and we summarize risk as the fraction of records exceeding a policy-defined threshold. This construction yields an interpretable, bounded risk metric that is robust to class imbalance, independent of any specific synthesizer, and applicable with arbitrary learning algorithms. We illustrate threshold calibration, uncertainty quantification, and comparative evaluation of synthetic data generators using simulations and real data. Our results show that RAPID provides a practical, attacker-realistic upper bound on attribute-inference disclosure risk that complements existing utility diagnostics and disclosure control frameworks.
[LG-56] Barycentric alignment for instance-level comparison of neural representations
链接: https://arxiv.org/abs/2602.09225
作者: Shreya Saha,Zoe Wanying He,Meenakshi Khosla
类目: Machine Learning (cs.LG)
*备注:
Abstract:Comparing representations across neural networks is challenging because representations admit symmetries, such as arbitrary reordering of units or rotations of activation space, that obscure underlying equivalence between models. We introduce a barycentric alignment framework that quotients out these nuisance symmetries to construct a universal embedding space across many models. Unlike existing similarity measures, which summarize relationships over entire stimulus sets, this framework enables similarity to be defined at the level of individual stimuli, revealing inputs that elicit convergent versus divergent representations across models. Using this instance-level notion of similarity, we identify systematic input properties that predict representational convergence versus divergence across vision and language model families. We also construct universal embedding spaces for brain representations across individuals and cortical regions, enabling instance-level comparison of representational agreement across stages of the human visual hierarchy. Finally, we apply the same barycentric alignment framework to purely unimodal vision and language models and find that post-hoc alignment into a shared space yields image text similarity scores that closely track human cross-modal judgments and approach the performance of contrastively trained vision-language models. This strikingly suggests that independently learned representations already share sufficient geometric structure for human-aligned cross-modal comparison. Together, these results show that resolving representational similarity at the level of individual stimuli reveals phenomena that cannot be detected by set-level comparison metrics.
[LG-57] EExApp: GNN-Based Reinforcement Learning for Radio Unit Energy Optimization in 5G O-RAN
链接: https://arxiv.org/abs/2602.09206
作者: Jie Lu,Peihao Yan,Huacheng Zeng
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted by IEEE INFOCOM 2026
Abstract:With over 3.5 million 5G base stations deployed globally, their collective energy consumption (projected to exceed 131 TWh annually) raises significant concerns over both operational costs and environmental impacts. In this paper, we present EExAPP, a deep reinforcement learning (DRL)-based xApp for 5G Open Radio Access Network (O-RAN) that jointly optimizes radio unit (RU) sleep scheduling and distributed unit (DU) resource slicing. EExAPP uses a dual-actor-dual-critic Proximal Policy Optimization (PPO) architecture, with dedicated actor-critic pairs targeting energy efficiency and quality-of-service (QoS) compliance. A transformer-based encoder enables scalable handling of variable user equipment (UE) populations by encoding all-UE observations into fixed-dimensional representations. To coordinate the two optimization objectives, a bipartite Graph Attention Network (GAT) is used to modulate actor updates based on both critic outputs, enabling adaptive tradeoffs between power savings and QoS. We have implemented EExAPP and deployed it on a real-world 5G O-RAN testbed with live traffic, commercial RU and smartphones. Extensive over-the-air experiments and ablation studies confirm that EExAPP significantly outperforms existing methods in reducing the energy consumption of RU while maintaining QoS.
[LG-58] Fair Feature Importance Scores via Feature Occlusion and Permutation
链接: https://arxiv.org/abs/2602.09196
作者: Camille Little,Madeline Navarro,Santiago Segarra,Genevera Allen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:As machine learning models increasingly impact society, their opaque nature poses challenges to trust and accountability, particularly in fairness contexts. Understanding how individual features influence model outcomes is crucial for building interpretable and equitable models. While feature importance metrics for accuracy are well-established, methods for assessing feature contributions to fairness remain underexplored. We propose two model-agnostic approaches to measure fair feature importance. First, we propose to compare model fairness before and after permuting feature values. This simple intervention-based approach decouples a feature and model predictions to measure its contribution to training. Second, we evaluate the fairness of models trained with and without a given feature. This occlusion-based score enjoys dramatic computational simplification via minipatch learning. Our empirical results reflect the simplicity and effectiveness of our proposed metrics for multiple predictive tasks. Both methods offer simple, scalable, and interpretable solutions to quantify the influence of features on fairness, providing new tools for responsible machine learning development.
[LG-59] ML-DCN: Masked Low-Rank Deep Crossing Network Towards Scalable Ads Click-through Rate Prediction at Pinterest
链接: https://arxiv.org/abs/2602.09194
作者: Jiacheng Li,Yixiong Meng,Yi wu,Yun Zhao,Sharare Zehtabian,Jiayin Jin,Degao Peng,Jinfeng Zhuang,Qifei Shen,Kungang Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning recommendation systems rely on feature interaction modules to model complex user-item relationships across sparse categorical and dense features. In large-scale ad ranking, increasing model capacity is a promising path to improving both predictive performance and business outcomes, yet production serving budgets impose strict constraints on latency and FLOPs. This creates a central tension: we want interaction modules that both scale effectively with additional compute and remain compute-efficient at serving time. In this work, we study how to scale feature interaction modules under a fixed serving budget. We find that naively scaling DCNv2 and MaskNet, despite their widespread adoption in industry, yields rapidly diminishing offline gains in the Pinterest ads ranking system. To overcome aforementioned limitations, we propose ML-DCN, an interaction module that integrates an instance-conditioned mask into a low-rank crossing layer, enabling per-example selection and amplification of salient interaction directions while maintaining efficient computation. This novel architecture combines the strengths of DCNv2 and MaskNet, scales efficiently with increased compute, and achieves state-of-the-art performance. Experiments on a large internal Pinterest ads dataset show that ML-DCN achieves higher AUC than DCNv2, MaskNet, and recent scaling-oriented alternatives at matched FLOPs, and it scales more favorably overall as compute increases, exhibiting a stronger AUC-FLOPs trade-off. Finally, online A/B tests demonstrate statistically significant improvements in key ads metrics (including CTR and click-quality measures) and ML-DCN has been deployed in the production system with neutral serving cost.
[LG-60] One RNG to Rule Them All: How Randomness Becomes an Attack Vector in Machine Learning
链接: https://arxiv.org/abs/2602.09182
作者: Kotekar Annapoorna Prabhu,Andrew Gan,Zahra Ghodsi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore
Abstract:Machine learning relies on randomness as a fundamental component in various steps such as data sampling, data augmentation, weight initialization, and optimization. Most machine learning frameworks use pseudorandom number generators as the source of randomness. However, variations in design choices and implementations across different frameworks, software dependencies, and hardware backends along with the lack of statistical validation can lead to previously unexplored attack vectors on machine learning systems. Such attacks on randomness sources can be extremely covert, and have a history of exploitation in real-world systems. In this work, we examine the role of randomness in the machine learning development pipeline from an adversarial point of view, and analyze the implementations of PRNGs in major machine learning frameworks. We present RNGGuard to help machine learning engineers secure their systems with low effort. RNGGuard statically analyzes a target library’s source code and identifies instances of random functions and modules that use them. At runtime, RNGGuard enforces secure execution of random functions by replacing insecure function calls with RNGGuard’s implementations that meet security specifications. Our evaluations show that RNGGuard presents a practical approach to close existing gaps in securing randomness sources in machine learning systems.
[LG-61] Weighted Wasserstein Barycenter of Gaussian Processes for exotic Bayesian Optimization tasks
链接: https://arxiv.org/abs/2602.09181
作者: Antonio Candelieri,Francesco Archetti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Exploiting the analogy between Gaussian Distributions and Gaussian Processes’ posterior, we present how the weighted Wasserstein Barycenter of Gaussian Processes (W2BGP) can be used to unify, under a common framework, different exotic Bayesian Optimization (BO) tasks. Specifically, collaborative/federated BO, (synchronous) batch BO, and multi-fidelity BO are considered in this paper. Our empirical analysis proves that each one of these tasks requires just an appropriate weighting schema for the W2BGP, while the entire framework remains untouched. Moreover, we demonstrate that the most well-known BO acquisition functions can be easily re-interpreted under the proposed framework and also enable a more computationally efficient way to deal with the computation of the Wasserstein Barycenter, compared with state-of-the-art methods from the Machine Learning literature. Finally, research perspectives branching from the proposed approach are presented.
[LG-62] rain Less Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity
链接: https://arxiv.org/abs/2602.09169
作者: Jonathan Svirsky,Yehonathan Refael,Ofir Lindenbaum
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20–40% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.
[LG-63] Faster Rates For Federated Variational Inequalities
链接: https://arxiv.org/abs/2602.09164
作者: Guanghui Wang,Satyen Kale
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study federated optimization for solving stochastic variational inequalities (VIs), a problem that has attracted growing attention in recent years. Despite substantial progress, a significant gap remains between existing convergence rates and the state-of-the-art bounds known for federated convex optimization. In this work, we address this limitation by establishing a series of improved convergence rates. First, we show that, for general smooth and monotone variational inequalities, the classical Local Extra SGD algorithm admits tighter guarantees under a refined analysis. Next, we identify an inherent limitation of Local Extra SGD, which can lead to excessive client drift. Motivated by this observation, we propose a new algorithm, the Local Inexact Proximal Point Algorithm with Extra Step (LIPPAX), and show that it mitigates client drift and achieves improved guarantees in several regimes, including bounded Hessian, bounded operator, and low-variance settings. Finally, we extend our results to federated composite variational inequalities and establish improved convergence guarantees.
[LG-64] Boltzmann Reinforcement Learning for Noise resilience in Analog Ising Machines
链接: https://arxiv.org/abs/2602.09162
作者: Aditya Choudhary,Saaketh Desai,Prasad Iyer
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Analog Ising machines (AIMs) have emerged as a promising paradigm for combinatorial optimization, utilizing physical dynamics to solve Ising problems with high energy efficiency. However, the performance of traditional optimization and sampling algorithms on these platforms is often limited by inherent measurement noise. We introduce BRAIN (Boltzmann Reinforcement for Analog Ising Networks), a distribution learning framework that utilizes variational reinforcement learning to approximate the Boltzmann distribution. By shifting from state-by-state sampling to aggregating information across multiple noisy measurements, BRAIN is resilient to Gaussian noise characteristic of AIMs. We evaluate BRAIN across diverse combinatorial topologies, including the Curie-Weiss and 2D nearest-neighbor Ising systems. We find that under realistic 3% Gaussian measurement noise, BRAIN maintains 98% ground state fidelity, whereas Markov Chain Monte Carlo (MCMC) methods degrade to 51% fidelity. Furthermore, BRAIN reaches the MCMC-equivalent solution up to 192x faster under these conditions. BRAIN exhibits \mathcalO(N^1.55) scaling up to 65,536 spins and maintains robustness against severe measurement uncertainty up to 40%. Beyond ground state optimization, BRAIN accurately captures thermodynamic phase transitions and metastable states, providing a scalable and noise-resilient method for utilizing analog computing architectures in complex optimizations.
[LG-65] UniComp: A Unified Evaluation of Large Language Model Compression via Pruning Quantization and Distillation ACL2026
链接: https://arxiv.org/abs/2602.09130
作者: Jonathan von Rad,Yong Cao,Andreas Geiger
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures. Submitted to ACL 2026
Abstract:Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.
[LG-66] Counterfactual Maps: What They Are and How to Find Them
链接: https://arxiv.org/abs/2602.09128
作者: Awa Khouna,Julien Ferry,Thibaut Vidal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactual explanations are a central tool in interpretable machine learning, yet computing them exactly for complex models remains challenging. For tree ensembles, predictions are piecewise constant over a large collection of axis-aligned hyperrectangles, implying that an optimal counterfactual for a point corresponds to its projection onto the nearest rectangle with an alternative label under a chosen metric. Existing methods largely overlook this geometric structure, relying either on heuristics with no optimality guarantees or on mixed-integer programming formulations that do not scale to interactive use. In this work, we revisit counterfactual generation through the lens of nearest-region search and introduce counterfactual maps, a global representation of recourse for tree ensembles. Leveraging the fact that any tree ensemble can be compressed into an equivalent partition of labeled hyperrectangles, we cast counterfactual search as the problem of identifying the generalized Voronoi cell associated with the nearest rectangle of an alternative label. This leads to an exact, amortized algorithm based on volumetric k-dimensional (KD) trees, which performs branch-and-bound nearest-region queries with explicit optimality certificates and sublinear average query time after a one-time preprocessing phase. Our experimental analyses on several real datasets drawn from high-stakes application domains show that this approach delivers globally optimal counterfactual explanations with millisecond-level latency, achieving query times that are orders of magnitude faster than existing exact, cold-start optimization methods. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.09128 [cs.LG] (or arXiv:2602.09128v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.09128 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-67] Epistemic Throughput: Fundamental Limits of Attention-Constrained Inference
链接: https://arxiv.org/abs/2602.09127
作者: Lei You
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Recent generative and tool-using AI systems can surface a large volume of candidates at low marginal cost, yet only a small fraction can be checked carefully. This creates a decoder-side bottleneck: downstream decision-makers must form reliable posteriors from many public records under scarce attention. We formalize this regime via Attention-Constrained Inference (ACI), in which a cheap screening stage processes K records and an expensive verification stage can follow up on at most B of them. Under Bayes log-loss, we study the maximum achievable reduction in posterior uncertainty per window, which we call \emphepistemic throughput. Our main result is a ``JaKoB’’ scaling law showing that epistemic throughput has a baseline term that grows linearly with verification and prevalence, and an additional \emphinformation-leverage term that scales as \sqrtJKB , where J summarizes screening quality. Thus, expanding cheap screening can nonlinearly amplify scarce verification, even when informative records are rare. We further show that this scaling is tight in a weak-screening limit, and that in the sparse-verification regime ( B \ll K ), substantial leverage requires heavy-tailed score distributions; for light-tailed scores the amplification is only logarithmic.
[LG-68] SpinCastML an Open Decision-Making Application for Inverse Design of Electrospinning Manufacturing: A Machine Learning Optimal Sampling and Inverse Monte Carlo Approach
链接: https://arxiv.org/abs/2602.09120
作者: Elisa Roldan,Tasneem Sabir
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electrospinning is a powerful technique for producing micro to nanoscale fibers with application specific architectures. Small variations in solution or operating conditions can shift the jet regime, generating non Gaussian fiber diameter distributions. Despite substantial progress, no existing framework enables inverse design toward desired fiber outcomes while integrating polymer solvent chemical constraints or predicting full distributions. SpinCastML is an open source, distribution aware, chemically informed machine learning and Inverse Monte Carlo (IMC) software for inverse electrospinning design. Built on a rigorously curated dataset of 68,480 fiber diameters from 1,778 datasets across 16 polymers, SpinCastML integrates three structured sampling methods, a suite of 11 high-performance learners, and chemistry aware constraints to predict not only mean diameter but the entire distribution. Cubist model with a polymer balanced Sobol D optimal sampling provides the highest global performance (R2 0.92). IMC accurately captures the fiber distributions, achieving R2 0.90 and 1% error between predicted and experimental success rates. The IMC engine supports both retrospective analysis and forward-looking inverse design, generating physically and chemically feasible polymer solvent parameter combinations with quantified success probabilities for user-defined targets. SpinCastML reframes electrospinning from trial and error to a reproducible, data driven design process. As an open source executable, it enables laboratories to analyze their own datasets and co create an expanding community software. SpinCastML reduces experimental waste, accelerates discovery, and democratizes access to advanced modeling, establishing distribution aware inverse design as a new standard for sustainable nanofiber manufacturing across biomedical, filtration, and energy applications.
[LG-69] Importance inversion transfer identifies shared principles for cross-domain learning
链接: https://arxiv.org/abs/2602.09116
作者: Daniele Caligiore
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Quantitative Methods (q-bio.QM)
*备注:
Abstract:The capacity to transfer knowledge across scientific domains relies on shared organizational principles. However, existing transfer-learning methodologies often fail to bridge radically heterogeneous systems, particularly under severe data scarcity or stochastic noise. This study formalizes Explainable Cross-Domain Transfer Learning (X-CDTL), a framework unifying network science and explainable artificial intelligence to identify structural invariants that generalize across biological, linguistic, molecular, and social networks. By introducing the Importance Inversion Transfer (IIT) mechanism, the framework prioritizes domain-invariant structural anchors over idiosyncratic, highly discriminative features. In anomaly detection tasks, models guided by these principles achieve significant performance gains - exhibiting a 56% relative improvement in decision stability under extreme noise - over traditional baselines. These results provide evidence for a shared organizational signature across heterogeneous domains, establishing a principled paradigm for cross-disciplinary knowledge propagation. By shifting from opaque latent representations to explicit structural laws, this work advances machine learning as a robust engine for scientific discovery.
[LG-70] From Adam to Adam-Like Lagrangians: Second-Order Nonlocal Dynamics
链接: https://arxiv.org/abs/2602.09101
作者: Carlos Heredia
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: 42 pages, 10 figures
Abstract:In this paper, we derive an accelerated continuous-time formulation of Adam by modeling it as a second-order integro-differential dynamical system. We relate this inertial nonlocal model to an existing first-order nonlocal Adam flow through an \alpha -refinement limit, and we provide Lyapunov-based stability and convergence analyses. We also introduce an Adam-inspired nonlocal Lagrangian formulation, offering a variational viewpoint. Numerical simulations on Rosenbrock-type examples show agreement between the proposed dynamics and discrete Adam.
[LG-71] Patient foundation model for risk stratification in low-risk overweight patients
链接: https://arxiv.org/abs/2602.09079
作者: Zachary N. Flamholz,Dillon Tracy,Ripple Khera,Jordan Wolinsky,Nicholas Lee,Nathaniel Tann,Xiao Yin Zhu,Harry Phillips,Jeffrey Sherman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate risk stratification in patients with overweight or obesity is critical for guiding preventive care and allocating high-cost therapies such as GLP-1 receptor agonists. We present PatientTPP, a neural temporal point process (TPP) model trained on over 500,000 real-world clinical trajectories to learn patient representations from sequences of diagnoses, labs, and medications. We extend existing TPP modeling approaches to include static and numeric features and incorporate clinical knowledge for event encoding. PatientTPP representations support downstream prediction tasks, including classification of obesity-associated outcomes in low-risk individuals, even for events not explicitly modeled during training. In health economic evaluation, PatientTPP outperformed body mass index in stratifying patients by future cardiovascular-related healthcare costs, identifying higher-risk patients more efficiently. By modeling both the type and timing of clinical events, PatientTPP offers an interpretable, general-purpose foundation for patient risk modeling with direct applications to obesity-related care and cost targeting.
[LG-72] SVD-Preconditioned Gradient Descent Method for Solving Nonlinear Least Squares Problems
链接: https://arxiv.org/abs/2602.09057
作者: Zhipeng Chang,Wenrui Hao,Nian Liu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This paper introduces a novel optimization algorithm designed for nonlinear least-squares problems. The method is derived by preconditioning the gradient descent direction using the Singular Value Decomposition (SVD) of the Jacobian. This SVD-based preconditioner is then integrated with the first- and second-moment adaptive learning rate mechanism of the Adam optimizer. We establish the local linear convergence of the proposed method under standard regularity assumptions and prove global convergence for a modified version of the algorithm under suitable conditions. The effectiveness of the approach is demonstrated experimentally across a range of tasks, including function approximation, partial differential equation (PDE) solving, and image classification on the CIFAR-10 dataset. Results show that the proposed method consistently outperforms standard Adam, achieving faster convergence and lower error in both regression and classification settings.
[LG-73] Dynamic Load Model for Data Centers with Pattern-Consistent Calibration
链接: https://arxiv.org/abs/2602.07859
作者: Siyu Lu,Chenhan Xiao,Yang Weng
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages, 13 figures
Abstract:The rapid growth of data centers has made large electronic load (LEL) modeling increasingly important for power system analysis. Such loads are characterized by fast workload-driven variability and protection-driven disconnection and reconnection behavior that are not captured by conventional load models. Existing data center load modeling includes physics-based approaches, which provide interpretable structure for grid simulation, and data-driven approaches, which capture empirical workload variability from data. However, physics-based models are typically uncalibrated to facility-level operation, while trajectory alignment in data-driven methods often leads to overfitting and unrealistic dynamic behavior. To resolve these limitations, we design the framework to leverage both physics-based structure and data-driven adaptability. The physics-based structure is parameterized to enable data-driven pattern-consistent calibration from real operational data, supporting facility-level grid planning. We further show that trajectory-level alignment is limited for inherently stochastic data center loads. Therefore, we design the calibration to align temporal and statistical patterns using temporal contrastive learning (TCL). This calibration is performed locally at the facility, and only calibrated parameters are shared with utilities, preserving data privacy. The proposed load model is calibrated by real-world operational load data from the MIT Supercloud, ASU Sol, Blue Waters, and ASHRAE datasets. Then it is integrated into the ANDES platform and evaluated on the IEEE 39-bus, NPCC 140-bus, and WECC 179-bus systems. We find that interactions among LELs can fundamentally alter post-disturbance recovery behavior, producing compound disconnection-reconnection dynamics and delayed stabilization that are not captured by uncalibrated load models.
[LG-74] Statistical-Computational Trade-offs in Learning Multi-Index Models via Harmonic Analysis
链接: https://arxiv.org/abs/2602.09959
作者: Hugo Latourelle-Vigeant,Theodor Misiakiewicz
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 91 pages
Abstract:We study the problem of learning multi-index models (MIMs), where the label depends on the input \boldsymbolx \in \mathbbR^d only through an unknown \mathsfs -dimensional projection \boldsymbolW_*^\mathsfT \boldsymbolx \in \mathbbR^\mathsfs . Exploiting the equivariance of this problem under the orthogonal group \mathcalO_d , we obtain a sharp harmonic-analytic characterization of the learning complexity for MIMs with spherically symmetric inputs – which refines and generalizes previous Gaussian-specific analyses. Specifically, we derive statistical and computational complexity lower bounds within the Statistical Query (SQ) and Low-Degree Polynomial (LDP) frameworks. These bounds decompose naturally across spherical harmonic subspaces. Guided by this decomposition, we construct a family of spectral algorithms based on harmonic tensor unfolding that sequentially recover the latent directions and (nearly) achieve these SQ and LDP lower bounds. Depending on the choice of harmonic degree sequence, these estimators can realize a broad range of trade-offs between sample and runtime complexity. From a technical standpoint, our results build on the semisimple decomposition of the \mathcalO_d -action on L^2 (\mathbbS^d-1) and the intertwining isomorphism between spherical harmonics and traceless symmetric tensors.
[LG-75] he Catastrophic Failure of The k-Means Algorithm in High Dimensions and How Hartigans Algorithm Avoids It
链接: https://arxiv.org/abs/2602.09936
作者: Roy R. Lederman,David Silva-Sánchez,Ziling Chen,Gilles Mordant,Amnon Balanov,Tamir Bendory
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Lloyd’s k-means algorithm is one of the most widely used clustering methods. We prove that in high-dimensional, high-noise settings, the algorithm exhibits catastrophic failure: with high probability, essentially every partition of the data is a fixed point. Consequently, Lloyd’s algorithm simply returns its initial partition - even when the underlying clusters are trivially recoverable by other methods. In contrast, we prove that Hartigan’s k-means algorithm does not exhibit this pathology. Our results show the stark difference between these algorithms and offer a theoretical explanation for the empirical difficulties often observed with k-means in high dimensions.
[LG-76] Robust Processing and Learning: Principles Methods and Wireless Applications
链接: https://arxiv.org/abs/2602.09848
作者: Shixiong Wang,Wei Dai,Li-Chun Wang,Geoffrey Ye Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This tutorial-style overview article examines the fundamental principles and methods of robustness, using wireless sensing and communication (WSC) as the narrative and exemplifying framework. First, we formalize the conceptual and mathematical foundations of robustness, highlighting the interpretations and relations across robust statistics, optimization, and machine learning. Key techniques, such as robust estimation and testing, distributionally robust optimization, and regularized and adversary training, are investigated. Together, the costs of robustness in system design, for example, the compromised nominal performances and the extra computational burdens, are discussed. Second, we review recent robust signal processing solutions for WSC that address model mismatch, data scarcity, adversarial perturbation, and distributional shift. Specific applications include robust ranging-based localization, modality sensing, channel estimation, receive combining, waveform design, and federated learning. Through this effort, we aim to introduce the classical developments and recent advances in robustness theory to the general signal processing community, exemplifying how robust statistical, optimization, and machine learning approaches can address the uncertainties inherent in WSC systems.
[LG-77] Stabilized Maximum-Likelihood Iterative Quantum Amplitude Estimation for Structural CVaR under Correlated Random Fields
链接: https://arxiv.org/abs/2602.09847
作者: Alireza Tabarraei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conditional Value-at-Risk (CVaR) is a central tail-risk measure in stochastic structural mechanics, yet its accurate evaluation under high-dimensional, spatially correlated material uncertainty remains computationally prohibitive for classical Monte Carlo methods. Leveraging bounded-expectation reformulations of CVaR compatible with quantum amplitude estimation, we develop a quantum-enhanced inference framework that casts CVaR evaluation as a statistically consistent, confidence-constrained maximum-likelihood amplitude estimation problem. The proposed method extends iterative quantum amplitude estimation (IQAE) by embedding explicit maximum-likelihood inference within a rigorously controlled interval-tracking architecture. To ensure global correctness under finite-shot noise and the non-injective oscillatory response induced by Grover amplification, we introduce a stabilized inference scheme incorporating multi-hypothesis feasibility tracking, periodic low-depth disambiguation, and a bounded restart mechanism governed by an explicit failure-probability budget. This formulation preserves the quadratic oracle-complexity advantage of amplitude estimation while providing finite-sample confidence guarantees and reduced estimator variance. The framework is demonstrated on benchmark problems with spatially correlated lognormal Young’s modulus fields generated using a Nystrom low-rank Gaussian kernel model. Numerical results show that the proposed estimator achieves substantially lower oracle complexity than classical Monte Carlo CVaR estimation at comparable confidence levels, while maintaining rigorous statistical reliability. This work establishes a practically robust and theoretically grounded quantum-enhanced methodology for tail-risk quantification in stochastic continuum mechanics.
[LG-78] Step-Size Stability in Stochastic Optimization: A Theoretical Perspective
链接: https://arxiv.org/abs/2602.09842
作者: Fabian Schaipp,Robert M. Gower,Adrien Taylor
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We present a theoretical analysis of stochastic optimization methods in terms of their sensitivity with respect to the step size. We identify a key quantity that, for each method, describes how the performance degrades as the step size becomes too large. For convex problems, we show that this quantity directly impacts the suboptimality bound of the method. Most importantly, our analysis provides direct theoretical evidence that adaptive step-size methods, such as SPS or NGN, are more robust than SGD. This allows us to quantify the advantage of these adaptive methods beyond empirical evaluation. Finally, we show through experiments that our theoretical bound qualitatively mirrors the actual performance as a function of the step size, even for nonconvex problems.
[LG-79] oeplitz Based Spectral Methods for Data-driven Dynamical Systems
链接: https://arxiv.org/abs/2602.09791
作者: Vladimir R. Kostic,Karim Lounici,Massimiliano Pontil
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 18 pages, 3 figures
Abstract:We introduce a Toeplitz-based framework for data-driven spectral estimation of linear evolution operators in dynamical systems. Focusing on transfer and Koopman operators from equilibrium trajectories without access to the underlying equations of motion, our method applies Toeplitz filters to the infinitesimal generator to extract eigenvalues, eigenfunctions, and spectral measures. Structural prior knowledge, such as self-adjointness or skew-symmetry, can be incorporated by design. The approach is statistically consistent and computationally efficient, leveraging both primal and dual algorithms commonly used in statistical learning. Numerical experiments on deterministic and chaotic systems demonstrate that the framework can recover spectral properties beyond the reach of standard data-driven methods.
[LG-80] Linear Model Extraction via Factual and Counterfactual Queries
链接: https://arxiv.org/abs/2602.09748
作者: Daan Otto,Jannis Kurtz,Dick den Hertog,Ilker Birbil
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:In model extraction attacks, the goal is to reveal the parameters of a black-box machine learning model by querying the model for a selected set of data points. Due to an increasing demand for explanations, this may involve counterfactual queries besides the typically considered factual queries. In this work, we consider linear models and three types of queries: factual, counterfactual, and robust counterfactual. First, for an arbitrary set of queries, we derive novel mathematical formulations for the classification regions for which the decision of the unknown model is known, without recovering any of the model parameters. Second, we derive bounds on the number of queries needed to extract the model’s parameters for (robust) counterfactual queries under arbitrary norm-based distances. We show that the full model can be recovered using just a single counterfactual query when differentiable distance measures are employed. In contrast, when using polyhedral distances for instance, the number of required queries grows linearly with the dimension of the data space. For robust counterfactuals, the latter number of queries doubles. Consequently, the applied distance function and robustness of counterfactuals have a significant impact on the model’s security.
[LG-81] Continual Learning for non-stationary regression via Memory-Efficient Replay
链接: https://arxiv.org/abs/2602.09720
作者: Pablo García-Santaclara,Bruno Fernández-Castro,RebecaP.Díaz-Redondo,Martín Alonso-Gamarra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Data streams are rarely static in dynamic environments like Industry 4.0. Instead, they constantly change, making traditional offline models outdated unless they can quickly adjust to the new data. This need can be adequately addressed by continual learning (CL), which allows systems to gradually acquire knowledge without incurring the prohibitive costs of retraining them from scratch. Most research on continual learning focuses on classification problems, while very few studies address regression tasks. We propose the first prototype-based generative replay framework designed for online task-free continual regression. Our approach defines an adaptive output-space discretization model, enabling prototype-based generative replay for continual regression without storing raw data. Evidence obtained from several benchmark datasets shows that our framework reduces forgetting and provides more stable performance than other state-of-the-art solutions.
[LG-82] SAQNN: Spectral Adaptive Quantum Neural Network as a Universal Approximator
链接: https://arxiv.org/abs/2602.09718
作者: Jialiang Tang,Jialin Zhang,Xiaoming Sun
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum machine learning (QML), as an interdisciplinary field bridging quantum computing and machine learning, has garnered significant attention in recent years. Currently, the field as a whole faces challenges due to incomplete theoretical foundations for the expressivity of quantum neural networks (QNNs). In this paper we propose a constructive QNN model and demonstrate that it possesses the universal approximation property (UAP), which means it can approximate any square-integrable function up to arbitrary accuracy. Furthermore, it supports switching function bases, thus adaptable to various scenarios in numerical approximation and machine learning. Our model has asymptotic advantages over the best classical feed-forward neural networks in terms of circuit size and achieves optimal parameter complexity when approximating Sobolev functions under L_2 norm.
[LG-83] he Entropic Signature of Class Speciation in Diffusion Models
链接: https://arxiv.org/abs/2602.09651
作者: Florian Handke,Dejan Stančević,Felix Koulischer,Thomas Demeester,Luca Ambrogioni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages
Abstract:Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along class-separating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.
[LG-84] racking Finite-Time Lyapunov Exponents to Robustify Neural ODEs
链接: https://arxiv.org/abs/2602.09613
作者: Tobias Wöhrer,Christian Kuehn
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: Lyapunov exponents, neural ODEs, deep learning, adversarial robustness, Lagrangian coherent structures
Abstract:We investigate finite-time Lyapunov exponents (FTLEs), a measure for exponential separation of input perturbations, of deep neural networks within the framework of continuous-depth neural ODEs. We demonstrate that FTLEs are powerful organizers for input-output dynamics, allowing for better interpretability and the comparison of distinct model architectures. We establish a direct connection between Lyapunov exponents and adversarial vulnerability, and propose a novel training algorithm that improves robustness by FTLE regularization. The key idea is to suppress exponents far from zero in the early stage of the input dynamics. This approach enhances robustness and reduces computational cost compared to full-interval regularization, as it avoids a full ``double’’ backpropagation.
[LG-85] From Averag e Sensitivity to Small-Loss Regret Bounds under Random-Order Model
链接: https://arxiv.org/abs/2602.09457
作者: Shinsaku Sakaue,Yuichi Yoshida
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We study online learning in the random-order model, where the multiset of loss functions is chosen adversarially but revealed in a uniformly random order. Building on the batch-to-online conversion by Dong and Yoshida (2023), we show that if an offline algorithm admits a (1+\varepsilon) -approximation guarantee and the effect of \varepsilon on its average sensitivity is characterized by a function \varphi(\varepsilon) , then an adaptive choice of \varepsilon yields a small-loss regret bound of \tilde O(\varphi^\star(\mathrmOPT_T)) , where \varphi^\star is the concave conjugate of \varphi , \mathrmOPT_T is the offline optimum over T rounds, and \tilde O hides polylogarithmic factors in T . Our method requires no regularity assumptions on loss functions, such as smoothness, and can be viewed as a generalization of the AdaGrad-style tuning applied to the approximation parameter \varepsilon . Our result recovers and strengthens the (1+\varepsilon) -approximate regret bounds of Dong and Yoshida (2023) and yields small-loss regret bounds for online k -means clustering, low-rank approximation, and regression. We further apply our framework to online submodular function minimization using (1\pm\varepsilon) -cut sparsifiers of submodular hypergraphs, obtaining a small-loss regret bound of \tilde O(n^3/4(1 + \mathrmOPT_T^3/4)) , where n is the ground-set size. Our approach sheds light on the power of sparsification and related techniques in establishing small-loss regret bounds in the random-order model.
[LG-86] Is Memorization Helpful or Harmful? Prior Information Sets the Threshold
链接: https://arxiv.org/abs/2602.09405
作者: Chen Cheng,Rina Foygel Barber
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 33 pages, 3 figures
Abstract:We examine the connection between training error and generalization error for arbitrary estimating procedures, working in an overparameterized linear model under general priors in a Bayesian setup. We find determining factors inherent to the prior distribution \pi , giving explicit conditions under which optimal generalization necessitates that the training error be (i) near interpolating relative to the noise size (i.e., memorization is necessary), or (ii) close to the noise level (i.e., overfitting is harmful). Remarkably, these phenomena occur when the noise reaches thresholds determined by the Fisher information and the variance parameters of the prior \pi .
[LG-87] How Far Can You Grow? Characterizing the Extrapolation Frontier of Graph Generative Models for Materials Science
链接: https://arxiv.org/abs/2602.09309
作者: Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Atomic and Molecular Clusters (physics.atm-clus)
*备注:
Abstract:Every generative model for crystalline materials harbors a critical structure size beyond which its outputs quietly become unreliable – we call this the extrapolation frontier. Despite its direct consequences for nanomaterial design, this frontier has never been systematically measured. We introduce RADII, a radius-resolved benchmark of \sim 75,000 nanoparticle structures (55-11,298 atoms) that treats radius as a continuous scaling knob to trace generation quality from in-distribution to out-of-distribution regimes under leakage-free splits. RADII provides frontier-specific diagnostics: per-radius error profiles pinpoint each architecture’s scaling ceiling, surface-interior decomposition tests whether failures originate at boundaries or in bulk, and cross-metric failure sequencing reveals which aspect of structural fidelity breaks first. Benchmarking five state-of-the-art architectures, we find that: (i) all models degrade by \sim13% in global positional error beyond training radii, yet local bond fidelity diverges wildly across architectures – from near-zero to over 2\times collapse; (ii) no two architectures share the same failure sequence, revealing the frontier as a multi-dimensional surface shaped by model family; and (iii) well-behaved models obey a power-law scaling exponent \alpha \approx 1/3 whose in-distribution fit accurately predicts out-of-distribution error, making their frontiers quantitatively forecastable. These findings establish output scale as a first-class evaluation axis for geometric generative models. The dataset and code are available at this https URL.
[LG-88] Mutual Information Collapse Explains Disentanglement Failure in β-VAEs
链接: https://arxiv.org/abs/2602.09277
作者: Minh Vu,Xiaoliang Wan,Shuangqing Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The \beta -VAE is a foundational framework for unsupervised disentanglement, using \beta to regulate the trade-off between latent factorization and reconstruction fidelity. Empirically, however, disentanglement performance exhibits a pervasive non-monotonic trend: benchmarks such as MIG and SAP typically peak at intermediate \beta and collapse as regularization increases. We demonstrate that this collapse is a fundamental information-theoretic failure, where strong Kullback-Leibler pressure promotes marginal independence at the expense of the latent channel’s semantic informativeness. By formalizing this mechanism in a linear-Gaussian setting, we prove that for \beta 1 , stationarity-induced dynamics trigger a spectral contraction of the encoder gain, driving latent-factor mutual information to zero. To resolve this, we introduce the \lambda\beta -VAE, which decouples regularization pressure from informational collapse via an auxiliary L_2 reconstruction penalty \lambda . Extensive experiments on dSprites, Shapes3D, and MPI3D-real confirm that \lambda 0 stabilizes disentanglement and restores latent informativeness over a significantly broader range of \beta , providing a principled theoretical justification for dual-parameter regularization in variational inference backbones.
[LG-89] Optimal Estimation in Orthogonally Invariant Generalized Linear Models: Spectral Initialization and Approximate Message Passing
链接: https://arxiv.org/abs/2602.09240
作者: Yihan Zhang,Hong Chang Ji,Ramji Venkataramanan,Marco Mondelli
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We consider the problem of parameter estimation from a generalized linear model with a random design matrix that is orthogonally invariant in law. Such a model allows the design have an arbitrary distribution of singular values and only assumes that its singular vectors are generic. It is a vast generalization of the i.i.d. Gaussian design typically considered in the theoretical literature, and is motivated by the fact that real data often have a complex correlation structure so that methods relying on i.i.d. assumptions can be highly suboptimal. Building on the paradigm of spectrally-initialized iterative optimization, this paper proposes optimal spectral estimators and combines them with an approximate message passing (AMP) algorithm, establishing rigorous performance guarantees for these two algorithmic steps. Both the spectral initialization and the subsequent AMP meet existing conjectures on the fundamental limits to estimation – the former on the optimal sample complexity for efficient weak recovery, and the latter on the optimal errors. Numerical experiments suggest the effectiveness of our methods and accuracy of our theory beyond orthogonally invariant data.
[LG-90] Minimum Distance Summaries for Robust Neural Posterior Estimation
链接: https://arxiv.org/abs/2602.09161
作者: Sherman Khoo,Dennis Prangle,Song Liu,Mark Beaumont
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Simulation-based inference (SBI) enables amortized Bayesian inference by first training a neural posterior estimator (NPE) on prior-simulator pairs, typically through low-dimensional summary statistics, which can then be cheaply reused for fast inference by querying it on new test observations. Because NPE is estimated under the training data distribution, it is susceptible to misspecification when observations deviate from the training distribution. Many robust SBI approaches address this by modifying NPE training or introducing error models, coupling robustness to the inference network and compromising amortization and modularity. We introduce minimum-distance summaries, a plug-in robust NPE method that adapts queried test-time summaries independently of the pretrained NPE. Leveraging the maximum mean discrepancy (MMD) as a distance between observed data and a summary-conditional predictive distribution, the adapted summary inherits strong robustness properties from the MMD. We demonstrate that the algorithm can be implemented efficiently with random Fourier feature approximations, yielding a lightweight, model-free test-time adaptation procedure. We provide theoretical guarantees for the robustness of our algorithm and empirically evaluate it on a range of synthetic and real-world tasks, demonstrating substantial robustness gains with minimal additional overhead.
[LG-91] Predicting magnetism with first-principles AI
链接: https://arxiv.org/abs/2602.09093
作者: Max Geier,Liang Fu
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 6+3 pages, 3+4 figures
Abstract:Computational discovery of magnetic materials remains challenging because magnetism arises from the competition between kinetic energy and Coulomb interaction that is often beyond the reach of standard electronic-structure methods. Here we tackle this challenge by directly solving the many-electron Schrödinger equation with neural-network variational Monte Carlo, which provides a highly expressive variational wavefunction for strongly correlated systems. Applying this technique to transition metal dichalcogenide moiré semicondutors, we predict itinerant ferromagnetism in WSe _2 /WS _2 and an antiferromagnetic insulator in twisted \Gamma -valley homobilayer, using the same neural network without any physics input beyond the microscopic Hamiltonian. Crucially, both types of magnetic states are obtained from a single calculation within the S_z=0 sector, removing the need to compute and compare multiple S_z sectors. This significantly reduces computational cost and paves the way for faster and more reliable magnetic material design.
[LG-92] Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition ICASSP2026
链接: https://arxiv.org/abs/2602.09043
作者: Aditya Srinivas Menon,Kumud Tripathi,Raj Gohil,Pankaj Wasnik
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: The paper has been accepted at ICASSP 2026, Barcelona, Spain
Abstract:Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.
[LG-93] Predicting Gene Disease Associations in Type 2 Diabetes Using Machine Learning on Single-Cell RNA-Seq Data
链接: https://arxiv.org/abs/2602.09036
作者: Maria De La Luz Lomboy Toledo,Daniel Onah
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures. Preprint
Abstract:Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels due to impaired insulin production or function. Two main forms are recognized: type 1 diabetes (T1D), which involves autoimmune destruction of insulin-producing \beta-cells, and type 2 diabetes (T2D), which arises from insulin resistance and progressive \beta-cell dysfunction. Understanding the molecular mechanisms underlying these diseases is essential for the development of improved therapeutic strategies, particularly those targeting \beta-cell dysfunction. To investigate these mechanisms in a controlled and biologically interpretable setting, mouse models have played a central role in diabetes research. Owing to their genetic and physiological similarity to humans, together with the ability to precisely manipulate their genome, mice enable detailed investigation of disease progression and gene function. In particular, mouse models have provided critical insights into \beta-cell development, cellular heterogeneity, and functional failure under diabetic conditions. Building on these experimental advances, this study applies machine learning methods to single-cell transcriptomic data from mouse pancreatic islets. Specifically, we evaluate two supervised approaches identified in the literature; Extra Trees Classifier (ETC) and Partial Least Squares Discriminant Analysis (PLS-DA), to assess their ability to identify T2D-associated gene expression signatures at single-cell resolution. Model performance is evaluated using standard classification metrics, with an emphasis on interpretability and biological relevance
附件下载




