本篇博文主要内容为 2026-02-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-26)

今日共更新517篇论文,其中:

  • 自然语言处理77篇(Computation and Language (cs.CL))
  • 人工智能113篇(Artificial Intelligence (cs.AI))
  • 计算机视觉123篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习151篇(Machine Learning (cs.LG))
  • 多智能体系统10篇(Multiagent Systems (cs.MA))
  • 信息检索15篇(Information Retrieval (cs.IR))
  • 人机交互23篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Using Feasible Action-Space Reduction by Groups to fill Causal Responsibility Gaps in Spatial Interactions

【速读】:该论文旨在解决空间交互中因果责任评估的局限性问题,尤其是在多个参与者同时导致某一结果(即因果过决定现象,causal overdeterminism)时,现有以个体为中心的责任度量方法失效的问题。其解决方案的关键在于提出一种面向群体的因果责任度量机制,并通过形式化“assertive influences”(主动影响)类型,设计了一种层级算法来系统识别对目标代理轨迹具有因果责任的 assertive agents(主动作用者),从而在复杂交互场景中更准确地刻画群体层面的责任分配。

链接: https://arxiv.org/abs/2602.22041
作者: Vassil Guenov,Ashwin George,Arkady Zgonnikov,David A. Abbink,Luciano Cavalcante Siebert
机构: 未知
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Heralding the advent of autonomous vehicles and mobile robots that interact with humans, responsibility in spatial interaction is burgeoning as a research topic. Even though metrics of responsibility tailored to spatial interactions have been proposed, they are mostly focused on the responsibility of individual agents. Metrics of causal responsibility focusing on individuals fail in cases of causal overdeterminism – when many actors simultaneously cause an outcome. To fill the gaps in causal responsibility left by individual-focused metrics, we formulate a metric for the causal responsibility of groups. To identify assertive agents that are causally responsible for the trajectory of an affected agent, we further formalise the types of assertive influences and propose a tiering algorithm for systematically identifying assertive agents. Finally, we use scenario-based simulations to illustrate the benefits of considering groups and how the emergence of group effects vary with interaction dynamics and the proximity of agents.

[MA-1] Hierarchical Lead Critic based Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决协作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)中因仅采用局部(独立学习)或全局(集中学习)视角而导致的性能瓶颈问题。其解决方案的关键在于提出一种分层训练机制与新型架构——分层领头评论家(Hierarchical Lead Critic, HLC),该方法通过在不同层级上融合局部与全局视角,模拟团队结构中高阶目标与低阶执行的自然协同关系,从而实现更高的样本效率和鲁棒性策略。实验表明,HLC在无需通信的协作任务中优于单一层次基准,并能随智能体数量和任务难度增加而稳定扩展。

链接: https://arxiv.org/abs/2602.21680
作者: David Eckel,Henri Meeß
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 16 pages, 10 Figures, Preprint

点击查看摘要

Abstract:Cooperative Multi-Agent Reinforcement Learning (MARL) solves complex tasks that require coordination from multiple agents, but is often limited to either local (independent learning) or global (centralized learning) perspectives. In this paper, we introduce a novel sequential training scheme and MARL architecture, which learns from multiple perspectives on different hierarchy levels. We propose the Hierarchical Lead Critic (HLC) - inspired by natural emerging distributions in team structures, where following high-level objectives combines with low-level execution. HLC demonstrates that introducing multiple hierarchies, leveraging local and global perspectives, can lead to improved performance with high sample efficiency and robust policies. Experimental results conducted on cooperative, non-communicative, and partially observable MARL benchmarks demonstrate that HLC outperforms single hierarchy baselines and scales robustly with increasing amounts of agents and difficulty.

[MA-2] Hierarchical LLM -Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning ICRA2026

【速读】:该论文旨在解决多机器人任务规划中自然语言指令难以转化为可执行动作的问题,尤其是在处理模糊或长时程任务时,传统PDDL规划器缺乏灵活性,而大型语言模型(LLM)虽能理解指令却易产生幻觉或不可行动作。解决方案的关键在于提出一种分层的多智能体LLM规划框架:上层负责任务分解与分配,下层生成PDDL问题并由经典规划器求解;当计划失败时,采用受TextGrad启发的文本梯度更新机制优化各代理的提示词(prompt),并通过跨代理共享元提示(meta-prompt)实现高效提示优化。此方法在MAT-THOR基准测试中显著提升成功率,尤其在复杂和模糊任务上表现优异。

链接: https://arxiv.org/abs/2602.21670
作者: Tomoya Kawabe(1),Rin Takano(1) ((1) NEC Corporation)
机构: NEC Corporation (日本电气公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to ICRA 2026. 8 pages, 2 figures

点击查看摘要

Abstract:Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent’s prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.

[MA-3] Agent LTV: An Agent -Based Unified Search-and-Evolution Framework for Automated Lifetime Value Prediction KDD2026 KDD

【速读】:该论文旨在解决Lifetime Value (LTV)预测在广告、推荐系统和电子商务场景中因数据模式差异导致的建模复杂性问题,即实践中需为不同决策场景构建高度定制化的特征处理、目标设计与超参数调优流程,造成开发成本高且难以迁移。其解决方案的关键在于提出AgentLTV——一种基于代理(agent)的统一搜索与演化框架,将每个候选模型视为可执行的流水线程序;通过LLM驱动的代理自动生成、运行并修复流水线,结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)阶段进行广域探索与帕累托感知多指标评估,以及进化算法(Evolutionary Algorithm, EA)阶段通过岛屿式演化机制(包含交叉、变异与迁移)对最优MCTS方案进行精细化优化,从而实现自动化、高效且跨场景稳定的LTV建模。

链接: https://arxiv.org/abs/2602.21634
作者: Chaowei Wu,Huazhu Chen,Congde Yuan,Qirui Yang,Guoqing Song,Yue Gao,Li Luo,Frank Youhua Chen,Mengzhuo Guo
机构: Sichuan University (四川大学); Harbin Institute of Technology (哈尔滨工业大学); Sun Yat-sen University (中山大学); City University of Hong Kong (香港城市大学); VIVO; Xiangtan University (湘潭大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures, submitted to KDD 2026: 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ADS Track

点击查看摘要

Abstract:Lifetime Value (LTV) prediction is critical in advertising, recommender systems, and e-commerce. In practice, LTV data patterns vary across decision scenarios. As a result, practitioners often build complex, scenario-specific pipelines and iterate over feature processing, objective design, and tuning. This process is expensive and hard to transfer. We propose AgentLTV, an agent-based unified search-and-evolution framework for automated LTV modeling. AgentLTV treats each candidate solution as an executable pipeline program. LLM-driven agents generate code, run and repair pipelines, and analyze execution feedback. Two decision agents coordinate a two-stage search. The Monte Carlo Tree Search (MCTS) stage explores a broad space of modeling choices under a fixed budget, guided by the Polynomial Upper Confidence bounds for Trees criterion and a Pareto-aware multi-metric value function. The Evolutionary Algorithm (EA) stage refines the best MCTS program via island-based evolution with crossover, mutation, and migration. Experiments on a large-scale proprietary dataset and a public benchmark show that AgentLTV consistently discovers strong models across ranking and error metrics. Online bucket-level analysis further indicates improved ranking consistency and value calibration, especially for high-value and negative-LTV segments. We summarize practitioner-oriented takeaways: use MCTS for rapid adaptation to new data patterns, use EA for stable refinement, and validate deployment readiness with bucket-level ranking and calibration diagnostics. The proposed AgentLTV has been successfully deployed online.

[MA-4] raining Generalizable Collaborative Agents via Strategic Risk Aversion

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中协作策略泛化能力差的问题,即现有方法训练出的策略在面对新合作对象时容易失效。作者指出,这一问题源于训练过程中存在的“搭便车”(free-riding)行为以及缺乏战略鲁棒性。解决方案的关键在于引入战略风险规避(strategic risk aversion)作为一项有原则的归纳偏置(inductive bias),使智能体在协作博弈中不仅能对伙伴行为的偏离保持鲁棒性,还能获得优于传统博弈论均衡概念(如纳什均衡)的稳定结果,并显著减少或消除搭便车现象。基于此,作者提出了一种将战略风险规避整合进标准策略优化框架的MARL算法,在多个协作基准任务(包括大语言模型协作任务)中验证了其有效性,实现了与异构且未见过的合作方的可靠协作。

链接: https://arxiv.org/abs/2602.21515
作者: Chengrui Qu,Yizhou Zhang,Nicholas Lanzetti,Eric Mazumdar
机构: California Institute of Technology (加州理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner’s behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

[MA-5] Pancake: Hierarchical Memory System for Multi-Agent LLM Serving

【速读】:该论文针对大语言模型(Large Language Model, LLM)服务中代理记忆管理的核心挑战展开研究,即在大规模存储、频繁更新以及多个共存代理的协同场景下,导致近似最近邻(Approximate Nearest Neighbor, ANN)搜索复杂度高且开销巨大。解决方案的关键在于提出Pancake系统,其核心创新包括:(i) 单代理多级索引缓存机制,(ii) 多代理间协调的索引管理策略,以及 (iii) GPU-CPU协同加速架构,从而显著提升内存密集型代理任务的端到端吞吐量,实验证明其性能优于现有框架超过4.29倍。

链接: https://arxiv.org/abs/2602.21477
作者: Zhengding Hu,Zaifeng Pan,Prabhleen Kaur,Vibha Murthy,Zhongkai Yu,Yue Guan,Zhen Wang,Steven Swanson,Yufei Ding
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three key techniques: (i) multi-level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU-CPU acceleration. Pancake exposes easy-to-use interface that can be integrated into memory-based agents like Mem-GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end-to-end throughput improvement.

[MA-6] From Cooperation to Hierarchy: A Study of Dynamics of Hierarchy Emergence in a Multi-Agent System

【速读】:该论文旨在解决“简单个体间互动规则如何促成层级组织的涌现与持续存在”这一核心问题,聚焦于初始异质性(initial heterogeneity)与突变幅度(mutation amplitude)在动态多智能体系统中对层级结构形成的影响机制。其解决方案的关键在于构建一个基于代理的模型(agent-based model, ABM),并采用营养级不一致性(trophic incoherence, TI)指标量化交互网络中的方向性不对称性,从而揭示:尽管微小个体差异可通过繁殖、竞争与合作等局部互动被放大,但层级秩序的稳定形成更依赖于足够高的突变幅度,而初始异质性主要影响早期层级结构的建立而非长期维持。

链接: https://arxiv.org/abs/2602.21404
作者: Shanshan Mao,Peter Tino
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:A central premise in evolutionary biology is that individual variation can generate information asymmetries that facilitate the emergence of hierarchical organisation. To examine this process, we develop an agent-based model (ABM) to identify the minimal conditions under which hierarchy arises in dynamic multi-agent systems, focusing on the roles of initial heterogeneity and mutation amplitude across generations. Hierarchical organisation is quantified using the Trophic Incoherence (TI) metric, which captures directional asymmetries in interaction networks. Our results show that even small individual differences can be amplified through repeated local interactions involving reproduction, competition, and cooperation, but that hierarchical order is markedly more sensitive to mutation amplitude than to initial heterogeneity. Across repeated trials, stable hierarchies reliably emerge only when mutation amplitude is sufficiently high, while initial heterogeneity primarily affects early formation rather than long-term persistence. Overall, these findings demonstrate how simple interaction rules can give rise to both the emergence and persistence of hierarchical organisation, providing a quantitative account of how structured inequality can develop from initially homogeneous populations.

[MA-7] A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

【速读】:该论文旨在解决地球科学数据(Earth science data)因海量积累而带来的可扩展性挑战,特别是现有数据存储库(如PANGAEA)中大量数据未被充分利用的问题,从而限制了数据的再利用效率。其解决方案的关键在于提出了一种分层多智能体框架——PANGAEA-GPT,该框架采用集中式监督者-工作者(Supervisor-Worker)拓扑结构,结合基于数据类型的路由机制、沙箱内确定性代码执行以及通过执行反馈实现的自我修正能力,使智能体能够自主诊断并修复运行时错误,进而以最少的人工干预完成复杂、多步骤的数据发现与分析任务。

链接: https://arxiv.org/abs/2602.21351
作者: Dmitrii Pantiukhin,Ivan Kuznetsov,Boris Shapkin,Antonia Anna Jost,Thomas Jung,Nikolay Koldunov
机构: Alfred Wegener Institute for Polar and Marine Research (阿尔弗雷德·韦格纳极地与海洋研究所)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 20 pages, 6 figures, 7 tables, supplementary material included

点击查看摘要

Abstract:The rapid accumulation of Earth science data has created a significant scalability challenge; while repositories like PANGAEA host vast collections of datasets, citation metrics indicate that a substantial portion remains underutilized, limiting data reusability. Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis. Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedback, enabling agents to diagnose and resolve runtime errors. Through use-case scenarios spanning physical oceanography and ecology, we demonstrate the system’s capacity to execute complex, multi-step workflows with minimal human intervention. This framework provides a methodology for querying and analyzing heterogeneous repository data through coordinated agent workflows.

[MA-8] Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models

【速读】:该论文试图解决的问题是:在高风险人类决策场景中,大型语言模型(Large Language Models, LLMs)作为顾问时所引入的风险,特别是其在面对善意与恶意信息时的辨别能力(vigilance)与说服能力(persuasion)之间的关系。现有研究多孤立考察这两种社会性能力,缺乏对其内在关联的系统分析。解决方案的关键在于设计了一个多轮谜题游戏Sokoban作为实验范式,通过让LLM代理相互提供建议并执行动作,发现任务表现、说服能力和理性警惕性在LLM中是可以分离的——即高任务表现并不意味着模型能识别误导信息,即便这种欺骗可能性已被明确提示;同时,模型虽可能被误导而失败,却仍能通过调整生成token数量来区分善意与恶意建议,表明其具备一定的元认知调节机制。这一发现提示,在未来AI安全工作中需独立监测这三项能力。

链接: https://arxiv.org/abs/2602.21262
作者: Sasha Robinson,Kerem Oktar,Katherine M. Collins,Ilia Sucholutsky,Kelsey R. Allen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs’ abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. % as part of the prompt. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.

[MA-9] Agent icTyper: Automated Typing of Legacy Software Projects Using Agent ic AI ICSE2026

【速读】:该论文旨在解决遗留 JavaScript 系统因缺乏类型安全而导致维护风险的问题,同时克服现有自动化类型标注研究在类型检查配置、类型定义生成、错误识别及代码行为正确性验证等方面的不足。其解决方案的关键在于提出 AgenticTyper——一个基于大语言模型(Large Language Model, LLM)的代理式系统,通过迭代式的错误修正与编译转换对比来保障行为一致性,从而在仓库级规模上实现高效且正确的类型迁移。

链接: https://arxiv.org/abs/2602.21251
作者: Clemens Pohle
机构: Darmstadt University of Applied Sciences (达姆施塔特应用技术大学); MaibornWolff GmbH (MaibornWolff 公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注: Accepted at ICSE 2026 Student Research Competition (SRC)

点击查看摘要

Abstract:Legacy JavaScript systems lack type safety, making maintenance risky. While TypeScript can help, manually adding types is expensive. Previous automated typing research focuses on type inference but rarely addresses type checking setup, definition generation, bug identification, or behavioral correctness at repository scale. We present AgenticTyper, a Large Language Model (LLM)-based agentic system that addresses these gaps through iterative error correction and behavior preservation via transpilation comparison. Evaluation on two proprietary repositories (81K LOC) shows that AgenticTyper resolves all 633 initial type errors in 20 minutes, reducing manual effort from one working day.

自然语言处理

[NLP-0] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

【速读】: 该论文旨在解决多语言大语言模型(Large Language Model, LLM)评估中因翻译基准数据集质量不一致而导致的可靠性问题,尤其是现有资源常出现语义漂移(semantic drift)和上下文丢失(context loss),从而误导性能指标。其解决方案的关键在于提出一个全自动框架,通过引入测试时计算扩展策略——通用自提升(Universal Self-Improvement, USI)与作者提出的多轮排序方法T-RANK,显著提升翻译质量,同时确保基准数据在本地化过程中保持原始任务结构和语言细微差别。该框架实现了可扩展、高质量的多语言基准翻译,并在8种东欧和南欧语言中验证了其有效性。

链接: https://arxiv.org/abs/2602.22207
作者: Hanna Yukhymenko,Anton Alexandrov,Martin Vechev
机构: INSAIT(INSAIT); Sofia University “St. Kliment Ohridski”(索菲亚大学“圣克莱门特·奥赫里德斯基”); ETH Zurich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

[NLP-1] SumTablets: A Transliteration Dataset of Sumerian Tablets

【速读】: 该论文旨在解决古苏美尔语(Sumerian)转写(transliteration)任务中缺乏结构化数据集的问题,这一问题阻碍了现代自然语言处理(Natural Language Processing, NLP)方法在该领域的应用。现有数字亚述学项目(如Oracc)虽已发布大量苏美尔语转写文本,但缺少与之对应的楔形文字字符(cuneiform glyphs)的数字化表示,导致无法直接训练或评估基于深度学习的转写模型。解决方案的关键在于构建一个名为SumTablets的数据集,它将91,606块苏美尔语泥板的Unicode字符表示(共6,970,407个字符)与Oracc发布的标准转写进行配对,并通过特殊标记保留原始结构信息(如页面边界、换行符和断裂段落)。此外,作者还基于该数据集实现了两个基准模型:基于加权采样的简单策略和微调的自回归语言模型,其中后者在字符级F分数(chrF)上达到97.55,验证了基于Transformer的模型在提升专家校验效率方面的潜力。

链接: https://arxiv.org/abs/2602.22200
作者: Cole Simmons,Richard Diehl Martinez,Dan Jurafsky
机构: Stanford University (斯坦福大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: 11 pages with 3 figures

点击查看摘要

Abstract:Sumerian transliteration is a conventional system for representing a scholar’s interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet’s cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph’s possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one. Comments: 11 pages with 3 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.22200 [cs.CL] (or arXiv:2602.22200v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.22200 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 192-202, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics Related DOI: https://doi.org/10.18653/v1/2024.ml4al-1.20 Focus to learn more DOI(s) linking to related resources

[NLP-2] Improving Parametric Knowledge Access in Reasoning Language Models

【速读】: 该论文旨在解决语言模型在访问其参数中存储的世界知识时推理能力不足的问题,即模型在默认情况下无法有效生成最优的知识回忆结果。解决方案的关键在于通过强化学习训练模型,利用世界知识问答任务(如TriviaQA)作为可验证的奖励信号,使模型学会对自身参数化知识进行更有效的推理。实验表明,这种方法显著提升了模型在TriviaQA等多数据集上的表现,证明了推理模型在参数知识获取方面存在优化空间,且可通过简单策略(如“逐步思考”提示)和针对性训练得到显著改善。

链接: https://arxiv.org/abs/2602.22193
作者: Melody Ma,John Hewitt
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study reasoning for accessing world knowledge stored in a language model’s parameters. For example, recalling that Canberra is Australia’s capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple “think step-by-step” cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.

[NLP-3] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

【速读】: 该论文旨在解决开源图形用户界面(GUI)智能体在长程导航任务中性能落后于闭源系统的问题,其核心挑战在于高质量动作对齐的推理数据稀缺,以及通用后训练流程未能适配GUI智能体的独特需求。解决方案的关键在于提出GUI-Libra训练方案:首先构建并过滤出81K条高质量GUI推理数据集以缓解数据稀缺问题;其次设计动作感知的监督微调(SFT),通过混合“推理-动作”与直接动作数据并重加权token来平衡推理与接地(grounding);最后引入KL正则化约束和成功自适应缩放机制,提升部分可验证环境下的强化学习(RLVR)稳定性与离线到在线预测能力。实验表明,该方法显著提升了多平台网页和移动端任务的步骤准确率与端到端完成率,且无需昂贵的在线数据收集。

链接: https://arxiv.org/abs/2602.22190
作者: Rui Yang,Qianhui Wu,Zhaoyang Wang,Hanyang Chen,Ke Yang,Hao Cheng,Huaxiu Yao,Baoling Peng,Huan Zhang,Jianfeng Gao,Tong Zhang
机构: UIUC; Microsoft; UNC-Chapel Hill
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 57 pages, 17 figures

点击查看摘要

Abstract:Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

[NLP-4] DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在处理长上下文时因注意力机制逐渐偏离关键信息而导致性能下降的问题。其核心挑战在于,随着输入长度增加,模型难以在解码过程中持续保持对任务相关上下文的注意力聚焦。解决方案的关键是提出一种名为DySCO的新型解码算法,该方法利用检索头(retrieval heads)——一类专门用于长上下文检索的注意力头——在每一步解码时识别出与任务相关的token,并显式地对其进行加权增强。通过动态调整注意力权重,DySCO能够在生成过程中更有效地利用相关信息,且无需额外训练,可直接应用于任意现成的语言模型。实验证明,该方法在多个长上下文推理基准测试中显著提升性能,最大相对增益达25%。

链接: https://arxiv.org/abs/2602.22175
作者: Xi Ye,Wuwei Zhang,Fangcong Yin,Howard Yen,Danqi Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads–a subset of attention heads specialized for long-context retrieval–to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at this https URL.

[NLP-5] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,即模型在生成文本描述时引入输入图像中并不存在的物体。通过系统性实验分析发现,对象幻觉主要源于语言解码器(language decoder)强先验知识的影响,而非视觉编码器(vision encoder)。解决方案的关键在于提出一种无需训练的框架 NoLan(No-Language-Hallucination Decoding),其核心机制是基于多模态输入与纯文本输入输出分布差异,动态抑制语言先验,从而有效降低幻觉现象。实验证明,NoLan 在多个 LVLM 和任务上均显著减少对象幻觉,例如在 POPE 数据集上使 LLaVA-1.5 7B 和 Qwen-VL 7B 的准确率分别提升 6.45 和 7.21。

链接: https://arxiv.org/abs/2602.22144
作者: Lingfeng Ren,Weihao Yu,Runpeng Yu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); Peking University Shenzhen Graduate School (北京大学深圳研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: this https URL.

[NLP-6] IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)指令遵循能力评估严重依赖英语基准数据集的问题,尤其针对全球数亿使用印地语系(Indic)语言的用户群体缺乏有效评测工具的现状。其解决方案的关键在于构建了一个名为IndicIFEval的多语言指令遵循基准,涵盖14种印地语系语言,包含约800个经人工验证的示例,分为两个互补子集:一是基于IFEval(Zhou et al., 2023)翻译并本地化后的提示(IndicIFEval-Ground),二是基于本土内容合成的规则驱动指令(IndicIFEval-Synthetic)。该基准采用可自动验证的规则化指令设计,使得对模型在格式约束下的生成能力进行客观评估成为可能,并揭示了现有模型在词汇层面和跨语言任务上的显著短板,从而为多语言受限生成研究提供标准化评估框架与开源资源支持。

链接: https://arxiv.org/abs/2602.22125
作者: Thanmay Jayakumar,Mohammed Safi Ur Rahman Khan,Raj Dabre,Ratish Puduppully,Anoop Kunchukuttan
机构: Nilekani Centre at AI4Bharat; Indian Institute of Technology Madras (印度理工学院马德拉斯分校); IT University of Copenhagen (哥本哈根信息技术大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: 8 pages + Appendix

点击查看摘要

Abstract:Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks – and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (this http URL).

[NLP-7] SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在长周期软件工程任务(如SWE-bench)中表现不佳的问题,尤其是行动循环(action looping)频发和修复成功率低等瓶颈。其解决方案的关键在于提出SWE-Protégé框架,将软件修复重构为专家-学徒协作问题:SLM作为唯一决策者,通过监督微调学习如何选择性地向强专家模型寻求指导、识别停滞状态并有效执行专家反馈;同时结合代理强化学习(agentic reinforcement learning),显式抑制退化循环和低效协作行为。该方法仅需稀疏的专家干预(平均每任务约4次调用,占总token的11%),便使Qwen2.5-Coder-7B-Instruct在SWE-bench Verified上的Pass@1指标提升至42.4%,显著超越此前SLM的最佳水平。

链接: https://arxiv.org/abs/2602.22124
作者: Patrick Tser Jern Kon,Archana Pradeep,Ang Chen,Alexander P. Ellis,Warren Hunt,Zijian Wang,John Yang,Samuel Thompson
机构: Meta(元)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).

[NLP-8] Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中因模型规模增大而导致计算成本显著上升的问题,同时保持高准确性。解决方案的关键在于提出一种基于置信度的动态模型选择策略:通过评估模型对当前任务的掌握程度(即是否知道正确答案的概率)和响应准确性的置信度,将高置信度的简单任务保留在较小模型上处理,而将低置信度或复杂任务分配给更大模型执行,从而在保障性能的同时实现计算资源的有效节约。实验表明,该方法在MMLU基准上可达到与最大模型相当的精度,同时降低20%至40%的计算开销,并在GPT-4o API调用中减少约60%的token使用量。

链接: https://arxiv.org/abs/2602.22090
作者: Bo-Wei Chen,Chung-Chi Chen,An-Zi Yen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); AIST (国立研究开发法人信息・技术综合研究所)
类目: Computation and Language (cs.CL)
备注: Accepted by EACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model’s confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model’s likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20% to 40%. When applied to GPT-4o API calls, it reduces token usage by approximately 60%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.

[NLP-9] Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否具备真正理论心理(Theory of Mind, ToM)能力的问题,特别是其在面对任务扰动时的鲁棒性表现。研究通过引入一个手工构建且标注丰富的ToM数据集,涵盖经典与扰动后的错误信念任务,以及对应的正确推理链空间、后续推理忠实度和任务解决方案,提出评估推理链正确性和最终答案对CoT推理轨迹忠实度的新指标。关键解决方案在于采用链式思维提示(Chain-of-Thought prompting, CoT),发现其虽总体上能以忠实方式提升LLMs的ToM性能,但在某些扰动类别下反而降低准确性,表明CoT并非普适增强手段,需根据扰动类型选择性应用。

链接: https://arxiv.org/abs/2602.22072
作者: Christian Nickel,Laura Schrewe,Florian Mai,Lucie Flek
机构: Bonn-Aachen International Center for Information Technology (b-it), University of Bonn; Lamarr Institute for Machine Learning and Artificial Intelligence; Research Center Trustworthy Data Science and Security (RC-Trust), University of Duisburg-Essen
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM) refers to an agent’s ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM’s decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.

[NLP-10] DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

【速读】: 该论文旨在解决分布式账本技术(Distributed Ledger Technology, DLT)领域缺乏大规模、多源异构文本语料库的问题,从而推动自然语言处理(Natural Language Processing, NLP)在DLT研究中的深度应用。现有NLP资源主要聚焦于加密货币价格预测和智能合约分析,忽视了该领域中广泛存在的专业术语与技术演进模式。解决方案的关键在于构建并公开发布DLT-Corpus——目前最大规模的领域专用语料库,包含来自科学文献(37,440篇)、美国专利商标局(USPTO)专利(49,023项)和社会媒体(2200万条帖子)的2.98亿词元文本数据,并开发了针对DLT场景优化的预训练模型LedgerBERT,在命名实体识别(Named Entity Recognition, NER)任务上相较BERT-base提升23%性能,为后续技术演化分析与市场创新关联研究提供高质量数据基础与模型工具。

链接: https://arxiv.org/abs/2602.22045
作者: Walter Hernandez Cruz,Peter Devine,Nikhil Vadgama,Paolo Tasca,Jiahua Xu
机构: Centre for Blockchain Technologies, University College London (伦敦大学学院区块链技术中心); School of Informatics, University of Edinburgh (爱丁堡大学信息学院); Exponential Science
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector’s ~ 3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus’ utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.22045 [cs.CL] (or arXiv:2602.22045v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.22045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-11] A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT

【速读】: 该论文旨在解决大规模预训练数据集在模型性能提升中的效率问题,即如何在减少预训练数据规模的同时保持甚至提升模型性能。其核心问题是当前主流Transformer模型(如ModernBERT)依赖于超大规模但缺乏多样性的预训练语料,导致资源浪费且未必最优。解决方案的关键在于采用多样性驱动的采样算法(diversity-driven sampling),通过优化数据选择策略,在较小的数据规模下实现更高效的预训练。实验表明,该方法可在部分任务中相较随机采样提升10个点的性能,并且用150M tokens的多样性数据预训练483小时可达到使用2.4B tokens随机数据预训练1,775小时的效果,显著降低了计算成本并提升了训练效率。

链接: https://arxiv.org/abs/2602.22014
作者: Louis Estève,Christophe Servan,Thomas Lavergne,Agata Savary
机构: Université Paris-Saclay (巴黎萨克雷大学); CNRS (法国国家科学研究中心); LISN (信息与网络科学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.

[NLP-12] CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models

【速读】: 该论文旨在解决当前语言模型在理解语法形式与语义关系之间的整合能力不足的问题,尤其是对构式(Construction)层面的语义解释能力缺乏系统评估。现有评测多聚焦于句法合法性判断,忽视了模型对构式所蕴含语义关系的理解。其解决方案的关键在于提出一个基于构式语法(Construction Grammar)的最小对测试基准——CxMP(Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models),通过控制变量的最小对设计,在九类构式类型(如“let-alone”、“caused motion”和“ditransitive”等)中系统考察模型是否能准确识别由语法形式所暗示的语义关系。结果表明,尽管句法能力较早显现,但构式理解发展缓慢且在大语言模型(LLMs)中仍存在显著局限,揭示了语言模型在形式-意义整合上的持续性差距。

链接: https://arxiv.org/abs/2602.21978
作者: Miyu Oba,Saku Sugawara
机构: Nara Institute of Science and Technology (奈良先端科学技术大学院大学); National Institute of Informatics (日本信息研究所); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has examined language models from a linguistic perspective to better understand how they acquire language. Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention. We introduce the Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models (CxMP), a benchmark grounded in Construction Grammar that treats form-meaning pairings, or constructions, as fundamental linguistic units. CxMP evaluates whether models can interpret the semantic relations implied by constructions, using a controlled minimal-pair design across nine construction types, including the let-alone, caused motion, and ditransitive constructions. Our results show that while syntactic competence emerges early, constructional understanding develops more gradually and remains limited even in large language models (LLMs). CxMP thus reveals persistent gaps in how language models integrate form and meaning, providing a framework for studying constructional understanding and learning trajectories in language models.

[NLP-13] RADAR: Reasoning as Discrimination with Aligned Representations for LLM -based Knowledge Graph Reasoning

【速读】: 该论文旨在解决当前知识图谱推理(Knowledge Graph Reasoning, KGR)中基于生成式大语言模型(Large Language Models, LLMs)的方法易受表面共现记忆干扰、难以学习真实关系语义,从而限制了分布外泛化能力的问题。解决方案的关键在于提出RADAR框架,将KGR从生成式模式匹配重构为判别式关系推理,通过强化学习机制强制实体间的相对可分性(relative entity separability),而非仅模仿token概率;在此基础上,推理过程直接在表示空间中进行,确保与判别优化一致并避免生成导致的幻觉现象,从而显著提升链接预测和三元组分类任务的性能,并增强中间表征的任务相关互信息。

链接: https://arxiv.org/abs/2602.21951
作者: Bo Xue,Yuan Jin,Luoyi Fu,Jiaxin Ding,Xinbing Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution generalization. To address this, we propose RADAR, which reformulates KGR from generative pattern matching to discriminative relational reasoning. We recast KGR as discriminative entity selection, where reinforcement learning enforces relative entity separability beyond token-likelihood imitation. Leveraging this separability, inference operates directly in representation space, ensuring consistency with the discriminative optimization and bypassing generation-induced hallucinations. Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more robust and transferable relational reasoning.

[NLP-14] MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗应用中面临的评估不足问题,尤其是现有基准未能充分反映真实临床场景的复杂性。其关键解决方案是提出MEDSYN——一个包含多达7种不同类型视觉临床证据(Clinical Evidence, CE)的多语言、多模态基准,用于系统评估MLLMs在鉴别诊断(Differential Diagnosis, DDx)生成与最终诊断(Final Diagnosis, FDx)选择中的表现。研究发现,尽管顶级模型在DDx生成上可媲美甚至超越人类专家,但其在整合异质性临床证据方面存在显著性能缺口,主要归因于对低判别力文本证据(如病史)的过度依赖及跨模态证据利用不均的问题;为此,作者引入“证据敏感性”(Evidence Sensitivity)指标量化后者,并证明该指标与诊断准确性正相关,从而为优化模型提供可操作的干预方向。

链接: https://arxiv.org/abs/2602.21950
作者: Boqi Chen,Xudong Liu,Jiachuan Peng,Marianne Frey-Marti,Bang Zheng,Kyle Lam,Lin Li,Jianing Qiu
机构: ETH Zurich (苏黎世联邦理工学院); Amazon; MBZUAI; University of Oxford (牛津大学); University of Bern (伯尔尼大学); Peking University (北京大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx–FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ( \ite.g. , medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.

[NLP-15] Large Language Models are Algorithmically Blind

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在算法推理能力上的不足问题,即尽管LLMs具备广泛的陈述性知识(declarative knowledge),但其对计算过程的推理能力仍不清晰且表现不佳。为解决这一问题,作者采用因果发现(causal discovery)作为测试基准,并通过大规模算法执行结果构建真实标签(ground truth),对八种前沿LLM进行评估。关键在于利用可验证的算法执行数据来量化模型预测的准确性与置信区间校准程度,发现模型普遍存在“算法盲视”(algorithmic blindness)现象:预测范围远超真实置信区间,且多数情况下无法覆盖真实算法均值,性能甚至低于随机猜测,表明当前LLMs缺乏将陈述性知识转化为可靠程序性预测的能力。

链接: https://arxiv.org/abs/2602.21947
作者: Sohan Venkatesh,Ashish Mahendran Kurapath,Tejas Melkote
机构: Manipal Institute of Technology (曼帕尔技术学院)
类目: Computation and Language (cs.CL)
备注: 20 pages, 11 figures, 14 tables

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.

[NLP-16] MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents

【速读】: 该论文旨在解决当前对多模态角色扮演代理(Multimodal Role-Playing Agents, MRPAs)的评估方法存在语义与模态生成耦合、依赖人工评分且一致性低的问题。现有研究主要使用纯文本基准评估文本响应,而将多模态表达的评估完全交由模态合成指标处理,导致错误归因模糊且主观性较强。为此,作者提出MERRY框架,其核心创新在于实现了语义解耦——通过引入五项情感一致性(Emotional Consistency, EC)和三项角色一致性(Role Consistency, RC)的精细化指标,并将传统的主观评分任务重构为一种新型双向证据发现(bidirectional-evidence-finding)任务,显著提升了大语言模型作为评判者(LLM-as-Judge)的人类一致性(human agreement)。此设计有效分离了语义质量与多模态表现力的评估维度,从而实现更准确、可复现且高效的MRPA性能评测。

链接: https://arxiv.org/abs/2602.21941
作者: Zhenyu Wang,Xiaofen Xing,Yirong Chen,Xiangmin Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.

[NLP-17] Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text

【速读】: 该论文旨在解决多语言及混码(code-mixed)环境下讽刺检测(sarcasm detection)的难题,尤其在资源稀缺(low-resource)场景中,传统大型语言模型(Large Language Models, LLMs)表现受限。其解决方案的关键在于采用领域自适应微调(domain-adaptive fine-tuning)策略,对小型Transformer架构模型DistilBERT进行优化,仅使用少量LLM生成的混码Hinglish数据进行训练,即实现了84%的最高准确率,显著优于所有对比的LLMs在零样本和少样本设置下的性能表现。这表明在数据有限条件下,针对特定任务微调的小型模型更具优势。

链接: https://arxiv.org/abs/2602.21933
作者: Bitan Majumder,Anirban Sen
机构: Pondicherry University (庞迪契里大学); Ashoka University (阿舒卡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability. This study compares four large language models, Llama 3.1, Mistral, Gemma 3, and Phi-4, with a fine-tuned DistilBERT model for sarcasm detection in code-mixed Hinglish text. The results indicate that the smaller, sequentially fine-tuned DistilBERT model achieved the highest overall accuracy of 84%, outperforming all of the LLMs in zero and few-shot set ups, using minimal LLM generated code-mixed data used for fine-tuning. These findings indicate that domain-adaptive fine-tuning of smaller transformer based models may significantly improve sarcasm detection over general LLM inference, in low-resource and data scarce settings.

[NLP-18] ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRMs)在强化学习(Reinforcement Learning, RL)后训练阶段主要依赖英语推理而导致多语言适应性不足的问题,尤其在需要支持全球用户原生语言思维轨迹的场景下表现受限。解决方案的关键在于提出一种名为ExpLang的新颖LLM后训练流程,其核心创新是将“思考语言选择”作为策略梯度优化中的一个可学习动作(action),从而在RL过程中实现在线策略(on-policy)的语言偏好探索与利用。该方法通过引入多语言推理空间,有效扩展了探索范围,并借助非英语语言的优势提升最终推理性能,且与主流RL算法正交,为利用多语种特性增强LRMs提供了新范式。

链接: https://arxiv.org/abs/2602.21887
作者: Changjiang Gao,Zixian Huang,Kaichen Yang,Jiajun Chen,Jixing Li,Shujian Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.

[NLP-19] DynamicGTR: Leverag ing Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs CVPR2026

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在零样本图问答(zero-shot graph QA)任务中对结构化图数据理解能力不足的问题,尤其是现有方法依赖单一图拓扑表示(Graph Topology Representation, GTR)导致的响应不准确或冗长的问题。解决方案的关键在于提出DynamicGTR框架,该框架在推理阶段动态选择最优GTR以适配每个查询,从而在不增加额外训练的前提下,提升VLM对图结构的理解能力和问答准确性,并实现从合成图算法任务到真实场景(如链接预测和节点分类)的零样本迁移,同时具备跨任务、跨域和跨模型的强泛化能力。

链接: https://arxiv.org/abs/2602.21864
作者: Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James Kwok,Yu Zhang
机构: Southern University of Science and Technology (南方科技大学); Hong Kong University of Science and Technology (香港科技大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注: CVPR 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all’’ strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the \mboxDynamicGTR framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.

[NLP-20] Personalized Graph-Empowered Large Language Model for Proactive Information Access

【速读】: 该论文旨在解决个体在回忆生活细节时面临的困难以及事件混淆问题,提出一种能够辅助用户高效召回遗忘经历的系统。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的框架,通过整合个人知识图谱(Personal Knowledge Graphs)来优化访问需求的识别过程,从而实现对用户记忆缺失的主动感知与精准响应。该框架具备高度灵活性,支持底座模型替换和事实检索方法调整,以适应随时间累积的个性化数据并持续改进性能。

链接: https://arxiv.org/abs/2602.21862
作者: Chia Cheng Chang,An-Zi Yen,Hen-Hsen Huang,Hsin-Hsi Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential. While numerous studies have proposed memory recall systems, these primarily rely on deep learning techniques that require extensive training and often face data scarcity due to the limited availability of personal lifelogs. As lifelogs grow over time, systems must also adapt quickly to newly accumulated data. Recently, large language models (LLMs) have demonstrated remarkable capabilities across various tasks, making them promising for personalized applications. In this work, we present a framework that leverages LLMs for proactive information access, integrating personal knowledge graphs to enhance the detection of access needs through a refined decision-making process. Our framework offers high flexibility, enabling the replacement of base models and the modification of fact retrieval methods for continuous improvement. Experimental results demonstrate that our approach effectively identifies forgotten events, supporting users in recalling past experiences more efficiently.

[NLP-21] Distill and Align Decomposition for Enhanced Claim Verification EACL

【速读】: 该论文旨在解决复杂声明验证(complex claim verification)中因子命题分解质量不足而导致的验证性能受限问题,现有方法难以实现分解质量与验证准确性之间的有效对齐。其解决方案的关键在于提出一种基于强化学习(reinforcement learning, RL)的联合优化框架,采用组相对策略优化(Group Relative Policy Optimization, GRPO)方法,同时最大化分解质量、验证器对齐度和格式合规性等多目标奖励信号,并结合结构化序列推理、教师蒸馏监督微调以及多目标奖励平衡机制。实验表明,该方法在6个评估场景下使8B规模的分解器在下游验证任务中达到71.75%的宏平均F1分数,显著优于提示工程方法和现有RL方法,且人类评估验证了生成子命题的高质量。

链接: https://arxiv.org/abs/2602.21857
作者: Jabez Magomere,Elena Kochkina,Samuel Mensah,Simerjot Kaur,Fernando Acero,Arturo Oncevay,Charese H. Smiley,Xiaomo Liu,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EACL Findings 2026

点击查看摘要

Abstract:Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition quality.

[NLP-22] FewMMBench: A Benchmark for Multimodal Few-Shot Learning

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理交错图像-文本数据时,其少样本学习(few-shot learning)能力难以系统评估的问题。现有方法缺乏统一、全面的基准测试工具来量化模型在不同任务类型、模型家族和提示策略下的表现,尤其在上下文学习(In-Context Learning, ICL)与思维链(Chain-of-Thought, CoT)提示机制下的性能差异尚不清晰。为此,作者提出了FewMMBench,一个涵盖属性识别到时间推理等多样化多模态理解任务的综合性基准,支持对零样本、少样本及CoT增强的少样本设置进行系统性分析。关键创新在于构建了一个结构化且可扩展的评测框架,能够揭示指令微调模型虽具备强零样本能力,但在引入演示示例或CoT推理后收益有限甚至出现性能退化,从而为未来提升MLLMs的少样本泛化能力提供诊断依据和研究方向。

链接: https://arxiv.org/abs/2602.21854
作者: Mustafa Dogan,Ilker Kesen,Iacer Calixto,Aykut Erdem,Erkut Erdem
机构: Aselsan Research; University of Copenhagen; Amsterdam UMC; University of Amsterdam; Koç University; Hacettepe University
类目: Computation and Language (cs.CL)
备注: Preprint. 49 pages, 38 Figures, 5 Tables

点击查看摘要

Abstract:As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: this https URL

[NLP-23] Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem

【速读】: 该论文试图解决大型语言模型在“汽车清洗问题”(car wash problem)这一推理基准测试中表现不佳的问题,该问题要求模型具备隐式物理约束推理能力。解决方案的关键在于引入结构化推理框架——STAR(Situation-Task-Action-Result)推理框架,通过强制在推理前明确目标(goal articulation),将准确率从0%提升至85%,显著优于单纯依赖上下文注入的方法。进一步结合用户画像和检索增强生成(RAG)技术可实现100%准确率,但核心突破仍源于结构化推理 scaffold 的设计。

链接: https://arxiv.org/abs/2602.21814
作者: Heejin Jo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 tables

点击查看摘要

Abstract:Large language models consistently fail the “car wash problem,” a viral reasoning benchmark requiring implicit physical constraint inference. We present a variable isolation study (n=20 per condition, 6 conditions, 120 total trials) examining which prompt architecture layers in a production system enable correct reasoning. Using Claude 3.5 Sonnet with controlled hyperparameters (temperature 0.7, top_p 1.0), we find that the STAR (Situation-Task-Action-Result) reasoning framework alone raises accuracy from 0% to 85% (p=0.001, Fisher’s exact test, odds ratio 13.22). Adding user profile context via vector database retrieval provides a further 10 percentage point gain, while RAG context contributes an additional 5 percentage points, achieving 100% accuracy in the full-stack condition. These results suggest that structured reasoning scaffolds – specifically, forced goal articulation before inference – matter substantially more than context injection for implicit constraint reasoning tasks.

[NLP-24] D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行链式思维(Chain-of-Thought, CoT)蒸馏时,导致小语言模型(Small Language Models, SLMs)产生“过度思考”(overthinking)的问题,从而引发性能下降和token消耗过高的现象。其解决方案的关键在于提出了一种结构化的推理框架——纪律性链式思维(Disciplined Chain-of-Thought, D-CoT),通过在训练过程中引入控制标签(如TEMP_LOW用于事实核查、TEMP_HIGH用于多角度探索)作为辅助支架,对CoT轨迹进行优化,从而抑制推理漂移(reasoning drift),实现性能提升与计算成本降低的双重目标。

链接: https://arxiv.org/abs/2602.21786
作者: Shunsuke Ubukata
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures. Code: this https URL | Benchmarks: this https URL | Dataset: this https URL

点击查看摘要

Abstract:Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces “overthinking” in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags – such as TEMP_LOW for fact-checking and TEMP_HIGH for multi-perspective exploration – as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.

[NLP-25] Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLM s AAAI26

【速读】: 该论文旨在解决隐式话语关系识别(Implicit Discourse Relation Recognition, IDRR)中因缺乏显式连接词而导致的深层语义理解困难,以及现有方法仅输出关系标签而无法提供可解释性支持的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)强大的推理与自然语言解释生成能力,通过提示(prompting)方式让LLM为训练样本生成带有金标关系的解释文本,并设计一种新颖的分类-生成联合框架,在关系预测的同时引入LLM生成的解释作为额外监督信号进行端到端训练,从而在提升模型性能的同时增强结果的可解释性。该方法具有良好的通用性,可无缝集成至主流IDRR模型中。

链接: https://arxiv.org/abs/2602.21763
作者: Heng Wang,Changxing Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI26’0ral

点击查看摘要

Abstract:Implicit Discourse Relation Recognition (IDRR) remains a challenging task due to the requirement for deep semantic understanding in the absence of explicit discourse markers. A further limitation is that existing methods only predict relations without providing any supporting explanations. Recent advances in large language models (LLMs) have shown strong reasoning capabilities in both deep language understanding and natural language explanation generation. In this work, we propose a simple yet effective approach to distill the reasoning capabilities of LLMs into lightweight IDRR models to improve both performance and interpretability. Specifically, we first prompt an LLM to generate explanations for each training instance conditioned on its gold label. Then, we introduce a novel classification-generation framework that jointly performs relation prediction and explanation generation, and train it with the additional supervision of LLM-generated explanations. Our framework is plug-and-play, enabling easy integration with most existing IDRR models. Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability. Furthermore, we validate the generality of our approach on sentiment classification and natural language inference

[NLP-26] Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

【速读】: 该论文针对低资源语言 Bengali 的长语音频识别(Automatic Speech Recognition, ASR)与说话人二值化(Speaker Diarization)任务所面临的挑战,提出了一套端到端解决方案。其核心问题包括:Bengali 丰富的音素库、显著的方言差异、频繁的英語混用(code-mixing),以及高质量标注语料稀缺。解决方案的关键在于三个设计选择:1)对分割模块进行领域特定微调(domain-specific fine-tuning of the segmentation component),以提升语音边界检测精度;2)采用 Demucs 源分离模型实现声源隔离(vocal source separation),增强目标语音信号质量;3)引入基于自然静默边界的分块策略(silence-aware chunking),优化音频切片方式以适配 Bengali 的语音节奏特征。上述改进显著提升了 ASR 的词错误率(WER)和说话人二值化错误率(DER),验证了针对低资源场景定制化处理流程的有效性。

链接: https://arxiv.org/abs/2602.21741
作者: MD. Sagor Chowdhury,Adiba Fairooz Chowdhury
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 6 pages, 5 figures, 3 tables; system paper submitted to DL Sprint 4.0 (Kaggle)

点击查看摘要

Abstract:We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the this http URL pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.

[NLP-27] Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling ICLR2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在问答任务中因幻觉和事实缺失导致的推理可靠性问题。现有基于知识图谱(Knowledge Graph, KG)增强的方法通常通过生成阶段施加规则或模仿固定示例路径来约束LLM推理,这限制了其在分布外图谱推理问题上的泛化能力。解决方案的关键在于提出一种名为Explore-on-Graph(EoG)的新框架,该框架利用强化学习激励LLM自主探索知识图谱上的多样化推理路径,并通过最终答案正确性作为主奖励信号,同时引入路径信息作为辅助奖励以提升探索效率与意义,从而突破传统方法对先验经验的依赖,实现更鲁棒且具有泛化能力的KG增强推理。

链接: https://arxiv.org/abs/2602.21728
作者: Shiqi Yan,Yubo Chen,Ruiqi Zhou,Zhengxi Yao,Shuai Chen,Tianyi Zhang,Shijie Zhang,Wei Qiang Zhang,Yongfeng Huang,Haixin Duan,Yunqi Zhang
机构: Zhongguancun Laboratory (中关村实验室); Tsinghua University (清华大学); Ant International (蚂蚁国际); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs’ answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths’ final answers. To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts. Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.

[NLP-28] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

【速读】: 该论文试图解决的问题是:为什么人类递归数制系统(如英语的十进制计数系统)在跨语言中表现出高度规律性,而其他可能存在的不规则系统却未被广泛使用?其核心假设为,规律性可能通过促进学习效率来解释这种跨语言共性。解决方案的关键在于采用强化学习(Reinforcement Learning)方法模拟学习过程,发现规律性强的系统在有限数据下更易被习得,且这一优势源于系统设计本身对泛化能力的要求;同时,对于高度不规则的“非自然”系统,学习难度不再受规律性影响,而是取决于信号长度,表明不同子空间中的系统受不同学习压力支配。

链接: https://arxiv.org/abs/2602.21720
作者: Andrea Silvi,Ponrawee Prasertsom,Jennifer Culbertson,Devdatt Dubhashi,Moa Johansson,Kenny Smith
机构: Chalmers University of Technology and Gothenburg University; University of Edinburgh
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.

[NLP-29] DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation EACL

【速读】: 该论文旨在解决跨分词器知识蒸馏(cross-tokenizer Knowledge Distillation, KD)中因序列级和词汇级对齐不佳而导致的性能瓶颈问题。现有方法在token级别上通常采用均匀权重,忽略了不同位置信息的差异性,在序列层面也缺乏对词法与语义信息的精细对齐能力。其解决方案的关键在于提出双空间加权与时间扭曲对齐(Dual-Space Weighting and Time-Warped Alignment, DWA-KD)框架:首先在token级别引入基于熵的双空间加权机制,通过KL散度在教师与学生表示空间间进行双向蒸馏,并动态提升学生不确定而教师自信的token的学习权重;其次在序列级别利用Soft Dynamic Time Warping(Soft-DTW)对嵌入层和最终隐藏状态层同时建模,实现对词法和上下文语义的鲁棒对齐,从而显著提升蒸馏效果。

链接: https://arxiv.org/abs/2602.21669
作者: Duc Trung Vu,Pham Khanh Chi,Dat Phi Van,Linh Ngo Van,Sang Dinh,Trung Le
机构: Hanoi University of Science and Technology (河内科学技术大学); University of Monash (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: EACL Findings

点击查看摘要

Abstract:Knowledge Distillation (KD) has emerged as a crucial technique for compressing Large Language Models (LLMs). Although existing cross-tokenizer KD methods have made notable progress, their effectiveness remains constrained by suboptimal alignment across sequence and vocabulary levels. To address these limitations, we introduce Dual-Space Weighting and Time-Warped Alignment (DWA-KD), a novel cross-tokenizer distillation framework that enhances token-wise distillation through dual-space entropy-based weighting and achieves precise sequence-level alignment by leveraging both lexical and semantic information. At the token level, DWA-KD maps teacher representations into the student space and vice versa, performing dual-space KD via Kullback-Leibler divergence (KL). The process is modulated by dual-space weights that up-weight tokens where the student is uncertain and the teacher is confident, thereby focusing learning on informative tokens rather than treating all positions equally. At the sequence level, DWA-KD applies Soft Dynamic Time Warping (Soft-DTW) to both the embedding and final hidden-state layers, enabling robust alignment of lexical and contextual semantics between teacher and student sequences. Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final hidden state layer Soft-DTW alignment.

[NLP-30] Sparsity Induction for Accurate Post-Training Pruning of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练稀疏化(Post-Training Sparsity, PTS)过程中因原始密集矩阵稀疏度不足而导致的性能下降问题。现有方法直接剪枝会破坏模型状态,即使经过微调也难以恢复性能。其解决方案的关键在于提出“稀疏诱导”(Sparsity Induction)机制:一方面通过数学等价的缩放变换提升分布级稀疏性(distribution-level sparsity),该方法无额外参数且推理时无开销;另一方面引入谱范数损失(Spectral Norm Loss)从低秩视角促进特征级稀疏性(feature-level sparsity)。此双重策略显著增强了模型对剪枝的友好性,在多种架构和任务中均优于现有方法。

链接: https://arxiv.org/abs/2602.21652
作者: Minhao Jiang,Zhikai Li,Xuewen Liu,Jing Zhang,Mengjuan Chen,Qingyi Gu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS), which reduces model cost by removing weights from dense networks, is an effective approach. However, native dense matrices lack high sparsity, making existing approaches that directly remove weights disrupt model states, resulting in unsatisfactory performance recovery even with post-tuning. We propose Sparsity Induction, which promotes models toward higher sparsity at both distribution and feature levels before pruning, to push the limits of PTS. At the distribution level, we enhance distributional sparsity through mathematically equivalent scaling transformations, which are fully absorbable and incur no extra parameters or inference-time overhead. At the feature level, we introduce Spectral Norm Loss to promote feature sparsity from a low-rank perspective. Experiments across diverse model architectures and tasks demonstrate that our method further enhances sparsity-friendliness, achieving superior pruning performance over existing approaches.

[NLP-31] Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration

【速读】: 该论文旨在解决尼泊尔语语音到英语文本翻译(S2TT)系统中因自动语音识别(ASR)引入的结构噪声问题,尤其是未加标点的ASR输出对机器翻译(NMT)质量造成的显著负面影响。实验表明,无标点ASR输出会导致FLORES基准上BLEU分数下降20.7%,严重损害翻译准确性。解决方案的关键在于提出并验证了一个中间阶段的标点恢复模块(Punctuation Restoration Module, PRM),该模块直接作用于ASR输出,有效提升后续NMT的输入质量;最终优化后的S2TT流水线在自建数据集上相较直接ASR到NMT基线实现了4.90 BLEU点的提升(36.38 vs. 31.48),并通过人工评估确认其在准确性和流畅性上的优势,证明了针对性标点恢复是缓解低资源语言下S2TT系统结构噪声最有效的干预手段。

链接: https://arxiv.org/abs/2602.21647
作者: Tangsang Chongbang,Pranesh Pyara Shrestha,Amrit Sarki,Anku Jaiswal
机构: Pulchowk Campus; Institute of Engineering, Tribhuvan University (特里布万大学工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 12 tables

点击查看摘要

Abstract:This paper presents and evaluates an optimized cascaded Nepali speech-to-English text translation (S2TT) system, focusing on mitigating structural noise introduced by Automatic Speech Recognition (ASR). We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark. We empirically investigate the influence of punctuation loss, demonstrating that unpunctuated ASR output significantly degrades translation quality, causing a massive 20.7% relative BLEU drop on the FLORES benchmark. To overcome this, we propose and evaluate an intermediate Punctuation Restoration Module (PRM). The final S2TT pipeline was tested across three configurations on a custom dataset. The optimal configuration, which applied the PRM directly to ASR output, achieved a 4.90 BLEU point gain over the direct ASR-to-NMT baseline (BLEU 36.38 vs. 31.48). This improvement was validated by human assessment, which confirmed the optimized pipeline’s superior Adequacy (3.673) and Fluency (3.804). This work validates that targeted punctuation restoration is the most effective intervention for mitigating structural noise in the Nepali S2TT pipeline. It establishes an optimized baseline and demonstrates a critical architectural insight for developing cascaded speech translation systems for similar low-resource languages.

[NLP-32] Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion ICLR2026

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在机器翻译中对多语言图像-文本对数据依赖性强、低资源语言覆盖不足的问题。现有方法主要依赖图像引导的翻译,受限于跨语言图文对数据稀缺;而语音模态因其与文本天然对齐且数据丰富,具备更广的语言扩展潜力。解决方案的关键在于提出一种语音引导的机器翻译(Speech-guided Machine Translation, SMT)框架,通过融合语音与文本作为输入,并引入自进化机制(Self-Evolution Mechanism),利用文本到语音模型生成合成语音样本,由MLLM自动分类并迭代优化自身,从而减少对真实语音数据的依赖。实验表明,该框架在Multi30K和FLORES-200等多个基准上均达到最先进性能,且合成语音与真实语音差异对翻译质量影响可忽略。

链接: https://arxiv.org/abs/2602.21646
作者: Yexing Du,Youcheng Pan,Zekun Wang,Zheng Chu,Yichong Huang,Kaiyuan Liu,Bo Yang,Yang Xiang,Ming Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at this https URL.

[NLP-33] Multi-dimensional Assessment and Explainable Feedback for Counselor Responses to Client Resistance in Text-based Counseling with LLM s

【速读】: 该论文旨在解决心理辅导中针对来访者抵抗(client resistance)的干预效果难以进行精细化评估的问题,传统NLP研究虽能评价整体咨询质量与通用治疗技能,但无法提供对高风险情境下具体沟通机制的细粒度分析。解决方案的关键在于构建一个理论驱动的多维评估框架,将咨询师回应分解为四种不同的沟通机制,并基于此框架创建了一个专家标注的真实咨询语料库,其中包含专业评分及解释性理由;在此基础上,采用全参数指令微调(full-parameter instruction tuning)方法在Llama-3.1-8B-Instruct模型上训练,使其不仅能精准区分不同沟通机制的质量(F1达77–81%),还能生成高质量解释,且人类专家对其解释评分接近满分(2.8–2.9/3.0)。实证结果表明,该AI反馈显著提升了43名咨询师应对客户抵抗的能力。

链接: https://arxiv.org/abs/2602.21638
作者: Anqi Li,Ruihan Wang,Zhaoming Chen,Yuqian Chen,Yu Lu,Yi Zhu,Yuan Xie,Zhenzhong Lan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Effectively addressing client resistance is a sophisticated clinical skill in psychological counseling, yet practitioners often lack timely and scalable supervisory feedback to refine their approaches. Although current NLP research has examined overall counseling quality and general therapeutic skills, it fails to provide granular evaluations of high-stakes moments where clients exhibit resistance. In this work, we present a comprehensive pipeline for the multi-dimensional evaluation of human counselors’ interventions specifically targeting client resistance in text-based therapy. We introduce a theory-driven framework that decomposes counselor responses into four distinct communication mechanisms. Leveraging this framework, we curate and share an expert-annotated dataset of real-world counseling excerpts, pairing counselor-client interactions with professional ratings and explanatory rationales. Using this data, we perform full-parameter instruction tuning on a Llama-3.1-8B-Instruct backbone to model fine-grained evaluative judgments of response quality and generate explanations underlying. Experimental results show that our approach can effectively distinguish the quality of different communication mechanisms (77-81% F1), substantially outperforming GPT-4o and Claude-3.5-Sonnet (45-59% F1). Moreover, the model produces high-quality explanations that closely align with expert references and receive near-ceiling ratings from human experts (2.8-2.9/3.0). A controlled experiment with 43 counselors further confirms that receiving these AI-generated feedback significantly improves counselors’ ability to respond effectively to client resistance.

[NLP-34] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

【速读】: 该论文旨在解决基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)在多模态大语言模型(Multimodal Large Language Models, MLLMs)推理能力提升中面临的“奖励黑客”(reward hacking)问题,即模型可能通过学习与最终答案无关的伪推理路径来误导奖励机制。同时,现有基于评分标准(rubric-based)的方法因实例级生成开销大且所有评分标准同等对待,导致训练效率低下。解决方案的关键在于提出分层评分课程学习(Stratified Rubric-based Curriculum Learning, RuCL),其核心创新是将课程学习的焦点从数据选择转向奖励设计:首先生成具有广泛适用性的通用评分标准,并根据模型能力进行分层;随后在训练过程中动态调整各评分标准的权重,引导模型从掌握基础感知能力逐步过渡到处理高级逻辑推理任务,从而实现高效且稳定的性能提升。

链接: https://arxiv.org/abs/2602.21628
作者: Yukun Chen,Jiaming Li,Longze Chen,Ze Gong,Jingpeng Li,Zhen Qin,Hengyu Chang,Ancheng Xu,Zhihao Yang,Hamid Alinejad-Rokny,Qiang Qu,Bo Zheng,Min Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model’s competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

[NLP-35] When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

【速读】: 该论文旨在解决视觉空间推理(Visual Spatial Reasoning, VSR)在现代视觉语言模型(Vision-Language Models, VLMs)中依然表现不佳的问题,尤其关注在推理阶段注入额外信息(如空间线索、常识知识或链式思维提示)的效果。研究表明,信息注入并非总是有益的,其有效性高度依赖于信息的类型、相关性与任务适配度。解决方案的关键在于选择性地注入任务对齐的信息:单一精准的空间线索优于多源信息聚合,过量或弱相关常识知识会降低性能,而链式思维(Chain-of-Thought, CoT)提示仅在空间定位足够精确时才能提升准确率。这一发现为构建可靠、高效的多模态推理流程提供了实证依据和设计指导。

链接: https://arxiv.org/abs/2602.21619
作者: Muku Akasaka,Soyeon Caren Han
机构: The University of Melbourne(墨尔本大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 6 figures, Under review

点击查看摘要

Abstract:Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.

[NLP-36] MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

【速读】: 该论文旨在解决南亚社交媒体中孟加拉语-英语代码混杂文本(Bangla-English code-mixing)场景下隐含语义识别资源匮乏的问题,尤其是针对情感、讽刺、冒犯性和粗俗性等难以捕捉的语义特征。现有模型多集中于单一语言或高资源语言,在音译变体、文化参照和句内语言切换方面表现不佳。解决方案的关键在于构建首个公开可用的孟加拉语-英语代码混杂语料库 MixSarc,包含 9,087 条人工标注句子,涵盖幽默、讽刺、冒犯性和粗俗性四个标签,并通过目标社交平台采集、系统过滤与多标注者验证确保数据质量。此外,研究还对比了基于 Transformer 的模型与零样本大语言模型在结构化提示下的性能,揭示了类别不平衡和语用复杂性对讽刺、冒犯性和粗俗性识别造成显著影响,为后续文化敏感型自然语言处理(NLP)提供了基础支持。

链接: https://arxiv.org/abs/2602.21608
作者: Kazi Samin Yasar Alam,Md Tanbir Chowdhury,Tamim Ahmed,Ajwad Abrar,Md Rafid Haque
机构: IUT-Dhaka(孟加拉国伊斯兰大学达卡分校); UIC(美国伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.

[NLP-37] GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)后训练阶段中因训练问题质量不稳定而导致性能波动的问题。其核心挑战源于RL的非平稳性:策略在训练过程中持续演化,导致轨迹生成和学习过程依赖于探索与奖励反馈,而非固定监督信号。为应对这一问题,作者提出GradAlign方法,其关键在于利用一个小而可信的验证集,通过衡量训练问题的策略梯度与验证梯度的方向一致性来筛选高质量训练样本,从而构建自适应课程学习机制。该方法有效提升了RL训练的稳定性与最终性能,在奖励信号不可靠、分布不平衡及低效训练语料等复杂场景下均显著优于现有基线。

链接: https://arxiv.org/abs/2602.21492
作者: Ningyuan Yang,Weihua Du,Weiwei Sun,Sean Welleck,Yiming Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages. Preliminary work

点击查看摘要

Abstract:Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at this https URL

[NLP-38] VecGlypher: Unified Vector Glyph Generation with Language Models CVPR’26

【速读】: 该论文旨在解决当前基于学习的字体生成方法依赖精心制作的示例表和栅格到矢量的后处理步骤,从而限制了字体创作的可访问性和可编辑性的问题。其核心解决方案是提出VecGlypher——一个单一的多模态语言模型,能够直接从文本描述或图像示例中生成高质量的矢量字形(vector glyphs),通过自回归方式输出SVG路径标记(SVG path tokens),避免使用栅格中间表示,并在单次推理中生成可编辑、闭合无误的轮廓。关键创新在于:(1) 一种面向排版感知的数据与训练策略,包括两阶段训练流程(先在39K含噪Envato字体上进行大规模续写以掌握SVG语法和长程几何结构,再在2.5K专家标注的Google Fonts数据集上微调以对齐语言、图像与几何);(2) 预处理环节标准化坐标系、归一化路径、去重家族并量化坐标,确保长序列解码稳定性;(3) 使用绝对坐标序列化显著提升几何准确性。此方案显著提升了跨家族分布外(OOD)文本驱动生成性能,并在图像参考生成任务中达到SOTA水平。

链接: https://arxiv.org/abs/2602.21461
作者: Xiaoke Huang,Bhavul Gauri,Kam Woh Ng,Tony Ng,Mengmeng Xu,Zhiheng Liu,Weiming Ren,Zhaochong An,Zijian Zhou,Haonan Qiu,Yuyin Zhou,Sen He,Ziheng Wang,Tao Xiang,Xiao Han
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to CVPR’26. Project page: this https URL

点击查看摘要

Abstract:Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.

[NLP-39] Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agent ic RAG

【速读】: 该论文旨在解决多模态代理式检索增强生成(Multimodal Agentic RAG)系统中,攻击者通过将恶意语义分散部署在检索、规划与生成等多阶段组件中所引发的安全漏洞问题。传统无状态防御机制难以识别此类跨模块协同攻击策略。解决方案的关键在于提出MMA-RAG^T框架,其核心是一个由模块化信任代理(Modular Trust Agent, MTA)驱动的推理时控制机制,MTA通过结构化大语言模型(LLM)推理维护近似的信念状态(belief state),从而在不依赖模型内部结构的前提下实现状态感知的纵深防御(defence-in-depth)。该框架通过配置化内部检查点对多阶段输出进行动态干预,在显著降低攻击成功率(平均提升6.50倍)的同时保持极低的可用性损耗。

链接: https://arxiv.org/abs/2602.21447
作者: Inderjeet Singh,Vikas Pahuja,Aishvariya Priya Rathina Sabapathy,Chiara Picardi,Amit Giloni,Roman Vainshtein,Andrés Murillo,Hisashi Kojima,Motoyoshi Sekiya,Yuki Unno,Junichi Suga
机构: Fujitsu Research of Europe, UK; Fujitsu Limited, Japan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy multi-stage observations. We introduce MMA-RAG^T, an inference-time control framework governed by a Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning. Operating as a model-agnostic overlay, MMA-RAGT mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth. Extensive evaluation on 43,774 instances demonstrates a 6.50x average reduction factor in Attack Success Rate relative to undefended baselines, with negligible utility cost. Crucially, a factorial ablation validates our theoretical bounds: while statefulness and spatial coverage are individually necessary (26.4 pp and 13.6 pp gains respectively), stateless multi-point intervention can yield zero marginal benefit under homogeneous stateless filtering when checkpoint detections are perfectly correlated.

[NLP-40] MrBERT: Modern Multilingual Encoders via Vocabulary Domain and Dimensional Adaptation

【速读】: 该论文旨在解决多语言预训练模型在特定语言任务和高风险专业领域(如生物医学与法律)中性能不足,以及模型部署时推理成本与存储开销过高的问题。解决方案的关键在于提出MrBERT模型家族,其基于ModernBERT架构,在35种语言及代码上进行预训练,并通过针对性适配实现对加泰罗尼亚语和西班牙语任务的最先进表现;同时引入Matryoshka Representation Learning (MRL) 技术,支持灵活的向量尺寸调整,显著降低推理和存储成本,从而在保持语言精度的同时提升模型的实用性与可扩展性。

链接: https://arxiv.org/abs/2602.21379
作者: Daniel Tamayo,Iñaki Lacunza,Paula Rivera-Hidalgo,Severino Da Dalt,Javier Aula-Blasco,Aitor Gonzalez-Agirre,Marta Villegas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 14 tables and 4 figures

点击查看摘要

Abstract:We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.

[NLP-41] Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

【速读】: 该论文旨在解决当前基于分词(tokenization)和子词(sub-tokenization)的自然语言处理模型(如word2vec、BERT和GPT系列)在处理形态丰富且资源匮乏的语言时,难以充分捕捉拼写相似性和形态变化的问题。其关键解决方案是提出一种基于Transformer架构的**富字符嵌入(Rich Character Embeddings, RCE)**方法,直接从字符序列计算词向量,从而融合语义与句法信息;同时设计了一种结合Transformer与卷积机制的混合模型,使生成的向量可作为现有模型中字典或子词嵌入的即插即用替代方案,在小样本场景下显著提升性能,尤其在SWAG、词形预测、隐喻和回文检测等任务中优于传统基于token的方法。

链接: https://arxiv.org/abs/2602.21377
作者: Felix Schneider,Maria Gogolev,Sven Sickert,Joachim Denzler
机构: Friedrich Schiller University Jena(耶拿弗里德里希·席勒大学)
类目: Computation and Language (cs.CL)
备注: 12 content pages, 2 figures, 8 tables, one example textbox

点击查看摘要

Abstract:Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.

[NLP-42] Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages

【速读】: 该论文旨在解决在低资源语言(如波斯语)中从医疗转录文本中提取临床信息的挑战,这在医疗自然语言处理(Healthcare NLP)领域尤为困难。其解决方案的关键在于采用两步式流水线:首先使用Aya-expanse-8B模型将波斯语转录文本翻译为英文,随后利用五种开源小语言模型(Small Language Models, SLMs)进行13个临床特征的二分类提取。该方法无需微调,仅通过少量示例提示(few-shot prompting)即可实现高精度提取,其中Qwen2.5-7B-Instruct表现最优(中位宏F1分数0.899),且翻译步骤显著提升了敏感度和对类别不平衡的鲁棒性,尽管略微降低了特异性。此策略为缺乏基础设施与标注资源的多语言临床NLP场景提供了可部署、隐私保护的实用方案。

链接: https://arxiv.org/abs/2602.21374
作者: Mohammadreza Ghaffarzadeh-Esfahani,Nahid Yousefian,Ebrahim Heidari-Farsani,Ali Akbar Omidvarian,Sepehr Ghahraei,Atena Farangi,AmirBahador Boroumand
机构: Isfahan University of Medical Sciences (伊斯法罕医科大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures, 2 supplementary files

点击查看摘要

Abstract:Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) – Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it – for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B–8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.

[NLP-43] Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

【速读】: 该论文旨在解决黑箱人工智能(AI)系统在特定任务下输出结果可信度的量化问题,即从业者应以何种置信水平信任其输出。解决方案的关键在于提出“可靠性水平”(reliability level)这一单一数值指标,该指标通过自一致性采样(self-consistency sampling)与合规校准(conformal calibration)相结合的方式构建,具备严格的有限样本、分布无关的保证。其中,自一致性采样可指数级降低不确定性,而合规校准则确保在任意模型误差下,正确性覆盖概率不超过目标置信水平的 1/(n+1)1/(n+1),且可通过更大答案集直观反映难题下的不确定性。此方法为黑箱部署提供了可解释、可验证的决策门控机制。

链接: https://arxiv.org/abs/2602.21368
作者: Charafeddine Mouzouni
机构: OPIT – Open Institute of Technology (开放技术研究所); Cohorte AI; Paris, France
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 41 pages, 11 figures, 10 tables, including appendices

点击查看摘要

Abstract:Given a black-box AI system and a task, at what confidence level can a practitioner trust the system’s output? We answer with a reliability level – a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system’s errors – made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy – see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.

[NLP-44] Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对齐(alignment)过程中存在的脆弱性问题,即尽管通过监督微调(Supervised Fine-Tuning, SFT)、基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)等技术提升了安全性,模型仍易受“越狱攻击”(jailbreak attacks)影响——此类攻击利用间接或欺骗性表述隐藏有害意图。研究发现,这种脆弱性源于浅层对齐机制缺乏深层推理能力,导致模型常在未真正理解危害性的前提下拒绝有害请求。解决方案的关键在于引入推理感知后训练(reasoning-aware post-training),具体包括:构建一个包含实用型与安全关键型提示及其逐步推理过程的Chain-of-Thought(CoT)微调数据集,使模型能基于原则性推理生成拒绝响应;并进一步提出对齐加权DPO(Alignment-Weighted DPO),通过为推理段与最终答案段分配不同偏好权重,实现更细粒度、针对性的参数更新,从而显著提升模型对多种越狱策略的鲁棒性,同时保持整体任务性能。

链接: https://arxiv.org/abs/2602.21346
作者: Mengxuan Hu,Vivek V. Datla,Anoop Kumar,Zihan Guan,Sheng Li,Alfy Samuel,Daben Liu
机构: University of Virginia (弗吉尼亚大学); Capital One (资本一号)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

[NLP-45] oolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning ICML2026

【速读】: 该论文旨在解决当前工具增强型语言模型(tool-augmented language models)在复杂多工具环境下的可靠性评估难题,尤其关注模型在真实场景中因推理能力不足而导致的中间结果误差累积和执行漂移问题。其解决方案的关键在于提出一个名为\ToolMATH的数学驱动型基准测试平台,该平台通过结构化的工具调用机制与可验证正确性的数学问题设计,系统性地评估模型在大规模重叠工具库和缺失目标能力条件下的表现;研究发现,模型失败的核心原因并非局部动作选择错误,而是缺乏长期规划一致性与对观测结果的规范使用,从而导致早期微小偏差被放大为不可逆的执行偏离。

链接: https://arxiv.org/abs/2602.21265
作者: Hyeonje Choi,Jeongsoo Lee,Hyojun Lee,Jay-Yoon Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Conference : Submitted to ICML 2026. 8 pages (+ abstract 16 pages), 5 figures

点击查看摘要

Abstract:We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results’ errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.

[NLP-46] Structured Prompt Language: Declarative Context Management for LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在实际应用中面临的 prompt 编写冗余、资源管理不透明、缺乏可复用性与可扩展性等问题。其核心挑战在于如何将LLM作为生成式知识库(generative knowledge base)进行结构化控制,同时优化上下文窗口(context window)这一受限资源的使用效率,并实现跨模型调度、持久化记忆和鲁棒代理流程的统一表达。解决方案的关键是提出一种声明式语言 SPL(Structured Prompt Language),它借鉴 SQL 的设计思想,提供显式的 token 预算管理(WITH BUDGET/LIMIT)、自动查询优化器、EXPLAIN 透明机制(类比 SQL 的 EXPLAIN ANALYZE)、原生集成检索增强生成(Retrieval-Augmented Generation, RAG)与持久内存的能力;并通过 SPL-flow 实现三层次提供商回退策略(Ollama - OpenRouter - 自愈重试),使同一 .spl 脚本可在云端并行或本地串行执行,且无需修改代码即可实现成本差异高达 68 倍的模型切换和自动最优模型留存。

链接: https://arxiv.org/abs/2602.21257
作者: Wen G. Gong
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB); Programming Languages (cs.PL)
备注: 44 pages, 6 figures, 14 tables, 15 code-listings

点击查看摘要

Abstract:We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic query optimizer, EXPLAIN transparency analogous to SQL’s EXPLAIN ANALYZE, and native integration of retrieval-augmented generation (RAG) and persistent memory in a single declarative framework. SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama - OpenRouter - self-healing retry) fully transparent to the .spl script. Five extensions demonstrate the paradigm’s breadth: (1) Text2SPL (multilingual NL-SPL translation); (2) Mixture-of-Models (MoM) routing that dispatches each PROMPT to a domain-specialist model at runtime; (3) Logical Chunking, an intelligent strategy for documents exceeding a single context window–expressed naturally through SPL’s existing CTE syntax with no new constructs, decomposing a large query into a Map-Reduce pipeline that reduces attention cost from O(N^2) to O(N^2/k) and runs identically on cloud (parallel) or local hardware (sequential); (4) SPL-flow, a declarative agentic orchestration layer with resilient three-tier provider fallback; and (5) BENCHMARK for parallel multi-model comparison with automatic winner persistence. We provide a formal EBNF grammar, two pip-installable Python packages (spl-llm, spl-flow), and comparison against Prompty, DSPy, and LMQL. SPL reduces prompt boilerplate by 65% on average, surfaces a 68x cost spread across model tiers as a pre-execution signal, and runs the identical .spl script at 0.002 on OpenRouter or at zero marginal cost on a local Ollama instance–without modification.

[NLP-47] ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

【速读】: 该论文旨在解决多模型编排(multi-model orchestration)在可审计条件下的任务路由问题,即如何根据任务复杂度和模型输出一致性动态选择单模型、双模型或三模型执行模式,以提升整体准确率并减少冗余计算。其解决方案的关键在于提出ACAR(Adaptive Complexity and Attribution Routing)框架,该框架基于N=3探针样本的自一致性方差(sigma)进行路由决策,无需学习组件且具备模型无关性;同时,系统构建于TEAMLLM这一确定性执行底座之上,确保不可变产物与完整决策轨迹,从而实现可审计的多模型协作。实验表明,该方法在多个基准测试中达到55.6%准确率,优于两模型基线(54.4%),并在54.2%的任务中避免了全集成计算。

链接: https://arxiv.org/abs/2602.21231
作者: Ramchand Kumaresan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 9 figures. Measurement framework for adaptive multi-model routing with auditable execution traces

点击查看摘要

Abstract:We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First, retrieval augmentation reduced accuracy by 3.4 percentage points, as median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces noise rather than grounding. Second, when models agree on incorrect answers (sigma equals zero), no downstream ensemble can recover; this agreement-but-wrong failure mode is intrinsic to self-consistency and bounds achievable accuracy at approximately eight percentage points below full ensembling. Third, attribution estimates based on proxy signals such as response similarity and entropy showed weak correlation with ground-truth leave-one-out values, indicating that practical attribution requires explicit counterfactual computation. This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution.

[NLP-48] RACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents WWW2026

【速读】: 该论文旨在解决深度研究代理(Deep Research Agents)评估中的关键挑战,即传统基于结果的指标(如Pass@1)无法捕捉其复杂推理过程的质量、效率和鲁棒性,且静态基准难以量化代理的潜在能力。解决方案的核心是提出TRACE(Trajectory-Aware Comprehensive Evaluation)框架,其关键创新在于:一是设计分层轨迹效用函数(Hierarchical Trajectory Utility Function),从证据锚定(evidence grounding)、准确性、效率等多个维度量化推理过程质量;二是引入分层能力评估协议(Scaffolded Capability Assessment),通过测量达成任务所需的最小引导程度来量化代理的潜在能力。该框架有效避免了“高分幻觉”,实现了对代理性能更全面、细致的评估。

链接: https://arxiv.org/abs/2602.21230
作者: Yanyu Chen,Jiyue Jiang,Jiahong Liu,Yifei Zhang,Xiao Guo,Irwin King
机构: The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL)
备注: Accepted by WWW 2026

点击查看摘要

Abstract:The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a “high-score illusion” that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the “high-score illusion”, we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent’s latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.

[NLP-49] ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对复杂指令时,因缺乏对隐含推理结构的理解而导致指令遵循能力不足的问题。其核心挑战在于处理那些包含隐式推理、复杂逻辑关系和多约束依赖的指令。解决方案的关键在于提出 ImpRIF 方法,将此类复杂指令形式化为可验证的推理图(verifiable reasoning graphs),从而实现程序化验证与图驱动的链式思维(chain-of-thought reasoning)。在此基础上,研究者构建了大规模单轮与多轮数据集,采用基于推理图的微调策略,并引入强化学习显式训练模型沿推理图进行推理,显著提升了模型对复杂指令的遵循能力。

链接: https://arxiv.org/abs/2602.21228
作者: Yuancheng Yang,Lin Yang,Xu Wang,Chao Tong,Haihua Yang
机构: ByteDance China (字节跳动中国); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs’ understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.

[NLP-50] Budget-Aware Agent ic Routing via Boundary-Guided Training

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理(autonomous agents)执行长程工作流时,因在每一步都调用高能力模型而导致的经济不可持续性问题。其核心挑战在于:传统模型路由方法适用于单轮查询,而代理路由具有序列依赖性和累积误差特性,且反馈通常仅在任务结束时出现,同时部署场景常要求严格的单任务预算限制。解决方案的关键在于提出预算感知的代理路由(Budget-Aware Agentic Routing)框架,通过引入边界引导训练(Boundary-Guided Training),利用“始终使用小模型”和“始终使用大模型”两种边界策略构建难度分类体系,并基于此锚定稀疏奖励下的学习过程;进一步采用边界相对奖励与参考引导的优势估计相结合的边界引导策略优化(Boundary-Guided Policy Optimization, BoPO),有效避免低质量廉价失败解的退化现象。实验表明,该方法显著提升了成本-成功率前沿效率,在保持强基线性能的同时大幅降低资源消耗,并能泛化至严格的推理期预算约束。

链接: https://arxiv.org/abs/2602.21227
作者: Caiqi Zhang,Menglin Xia,Xuchao Zhang,Daniel Madrigal,Ankur Mallick,Samuel Kessler,Victor Ruehle,Saravan Rajmohan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve into autonomous agents that execute long-horizon workflows, invoking a high-capability model at every step becomes economically unsustainable. While model routing is effective for single-turn queries, agentic routing is a sequential, path-dependent problem: early mistakes compound, feedback is often at the end of the episode, and deployments often demand strict per-task spending limits. We propose Budget-Aware Agentic Routing, which selects between a cheap and an expensive model at each step to optimize the cost–success frontier and to operate under strict per-task budgets. We propose Boundary-Guided Training, which leverages two boundary policies (always-small vs.\ always-large) to build a difficulty taxonomy and to anchor learning under sparse rewards. Our approach warms start with boundary-guided SFT data synthesis via stratified sampling of cost-efficient trajectories, then applies Boundary-Guided Policy Optimization (BoPO), combining boundary-relative rewards with a reference-guided advantage to avoid degenerate cheap-failure solutions. Experiment results show that our method improves the efficiency frontier, matching strong routing baselines at substantially lower cost while demonstrating generalization to strict inference-time budget constraints. Overall, our work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.

[NLP-51] IslamicLegalBench: Evaluating LLM s Knowledge and Reasoning of Islamic Law Across 1200 Years of Islamic Pluralist Legal Traditions

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理伊斯兰法(Islamic law)推理任务时的可靠性问题,即这些模型是否能够准确、一致地进行伊斯兰法律判断。其解决方案的关键在于构建了首个系统性评估框架——IslamicLegalBench,涵盖七个伊斯兰法学派别(madhabs)、13类不同复杂度的任务以及718个实例,从而对九种前沿LLM进行全面评测。结果显示,现有模型在基础事实准确性上存在显著不足(最佳模型仅达68%正确率,且有21%幻觉),尤其在中等复杂度任务中错误率最高,而高复杂度任务则展现出基于语义推理的能力;此外,多数模型对错误前提容忍度高(6/9模型接受误导性假设比例超40%),表明单纯依赖提示工程无法弥补知识缺失。该基准为未来开发更可靠、可信赖的宗教法律AI工具提供了关键评估依据和改进方向。

链接: https://arxiv.org/abs/2602.21226
作者: Ezieddin Elmahjub,Junaid Qadir,Abdullah Mushtaq,Rafay Naeem,Ibrahim Ghaznavi,Waleed Iqbal
机构: Qatar University (卡塔尔大学); Information Technology University (信息技术大学); Northeastern University (东北大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This manuscript has been submitted for review to Artificial Intelligence \ Law

点击查看摘要

Abstract:As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by 1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.

[NLP-52] Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal

【速读】: 该论文旨在解决不同架构的文档理解模型在训练效率上的差异问题,特别是探究渐进式数据调度(progressive data scheduling)是否能为各类模型带来一致的计算效率提升。其核心解决方案是采用一种分阶段增加训练数据暴露比例的课程学习策略(从33% → 67% → 100%),并通过对比匹配计算量的基线(matched-compute baselines)来区分调度效应与单纯计算资源减少的影响。关键发现表明:对于参数受限的纯文本模型(如BERT),该调度策略可显著缩短训练时间(约33%)并提升性能;而对于具备多模态表示和更强归纳偏置的模型(如LayoutLMv3),则无明显优势;此外,在任务复杂度较高或达到性能上限时(如CORD数据集),调度策略不再产生差异。这说明渐进式调度是一种可靠的计算节约策略,但其有效性依赖于模型容量与任务复杂度之间的交互关系。

链接: https://arxiv.org/abs/2602.21225
作者: Mohammed Hamdan,Vincenzo Dentamaro,Giuseppe Pirlo,Mohamed Cheriet
机构: Synchromedia Laboratory, École de Technologie Supérieure (ÉTS), 1100 Notre-Dame St W, Montreal, QC H3C 1K3, Canada; Department of Computer Science, University of Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate whether progressive data scheduling – a curriculum learning strategy that incrementally increases training data exposure (33% \rightarrow 67% \rightarrow 100%) – yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ( \Delta F1 = +0.023, p=0.022 , d_z=3.83 ), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ( p=0.621 ), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ( \geq 0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.

[NLP-53] Make Every Draft Count: Hidden State based Speculative Decoding

【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于推测解码(speculative decoding)的计算效率低下问题:在传统推测解码过程中,轻量级草稿模型(draft model)生成的大部分候选 token 会在目标模型验证阶段被丢弃,导致大量计算资源浪费。为应对这一挑战,论文提出了一种创新系统,其核心在于将被丢弃的草稿隐藏状态(hidden states)转化为可复用的语义信息,从而回收原本浪费的计算。关键解决方案包括:1)设计基于自回归隐藏状态的草稿模型架构,以保留比基于 token 的草稿模型更丰富的语义信息;2)引入高效的 token 信息注入机制,利用该模型构建高质量的草稿 token 树,并支持从验证失败中重采样 token;3)消除设计中的额外开销,最大化硬件利用率。实验表明,该方法相较标准推测解码最高可实现 3.3 倍加速。

链接: https://arxiv.org/abs/2602.21224
作者: Yuetao Chen,Xuliang Wang,Xinzhou Zheng,Ming Li,Peng Wang,Hong Xu
机构: The Chinese University of Hong Kong (香港中文大学); University of Waterloo (滑铁卢大学); University of Science and Technology of China (中国科学技术大学); Unaffiliated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics than token-based drafters to facilitate draft repurposing. Second, we design an efficient token information injection mechanism that leverages our specialized draft model to construct high-quality draft token trees and enables resampling tokens from verification failures. Third, we eliminate the overhead hidden in our design to further maximize hardware utilization. We conducted extensive evaluations against various baselines, demonstrating up to a 3.3x speedup against standard speculative decoding.

[NLP-54] Measuring Prag matic Influence in Large Language Model Instructions

【速读】: 该论文旨在解决如何系统性测量和理解“语用框架”(pragmatic framing)对大型语言模型(Large Language Models, LLMs)指令遵循行为的影响问题。以往研究多关注提示优化或将其视为安全漏洞,但未将语用框架本身作为可量化、可预测的指令执行属性进行建模。其解决方案的关键在于提出一个包含三个创新组件的框架:首先通过“指令-框架分解”(directive-framing decomposition)将语境线索与任务规范分离;其次构建涵盖400个实例的分类体系,将框架策略归纳为13种策略并归属至4个机制簇;最后引入基于优先级的测量方法,通过可观测的指令优先级变化来量化语用框架的影响。实验证明,不同家族和规模的五种LLM均表现出一致且结构化的优先级偏移,表明语用框架是影响指令执行的可测量且可预测因素。

链接: https://arxiv.org/abs/2602.21223
作者: Yilin Geng,Omri Abend,Eduard Hovy,Lea Frermann
机构: University of Melbourne (墨尔本大学); Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It is not only what we ask large language models (LLMs) to do that matters, but also how we prompt. Phrases like “This is urgent” or “As your supervisor” can shift model behavior without altering task content. We study this effect as pragmatic framing, contextual cues that shape directive interpretation rather than task specification. While prior work exploits such cues for prompt optimization or probes them as security vulnerabilities, pragmatic framing itself has not been treated as a measurable property of instruction following. Measuring this influence systematically remains challenging, requiring controlled isolation of framing cues. We introduce a framework with three novel components: directive-framing decomposition separating framing context from task specification; a taxonomy organizing 400 instantiations of framing into 13 strategies across 4 mechanism clusters; and priority-based measurement that quantifies influence through observable shifts in directive prioritization. Across five LLMs of different families and sizes, influence mechanisms cause consistent and structured shifts in directive prioritization, moving models from baseline impartiality toward favoring the framed directive. This work establishes pragmatic framing as a measurable and predictable factor in instruction-following systems.

[NLP-55] ask-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases

【速读】: 该论文旨在解决如何在不重新训练模型的前提下,高效地将多个专用LoRA(Low-Rank Adaptation)适配器动态组合以应对未见过的自然语言处理(Natural Language Processing, NLP)任务的问题。其解决方案的关键在于构建一个基于任务感知的向量数据库,该库通过嵌入来自22个不同数据集(涵盖常识推理、问答、自然语言推断和情感分析等任务)的训练样例形成表征;在推理阶段,利用相似性检索从该数据库中获取最相关的训练样本,并通过核采样(nucleus sampling)计算任务相似度分布,进而采用检索加权融合策略动态合并相关LoRA适配器。这种方法无需额外训练检索器、保持嵌入冻结,实现了参数高效且可解释的多任务适配器组合,在多个下游任务上达到或超越单独微调的性能表现。

链接: https://arxiv.org/abs/2602.21222
作者: Riya Adsul,Balachandra Devarangadi Sunil,Isha Nalawade,Sudharshan Govindan
机构: University of Massachusetts (马萨诸塞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parameter efficient fine tuning methods like LoRA have enabled task specific adaptation of large language models, but efficiently composing multiple specialized adapters for unseen tasks remains challenging. We present a novel framework for dynamic LoRA adapter composition that leverages similarity retrieval in vector databases to enable zero-shot generalization across diverse NLP tasks. Our approach constructs a task-aware vector database by embedding training examples from 22 datasets spanning commonsense reasoning, question answering, natural language inference, and sentiment analysis. At inference time, we retrieve the most similar training examples, compute task similarity distributions via nucleus sampling, and dynamically merge relevant LoRA adapters using retrieval weighted fusion strategies. We evaluated four merging methods Linear, Concatenation, TIES, and Magnitude Prune demonstrating that our dataset centric retrieval approach often matches or exceeds the performance of individually fine-tuned task-specific adapters. Notably, Linear merging achieves 70.95% on PIQA and 77.62% on RTE, substantially outperforming single-task baselines (46% and 52%, respectively). Our framework requires no additional retriever training, operates with frozen embeddings, and enables efficient, interpretable adapter composition. These results suggest that retrieval based dynamic merging offers a promising direction for scalable, parameter-efficient multitask learning without requiring full model retraining for each new task.

[NLP-56] Latent Context Compilation: Distilling Long Context into Compact Portable Memory

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文部署中面临的效率瓶颈问题,即现有方法在压缩上下文时难以兼顾分布外泛化能力(amortized compression)与高效推理(Test-Time Training),后者还存在合成数据成本高和需修改模型权重的问题。解决方案的关键在于提出“潜在上下文编译”(Latent Context Compilation)框架,通过引入一个一次性使用的低秩适配(LoRA)模块作为编译器,将长上下文蒸馏为紧凑的缓冲标记(buffer tokens),这些标记是无状态、可移植的记忆单元,可直接接入冻结的基础模型;其核心创新在于设计了一种自对齐优化策略,无需依赖特定于上下文的问答对,而是利用上下文无关的随机查询正则化上下文重建任务,迫使压缩后的标记位于模型已有的指令遵循流形内,从而在保持细粒度细节和推理能力的同时,实现内存密度与模型参数的解耦,实验表明在16倍压缩比下仍能有效维持性能。

链接: https://arxiv.org/abs/2602.21221
作者: Zeju Li,Yizhou Zhou,Qiang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens – stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model’s existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.

[NLP-57] Field-Theoretic Memory for AI Agents : Continuous Dynamics for Context Preservation

【速读】: 该论文旨在解决传统AI代理记忆系统中信息存储与更新效率低、难以支持长对话和多轮推理的问题。现有方法通常将记忆视为离散条目,导致语义关联弱、动态适应性差。其解决方案的关键在于引入基于偏微分方程(partial differential equations, PDEs)的连续场理论建模方式,将记忆表示为在语义空间中扩散、依重要性热力学衰减并可通过场耦合相互作用的连续场,从而实现更自然的语义传播与多智能体协同记忆整合。该方法在LoCoMo和LongMemEval等长上下文基准测试中显著提升了多轮推理、时间推理及知识更新召回率,并在多智能体场景下实现了接近完美的集体智能(99.8%)。

链接: https://arxiv.org/abs/2602.21220
作者: Subhadip Mitra
机构: Rotalabs(罗塔实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 6 figures. Code: this https URL

点击查看摘要

Abstract:We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi-agent scenarios. We evaluate the system on two established long-context benchmarks: LoCoMo (ACL 2024) with 300-turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi-session reasoning over 500+ turns. On LongMemEval, the field-theoretic approach achieves significant improvements: +116% F1 on multi-session reasoning (p0.01, d= 3.06), +43.8% on temporal reasoning (p0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p0.001, d= 5.00). Multi-agent experiments show near-perfect collective intelligence (99.8%) through field coupling. Code is available at this http URL.

[NLP-58] Reasoning -Based Personalized Generation for Users with Sparse Data

【速读】: 该论文旨在解决在用户交互历史稀疏(如冷启动用户或新注册用户)情况下,大型语言模型(Large Language Model, LLM)难以实现有效个性化文本生成的问题。解决方案的关键在于提出一种名为GraSPer(Graph-based Sparse Personalized Reasoning)的框架:首先通过预测用户未来可能交互的项目来增强用户上下文;接着利用推理对齐机制生成这些假设交互对应的文本以丰富增强后的上下文;最终基于真实与合成的历史共同生成个性化输出,从而确保生成内容与用户风格和偏好保持一致。

链接: https://arxiv.org/abs/2602.21219
作者: Bo Ni,Branislav Kveton,Samyadeep Basu,Subhojyoti Mukherjee,Leyao Wang,Franck Dernoncourt,Sungchul Kim,Seunghyun Yoon,Zichao Wang,Ruiyi Zhang,Puneet Mathur,Jihyung Kil,Jiuxiang Gu,Nedim Lipka,Yu Wang,Ryan A. Rossi,Tyler Derr
机构: Vanderbilt University (范德比尔特大学); Adobe Research (Adobe 研究院); Yale University (耶鲁大学); University of Oregon (俄勒冈大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) personalization holds great promise for tailoring responses by leveraging personal context and history. However, real-world users usually possess sparse interaction histories with limited personal context, such as cold-start users in social platforms and newly registered customers in online E-commerce platforms, compromising the LLM-based personalized generation. To address this challenge, we introduce GraSPer (Graph-based Sparse Personalized Reasoning), a novel framework for enhancing personalized text generation under sparse context. GraSPer first augments user context by predicting items that the user would likely interact with in the future. With reasoning alignment, it then generates texts for these interactions to enrich the augmented context. In the end, it generates personalized outputs conditioned on both the real and synthetic histories, ensuring alignment with user style and preferences. Extensive experiments on three benchmark personalized generation datasets show that GraSPer achieves significant performance gain, substantially improving personalization in sparse user context settings.

[NLP-59] EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

【速读】: 该论文旨在解决隐私保护下合成文本数据生成效率低的问题,即现有私有文本生成方法在数据使用、计算速度和隐私预算分配上存在严重瓶颈,尤其在小数据场景下难以保证生成质量。其解决方案的关键在于提出EPSVec,一种基于差分隐私的轻量级方法,通过提取并净化一次性的“数据集向量”(dataset vectors)——即激活空间中反映私有数据与公共先验分布差异的方向——来引导大语言模型(LLM)的生成过程。该方法将隐私预算与生成过程解耦,实现无需额外隐私成本即可生成任意数量的合成样本,并在低数据条件下仍保持高保真度,同时结合预训练模型和固定提示(fixed-shot prompting)提升多样性与生成质量。

链接: https://arxiv.org/abs/2602.21218
作者: Amin Banayeeanzade,Qingchuan Yang,Deqing Fu,Spencer Hong,Erin Babinsky,Alfy Samuel,Anoop Kumar,Robin Jia,Sai Praneeth Karimireddy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using dataset vectors–directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

[NLP-60] Applied Sociolinguistic AI for Community Development (ASA-CD): A New Scientific Paradigm for Linguistically-Grounded Social Intervention

【速读】: 该论文旨在解决社区发展过程中因语言分歧导致的社会排斥与集体行动障碍问题,其核心挑战在于如何通过人工智能技术识别并干预具有破坏性的语言模式。解决方案的关键在于提出了一种新的科学范式——面向社区发展的应用社会语言学人工智能(Applied Sociolinguistic AI for Community Development, ASA-CD),其三大核心贡献包括:(1)将语言生物标志物(linguistic biomarkers)作为话语碎片化的计算指标;(2)发展以社区发展目标为导向的自然语言处理(development-aligned natural language processing, NLP),强调AI优化应服务于集体福祉;(3)建立标准化的五阶段话语干预流程,实现可复制、可扩展的干预机制。该框架为实现价值对齐、可扩展的人工智能赋能社区提供了方法论、伦理和实证基础。

链接: https://arxiv.org/abs/2602.21217
作者: S M Ruhul Alam,Rifa Ferzana
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 2 figures, 3 tables; simulation-based study introducing the ASA-CD framework

点击查看摘要

Abstract:This paper establishes Applied Sociolinguistic AI for Community Development (ASA-CD) as a novel scientific paradigm for addressing community challenges through linguistically grounded, AI-enabled intervention. ASA-CD introduces three key contributions: (1) linguistic biomarkers as computational indicators of discursive fragmentation; (2) development-aligned natural language processing (NLP), an AI optimisation paradigm prioritising collective outcomes; and (3) a standardised five-phase protocol for discursive intervention. A proof-of-concept study, incorporating real-world and synthetic corpora, demonstrates systematic associations between exclusionary language and negative sentiment and simulates intervention-based improvements. ASA-CD provides a unified methodological, ethical and empirical framework for scalable, value-aligned AI in the service of community empowerment.

[NLP-61] EQ-5D Classification Using Biomedical Entity-Enriched Pre-trained Language Models and Multiple Instance Learning

【速读】: 该论文旨在解决系统性文献综述(Systematic Literature Reviews, SLRs)中自动化识别使用EQ-5D量表的文献时存在的效率低、准确性差的问题。传统方法依赖人工筛选大量文献,耗时且易出错。解决方案的关键在于通过细调通用预训练语言模型(如BERT)和领域特定模型(如SciBERT、BioBERT),并结合scispaCy提取的生物医学实体信息对句子进行增强,从而提升模型对EQ-5D相关表述的检测能力;进一步采用多实例学习(Multiple Instance Learning, MIL)与注意力池化机制,将句子级预测聚合为研究级判断,显著提高了F1分数(达0.82)和召回率,优于传统词袋模型及现有预训练语言模型基线。

链接: https://arxiv.org/abs/2602.21216
作者: Zhyar Rzgar K Rostam,Gábor Kertész
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 tables

点击查看摘要

Abstract:The EQ-5D (EuroQol 5-Dimensions) is a standardized instrument for the evaluation of health-related quality of life. In health economics, systematic literature reviews (SLRs) depend on the correct identification of publications that use the EQ-5D, but manual screening of large volumes of scientific literature is time-consuming, error-prone, and inconsistent. In this study, we investigate fine-tuning of general-purpose (BERT) and domain-specific (SciBERT, BioBERT) pre-trained language models (PLMs), enriched with biomedical entity information extracted through scispaCy models for each statement, to improve EQ-5D detection from abstracts. We conduct nine experimental setups, including combining three scispaCy models with three PLMs, and evaluate their performance at both the sentence and study levels. Furthermore, we explore a Multiple Instance Learning (MIL) approach with attention pooling to aggregate sentence-level information into study-level predictions, where each abstract is represented as a bag of enriched sentences (by scispaCy). The findings indicate consistent improvements in F1-scores (reaching 0.82) and nearly perfect recall at the study-level, significantly exceeding classical bag-of-words baselines and recently reported PLM baselines. These results show that entity enrichment significantly improves domain adaptation and model generalization, enabling more accurate automated screening in systematic reviews.

[NLP-62] Inference-time Alignment via Sparse Junction Steering

【速读】: 该论文旨在解决现有基于token级干预的推理时对齐方法中存在的计算开销大和生成质量下降的问题。当前方法依赖于在每个解码步骤进行密集干预,不仅效率低下,还可能因过度偏离模型固有输出分布而影响生成效果。解决方案的关键在于提出稀疏推理时对齐(Sparse Inference time Alignment, SIA),通过仅在生成轨迹中的高熵关键决策点进行干预来实现高效对齐。研究发现,高熵节点是生成路径中易发生偏移的敏感点,此时引入对齐奖励信号能显著提升控制精度;实验表明,仅干预20%至80%的token即可获得优于密集干预的对齐效率权衡,且在强基座模型(如Qwen3)上,干预20% token的效果可媲美甚至超越经过大量后训练的指令微调模型,同时降低高达6倍的计算成本。

链接: https://arxiv.org/abs/2602.21215
作者: Runyi Hu,Jie Zhang,Shiqian Zhao,Jiale Meng,Jiwei Li,Jason Zeng,Ming Wu,Michael Heinrich,Yonggang Wen,Tianwei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 17 figures

点击查看摘要

Abstract:Token-level steering has emerged as a pivotal approach for inference-time alignment, enabling fine grained control over large language models by modulating their output distributions without parameter updates. While effective, existing methods rely on dense intervention at every decoding step. This persistent manipulation not only incurs substantial computational overhead but also risks compromising generation quality by excessively drifting from the model’s intrinsic distribution. In this work, we show that dense intervention is unnecessary and propose Sparse Inference time Alignment (SIA), which performs sparse junction steering by intervening only at critical decision points along the generation trajectory. Our key insight is that high entropy junctions mark pivotal decision points in the generation trajectory and are particularly susceptible to misalignment, indicating the need to introduce alignment related reward signals at these points. Extensive experiments across different model families and alignment objectives show that steering only 20% to 80% of tokens achieves superior alignment-efficiency trade offs. For strong base models such as Qwen3, intervening on as few as 20% of tokens matches or even surpasses heavily post-trained instruct models. This sparsity enables stronger guidance while better preserving the model’s native distribution, integrates seamlessly with search based methods such as Best-of-N, and reduces computational cost by up to 6x.

[NLP-63] G-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition LREC2026

【速读】: 该论文旨在解决低资源语言自动语音识别(ASR)中因标注数据稀缺而导致的性能瓶颈问题,尤其针对台湾闽南语(Taiwanese Hokkien)这类缺乏高质量转录文本的语言。其核心解决方案是提出一种基于翻译引导的ASR框架(TG-ASR),其关键技术在于引入并行门控交叉注意力机制(Parallel Gated Cross-Attention, PGCA),该机制能够自适应地融合多种辅助语言的多语言翻译嵌入至ASR解码器中,从而在保持优化稳定性的同时实现跨语言语义引导,有效提升低资源场景下的识别准确率。

链接: https://arxiv.org/abs/2602.22039
作者: Cheng-Yeh Yang,Chien-Chun Wang,Li-Wei Chen,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.

[NLP-64] One Brain Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models

【速读】: 该论文旨在解决非侵入式脑成像中高频电磁信号(如EEG/MEG)与低频代谢信号(如fMRI)长期分离分析的问题,这种分离限制了对大脑活动的全面理解。其关键解决方案是提出NOBEL——一个神经多模态脑编码大型语言模型(LLM),通过统一编码器整合EEG与MEG信号,并设计新颖的双路径策略处理fMRI信号,将不同模态的脑信号和外部感官刺激映射到共享的语义嵌入空间中,再利用LLM作为通用骨干网络实现跨模态融合与解码。这一架构不仅提升了单模态任务的性能,还显著增强了电磁与代谢信号协同解码的准确性,验证了多模态神经信号的互补性,并实现了基于直接刺激输入的因果关联验证,推动了非侵入式脑解码的统一建模。

链接: https://arxiv.org/abs/2602.21522
作者: Changli Tang,Shurui Li,Junliang Wang,Qinfan Xiao,Zhonghao Zhai,Lei Bai,Yu Qiao,Bowen Zhou,Wen Wu,Yuanning Li,Chao Zhang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deciphering brain function through non-invasive recordings requires synthesizing complementary high-frequency electromagnetic (EEG/MEG) and low-frequency metabolic (fMRI) signals. However, despite their shared neural origins, extreme discrepancies have traditionally confined these modalities to isolated analysis pipelines, hindering a holistic interpretation of brain activity. To bridge this fragmentation, we introduce \textbfNOBEL, a \textbfneuro-\textbfomni-modal \textbfbrain-\textbfencoding \textbflarge language model (LLM) that unifies these heterogeneous signals within the LLM’s semantic embedding space. Our architecture integrates a unified encoder for EEG and MEG with a novel dual-path strategy for fMRI, aligning non-invasive brain signals and external sensory stimuli into a shared token space, then leverages an LLM as a universal backbone. Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks. We also show that the synergistic fusion of electromagnetic and metabolic signals yields higher decoding accuracy than unimodal baselines, validating the complementary nature of multiple neural modalities. Furthermore, NOBEL exhibits strong capabilities in stimulus-aware decoding, effectively interpreting visual semantics from multi-subject fMRI data on the NSD and HAD datasets while uniquely leveraging direct stimulus inputs to verify causal links between sensory signals and neural responses. NOBEL thus takes a step towards unifying non-invasive brain decoding, demonstrating the promising potential of omni-modal brain understanding.

[NLP-65] MiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis

【速读】: 该论文旨在解决现有情感语音数据集普遍依赖于表演或实验室诱发情绪,难以反映真实情境下自发情感状态的问题。其解决方案的关键在于构建iMiGUE-Speech数据集,该数据集基于真实比赛结果自然引发的自发情感,补充了语音转录、说话人角色分离(采访者与受访者)及词级强制对齐等元数据,并可与原始iMiGUE数据集中微表情标注同步使用,形成多模态资源,从而支持从声学和语言模态共同捕捉自发情感状态的研究。

链接: https://arxiv.org/abs/2602.21464
作者: Sofoklis Kakouros,Fang Kang,Haoyu Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Speech Prosody 2026

点击查看摘要

Abstract:This work presents iMiGUE-Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states. The new release focuses on speech and enriches the original dataset with additional metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. Unlike existing emotional speech datasets that rely on acted or laboratory-elicited emotions, iMiGUE-Speech captures spontaneous affect arising naturally from real match outcomes. To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset’s ability to capture spontaneous affective states from both acoustic and linguistic modalities. iMiGUE-Speech can also be synchronously paired with micro-gesture annotations from the original iMiGUE dataset, forming a uniquely multimodal resource for studying speech-gesture affective dynamics. The extended dataset is available at this https URL.

[NLP-66] Forecasting Future Language: Context Design for Mention Markets

【速读】: 该论文旨在解决生成式 AI(Generative AI)在提及市场(mention markets)中进行准确概率预测的问题,特别是如何设计输入上下文以提升大语言模型(Large Language Models, LLMs)的预测性能。其关键解决方案是提出市场条件提示(Market-Conditioned Prompting, MCP),该方法将市场隐含概率作为先验信息,并指导LLM基于文本证据对这一先验进行更新,而非从零开始重新预测基础概率。实验表明,MCP显著提升了预测校准性,而结合市场概率与MCP的混合策略(MixMCP)进一步优化了鲁棒性和准确性,优于单独依赖市场或LLM的基准方法。

链接: https://arxiv.org/abs/2602.21229
作者: Sumin Kim,Jihoon Kwon,Yoon Kim,Nicole Kagan,Raffi Khatchadourian,Wonbin Ahn,Alejandro Lopez-Lira,Jaewon Lee,Yoontae Hwang,Oscar Levy,Yongjae Lee,Chanyeol Choi
机构: LinqAlpha; Massachusetts Institute of Technology (麻省理工学院); Kalshi; IBM; LG AI Research; University of Florida; Seoul National University (首尔国立大学); Pusan National University (釜山国立大学); University of California, Berkeley (加州大学伯克利分校); UNIST (韩国科学技术院)
类目: General Finance (q-fin.GN); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Mention markets, a type of prediction market in which contracts resolve based on whether a specified keyword is mentioned during a future public event, require accurate probabilistic forecasts of keyword-mention outcomes. While recent work shows that large language models (LLMs) can generate forecasts competitive with human forecasters, it remains unclear how input context should be designed to support accurate prediction. In this paper, we study this question through experiments on earnings-call mention markets, which require forecasting whether a company will mention a specified keyword during its upcoming call. We run controlled comparisons varying (i) which contextual information is provided (news and/or prior earnings-call transcripts) and (ii) how \textitmarket probability, (i.e., prediction market contract price) is used. We introduce Market-Conditioned Prompting (MCP), which explicitly treats the market-implied probability as a prior and instructs the LLM to update this prior using textual evidence, rather than re-predicting the base rate from scratch. In our experiments, we find three insights: (1) richer context consistently improves forecasting performance; (2) market-conditioned prompting (MCP), which treats the market probability as a prior and updates it using textual evidence, yields better-calibrated forecasts; and (3) a mixture of the market probability and MCP (MixMCP) outperforms the market baseline. By dampening the LLM’s posterior update with the market prior, MixMCP yields more robust predictions than either the market or the LLM alone.

信息检索

[IR-0] LiCQA : A Lightweight Complex Question Answering System

【速读】:该论文旨在解决复杂问答(Complex Question Answering, CQA)问题,即答案分布在多个文档中的多跳推理任务。传统方法要么依赖知识图谱,要么使用计算成本高昂的神经网络模型,存在训练资源消耗大、数据需求量高等局限性。本文提出的LiCQA是一种无监督的问答模型,其核心创新在于主要基于语料库证据进行推理,无需大量标注数据或昂贵的训练过程,在保持高准确率的同时显著降低了延迟,实验证明其在基准数据集上优于两种最新有监督系统。

链接: https://arxiv.org/abs/2602.22182
作者: Sourav Saha,Dwaipayan Roy,Mandar Mitra
机构: Indian Statistical Institute (印度统计研究所); Indian Institute of Science Education and Research (印度科学教育与研究学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems. However, addressing complex questions, the answers to which are spread across multiple documents, remains a challenging problem. Recent QA systems that are designed to handle complex questions work either on the basis of knowledge graphs, or utilise contem- porary neural models that are expensive to train, in terms of both computational resources and the volume of training data required. In this paper, we present LiCQA, an unsupervised question answer- ing model that works primarily on the basis of corpus evidence. We empirically compare the effectiveness and efficiency of LiCQA with two recently presented QA systems, which are based on different underlying principles. The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.

[IR-1] Learning to Collaborate via Structures: Cluster-Guided Item Alignment for Federated Recommendation

【速读】:该论文旨在解决联邦推荐系统中因客户端与服务器之间频繁传输高维物品嵌入(item embeddings)而导致的通信效率低下问题,同时避免强制要求全局一致的嵌入坐标对齐所带来的局限性。其解决方案的关键在于提出Cluster-Guided FedRec(CGFedRec)框架,该框架通过将上传的嵌入映射为紧凑的聚类标签,使服务器作为全局结构发现者学习物品簇并仅分发标签,从而切断下游物品嵌入的传输路径;这种机制允许每个客户端在保持全局语义结构约束的前提下本地灵活调整物品表示,有效注入全局协同信号而不依赖共享嵌入,显著提升通信效率并维持推荐精度。

链接: https://arxiv.org/abs/2602.21957
作者: Yuchun Tu,Zhiwei Li,Bingli Sun,Yixuan Li,Xiao Song
机构: Beihang University (北京航空航天大学); University of Technology Sydney (悉尼科技大学); University of Edinburgh (爱丁堡大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Federated recommendation facilitates collaborative model training across distributed clients while keeping sensitive user interaction data local. Conventional approaches typically rely on synchronizing high-dimensional item representations between the server and clients. This paradigm implicitly assumes that precise geometric alignment of embedding coordinates is necessary for collaboration across clients. We posit that establishing relative semantic relationships among items is more effective than enforcing shared representations. Specifically, global semantic relations serve as structural constraints for items. Within these constraints, the framework allows item representations to vary locally on each client, which flexibility enables the model to capture fine-grained user personalization while maintaining global consistency. To this end, we propose Cluster-Guided FedRec framework (CGFedRec), a framework that transforms uploaded embeddings into compact cluster labels. In this framework, the server functions as a global structure discoverer to learn item clusters and distributes only the resulting labels. This mechanism explicitly cuts off the downstream transmission of item embeddings, relieving clients from maintaining global shared item embeddings. Consequently, CGFedRec achieves the effective injection of global collaborative signals into local item representations without transmitting full embeddings. Extensive experiments demonstrate that our approach significantly improves communication efficiency while maintaining superior recommendation accuracy across multiple datasets.

[IR-2] Offline Reasoning for Efficient Recommendation: LLM -Empowered Persona-Profiled Item Indexing

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统在在线推理阶段存在高延迟的问题,即LLM在实时排序过程中需要消耗大量计算资源,限制了其在实际场景中的部署。解决方案的关键在于提出Persona4Rec框架,通过离线阶段利用LLM对商品评论进行推理,构建可解释的“人格化”(persona)表示——这些表示捕捉了不同用户群体可能与某物品产生交互的多样化动机,并将其作为物品侧的多视角表征;在线阶段则通过用户画像与最匹配的物品人格表示之间的对齐来实现快速相关性计算,从而避免重复调用昂贵的LLM推理,显著降低延迟并保持推荐性能。

链接: https://arxiv.org/abs/2602.21756
作者: Deogyong Kim,Junseong Lee,Jeongeun Lee,Changhoe Kim,Junguel Lee,Jungseok Lee,Dongha Lee
机构: Yonsei University (延世大学); NAVER (NAVER)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Recent advances in large language models (LLMs) offer new opportunities for recommender systems by capturing the nuanced semantics of user interests and item characteristics through rich semantic understanding and contextual reasoning. In particular, LLMs have been employed as rerankers that reorder candidate items based on inferred user-item relevance. However, these approaches often require expensive online inference-time reasoning, leading to high latency that hampers real-world deployment. In this work, we introduce Persona4Rec, a recommendation framework that performs offline reasoning to construct interpretable persona representations of items, enabling lightweight and scalable real-time inference. In the offline stage, Persona4Rec leverages LLMs to reason over item reviews, inferring diverse user motivations that explain why different types of users may engage with an item; these inferred motivations are materialized as persona representations, providing multiple, human-interpretable views of each item. Unlike conventional approaches that rely on a single item representation, Persona4Rec learns to align user profiles with the most plausible item-side persona through a dedicated encoder, effectively transforming user-item relevance into user-persona relevance. At the online stage, this persona-profiled item index allows fast relevance computation without invoking expensive LLM reasoning. Extensive experiments show that Persona4Rec achieves performance comparable to recent LLM-based rerankers while substantially reducing inference time. Moreover, qualitative analysis confirms that persona representations not only drive efficient scoring but also provide intuitive, review-grounded explanations. These results demonstrate that Persona4Rec offers a practical and interpretable solution for next-generation recommender systems.

[IR-3] rie-Aware Transformers for Generative Recommendation

【速读】:该论文针对生成式推荐(Generative Recommendation, GR)中因标准自回归建模忽略物品层次结构而导致的性能瓶颈问题展开研究。现有方法通常将物品通过层次化分词(hierarchical tokenization)映射为离散token序列,并采用Transformer进行自回归生成,但传统Transformer在处理token时将其视为线性流,未能利用由分词过程形成的前缀树(trie)结构所蕴含的拓扑信息。解决方案的关键在于提出TrieRec,其核心创新是引入两种位置编码机制以显式建模trie结构:一是trie-aware绝对位置编码,将每个token(节点)的局部结构上下文(如深度、祖先与后代)融入表示;二是topology-aware相对位置编码,在自注意力机制中注入成对的结构关系,从而捕捉由拓扑结构诱导的语义相关性。该方法无需调整模型架构,具备模型无关性、高效性和无超参数特性,在三个代表性GR骨干网络上均取得显著提升(平均提升8.83%)。

链接: https://arxiv.org/abs/2602.21677
作者: Zhenxiang Xu,Jiawei Chen,Sirui Chen,Yong He,Jieyu Yang,Chuan Yuan,Ke Ding,Can Wang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative recommendation (GR) aligns with advances in generative AI by casting next-item prediction as token-level generation rather than score-based ranking. Most GR methods adopt a two-stage pipeline: (i) \textititem tokenization, which maps each item to a sequence of discrete, hierarchically organized tokens; and (ii) \textitautoregressive generation, which predicts the next item’s tokens conditioned on the tokens of user’s interaction history. Although hierarchical tokenization induces a prefix tree (trie) over items, standard autoregressive modeling with conventional Transformers often flattens item tokens into a linear stream and overlooks the underlying topology. To address this, we propose TrieRec, a trie-aware generative recommendation method that augments Transformers with structural inductive biases via two positional encodings. First, a \textittrie-aware absolute positional encoding aggregates a token’s (node’s) local structural context (\eg depth, ancestors, and descendants) into the token representation. Second, a \textittopology-aware relative positional encoding injects pairwise structural relations into self-attention to capture topology-induced semantic relatedness. TrieRec is also model-agnostic, efficient, and hyperparameter-free. In our experiments, we implement TrieRec within three representative GR backbones, achieving notably improvements of 8.83% on average across four real-world datasets. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.21677 [cs.IR] (or arXiv:2602.21677v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.21677 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] AQR-HNSW: Accelerating Approximate Nearest Neighbor Search via Density-aware Quantization and Multi-stage Re-ranking

【速读】:该论文旨在解决HNSW(Hierarchical Navigable Small World)图在大规模向量数据库场景下面临的三大瓶颈问题:内存消耗过高、距离计算开销主导查询延迟,以及在异构数据分布上性能不佳。其解决方案的关键在于提出AQR-HNSW框架,通过三项协同优化策略实现突破:(1) 密度感知的自适应量化(density-aware adaptive quantization),在保持距离关系的前提下实现4倍压缩;(2) 多状态重排序机制(multi-state re-ranking),减少35%的无效计算;(3) 量化优化的SIMD并行实现,在不同架构上达到每周期16–64次操作,显著提升计算效率。实验表明,该方案在标准基准测试中相较现有最优HNSW实现可提升2.5–3.3倍查询吞吐量(QPS),同时维持98%以上召回率,并实现索引图75%的内存压缩和5倍的构建速度提升。

链接: https://arxiv.org/abs/2602.21600
作者: Ganap Ashit Tewary,Nrusinga Charan Gantayat,Jeff Zhang
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted at DAC 2026

点击查看摘要

Abstract:Approximate Nearest Neighbor (ANN) search has become fundamental to modern AI infrastructure, powering recommendation systems, search engines, and large language models across industry leaders from Google to OpenAI. Hierarchical Navigable Small World (HNSW) graphs have emerged as the dominant ANN algorithm, widely adopted in production systems due to their superior recall versus latency balance. However, as vector databases scale to billions of embeddings, HNSW faces critical bottlenecks: memory consumption expands, distance computation overhead dominates query latency, and it suffers suboptimal performance on heterogeneous data distributions. This paper presents Adaptive Quantization and Rerank HNSW (AQR-HNSW), a novel framework that synergistically integrates three strategies to enhance HNSW scalability. AQR-HNSW introduces (1) density-aware adaptive quantization, achieving 4x compression while preserving distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (3) quantization-optimized SIMD implementations delivering 16-64 operations per cycle across architectures. Evaluation on standard benchmarks demonstrates 2.5-3.3x higher queries per second (QPS) than state-of-the-art HNSW implementations while maintaining over 98% recall, with 75% memory reduction for the index graph and 5x faster index construction.

[IR-5] Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access

【速读】:该论文旨在解决公共信息服务系统中存在的碎片化、格式不一致和信息过时等问题,这些问题导致在低资源检索环境中难以及时获取关键公共服务。针对食品援助点(food pantry)这一社会紧迫问题,作者提出了一种基于人工智能的对话式检索系统,其核心在于通过网络爬取和索引公开的援助点数据,并采用检索增强生成(Retrieval-Augmented Generation, RAG)流水线来支持自然语言查询。该方案的关键创新在于将RAG技术与现实场景中的社区来源查询相结合,从而提升用户对关键公共资源的可及性,同时揭示了当前检索系统在鲁棒性、查询模糊处理和知识库一致性方面的局限性,为未来面向低资源环境的健壮对话式检索研究提供了方向。

链接: https://arxiv.org/abs/2602.21598
作者: Touseef Hasan,Laila Cure,Souvika Sarkar
机构: Wichita State University (威奇托州立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 3 pages, 1 figure

点击查看摘要

Abstract:Public service information systems are often fragmented, inconsistently formatted, and outdated. These characteristics create low-resource retrieval environments that hinder timely access to critical services. We investigate retrieval challenges in such settings through the domain of food pantry access, a socially urgent problem given persistent food insecurity. We develop an AI-powered conversational retrieval system that scrapes and indexes publicly available pantry data and employs a Retrieval-Augmented Generation (RAG) pipeline to support natural language queries via a web interface. We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios. Our analysis reveals key limitations in retrieval robustness, handling underspecified queries, and grounding over inconsistent knowledge bases. This ongoing work exposes fundamental IR challenges in low-resource environments and motivates future research on robust conversational retrieval to improve access to critical public resources.

[IR-6] Revisiting RAG Retrievers: An Information Theoretic Benchmark

【速读】:该论文旨在解决当前Retrieval-Augmented Generation (RAG)系统中对不同检索器(retriever)机制缺乏系统性理解的问题,特别是现有评估方法难以揭示各类检索策略(如词法匹配、稠密嵌入、图引用等)之间的差异、冗余性和互补性。其解决方案的关键在于提出MIGRASCOPE——一个基于互信息(Mutual Information)的检索器分析框架,通过引入信息论与统计估计理论为基础的量化指标,系统评估检索质量、冗余度、协同效应及边际贡献,从而为检索器的选择与组合提供可解释的依据,并验证了精心设计的检索器集成方案能超越单一检索器性能。

链接: https://arxiv.org/abs/2602.21553
作者: Wenqing Zheng,Dmitri Kalaev,Noah Fatsi,Daniel Barcklow,Owen Reinert,Igor Melnyk,Senthil Kumar,C. Bayan Bruss
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems rely critically on the retriever module to surface relevant context for large language models. Although numerous retrievers have recently been proposed, each built on different ranking principles such as lexical matching, dense embeddings, or graph citations, there remains a lack of systematic understanding of how these mechanisms differ and overlap. Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves. Those that do compare retrievers directly use a limited set of evaluation tools which fail to capture complementary and overlapping strengths. This work presents MIGRASCOPE, a Mutual Information based RAG Retriever Analysis Scope. We revisit state-of-the-art retrievers and introduce principled metrics grounded in information and statistical estimation theory to quantify retrieval quality, redundancy, synergy, and marginal contribution. We further show that if chosen carefully, an ensemble of retrievers outperforms any single retriever. We leverage the developed tools over major RAG corpora to provide unique insights on contribution levels of the state-of-the-art retrievers. Our findings provide a fresh perspective on the structure of modern retrieval techniques and actionable guidance for designing robust and efficient RAG systems.

[IR-7] Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

【速读】:该论文旨在解决多语言预训练模型在表示空间中缺乏显式对齐信号,导致跨语言对齐效果不佳的问题。其解决方案的关键在于利用来自多种目标语言的多向平行语料库(multi-way parallel corpus)进行对比学习(contrastive learning),从而显著提升模型在多语言自然语言理解(NLU)任务中的跨语言对齐能力。通过在六种目标语言上构建基于现成神经机器翻译(NMT)模型的平行数据,并以对比方式训练XLM-Roberta和多语言BERT基础模型,实验表明该方法在MTEB基准测试中对已见和未见语言均带来显著性能提升,尤其在句子嵌入质量方面表现突出。

链接: https://arxiv.org/abs/2602.21543
作者: Barah Fazili,Koustava Goswami
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.

[IR-8] Both Ends Count! Just How Good are LLM Agents at “Text-to-Big SQL”?

【速读】:该论文旨在解决现有Text-to-SQL(文本转SQL)评估基准在面对大规模数据场景时的局限性问题,即当前指标未能充分反映系统在真实Big Data(大数据)工作流中的执行效率、成本与可扩展性表现。其关键解决方案是提出一套全新的、面向“Text-to-Big SQL”的评估指标体系,该体系能够准确衡量LLM(大语言模型)代理在生产级环境下的性能、资源消耗及随数据规模增长带来的延迟和成本变化,从而弥补传统Text-to-SQL评测中忽略规模化影响的缺陷。

链接: https://arxiv.org/abs/2602.21480
作者: Germán T. Eizaguirre,Lars Tissen,Marc Sánchez-Artigas
机构: Universitat Rovira i Virgili (罗维拉·维尔吉里大学); RWTH Aachen University (亚琛工业大学)
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as “Text-to-Big SQL”. However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost. Comments: 11 pages, 4 figures Subjects: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR) ACMclasses: I.2.7; H.2.8 Cite as: arXiv:2602.21480 [cs.DB] (or arXiv:2602.21480v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.21480 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-9] Revisiting Text Ranking in Deep Research

【速读】:该论文旨在解决深度研究(deep research)任务中因使用黑箱网络搜索API而导致的检索组件行为不透明问题,从而阻碍对文本排序方法在该场景下有效性的系统性分析。其关键解决方案是通过复现和评估一系列信息检索(IR)文本排序方法在固定语料库上的表现,从三个维度展开实验:(i)检索单元(文档 vs. 段落),(ii)流水线配置(不同检索器、重排序器及重排序深度),以及(iii)查询特征(代理生成查询与训练查询之间的不匹配)。结果表明,代理生成的查询通常遵循网页搜索风格语法,偏好基于词法、学习稀疏表示或多向量的检索器;段落级检索在有限上下文窗口下更高效且避免了词法检索中的文档长度归一化难题;重排序显著提升效果;将代理查询转化为自然语言问题可有效缓解查询语义错位问题。

链接: https://arxiv.org/abs/2602.21456
作者: Chuan Meng,Litu Ou,Sean MacAvaney,Jeff Dalton
机构: The University of Edinburgh(爱丁堡大学); University of Glasgow(格拉斯哥大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search’s essential role in deep research, black-box web search APIs hinder systematic analysis of search components, leaving the behaviour of established text ranking methods in deep research largely unclear. To fill this gap, we reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting. In particular, we examine their effectiveness from three perspectives: (i) retrieval units (documents vs. passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers). We perform experiments on BrowseComp-Plus, a deep research dataset with a fixed corpus, evaluating 2 open-source agents, 5 retrievers, and 3 re-rankers across diverse setups. We find that agent-issued queries typically follow web-search-style syntax (e.g., quoted exact matches), favouring lexical, learned sparse, and multi-vector retrievers; passage-level units are more efficient under limited context windows, and avoid the difficulties of document length normalisation in lexical retrieval; re-ranking is highly effective; translating agent-issued queries into natural-language questions significantly bridges the query mismatch.

[IR-10] PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing

【速读】:该论文旨在解决当前基于图的近似最近邻搜索(Approximate Nearest Neighbor Search, ANNS)索引构建速度慢的问题,尤其是HNSW和Vamana等方法因依赖随机访问密集的束搜索(beam search)而导致构造时间过长。其核心解决方案是提出PiPNN(Pick-in-Partitions Nearest Neighbors)算法,关键创新在于HashPrune——一种新颖的在线剪枝算法,能够动态维护稀疏边集合,并通过将数据集划分为重叠子问题、利用密集矩阵乘法核高效执行批量距离计算、以及流式输入部分边至HashPrune的方式,显著降低内存占用并提升构建效率。该方法在保证索引质量的前提下,实现比Vamana(DiskANN)快11.6倍、比HNSW快12.9倍的构建速度,首次在单个多核机器上实现了百亿级数据集的高质量ANN索引构建,耗时低于20分钟。

链接: https://arxiv.org/abs/2602.21247
作者: Tobias Rubel,Richard Wen,Laxman Dhulipala,Lars Gottesbüren,Rajesh Jayaram,Jakub Łącki
机构: University of Maryland, College Park (马里兰大学学院市分校); Google Research (谷歌研究院)
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck’’ that existing graph-based methods suffer from. PiPNN’s core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory. PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine. Subjects: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR) Cite as: arXiv:2602.21247 [cs.DB] (or arXiv:2602.21247v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2602.21247 Focus to learn more arXiv-issued DOI via DataCite

[IR-11] oward Effective Multi-Domain Rumor Detection in Social Networks Using Domain-Gated Mixture-of-Experts

【速读】:该论文旨在解决多领域谣言检测中因数据分布差异导致现有单域模型性能下降的问题,即如何在跨领域场景下实现高精度的谣言识别。其解决方案的关键在于提出了一种基于领域门控机制(domain gate)的多域谣言检测模型,该模型采用Mixture-of-Experts架构动态融合多个专家网络提取的特征表示;每个专家由CNN与BiLSTM组成,分别捕捉局部句法特征和长程上下文依赖关系,并结合文本内容与发布者信息进行分类,从而在多域设置下实现了F1-score达79.86%、准确率为79.98%的先进性能。

链接: https://arxiv.org/abs/2602.21214
作者: Mohadeseh Sheikhqoraei,Zainabolhoda Heshmati,Zeinab Rajabi,Leila Rabiei
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Social media platforms have become key channels for spreading and tracking rumors due to their widespread accessibility and ease of information sharing. Rumors can continuously emerge across diverse domains and topics, often with the intent to mislead society for personal or commercial gain. Therefore, developing methods that can accurately detect rumors at early stages is crucial to mitigating their negative impact. While existing approaches often specialize in single-domain detection, their performance degrades when applied to new domains due to shifts in data distribution, such as lexical patterns and propagation dynamics. To bridge this gap, this study introduces PerFact, a large-scale multi-domain rumor dataset comprising 8,034 annotated posts from the X platform, annotated into two primary categories: rumor (including true, false, and unverified rumors) and non-rumor. Annotator agreement, measured via Fleiss’ Kappa ( \kappa = 0.74 ), ensures high-quality labels. This research further proposes an effective multi-domain rumor detection model that employs a domain gate to dynamically aggregate multiple feature representations extracted through a Mixture-of-Experts method. Each expert combines CNN and BiLSTM networks to capture local syntactic features and long-range contextual dependencies. By leveraging both textual content and publisher information, the proposed model classifies posts into rumor and non-rumor categories with high accuracy. Evaluations demonstrate state-of-the-art performance, achieving an F1-score of 79.86% and an accuracy of 79.98% in multi-domain settings. Keywords: Rumor Detection, Multi-Domain, Natural Language Processing, Social Networks, Mixture-of-Experts Model Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2602.21214 [cs.SI] (or arXiv:2602.21214v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2602.21214 Focus to learn more arXiv-issued DOI via DataCite

[IR-12] Disaster Question Answering with LoRA Efficiency and Accurate End Position

【速读】:该论文旨在解决自然灾害情境下个体因缺乏领域知识和经验而难以做出恰当响应的问题,尤其在利用检索增强生成(Retrieval-Augmented Generation, RAG)与大语言模型进行灾害问答时,常面临信息不相关、幻觉(hallucination)导致的人工误导风险。其解决方案的关键在于构建一个基于日本灾害情境与应对经验的专用问答系统,采用cl-tohoku/bert-base-japanese-v3结合Bi-LSTM与增强位置头(Enhanced Position Heads)的架构,并通过LoRA(Low-Rank Adaptation)参数高效优化,在仅使用5.7%总参数(6.7M/117M)的情况下实现了70.4%的端位置准确率(End Position Accuracy)和0.885的Span F1得分,表明该方法在保持轻量化的同时具备满足实际灾害响应场景所需的高精度推理能力。

链接: https://arxiv.org/abs/2602.21212
作者: Takato Yasuno
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Natural disasters such as earthquakes, torrential rainfall, floods, and volcanic eruptions occur with extremely low frequency and affect limited geographic areas. When individuals face disaster situations, they often experience confusion and lack the domain-specific knowledge and experience necessary to determine appropriate responses and actions. While disaster information is continuously updated, even when utilizing RAG search and large language models for inquiries, obtaining relevant domain knowledge about natural disasters and experiences similar to one’s specific situation is not guaranteed. When hallucinations are included in disaster question answering, artificial misinformation may spread and exacerbate confusion. This work introduces a disaster-focused question answering system based on Japanese disaster situations and response experiences. Utilizing the cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + Enhanced Position Heads architecture with LoRA efficiency optimization, we achieved 70.4% End Position accuracy with only 5.7% of the total parameters (6.7M/117M). Experimental results demonstrate that the combination of Japanese BERT-base optimization and Bi-LSTM contextual understanding achieves accuracy levels suitable for real disaster response scenarios, attaining a 0.885 Span F1 score. Future challenges include: establishing natural disaster Q\A benchmark datasets, fine-tuning foundation models with disaster knowledge, developing lightweight and power-efficient edge AI Disaster Q\A applications for situations with insufficient power and communication during disasters, and addressing disaster knowledge base updates and continual learning capabilities.

[IR-13] he Language of Infographics: Toward Understanding Conceptual Metaphor Use in Scientific Storytelling IEEE-VIS2024

【速读】:该论文旨在解决科学信息图(science infographics)中视觉概念隐喻(visual conceptual metaphors)使用缺乏系统性理解与规范描述的问题,当前其应用多依赖直觉而非正式流程,导致难以通过分类学和语法规则指导可视化组件的设计。解决方案的关键在于引入认知语言学中的概念隐喻理论(Conceptual Metaphor Theory, CMT),通过对四个领域(生物医学、气候、空间和人类学)的科学信息图进行视觉成分分解与分析,构建了一套可识别和解释视觉概念隐喻模式的分类体系,并开发了一个基于数据库的可视化探索工具,以在时空尺度上展示各信息图中隐喻的构成与分布,从而提升对科学信息图中隐喻机制的理解与应用规范性。

链接: https://arxiv.org/abs/2407.13416
作者: Hana Pokojná,Tobias Isenberg,Stefan Bruckner,Barbora Kozlíková,Laura Garrison
机构: 未知
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
备注: 11 pages, 8 figures, 1 table, accepted to IEEE VIS 2024 Conference

点击查看摘要

Abstract:We apply an approach from cognitive linguistics by mapping Conceptual Metaphor Theory (CMT) to the visualization domain to address patterns of visual conceptual metaphors that are often used in science infographics. Metaphors play an essential part in visual communication and are frequently employed to explain complex concepts. However, their use is often based on intuition, rather than following a formal process. At present, we lack tools and language for understanding and describing metaphor use in visualization to the extent where taxonomy and grammar could guide the creation of visual components, e.g., infographics. Our classification of the visual conceptual mappings within scientific representations is based on the breakdown of visual components in existing scientific infographics. We demonstrate the development of this mapping through a detailed analysis of data collected from four domains (biomedicine, climate, space, and anthropology) that represent a diverse range of visual conceptual metaphors used in the visual communication of science. This work allows us to identify patterns of visual conceptual metaphor use within the domains, resolve ambiguities about why specific conceptual metaphors are used, and develop a better overall understanding of visual metaphor use in scientific infographics. Our analysis shows that ontological and orientational conceptual metaphors are the most widely applied to translate complex scientific concepts. To support our findings we developed a visual exploratory tool based on the collected database that places the individual infographics on a spatio-temporal scale and illustrates the breakdown of visual conceptual metaphors.

人机交互

[HC-0] Codesigning Ripplet: an LLM -Assisted Assessment Authoring System Grounded in a Conceptual Model of Teachers Workflows

【速读】:该论文旨在解决教育评估(Assessment)设计过程中教师面临的高难度与低效率问题。为应对这一挑战,研究团队通过为期七个月的联合设计(codesign)过程与13位教师合作,提出了一种概念模型,用以刻画教师在开发评估工具时同时迭代优化需求的双重过程。解决方案的关键在于构建了一个名为Ripplet的基于Web的工具,其核心特征是多层级可复用的交互组件(multilevel reusable interactions),从而支持教师高效创建形成性评估(formative assessments)。实证研究表明,Ripplet不仅促使教师产出原本不会设计的评估内容,还推动其从“生成”转向“筛选与优化”的实践模式,并显著提升对评估质量的反思深度。

链接: https://arxiv.org/abs/2602.22186
作者: Yuan Cui,Annabel Goldman,Jovy Zhou,Xiaolin Liu,Clarissa Shieh,Joshua Yao,Mia Cheng,Matthew Kay,Fumeng Yang
机构: Northwestern University (西北大学); University of Maryland (马里兰大学)
类目: Human-Computer Interaction (cs.HC)
备注: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Assessments are critical in education, but creating them can be difficult. To address this challenge in a grounded way, we partnered with 13 teachers in a seven-month codesign process. We developed a conceptual model that characterizes the iterative dual process where teachers develop assessments while simultaneously refining requirements. To enact this model in practice, we built Ripplet, a web-based tool with multilevel reusable interactions to support assessment authoring. The extended codesign revealed that Ripplet enabled teachers to create formative assessments they would not have otherwise made, shifted their practices from generation to curation, and helped them reflect more on assessment quality. In a user study with 15 additional teachers, compared to their current practices, teachers felt the results were more worth their effort and that assessment quality improved.

[HC-1] A Taxonomy of Human–MLLM Interaction in Early-Stage Sketch-Based Design Ideation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在早期设计工具中集成后,设计师如何与AI进行协同创作的问题,特别是关注创意责任在人类与AI之间的分配模式及其动态变化。其解决方案的关键在于通过用户研究识别出四种交互模式——仅人类主导(Human-Only)、人类主导(Human-Lead)、AI主导(AI-Lead)和共同演化(Co-Evolution),并基于自动记录的交互日志和任务后访谈分析这些模式在草图驱动的设计构思过程中的实际表现,发现设计师很少固定使用单一模式,而是频繁地在人类与AI角色间切换,从而为未来研究人机协作机制及优化交互系统提供了实证基础。

链接: https://arxiv.org/abs/2602.22171
作者: Weiayn Shi,Kenny Tsu Wei Choo
机构: Singapore University of Technology and Design(新加坡科技设计大学)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages

点击查看摘要

Abstract:As multimodal large language models (MLLMs) are increasingly integrated into early-stage design tools, it is important to understand how designers collaborate with AI during ideation. In a user study with 12 participants, we analysed sketch-based design interactions with an MLLM-powered system using automatically recorded interaction logs and post-task interviews. Based on how creative responsibility was allocated between humans and the AI, we predefined four interaction modes: Human-Only, Human-Lead, AI-Lead, and Co-Evolution, and analysed how these modes manifested during sketch-based design ideation. Our results show that designers rarely rely on a single mode; instead, human-led and AI-led roles are frequently interwoven and shift across ideation instances. These findings provide an empirical basis for future work to investigate why designers shift roles with AI and how interactive systems can better support such dynamic collaboration.

[HC-2] Dynamic Personality Adaptation in Large Language Models via State Machines ICPR2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂交互场景中缺乏根据对话动态调整人格表达能力的问题,从而限制了其在人机交互中的适应性与效果。解决方案的关键在于提出了一种模型无关(model-agnostic)的动态人格模拟框架,该框架利用状态机表示潜在的人格状态,并通过动态调整转移概率来响应对话上下文;同时设计了一个模块化的人格评分流水线,能够在不依赖特定人格模型或LLM的前提下,持续评估对话在潜在人格维度上的表现,并将评分作为动态状态变量用于系统提示的重构,从而实现行为一致性引导。

链接: https://arxiv.org/abs/2602.22157
作者: Leon Pielage,Ole Hätscher,Mitja Back,Bernhard Marschall,Benjamin Risse
机构: Institute for Geoinformatics, University of Münster(明斯特大学地理信息研究所); Faculty of Mathematics and Computer Science, University of Münster(明斯特大学数学与计算机科学学院); Department of Psychology, University of Münster(明斯特大学心理学系); Institute of Medical Education and Student Affairs, University of Münster(明斯特大学医学教育与学生事务研究所)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 22 pages, 5 figures, submitted to ICPR 2026

点击查看摘要

Abstract:The inability of Large Language Models (LLMs) to modulate their personality expression in response to evolving dialogue dynamics hinders their performance in complex, interactive contexts. We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states, where transition probabilities are dynamically adapted to the conversational context. Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes while remaining agnostic to the specific personality models, their dimensions, transition mechanisms, or LLMs used. These scores function as dynamic state variables that systematically reconfigure the system prompt, steering behavioral alignment throughout the this http URL evaluate this framework by operationalizing the Interpersonal Circumplex (IPC) in a medical education setting. Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior, thereby facilitating de-escalation training. Notably, the scoring pipeline maintains comparable precision even when utilizing lightweight, fine-tuned classifiers instead of large-scale LLMs. This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.

[HC-3] When AI Writes Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在职场沟通中“专业化”文本时,对非母语英语变体(如印度英语、新加坡英语、尼日利亚英语)的语用特征进行系统性抹除的问题,即“文化幽灵化”(Cultural Ghosting)。其核心问题是:LLMs在保持语义完整性的同时,如何避免对特定文化背景的语言标记(linguistic markers)造成隐性侵蚀。解决方案的关键在于引入两个量化指标——身份擦除率(Identity Erasure Rate, IER)和语义保留度评分(Semantic Preservation Score, SPS),并通过设计显式的文化保全提示(cultural-preservation prompts)实现干预,实验表明该方法可使擦除率降低29%而不损害语义质量,从而在语义保真与文化多样性之间取得平衡。

链接: https://arxiv.org/abs/2602.22145
作者: Satyam Kumar Navneet,Joydeep Chandra,Yong Zhang
机构: BNRIST, Dept. of CST, Tsinghua University (清华大学); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to ``professionalize’’ workplace communication, often at the cost of linguistic identity. We introduce “Cultural Ghosting”, the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean, Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.

[HC-4] Speculating for Epiplexity: How to Learn the Most from Speculative Design?

【速读】:该论文试图解决的问题是:当前生成式设计(speculative design)缺乏严谨的评估标准,难以区分哪些设想能产生真正有价值的洞察,哪些仅停留在表面的审美冲击。其解决方案的关键在于引入信息论视角,将生成式设计重构为一种资源受限的知识生成过程,并提出通过“provotypes”(原型化假设)主动拥抱意外性(surprise)。进一步地,作者基于“epiplexity”(可学习的信息结构)的概念,将生成的知识分解为两类:结构化的认知信息(structured epistemic information,即关于未来可迁移的推论)与熵噪声(entropic noise,包括叙事、美学和表层惊喜)。由此构建了一个实用的审计框架,包含自评问卷,帮助设计者判断其设想是否产生了高epiplexity的深层洞见,而非停留在浅层表达。

链接: https://arxiv.org/abs/2602.22132
作者: Botao Amber Hu
机构: University of Oxford(牛津大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Submitted for CC 2026

点击查看摘要

Abstract:Speculative design uses provocative “what if?” scenarios to explore possible sociotechnical futures, yet lacks rigorous criteria for assessing the quality of speculation. We address this gap by reframing speculative design through an information-theoretic lens as a resource-bounded knowledge generation process that uses provotypes to strategically embrace surprise. However, not all surprises are equally informative-some yield genuine insight while others remain aesthetic shock. Drawing on epiplexity-structured, learnable information extractable by bounded observers-we propose decomposing the knowledge generated by speculative artifacts into structured epistemic information (transferable implications about futures) and entropic noise (narrative, aesthetics, and surface-level surprise). We conclude by introducing a practical audit framework with a self-assessment questionnaire that enables designers to evaluate whether their speculations yield rich, high-epiplexity insights or remain at a superficial level. We discuss implications for peer review, design pedagogy, and policy-oriented futuring.

[HC-5] Giving Meaning to Movements: Challenges and Opportunities in Expanding Communication by Pairing Unaided AAC with Speech Generated Messages

【速读】:该论文旨在解决如何融合有辅助的增强与替代性沟通(Augmentative and Alternative Communication, AAC)技术与无辅助的AAC技术,以兼顾后者在表达速度和自然性上的优势,同时保持前者在可理解性方面的可靠性,这一问题在存在沟通和运动障碍的人群中尚未得到充分探索。解决方案的关键在于开发了一种名为AllyAAC的可穿戴系统,其核心组件为腕部惯性测量单元(Inertial Measurement Unit, IMU)与智能手机应用程序的协同工作,并通过18个月的参与式设计识别出关键挑战与机会;进一步在14名参与者的真实场景场研究中验证了该系统,构建了首个包含超过60万个多模态数据点、涵盖非典型手势的大规模数据集,并采用基于Transformer的大型机器学习模型结合不同预训练策略,有效应对个性化、独特手势识别难题,从而实现了自适应、个性化的辅助沟通系统设计。

链接: https://arxiv.org/abs/2602.22131
作者: Imran Kabir,Sharon Ann Redmon,Lynn R Elko,Kevin Williams,Mitchell A Case,Dawn J Sowers,Krista Wilkinson,Syed Masum Billah
机构: Pennsylvania State University (宾夕法尼亚州立大学); Florida State University (佛罗里达州立大学); L. L. SLIM, LLC.
类目: Human-Computer Interaction (cs.HC)
备注: To appear in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)

点击查看摘要

Abstract:Augmentative and Alternative Communication (AAC) technologies are categorized into two forms: aided AAC, which uses external devices like speech-generating systems to produce standardized output, and unaided AAC, which relies on body-based gestures for natural expression but requires shared understanding. We investigate how to combine these approaches to harness the speed and naturalness of unaided AAC while maintaining the intelligibility of aided AAC, a largely unexplored area for individuals with communication and motor impairments. Through 18 months of participatory design with AAC users, we identified key challenges and opportunities and developed AllyAAC, a wearable system with a wrist-worn IMU paired with a smartphone app. We evaluated AllyAAC in a field study with 14 participants and produced a dataset containing over 600,000 multimodal data points featuring atypical gestures–the first of its kind. Our findings reveal challenges in recognizing personalized, idiosyncratic gestures and demonstrate how to address them using Transformer-based large machine learning (ML) models with different pretraining strategies. In sum, we contribute design principles and a reference implementation for adaptive, personalized systems combining aided and unaided AAC.

[HC-6] SocialPulse: On-Device Detection of Social Interactions in Naturalistic Settings Using Smartwatch Multimodal Sensing

【速读】:该论文旨在解决如何在自然场景下利用可穿戴设备(如智能手表)自动检测社交互动的问题,现有方法多受限于实验室环境、仅关注面对面互动或依赖严格假设(如固定时间窗口内存在多个说话者),导致实际应用泛化能力不足。其核心解决方案是开发了一种基于表层语音检测(foreground speech detection)的腕上交互检测系统,通过在公开数据集上训练的语音检测模型实现高精度识别,并结合用户标注的多模态数据构建更鲁棒的分类模型,最终在真实世界部署中实现了85.51%的平衡准确率(foreground speech检测)和90.36%的多模态交互识别准确率,验证了在复杂动态社会环境中感知交互行为的可行性。

链接: https://arxiv.org/abs/2602.22085
作者: Md Sabbir Ahmed,Kaitlyn Dorothy Petz,Noah French,Tanvi Lakhtakia,Aayushi Sangani,Mark Rucker,Xinyu Chen,Bethany A. Teachman,Laura E. Barnes
机构: University of Virginia (弗吉尼亚大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social interactions are fundamental to well-being, yet automatically detecting them in daily life-particularly using wearables-remains underexplored. Most existing systems are evaluated in controlled settings, focus primarily on in-person interactions, or rely on restrictive assumptions (e.g., requiring multiple speakers within fixed temporal windows), limiting generalizability to real-world use. We present an on-watch interaction detection system designed to capture diverse interactions in naturalistic settings. A core component is a foreground speech detector trained on a public dataset. Evaluated on over 100,000 labeled foreground speech and background sound instances, the detector achieves a balanced accuracy of 85.51%, outperforming prior work by 5.11%. We evaluated the system in a real-world deployment (N=38), with over 900 hours of total smartwatch wear time. The system detected 1,691 interactions, 77.28% were confirmed via participant self-report, with durations ranging from under one minute to over one hour. Among correct detections, 81.45% were in-person, 15.7% virtual, and 1.85% hybrid. Leveraging participant-labeled data, we further developed a multimodal model achieving a balanced accuracy of 90.36% and a sensitivity of 91.17% on 33,698 labeled 15-second windows. These results demonstrate the feasibility of real-world interaction sensing and open the door to adaptive, context-aware systems responding to users’ dynamic social environments. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.22085 [cs.HC] (or arXiv:2602.22085v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.22085 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-7] ViSTAR: Virtual Skill Training with Augmented Reality with 3D Avatars and LLM coaching agent

【速读】:该论文旨在解决篮球技能训练中缺乏即时、个性化反馈的问题,尤其在平衡、姿势和时机等关键动作要素上,传统训练难以实现精准纠错。解决方案的核心在于提出ViSTAR系统,其基于行为技能训练(Behavioral Skills Training, BST)框架,通过视觉叠加、节奏提示与AI教练代理提供多模态反馈;其中关键技术突破是利用3D运动重建获取时空关节数据,并借助大语言模型(Large Language Model, LLM)将运动特征映射为自然语言的教练提示,从而生成简洁且具指导性的反馈信息,显著提升用户对自身动作问题的觉察与修正能力。

链接: https://arxiv.org/abs/2602.22077
作者: Chunggi Lee,Hayato Saiki,Tica Lin,Eiji Ikeda,Kenji Suzuki,Chen Zhu-Tian,Hanspeter Pfister
机构: Harvard University (哈佛大学); University of Tsukuba (筑波大学); Dolby Laboratories (杜比实验室); University of Minnesota-Twin Cities (明尼苏达大学双城分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present ViSTAR, a Virtual Skill Training system in AR that supports self-guided basketball skill practice, with feedback on balance, posture, and timing. From a formative study with basketball players and coaches, the system addresses three challenges: understanding skills, identifying errors, and correcting mistakes. ViSTAR follows the Behavioral Skills Training (BST) framework-instruction, modeling, rehearsal, and feedback. It provides feedback through visual overlays, rhythm and timing cues, and an AI-powered coaching agent using 3D motion reconstruction. We generate verbal feedback by analyzing spatio-temporal joint data and mapping features to natural-language coaching cues via a Large Language Model (LLM). A key novelty is this feedback generation: motion features become concise coaching insights. In two studies (N=16), participants generally preferred our AI-generated feedback to coach feedback and reported that ViSTAR helped them notice posture and balance issues and refine movements beyond self-observation.

[HC-8] A Critical Reflection on the Values and Assumptions in Data Visualization

【速读】:该论文试图解决的问题是:当前可视化研究领域过度依赖少数核心价值(如普遍性、客观性和效率),导致研究范式趋于单一,忽视了来自其他学科的多元视角和批判性反思。解决方案的关键在于:通过梳理Jacques Bertin、John Tukey、Leland Wilkinson、Colin Ware和Tamara Munzner等奠基性学者著作中的价值取向,明确这些传统价值如何渗透至工具设计、教学体系与科研实践,并在此基础上呼吁社区接纳更广泛的多元价值观,以推动可视化研究走向更具包容性和多样性的未来发展方向。

链接: https://arxiv.org/abs/2602.22051
作者: Shehryar Saharan,Ibrahim Al-Hazwani,Miriah Meyer,Laura Garrison
机构: University of Toronto (多伦多大学); University of Zurich (苏黎世大学); Linköping University (林雪平大学); University of Bergen (卑尔根大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Visualization has matured into an established research field, producing widely adopted tools, design frameworks, and empirical foundations. As the field has grown, ideas from outside computer science have increasingly entered visualization discourse, questioning the fundamental values and assumptions on which visualization research stands. In this short position paper, we examine a set of values that we see underlying the seminal works of Jacques Bertin, John Tukey, Leland Wilkinson, Colin Ware, and Tamara Munzner. We articulate three prominent values in these texts - universality, objectivity, and efficiency - and examine how these values permeate visualization tools, curricula, and research practices. We situate these values within a broader set of critiques that call for more diverse priorities and viewpoints. By articulating these tensions, we call for our community to embrace a more pluralistic range of values to shape our future visualization tools and guidelines.

[HC-9] SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation

【速读】:该论文旨在解决如何准确预测人类观看绘画作品时的注视路径(scanpaths)这一问题,以支持文化遗产的保护与欣赏。其核心挑战在于自然图像与艺术作品之间存在的域差异(domain gap),以及眼动数据固有的随机性。解决方案的关键在于提出SPGen模型,该模型采用全卷积神经网络(Fully Convolutional Neural Network, FCNN)架构,结合可微分的注视选择机制和可学习的高斯先验(learnable Gaussian priors)来模拟自然观看偏置;同时引入梯度反转层(gradient reversal layer)实现无监督域自适应(unsupervised domain adaptation),使模型能从自然场景数据中迁移知识至艺术图像;此外,通过随机噪声采样器建模眼动数据的内在不确定性,从而显著提升预测精度,优于现有方法。

链接: https://arxiv.org/abs/2602.22049
作者: Mohamed Amine Kerkouri,Marouane Tliba,Aladine Chetouani,Alessandro Bruno
机构: F-Initiatives; Université Sorbonne Paris Nord; IULM University
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Under Review

点击查看摘要

Abstract:Understanding human visual attention is key to preserving cultural heritage We introduce SPGen a novel deep learning model to predict scanpaths the sequence of eye movementswhen viewers observe paintings. Our architecture uses a Fully Convolutional Neural Network FCNN with differentiable fixation selection and learnable Gaussian priors to simulate natural viewing biases To address the domain gap between photographs and artworks we employ unsupervised domain adaptation via a gradient reversal layer allowing the model to transfer knowledge from natural scenes to paintings Furthermore a random noise sampler models the inherent stochasticity of eyetracking data. Extensive testing shows SPGen outperforms existing methods offering a powerful tool to analyze gaze behavior and advance the preservation and appreciation of artistic treasures. Comments: Under Review Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.22049 [cs.CV] (or arXiv:2602.22049v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.22049 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-10] Detecting UX smells in Visual Studio Code using LLM s

【速读】:该论文旨在解决集成开发环境(Integrated Development Environment, IDE)中用户体验(User Experience, UX)相关问题的实证研究不足这一挑战,聚焦于Visual Studio Code平台,通过挖掘和分类用户在GitHub仓库中报告的问题,识别出影响开发者体验的典型UX缺陷。其解决方案的关键在于采用大语言模型(Large Language Model, LLM)辅助的方法,结合已验证的UX气味分类体系与专家评审,系统性地发现并归类高频出现的UX问题,尤其集中在信息性、清晰度、直观性和效率等开发者最重视的维度上。

链接: https://arxiv.org/abs/2602.22020
作者: Andrés Rodriguez,Juan Cruz Gardey,Alejandra Garrido
机构: LIFIA, Fac. Informática, Univ. Nac. La Plata & CONICET (National Scientific and Technical Research Council)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures, 1 table, 3rd International Workshop on Integrated Development Environments (IDE 2026)

点击查看摘要

Abstract:Integrated Development Environments shape developers’ daily experience, yet the empirical study of their usability and user experience (UX) remains limited. This work presents an LLM-assisted approach to detecting UX smells in Visual Studio Code by mining and classifying user-reported issues from the GitHub repository. Using a validated taxonomy and expert review, we identified recurring UX problems that affect the developer experience. Our results show that the majority of UX smells are concentrated in informativeness, clarity, intuitiveness, and efficiency, qualities that developers value most.

[HC-11] he Governance of Intimacy: A Preliminary Policy Analysis of Romantic AI Platforms

【速读】:该论文旨在解决浪漫型人工智能(Romantic AI)平台在用户情感披露数据治理方面的缺失问题,特别是这些平台如何处理用户提供的亲密信息。其核心发现指出,平台通过默认训练授权(default training appropriation)、所有权重构(ownership reconstruction)和亲密历史资产化(intimate history assetization)三大机制,将用户的私密情感内容转化为可再利用的数据资产,从而扩大自身权利并转嫁风险至用户。这一发现揭示了当前浪漫AI中人机亲密关系治理的关键挑战,并为未来实证研究与设计实践提供了重要依据。

链接: https://arxiv.org/abs/2602.22000
作者: Xiao Zhan,Yifan Xu,Rongjun Ma,Shijing He,Jose Luis Martin-Navarro,Jose Such
机构: VRAIN, Universitat Politècnica de València & University of Cambridge (瓦伦西亚理工大学与剑桥大学); The University of Manchester (曼彻斯特大学); VRAIN, Universitat Politècnica de València & Aalto University (瓦伦西亚理工大学与阿尔托大学); King’s College London (伦敦国王学院); INGENIO (CSIC-Universitat Politècnica de València) (CSIC-瓦伦西亚理工大学工程研究所)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 9 pages

点击查看摘要

Abstract:Romantic AI platforms invite intimate emotional disclosure, yet their data governance practices remain underexamined. This preliminary study analyses the Privacy Policies and Terms of Service of six Western and Chinese romantic AI platforms. We find that intimate disclosures are often positioned as reusable data assets, with broad permissions for storage, analysis, and model training. We identify default training appropriation, ownership reconstruction, and intimate history assetization as key mechanisms structuring these practices, expanding platforms’ rights while shifting risk onto users. Our findings surface key governance challenges in romantic AI and are intended to provoke discussion and inform future empirical and design research on human AI intimacy and its governance.

[HC-12] Interactive Augmented Reality-enabled Outdoor Scene Visualization For Enhanced Real-time Disaster Response

【速读】:该论文旨在解决灾难响应场景中增强现实(AR)界面因信息过载导致情境意识下降和认知负荷过高这一关键问题。解决方案的核心在于采用基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的高质量场景重建可视化技术,在保持低认知负荷的同时提升空间感知能力;同时引入轻量级交互机制,结合微型世界导航(World-in-Miniature, WIM)与可动态过滤的语义兴趣点(Points of Interest, POIs),并构建支持实时更新的流式架构,从而实现高效、直观且具上下文感知能力的AR交互体验。

链接: https://arxiv.org/abs/2602.21874
作者: Dimitrios Apostolakis,Georgios Angelidis,Vasileios Argyriou,Panagiotis Sarigiannidis,Georgios Th. Papadopoulos
机构: 1. University of Piraeus (比雷埃夫斯大学); 2. National and Kapodistrian University of Athens (雅典国立卡波迪斯特里亚大学); 3. University of Nicosia (尼科西亚大学); 4. University of West Attica (西阿提卡大学)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:A user-centered AR interface for disaster response is presented in this work that uses 3D Gaussian Splatting (3DGS) to visualize detailed scene reconstructions, while maintaining situational awareness and keeping cognitive load low. The interface relies on a lightweight interaction approach, combining World-in-Miniature (WIM) navigation with semantic Points of Interest (POIs) that can be filtered as needed, and it is supported by an architecture designed to stream updates as reconstructions evolve. User feedback from a preliminary evaluation indicates that this design is easy to use and supports real-time coordination, with participants highlighting the value of interaction and POIs for fast decision-making in context. Thorough user-centric performance evaluation demonstrates strong usability of the developed interface and high acceptance ratios.

[HC-13] StylusPort: Investigating Teleportation using Stylus in VR

【速读】:该论文旨在解决虚拟现实(VR)中如何无缝集成Teleportation(传送)功能与stylus(触控笔)交互的问题,避免打断用户日常的绘图或定位操作流程。其核心解决方案在于提出两个关键设计思想:一是通过“翻转触控笔”作为直观的模式切换机制,在绘制与传送之间自由转换;二是利用注视点(gaze)确定传送方向,而触控笔则负责精确定位。这一方法在保持自然手部动作连续性的同时,提升了传送操作的效率与沉浸感。

链接: https://arxiv.org/abs/2602.21799
作者: Yang Liu,Qiushi Zhou,Mathias N Lystbæk,Aidan Kehoe,Mario Gutierrez,Hans Gellersen,Ken Pfeuffer
机构: Aarhus University (奥胡斯大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Logitech (罗技); Logitech Europe S.A. (罗技欧洲股份有限公司); Lancaster University (兰卡斯特大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI 2026. 12 pages, 12 figures

点击查看摘要

Abstract:With a stylus, users can both sweep sketches across models and pinpoint locations with precision. Building on this dual capability, we explore how teleportation can be integrated into stylus interaction without disrupting the flow of common stylus usage. We introduce two key ideas: flipping the stylus as an intuitive mode switch between drawing and teleportation, and using gaze to set orientation while the stylus handles positioning. In a user study that features a teleport-and-orient task, we evaluate six teleportation techniques, covering two mode-switching methods (Button and Flip) and three orientation approaches (StylusRoll, StylusPoint, and GazePoint). The results offer new insights into the relative merits and limitations of each technique. Our work contributes to knowledge about teleportation in VR and fills the gap in seamlessly integrating teleportation with stylus use in 3D.

[HC-14] Heads Up!: Towards In Situ Photogrammetry Annotations and Augmented Reality Visualizations for Guided Backcountry Skiing

【速读】:该论文旨在解决背country滑雪(Backcountry skiing)活动中,向滑雪者有效传达实时地形与天气风险信息的难题,以提升群体安全性。其核心解决方案在于为滑雪向导提供一套基于实景空间标注(in situ spatial annotation)的工具,使向导可通过平板应用在由无人机航拍生成的摄影测量地图上标记危险点、减速区和安全区,并通过增强现实(Augmented Reality, AR)头显将这些标注以视觉叠加方式实时传递给滑雪者,从而实现精准、直观的协同沟通。

链接: https://arxiv.org/abs/2602.21771
作者: Christoph Albert Johns,László Kopácsi,Michael Barz,Daniel Sonntag
机构: University of Oldenburg(旧堡大学); German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at AlpCHI 2026 Demos, March 01-05, 2026, Ascona, Switzerland

点击查看摘要

Abstract:Backcountry skiing is an activity where a group of skiers navigate challenging environmental conditions to ski outside of managed areas. This activity requires careful monitoring and effective communication around the current weather and terrain conditions to ensure skier safety. We aim to support and facilitate this communication by providing backcountry guides with a set of in situ spatial annotation tools to communicate hazards and appropriate speeds to the ski recreationalists. A guide can use a tablet application to annotate a photogrammetry-based map of a mountainside, for example, one collected using a commercial camera drone, with hazard points, slow-down zones, and safe zones. These annotations are communicated to the skiers via visual overlays in augmented reality heads-up displays. We present a prototype consisting of a web application and a virtual reality display that mirror the guide’s and skier’s perspectives, enabling participatory interaction design studies in a safe environment.

[HC-15] “Without AI I Would Never Share This Online”: Unpacking How LLM s Catalyze Womens Sharing of Gendered Experiences on Social Media

【速读】:该论文旨在解决女性在社交媒体上分享性别经验时面临的内外部障碍问题,包括外部的舆论压力与自我设限的表达标准。研究表明,尽管数字女性主义鼓励女性通过平台进行自我赋权式表达,但许多女性仍因担心被评判或遭遇负面反馈而犹豫不决。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)作为辅助工具,帮助女性更清晰地组织和提炼其性别经历,在满足自身对“适当性”与“价值感”的内在标准的同时,降低公开表达的心理负担,从而促进更具安全感和自主性的线上表达实践。

链接: https://arxiv.org/abs/2602.21686
作者: Runhua Zhang,Ziqi Pan,Huiran Yi,Huamin Qu,Xiaojuan Ma
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Michigan (密歇根大学)
类目: Human-Computer Interaction (cs.HC)
备注: This poster was conditionally accepted to CHI 2026

点击查看摘要

Abstract:Sharing gendered experiences on social media has been widely recognized as supporting women’s personal sense-making and contributing to digital feminism. However, there are known concerns, such as fear of judgment and backlash, that may discourage women from posting online. In this study, we examine a recurring practice on Xiaohongshu, a popular Chinese social media platform, in which women share their gendered experiences alongside screenshots of conversations with LLMs. We conducted semi-structured interviews with 20 women to investigate whether and how interactions with LLMs might support women in articulating and sharing gendered experiences. Our findings reveal that, beyond those external concerns, women also hold self-imposed standards regarding what feels appropriate and worthwhile to share publicly. We further show how interactions with LLMs help women meet these standards and navigate such concerns. We conclude by discussing how LLMs might be carefully and critically leveraged to support women’s everyday expression online.

[HC-16] WatchHand: Enabling Continuous Hand Pose Tracking On Off-the-Shelf Smartwatches

【速读】:该论文旨在解决在商用智能手表上实现连续3D手部姿态追踪的问题,此前的方法依赖外部传感器或定制硬件,限制了其在真实场景中的应用。解决方案的关键在于提出WatchHand系统,该系统仅利用智能手表内置的扬声器和麦克风,通过发射不可听的调频连续波(Frequency-Modulated Continuous Wave, FMCW)并捕捉手部反射信号,结合深度学习模型估计20个手指关节的3D位置,从而在不增加额外硬件的前提下实现了高精度、鲁棒的手势交互。

链接: https://arxiv.org/abs/2602.21610
作者: Jiwan Kim,Chi-Jung Lee,Hohurn Jung,Tianhong Catherine Yu,Ruidong Zhang,Ian Oakley,Cheng Zhang
机构: KAIST(韩国科学技术院); Cornell University(康奈尔大学)
类目: Human-Computer Interaction (cs.HC)
备注: This work will be presented and published at ACM CHI 2026

点击查看摘要

Abstract:Tracking hand poses on wrist-wearables enables rich, expressive interactions, yet remains unavailable on commercial smartwatches, as prior implementations rely on external sensors or custom hardware, limiting their real-world applicability. To address this, we present WatchHand, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone. WatchHand emits inaudible frequency-modulated continuous waves and captures their reflections from the hand. These acoustic signals are processed by a deep-learning model that estimates 3D hand poses for 20 finger joints. We evaluate WatchHand across diverse real-world conditions – multiple smartwatch models, wearing-hands, body postures, noise conditions, pose-variation protocols – and achieve a mean per-joint position error of 7.87 mm in cross-session tests with device remounting. Although performance drops for unseen users or gestures, the model adapts effectively with lightweight fine-tuning on small amounts of data. Overall, WatchHand lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.

[HC-17] Exploring Human-Machine Coexistence in Symmetrical Reality

【速读】:该论文旨在解决人工智能(AI)发展背景下,人机交互中传统以人类为中心的范式所面临的挑战,即如何重构人类与AI实体之间的关系,以实现二者在物理世界与虚拟世界中的和谐共存。其解决方案的关键在于提出“对称现实”(symmetrical reality)这一全新描述框架,通过整合虚拟与物理世界的维度,为构建人机共生的协同机制提供理论基础和研究方向。

链接: https://arxiv.org/abs/2602.21584
作者: Zhenliang Zhang
机构: State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: IEEE Virtual Reality 2026 Poster

点击查看摘要

Abstract:In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction. To address this challenge, it is imperative to reassess the relationship between AI entities and humans. Through considering both the virtual and physical worlds, we can construct a novel descriptive framework for a world where humans and machines coexist symbiotically. This paper will introduce a fresh research direction engendered for studying harmonious human-machine coexistence across physical and virtual worlds, which has been termed “symmetrical reality”. We will elucidate its key characteristics, offering innovative research insight for renovating human-machine interaction paradigms.

[HC-18] StoryComposerAI: Supporting Human-AI Story Co-Creation Through Decomposition and Linking

【速读】:该论文旨在解决生成式 AI (Generative AI) 在人类与AI协同创作数字故事过程中,用户难以对单个故事元素进行精细控制,同时保证生成视觉内容与故事情节一致性和跨输出一致性的问题。解决方案的关键在于提出“创造性分解与关联”(creative decomposition and linking)范式,通过引导用户分别指定故事要素(如情节、人物、场景等),使GenAI能够针对性地生成内容,并在保持各元素间语义连贯性的基础上实现整体叙事一致性。

链接: https://arxiv.org/abs/2602.21486
作者: Shuo Niu,Dylan Clements,Marina Margalit Nemanov,Hyungsin Kim
机构: Clark University (克拉克大学)
类目: Human-Computer Interaction (cs.HC)
备注: Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26)

点击查看摘要

Abstract:GenAI’s ability to produce text and images is increasingly incorporated into human-AI co-creation tasks such as storytelling and video editing. However, integrating GenAI into these tasks requires enabling users to retain control over editing individual story elements while ensuring that generated visuals remain coherent with the storyline and consistent across multiple AI-generated outputs. This work examines a paradigm of creative decomposition and linking, which allows creators to clearly communicate creative intent by prompting GenAI to tailor specific story elements, such as storylines, personas, locations, and scenes, while maintaining coherence among them. We implement and evaluate StoryComposerAI, a system that exemplifies this paradigm for enhancing users’ sense of control and content consistency in human-AI co-creation of digital stories.

[HC-19] Evaluating the Usage of African-American Vernacular English in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在表征非标准方言——特别是非洲裔美国英语(African American Vernacular English, AAVE)时存在的准确性不足与刻板印象强化问题。其核心问题是当前主流自然语言理解评估多基于标准语种(如标准美式英语),导致模型对AAVE等边缘化方言的语法特征识别和生成能力薄弱,且可能无意中复制社会偏见。解决方案的关键在于通过对比人类AAVE母语者的实际使用模式(基于Corpus of Regional African American Language和TwitterAAE数据集)与LLMs生成文本中的语法特征分布,揭示模型在频率、正确性和语境适配上的系统性偏差,并强调需在训练数据中引入更多元化的语言变体及采用公平性缓解方法,以减少对少数群体语言形式的误判和刻板印象的传播。

链接: https://arxiv.org/abs/2602.21485
作者: Deja Dunlap,R. Thomas McCoy
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain’t. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicated stereotypes about African Americans. These results highlight the need for more diversity in training data and the incorporation of fairness methods to mitigate the perpetuation of stereotypes.

[HC-20] A Benchmark to Assess Common Ground in Human-AI Collaboration

【速读】:该论文旨在解决当前人机协作(Human-AI Collaboration)研究中对“共同基础”(Common Ground)关注不足的问题,即如何使AI从单一的信息或事务性助手角色转变为具备真正协同能力的合作伙伴。其解决方案的关键在于构建一个基于人类协作理论与实证研究的新基准(Benchmark),该基准以需要迭代交互、联合行动、指称协调和情境意识差异下修复误解的协作拼图任务为核心,通过验证实验表明该基准能复现人类协作中的经典理论与实证结果,同时揭示出人与AI协作中的独特差异,从而为人机协作研究提供可量化、可比较的评估框架。

链接: https://arxiv.org/abs/2602.21337
作者: Christian Poelitz,Finale Doshi-Velez,Siân Lindley
机构: University College London (伦敦大学学院); University of California, Berkeley (加州大学伯克利分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI is becoming increasingly integrated into everyday life, both in professional work environments and in leisure and entertainment contexts. This integration requires AI to move beyond acting as an assistant for informational or transactional tasks toward a genuine collaborative partner. Effective collaboration, whether between humans or between humans and AI, depends on establishing and maintaining common ground: shared beliefs, assumptions, goals, and situational awareness that enable coordinated action and efficient repair of misunderstandings. While common ground is a central concept in human collaboration, it has received limited attention in studies of human-AI collaboration. In this paper, we introduce a new benchmark grounded in theories and empirical studies of human-human collaboration. The benchmark is based on a collaborative puzzle task that requires iterative interaction, joint action, referential coordination, and repair under varying conditions of situation awareness. We validate the benchmark through a confirmatory user study in which human participants collaborate with an AI to solve the task. The results show that the benchmark reproduces established theoretical and empirical findings from human-human collaboration, while also revealing clear divergences in human-AI interaction.

[HC-21] owards Narrative Medical Visualization

【速读】:该论文旨在解决如何通过叙事性可视化(narrative visualization)技术,将复杂的医学数据转化为非专业受众(如患者及其家属)易于理解的故事化表达,从而提升其对疾病机制、病理过程及风险因素的认知。其解决方案的关键在于提出一个通用的叙事医学可视化模板,该模板可适用于不同系统(骨、血管、器官)的疾病案例,并通过整合探索性与解释性可视化手段,增强数据驱动故事的传达效果与理解深度。此模板为其他科研人员提供了可复用的方法框架,以面向公众传播特定疾病的科学信息。

链接: https://arxiv.org/abs/2108.05462
作者: Monique Meuschke,Laura Garrison,Noeska Smit,Stefan Bruckner,Kai Lawonn,Bernhard Preim
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Narrative visualization aims to communicate scientific results to a general audience and garners significant attention in various applications. Merging exploratory and explanatory visualization could effectively support a non-expert understanding of scientific processes. Medical research results, e.g., mechanisms of the healthy human body, explanations of pathological processes, or avoidable risk factors for diseases, are also interesting to a general audience that includes patients and their relatives. This paper discusses how narrative techniques can be applied to medical visualization to tell data-driven stories about diseases. We address the general public comprising people interested in medicine without specific medical background knowledge. We derived a general template for the narrative medical visualization of diseases. Applying this template to three diseases selected to span bone, vascular, and organ systems, we discuss how narrative techniques can support visual communication and facilitate understanding of medical data. Other scientists can adapt our proposed template to inform an audience on other diseases. With our work, we show the potential of narrative medical visualization and conclude with a comprehensive research agenda.

计算机视觉

[CV-0] Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences CVPR2026

【速读】:该论文旨在解决从非结构化点云数据中实现动态三维物体长时间序列的时序一致表面重建问题,现有方法或因增量优化导致漂移且计算耗时,或依赖复杂的学习模型并需类别特定训练。其解决方案的关键在于提出一种基于新型预条件潜空间网格编码(preconditioned latent-grid encoding)的快速变形优化方法——将所有时间步的变形信息以多分辨率潜空间形式编码至一个参考关键帧表面的位置与法向量参数化空间中,并通过轻量级多层感知机(MLP)解码为每帧的6-DoF变形;同时在梯度训练中引入Sobolev预条件策略,无需显式对应关系或额外先验即可实现高保真、无漂移的重建,且在推理速度上显著优于当前训练-free方法(快至少60倍),接近预训练模型水平。

链接: https://arxiv.org/abs/2602.22212
作者: Julian Kaltheuner,Hannah Dröge,Markus Plack,Patrick Stotko,Reinhard Klein
机构: University of Bonn (波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Code: this https URL

点击查看摘要

Abstract:Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.

[CV-1] WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos WWW

【速读】:该论文旨在解决基于第一人称视角(egocentric)视频中手部与物体运动重建的难题,尤其针对交互过程中严重的遮挡、物体频繁进出视野以及独立估计手部和物体姿态导致的关系不一致问题。解决方案的关键在于提出WHOLE方法,通过学习手-物体运动的生成先验(generative prior),在世界坐标系中联合推理二者交互关系,从而实现更一致且鲁棒的联合重建,显著优于传统分步处理后进行后处理的方法。

链接: https://arxiv.org/abs/2602.22209
作者: Yufei Ye,Jiaman Li,Ryan Rong,C. Karen Liu
机构: Stanford University (斯坦福大学); Amazon FAR (前沿人工智能与机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: this https URL

[CV-2] Solaris: Building a Multiplayer Video World Model in Minecraft

【速读】:该论文旨在解决现有动作条件视频生成模型(action-conditioned video generation models)仅限于单智能体视角、无法捕捉真实环境中多智能体交互的问题。其解决方案的关键在于提出Solaris——一个可模拟一致多视角观测的多人视频世界模型,并构建了一个专为Minecraft等视频游戏设计的多人数据系统,支持协调的多智能体交互与同步视频+动作采集。通过分阶段训练策略(从单人到多人建模),结合双向、因果和自强制(Self Forcing)训练方式,并在最终阶段引入内存高效的Checkpointed Self Forcing以延长教师信号的时序长度,从而显著提升模型在多人运动、记忆、定位、建造及视图一致性等方面的性能表现。

链接: https://arxiv.org/abs/2602.22208
作者: Georgy Savva,Oscar Michel,Daohan Lu,Suppakit Waiwitlikhit,Timothy Meehan,Dhairya Mishra,Srivats Poddar,Jack Lu,Saining Xie
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

[CV-3] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

【速读】:该论文旨在解决当前图像保护策略在面对生成式 AI(Generative AI)攻击时存在的脆弱性问题。现有保护方法通常通过添加人眼难以察觉的扰动来防止图像被滥用,如风格模仿或深度伪造(deepfake)篡改。然而,本文揭示了一个关键漏洞:无需专门设计的攻击手段,仅利用现成的图像到图像生成模型(image-to-image GenAI models),配合简单的文本提示(text prompt),即可将其作为通用“去噪器”(denoisers)有效移除多种保护扰动。这一通用攻击方法在8个案例研究中覆盖6种不同保护方案,不仅成功绕过防御机制,还优于现有专用攻击方法,并保持图像对攻击者的可用性,表明当前多数图像保护方案存在系统性安全缺陷,亟需基于此类通用攻击进行重新评估与改进。

链接: https://arxiv.org/abs/2602.22197
作者: Xavier Pleimling,Sifat Muhammad Abdullah,Gunjan Balde,Peng Gao,Mainack Mondal,Murtuza Jadliwala,Bimal Viswanath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore. To IEEE SaTML 2026

点击查看摘要

Abstract:Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible protective perturbations to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose-built methods, we demonstrate that this is no longer necessary. We show that off-the-shelf image-to-image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general-purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image’s utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off-the-shelf GenAI models. Code is available in this repository: this https URL

[CV-4] Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

【速读】:该论文旨在解决当前计算病理学工作流中因仅使用单一放大倍数(20×)的图像切片而导致的空间上下文信息丢失问题,以及由此引发的切片表示冗余和计算效率低下的挑战。现有方法通常将全幻灯片图像(Whole Slide Image, WSI)切割为大量224×224像素的固定分辨率切片,并依赖单一放大倍数的基础模型进行特征提取,这限制了对多尺度组织结构的准确建模。解决方案的关键在于提出一种区域级混合编码器(region-level mixing encoder),通过在预训练阶段联合融合来自不同放大倍数的基础模型所提取的图像切片表征,利用掩码嵌入建模(masked embedding modeling)策略实现跨尺度特征融合,从而更有效地捕获多分辨率空间上下文信息,并减少每张幻灯片所需的表征数量,最终在多种癌症类型的生物标志物预测任务中展现出显著的性能提升,验证了空间上下文理解的重要性。

链接: https://arxiv.org/abs/2602.22176
作者: Eric Zimmermann,Julian Viret,Michal Zelechowski,James Brian Hall,Neil Tenenholtz,Adam Casson,George Shaikovski,Eugene Vorontsov,Siqi Liu,Kristen A Severson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20 \times magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224 \times 224 pixel crops at 20 \times leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.

[CV-5] CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness

【速读】:该论文旨在解决任意尺度超分辨率(Arbitrary-Scale Super-Resolution, ASISR)中因跨尺度分布偏移(cross-scale distribution shift)导致的性能下降问题,即当推理尺度超出训练范围时,图像会出现噪声、模糊和伪影显著累积的现象。解决方案的关键在于提出一种名为CASR的循环式超分辨率框架,其核心思想是将极端放大任务分解为一系列在分布内的尺度过渡过程,从而避免分布漂移;同时引入两个关键模块:结构分布对齐模块(SDAM)通过超像素聚合实现结构级分布对齐以抑制误差累积,以及自相关纹理恢复模块(SARM)利用自相关约束和低分辨率图像自相似性先验来恢复高频纹理,确保长程纹理一致性。该方法仅需单一模型即可实现高稳定性和强泛化能力,尤其在极端放大场景下表现优异。

链接: https://arxiv.org/abs/2602.22159
作者: Wenhao Guo,Zhaoran Zhao,Peng Lu,Sheng Li,Qian Qiao,RuiDe Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.

[CV-6] CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation CVPR2026

【速读】:该论文旨在解决统一条件图像生成中因不同任务依赖本质不同的内部表征而导致的“概念-定位”表示冲突问题,即某些任务需要语义理解以实现概念合成,而另一些则依赖空间定位线索以保证精度。为应对这一挑战,作者提出CoLoGen框架,其核心创新在于Progressive Representation Weaving (PRW)模块,该模块通过分阶段课程学习策略,先分别构建概念与定位能力,再将其适配到多样视觉条件下,并最终协同优化二者在复杂指令驱动任务中的融合效果,从而实现稳定且高效的多任务统一图像生成。

链接: https://arxiv.org/abs/2602.22150
作者: YuXin Song,Yu Lu,Haoyuan Sun,Huanjin Yao,Fanglong Liu,Yifan Sun,Haocheng Feng,Hang Zhou,Jingdong Wang
机构: Baidu Inc. (百度公司); Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026. 15 pages, 8 figures

点击查看摘要

Abstract:Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.

[CV-7] MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining

【速读】:该论文旨在解决医学视觉-语言预训练中因原始影像报告存在显著风格异质性、长度不一及大量与图像无关内容而导致的监督信号质量下降问题。其解决方案的关键在于提出MedTri框架,将自由文本报告结构化为统一的[解剖实体: 影像描述 + 诊断类别]三元组,通过解剖学锚定的规范化策略保留关键形态与空间信息,同时去除风格噪声和无关内容,从而提供一致且图像相关的文本监督信号。这一结构化文本归一化方法在X光和CT多数据集上均显著优于原始报告及现有基线,且可无缝集成文本层面增强策略(如知识扩充和解剖学引导的反事实监督),提升模型鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2602.22143
作者: Yuetan Chu,Xinhua Ma,Xinran Jin,Gongning Luo,Xin Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at this https URL.

[CV-8] WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLM s CVPR2026

【速读】:该论文旨在解决当前视频大语言模型(Video-LLM)在流式场景下的核心局限性——时间无关性(Time-Agnosticism),即模型将视频视为无序证据集合而非因果有序序列,导致两个关键问题:时序顺序模糊(temporal order ambiguity)和过去-当前关注盲区(past-current focus blindness)。其解决方案的关键在于提出 WeaveTime 框架,该框架通过引入轻量级的时间重建目标(Temporal Reconstruction objective)实现无需专门流式数据的顺序感知表征学习,并在推理阶段采用基于不确定性的动态焦点缓存机制(Past-Current Dynamic Focus Cache),仅在必要时扩展历史信息,从而在不修改原有模型结构的前提下显著提升流式视频理解的准确性与效率。

链接: https://arxiv.org/abs/2602.22142
作者: Yulin Zhang,Cheng Shi,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (preview; camera-ready in preparation)

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: this https URL

[CV-9] GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models ICLR2026

【速读】:该论文旨在解决生成式图像模型(如Stable Diffusion和FLUX.1-dev)在文本到图像(Text-to-image, T2I)生成过程中存在的地理多样性缺失、刻板印象强化及区域 misrepresented 问题,这些问题可能导致对发展中国家等地区的不公正视觉呈现。解决方案的关键在于提出GeoDiv框架,该框架利用大规模语言模型和视觉-语言模型,从两个互补维度量化地理多样性:一是社会经济视觉指数(Socio-Economic Visual Index, SEVI),用于捕捉与经济状况和环境条件相关的视觉线索;二是视觉多样性指数(Visual Diversity Index, VDI),衡量主要实体和背景的变异程度。通过这一可解释的评估体系,研究揭示了模型在特定国家(如印度、尼日利亚和哥伦比亚)生成图像中普遍存在的贫困化倾向,从而为改进生成模型的公平性和包容性提供了系统性分析工具。

链接: https://arxiv.org/abs/2602.22120
作者: Abhipsa Basu,Mohana Singh,Shashank Agnihotri,Margret Keuper,R. Venkatesh Babu
机构: Vision and AI Lab, Indian Institute of Science, Bangalore, India; Chair for Machine Learning, University of Mannheim, Mannheim, Germany; Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across 10 entities and 16 countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: this https URL

[CV-10] Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

【速读】:该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, VLMs)在处理三维脑部磁共振成像(MRI)时,依赖二维切片近似方法导致空间上下文碎片化、影响神经放射学诊断准确性的问题。其解决方案的关键在于提出了一种分阶段的视觉语言框架 Brain3D,通过将预训练的二维医学编码器扩展为原生三维架构,并分三步逐步对齐至因果语言模型:对比性定位(contrastive grounding)建立视觉-文本对应关系,监督投影器预热(supervised projector warmup)稳定条件生成,以及基于低秩自适应(LoRA-based linguistic specialization)的语言专业化调整,使输出从冗长描述转向结构化临床报告。此设计显著提升了病理分类的F1分数(达0.951),同时保持健康扫描的完美特异性,验证了针对神经放射学特性的三维建模与渐进式对齐策略的有效性。

链接: https://arxiv.org/abs/2602.22098
作者: Mariano Barone,Francesco Di Serio,Giuseppe Riccio,Antonio Romano,Marco Postiglione,Antonino Ferraro,Vincenzo Moscato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbfBrain3D, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbfBrain3D is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnoteOur code is publicly available for transparency and reproducibility

[CV-11] WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

【速读】:该论文旨在解决当前4D城市场景重建与天气编辑中存在的两大问题:一是现有方法难以实现多样化天气条件下的场景模拟,二是图像级天气编辑易引入场景伪影且控制能力弱。解决方案的关键在于提出WeatherCity框架,其核心创新包括:(1)基于文本引导的图像编辑模型实现灵活的天气背景修改;(2)引入一种基于共享场景特征和专用天气解码器的天气高斯表示(weather Gaussian representation),结合内容一致性优化策略,确保不同天气条件下场景建模的一致性;(3)设计物理驱动模型,通过粒子与运动模式模拟动态天气效果,从而在保持高保真度和时序一致性的同时,实现细粒度天气控制(如轻雨、暴雪)及场景内对象级操作。

链接: https://arxiv.org/abs/2602.22096
作者: Wenhua Wu,Huai Guan,Zhe Liu,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.

[CV-12] Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

【速读】:该论文旨在解决胸部X光(Chest X-ray, CXR)图像解读中因病种分布长尾特性及临床环境开放性导致的模型泛化能力不足问题,尤其是现有基准测试多基于单一机构的封闭类别数据集,难以反映罕见疾病的实际发生率或新出现的病理表现。解决方案的关键在于构建一个包含超过14.5万张图像的多中心数据集(整合PadChest与NIH Chest X-ray数据),并设计两项核心任务:一是针对30个已知类别的鲁棒多标签分类任务,二是面向6个未见(分布外)罕见疾病类别的开放世界泛化任务。实验表明,大规模视觉-语言预训练方法显著缓解了零样本诊断中的性能下降问题,验证了其在提升模型对罕见和未知病理识别能力方面的有效性。

链接: https://arxiv.org/abs/2602.22092
作者: Hexin Dong,Yi Lin,Pengyu Zhou,Fengnian Zhao,Alan Clint Legasto,Mingquan Lin,Hao Chen,Yuzhe Yang,George Shih,Yifan Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from single institutions, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT 2026 challenge. This third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. The challenge defines two core tasks: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. We report the results of the top-performing teams, evaluating them via mean Average Precision (mAP), AUROC, and F1-score. The winning solutions achieved an mAP of 0.5854 on Task 1 and 0.4315 on Task 2, demonstrating that large-scale vision-language pre-training significantly mitigates the performance drop typically associated with zero-shot diagnosis.

[CV-13] Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos CVPR2026

【速读】:该论文旨在解决从无标注的自车视角(ego-centric)驾驶视频中学习兼具语义结构与三维几何信息的表征问题,此类数据虽丰富但缺乏标注,难以用于训练高效且鲁棒的自动驾驶感知模型。解决方案的关键在于提出一种无需标签的教师引导框架(Label-Free Guidance, LFG),其核心创新是利用多模态教师信号(multi-modal supervisory signals)驱动一个前馈架构结合轻量级自回归模块,在单次前向传播中联合预测当前及未来点云图(point maps)、相机位姿(camera poses)、语义分割(semantic segmentation)和运动掩码(motion masks),从而从原始YouTube视频中学习统一的伪四维(pseudo-4D)表示,无需依赖真实位姿、标签或LiDAR数据。该方法显著提升了下游任务如规划、语义理解与运动预测的性能,展现出作为视频驱动型基础模型的潜力。

链接: https://arxiv.org/abs/2602.22091
作者: Matthew Strong,Wei-Jer Chang,Quentin Herau,Jiezhi Yang,Yihan Hu,Chensheng Peng,Wei Zhan
机构: Applied Intuition (应用直觉); Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

[CV-14] AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

【速读】:该论文旨在解决精确事件定位(Precise Event Spotting, PES)任务中因视频数据存在时空冗余而导致的计算效率低下与细节丢失问题。现有方法通常对所有帧进行均匀处理,导致在非信息区域产生冗余计算,同时为保证可训练性常采用空间下采样,从而损失关键的细粒度视觉线索。解决方案的关键在于提出 AdaSpot 框架,其通过低分辨率视频提取全局任务相关特征,并结合一种无监督、任务感知的自适应策略,动态选择每帧中最具有信息量的感兴趣区域(Region-of-Interest, ROI)进行高分辨率处理,该策略保持跨帧的时空一致性且避免可学习方案带来的训练不稳定问题,从而在仅增加少量计算开销的前提下显著提升定位精度。

链接: https://arxiv.org/abs/2602.22073
作者: Artur Xarles,Sergio Escalera,Thomas B. Moeslund,Albert Clapés
机构: Universitat de Barcelona(巴塞罗那大学); Computer Vision Center(计算机视觉中心); Aalborg University(奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbfAdaSpot, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbfAdaSpot achieves state-of-the-art performance under strict evaluation metrics (\eg, +3.96 and +2.26 mAP @0 frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \hrefthis https URLthis https URL.

[CV-15] NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training CVPR2026

【速读】:该论文旨在解决现有神经算子(Neural Operator)在处理复杂偏微分方程(PDE)系统时因单一网络架构导致的特征捕捉能力有限、难以建模异质性与复杂依赖关系的问题,从而限制了其在大规模预训练中的性能提升。解决方案的关键在于提出一种基于嵌套专家混合(Nested Mixture-of-Experts, MoE)框架的大规模PDE预训练神经算子模型:其中图像级MoE用于捕获全局依赖关系,而令牌级子MoE(Sub-MoE)专注于局部依赖建模;通过根据输入动态激活最适专家网络,显著增强了模型的泛化能力和迁移性能。

链接: https://arxiv.org/abs/2602.22059
作者: Dengdi Sun,Xiaoya Zhou,Xiao Wang,Hao Si,Wanli Lyu,Jin Tang,Bin Luo
机构: Anhui University (安徽大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.

[CV-16] AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks WACV2026

【速读】:该论文旨在解决从服装缝制图(sewing patterns)中自动预测缝合对应关系(stitch correspondences)的问题,这一任务在缺乏标准化标注协议和语义线索的情况下尤为困难。现有方法通常依赖于面板标签或手工设计的启发式规则,难以应用于现实世界中非标准化的缝制图。其解决方案的关键在于提出一种完全自动化的几何驱动方法 AutoSew,该方法将缝合预测建模为图匹配问题,利用图神经网络(Graph Neural Network, GNN)捕获局部与全局几何上下文信息,并结合可微分最优传输求解器推断缝合关系(包括多边连接),从而仅基于二维缝制轮廓即可实现高精度缝合预测。

链接: https://arxiv.org/abs/2602.22052
作者: Pablo Ríos-Navarro,Elena Garces,Jorge Lopez-Moreno
机构: Universidad Rey Juan Carlos (皇家赫塔费大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:Automating garment assembly from sewing patterns remains a significant challenge due to the lack of standardized annotation protocols and the frequent absence of semantic cues. Existing methods often rely on panel labels or handcrafted heuristics, which limit their applicability to real-world, non-conforming patterns. We present AutoSew, a fully automatic, geometry-based approach for predicting stitch correspondences directly from 2D pattern contours. AutoSew formulates the problem as a graph matching task, leveraging a Graph Neural Network to capture local and global geometric context, and employing a differentiable optimal transport solver to infer stitching relationships-including multi-edge connections. To support this task, we update the GarmentCodeData dataset modifying over 18k patterns with realistic multi-edge annotations, reflecting industrial assembly scenarios. AutoSew achieves 96% F1-score and successfully assembles 73.3% of test garments without error, outperforming existing methods while relying solely on geometric input. Our results demonstrate that geometry alone can robustly guide stitching prediction, enabling scalable garment assembly without manual input.

[CV-17] RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

【速读】:该论文旨在解决传统 referring multi-object tracking (RMOT) 在低可见度场景(如夜间、烟雾等)下性能下降的问题,提出了一种新的 RGB-Thermal RMOT 任务(RT-RMOT),通过融合RGB图像的外观特征与热成像模态的光照鲁棒性,实现全天候的目标追踪。其解决方案的关键在于构建了首个基于RGB-热成像模态的RMOT数据集RefRT,并设计了基于多模态大语言模型(MLLM)的RTrack框架,该框架整合RGB、热成像和文本特征;同时引入Group Sequence Policy Optimization(GSPO)策略优化强化学习微调过程,并结合Clipped Advantage Scaling(CAS)策略缓解训练不稳定问题,以及设计Structured Output Reward和Comprehensive Detection Reward以平衡探索与利用,从而提升目标感知的完整性和准确性。

链接: https://arxiv.org/abs/2602.22033
作者: Yanqiu Yu,Zhifan Jin,Sijia Chen,Tongfei Chu,En Yu,Liman Liu,Wenbing Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model’s potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.

[CV-18] RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models

【速读】:该论文旨在解决地铁列车在GNSS(全球导航卫星系统)不可用环境下,依赖视觉感知进行精确定位时所面临的挑战,尤其是传统RGB相机在光照变化、高速运动和恶劣天气条件下性能受限的问题。解决方案的关键在于引入事件相机(event camera)与RGB图像的多模态融合策略,提出一种基于预训练RGB光学字符识别(OCR)基础模型并结合多模态适配的鲁棒性千米标记识别(KMR)方法,从而提升复杂环境下的识别稳定性与准确性。同时,作者构建了首个大规模RGB-事件数据集EvMetro5K,为该任务提供了高质量的数据支持。

链接: https://arxiv.org/abs/2602.22026
作者: Xiaoyu Xian,Shiao Wang,Xiao Wang,Daxin Tian,Yan Tian
机构: Beihang University (北京航空航天大学); CRRC Academy Co., Ltd (中国中车集团研究院); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Cognitive and Developmental Systems (IEEE TCDS) 2026

点击查看摘要

Abstract:Metro trains often operate in highly complex environments, characterized by illumination variations, high-speed motion, and adverse weather conditions. These factors pose significant challenges for visual perception systems, especially those relying solely on conventional RGB cameras. To tackle these difficulties, we explore the integration of event cameras into the perception system, leveraging their advantages in low-light conditions, high-speed scenarios, and low power consumption. Specifically, we focus on Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization under GNSS-denied conditions. In this context, we propose a robust baseline method based on a pre-trained RGB OCR foundation model, enhanced through multi-modal adaptation. Furthermore, we construct the first large-scale RGB-Event dataset, EvMetro5K, containing 5,599 pairs of synchronized RGB-Event samples, split into 4,479 training and 1,120 testing samples. Extensive experiments on EvMetro5K and other widely used benchmarks demonstrate the effectiveness of our approach for KMR. Both the dataset and source code will be released on this https URL

[CV-19] Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments CVPR2026

【速读】:该论文旨在解决户外场景中固有图像分解(Intrinsic Image Decomposition, IID)的难题,即如何从真实世界图像中准确分离出材质反射率(albedo)与阴影(shading)分量,从而支持诸如重光照、场景编辑和大规模环境理解等应用。此前进展受限于缺乏带有可靠albedo和shading监督的真实数据集。其解决方案的关键在于构建并发布Olbedo——一个大规模航拍户外IID数据集,包含5,664张无人机(UAV)图像及其多视角一致的albedo和shading图、度量深度、表面法向、太阳与天空阴影分量、相机位姿及HDR天空穹顶信息;这些标注通过基于多视图立体重建和校准天空光照的逆渲染优化流程生成,并附带像素级置信度掩膜。实验证明,该数据集可使原本在合成室内数据上训练的扩散模型有效泛化至真实户外图像,在MatrixCity基准上显著提升单视图albedo预测性能,并支撑了3D资产多视角一致重光照、材质编辑与城市数字孪生场景变化分析等下游任务。

链接: https://arxiv.org/abs/2602.22025
作者: Shuang Song,Debao Huang,Deyan Deng,Haolin Xiong,Yang Tang,Yajie Zhao,Rongjun Qin
机构: The Ohio State University (俄亥俄州立大学); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce Olbedo, a large-scale aerial dataset for outdoor albedo–shading decomposition in the wild. Olbedo contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of Olbedo-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.

[CV-20] RobustVisRAG : Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations CVPR2026

【速读】:该论文旨在解决视觉增强生成(Vision-based Retrieval-Augmented Generation, VisRAG)模型在面对图像退化(如模糊、噪声、低光照或阴影)时性能下降的问题。现有VisRAG模型因预训练视觉编码器中语义信息与退化因素纠缠,导致检索和生成阶段均出现错误。解决方案的关键在于提出一种因果引导的双路径框架——RobustVisRAG,其包含非因果路径用于捕捉退化信号(通过单向注意力机制),以及因果路径用于学习由退化信号引导的纯净语义表示;结合提出的“非因果退化建模”与“因果语义对齐”目标,实现语义与退化因素的有效解耦,从而在复杂视觉条件下保持稳定的检索与生成性能。

链接: https://arxiv.org/abs/2602.22013
作者: I-Hsiang Chen,Yu-Wei Liu,Tse-Yu Wu,Yu-Chien Chiang,Jen-Chien Yang,Wei-Ting Chen
机构: National Taiwan University (国立台湾大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026; Project Page: this https URL

点击查看摘要

Abstract:Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.

[CV-21] World Guidance: World Modeling in Condition Space for Action Generation

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在利用未来观测建模以促进动作生成时面临的挑战:如何在保持高效、可预测的未来表征的同时,保留足够的细粒度信息以指导精确的动作生成。现有方法难以在这两者之间取得平衡,导致动作生成质量受限或泛化能力不足。解决方案的关键在于提出WoG(World Guidance)框架,该框架通过将未来观测映射为紧凑的条件向量,并将其注入动作推理流程中,使VLA模型在训练过程中同时预测这些压缩后的条件与未来动作;这种在条件空间内进行世界建模的方法不仅提升了动作生成的精细度,还增强了模型的泛化性能,并能有效从大量人类操作视频中学习。

链接: https://arxiv.org/abs/2602.22010
作者: Yue Su,Sijin Chen,Haixin Shi,Mingyu Liu,Zhengshen Zhang,Ningyuan Huang,Weiheng Zhong,Zhengbang Zhu,Yuxiao Liu,Xihui Liu
机构: ByteDance(字节跳动); The University of Hong Kong (香港大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: this https URL

[CV-22] PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在等距圆柱投影(Equirectangular Projection, ERP)图像上进行三维空间推理能力不足的问题,其核心挑战在于ERP图像的几何失真以及缺乏足够的三维监督信号。解决方案的关键在于提出一个大规模、基于合成3D环境构建的VQA基准PanoEnv-QA,其中包含14.8K条具有精确三维标注(如深度、分割和边界框)的问题,并设计了一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习后训练框架,引入五种几何感知奖励策略以增强模型对空间关系的理解;同时采用两阶段课程学习机制——第一阶段在结构化任务(判断题与选择题)上训练,第二阶段在混合开放式数据上微调,从而有效缓解灾难性遗忘并提升泛化性能。该方法使7B规模模型在整体准确率和开放式问题准确率上分别达到52.93%和14.83%,显著优于现有模型。

链接: https://arxiv.org/abs/2602.21992
作者: Zekai Lin,Xu Zheng
机构: University of Glasgow(格拉斯哥大学); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.

[CV-23] PatchDenoiser: Parameter-efficient multi-scale patch learning and fusion denoiser for medical images

【速读】:该论文旨在解决医学图像(如低剂量CT)在采集过程中因噪声干扰导致质量下降的问题,这种噪声会严重影响临床诊断和后续分析。传统滤波方法易过度平滑而丢失细粒度解剖结构信息,而深度学习方法(如CNN、GAN、Transformer)虽性能优越,但常面临细节保留不足或模型庞大、计算成本高的问题,限制了其临床实用性。解决方案的关键在于提出一种轻量级、多尺度的基于patch的去噪框架PatchDenoiser,其核心创新是将去噪过程分解为局部纹理提取与全局上下文聚合,并通过空间感知的patch融合策略实现高效整合,从而在显著抑制噪声的同时有效保留关键解剖细节,同时具备参数少、能耗低、跨设备泛化能力强等优势。

链接: https://arxiv.org/abs/2602.21987
作者: Jitindra Fartiyal,Pedro Freire,Sergei K. Turitsyn,Sergei G. Solovski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review in Medical Image Analysis journal

点击查看摘要

Abstract:Medical images are essential for diagnosis, treatment planning, and research, but their quality is often degraded by noise from low-dose acquisition, patient motion, or scanner limitations, affecting both clinical interpretation and downstream analysis. Traditional filtering approaches often over-smooth and lose fine anatomical details, while deep learning methods, including CNNs, GANs, and transformers, may struggle to preserve such details or require large, computationally expensive models, limiting clinical practicality. We propose PatchDenoiser, a lightweight, energy-efficient multi-scale patch-based denoising framework. It decomposes denoising into local texture extraction and global context aggregation, fused via a spatially aware patch fusion strategy. This design enables effective noise suppression while preserving fine structural and anatomical details. PatchDenoiser is ultra-lightweight, with far fewer parameters and lower computational complexity than CNN-, GAN-, and transformer-based denoisers. On the 2016 Mayo Low-Dose CT dataset, PatchDenoiser consistently outperforms state-of-the-art CNN- and GAN-based methods in PSNR and SSIM. It is robust to variations in slice thickness, reconstruction kernels, and HU windows, generalizes across scanners without fine-tuning, and reduces parameters by ~9x and energy consumption per inference by ~27x compared with conventional CNN denoisers. PatchDenoiser thus provides a practical, scalable, and computationally efficient solution for medical image denoising, balancing performance, robustness, and clinical deployability. Comments: Under review in Medical Image Analysis journal Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.21987 [cs.CV] (or arXiv:2602.21987v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.21987 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jitindra Fartiyal [view email] [v1] Wed, 25 Feb 2026 15:08:43 UTC (5,244 KB)

[CV-24] When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters

【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)技术在文本到图像扩散模型中被恶意利用的问题,即攻击者可借助LoRA模块作为隐蔽载体,在模型中植入后门行为,从而在不改变正常功能的前提下实现对生成结果的操控。解决方案的关键在于提出Masquerade-LoRA(MasqLoRA)框架,其核心机制是通过冻结基础模型参数,仅更新少量“触发词-目标图像”配对所对应的低秩适配权重,训练出一个独立的后门LoRA模块;该模块嵌入了跨模态隐式映射:当加载该模块并输入特定文本触发词时,模型将输出预定义的恶意图像,而在其他情况下则与原始模型无异,从而实现高隐蔽性的攻击,实验表明该方法可在极低资源开销下达到99.8%的攻击成功率。

链接: https://arxiv.org/abs/2602.21977
作者: Liangwei Lyu,Jiaqi Xu,Jianwei Ding,Qiyao Deng
机构: People’s Public Security University of China (中国公安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack surface. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first systematic attack framework that leverages an independent LoRA module as the attack vehicle to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of “trigger word-target image” pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; otherwise, it behaves indistinguishably from the benign model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.

[CV-25] Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments

【速读】:该论文旨在解决现有单目主动SLAM(Simultaneous Localization and Mapping,同步定位与建图)方法在三个关键方面的局限性:一是受限于底层SLAM模块的性能瓶颈;二是运动规划策略缺乏长期视野,多为短视决策;三是难以有效处理动态场景。解决方案的核心在于提出一种名为Dream-SLAM的新方法,其关键创新是基于“梦境”生成跨时空图像(cross-spatio-temporal images)和语义合理的部分观测动态环境结构,通过将这些虚拟生成的图像与真实观测融合,提升相机位姿估计精度和三维场景表示的一致性;同时,结合梦中与真实场景结构进行长程规划,从而生成具有前瞻性的探索轨迹,显著提高探索效率与建图质量。

链接: https://arxiv.org/abs/2602.21967
作者: Xiangqi Meng,Pengxu Hou,Zhenjun Zhao,Javier Civera,Daniel Cremers,Hesheng Wang,Haoang Li
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Zaragoza (萨拉戈萨大学); Technical University of Munich (慕尼黑工业大学); Shanghai Jiao Tong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.

[CV-26] Global-Aware Edge Prioritization for Pose Graph Initialization CVPR2026

【速读】:该论文旨在解决Structure-from-Motion (SfM) 中姿态图(pose graph)初始化阶段因几何验证计算成本高而导致的边(edge)稀疏性问题,现有方法依赖图像检索独立连接每张图像与其k近邻,忽视了全局一致性。其解决方案的关键在于提出“边优先级排序”(edge prioritization)机制,通过三个核心组件实现:(1) 使用SfM衍生监督信号训练图神经网络(GNN)以预测全局一致的边可靠性;(2) 基于多最小生成树(multi-minimal-spanning-tree)策略构建受优先级引导的姿态图;(3) 引入连通性感知的分数调制模块,增强弱连接区域并减小图直径。该方法显著提升了姿态图的可靠性和紧凑性,在稀疏和高速场景下改善重建精度,并优于当前最优的图像检索方法在模糊场景中的表现。

链接: https://arxiv.org/abs/2602.21963
作者: Tong Wei,Giorgos Tolias,Jiri Matas,Daniel Barath
机构: Czech Technical University in Prague (捷克技术大学); ETH Zurich (苏黎世联邦理工学院); HUN-REN SZTAKI (匈牙利科学院计算机与自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 2026

点击查看摘要

Abstract:The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its k nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at this https URL.

[CV-27] Global-Local Dual Perception for MLLM s in High-Resolution Text-Rich Image Translation

【速读】:该论文旨在解决高分辨率文本丰富图像中的文本图像机器翻译(Text Image Machine Translation, TIMT)问题,现有方法在面对杂乱布局、多样字体及非文本干扰时易出现文本遗漏、语义偏移和上下文不一致等问题。其解决方案的关键在于提出一种基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的全局-局部双视觉感知框架(GLoTran),通过低分辨率全局图像与多尺度区域级文本图像切片的指令引导对齐策略,使MLLMs在保持场景级上下文一致性的同时精准捕捉细粒度文本细节。

链接: https://arxiv.org/abs/2602.21956
作者: Junxin Lu,Tengfei Song,Zhanglin Wu,Pengfei Li,Xiaowei Liang,Hui Yang,Kun Chen,Ning Xie,Yunfei Lu,Jing Zhao,Shiliang Sun,Daimeng Wei
机构: East China Normal University (华东师范大学); Huawei Technologies Co., LTD (华为技术有限公司); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

[CV-28] MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving CVPR2026

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLM)在端到端自动驾驶系统中进行链式思维(Chain-of-Thought, CoT)推理时存在的核心问题:一是传统文本形式的CoT存在语义空间与轨迹物理空间之间的鸿沟,难以有效指导驾驶决策;二是近期采用未来图像替代文本作为CoT输入的方法缺乏明确的面向规划的目标引导,导致生成的场景演化图像不准确。解决方案的关键在于提出MindDriver框架,该框架通过三阶段渐进式多模态推理机制实现从语义理解、语义到物理空间的想象再到物理空间轨迹规划的对齐过程,并结合反馈引导的自动数据标注流水线生成对齐的多模态推理训练数据,以及基于高阶奖励的渐进强化微调方法优化推理一致性,从而显著提升自动驾驶系统在nuScenes开放环路和Bench2Drive闭环评估中的性能表现。

链接: https://arxiv.org/abs/2602.21952
作者: Lingjun Zhang,Yujian Yuan,Changjie Wu,Xinyuan Chang,Xin Cai,Shuang Zeng,Linzhe Shi,Sijin Wang,Hang Zhang,Mu Xu
机构: Amap, Alibaba Group (高德地图,阿里巴巴集团); The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026; Yujian Yuan and Lingjun Zhang contributed equally with random order

点击查看摘要

Abstract:Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM’s widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at this https URL.

[CV-29] Learning to Fuse and Reconstruct Multi-View Graphs for Diabetic Retinopathy Grading

【速读】:该论文旨在解决多视图眼底图像(multi-view fundus images)在糖尿病视网膜病变(Diabetic Retinopathy, DR)分级中,现有深度学习方法忽视视图间相关性、未能充分挖掘来自同一患者的不同视角图像之间内在一致性的问题。解决方案的关键在于提出一种端到端的多视图图融合框架(Multi-View Graph Fusion, MVGFDR),其核心创新是引入了一个新颖的多视图图融合(Multi-View Graph Fusion, MVGF)模块,能够显式解耦共享特征与视图特有特征:首先通过残差引导连接构建视觉图并利用离散余弦变换(Discrete Cosine Transform, DCT)系数作为频域锚点进行初始化;其次基于频域相关性选择性融合多视图图中的节点以捕获互补的视图特有信息;最后通过掩码跨视图重建机制促进视图不变表示的学习,从而提升DR分级精度。

链接: https://arxiv.org/abs/2602.21944
作者: Haoran Li,Yuxin Lin,Huan Wang,Xiaoling Luo,Qi Zhu,Jiahua Shi,Huaming Chen,Bo Du,Johan Barthelemy,Zongyan Xue,Jun Shen,Yong Xu
机构: Monash University (莫纳什大学); ARC Centre of Excellence for the Weather of the 21st Century; University of Wollongong (伍伦贡大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Shenzhen University (深圳大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); The University of Queensland (昆士兰大学); University of Sydney (悉尼大学); Griffith University (格里菲斯大学); NVIDIA (英伟达); The University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic retinopathy (DR) is one of the leading causes of vision loss worldwide, making early and accurate DR grading critical for timely intervention. Recent clinical practices leverage multi-view fundus images for DR detection with a wide coverage of the field of view (FOV), motivating deep learning methods to explore the potential of multi-view learning for DR grading. However, existing methods often overlook the inter-view correlations when fusing multi-view fundus images, failing to fully exploit the inherent consistency across views originating from the same patient. In this work, we present MVGFDR, an end-to-end Multi-View Graph Fusion framework for DR grading. Different from existing methods that directly fuse visual features from multiple views, MVGFDR is equipped with a novel Multi-View Graph Fusion (MVGF) module to explicitly disentangle the shared and view-specific visual features. Specifically, MVGF comprises three key components: (1) Multi-view Graph Initialization, which constructs visual graphs via residual-guided connections and employs Discrete Cosine Transform (DCT) coefficients as frequency-domain anchors; (2) Multi-view Graph Fusion, which integrates selective nodes across multi-view graphs based on frequency-domain relevance to capture complementary view-specific information; and (3) Masked Cross-view Reconstruction, which leverages masked reconstruction of shared information across views to facilitate view-invariant representation learning. Extensive experimental results on MFIDDR, by far the largest multi-view fundus image dataset, demonstrate the superiority of our proposed approach over existing state-of-the-art approaches in diabetic retinopathy grading.

[CV-30] Mobile-Ready Automated Triage of Diabetic Retinopathy Using Digital Fundus Images

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)自动化筛查中人工诊断效率低、易出错及部署受限的问题。其关键解决方案是提出一种轻量级深度学习框架,采用MobileNetV3主干网络结合一致性秩 logits(Consistent Rank Logits, CORAL)头结构,在保持计算高效的同时建模疾病分级的有序性,从而实现对数字眼底图像的精准分级评估。该方法在APTOS 2019与IDRiD数据集上通过预处理(圆形裁剪与光照归一化)和三折交叉验证验证了有效性,最终获得0.9019的加权肯德尔相关系数(Quadratic Weighted Kappa, QWK)和80.03%的准确率,并针对实际部署优化了模型校准与移动端适配能力。

链接: https://arxiv.org/abs/2602.21943
作者: Aadi Joshi,Manav S. Sharma,Vijay Uttam Rathod,Ashlesha Sawant,Prajakta Musale,Asmita B. Kalamkar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at ICCI 2025. 11 pages, 2 figures. MobileNetV3 + CORAL-based lightweight model for diabetic retinopathy severity classification with mobile deployment

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a major cause of vision impairment worldwide. However, manual diagnosis is often time-consuming and prone to errors, leading to delays in screening. This paper presents a lightweight automated deep learning framework for efficient assessment of DR severity from digital fundus images. We use a MobileNetV3 architecture with a Consistent Rank Logits (CORAL) head to model the ordered progression of disease while maintaining computational efficiency for resource-constrained environments. The model is trained and validated on a combined dataset of APTOS 2019 and IDRiD images using a preprocessing pipeline including circular cropping and illumination normalization. Extensive experiments including 3-fold cross-validation and ablation studies demonstrate strong performance. The model achieves a Quadratic Weighted Kappa (QWK) score of 0.9019 and an accuracy of 80.03 percent. Additionally, we address real-world deployment challenges through model calibration to reduce overconfidence and optimization for mobile devices. The proposed system provides a scalable and practical tool for early-stage diabetic retinopathy screening.

[CV-31] Directed Ordinal Diffusion Regularization for Progression-Aware Diabetic Retinopathy Grading

【速读】:该论文旨在解决现有序数回归方法在糖尿病视网膜病变(Diabetic Retinopathy, DR)分级中忽略疾病进展方向性的问题,即模型通常将DR严重程度视为静态、对称的等级关系,未能体现其从轻到重不可逆的生物学演化过程,导致学习到的特征表示可能出现非连续阶段间的不合理的邻近性甚至逆向转移,违背临床逻辑。解决方案的关键在于提出定向序数扩散正则化(Directed Ordinal Diffusion Regularization, D-ODR),通过构建一个受疾病进展约束的有向图结构,在该结构上进行多尺度扩散操作,对沿有效进展路径出现的评分反转施加惩罚,从而强制模型学习符合生物合理性的前向演化特征表示,使模型输出更贴近DR实际的渐进恶化轨迹。

链接: https://arxiv.org/abs/2602.21942
作者: Huangwei Chen,Junhao Jia,Ruocheng Li,Cunyuan Yang,Wu Li,Xiaotao Pang,Yifei Chen,Haishuai Wang,Jiajun Bu,Lei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 figures

点击查看摘要

Abstract:Diabetic Retinopathy (DR) progresses as a continuous and irreversible deterioration of the retina, following a well-defined clinical trajectory from mild to severe stages. However, most existing ordinal regression approaches model DR severity as a set of static, symmetric ranks, capturing relative order while ignoring the inherent unidirectional nature of disease progression. As a result, the learned feature representations may violate biological plausibility, allowing implausible proximity between non-consecutive stages or even reverse transitions. To bridge this gap, we propose Directed Ordinal Diffusion Regularization (D-ODR), which explicitly models the feature space as a directed flow by constructing a progression-constrained directed graph that strictly enforces forward disease evolution. By performing multi-scale diffusion on this directed structure, D-ODR imposes penalties on score inversions along valid progression paths, thereby effectively preventing the model from learning biologically inconsistent reverse transitions. This mechanism aligns the feature representation with the natural trajectory of DR worsening. Extensive experiments demonstrate that D-ODR yields superior grading performance compared to state-of-the-art ordinal regression and DR-specific grading methods, offering a more clinically reliable assessment of disease severity. Our code is available on this https URL.

[CV-32] A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography

【速读】:该论文旨在解决冠状动脉钙化(Coronary Artery Calcium, CAC)评分在常规胸部CT扫描中难以应用的问题,传统CAC评分依赖于心电图门控(ECG-gated)CT扫描,限制了其在非心脏专用影像场景中的普及。解决方案的关键在于提出一种自动化框架,基于自监督视觉Transformer模型CARD-ViT,该模型仅使用门控CT数据通过DINO方法训练,实现了从门控域到非门控域的跨域迁移能力,在无需任何非门控数据训练的情况下,仍能在Stanford非门控数据集上达到与专门针对非门控数据训练模型相当的性能(准确率0.707,Cohen’s kappa 0.528),并在门控测试集上实现高准确性(准确率0.910,Cohen’s kappa 0.871–0.874),证明了跨域CAC评分的可行性,为在常规胸部CT中开展大规模心血管风险筛查提供了技术支撑。

链接: https://arxiv.org/abs/2602.21935
作者: Mahmut S. Gokmen,Moneera N. Haque,Steve W. Leung,Caroline N. Leach,Seth Parker,Stephen B. Hobbs,Vincent L. Sorrell,W. Brent Seales,V. K. Cody Bumgardner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Coronary artery calcium (CAC) scoring is a key predictor of cardiovascular risk, but it relies on ECG-gated CT scans, restricting its use to specialized cardiac imaging settings. We introduce an automated framework for CAC detection and lesion-specific Agatston scoring that operates across both gated and non-gated CT scans. At its core is CARD-ViT, a self-supervised Vision Transformer trained exclusively on gated CT data using DINO. Without any non-gated training data, our framework achieves 0.707 accuracy and a Cohen’s kappa of 0.528 on the Stanford non-gated dataset, matching models trained directly on non-gated scans. On gated test sets, the framework achieves 0.910 accuracy with Cohen’s kappa scores of 0.871 and 0.874 across independent datasets, demonstrating robust risk stratification. These results demonstrate the feasibility of cross-domain CAC scoring from gated to non-gated domains, supporting scalable cardiovascular screening in routine chest imaging without additional scans or annotations.

[CV-33] Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context CVPR2026

【速读】:该论文旨在解决场景一致性的视频生成问题,即在给定相机轨迹的情况下生成与3D场景保持几何一致性的视频。传统方法依赖外部记忆或迭代的3D重建与图像修复(inpainting),存在推理过程中误差累积、非可微过程及多模型分离等问题。其解决方案的关键在于提出“几何作为上下文”(geometry-as-context)框架,利用自回归相机控制的视频生成模型,迭代完成两个步骤:(1) 估计当前视角所需的几何信息以支持3D重建,(2) 模拟并恢复由3D场景渲染的新视角图像。该框架通过引入相机门控注意力模块(camera gated attention module)增强模型对相机位姿的利用能力,并在训练阶段结合文本上下文动态决定生成几何或RGB图像,同时在推理阶段随机丢弃几何上下文以确保仅输出RGB视频,从而实现高保真且具有一致性的场景视频生成。

链接: https://arxiv.org/abs/2602.21929
作者: JiaKui Hu,Jialun Liu,Liying Yang,Xinliang Zhang,Kaiwen Li,Shuang Zeng,Yuanwei Li,Haibin Huang,Chi Zhang,Yanye Lu
机构: Institute of Medical Technology, Peking University (北京大学医学技术研究院); TeleAI; MUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model’s capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.

[CV-34] Learning in the Null Space: Small Singular Values for Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中灾难性遗忘(catastrophic forgetting)与新任务学习能力之间的权衡问题。其解决方案的关键在于利用每一层输入表示的小奇异值(small singular values)来构造近似零空间(null space),并在此约束下通过低秩适配(LoRA-style)形式参数化任务特定的权重更新。具体而言,NESS方法直接在权重空间中施加正交性约束,而非依赖梯度投影,且仅需为每个任务学习一个可训练矩阵,从而确保更新方向近似正交于先前任务的输入空间,有效缓解遗忘并支持稳定的新任务适应。

链接: https://arxiv.org/abs/2602.21919
作者: Cuong Anh Pham,Praneeth Vepakomma,Samuel Horváth
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI, 穆罕默德·本·扎耶德人工智能大学); Massachusetts Institute of Technology (MIT, 麻省理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, accepted as Oral presentation at the Third Conference on Parsimony and Learning (CPAL 2026)

点击查看摘要

Abstract:Alleviating catastrophic forgetting while enabling further learning is a primary challenge in continual learning (CL). Orthogonal-based training methods have gained attention for their efficiency and strong theoretical properties, and many existing approaches enforce orthogonality through gradient projection. In this paper, we revisit orthogonality and exploit the fact that small singular values correspond to directions that are nearly orthogonal to the input space of previous tasks. Building on this principle, we introduce NESS (Null-space Estimated from Small Singular values), a CL method that applies orthogonality directly in the weight space rather than through gradient manipulation. Specifically, NESS constructs an approximate null space using the smallest singular values of each layer’s input representation and parameterizes task-specific updates via a compact low-rank adaptation (LoRA-style) formulation constrained to this subspace. The subspace basis is fixed to preserve the null-space constraint, and only a single trainable matrix is learned for each task. This design ensures that the resulting updates remain approximately in the null space of previous inputs while enabling adaptation to new tasks. Our theoretical analysis and experiments on three benchmark datasets demonstrate competitive performance, low forgetting, and stable accuracy across tasks, highlighting the role of small singular values in continual learning. The code is available at this https URL.

[CV-35] Scan Clusters Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration CVPR26

【速读】:该论文旨在解决超高清(Ultra-High-Definition, UHD)图像恢复任务中因像素级操作导致的可扩展性危机,即现有模型计算复杂度不可持续的问题。其核心挑战在于,尽管状态空间模型(State Space Models, SSMs)如Mamba具备线性复杂度优势,但其逐像素扫描机制仍无法高效处理UHD图像中数百万像素的数据。解决方案的关键在于提出C²SSM——一种基于聚类中心的视觉状态空间模型,它摒弃了传统的像素级扫描方式,转而采用“聚类级扫描”策略:通过神经参数化的混合模型将UHD图像的丰富特征分布压缩为稀疏语义中心点(semantic centroids),并在此基础上构建双路径全局建模机制——先在少量聚类中心上进行扫描与推理,再通过合理的相似性分布将全局上下文扩散回所有像素,同时由轻量调制器保留细节信息。这一聚类中心范式显著降低了计算成本,并在五个UHD恢复任务中达到新的最先进性能。

链接: https://arxiv.org/abs/2602.21917
作者: Chen Wu,Ling Wang,Zhuoran Zheng,Yuning Cui,Zhixiong Yang,Xiangyu Chen,Yue Zhang,Weidong Jiang,Jingyuan Xia
机构: National University of Defense Technology (国防科技大学); HKUST(GZ) (香港科技大学(广州)); Qilu University of Technology (齐鲁工业大学); Technical University of Munich (慕尼黑工业大学); Institute of Artificial Intelligence (TeleAI) (人工智能研究所(TeleAI)); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Aceepted by CVPR26

点击查看摘要

Abstract:Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C ^2 SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C ^2 SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C ^2 SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.

[CV-36] Protein Graph Neural Networks for Heterogeneous Cryo-EM Reconstruction

【速读】:该论文旨在解决异质性单粒子冷冻电镜(cryo-EM)重构中难以准确预测原子级主链构象的问题。其关键解决方案是提出一种几何感知的图神经网络(GNN)自编码器架构,将蛋白质主链表示为图结构,并通过学习从每张图像的潜在变量到模板构象三维位移的映射来实现高精度重建;该方法结合基于可微分cryo-EM前向模型的数据不一致性项与几何正则化项,并利用椭球支撑提升(ellipsoidal support lifting, ESL)实现未知朝向估计,从而在合成数据集上显著优于同规模多层感知机(MLP)。

链接: https://arxiv.org/abs/2602.21915
作者: Jonathan Krook,Axel Janson,Joakim andén,Melanie Weber,Ozan Öktem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a geometry-aware method for heterogeneous single-particle cryogenic electron microscopy (cryo-EM) reconstruction that predicts atomic backbone conformations. To incorporate protein-structure priors, we represent the backbone as a graph and use a graph neural network (GNN) autodecoder that maps per-image latent variables to 3D displacements of a template conformation. The objective combines a data-discrepancy term based on a differentiable cryo-EM forward model with geometric regularization, and it supports unknown orientations via ellipsoidal support lifting (ESL) pose estimation. On synthetic datasets derived from molecular dynamics trajectories, the proposed GNN achieves higher accuracy compared to a multilayer perceptron (MLP) of comparable size, highlighting the benefits of a geometry-informed inductive bias.

[CV-37] IRAuxCloud: A Thermal Infrared Dataset for Day and Night Cloud Detection

【速读】:该论文旨在解决地球观测中云层干扰导致的遥感应用可靠性下降问题,特别是在夜间因缺乏太阳光照射而难以准确检测云层的挑战。其关键解决方案是构建TIRAuxCloud多模态数据集,该数据集以热红外(Thermal Infrared, TIR)光谱数据为核心,融合Landsat与VIIRS的多光谱数据(包括TIR、可见光和近红外波段)以及辅助信息层(如高程、地表覆盖、气象变量和无云参考图像),从而降低地表与云层之间的混淆及云形成不确定性。此外,通过提供大量自动标注样本和小规模人工标注子集,有效缓解了云标签稀缺问题,为昼夜连续云分割模型的训练与评估提供了高质量基准,推动了生成式AI在全天候云检测中的发展。

链接: https://arxiv.org/abs/2602.21905
作者: Alexis Apostolakis,Vasileios Botsos,Niklas Wölki,Andrea Spichtinger,Nikolaos Ioannis Bountos,Ioannis Papoutsis,Panayiotis Tsanakas
机构: National Technical University of Athens (国立技术大学雅典); National Observatory of Athens (国家天文台); OroraTech; School of Rural, Surveying and Geoinformatics Engineering (农村、测量与地理信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clouds are a major obstacle in Earth observation, limiting the usability and reliability of critical remote sensing applications such as fire disaster response, urban heat island monitoring, and snow and ice cover mapping. Therefore, the ability to detect clouds 24/7 is of paramount importance. While visible and near-infrared bands are effective for daytime cloud detection, their dependence on solar illumination makes them unsuitable for nighttime monitoring. In contrast, thermal infrared (TIR) imagery plays a crucial role in detecting clouds at night, when sunlight is absent. Due to their generally lower temperatures, clouds emit distinct thermal signatures that are detectable in TIR bands. Despite this, accurate nighttime cloud detection remains challenging due to limited spectral information and the typically lower spatial resolution of TIR imagery. To address these challenges, we present TIRAuxCloud, a multi-modal dataset centered around thermal spectral data to facilitate cloud segmentation under both daytime and nighttime conditions. The dataset comprises a unique combination of multispectral data (TIR, optical, and near-infrared bands) from Landsat and VIIRS, aligned with auxiliary information layers. Elevation, land cover, meteorological variables, and cloud-free reference images are included to help reduce surface-cloud ambiguity and cloud formation uncertainty. To overcome the scarcity of manual cloud labels, we include a large set of samples with automated cloud masks and a smaller manually annotated subset to further evaluate and improve models. Comprehensive benchmarks are presented to establish performance baselines through supervised and transfer learning, demonstrating the dataset’s value in advancing the development of innovative methods for day and night time cloud detection.

[CV-38] UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing ICCV

【速读】:该论文旨在解决自主赛车在复杂环境中的锥桶(cone)精确定位问题,以实现高精度的赛道导航。传统计算机视觉方法对环境变化敏感,而现有神经网络模型受限于训练数据不足且难以实时运行。其解决方案的关键在于构建了一个基于UNet架构的神经网络模型,并利用自建的最大规模定制标注数据集进行训练,从而实现高精度的关键点检测,不仅提升了锥桶位置估计的准确性,还具备颜色预测潜力;同时,该模型被集成至感知流水线中进行端到端评估,验证了其在实际应用中的高性能表现。

链接: https://arxiv.org/abs/2602.21904
作者: Mariia Baidachna,James Carty,Aidan Ferguson,Joseph Agrane,Varad Kulkarni,Aubrey Agub,Michael Baxendale,Aaron David,Rachel Horton,Elliott Atkinson
机构: University of Glasgow (格拉斯哥大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 9 figures. Accepted to ICCV End-to-End 3D Learning Workshop 2025 and presented as a poster; not included in the final proceedings due to a conference administrative error

点击查看摘要

Abstract:Accurate cone localization in 3D space is essential in autonomous racing for precise navigation around the track. Approaches that rely on traditional computer vision algorithms are sensitive to environmental variations, and neural networks are often trained on limited data and are infeasible to run in real time. We present a UNet-based neural network for keypoint detection on cones, leveraging the largest custom-labeled dataset we have assembled. Our approach enables accurate cone position estimation and the potential for color prediction. Our model achieves substantial improvements in keypoint accuracy over conventional methods. Furthermore, we leverage our predicted keypoints in the perception pipeline and evaluate the end-to-end autonomous system. Our results show high-quality performance across all metrics, highlighting the effectiveness of this approach and its potential for adoption in competitive autonomous racing systems.

[CV-39] EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion ICRA2026

【速读】:该论文旨在解决内窥镜环境下深度估计精度不足的问题,尤其针对弱纹理和光照变化导致的稀疏且不准确的深度图问题。现有自监督深度估计方法在上述复杂环境中性能下降,而传统深度补全技术在内窥镜场景中应用有限。解决方案的关键在于提出EndoDDC方法,该方法融合图像信息、稀疏深度图与深度梯度特征,并通过扩散模型(diffusion model)优化深度图,从而有效提升复杂内窥镜环境下的深度估计准确性与鲁棒性。

链接: https://arxiv.org/abs/2602.21893
作者: Yinheng Lin,Yiming Huang,Beilei Cui,Long Bai,Huxin Gao,Hongliang Ren,Jiewen Lai
机构: The Chinese University of Hong Kong (香港中文大学); Alibaba (阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2026

点击查看摘要

Abstract:Accurate depth estimation plays a critical role in the navigation of endoscopic surgical robots, forming the foundation for 3D reconstruction and safe instrument guidance. Fine-tuning pretrained models heavily relies on endoscopic surgical datasets with precise depth annotations. While existing self-supervised depth estimation techniques eliminate the need for accurate depth annotations, their performance degrades in environments with weak textures and variable lighting, leading to sparse reconstruction with invalid depth estimation. Depth completion using sparse depth maps can mitigate these issues and improve accuracy. Despite the advances in depth completion techniques in general fields, their application in endoscopy remains limited. To overcome these limitations, we propose EndoDDC, an endoscopy depth completion method that integrates images, sparse depth information with depth gradient features, and optimizes depth maps through a diffusion model, addressing the issues of weak texture and light reflection in endoscopic environments. Extensive experiments on two publicly available endoscopy datasets show that our approach outperforms state-of-the-art models in both depth accuracy and robustness. This demonstrates the potential of our method to reduce visual errors in complex endoscopic environments. Our code will be released at this https URL.

[CV-40] How to Take a Memorable Picture? Empowering Users with Actionable Feedback CVPR2026

【速读】:该论文旨在解决图像记忆性(image memorability)研究中缺乏实时反馈机制的问题,即传统方法仅能事后预测图像的记忆可能性或通过生成式AI修改图像以提升记忆性,但无法在用户拍摄时提供可操作的、人类可理解的指导建议。其解决方案的关键在于提出“记忆反馈”(Memorability Feedback, MemFeed)任务,并设计了首个无需训练的自然语言指导模型MemCoach,该模型基于多模态大语言模型(Multimodal Large Language Models, MLLMs),采用教师-学生引导策略,通过对比学习将模型内部激活向更易被记住的视觉模式对齐,从而生成具体且可执行的优化建议(如“强调面部表情”、“将主体前移”)。这一方法实现了从被动预测到主动教学的范式转变,显著提升了图像未来回忆的可能性。

链接: https://arxiv.org/abs/2602.21877
作者: Francesco Laiti,Davide Talon,Jacopo Staiano,Elisa Ricci
机构: University of Trento(特伦托大学); University of Pisa(比萨大学); Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted @ CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., “emphasize facial expression,” “bring the subject forward”). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.

[CV-41] GFPL: Generative Federated Prototype Learning for Resource-Constrained and Data-Imbalanced Vision Task

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际部署中面临的两个核心问题:一是由于模型更新偏向多数类特征而导致的知识融合效率低下;二是因频繁传输高维模型参数带来的高昂通信开销。其解决方案的关键在于提出一种新型的生成式联邦原型学习(Generative Federated Prototype Learning, GFPL)框架,通过基于高斯混合模型(Gaussian Mixture Model, GMM)的原型生成方法捕捉类别特征的统计信息,并利用巴氏距离(Bhattacharyya distance)设计原型聚合策略以有效融合语义相似的知识;同时,借助融合后的原型生成伪特征来缓解客户端间特征分布不平衡问题,并引入双分类器架构结合点积回归(Dot Regression)与交叉熵(Cross-Entropy)的混合损失函数,提升本地训练过程中的特征对齐能力,从而在保持低通信成本的同时显著提升模型精度(在数据不平衡场景下提升3.6%)。

链接: https://arxiv.org/abs/2602.21873
作者: Shiwei Lu,Yuhang He,Jiashuo Li,Qiang Wang,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学); Air Force Engineering University (空军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated learning (FL) facilitates the secure utilization of decentralized images, advancing applications in medical image recognition and autonomous driving. However, conventional FL faces two critical challenges in real-world deployment: ineffective knowledge fusion caused by model updates biased toward majority-class features, and prohibitive communication overhead due to frequent transmissions of high-dimensional model parameters. Inspired by the human brain’s efficiency in knowledge integration, we propose a novel Generative Federated Prototype Learning (GFPL) framework to address these issues. Within this framework, a prototype generation method based on Gaussian Mixture Model (GMM) captures the statistical information of class-wise features, while a prototype aggregation strategy using Bhattacharyya distance effectively fuses semantically similar knowledge across clients. In addition, these fused prototypes are leveraged to generate pseudo-features, thereby mitigating feature distribution imbalance across clients. To further enhance feature alignment during local training, we devise a dual-classifier architecture, optimized via a hybrid loss combining Dot Regression and Cross-Entropy. Extensive experiments on benchmarks show that GFPL improves model accuracy by 3.6% under imbalanced data settings while maintaining low communication cost.

[CV-42] Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barretts Video Segmentation

【速读】:该论文旨在解决内镜视频标注中因半自动工具(如Segment Anything Model 2, SAM2)在帧间传播标注时误差累积导致的精度下降问题,尤其针对Barrett食管异型增生这类边界不规则、难以界定的病变区域。解决方案的关键在于提出一种成本感知的“学习重提示”(Learning-to-Re-Prompt, L2RP)框架,该框架通过学习不同提示类型(掩码、边界框和点)下错误传播的规律,智能决定何时以及何处请求专家干预,从而在标注效率与分割准确性之间实现最优权衡。

链接: https://arxiv.org/abs/2602.21855
作者: Lokesha Rasanjalee,Jin Lin Tan,Dileepa Pitawela,Rajvinder Singh,Hsiang-Ting Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE ISBI 2026

点击查看摘要

Abstract:Accurate annotation of endoscopic videos is essential yet time-consuming, particularly for challenging datasets such as dysplasia in Barrett’s esophagus, where the affected regions are irregular and lack clear boundaries. Semi-automatic tools like Segment Anything Model 2 (SAM2) can ease this process by propagating annotations across frames, but small errors often accumulate and reduce accuracy, requiring expert review and correction. To address this, we systematically study how annotation errors propagate across different prompt types, namely masks, boxes, and points, and propose Learning-to-Re-Prompt (L2RP), a cost-aware framework that learns when and where to seek expert input. By tuning a human-cost parameter, our method balances annotation effort and segmentation accuracy. Experiments on a private Barrett’s dysplasia dataset and the public SUN-SEG benchmark demonstrate improved temporal consistency and superior performance over baseline strategies.

[CV-43] Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking

【速读】:该论文旨在解决当前基于深度学习的水印技术在面对多种失真类型时,因采用单随机失真(Single Random Distortion, SRD)训练策略而导致的鲁棒性与泛化能力受限的问题。SRD策略在每个训练批次中独立处理单一失真,忽略了不同失真之间的内在关联,从而引发批次间的优化冲突。为解决此问题,作者提出一种基于元学习与特征一致性(Meta-FC)的新训练策略:首先在元训练阶段从噪声池中随机采样多个失真构建任务,同时保留一个失真作为“未知”失真用于元测试;通过元学习引导模型识别在各类失真下保持稳定激活的神经元,减少因随机采样多样失真带来的优化冲突;进一步引入特征一致性损失,约束同一图像在不同失真下的解码特征保持一致,从而促进稳定激活向失真不变表示的转化。实验表明,该方法显著提升了水印模型在高强度、组合及未知失真场景下的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2602.21849
作者: Yuheng Li,Weitong Chen,Chengcheng Zhu,Jiale Zhang,Chunpeng Ge,Di Wu,Guodong Long
机构: Yangzhou University (扬州大学); Nanjing University (南京大学); Shandong University (山东大学); La Trobe University (拉特罗布大学); University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \underline\textbfsingle \underline\textbfrandom \underline\textbfdistortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches. As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underline\textbfmeta-learning with \underline\textbffeature \underline\textbfconsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated ``unknown’’ distortion for the meta-testing phase. Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch. To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent. Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59%, 4.71%, and 2.38% under high-intensity, combined, and unknown distortions.

[CV-44] UniVBench: Towards Unified Evaluation for Video Foundation Models

【速读】:该论文旨在解决当前视频基础模型(video foundation models)评估基准碎片化、任务单一、复杂度不足的问题,这些问题导致无法全面衡量模型在视频理解、生成、编辑及重建等多任务上的统一能力。解决方案的关键在于提出UniVBench这一专门针对视频基础模型的综合性评估基准,其核心创新包括:构建包含200个高质量、多样化且多镜头视频的数据集,每个视频均配有详细描述、多格式编辑指令和参考图像;并开发统一的代理式评估系统(UniV-Eval),实现跨任务的标准化提示、指令解析与评分机制,从而支持公平、可扩展且可复现的模型比较,首次实现了对视频基础模型集成能力的系统性测量。

链接: https://arxiv.org/abs/2602.21835
作者: Jianhui Wei,Xiaotian Zhang,Yichen Li,Yuan Wang,Yan Zhang,Ziyi Chen,Zhihang Tang,Wei Xu,Zuozhu Liu
机构: Zhejiang University (浙江大学); ByteDance (字节跳动); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.

[CV-45] StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

【速读】:该论文旨在解决视觉故事生成模型在保持图像实体正确定位(visual grounding)的同时,仍可能出现语义关系幻觉的问题,例如错误的对话归属、角色互动或情感状态描述。解决方案的关键在于构建一个名为StoryMovie的数据集,通过最长公共子序列(Longest Common Subsequence, LCS)匹配将电影剧本与字幕时间戳对齐,从而实现基于脚本的角色名称、对话内容及关系动态的精准映射;在此基础上,作者微调Qwen Storyteller3模型,使其在保留视觉锚定标签的同时,融合真实角色名、对话和关系逻辑,显著提升了对话归属准确性——实验表明,相比未使用剧本对齐训练的Storyteller模型,Storyteller3在字幕对齐任务上胜率从38.0%提升至48.5%,并以89.9%胜率击败基线模型Qwen2.5-VL 7B,验证了语义对齐对于提升生成质量的重要性。

链接: https://arxiv.org/abs/2602.21829
作者: Daniel Oliveira,David Martins de Matos
机构: INESC-ID Lisboa (INESC-ID里斯本); Instituto Superior Técnico, Universidade de Lisboa (理工学院,里斯本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, submitted to Journal of Visual Communication and Image Representation

点击查看摘要

Abstract:Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone. Comments: 15 pages, submitted to Journal of Visual Communication and Image Representation Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.10; I.4.8 Cite as: arXiv:2602.21829 [cs.CV] (or arXiv:2602.21829v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.21829 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-46] Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

【速读】:该论文旨在解决生成式 AI (Generative AI) 在阴影生成与光照重渲染(relighting)任务中常出现的物理不一致性问题,如漂浮阴影、光照不一致及几何不合理等。其关键解决方案是提出了一种名为 Light-Geometry Interaction (LGI) maps 的新表征方式,该方法基于单目深度图编码光-aware遮挡信息,显式地将光照方向与几何结构关联,从而提供一个物理启发式的先验约束,有效指导生成模型进行符合真实光路传播规律的阴影和光照推理。通过将LGI嵌入桥接匹配(bridge-matching)生成骨干网络,显著降低歧义并增强物理一致性,同时构建首个大规模联合阴影生成与光照重渲染基准数据集以支持训练与评估。

链接: https://arxiv.org/abs/2602.21820
作者: Shan Wang,Peixia Li,Chenchen Xu,Ziang Cheng,Jiayu Yang,Hongdong Li,Pulak Purkait
机构: Amazon; Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRL 2026

点击查看摘要

Abstract:We propose Light-Geometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential light-shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting - unlike prior methods that treat them as disjoint tasks - capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light-shadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.

[CV-47] SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

【速读】:该论文旨在解决fMRI-to-video重建中面临的两大核心问题:一是显著物体在不同帧间视觉表征不一致导致的外观错位,二是时间一致性差引发的运动错位或帧间突变。解决方案的关键在于提出SemVideo框架,其核心创新是引入SemMiner模块,通过构建静态锚定描述、运动导向叙事和整体摘要三个层次的语义线索,为视频重建提供分层语义引导;在此基础上,SemVideo集成语义对齐解码器、运动适配解码器和条件视频渲染模块,实现从fMRI信号到高质量视频的高保真重建,从而显著提升语义一致性与时间连续性,达到当前最优性能。

链接: https://arxiv.org/abs/2602.21819
作者: Minghan Yang,Lan Yang,Ke Li,Honggang Zhang,Kaiyue Pang,Yizhe Song
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.

[CV-48] SkyReels-V4: Multi-modal Video-Audio Generation Inpainting and Editing model

【速读】:该论文旨在解决视频生成、编辑与修复任务中多模态输入支持不足、音视频联合生成能力弱以及高分辨率长时视频生成效率低的问题。其核心解决方案在于提出SkyReels V4,一个基于双流多模态扩散Transformer(MMDiT)架构的统一视频基础模型,其中视频分支和音频分支分别处理视觉与听觉内容,同时共享由多模态大语言模型(MMLM)驱动的强大文本编码器,从而实现对文本、图像、视频片段、掩码及音频参考等丰富多模态指令的精准理解与执行。关键创新包括:通过通道拼接公式统一图像到视频、视频扩展和视频编辑等多种修复任务;引入“低分辨率全序列联合生成+高分辨率关键帧精细化重建”的效率策略,结合专用超分辨率与帧插值模型,在保持电影级质量(最高1080p、32 FPS、15秒)的同时显著提升计算可行性。

链接: https://arxiv.org/abs/2602.21818
作者: Guibin Chen,Dixuan Lin,Jiangping Yang,Youqiang Zhang,Zhengcong Fei,Debang Li,Sheng Chen,Chaofeng Ao,Nuo Pang,Yiming Wang,Yikun Dou,Zheng Chen,Mingyuan Fan,Tuanhui Li,Mingshan Chang,Hao Zhang,Xiaopeng Sun,Jingtao Xu,Yuqiang Xie,Jiahua Wang,Zhiheng Xu,Weiming Xiong,Yuzhe Jin,Baoxuan Gu,Binjie Mao,Yunjie Yu,Jujie He,Yuhao Feng,Shiwen Tu,Chaojie Wang,Rui Yan,Wei Shen,Jingchen Wu,Peng Zhao,Xuanyue Zhong,Zhuangzhuang Liu,Kaifei Wang,Fuxiang Zhang,Weikai Xu,Wenyan Liu,Binglu Zhang,Yu Shen,Tianhui Xiong,Bin Peng,Liang Zeng,Xuchen Song,Haoxiang Guo,Peiyu Wang,Yahui Zhou
机构: SkyReels Team; Skywork AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

[CV-49] GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

【速读】:该论文旨在解决动态场景中运动分割(motion segmentation)的难题,传统方法依赖于从噪声较大的运动线索中估计相机位姿和点对应关系,而统计推断或迭代优化技术在多阶段流程中易产生累积误差,导致性能受限或计算成本过高。其解决方案的关键在于提出一种完全基于学习的方法,通过注意力机制直接从潜在特征表示中推理出移动物体,从而实现端到端的前向运动分割;核心思想是绕过显式的对应关系估计,让模型隐式地解耦物体运动与相机运动,同时利用近期4D场景几何重建(如π³)提供的可靠相机位姿和丰富的时空先验,确保训练稳定性和推理鲁棒性。

链接: https://arxiv.org/abs/2602.21810
作者: Xiankang He,Peile Lin,Ying Cui,Dongyan Guo,Chunhua Shen,Xiaoqin Zhang
机构: Zhejiang University of Technology (浙江工业大学); Zhejiang Key Laboratory of Visual Information Intelligent Processing (浙江省视觉信息智能处理重点实验室); State Key Lab of CAD & CG, Zhejiang University (浙江大学CAD与计算机图形学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., \pi^3 ), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:this https URL.

[CV-50] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

【速读】:该论文旨在解决基于视觉的3D几何建模方法(如StreamVGGT)在长时间视频流输入下Key-Value (KV)缓存无界增长的问题,这会导致内存消耗和推理延迟显著上升,限制了其在长时序场景中的可扩展性。解决方案的关键在于提出一种无需微调的XStreamVGGT框架,通过将剪枝(pruning)与量化(quantization)无缝集成到KV缓存压缩流程中:首先利用高效的token重要性识别机制对多帧输入产生的冗余KV进行剪枝,以满足固定内存预算;随后,在剪枝基础上引入基于KV张量分布特性的维度自适应量化策略,进一步降低内存开销并保持数值精度。该方法在几乎不损失性能的前提下,实现了4.42倍的内存压缩和5.48倍的推理加速。

链接: https://arxiv.org/abs/2602.21780
作者: Zunhai Su,Weihao Ye,Hansen Feng,Keyu Fan,Jing Zhang,Dahai Yu,Zhengwu Liu,Ngai Wong
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); China Star Optoelectronics Technology (中国星光学电子技术公司); TCL Corporate Research (HK) Co., Ltd. (TCL企业研究(香港)有限公司); Department of Electrical and Electronic Engineering, The University of Hong Kong (香港大学电机电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submission to the Journal of the Society for Information Display

点击查看摘要

Abstract:Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42 \times and accelerating inference by 5.48 \times , enabling practical and scalable streaming 3D applications. The code is available at this https URL.

[CV-51] Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models CVPR2026

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在深度伪造(deepfake)检测中忽视视频时序不一致性的问题。现有方法虽能有效识别静态空间伪影,但缺乏对动态伪造痕迹的推理能力。解决方案的关键在于提出 Forensic Answer-Questioning (FAQ),一个大规模基准测试,将时序深度伪造分析转化为多选题任务,并构建三级层次评估体系:(1)面部感知(Facial Perception),检测静态视觉伪影;(2)时序伪造定位(Temporal Deepfake Grounding),跨帧定位动态伪造痕迹;(3)法证推理(Forensic Reasoning),综合证据做出真实性判断。通过在 FAQ 上进行指令微调(FAQ-IT),显著提升了模型在域内与跨数据集检测任务上的性能,验证了该设计对增强 VLM 时序推理能力的核心作用。

链接: https://arxiv.org/abs/2602.21779
作者: Zheyuan Gu,Qingsong Zhao,Yusong Wang,Zhaohong Huang,Xinqi Li,Cheng Yuan,Jiaowei Shao,Chi Zhang,Xuelong Li
机构: Institute of Artificial Intelligence, China Telecom (TeleAI); Peking University; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures. Submitted to CVPR 2026

点击查看摘要

Abstract:Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.

[CV-52] From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

【速读】:该论文旨在解决当前基于指令的图像编辑模型在处理涉及复杂因果动态(如折射或材料变形)时,难以生成物理上合理结果的问题。其核心挑战在于现有方法将编辑视为图像对之间的离散映射,仅提供边界条件而未明确过渡过程的动力学细节。解决方案的关键在于将物理感知编辑重新建模为预测性的物理状态转移,并构建了包含38K条跨五个物理领域的过渡轨迹的大规模视频数据集PhysicTran38K,同时提出PhysicEdit框架,该框架采用文本-视觉双思考机制,结合冻结的Qwen2.5-VL模型进行物理基础推理与可学习的时间步自适应过渡查询,为扩散主干网络提供视觉引导,从而显著提升编辑结果的物理真实性和知识一致性。

链接: https://arxiv.org/abs/2602.21778
作者: Liangbing Zhao,Le Zhuo,Sayak Paul,Hongsheng Li,Mohamed Elhoseiny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: All code, checkpoints, and datasets are available at this https URL

点击查看摘要

Abstract:Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.

[CV-53] Easy to Learn Yet Hard to Forget: Towards Robust Unlearning Under Bias AAAI2026

【速读】:该论文旨在解决机器学习模型在执行遗忘(unlearning)过程中因数据中存在虚假相关性(spurious correlations)而导致的“捷径遗忘”(shortcut unlearning)问题,即模型难以真正遗忘与偏差一致的样本,反而会错误地消除偏差特征,从而在某些情况下反而提升对被遗忘类别的准确率。解决方案的关键在于提出CUPID框架,其核心机制是利用不同偏差样本在损失函数景观中的尖锐度(loss landscape sharpness)差异,将遗忘集划分为因果和偏差近似子集,并通过解耦模型参数为因果路径与偏差路径,分别注入细化后的因果梯度和偏差梯度进行针对性更新,从而实现更有效的、可解释的遗忘行为。

链接: https://arxiv.org/abs/2602.21773
作者: JuneHyoung Kwon,MiHyeon Kim,Eunju Lee,Yoonji Lee,Seunghoon Lee,YoungBin Kim
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Machine unlearning, which enables a model to forget specific data, is crucial for ensuring data privacy and model reliability. However, its effectiveness can be severely undermined in real-world scenarios where models learn unintended biases from spurious correlations within the data. This paper investigates the unique challenges of unlearning from such biased models. We identify a novel phenomenon we term shortcut unlearning," where models exhibit an easy to learn, yet hard to forget" tendency. Specifically, models struggle to forget easily-learned, bias-aligned samples; instead of forgetting the class attribute, they unlearn the bias attribute, which can paradoxically improve accuracy on the class intended to be forgotten. To address this, we propose CUPID, a new unlearning framework inspired by the observation that samples with different biases exhibit distinct loss landscape sharpness. Our method first partitions the forget set into causal- and bias-approximated subsets based on sample sharpness, then disentangles model parameters into causal and bias pathways, and finally performs a targeted update by routing refined causal and bias gradients to their respective pathways. Extensive experiments on biased datasets including Waterbirds, BAR, and Biased NICO++ demonstrate that our method achieves state-of-the-art forgetting performance and effectively mitigates the shortcut unlearning problem.

[CV-54] SAPNet: Evolving Point-Prompted Instance Segmentation with Semantic and Spatial Awareness

【速读】:该论文旨在解决单点提示(single-point prompt)在实例分割任务中因标注成本低而带来的粒度模糊性(granularity ambiguity)和边界不确定性(boundary uncertainty)问题,这些问题严重影响了点提示实例分割(Point-Prompted Instance Segmentation, PPIS)的精度。解决方案的关键在于提出Semantic-Aware Point-Prompted Instance Segmentation Network (SAPNet),其核心创新包括:1)引入点距离引导(Point Distance Guidance)与框挖掘策略(Box Mining Strategy),分别处理由点提示引发的全局与局部粒度歧义;2)在提案选择过程中引入完整性评分(completeness scores),增强多实例学习(Multiple Instance Learning, MIL)的空间粒度感知能力,称为S-MIL;3)通过多级亲和力精化(Multi-level Affinity Refinement)融合像素与语义线索,有效缩小边界不确定性。最终形成的SAPNet++显著提升了PPIS性能,在四个挑战性数据集上验证了方法的有效性。

链接: https://arxiv.org/abs/2602.21762
作者: Zhaoyang Wei,Xumeng Han,Xuehui Yu,Xue Yang,Guorong Li,Zhenjun Han,Jianbin Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages

点击查看摘要

Abstract:Single-point annotation is increasingly prominent in visual tasks for labeling cost reduction. However, it challenges tasks requiring high precision, such as the point-prompted instance segmentation (PPIS) task, which aims to estimate precise masks using single-point prompts to train a segmentation network. Due to the constraints of point annotations, granularity ambiguity and boundary uncertainty arise the difficulty distinguishing between different levels of detail (eg. whole object vs. parts) and the challenge of precisely delineating object boundaries. Previous works have usually inherited the paradigm of mask generation along with proposal selection to achieve PPIS. However, proposal selection relies solely on category information, failing to resolve the ambiguity of different granularity. Furthermore, mask generators offer only finite discrete solutions that often deviate from actual masks, particularly at boundaries. To address these issues, we propose the Semantic-Aware Point-Prompted Instance Segmentation Network (SAPNet). It integrates Point Distance Guidance and Box Mining Strategy to tackle group and local issues caused by the point’s granularity ambiguity. Additionally, we incorporate completeness scores within proposals to add spatial granularity awareness, enhancing multiple instance learning (MIL) in proposal selection termed S-MIL. The Multi-level Affinity Refinement conveys pixel and semantic clues, narrowing boundary uncertainty during mask refinement. These modules culminate in SAPNet++, mitigating point prompt’s granularity ambiguity and boundary uncertainty and significantly improving segmentation performance. Extensive experiments on four challenging datasets validate the effectiveness of our methods, highlighting the potential to advance PPIS.

[CV-55] Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

【速读】:该论文旨在解决扩散模型(Diffusion Models)在条件生成任务中推理阶段计算成本高、现有基于分布式并行的加速方法存在明显生成伪影且加速效果与GPU数量不成比例的问题。其解决方案的关键在于提出一种混合并行框架,核心创新包括:(i) 将条件去噪路径与无条件去噪路径作为新的数据划分视角,实现基于条件的分区策略(condition-based partitioning);(ii) 设计自适应并行切换机制(adaptive parallelism switching),根据两条路径间的去噪差异动态启用最优流水线并行策略。该方法在SDXL和SD3模型上分别实现了2.31×和2.07×的延迟降低,同时保持图像质量,验证了其在U-Net和DiT架构中的通用性与高效性。

链接: https://arxiv.org/abs/2602.21760
作者: Euisoo Jung,Byunghyun Kim,Hyunjin Kim,Seonghye Cho,Jae-Gil Lee
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves 2.31\times and 2.07\times latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at this https URL.

[CV-56] LiREC-Net: A Target-Free and Learning-Based Network for LiDAR RGB and Event Calibration CVPR2026

【速读】:该论文旨在解决多传感器融合系统中,如何在无需标定目标(target-free)的自然驾驶场景下实现高精度、多模态传感器联合标定的问题。现有基于学习的方法通常仅针对单一传感器对(如双模态设置),难以扩展至包含LiDAR、RGB图像和事件相机(event camera)等多模态协同标定的场景。其解决方案的关键在于提出LiREC-Net——一个统一框架下的目标自由学习标定网络,通过引入共享的LiDAR特征表示,融合其3D空间结构与投影深度图信息,从而提升跨模态一致性并减少冗余计算,显著提高了多模态联合标定的效率与精度,在KITTI和DSEC数据集上达到优于传统方法的性能,并为三模态标定建立了新的基准。

链接: https://arxiv.org/abs/2602.21754
作者: Aditya Ranjan Dash,Ramy Battrawy,René Schuster,Didier Stricker
机构: RPTU – University Kaiserslautern-Landau (莱茨大学); DFKI – German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026

点击查看摘要

Abstract:Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.

[CV-57] Enhancing Multi-Modal LLM s Reasoning via Difficulty-Aware Group Normalization

【速读】:该论文旨在解决多模态大语言模型在采用基于标准差(std)的归一化方法时所面临的稳定性问题,尤其针对极端奖励样本对训练过程的干扰。这类问题在纯文本大模型中相对可控,但在多模态场景下更为严重,因为感知误差和推理不确定性会共同影响模型输出。解决方案的关键在于提出一种难度感知的组归一化方法(Durian),其核心思想是根据样本的感知复杂度(通过视觉熵衡量)和推理不确定性(由模型置信度捕捉)来表征每个样本的难度,并据此将样本重新分组,仅在同难度组内共享标准差。这一机制既保留了Group Relative Policy Optimization (GRPO) 的组内区分能力,又有效降低了对极端奖励样本的敏感性,从而在多个多模态推理基准测试中显著提升性能。

链接: https://arxiv.org/abs/2602.21743
作者: Jinghan Li,Junfeng Fang,Jinda Lu,Yuan Wang,Xiaoyan Guo,Tianyu Zhang,Xiang Wang,Xiangnan He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO’s intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

[CV-58] Structure-to-Image: Zero-Shot Depth Estimation in Colonoscopy via High-Fidelity Sim-to-Real Adaptation

【速读】:该论文旨在解决内窥镜场景下单目深度估计(Monocular Depth Estimation, MDE)因模拟数据与真实图像之间存在领域差异(domain gap)而导致性能下降的问题。现有图像到图像转换方法通常将深度作为后验约束,但在平衡真实感与结构一致性方面表现不足,常导致结构失真和高光伪影。其解决方案的关键在于提出一种“结构到图像”(Structure-to-Image)范式,将深度图从被动约束转变为主动生成基础,并首次引入相位一致性(phase congruency)用于结肠镜域适应,同时设计跨层级结构约束以协同优化几何结构与血管纹理等细粒度细节。该方法在公开的phantom数据集上实现了零样本评估中RMSE最大降低44.18%的显著效果。

链接: https://arxiv.org/abs/2602.21740
作者: Juan Yang,Yuyan Zhang,Han Jia,Bing Hu,Wanzhong Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Monocular depth estimation (MDE) for colonoscopy is hampered by the domain gap between simulated and real-world images. Existing image-to-image translation methods, which use depth as a posterior constraint, often produce structural distortions and specular highlights by failing to balance realism with structure consistency. To address this, we propose a Structure-to-Image paradigm that transforms the depth map from a passive constraint into an active generative foundation. We are the first to introduce phase congruency to colonoscopic domain adaptation and design a cross-level structure constraint to co-optimize geometric structures and fine-grained details like vascular textures. In zero-shot evaluations conducted on a publicly available phantom dataset, the MDE model that was fine-tuned on our generated data achieved a maximum reduction of 44.18% in RMSE compared to competing methods. Our code is available at this https URL.

[CV-59] SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

【速读】:该论文旨在解决大规模体积医学影像数据因来自不同厂商和设备而导致的分辨率、切片厚度及每例扫描切片数量高度不一致的问题,这种异质性使得传统训练表示模型时需对z轴进行裁剪或插值以获得固定尺寸块,从而不可避免地造成信息丢失。解决方案的关键在于摒弃传统的绝对位置嵌入(absolute position embeddings),将体积数据视为一系列3D块的序列,并引入旋转位置嵌入(Rotary Position Embeddings, RoPE),使z轴可被建模为无约束的时间维度;在此基础上,作者提出SigVLP视觉-语言模型,其核心创新是将RoPE直接嵌入注意力机制中,动态生成输入条件相关的正弦与余弦权重,确保查询与键投影的一致对齐并适应任意输入尺寸。此外,通过在训练阶段以块为单位采样CT图像并配合同步器官级文本描述,实现细粒度监督,强化文本与体积表征间的关联,提升文本到图像的对齐精度。

链接: https://arxiv.org/abs/2602.21735
作者: Jiayi Wang,Hadrien Reynaud,Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Bjoern Menze,Bernhard Kainz
机构: Friedrich-Alexander University Erlangen-Nürnberg (弗里德里希-亚历山大埃尔朗根-纽伦堡大学); University of Zurich (苏黎世大学); ETH AI Center (ETH人工智能中心); ETH Zurich (苏黎世联邦理工学院); Istanbul Medipol University (伊斯坦布尔梅迪波尔大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.

[CV-60] ranX-Adapter: Bridging Artifacts and Semantics within MLLM s for Robust AI-generated Image Detection

【速读】:该论文旨在解决生成式 AI (Generative AI) 图像检测中因纹理级伪影特征(artifact features)内部相似性过高导致注意力稀释的问题,从而影响语义特征与伪影特征的有效融合。解决方案的关键在于提出一种轻量级融合适配器 TranX-Adapter,其核心创新包括:1)任务感知的最优传输融合机制(Task-aware Optimal-Transport Fusion),利用伪影与语义预测概率之间的 Jensen-Shannon 散度构建代价矩阵,实现伪影信息向语义特征的高效迁移;2)X-Fusion 模块通过交叉注意力机制实现语义信息向伪影特征的反向传递,从而增强多模态大语言模型(MLLMs)在 AIGI 检测任务中的判别能力。实验表明,该方法在多个先进 MLLMs 上均带来显著且一致的性能提升(最高达 +6% 准确率)。

链接: https://arxiv.org/abs/2602.21716
作者: Wenbin Wang,Yuge Huang,Jianqing Xu,Yue Yu,Jiangtao Yan,Shouhong Ding,Pan Zhou,Yong Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).

[CV-61] Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling

【速读】:该论文针对牙科图像分割中传统图像编码器依赖固定分辨率特征图导致的分割不连续及目标区域与背景区分度差的问题,以及基于Transformer的自注意力机制因二次计算复杂度(O(n²))在高分辨率牙科图像上效率低下的问题,提出了一种三阶段编码器架构。其解决方案的关键在于:首先构建具有层次化特征表示的编码器以捕获尺度自适应信息;其次通过跨尺度特征融合联合利用低层细节与高层语义,有效保留精细结构信息并维持强上下文感知能力;最后引入双向序列建模策略,在不显著增加计算开销的前提下增强全局空间上下文理解能力。实验表明,该方法在OralVision数据集上mIoU提升1.1%,优于现有方法。

链接: https://arxiv.org/abs/2602.21712
作者: Xinxin Zhao,Jian Jiang,Yan Tian,Liqin Wu,Zhaocheng Xu,Teddy Yang,Yunuo Zou,Xun Wang
机构: Zhejiang Gongshang University (浙江工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU). Comments: Accepted by Pattern Recognition Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.21712 [cs.CV] (or arXiv:2602.21712v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.21712 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Xinxin Zhao, Jian Jiang, Yan Tian, Liqin Wu, Zhaocheng Xu, Wei-fa Yang, Yunuo Zou, Xun Wang. Innovative tooth segmentation using hierarchical features and bidirectional sequence modeling[J]. Pattern Recognition, 2026, 175:113045 Related DOI: https://doi.org/10.1016/j.patcog.2026.113045 Focus to learn more DOI(s) linking to related resources

[CV-62] Assessing airborne laser scanning and aerial photogrammetry for deep learning-based stand delineation

【速读】:该论文旨在解决森林林分(forest stand)自动划分中依赖人工且主观性强的问题,特别是在多源遥感数据存在时间错位(temporal misalignment)时难以实现规模化应用的挑战。其关键解决方案是采用基于U-Net的语义分割框架,结合多光谱航空影像与不同来源的冠层高度模型(CHM),包括机载激光扫描(ALS)衍生的CHM、数字摄影测量(DAP)衍生的CHM,以及DAP-CHM与数字地形模型(DTM)融合的数据组合,在挪威东南部六个市镇进行跨区域验证。结果表明,尽管DAP-CHM在结构细节上有所损失,但其性能与ALS-CHM相当,且加入DTM未带来显著提升,说明该框架对输入数据变化具有鲁棒性,从而支持利用时空对齐的DAP点云构建大规模深度学习训练数据集。

链接: https://arxiv.org/abs/2602.21709
作者: Håkon Næss Sandum,Hans Ole Ørka,Oliver Tomic,Terje Gobakken
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Accurate forest stand delineation is essential for forest inventory and management but remains a largely manual and subjective process. A recent study has shown that deep learning can produce stand delineations comparable to expert interpreters when combining aerial imagery and airborne laser scanning (ALS) data. However, temporal misalignment between data sources limits operational scalability. Canopy height models (CHMs) derived from digital photogrammetry (DAP) offer better temporal alignment but may smoothen canopy surface and canopy gaps, raising the question of whether they can reliably replace ALS-derived CHMs. Similarly, the inclusion of a digital terrain model (DTM) has been suggested to improve delineation performance, but has remained untested in published literature. Using expert-delineated forest stands as reference data, we assessed a U-Net-based semantic segmentation framework with municipality-level cross-validation across six municipalities in southeastern Norway. We compared multispectral aerial imagery combined with (i) an ALS-derived CHM, (ii) a DAP-derived CHM, and (iii) a DAP-derived CHM in combination with a DTM. Results showed comparable performance across all data combinations, reaching overall accuracy values between 0.90-0.91. Agreement between model predictions was substantially larger than agreement with the reference data, highlighting both model consistency and the inherent subjectivity of stand delineation. The similar performance of DAP-CHMs, despite the reduced structural detail, and the lack of improvements of the DTM indicate that the framework is resilient to variations in input data. These findings indicate that large datasets for deep learning-based stand delineations can be assembled using projects including temporally aligned ALS data and DAP point clouds.

[CV-63] SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

【速读】:该论文旨在解决腹腔镜手术中安全操作区域(Go Zone)识别的挑战,尤其针对术中认知负荷高、需结合手术阶段、视觉线索与解剖背景进行动态推理的问题。现有AI系统多采用静态或二元安全判定,忽视了手术阶段依赖性的本质特征。其解决方案的关键在于提出ResGo基准数据集和SurGo-R1模型:ResGo通过标注Go Zone边界框及外科医生撰写的分阶段推理理由(涵盖阶段识别、暴露质量、下一步动作与风险提示),构建了具有临床语境的多维评估体系;SurGo-R1则采用“先阶段后Go”的多轮强化学习人类反馈(RLHF)优化架构,使模型在识别手术阶段后生成条件化的推理与Go Zone坐标,从而显著提升跨任务泛化能力,在未见手术流程中实现76.6%阶段准确率、32.7 mIoU和54.8%硬性准确率,较主流通用视觉语言模型(VLMs)提升6.6倍。

链接: https://arxiv.org/abs/2602.21706
作者: Guanyi Qin,Xiaozhen Wang,Zhu Zhuo,Chang Han Low,Yuancan Xiao,Yibing Fu,Haofeng Liu,Kai Wang,Chunjiang Li,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning. We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder. We introduce evaluation metrics that treat correct grounding under incorrect phase as failures, revealing that most vision-language models cannot handle such tasks and perform poorly. We then present SurGo-R1, a model optimized via RLHF with a multi-turn phase-then-go architecture where the model first identifies the surgical phase, then generates reasoning and Go Zone coordinates conditioned on that context. On unseen procedures, SurGo-R1 achieves 76.6% phase accuracy, 32.7 mIoU, and 54.8% hardcore accuracy, a 6.6 \times improvement over the mainstream generalist VLMs. Code, model and benchmark will be available at this https URL

[CV-64] Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models ICLR2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉-语言任务中普遍存在的幻觉(hallucination)问题。通过深入分析LVLM的激活模式,作者发现真实性(truthfulness)与视觉感知能力主要依赖于模型架构中不同的注意力头子集,且真实性引导向量(truthfulness steering vectors)在不同语义上下文中存在显著差异。解决方案的关键在于提出一种无需训练的动态多模态激活引导方法(Dynamic Multimodal Activation Steering),其核心是构建基于语义的真实性引导向量数据库,并计算视觉感知引导向量,在推理阶段根据输入语义相似度动态选择最相关的引导向量,并将其施加到最具影响力的注意力头上,从而实现上下文感知的幻觉抑制。

链接: https://arxiv.org/abs/2602.21704
作者: Jianghao Yin,Qin Chen,Kedi Chen,Jie Zhou,Xingjiao Wu,Liang He
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.

[CV-65] Brain Tumor Segmentation with Special Emphasis on the Non-Enhancing Brain Tumor Compartment

【速读】:该论文旨在解决脑肿瘤分割中非增强肿瘤区域(non-enhancing tumor compartment)难以自动识别的问题。近年来,如MICCAI等脑肿瘤分割挑战赛已不再关注该区域,但研究表明其与患者生存时间及肿瘤潜在生长区域密切相关,因此准确自动分割该区域具有重要临床意义。解决方案的关键在于设计了一种基于U-Net的深度学习架构,专门针对多模态MRI图像中的非增强肿瘤部分进行精准分割,从而提升整体肿瘤边界刻画的完整性与临床可用性。

链接: https://arxiv.org/abs/2602.21703
作者: T. Schaffer,A. Brawanski,S. Wein,A. M. Tomé,E. W. Lang
机构: CIML Group, Biophysics, University of Regensburg (雷根斯堡大学); Department of Neurosurgery, University Hospital Regensburg (雷根斯堡大学医院); Department of Biomedical Imaging, University Hospital Regensburg (雷根斯堡大学医院); DETI, IEETA, Universidade de Aveiro (阿维罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A U-Net based deep learning architecture is designed to segment brain tumors as they appear on various MRI modalities. Special emphasis is lent to the non-enhancing tumor compartment. The latter has not been considered anymore in recent brain tumor segmentation challenges like the MICCAI challenges. However, it is considered to be indicative of the survival time of the patient as well as of areas of further tumor growth. Hence it deems essential to have means to automatically delineate its extension within the tumor.

[CV-66] SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR

【速读】:该论文旨在解决单模态方法在场景流估计(scene flow estimation)中精度不足与鲁棒性差的问题,尤其是图像或LiDAR单一模态难以全面捕捉动态场景中的复杂运动信息。其解决方案的关键在于提出一种端到端的深度学习架构SF3D-RGB,该架构融合2D单目图像与3D点云(如LiDAR获取)两种模态的信息:首先分别编码各模态特征并进行融合,随后利用融合特征增强图匹配模块以计算更可靠的初始映射矩阵,进而生成初始场景流;最后通过残差场景流模块对初始结果进行精细化修正。该设计在保证高精度的同时提升了效率,且在真实数据集上优于单模态方法,并以更少参数实现优于现有融合方法的性能。

链接: https://arxiv.org/abs/2602.21699
作者: Rajai Alhimdiat,Ramy Battrawy,René Schuster,Didier Stricker,Wesam Ashour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Computer Vision Conference (CVC) 2026

点击查看摘要

Abstract:Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.

[CV-67] E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought CVPR2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 在生成电商海报时缺乏针对中文内容的自动化质量评估方法的问题,尤其关注现有模型在处理复杂汉字时产生的细微但关键的文本伪影(textual artifacts)时表现不足。解决方案的关键在于构建首个面向中文电商海报的多维评分数据集 E-comIQ-18k,其中包含专家校准的 Chain of Thought (CoT) 理由,并基于此训练出与人类专家判断高度一致的专用评估模型 E-comIQ-M,从而实现可扩展、自动化的中文电商海报质量评估基准 E-comIQ-Bench。

链接: https://arxiv.org/abs/2602.21698
作者: Meiqi Sun,Mingyu Li,Junxiong Zhu
机构: Taobao & Tmall Group, Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21pages, 19figures, accepted by CVPR 2026

点击查看摘要

Abstract:Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this this http URL will be available at this https URL.

[CV-68] Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

【速读】:该论文旨在解决动态场景长期预测中的核心挑战,即在有限观测条件下难以捕捉一致的物体级运动与长时间序列的时序演化。其解决方案的关键在于提出基于4D高斯点渲染(4D Gaussian Splatting)表示的Motion Group-aware Gaussian Forecasting(MoGaF)框架,通过引入运动感知的高斯分组(motion-aware Gaussian grouping)和组内优化策略(group-wise optimization),在刚性与非刚性区域均实现物理一致性的运动约束,从而构建空间上连贯的动态场景表示;在此基础上,进一步设计轻量级预测模块以推演未来运动,实现真实且时序稳定的场景外推。

链接: https://arxiv.org/abs/2602.21668
作者: Junmyeong Lee,Hoseung Choi,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH); RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability. Our project page is available at this https URL

[CV-69] Send Less Perceive More: Masked Quantized Point Cloud Communication for Loss-Tolerant Collaborative Perception

【速读】:该论文旨在解决协同感知(Collaborative Perception)中因带宽限制和随机传输丢包导致的感知精度下降问题。现有方法在严格带宽约束下难以维持高精度,且对通信丢包敏感。其解决方案的关键在于提出 QPoint2Comm 框架:通过使用共享码本直接传输量化点云索引(quantized point-cloud indices),而非中间特征,显著降低带宽需求并保留高保真 3D 信息;同时引入掩码训练策略模拟随机丢包,提升模型在严重传输故障下的鲁棒性,并设计级联注意力融合模块优化多车信息整合。

链接: https://arxiv.org/abs/2602.21667
作者: Sheng Xu,Enshu Wang,Hongfei Xue,Jian Teng,Bingyi Liu,Yi Zhu,Pu Wang,Libing Wu,Chunming Qiao
机构: Wuhan University (武汉大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); Wuhan University Of Technology (武汉理工大学); Wayne State University (韦恩州立大学); University at Buffalo (水牛城大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Collaborative perception allows connected vehicles to overcome occlusions and limited viewpoints by sharing sensory information. However, existing approaches struggle to achieve high accuracy under strict bandwidth constraints and remain highly vulnerable to random transmission packet loss. We introduce QPoint2Comm, a quantized point-cloud communication framework that dramatically reduces bandwidth while preserving high-fidelity 3D information. Instead of transmitting intermediate features, QPoint2Comm directly communicates quantized point-cloud indices using a shared codebook, enabling efficient reconstruction with lower bandwidth than feature-based methods. To ensure robustness to possible communication packet loss, we employ a masked training strategy that simulates random packet loss, allowing the model to maintain strong performance even under severe transmission failures. In addition, a cascade attention fusion module is proposed to enhance multi-vehicle information integration. Extensive experiments on both simulated and real-world datasets demonstrate that QPoint2Comm sets a new state of the art in accuracy, communication efficiency, and resilience to packet loss.

[CV-70] HybridINR-PCGC: Hybrid Lossless Point Cloud Geometry Compression Bridging Pretrained Model and Implicit Neural Representation

【速读】:该论文旨在解决基于预训练模型的点云压缩方法在训练数据依赖性上的局限性,以及隐式神经表示(Implicit Neural Representation, INR)方法因在线训练耗时和比特流开销过大而导致的效率问题。其解决方案的关键在于提出一种混合框架 HybridINR-PCGC,该框架融合了预训练先验网络(Pretrained Prior Network, PPN)与分布无关的精修模块(Distribution Agnostic Refiner, DAR),通过PPN快速生成稳定先验以加速DAR收敛,并仅将DAR中的增强层编码进比特流,同时引入监督式模型压缩模块进一步优化增强层参数的比特率。此设计既保留了INR的分布无关特性,又显著提升了压缩效率和编码速度。

链接: https://arxiv.org/abs/2602.21662
作者: Wenjie Huang,Qi Yang,Shuting Xia,He Huang,Zhu Li,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri-Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures

点击查看摘要

Abstract:Learning-based point cloud compression presents superior performance to handcrafted codecs. However, pretrained-based methods, which are based on end-to-end training and expected to generalize to all the potential samples, suffer from training data dependency. Implicit neural representation (INR) based methods are distribution-agnostic and more robust, but they require time-consuming online training and suffer from the bitstream overhead from the overfitted model. To address these limitations, we propose HybridINR-PCGC, a novel hybrid framework that bridges the pretrained model and INR. Our framework retains distribution-agnostic properties while leveraging a pretrained network to accelerate convergence and reduce model overhead, which consists of two parts: the Pretrained Prior Network (PPN) and the Distribution Agnostic Refiner (DAR). We leverage the PPN, designed for fast inference and stable performance, to generate a robust prior for accelerating the DAR’s convergence. The DAR is decomposed into a base layer and an enhancement layer, and only the enhancement layer needed to be packed into the bitstream. Finally, we propose a supervised model compression module to further supervise and minimize the bitrate of the enhancement layer parameters. Based on experiment results, HybridINR-PCGC achieves a significantly improved compression rate and encoding efficiency. Specifically, our method achieves a Bpp reduction of approximately 20.43% compared to G-PCC on 8iVFB. In the challenging out-of-distribution scenario Cat1B, our method achieves a Bpp reduction of approximately 57.85% compared to UniPCGC. And our method exhibits a superior time-rate trade-off, achieving an average Bpp reduction of 15.193% relative to the LINR-PCGC on 8iVFB.

[CV-71] Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis

【速读】:该论文旨在解决当前计算机辅助诊断(CAD)系统在临床应用中面临的两大核心问题:一是缺乏与临床工作流程的无缝集成,导致诊断模型难以实现可靠且可解释的决策支持;二是放射科医生的决策模式与模型表征之间存在语义鸿沟,限制了人机协同的深度整合。解决方案的关键在于提出一种视觉认知引导的协作网络(VCC-Net),其创新性地将视觉认知(VC)作为空间认知引导,通过眼动追踪或鼠标操作等临床兼容接口捕获放射科医生的视觉搜索轨迹和注意力模式,并以此构建疾病感知的认知图谱(cognition-graph)。该图谱通过捕捉解剖区域间的依赖关系并校准模型表示与VC驱动特征的一致性,有效缓解了医生偏倚,实现了互补且透明的协同诊断机制。

链接: https://arxiv.org/abs/2602.21657
作者: Shaoxuan Wu,Jingkun Chen,Chong Ma,Cong Shen,Xiao Zhang,Jun Feng
机构: Northwest University (西北大学); University of Oxford (牛津大学); Southwest Jiaotong University (西南交通大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-aided diagnosis (CAD) has significantly advanced automated chest X-ray diagnosis but remains isolated from clinical workflows and lacks reliable decision support and interpretability. Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists. However, the absence of interactive tools seamlessly embedded within diagnostic routines impedes collaboration, while the semantic gap between radiologists’ decision-making patterns and model representations further limits clinical adoption. To overcome these limitations, we propose a visual cognition-guided collaborative network (VCC-Net) to achieve the cooperative diagnostic paradigm. VCC-Net centers on visual cognition (VC) and employs clinically compatible interfaces, such as eye-tracking or the mouse, to capture radiologists’ visual search traces and attention patterns during diagnosis. VCC-Net employs VC as a spatial cognition guide, learning hierarchical visual search strategies to localize diagnostically key regions. A cognition-graph co-editing module subsequently integrates radiologist VC with model inference to construct a disease-aware graph. The module captures dependencies among anatomical regions and aligns model representations with VC-driven features, mitigating radiologist bias and facilitating complementary, transparent decision-making. Experiments on the public datasets SIIM-ACR, EGD-CXR, and self-constructed TB-Mouse dataset achieved classification accuracies of 88.40%, 85.05%, and 92.41%, respectively. The attention maps produced by VCC-Net exhibit strong concordance with radiologists’ gaze distributions, demonstrating a mutual reinforcement of radiologist and model inference. The code is available at this https URL.

[CV-72] CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning CVPR2026

【速读】:该论文旨在解决图像描述生成任务中依赖人类标注作为监督信号所导致的局限性问题,即人工标注存在主观性和不完整性,进而限制了模型性能提升。其核心解决方案是提出一种双奖励强化学习框架CCCaption,通过显式优化两个客观指标来生成更完整(Completeness)和更正确(Correctness)的图像描述:一方面利用多种大视觉语言模型(Large Vision-Language Models, LVLMs)将图像分解为多个视觉查询,并奖励覆盖更多查询的caption,同时引入动态查询采样策略提升训练效率;另一方面通过验证子描述查询的真实性来惩罚幻觉内容,从而保证描述的真实性。该对称的双奖励优化机制协同提升两个维度的质量,推动模型超越对人类标注的简单模仿,迈向更可靠的视觉-语言理解能力。

链接: https://arxiv.org/abs/2602.21655
作者: Zhijiang Tang,Linhua Wang,Jiaxin Qi,Weihao Jiang,Peng Hou,Anxiang Zeng,Jianqiang Huang
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Shopee Pte. Ltd. (虾皮)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by CVPR 2026

点击查看摘要

Abstract:Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbfComplete and \textbfCorrect \textbfCaptions. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

[CV-73] Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle

【速读】:该论文旨在解决4D场景建模中如何同时准确捕捉空间结构与时间运动的问题,尤其是现有方法依赖平移位移(translational displacements)难以有效表示旋转和刚性/非刚性关节变换,导致运动不一致和物理上不合理的结果。其解决方案的关键在于提出LieFlow框架,通过在SE(3)李群(Lie group)空间内显式建模运动,统一学习平移与旋转,从而在几何空间中保持运动连续性和结构一致性,实现更物理合理且视觉逼真的动态场景重建。

链接: https://arxiv.org/abs/2602.21645
作者: Weidong Qiao,Wangmeng Zuo,Hui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10pages,5 figures

点击查看摘要

Abstract:Modeling 4D scenes requires capturing both spatial structure and temporal motion, which is challenging due to the need for physically consistent representations of complex rigid and non-rigid motions. Existing approaches mainly rely on translational displacements, which struggle to represent rotations, articulated transformations, often leading to spatial inconsistency and physically implausible motion. LieFlow, a dynamic radiance representation framework that explicitly models motion within the SE(3) Lie group, enabling coherent learning of translation and rotation in a unified geometric space. The SE(3) transformation field enforces physically inspired constraints to maintain motion continuity and geometric consistency. The evaluation includes a synthetic dataset with rigid-body trajectories and two real-world datasets capturing complex motion under natural lighting and occlusions. Across all datasets, LieFlow consistently improves view-synthesis fidelity, temporal coherence, and physical realism over NeRF-based baselines. These results confirm that SE(3)-based motion modeling offers a robust and physically grounded framework for representing dynamic 4D scenes.

[CV-74] CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

【速读】:该论文旨在解决现有基础模型在计算病理学中因依赖自然图像骨干网络而忽视组织形态异质性和非均匀性分布的问题,导致难以捕捉超越孤立切片的连贯组织结构,从而限制了模型的可解释性和临床相关性。其解决方案的关键在于提出Cross-modal Adaptive Region Encoder (CARE),一种专为病理学设计的基础模型,通过两阶段预训练策略实现:首先在无分割标注条件下,利用34,277张全切片图像(Whole Slide Images, WSIs)进行自监督单模态预训练以学习组织形态表征;其次借助RNA和蛋白组学数据进行跨模态对齐,引导构建生物学相关的自适应区域,从而识别出不规则但结构一致的组织区域并选择最具代表性的区域作为感兴趣区域(Region of Interest, ROI)。此方法显著提升了病理任务的泛化能力与临床实用性。

链接: https://arxiv.org/abs/2602.21637
作者: Di Zhang,Zhangpeng Gong,Xiaobo Pang,Jiashuai Liu,Junbo Lu,Hao Cui,Jiusong Ge,Zhi Zeng,Kai Yi,Yinghua Li,Si Liu,Tingsong Yu,Haoran Wang,Mireia Crispin-Ortuzar,eimiao Yu,Chen Li,Zeyu Gao
机构: Xi’an Jiaotong University (西安交通大学); University of Cambridge (剑桥大学); KingMed (金域医学); BGI Research (华大研究院); A⋆STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.

[CV-75] Axial-Centric Cross-Plane Attention for 3D Medical Image Classification MICCAI2026

【速读】:该论文旨在解决现有3D深度学习模型在处理医学图像时未能反映临床实践中“以轴向平面为主”的多平面解读流程的问题。当前方法通常对所有解剖平面(轴向、冠状面和矢状面)赋予同等重要性,或仅整体处理体积数据,忽略了医生实际诊断中以轴向为基准、辅以其他平面增强空间理解的不对称依赖关系。解决方案的关键在于提出一种轴向中心的跨平面注意力架构(axial-centric cross-plane attention architecture),其核心包括:利用预训练的MedDINOv3模型作为冻结特征提取器获取各平面特征;通过RICA块与平面内Transformer编码器捕获每个解剖平面内的位置和上下文信息;并采用轴向中心的跨平面Transformer编码器,将轴向特征有条件地融合来自辅助平面的互补信息,从而实现不对称的跨平面信息交互。实验表明,该设计显著提升了分类性能,并验证了轴向中心查询-键-值分配及方向性跨平面融合的重要性。

链接: https://arxiv.org/abs/2602.21636
作者: Doyoung Park,Jinsoo Kim,Lohendran Baskaran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to MICCAI 2026

点击查看摘要

Abstract:Clinicians commonly interpret three-dimensional (3D) medical images, such as computed tomography (CT) scans, using multiple anatomical planes rather than as a single volumetric representation. In this multi-planar approach, the axial plane typically serves as the primary acquisition and diagnostic reference, while the coronal and sagittal planes provide complementary spatial information to increase diagnostic confidence. However, many existing 3D deep learning methods either process volumetric data holistically or assign equal importance to all planes, failing to reflect the axial-centric clinical interpretation workflow. To address this gap, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that captures the inherent asymmetric dependencies between different anatomical planes. Our architecture incorporates MedDINOv3, a medical vision foundation model pretrained via self-supervised learning on large-scale axial CT images, as a frozen feature extractor for the axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information within each anatomical plane, while axial-centric cross-plane transformer encoders condition axial features on complementary information from auxiliary planes. Experimental results on six datasets from the MedMNIST3D benchmark demonstrate that the proposed architecture consistently outperforms existing 3D and multi-plane models in terms of accuracy and AUC. Ablation studies further confirm the importance of axial-centric query-key-value allocation and directional cross-plane fusion. These results highlight the importance of aligning architectural design with clinical interpretation workflows for robust and data-efficient 3D medical image analysis.

[CV-76] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在物理动态理解上的局限性,以及强化学习中外部奖励信号与智能体内部状态脱节的问题。现有方法依赖统计数据先验或孤立的奖励机制,难以实现对环境物理规律的深层建模和自适应优化。解决方案的关键在于提出自校正VLA(Self-Correcting VLA, SC-VLA),其核心创新是通过稀疏想象(sparse imagination)内在引导动作精炼:首先设计稀疏世界想象模块,引入辅助预测头以估计当前任务进展和未来轨迹趋势,从而约束策略编码短期物理演化;其次引入在线动作精炼模块,基于预测的稀疏未来状态重塑进度相关密集奖励,动态调整轨迹方向。这一机制实现了无需外部奖励的自我改进能力,在仿真与真实机器人操作任务中显著提升性能,相较最优基线减少16%步数、提高9%成功率,并在真实场景中获得14%的性能增益。

链接: https://arxiv.org/abs/2602.21633
作者: Chenyv Liu,Wentao Tan,Lei Zhu,Fengling Li,Jingjing Li,Guoli Yang,Heng Tao Shen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent’s internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at this https URL.

[CV-77] UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

【速读】:该论文旨在解决4D手部运动建模中的两大难题:一是现有估计方法在手部遮挡或缺失时性能下降;二是生成方法难以有效利用多模态结构化输入进行运动补全。其解决方案的关键在于提出UniHand,一个基于扩散模型的统一框架,将估计与生成任务统一为条件运动合成问题。该框架通过联合变分自编码器(Joint Variational Autoencoder)将异构输入(如MANO参数、2D骨骼等)嵌入共享潜在空间,同时利用冻结的视觉主干网络和专用手部感知器从图像特征中提取手部特定线索,避免复杂的检测与裁剪流程,并借助潜在扩散模型从多样化条件中合成一致的手部运动序列,从而实现对遮挡和时间不完整输入的鲁棒建模。

链接: https://arxiv.org/abs/2602.21631
作者: Zhihao Sun,Tong Wu,Ruirui Tu,Daoguo Dong,Zuxuan Wu
机构: Fudan University (复旦大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

[CV-78] okenizing Semantic Segmentation with RLE

【速读】:该论文旨在解决图像与视频中语义分割(semantic segmentation)任务的统一建模问题,尤其关注如何利用语言建模框架将分割掩码(mask)表示为离散标记序列,从而实现端到端的生成式分割方法。其解决方案的关键在于:首先采用游程编码(run length encoding, RLE)对分割掩码进行离散化处理,使掩码可被建模为标记序列;其次,基于Pix2Seq模型改进其架构以支持自回归生成RLE标记序列,并提出新颖的标记策略以压缩序列长度,提升在视频场景下的可行性;此外,通过在标记过程中嵌入实例信息,实现了全景分割(panoptic segmentation)。该方法在两个数据集上验证了其有效性,尽管受限于计算资源,仍达到与当前最先进方法相当的性能。

链接: https://arxiv.org/abs/2602.21627
作者: Abhineet Singh,Justin Rozeboom,Nilanjan Ray
机构: University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \citep2s to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.

[CV-79] Virtual Biopsy for Intracranial Tumors Diagnosis on MRI

【速读】:该论文旨在解决深部脑肿瘤在功能敏感区域(eloquent brain regions)的病理诊断难题,传统依赖立体定向活检的方法存在出血和神经功能缺损风险,并受限于肿瘤空间异质性导致的取样偏差。为此,作者构建了首个公开的活检验证MRI数据集ICT-MRI(249例,四类病理),并提出“虚拟活检”(Virtual Biopsy)框架:其关键在于三阶段设计——MRI标准化处理器(MRI-Processor)、基于视觉-语言模型的粗到精定位器(Tumor-Localizer)实现弱监督下的精准病灶定位,以及引入掩码通道注意力机制(Masked Channel Attention)的自适应诊断器(Adaptive-Diagnoser),融合局部判别特征与全局上下文信息,显著提升非侵入性MRI病理预测准确率(>90%),优于基线模型超20%。

链接: https://arxiv.org/abs/2602.21613
作者: Xinzhe Luo,Shuai Shao,Yan Wang,Jiangtao Wang,Yutong Bai,Jianguo Zhang
机构: University of Science and Technology of China(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep intracranial tumors situated in eloquent brain regions controlling vital functions present critical diagnostic challenges. Clinical practice has shifted toward stereotactic biopsy for pathological confirmation before treatment. Yet biopsy carries inherent risks of hemorrhage and neurological deficits and struggles with sampling bias due to tumor spatial heterogeneity, because pathological changes are typically region-selective rather than tumor-wide. Therefore, advancing non-invasive MRI-based pathology prediction is essential for holistic tumor assessment and modern clinical decision-making. The primary challenge lies in data scarcity: low tumor incidence requires long collection cycles, and annotation demands biopsy-verified pathology from neurosurgical experts. Additionally, tiny lesion volumes lacking segmentation masks cause critical features to be overwhelmed by background noise. To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories. We propose a Virtual Biopsy framework comprising: MRI-Processor for standardization; Tumor-Localizer employing vision-language models for coarse-to-fine localization via weak supervision; and Adaptive-Diagnoser with a Masked Channel Attention mechanism fusing local discriminative features with global contexts. Experiments demonstrate over 90% accuracy, outperforming baselines by more than 20%. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.21613 [cs.CV] (or arXiv:2602.21613v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.21613 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-80] Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control

【速读】:该论文旨在解决物理驱动类人机器人控制中因训练数据集难度分布固定而导致的策略性能瓶颈问题,以及依赖专业动捕系统获取高质量数据所带来的高成本与难以规模化的问题。其解决方案的关键在于提出一个闭环的自动化运动数据生成与迭代框架,能够基于物理指标和客观评估实现策略与数据的难度迭代优化,从而突破原始训练限制;该框架可生成包含武术、舞蹈、格斗、体育、体操等多种语义动作的高质量运动数据,并在仅使用约1/10 AMASS数据量的情况下,使PHC单基元追踪器在2201段测试片段上的平均失败率降低45%。

链接: https://arxiv.org/abs/2602.21599
作者: Weisheng Xu,Qiwei Wu,Jiaxi Zhang,Tan Jing,Yangfan Li,Yuetong Fang,Jiaqi Xiong,Kai Wu,Rong Ou,Renjing Xu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Oxford (牛津大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.

[CV-81] A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers ICLR2026

【速读】:该论文旨在解决扩散Transformer(Diffusion Transformers)中条件嵌入(conditional embeddings)结构不明确的问题,特别是其语义编码机制与冗余性缺乏系统理解。研究发现,类别条件嵌入在ImageNet-1K上表现出超过99%的角相似度,连续条件任务如姿态引导图像生成和视频到音频生成则超过99.9%,表明嵌入空间存在显著冗余;同时,语义信息集中于少数维度,头部维度承载主要信号,尾部维度贡献微弱。解决方案的关键在于通过剪枝低幅值维度(最多移除三分之二嵌入空间),在不损害甚至提升生成质量与保真度的前提下,揭示了Transformer扩散模型中的语义瓶颈(semantic bottleneck),为设计更高效的条件机制提供了理论依据和实践路径。

链接: https://arxiv.org/abs/2602.21596
作者: Trung X. Pham,Kang Zhang,Ji Woo Hong,Chang D. Yoo
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions–removing up to two-thirds of the embedding space–we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.

[CV-82] Breaking Semantic-Aware Watermarks via LLM -Guided Coherence-Preserving Semantic Injection

【速读】:该论文旨在解决当前内容感知型语义水印(content-aware semantic watermarking)在面对大语言模型(LLM)驱动的细粒度语义扰动时存在的安全脆弱性问题。现有方案通常将水印信号绑定到图像的高层语义特征上以增强鲁棒性,但忽略了LLM具备结构化推理能力,可实现局部语义微调而不破坏整体视觉一致性,从而绕过水印绑定机制。解决方案的关键在于提出一种协同保持一致性的语义注入攻击(Coherence-Preserving Semantic Injection, CSI),该方法通过在嵌入空间相似性约束下利用LLM引导的语义操作,精准扰动与水印相关的语义区域,在维持图像整体视觉一致性的前提下诱导检测器误分类,实验证明其显著优于现有攻击基线,揭示了当前语义水印设计的根本性安全缺陷。

链接: https://arxiv.org/abs/2602.21593
作者: Zheng Gao,Xiaoyu Li,Zhicheng Bao,Xiaoyan Feng,Jiaojiao Jiang
机构: University of New South Wales (新南威尔士大学); Griffith University (格里菲斯大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The Web Conference 2026 (Short Paper Track)

点击查看摘要

Abstract:Generative images have proliferated on Web platforms in social media and online copyright distribution scenarios, and semantic watermarking has increasingly been integrated into diffusion models to support reliable provenance tracking and forgery prevention for web content. Traditional noise-layer-based watermarking, however, remains vulnerable to inversion attacks that can recover embedded signals. To mitigate this, recent content-aware semantic watermarking schemes bind watermark signals to high-level image semantics, constraining local edits that would otherwise disrupt global coherence. Yet, large language models (LLMs) possess structured reasoning capabilities that enable targeted exploration of semantic spaces, allowing locally fine-grained but globally coherent semantic alterations that invalidate such bindings. To expose this overlooked vulnerability, we introduce a Coherence-Preserving Semantic Injection (CSI) attack that leverages LLM-guided semantic manipulation under embedding-space similarity constraints. This alignment enforces visual-semantic consistency while selectively perturbing watermark-relevant semantics, ultimately inducing detector misclassification. Extensive empirical results show that CSI consistently outperforms prevailing attack baselines against content-aware semantic watermarking, revealing a fundamental security weakness of current semantic watermark designs when confronted with LLM-driven semantic perturbations.

[CV-83] CADC: Content Adaptive Diffusion-Based Generative Image Compression CVPR2026

【速读】:该论文旨在解决扩散模型驱动的图像压缩方法在超低比特率下实现高质量重建时面临的三大关键限制:一是各向同性量化导致的量化失真无法适应图像内容的空间复杂度变化,造成与扩散模型噪声依赖先验的错位;二是高维噪声潜在空间与固定输入维度的解码器之间存在信息集中瓶颈,阻碍了语义信息在主通道中的自适应保留;三是现有文本条件策略要么引入显著的文本比特率开销,要么依赖通用、内容无关的提示词,难以提供高效的语义引导。解决方案的关键在于提出一种内容自适应的扩散图像编解码器,其核心创新包括:1)基于不确定性引导的自适应量化方法,通过学习空间不确定性图来对齐量化失真与内容特征;2)辅助解码器引导的信息集中方法,利用轻量级辅助解码器强制主潜在通道中内容感知的信息保留;3)无比特率代价的自适应文本条件方法,从辅助重建图像中提取内容感知的文本描述,实现零比特率成本的语义指导。

链接: https://arxiv.org/abs/2602.21591
作者: Xihua Sheng,Lingyu Zhu,Tianyu Zhang,Dong Liu,Shiqi Wang,Jing Wang
机构: City University of Hong Kong (香港城市大学); University of Science and Technology of China (中国科学技术大学); Central Media Technology Institute, Huawei (华为中央媒体技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder’s representation and the decoder’s generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model’s noise-dependent prior. Second, the information concentration bottleneck – arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder’s fixed input – prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.

[CV-84] SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction

【速读】:该论文旨在解决多模态高精地图(HD map)预测中因相机与激光雷达(LiDAR)模态间不一致性导致的性能下降问题,尤其在低光照、遮挡或点云稀疏等退化条件下表现不佳。解决方案的关键在于提出一种子空间专家融合框架(SEFMAP),其核心思想是将鸟瞰图(BEV)特征显式解耦为四个语义子空间:LiDAR私有、图像私有、共享和交互子空间,并为每个子空间分配专用专家网络,从而保留模态特异性线索并捕捉跨模态共识;同时引入不确定性感知门控机制,在BEV单元级别自适应加权专家输出,通过预测方差抑制不可靠专家,并辅以使用平衡正则项防止专家坍塌;此外,进一步设计分布感知掩码策略,模拟模态缺失场景并施加专门化损失,促进私有、共享与交互专家在完整及掩码输入下的角色分化,显著提升模型鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2602.21589
作者: Haoxiang Fu,Lingfeng Zhang,Hao Li,Ruibing Hu,Zhengrong Li,Guanjing Liu,Zimu Tan,Long Chen,Hangjun Ye,Xiaoshuai Hao
机构: National University of Singapore; Shenzhen Tsinghua International Graduate School, Tsinghua University; Xiaomi EV; Chinese University of Hong Kong; The University of Manchester; Renmin University of China; Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.

[CV-85] MultiAnimate: Pose-Guided Image Animation Made Extensible

【速读】:该论文旨在解决多角色图像动画(multi-character image animation)中的身份混淆(identity confusion)与角色间不合理的遮挡(implausible occlusions)问题。现有基于扩散模型(diffusion-based methods)的方法主要局限于单角色场景,直接扩展至多角色时难以保持个体身份一致性并准确建模角色间的空间关系。其解决方案的关键在于提出一个可扩展的框架,核心包含两个创新模块:Identifier Assigner 和 Identifier Adapter,二者协同作用以捕捉每个角色的局部位置信息(per-person positional cues)和角色之间的相对空间关系(inter-person spatial relationships),并通过掩码驱动(mask-driven)机制实现灵活且可泛化的动画生成,即使训练时仅使用双角色数据,也能成功推广到更多角色的场景,同时兼容单角色动画任务。

链接: https://arxiv.org/abs/2602.21581
作者: Yingcheng Hu,Haowen Gong,Chuanguang Yang,Zhulin An,Yongjun Xu,Songhua Liu
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); ShanghaiTech University (上海科技大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

[CV-86] Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction CVPR2026

【速读】:该论文旨在解决单目3D场景理解中占用预测(occupancy prediction)性能受限的问题,尤其是现有方法对深度先验(depth priors)利用不足、难以有效融合三维几何信息(3D cues)的局限性。其核心挑战在于如何更高效地利用日益强大的视觉几何先验(visual geometry priors),如VGGT等模型提供的丰富3D结构信息,从而提升占用预测的准确性与泛化能力。解决方案的关键在于提出GPOcc框架:通过将表面点沿相机射线向内扩展生成体素样本,并以高斯基元(Gaussian primitives)表示这些样本,实现概率化的占用推理;同时设计无需训练的增量更新策略,将每帧高斯基元融合为统一全局表示,支持流式输入处理。该方法显著提升了mIoU指标,在保持更高效率的同时实现了优于现有最先进方法的性能表现。

链接: https://arxiv.org/abs/2602.21552
作者: Changqing Zhou,Yueru Luo,Changhao Chen
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65 \times faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at this https URL.

[CV-87] VasGuideNet: Vascular Topology-Guided Couinaud Liver Segmentation with Structural Contrastive Loss

【速读】:该论文旨在解决肝脏Couinaud分段中因缺乏血管拓扑结构建模而导致的边界模糊和泛化能力不足的问题(即:现有方法主要依赖图像强度与空间位置线索,未显式建模血管拓扑结构,导致在血管附近分割边界不清晰且跨解剖变异场景下性能受限)。其解决方案的关键在于提出VasGuideNet框架,该框架首次通过图卷积网络(Graph Convolutional Networks, GCNs)将骨架化血管、基于欧氏距离变换(Euclidean Distance Transform, EDT)的几何特征以及k近邻(k-Nearest Neighbor, kNN)连接性编码为拓扑特征,并利用交叉注意力融合模块将其注入3D编码器-解码器主干网络;同时引入结构对比损失(Structural Contrastive Loss, SCL)以增强类别间可分性和解剖一致性,从而显著提升分割精度与鲁棒性。

链接: https://arxiv.org/abs/2602.21539
作者: Chaojie Shen,Jingjun Gu,Zihao Zhao,Ruocheng Li,Cunyuan Yang,Jiajun Bu,Lei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate Couinaud liver segmentation is critical for preoperative surgical planning and tumor this http URL, existing methods primarily rely on image intensity and spatial location cues, without explicitly modeling vascular topology. As a result, they often produce indistinct boundaries near vessels and show limited generalization under anatomical this http URL propose VasGuideNet, the first Couinaud segmentation framework explicitly guided by vascular topology. Specifically, skeletonized vessels, Euclidean distance transform (EDT)–derived geometry, and k-nearest neighbor (kNN) connectivity are encoded into topology features using Graph Convolutional Networks (GCNs). These features are then injected into a 3D encoder–decoder backbone via a cross-attention fusion module. To further improve inter-class separability and anatomical consistency, we introduce a Structural Contrastive Loss (SCL) with a global memory this http URL Task08_HepaticVessel and our private LASSD dataset, VasGuideNet achieves Dice scores of 83.68% and 76.65% with RVDs of 1.68 and 7.08, respectively. It consistently outperforms representative baselines including UNETR, Swin UNETR, and G-UNETR++, delivering higher Dice/mIoU and lower RVD across datasets, demonstrating its effectiveness for anatomically consistent segmentation. Code is available at this https URL.

[CV-88] IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model

【速读】:该论文旨在解决多模态磁共振成像(MRI)数据在跨站点或跨设备场景下存在的异质性问题,即传统回顾性MRI调和方法在不同模态间扩展性差且依赖于配对受试者数据的问题。其解决方案的关键在于提出一种统一的可逆层次流框架(Invertible Hierarchy Flow, IHF-Harmony),通过将翻译过程分解为可逆的特征变换,确保双射映射与无损重建,从而避免解剖结构失真;具体而言,该方法利用可逆层次流(IHF)进行分层减法耦合以逐步去除伪影相关特征,并结合伪影感知归一化(Artefact-Aware Normalization, AAN)实现基于解剖固定特征的精准目标特性迁移,同时引入解剖与伪影一致性损失函数,显著提升调和结果的保真度和下游任务性能。

链接: https://arxiv.org/abs/2602.21536
作者: Pengli Zhu,Yitao Zhu,Haowen Pang,Anqi Qiu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Center for Brain-Inspired Intelligence, Tsinghua University (清华大学类脑智能研究中心); 4. Beijing Key Laboratory of Intelligent Perception and Image Understanding (北京市重点实验室:智能感知与图像理解)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code will be released upon acceptance.

[CV-89] Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction

【速读】:该论文旨在解决在稀疏视点条件下(尤其是户外场景)进行高质量三维场景重建的问题,其核心挑战在于极端有限的输入视角会导致直接使用扩散模型生成伪帧时引入不合理几何结构,从而损害最终重建质量。解决方案的关键在于提出一种双向伪帧恢复机制与场景感知高斯管理策略:首先通过基于扩散模型的伪帧恢复方法,在相邻帧引导下利用轻量级伪视图去模糊模型和置信度掩码推理算法填补缺失内容;其次设计场景感知的高斯管理策略,基于深度-密度联合信息优化高斯分布参数,从而显著提升重建完整性、抑制漂浮伪影并增强几何一致性。

链接: https://arxiv.org/abs/2602.21535
作者: Beizhen Zhao,Sicheng Yu,Guanzhi Ding,Yu Hu,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:3D scene reconstruction under unposed sparse viewpoints is a highly challenging yet practically important problem, especially in outdoor scenes due to complex lighting and scale variation. With extremely limited input views, directly utilizing diffusion model to synthesize pseudo frames will introduce unreasonable geometry, which will harm the final reconstruction quality. To address these issues, we propose a novel framework for sparse-view outdoor reconstruction that achieves high-quality results through bidirectional pseudo frame restoration and scene perception Gaussian management. Specifically, we introduce a bidirectional pseudo frame restoration method that restores missing content by diffusion-based synthesis guided by adjacent frames with a lightweight pseudo-view deblur model and confidence mask inference algorithm. Then we propose a scene perception Gaussian management strategy that optimize Gaussians based on joint depth-density information. These designs significantly enhance reconstruction completeness, suppress floating artifacts and improve overall geometric consistency under extreme view sparsity. Experiments on outdoor benchmarks demonstrate substantial gains over existing methods in both fidelity and stability.

[CV-90] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

【速读】:该论文旨在解决通用机器人在非结构化环境中执行长时程操作(long-horizon manipulation)任务时面临的挑战,即如何有效组合多个原子技能(atomic skills)并避免因环境敏感性导致的级联失败(cascading failures)。其核心解决方案是提出一种模块化框架LiLo-VLA(Linked Local VLA),通过解耦“移动”与“交互”两个阶段:由一个“到达模块”(Reaching Module)负责全局运动规划,而“交互模块”(Interaction Module)则采用以物体为中心的视觉-语言-动作(Vision-Language-Action, VLA)模型处理目标对象,从而提升对无关视觉特征的鲁棒性和空间配置的不变性。该设计支持动态重规划和技能复用,显著增强了系统在面对错误时的恢复能力,实现了零样本泛化至未见过的长时程任务。

链接: https://arxiv.org/abs/2602.21531
作者: Yue Yang,Shuo Cheng,Yu Fang,Homanga Bharadhwaj,Mingyu Ding,Gedas Bertasius,Daniel Szafir
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Georgia Institute of Technology (佐治亚理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: this https URL.

[CV-91] Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agent ic Learning

【速读】:该论文旨在解决医疗领域中AI代理(AI agent)在使用多个工具时因工具本身存在固有错误和输出矛盾而导致的可靠性问题。现有研究未能充分考虑工具在实际场景中的可信度,因而无法有效处理工具冲突。其解决方案的关键在于提出一个基于代理学习(agentic learning)的框架,使代理能够通过与工具交互并根据多模态查询类型实证性地学习各工具的实际可信度:当不同工具输出不一致时,代理会实验性地接受或拒绝结果,并依据奖励信号优化决策策略,从而动态确定每类查询下最值得信赖的工具。该方法显著提升了代理在胸部X光分析任务中的性能表现,且代码框架支持多轮多模态交互、单次并发调用多个工具及多图像处理,为医学领域的多模态强化学习研究提供了可扩展的技术基础。

链接: https://arxiv.org/abs/2602.21517
作者: Zheang Huai,Honglong Yang,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools’ realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.

[CV-92] WaterVIB: Learning Minimal Sufficient Watermark Representations via Variational Information Bottleneck

【速读】:该论文旨在解决生成式 AI (Generative AI) 攻击下数字水印(watermarking)的鲁棒性问题,特别是针对基于再生(regeneration-based)的 AIGC 攻击导致现有水印方法失效的问题。其解决方案的关键在于提出 WaterVIB 框架,通过变分信息瓶颈(Variational Information Bottleneck, VIB)将编码器重构为一个信息筛(information sieve),迫使模型学习消息的最小充分统计量(Minimal Sufficient Statistic),从而过滤掉易受生成净化影响的高频载体细节,仅保留对再生操作不变的鲁棒信号,理论上证明该优化过程是抵御分布偏移攻击的必要条件。

链接: https://arxiv.org/abs/2602.21508
作者: Haoyuan He,Yu Zheng,Jie Zhou,Jiwen Lu
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 7 figures. Preprint

点击查看摘要

Abstract:Robust watermarking is critical for intellectual property protection, whereas existing methods face a severe vulnerability against regeneration-based AIGC attacks. We identify that existing methods fail because they entangle the watermark with high-frequency cover texture, which is susceptible to being rewritten during generative purification. To address this, we propose WaterVIB, a theoretically grounded framework that reformulates the encoder as an information sieve via the Variational Information Bottleneck. Instead of overfitting to fragile cover details, our approach forces the model to learn a Minimal Sufficient Statistic of the message. This effectively filters out redundant cover nuances prone to generative shifts, retaining only the essential signal invariant to regeneration. We theoretically prove that optimizing this bottleneck is a necessary condition for robustness against distribution-shifting attacks. Extensive experiments demonstrate that WaterVIB significantly outperforms state-of-the-art methods, achieving superior zero-shot resilience against unknown diffusion-based editing.

[CV-93] AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification AAAI2026

【速读】:该论文旨在解决同卵双胞胎人脸验证(Identical Twin Face Verification)这一极端细粒度识别难题,该任务中因遗传相似性极高导致现有主流人脸识别系统性能显著下降(准确率从99.8%骤降至88.9%),暴露出生物特征安全系统的严重漏洞。解决方案的关键在于提出一种名为不对称分层注意力网络(Asymmetric Hierarchical Attention Network, AHAN)的新架构,其核心创新包括:1)引入分层交叉注意力模块(Hierarchical Cross-Attention, HCA),实现对语义面部区域的多尺度分析与最优分辨率处理;2)设计面部不对称注意力模块(Facial Asymmetry Attention Module, FAAM),通过左右脸半部间的交叉注意力机制捕捉即使在双胞胎间也存在的细微不对称生物特征;3)提出仅用于训练阶段的孪生感知成对交叉注意力正则化策略(Twin-Aware Pair-Wise Cross-Attention, TA-PWCA),强制模型以自身孪生个体作为最难的干扰样本进行学习,从而提取真正区分个体的特征表示。实验表明,AHAN在ND_TWIN数据集上达到92.3%的准确率,相较现有最优方法提升3.4%。

链接: https://arxiv.org/abs/2602.21503
作者: Hoang-Nhat Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Identical twin face verification represents an extreme fine-grained recognition challenge where even state-of-the-art systems fail due to overwhelming genetic similarity. Current face recognition methods achieve over 99.8% accuracy on standard benchmarks but drop dramatically to 88.9% when distinguishing identical twins, exposing critical vulnerabilities in biometric security systems. The difficulty lies in learning features that capture subtle, non-genetic variations that uniquely identify individuals. We propose the Asymmetric Hierarchical Attention Network (AHAN), a novel architecture specifically designed for this challenge through multi-granularity facial analysis. AHAN introduces a Hierarchical Cross-Attention (HCA) module that performs multi-scale analysis on semantic facial regions, enabling specialized processing at optimal resolutions. We further propose a Facial Asymmetry Attention Module (FAAM) that learns unique biometric signatures by computing cross-attention between left and right facial halves, capturing subtle asymmetric patterns that differ even between twins. To ensure the network learns truly individuating features, we introduce Twin-Aware Pair-Wise Cross-Attention (TA-PWCA), a training-only regularization strategy that uses each subject’s own twin as the hardest possible distractor. Extensive experiments on the ND_TWIN dataset demonstrate that AHAN achieves 92.3% twin verification accuracy, representing a 3.4% improvement over state-of-the-art methods.

[CV-94] Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow CVPR2026

【速读】:该论文旨在解决现有3D编辑方法依赖计算密集的逐场景迭代优化、导致多视角不一致的问题。其核心解决方案在于提出一种基于TRELLIS生成式骨干网络的全前馈3D编辑框架,关键创新包括:1)通过引入Voxel FlowEdit模块,在稀疏体素潜在空间中构建编辑驱动的流场,实现单次遍历即可获得全局几何一致性的3D形变;2)设计基于法向量引导的单到多视角生成模块作为外部外观先验,有效恢复高频纹理细节,从而克服压缩3D特征带来的外观保真度瓶颈。

链接: https://arxiv.org/abs/2602.21499
作者: Shimin Hu,Yuanyi Wei,Fei Zha,Yudong Guo,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore high-fidelity details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.

[CV-95] See It Say It Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs CVPR2026

【速读】:该论文旨在解决多模态大模型在链式推理(Chain-of-Thought, CoT)过程中因视觉幻觉传播而导致的错误累积问题。具体而言,当推理中间步骤与视觉证据不一致时,即使后续逻辑正确,也会导致最终答案错误。现有方法依赖强化学习(Reinforcement Learning, RL)训练模型“结合图像思考”,但存在成本高、模型依赖性强且难以跨架构泛化的问题。本文提出一种轻量级、无需训练的插件式框架,其核心在于:在测试阶段通过视觉证据对每一步推理进行监督,确保每个生成的词元均能由对应视觉线索支撑;通过构建文本-视觉证据池引导推理,并引入视觉决策模块动态从图像中提取当前推理上下文相关的额外证据,扩充证据池直至模型获得足够视觉置信度后终止推理并输出最终答案。

链接: https://arxiv.org/abs/2602.21497
作者: Yongchang Zhang,Xianzheng Ma,Tianyi Liu,Guangquan Zhou,Yang Chen
机构: Southeast University (东南大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 Accepted

点击查看摘要

Abstract:Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to “think with images” via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model’s reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

[CV-96] Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning

【速读】:该论文旨在解决3D目标检测在自动驾驶和机器人感知中对大规模人工标注数据的依赖问题,从而提升模型的可扩展性和适应性。当前无监督和稀疏监督方法面临伪标签质量低、特征挖掘不稳定以及缺乏统一训练框架等挑战。其解决方案的关键在于提出SPL(Semantic Pseudo-labeling and prototype Learning)框架,通过融合图像语义、点云几何与时间线索生成高质量伪标签,并将其作为概率先验引入一种多阶段原型学习策略;该策略利用基于记忆的初始化和动量更新机制稳定特征表示学习,从而有效从标注与未标注数据中挖掘特征,实现对无监督和稀疏监督场景下3D目标检测性能的显著提升。

链接: https://arxiv.org/abs/2602.21484
作者: Yushen He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D object detection is essential for autonomous driving and robotic perception, yet its reliance on large-scale manually annotated data limits scalability and adaptability. To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged. However, they face intertwined challenges: low-quality pseudo-labels, unstable feature mining, and a lack of a unified training framework. This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection via Semantic Pseudo-labeling and prototype Learning. SPL first generates high-quality pseudo-labels by integrating image semantics, point cloud geometry, and temporal cues, producing both 3D bounding boxes for dense objects and 3D point labels for sparse ones. These pseudo-labels are not used directly but as probabilistic priors within a novel, multi-stage prototype learning strategy. This strategy stabilizes feature representation learning through memory-based initialization and momentum-based prototype updating, effectively mining features from both labeled and unlabeled data. Extensive experiments on KITTI and nuScenes datasets demonstrate that SPL significantly outperforms state-of-the-art methods in both settings. Our work provides a robust and generalizable solution for learning 3D object detectors with minimal or no manual annotations.

[CV-97] Automatic Map Density Selection for Locally-Performant Visual Place Recognition

【速读】:该论文旨在解决视觉位置识别(Visual Place Recognition, VPR)系统在长期部署中难以保证局部性能达标的问题,即如何确保在用户指定的环境比例范围内满足预设的局部召回率(Local Recall@1)要求,而非仅依赖全局平均性能指标。其解决方案的关键在于提出一种动态VPR映射方法,通过利用目标环境中多对参考遍历数据,在不同地图密度下分析匹配模式,并建立模型预测达到用户定义的两个目标所需的最优密度:(1)局部召回率@1的目标水平;(2)该性能需覆盖的环境比例,称为召回达成率(Recall Achievement Rate, RAR)。该方法基于假设——参考遍历间的匹配模式可建模并用于预测未见部署数据下的最佳密度配置,从而实现精准、高效的本地化性能保障。

链接: https://arxiv.org/abs/2602.21473
作者: Somayeh Hussaini,Tobias Fischer,Michael Milford
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:A key challenge in translating Visual Place Recognition (VPR) from the lab to long-term deployment is ensuring a priori that a system can meet user-specified performance requirements across different parts of an environment, rather than just on average globally. A critical mechanism for controlling local VPR performance is the density of the reference mapping database, yet this factor is largely neglected in existing work, where benchmark datasets with fixed, engineering-driven (sensors, storage, GPS frequency) sampling densities are typically used. In this paper, we propose a dynamic VPR mapping approach that uses pairs of reference traverses from the target environment to automatically select an appropriate map density to satisfy two user-defined requirements: (1) a target Local Recall@1 level, and (2) the proportion of the operational environment over which this requirement must be met or exceeded, which we term the Recall Achievement Rate (RAR). Our approach is based on the hypothesis that match patterns between multiple reference traverses, evaluated across different map densities, can be modelled to predict the density required to meet these performance targets on unseen deployment data. Through extensive experiments across multiple VPR methods and the Nordland and Oxford RobotCar benchmarks, we show that our system consistently achieves or exceeds the specified local recall level over at least the user-specified proportion of the environment. Comparisons with alternative baselines demonstrate that our approach reliably selects the correct operating point in map density, avoiding unnecessary over-densification. Finally, ablation studies and analysis evaluate sensitivity to reference map choice and local space definitions, and reveal that conventional global Recall@1 is a poor predictor of the often more operationally meaningful RAR metric.

[CV-98] Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound

【速读】:该论文旨在解决深度学习模型在超声图像中进行甲状腺结节分割时对对抗性扰动的鲁棒性不足问题,尤其是针对不同域(空间域与频域)的对抗攻击。其关键解决方案是开发两种黑盒对抗攻击方法——结构化斑点增强攻击(SSAA)和频域超声攻击(FDUA),并评估三种推理阶段防御机制:随机预处理结合测试时增强、确定性输入去噪及基于一致性感知聚合的随机集成推理。实验表明,针对空间域扰动(SSAA),确定性去噪能显著恢复分割性能(DSC提升+0.10, p < 0.001),而频域扰动(FDUA)则未被任何防御手段有效缓解,揭示了超声成像模态下对抗鲁棒性评估的独特挑战。

链接: https://arxiv.org/abs/2602.21452
作者: Nicholas Dietrich,David McShannon
机构: Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada; Faculty of Engineering, McMaster University, Hamilton, Ontario, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Introduction: Deep learning-based segmentation models are increasingly integrated into clinical imaging workflows, yet their robustness to adversarial perturbations remains incompletely characterized, particularly for ultrasound images. We evaluated adversarial attacks and inference-time defenses for thyroid nodule segmentation in B-mode ultrasound. Methods: Two black-box adversarial attacks were developed: (1) Structured Speckle Amplification Attack (SSAA), which injects boundary-targeted noise, and (2) Frequency-Domain Ultrasound Attack (FDUA), which applies bandpass-filtered phase perturbations in the Fourier domain. Three inference-time mitigations were evaluated on adversarial images: randomized preprocessing with test-time augmentation, deterministic input denoising, and stochastic ensemble inference with consistency-aware aggregation. Experiments were conducted on a U-Net segmentation model trained on cine-clips from a database of 192 thyroid nodules. Results: The baseline model achieved a mean Dice similarity coefficient (DSC) of 0.76 (SD 0.20) on unperturbed images. SSAA reduced DSC by 0.29 (SD 0.20) while maintaining high visual similarity (SSIM = 0.94). FDUA resulted in a smaller DSC reduction of 0.11 (SD 0.09) with lower visual fidelity (SSIM = 0.82). Against SSAA, all three defenses significantly improved DSC after correction, with deterministic denoising showing the largest recovery (+0.10, p 0.001), followed by randomized preprocessing (+0.09, p 0.001), and stochastic ensemble inference (+0.08, p = 0.002). No defense achieved statistically significant improvement against FDUA. Conclusion: Spatial-domain adversarial perturbations in ultrasound segmentation showed partial mitigation with input preprocessing, whereas frequency-domain perturbations were not mitigated by the defenses, highlighting modality-specific challenges in adversarial robustness evaluation.

[CV-99] Causal Decoding for Hallucination-Resistant Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中普遍存在对象幻觉(object hallucination)的问题,即模型在生成内容时引入图像中并不存在的对象,从而降低输出的可靠性。解决方案的关键在于提出一种因果解码框架(causal decoding framework),通过在生成过程中施加针对性的因果干预,重塑解码动态以削弱虚假依赖关系,从而有效减少错误对象标记的出现,同时保持描述质量不受影响。

链接: https://arxiv.org/abs/2602.21441
作者: Shiwei Tan,Hengyi Wang,Weiyi Qin,Qi Xu,Zhigang Hua,Hao Wang
机构: Rutgers University (罗格斯大学); Meta Ranking AI (Meta 排名人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Transactions on Machine Learning Research (TMLR), 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) deliver detailed responses on vision-language tasks, yet remain susceptible to object hallucination (introducing objects not present in the image), undermining reliability in practice. Prior efforts often rely on heuristic penalties, post-hoc correction, or generic decoding tweaks, which do not directly intervene in the mechanisms that trigger object hallucination and thus yield limited gains. To address this challenge, we propose a causal decoding framework that applies targeted causal interventions during generation to curb spurious object mentions. By reshaping the decoding dynamics to attenuate spurious dependencies, our approach reduces false object tokens while maintaining descriptive quality. Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.

[CV-100] Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking ICLR

【速读】:该论文旨在解决统一视觉-语言模型(Unified Vision-Language Models, UVLMs)中理解与生成能力被视作并行技能而非协同过程的问题,从而限制了模型在多模态任务中的整体性能。解决方案的关键在于提出一种称为“分析-草稿”(Analyzing-Drafting, AD-Loop)的交错式问题求解循环机制,通过动态交替进行文本思考与视觉思考,实现对理解与输出的迭代优化,从而促进真正的协同效应。该方法结合两阶段训练策略——首先基于交错思维数据的监督学习初始化交替行为,再通过强化学习实现自适应控制,实验证明其在多个标准基准上均显著提升理解与生成性能,并具备良好的跨架构迁移能力。

链接: https://arxiv.org/abs/2602.21435
作者: Shengqiong Wu,Bobo Li,Xinkai Wang,Xiangtai Li,Lei Cui,Furu Wei,Shuicheng Yan,Hao Fei,Tat-seng Chua
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 17 figures, 6 tables, ICLR conference

点击查看摘要

Abstract:Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at this https URL.

[CV-101] PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

【速读】:该论文旨在解决医学视觉语言模型(Medical Vision Language Models, VLMs)在面对语义不变的提问重述时产生不一致回答的问题,即“重述敏感性失效”(Paraphrase Sensitivity Failure, PSF),这可能带来临床部署风险。解决方案的关键在于:首先构建了包含19,748个胸部X光图像问题及其约92,000个语义保持的改写版本的大规模基准数据集PSF-Med;其次通过分析发现低翻转率并不等同于良好的视觉 grounding,部分模型即使移除图像仍保持一致性,说明其依赖文本先验;进一步利用GemmaScope 2稀疏自动编码器(Sparse Autoencoders, SAEs)对MedGemma 4B进行特征解析,识别出第17层的一个稀疏特征与提示框架相关,并能预测决策边界的变化;最后通过因果打孔实验验证该特征对翻转行为具有因果影响,且在推理阶段对该特征施加约束可使翻转率降低31%,同时仅损失1.3个百分点的准确率并减少对文本先验的依赖,从而提出一种兼顾稳定性与视觉依赖性的评估和改进路径。

链接: https://arxiv.org/abs/2602.21428
作者: Binesh Sadanandan,Vahid Behzadan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature’s contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.

[CV-102] Automating Timed Up and Go Phase Segmentation and Gait Analysis via the tugturn Markerless 3D Pipeline

【速读】:该论文旨在解决临床与科研中对仪器化定时起立-行走测试(Timed Up and Go, TUG)分析的自动化、可重复性和无标记(markerless)处理需求不足的问题。其解决方案的关键在于提出了一种基于Python的完整3D无标记TUG处理工作流——tugturn,该工作流融合了相位分割、步态事件检测、时空指标计算、节段间协调性分析及动态稳定性评估;通过空间阈值实现对站立、首次步行、转身、第二次步行和坐下五个阶段的自动划分,并采用相对距离策略在有效步态窗口内精确识别足跟触地与足尖离地事件;同时输出向量编码(Vector Coding)和基于外推质心(Extrapolated Center of Mass, XCoM)的指标,支持高保真、可复现的分析结果,包括HTML报告、CSV表格及质量控制可视化输出,配置文件采用TOML格式以保障流程标准化与可重现性。

链接: https://arxiv.org/abs/2602.21425
作者: Abel Gonçalves Chinaglia,Guilherme Manna Cesar,Paulo Roberto Pereira Santiago
机构: University of São Paulo (圣保罗大学); University of North Florida (北佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 2 figures, 1 pdf report, submitted to arXiv under cs.CV

点击查看摘要

Abstract:Instrumented Timed Up and Go (TUG) analysis can support clinical and research decision-making, but robust and reproducible markerless pipelines are still limited. We present \textitthis http URL, a Python-based workflow for 3D markerless TUG processing that combines phase segmentation, gait-event detection, spatiotemporal metrics, intersegmental coordination, and dynamic stability analysis. The pipeline uses spatial thresholds to segment each trial into stand, first gait, turning, second gait, and sit phases, and applies a relative-distance strategy to detect heel-strike and toe-off events within valid gait windows. In addition to conventional kinematics, \textittugturn provides Vector Coding outputs and Extrapolated Center of Mass (XCoM)-based metrics. The software is configured through TOML files and produces reproducible artifacts, including HTML reports, CSV tables, and quality-assurance visual outputs. A complete runnable example is provided with test data and command-line instructions. This manuscript describes the implementation, outputs, and reproducibility workflow of \textittugturn as a focused software contribution for markerless biomechanical TUG analysis.

[CV-103] ECHOSAT: Estimating Canopy Height Over Space And Time

【速读】:该论文旨在解决现有全球树高地图仅提供静态快照、无法捕捉森林动态变化的问题,这限制了碳核算的准确性。其关键解决方案是提出ECHOSAT,一个基于多源卫星数据训练的专用视觉Transformer模型,通过像素级时间回归实现10米分辨率的多年度一致树高制图;模型引入自监督生长损失函数,使预测结果符合自然树木发育规律(如随时间渐进增高及因火灾等扰动事件导致的骤降),从而首次实现了全球尺度上对树高增长与扰动的精准量化。

链接: https://arxiv.org/abs/2602.21421
作者: Jan Pauls,Karsten Schrödter,Sven Ligensa,Martin Schwartz,Berkant Turan,Max Zimmer,Sassan Saatchi,Sebastian Pokutta,Philippe Ciais,Fabian Gieseke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi-sensor satellite data to train a specialized vision transformer model, which performs pixel-level temporal regression. A self-supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions. We also provide the first global-scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at this https URL.

[CV-104] WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions

【速读】:该论文旨在解决从自然图像中提取可缩放矢量图形(Scalable Vector Graphics, SVG)的挑战,尤其是在存在噪声、杂乱背景和领域偏移等现实场景下,现有多模态模型性能显著下降的问题。其解决方案的关键在于构建首个系统性的基准测试平台——WildSVG Benchmark,该基准由两个互补的数据集组成:Natural WildSVG(基于真实图像中的公司标志及其SVG标注)和Synthetic WildSVG(将复杂SVG渲染嵌入真实场景以模拟困难条件),从而为SVG提取任务提供可靠的评估基础。通过在此基准上对前沿多模态模型进行评测,研究发现当前方法在实际应用中仍远未达到可靠水平,但迭代优化策略展现出良好的改进潜力,预示着未来模型能力将持续提升。

链接: https://arxiv.org/abs/2602.21416
作者: Marco Terral,Haotian Zhang,Tianyang Zhang,Meng Lin,Xiaoqing Xie,Haoran Dai,Darsh Kaushik,Pai Peng,Nicklas Scharpff,David Vazquez,Joan Rodriguez
机构: QuiverAI; Columbia University (哥伦比亚大学); Illinois Institute of Technology (伊利诺伊理工学院); Mila - Quebec Artificial Intelligence Institute (魁北克人工智能研究所); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); ServiceNow Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 pages of additional material

点击查看摘要

Abstract:We introduce the task of SVG extraction, which consists in translating specific visual inputs from an image into scalable vector graphics. Existing multimodal models achieve strong results when generating SVGs from clean renderings or textual descriptions, but they fall short in real-world scenarios where natural images introduce noise, clutter, and domain shifts. A central challenge in this direction is the lack of suitable benchmarks. To address this need, we introduce the WildSVG Benchmark, formed by two complementary datasets: Natural WildSVG, built from real images containing company logos paired with their SVG annotations, and Synthetic WildSVG, which blends complex SVG renderings into real scenes to simulate difficult conditions. Together, these resources provide the first foundation for systematic benchmarking SVG extraction. We benchmark state-of-the-art multimodal models and find that current approaches perform well below what is needed for reliable SVG extraction in real scenarios. Nonetheless, iterative refinement methods point to a promising path forward, and model capabilities are steadily improving

[CV-105] Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation ICRA2026

【速读】:该论文旨在解决传统时间动作分割(Temporal Action Segmentation, TAS)方法受限于封闭词汇和固定标签集的问题,提出开放词汇零样本时间动作分割(Open-Vocabulary Zero-Shot Temporal Action Segmentation, OVTAS)这一新任务,以应对视频中动作类别多样性和标注数据难以全面覆盖的挑战。解决方案的关键在于设计了一种无需训练的流水线:首先通过帧-动作嵌入相似性(Frame-Action Embedding Similarity, FAES)将视频帧与候选动作标签进行匹配,利用视觉-语言模型(Vision-Language Models, VLMs)的零样本能力实现跨模态对齐;随后通过相似度矩阵的时间一致性分割(Similarity-Matrix Temporal Segmentation, SMTS)保证动作边界在时间维度上的连续性与合理性,从而在不依赖特定任务标注的情况下实现高质量的动作片段划分。

链接: https://arxiv.org/abs/2602.21406
作者: Asim Unmesh,Kaki Ramesh,Mayank Patel,Rahul Jain,Karthik Ramani
机构: Purdue University (普渡大学); Birla Institute of Technology and Science (BITS) Hyderabad (比特学院海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026

点击查看摘要

Abstract:Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.

[CV-106] FlowFixer: Towards Detail-Preserving Subject-Driven Generation

【速读】:该论文旨在解决主体驱动生成(Subject-Driven Generation, SDG)过程中因主体尺度和视角变化而导致的细节丢失问题,从而提升生成图像的细节保真度。其解决方案的关键在于提出FlowFixer框架,通过直接的图像到图像翻译(image-to-image translation)从视觉参考中恢复丢失的细粒度信息,避免了语言提示带来的歧义;同时引入一步去噪方案生成自监督训练数据,自动去除高频细节但保留全局结构,有效模拟真实SDG中的误差模式,并结合基于关键点匹配的评估指标,以更准确衡量超出语义相似性(如CLIP或DINO所测)的细节保真度。

链接: https://arxiv.org/abs/2602.21402
作者: Jinyoung Jun,Won-Dong Jang,Wenbin Ouyang,Raghudeep Gadde,Jungbeom Lee
机构: Amazon(亚马逊); Korea University(高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.

[CV-107] FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning CVPR2026

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据异构性导致的客户端漂移(client drift)问题,该问题会显著降低模型的整体泛化性能,尤其是当系统过度关注表现较差的客户端时更为严重。解决方案的关键在于提出FedVG框架,其核心创新是引入一个全局验证集(global validation set),该验证集可基于公开可用的数据集构建,从而在不泄露隐私的前提下为各客户端提供统一的评估基准。FedVG通过计算各客户端模型在不同层上的梯度范数(layerwise gradient norms),量化每个客户端对全局验证集上泛化能力的贡献,并据此生成客户端特定的调整评分,实现更智能、自适应的聚合策略,从而有效缓解客户端漂移并提升整体性能。

链接: https://arxiv.org/abs/2602.21399
作者: Alina Devkota,Jacob Thrasher,Donald Adjeroh,Binod Bhattarai,Prashnna K. Gyawali
机构: West Virginia University (西弗吉尼亚大学); University of Aberdeen (阿伯丁大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Findings Track)

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across multiple clients without sharing their private data. However, data heterogeneity across clients leads to client drift, which degrades the overall generalization performance of the model. This effect is further compounded by overemphasis on poorly performing clients. To address this problem, we propose FedVG, a novel gradient-based federated aggregation framework that leverages a global validation set to guide the optimization process. Such a global validation set can be established using readily available public datasets, ensuring accessibility and consistency across clients without compromising privacy. In contrast to conventional approaches that prioritize client dataset volume, FedVG assesses the generalization ability of client models by measuring the magnitude of validation gradients across layers. Specifically, we compute layerwise gradient norms to derive a client-specific score that reflects how much each client needs to adjust for improved generalization on the global validation set, thereby enabling more informed and adaptive federated aggregation. Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings. Moreover, FedVG is modular and can be seamlessly integrated with various state-of-the-art FL algorithms, often further improving their results. Our code is available at this https URL.

[CV-108] MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在下游任务中进行提示学习(Prompt Learning)时面临的参数效率与性能之间的矛盾问题。现有方法通过在多个Transformer层中同时扩展视觉和文本提示(Prompt)虽显著提升性能,但导致可训练参数数量激增(达百万级),违背了提示调优(Prompt Tuning)原本的参数高效性优势。其解决方案的关键在于提出MMLoP(Multi-Modal Low-Rank Prompting)框架,采用低秩因子分解(Low-Rank Factorization)对每一层的视觉和文本提示进行参数化,仅需11.5K可训练参数即可实现深度多模态提示;并引入三个互补组件:自调节一致性损失(Self-regulating Consistency Loss)、均匀漂移校正(Uniform Drift Correction)以及共享上投影(Shared Up-projection),分别从特征与logit层面保持与冻结零样本CLIP特征的一致性、消除提示调优引起的全局嵌入偏移、并通过共同低秩因子强制跨模态对齐,从而在保证极低参数量的同时大幅提升性能,实现了优异的准确率-效率权衡。

链接: https://arxiv.org/abs/2602.21397
作者: Sajjad Ghiasvand,Haniyeh Ehsani Oskouie,Mahnoosh Alizadeh,Ramtin Pedarsani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbfMMLoP (\textbfMulti-\textbfModal \textbfLow-Rank \textbfPrompting), a framework that achieves deep multi-modal prompting with only \textbf11.5K trainable parameters, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70% on base-to-novel generalization.

[CV-109] Momentum Memory for Knowledge Distillation in Computational Pathology CVPR2026

【速读】:该论文旨在解决多模态学习在癌症诊断中临床转化受限的问题,即由于配对的组织病理学(histopathology)与基因组数据(genomics)稀缺,导致模型难以有效训练。现有知识蒸馏(Knowledge Distillation, KD)方法依赖于批次内局部对齐,因批内比较有限而引入不稳定性,进而影响性能。其解决方案的关键在于提出动量记忆知识蒸馏(Momentum Memory Knowledge Distillation, MoMKD)框架,通过一个动量更新的记忆库跨批次聚合基因组与组织病理学信息,显著扩展每个小批量的监督上下文;同时解耦基因组与组织病理学分支的梯度传播,避免基因组信号主导特征学习,消除推理时的模态差距(modality-gap),从而实现仅用组织病理图像即可获得高精度且泛化能力强的预测结果。

链接: https://arxiv.org/abs/2602.21395
作者: Yongxin Guo,Hao Lu,Onur C. Koyun,Zhengjie Zhu,Muhammet Fatih Demir,Metin Nafi Gurcan
机构: Wake Forest University School of Medicine (维克森林大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance. To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time. Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology. Comments: Accepted by CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.21395 [cs.CV] (or arXiv:2602.21395v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.21395 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-110] owards Controllable Video Synthesis of Routine and Rare OR Events

【速读】:该论文旨在解决手术室(Operating Room, OR)工作流中大规模数据集构建的难题,特别是针对罕见、高风险或非典型事件的数据稀缺问题,这类数据瓶颈严重制约了环境智能(Ambient Intelligence)在识别、理解与缓解安全关键事件中的应用。其解决方案的关键在于提出了一种基于视频扩散模型的OR场景可控合成框架,该框架包含几何抽象模块、条件控制模块和微调后的扩散模型:首先将OR场景转换为抽象几何表示,再通过条件机制引导生成过程,最终合成逼真的OR事件视频。该方法不仅显著优于现有视频扩散基线模型,在常规和跨域数据上均实现了更低的FVD/LPIPS指标和更高的SSIM/PSNR,还成功用于训练AI模型检测无菌区违规的近失事件,召回率达70.13%,验证了其在生成稀有事件及支持环境智能模型开发方面的有效性。

链接: https://arxiv.org/abs/2602.21365
作者: Dominik Schneider,Lalithkumar Seenivasan,Sampath Rapuri,Vishalroshan Anil,Aiza Maksutova,Yiqing Shen,Jan Emily Mangulabnan,Hao Ding,Jose L. Porras,Masaru Ishii,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted to IPCAI 2026 and submitted to IJCARs

点击查看摘要

Abstract:Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models. Comments: Accepted to IPCAI 2026 and submitted to IJCARs Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2602.21365 [cs.CV] (or arXiv:2602.21365v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.21365 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lalithkumar Seenivasan [view email] [v1] Tue, 24 Feb 2026 20:56:15 UTC (2,177 KB)

[CV-111] Scaling View Synthesis Transformers WWW

【速读】:该论文旨在解决生成式视图合成(View Synthesis)中基于Transformer的模型在计算资源扩展时的性能优化问题,特别是当前对编码器-解码器架构是否能在训练计算效率上达到最优尚不明确。其解决方案的关键在于通过系统性地研究视图合成Transformer的缩放规律,发现并纠正以往研究中因架构设计不当和训练计算预算不均衡导致的错误结论;进而提出一种名为可扩展视图合成模型(Scalable View Synthesis Model, SVSM)的编码器-解码器架构,在多个计算水平下表现出与仅解码器模型相当甚至更优的性能-计算帕累托前沿,并在真实世界视图合成基准上以显著更低的训练计算成本超越了先前最优方法。

链接: https://arxiv.org/abs/2602.21341
作者: Evan Kim,Hyunwoo Ryu,Thomas W. Mitchel,Vincent Sitzmann
机构: MIT(麻省理工学院); Adobe(Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

[CV-112] HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles CVPR2026

【速读】:该论文旨在解决自动驾驶仿真中场景生成的两大核心挑战:如何同时实现高保真度(photorealism)与精确可控性(controllable generation)。现有方法难以兼顾二者,导致生成场景在视觉真实感或编辑精度上存在局限。解决方案的关键在于提出HorizonForge框架,其创新性地将场景重建为可编辑的高斯点云(Gaussian Splats)与网格(Mesh)联合表示,并结合噪声感知的视频扩散模型(noise-aware video diffusion process)进行渲染,从而在单次前向传播中实现细粒度3D编辑与语言驱动的车辆插入,同时保证时空一致性。该方法通过引入基于视频扩散的时间先验,显著提升了合成结果的连贯性,且无需针对每条轨迹单独优化,极大提高了效率与可控性。

链接: https://arxiv.org/abs/2602.21333
作者: Yifan Wang,Francesco Pittaluga,Zaid Tasneem,Chenyu You,Manmohan Chandraker,Ziyu Jiang
机构: NEC Labs America (美国 NEC 实验室); Stony Brook University (石溪大学); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: this https URL .

[CV-113] Uncertainty-Aware Diffusion Model for Multimodal Highway Trajectory Prediction via DDIM Sampling

【速读】:该论文旨在解决自动驾驶中轨迹预测的准确性与不确定性感知难题,尤其针对多智能体交互复杂性、场景上下文多样性以及未来运动固有的随机性。现有基于扩散模型的方法(如cVMD)存在采样速度慢、生成多样性利用不足及场景编码脆弱等问题。其解决方案的关键在于提出cVMDx框架:通过DDIM采样策略将推理时间减少高达100倍,实现高效多样本生成以估计不确定性;同时引入拟合的高斯混合模型(Gaussian Mixture Model, GMM)从生成轨迹中提取可解析的多模态预测结果;此外,采用改进的CVQ-VAE进行更鲁棒的场景编码,从而在highD公开数据集上显著提升预测精度与效率,实现真正意义上的全随机、多模态轨迹预测。

链接: https://arxiv.org/abs/2602.21319
作者: Marion Neumeier,Niklas Roßberg,Michael Botsch,Wolfgang Utschick
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

点击查看摘要

Abstract:Accurate and uncertainty-aware trajectory prediction remains a core challenge for autonomous driving, driven by complex multi-agent interactions, diverse scene contexts and the inherently stochastic nature of future motion. Diffusion-based generative models have recently shown strong potential for capturing multimodal futures, yet existing approaches such as cVMD suffer from slow sampling, limited exploitation of generative diversity and brittle scenario encodings. This work introduces cVMDx, an enhanced diffusion-based trajectory prediction framework that improves efficiency, robustness and multimodal predictive capability. Through DDIM sampling, cVMDx achieves up to a 100x reduction in inference time, enabling practical multi-sample generation for uncertainty estimation. A fitted Gaussian Mixture Model further provides tractable multimodal predictions from the generated trajectories. In addition, a CVQ-VAE variant is evaluated for scenario encoding. Experiments on the publicly available highD dataset show that cVMDx achieves higher accuracy and significantly improved efficiency over cVMD, enabling fully stochastic, multimodal trajectory prediction. Comments: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2602.21319 [cs.LG] (or arXiv:2602.21319v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-114] StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives CVPR2026

【速读】:该论文旨在解决多帧、动作丰富的视觉叙事生成中面临的三重矛盾:动作文本忠实性(action text faithfulness)、主体身份保真度(subject identity fidelity)以及跨帧背景连续性(cross-frame background continuity)。其核心解决方案是提出一种零样本(zero-shot)流水线方法 StoryTailor,关键创新在于三个协同模块:Gaussian-Centered Attention(GCA)通过动态聚焦每个主体核心缓解接地框(grounding boxes)重叠问题;Action-Boost Singular Value Reweighting(AB-SVR)在文本嵌入空间中增强与动作相关的方向响应;Selective Forgetting Cache(SFC)则保留可迁移的背景线索、遗忘非必要历史信息,并选择性地激活留存线索以构建跨场景语义关联,从而实现时间上连贯且身份稳定的图像序列生成。

链接: https://arxiv.org/abs/2602.21273
作者: Jinghao Hu,Yuhe Zhang,GuoHua Geng,Kang Li,Han Zhang
机构: School of Computing, Northwest University, Xi’an (西北大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages,19 figures,accepted by CVPR2026

点击查看摘要

Abstract:Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.

[CV-115] Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels CVPR2026

【速读】:该论文旨在解决动态场景下高光谱视频成像的重建质量与时间稳定性问题,现有快照式高光谱成像系统因被动采集方式在运动条件下易出现光子利用率低、光谱保真度差等问题。解决方案的关键在于提出Lumosaic系统,其核心创新是将窄带LED阵列与编码曝光像素(Coded-Exposure-Pixel, CEP)相机相结合,通过主动同步照明与像素级曝光控制,在单帧内联合编码空间、时间和波长维度的信息;同时采用基于学习的重建算法,实现31通道(400–700 nm)分辨率VGA、30 fps的高保真度高光谱视频重建,显著提升了复杂运动场景下的时空一致性与光谱准确性。

链接: https://arxiv.org/abs/2602.22140
作者: Dhruv Verma,Andrew Qiu,Roberto Rangel,Ayandev Barman,Hao Yang,Chenjia Hu,Fengqi Zhang,Roman Genov,David B. Lindell,Kiriakos N. Kutulakos,Alex Mariakakis
机构: University of Toronto (多伦多大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame’s exposure, Lumosaic actively synchronizes illumination and pixel-wise exposure, improving photon utilization and preserving spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.

[CV-116] Learning spatially adaptive sparsity level maps for arbitrary convolutional dictionaries

【速读】:该论文旨在解决当前基于学习的图像重建方法中黑箱模块导致的可解释性差和鲁棒性不足的问题。其解决方案的关键在于将数据驱动的信息嵌入到基于模型的卷积字典正则化中,通过神经网络推断的空间自适应稀疏度图实现更灵活的重建机制;同时,改进网络设计与训练策略,使方法具备滤波器置换不变性,并支持在推理阶段更换卷积字典,从而降低对训练数据的依赖,提升在分布外数据上的鲁棒性。

链接: https://arxiv.org/abs/2602.21707
作者: Joshua Schulz,David Schote,Christoph Kolbitsch,Kostas Papafitsoros,Andreas Kofler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:State-of-the-art learned reconstruction methods often rely on black-box modules that, despite their strong performance, raise questions about their interpretability and robustness. Here, we build on a recently proposed image reconstruction method, which is based on embedding data-driven information into a model-based convolutional dictionary regularization via neural network-inferred spatially adaptive sparsity level maps. By means of improved network design and dedicated training strategies, we extend the method to achieve filter-permutation invariance as well as the possibility to change the convolutional dictionary at inference time. We apply our method to low-field MRI and compare it to several other recent deep learning-based methods, also on in vivo data, in which the benefit for the use of a different dictionary is showcased. We further assess the method’s robustness when tested on in- and out-of-distribution data. When tested on the latter, the proposed method suffers less from the data distribution shift compared to the other learned methods, which we attribute to its reduced reliance on training data due to its underlying model-based reconstruction component.

[CV-117] Perceptual Quality Optimization of Image Super-Resolution ICASSP26

【速读】:该论文旨在解决单图像超分辨率(Single-image Super-Resolution, SR)中因依赖失真导向损失函数或启发式感知先验而导致重建保真度与视觉质量之间难以平衡的问题。其解决方案的关键在于提出一种高效感知双向注意力网络(Efficient Perceptual Bi-directional Attention Network, Efficient-PBAN),该网络通过构建覆盖多种先进SR方法的主观评分数据集,学习与人类感知高度相关联的图像级感知质量预测机制,并将该可微分的感知损失嵌入SR训练流程中,实现重建过程与感知评估之间的闭环对齐,从而显著提升生成图像的主观视觉质量。

链接: https://arxiv.org/abs/2602.21482
作者: Wei Zhou,Yixiao Li,Hadi Amirpour,Xiaoshuai Hao,Jiang Liu,Peng Wang,Hantao Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6 pages, 2 figures, accepted in ICASSP 26

点击查看摘要

Abstract:Single-image super-resolution (SR) has achieved remarkable progress with deep learning, yet most approaches rely on distortion-oriented losses or heuristic perceptual priors, which often lead to a trade-off between fidelity and visual quality. To address this issue, we propose an \textitEfficient Perceptual Bi-directional Attention Network (Efficient-PBAN) that explicitly optimizes SR towards human-preferred quality. Unlike patch-based quality models, Efficient-PBAN avoids extensive patch sampling and enables efficient image-level perception. The proposed framework is trained on our self-constructed SR quality dataset that covers a wide range of state-of-the-art SR methods with corresponding human opinion scores. Using this dataset, Efficient-PBAN learns to predict perceptual quality in a way that correlates strongly with subjective judgments. The learned metric is further integrated into SR training as a differentiable perceptual loss, enabling closed-loop alignment between reconstruction and perceptual assessment. Extensive experiments demonstrate that our approach delivers superior perceptual quality. Code is publicly available at this https URL.

[CV-118] owards single-shot coherent imaging via overlap-free ptychography

【速读】:该论文旨在解决同步辐射和X射线自由电子激光(XFEL)源中叠层成像(ptychography)因需密集重叠扫描而导致的通量限制与剂量过高问题,同时推动无重叠条件下的扩展样品单次曝光重构技术发展。其核心解决方案是提出PtychoPINN框架,通过将可微分的相干散射前向模型与泊松光子计数似然函数耦合,并以基于坐标分组的方式引入实空间重叠作为可调参数(而非硬性约束),从而实现无需重叠的单次曝光Fresnel相干衍射成像(CDI)重构,且加速传统多帧叠层成像。该方法在低光子计数(~10⁴ 光子/帧)下仍保持高精度,实验验证显示其在单次曝光重建中达到结构相似性(SSIM)0.904,接近重叠约束重建结果(SSIM 0.968),并具备优异泛化能力与GPU计算效率(约为最小二乘最大似然法的40倍)。

链接: https://arxiv.org/abs/2602.21361
作者: Oliver Hoidn,Aashwin Mishra,Steven Henke,Albert Vong,Matthew Seaberg
机构: SLAC National Accelerator Laboratory (SLAC 国家加速器实验室); Argonne National Laboratory (阿贡国家实验室)
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Ptychographic imaging at synchrotron and XFEL sources requires dense overlapping scans, limiting throughput and increasing dose. Extending coherent diffractive imaging to overlap-free operation on extended samples remains an open problem. Here, we extend PtychoPINN (O. Hoidn \emphet al., \emphScientific Reports \textbf13, 22789, 2023) to deliver \emphoverlap-free, single-shot reconstructions in a Fresnel coherent diffraction imaging (CDI) geometry while also accelerating conventional multi-shot ptychography. The framework couples a differentiable forward model of coherent scattering with a Poisson photon-counting likelihood; real-space overlap enters as a tunable parameter via coordinate-based grouping rather than a hard requirement. On synthetic benchmarks, reconstructions remain accurate at low counts ( \sim!10^4 photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with 0.968 for overlap-constrained reconstruction. Against a data-saturated supervised model with the same backbone (16,384 training images), PtychoPINN achieves higher SSIM with only 1,024 images and generalizes to unseen illumination profiles. Per-graphics processing unit (GPU) throughput is approximately 40\times that of least-squares maximum-likelihood (LSQ-ML) reconstruction at matched 128\times128 resolution. These results, validated on experimental data from the Advanced Photon Source and the Linac Coherent Light Source, unify single-exposure Fresnel CDI and overlapped ptychography within one framework, supporting dose-efficient, high-throughput imaging at modern light sources.

[CV-119] RelA-Diffusion: Relativistic Adversarial Diffusion for Multi-Tracer PET Synthesis from Multi-Sequence MRI

【速读】:该论文旨在解决多示踪剂正电子发射断层成像(multi-tracer positron emission tomography, PET)在临床应用中因成本高、辐射暴露大及示踪剂获取受限而难以常规开展的问题。为实现无创、低成本的多病理过程评估,作者提出RelA-Diffusion框架,其核心创新在于结合T1加权和T2-FLAIR磁共振成像(MRI)作为互补输入,以增强结构信息引导生成;同时引入梯度惩罚的相对对抗损失(gradient-penalized relativistic adversarial loss),在扩散模型的中间去噪阶段进行相对判别,从而提升局部结构的真实性与训练稳定性,实现更高质量的多示踪剂PET图像合成。

链接: https://arxiv.org/abs/2602.21345
作者: Minhui Yu,Yongheng Sun,David S. Lalush,Jason P Mihalik,Pew-Thian Yap,Mingxia Liu
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); North Carolina State University (北卡罗来纳州立大学); Department of Radiology (放射科); Biomedical Research Imaging Center (生物医学成像中心); Joint Department of Biomedical Engineering (生物医学工程联合系); Department of Exercise and Sport Science (运动与体育科学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-tracer positron emission tomography (PET) provides critical insights into diverse neuropathological processes such as tau accumulation, neuroinflammation, and \beta -amyloid deposition in the brain, making it indispensable for comprehensive neurological assessment. However, routine acquisition of multi-tracer PET is limited by high costs, radiation exposure, and restricted tracer availability. Recent efforts have explored deep learning approaches for synthesizing PET images from structural MRI. While some methods rely solely on T1-weighted MRI, others incorporate additional sequences such as T2-FLAIR to improve pathological sensitivity. However, existing methods often struggle to capture fine-grained anatomical and pathological details, resulting in artifacts and unrealistic outputs. To this end, we propose RelA-Diffusion, a Relativistic Adversarial Diffusion framework for multi-tracer PET synthesis from multi-sequence MRI. By leveraging both T1-weighted and T2-FLAIR scans as complementary inputs, RelA-Diffusion captures richer structural information to guide image generation. To improve synthesis fidelity, we introduce a gradient-penalized relativistic adversarial loss to the intermediate clean predictions of the diffusion model. This loss compares real and generated images in a relative manner, encouraging the synthesis of more realistic local structures. Both the relativistic formulation and the gradient penalty contribute to stabilizing the training, while adversarial feedback at each diffusion timestep enables consistent refinement throughout the generation process. Extensive experiments on two datasets demonstrate that RelA-Diffusion outperforms existing methods in both visual fidelity and quantitative metrics, highlighting its potential for accurate synthesis of multi-tracer PET.

人工智能

[AI-0] Surrogate models for Rock-Fluid Interaction: A Grid-Size-Invariant Approach

【速读】:该论文旨在解决岩石-流体相互作用建模中高保真数值模型计算成本高昂的问题,尤其针对不确定性量化和优化等多查询场景的适用性受限问题。其核心解决方案是构建八种代理模型(surrogate models),其中四类为基于神经网络压缩与预测的降阶模型(Reduced-Order Models, ROM),另四类为具备网格尺寸不变性(grid-size invariance)特性的单神经网络模型——即能够在训练时未见过的更大计算域上进行推理的图像到图像映射模型。关键创新在于提出并验证了网格尺寸不变框架,显著降低训练内存消耗,并在预测精度上优于传统ROM方法;同时实验表明UNet++架构相较于UNet在代理模型中表现更优,且该方法能有效应对因流体诱导岩石溶解导致的非静态固相场带来的挑战。

链接: https://arxiv.org/abs/2602.22188
作者: Nathalie C. Pinheiro,Donghu Guo,Hannah P. Menke,Aniket C. Joshi,Claire E. Heaney,Ahmed H. ElSheikh,Christopher C. Pain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Modelling rock-fluid interaction requires solving a set of partial differential equations (PDEs) to predict the flow behaviour and the reactions of the fluid with the rock on the interfaces. Conventional high-fidelity numerical models require a high resolution to obtain reliable results, resulting in huge computational expense. This restricts the applicability of these models for multi-query problems, such as uncertainty quantification and optimisation, which require running numerous scenarios. As a cheaper alternative to high-fidelity models, this work develops eight surrogate models for predicting the fluid flow in porous media. Four of these are reduced-order models (ROM) based on one neural network for compression and another for prediction. The other four are single neural networks with the property of grid-size invariance; a term which we use to refer to image-to-image models that are capable of inferring on computational domains that are larger than those used during training. In addition to the novel grid-size-invariant framework for surrogate models, we compare the predictive performance of UNet and UNet++ architectures, and demonstrate that UNet++ outperforms UNet for surrogate models. Furthermore, we show that the grid-size-invariant approach is a reliable way to reduce memory consumption during training, resulting in good correlation between predicted and ground-truth values and outperforming the ROMs analysed. The application analysed is particularly challenging because fluid-induced rock dissolution results in a non-static solid field and, consequently, it cannot be used to help in adjustments of the future prediction.

[AI-1] Enhancing Framingham Cardiovascular Risk Score Transparency through Logic-Based XAI

【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)风险评估工具——如Framingham Risk Score(FRS)——缺乏透明性和可解释性的问题,即无法明确说明患者为何被归类为某一风险等级,以及如何通过干预降低风险。其解决方案的关键在于构建一个基于一阶逻辑(first-order logic)和可解释人工智能(Explainable Artificial Intelligence, XAI)原理的逻辑解释器,能够识别出导致特定风险分类的最小属性集合,并生成可操作的干预场景,从而为临床提供透明且具有指导意义的风险管理建议。

链接: https://arxiv.org/abs/2602.22149
作者: Emannuel L. de A. Bezerra,Luiz H. T. Viana,Vinícius P. Chagas,Diogo E. Rolim,Thiago Alves Rocha,Carlos H. L. Cavalcante
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Preprint version. The final authenticated version is available online via the DOI below

点击查看摘要

Abstract:Cardiovascular disease (CVD) remains one of the leading global health challenges, accounting for more than 19 million deaths worldwide. To address this, several tools that aim to predict CVD risk and support clinical decision making have been developed. In particular, the Framingham Risk Score (FRS) is one of the most widely used and recommended worldwide. However, it does not explain why a patient was assigned to a particular risk category nor how it can be reduced. Due to this lack of transparency, we present a logical explainer for the FRS. Based on first-order logic and explainable artificial intelligence (XAI) fundaments, the explainer is capable of identifying a minimal set of patient attributes that are sufficient to explain a given risk classification. Our explainer also produces actionable scenarios that illustrate which modifiable variables would reduce a patient’s risk category. We evaluated all possible input combinations of the FRS (over 22,000 samples) and tested them with our explainer, successfully identifying important risk factors and suggesting focused interventions for each case. The results may improve clinician trust and facilitate a wider implementation of CVD risk assessment by converting opaque scores into transparent and prescriptive insights, particularly in areas with restricted access to specialists.

[AI-2] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

【速读】:该论文旨在解决强化学习中基于人类反馈(Reinforcement Learning from Human Feedback, RLHF)的对齐问题,特别是针对标准原始-对偶(primal-dual)方法在参数化策略下可能出现的不稳定或发散问题。现有方法在处理带期望奖励约束的RLHF时,虽可建模为原始-对偶优化问题,但难以保证最后迭代的收敛性,尤其在实际应用中存在震荡现象。解决方案的关键在于提出一个通用的原始-对偶框架,统一了包括安全RLHF、单次和多次采样在内的多种对齐算法,并引入一种乐观原始-对偶(Optimistic Primal-Dual, OPD)算法,通过在原始变量和对偶变量上引入预测更新机制来稳定鞍点动态。理论分析表明,该方法能保证最后迭代收敛,无论是分布空间中的精确策略优化,还是参数化策略下的近似最优解,其误差与逼近误差和偏差相关,从而填补了约束强化学习与实际RLHF之间的重要理论空白。

链接: https://arxiv.org/abs/2602.22146
作者: Yining Li,Peizhong Ju,Ness Shroff
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.

[AI-3] Dont stop me now: Rethinking Validation Criteria for Model Parameter Selection

【速读】:该论文试图解决的问题是:在神经网络分类器的训练过程中,用于模型选择(如早停策略)的验证集评估准则如何影响最终测试性能,尤其是在不同损失函数(交叉熵、C-Loss、PolyLoss)和不同验证指标(准确率或损失函数值)组合下的表现差异。其解决方案的关键在于通过系统性的实验设计,在标准数据集上采用k折交叉验证,对比了基于验证准确率的早停与基于损失函数的早停以及事后选择(post-hoc selection)三种策略的效果,发现使用验证损失而非验证准确率进行模型选择可获得更稳定且接近最优测试性能的结果,从而建议避免在早停中使用验证准确率作为决策依据。

链接: https://arxiv.org/abs/2602.22107
作者: Andrea Apicella,Francesco Isgrò,Andrea Pollastro,Roberto Prevete
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the extensive literature on training loss functions, the evaluation of generalization on the validation set remains underexplored. In this work, we conduct a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping. Using fully connected networks on standard benchmarks under k -fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.

[AI-4] On Imbalanced Regression with Hoeffding Trees PAKDD2026

【速读】:该论文旨在解决流式数据环境中决策树模型在回归任务中的性能优化问题,尤其是在数据分布不均衡场景下的预测平滑性与稳定性提升。其关键解决方案包括:首先将核密度估计(Kernel Density Estimation, KDE)方法拓展至流式学习环境,以改善早期数据流中因样本不足导致的预测波动;其次,将层次收缩(Hierarchical Shrinkage, HS)这一批处理阶段的后验正则化方法引入增量决策树模型,实现对树结构不变的前提下进行参数平滑。实验表明,KDE在流数据初期具有显著优势,而HS在所测试场景中几乎未带来性能提升。

链接: https://arxiv.org/abs/2602.22101
作者: Pantia-Marina Alchirch,Dimitrios I. Diochnos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, 1 table, 2 algorithms, authors’ version of paper accepted in PAKDD 2026 special session on Data Science: Foundations and Applications (DSFA)

点击查看摘要

Abstract:Many real-world applications provide a continuous stream of data that is subsequently used by machine learning models to solve regression tasks of interest. Hoeffding trees and their variants have a long-standing tradition due to their effectiveness, either alone or as base models in broader ensembles. At the same time a recent line of work in batch learning has shown that kernel density estimation (KDE) is an effective approach for smoothed predictions in imbalanced regression tasks [Yang et al., 2021]. Moreover, another recent line of work for batch learning, called hierarchical shrinkage (HS) [Agarwal et al., 2022], has introduced a post-hoc regularization method for decision trees that does not alter the structure of the learned tree. Using a telescoping argument we cast KDE to streaming environments and extend the implementation of HS to incremental decision tree models. Armed with these extensions we investigate the performance of decision trees that may enjoy such options in datasets commonly used for regression in online settings. We conclude that KDE is beneficial in the early parts of the stream, while HS hardly, if ever, offers performance benefits. Our code is publicly available at: this https URL.

[AI-5] Petri Net Relaxation for Infeasibility Explanation and Sequential Task Planning

【速读】:该论文旨在解决传统规划方法在面对情境变化或认知更新时缺乏鲁棒性的问题,尤其关注于无法找到可行计划的情况下的检测与解释能力不足。现有方法多聚焦于单次高效规划,而忽视了对领域知识的动态调整和不可行性的识别。其解决方案的关键在于提出一种基于Petri网可达性松弛(Petri net reachability relaxation)的新机制,该机制能够实现鲁棒不变式合成、高效的不可达目标检测以及有助于理解问题本质的不可行性解释;同时结合增量约束求解器以支持目标与约束的迭代更新,在实验中展现出比基线方法更高的不可行性检测率(最多提升2倍)和更优的序列规划更新性能。

链接: https://arxiv.org/abs/2602.22094
作者: Nguyen Cong Nhat Le,John G. Rogers,Claire N. Bonial,Neil T. Dantam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures. Submitted to 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR) on 01/14/2026

点击查看摘要

Abstract:Plans often change due to changes in the situation or our understanding of the situation. Sometimes, a feasible plan may not even exist, and identifying such infeasibilities is useful to determine when requirements need adjustment. Common planning approaches focus on efficient one-shot planning in feasible cases rather than updating domains or detecting infeasibility. We propose a Petri net reachability relaxation to enable robust invariant synthesis, efficient goal-unreachability detection, and helpful infeasibility explanations. We further leverage incremental constraint solvers to support goal and constraint updates. Empirically, compared to baselines, our system produces a comparable number of invariants, detects up to 2 times more infeasibilities, performs competitively in one-shot planning, and outperforms in sequential plan updates in the tested domains.

[AI-6] Language Models Exhibit Inconsistent Biases Towards Algorithmic Agents and Human Experts

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在决策任务中如何权衡来自人类专家与算法代理(algorithmic agent)的信息,尤其是在存在“算法厌恶”(algorithm aversion)现象的背景下,LLMs 是否表现出类似人类的偏倚。解决方案的关键在于设计双维度评估框架——通过“陈述偏好”(stated preferences,即直接询问LLMs对人类或算法的信任度)和“揭示偏好”(revealed preferences,即提供具体性能示例并要求LLMs做出带激励的决策选择),从而系统性地考察LLMs对不同信息源的态度差异。实验发现,尽管LLMs在信任评分上倾向于人类专家,但在实际行为选择中却更可能偏向表现较差的算法,表明其存在不一致的偏倚,这对高风险场景下AI系统的部署具有重要警示意义。

链接: https://arxiv.org/abs/2602.22070
作者: Jessica Y. Bo,Lillio Mok,Ashton Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI 2026)

点击查看摘要

Abstract:Large language models are increasingly used in decision-making tasks that require them to process information from a variety of sources, including both human experts and other algorithmic agents. How do LLMs weigh the information provided by these different sources? We consider the well-studied phenomenon of algorithm aversion, in which human decision-makers exhibit bias against predictions from algorithms. Drawing upon experimental paradigms from behavioural economics, we evaluate how eightdifferent LLMs delegate decision-making tasks when the delegatee is framed as a human expert or an algorithmic agent. To be inclusive of different evaluation formats, we conduct our study with two task presentations: stated preferences, modeled through direct queries about trust towards either agent, and revealed preferences, modeled through providing in-context examples of the performance of both agents. When prompted to rate the trustworthiness of human experts and algorithms across diverse tasks, LLMs give higher ratings to the human expert, which correlates with prior results from human respondents. However, when shown the performance of a human expert and an algorithm and asked to place an incentivized bet between the two, LLMs disproportionately choose the algorithm, even when it performs demonstrably worse. These discrepant results suggest that LLMs may encode inconsistent biases towards humans and algorithms, which need to be carefully considered when they are deployed in high-stakes scenarios. Furthermore, we discuss the sensitivity of LLMs to task presentation formats that should be broadly scrutinized in evaluation robustness for AI safety.

[AI-7] Semantic Partial Grounding via LLM s

【速读】:该论文旨在解决经典规划中**接地(Grounding)**步骤的计算瓶颈问题,即随着任务规模扩大, grounded actions 和 atoms 数量呈指数增长,导致求解效率急剧下降。解决方案的关键在于提出 SPG-LLM 方法,利用大语言模型(Large Language Models, LLMs)对 PDDL 域文件和问题文件进行文本与结构分析,启发式地识别并剔除可能无关的对象、动作和谓词,从而显著缩小接地后的任务规模。该方法在七个难以接地的基准测试中实现了更快的接地速度(有时快几个数量级),同时在部分领域中还能获得更优或相当的计划成本。

链接: https://arxiv.org/abs/2602.22067
作者: Giuseppe Canonaco,Alberto Pozanco,Daniel Borrajo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grounding is a critical step in classical planning, yet it often becomes a computational bottleneck due to the exponential growth in grounded actions and atoms as task size increases. Recent advances in partial grounding have addressed this challenge by incrementally grounding only the most promising operators, guided by predictive models. However, these approaches primarily rely on relational features or learned embeddings and do not leverage the textual and structural cues present in PDDL descriptions. We propose SPG-LLM, which uses LLMs to analyze the domain and problem files to heuristically identify potentially irrelevant objects, actions, and predicates prior to grounding, significantly reducing the size of the grounded task. Across seven hard-to-ground benchmarks, SPG-LLM achieves faster grounding-often by orders of magnitude-while delivering comparable or better plan costs in some domains.

[AI-8] DualWeaver: Synergistic Feature Weaving Surrogates for Multivariate Forecasting with Univariate Time Series Foundation Models

【速读】:该论文旨在解决如何将单变量时间序列基础模型(Univariate Time-Series Foundation Models, Uni-TSFMs)有效扩展至多变量时间序列预测的问题,这是当前多变量预测任务中的一大挑战。解决方案的关键在于提出DualWeaver框架,其核心创新是引入一对可学习且结构对称的代理序列(surrogate series),通过共享的辅助特征融合模块捕捉变量间的交叉依赖关系,并利用预测目标将这些代理序列映射为适配TSFM的输入形式;其对称结构允许无需额外参数解码即可直接重建最终预测结果,同时引入理论驱动的正则化项以增强对适应崩溃(adaptation collapse)的鲁棒性。

链接: https://arxiv.org/abs/2602.22066
作者: Jinpeng Li,Zhongyi Pei,Huaze Xue,Bojian Zheng,Chen Wang,Jianmin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages. Preprint

点击查看摘要

Abstract:Time-series foundation models (TSFMs) have achieved strong univariate forecasting through large-scale pre-training, yet effectively extending this success to multivariate forecasting remains challenging. To address this, we propose DualWeaver, a novel framework that adapts univariate TSFMs (Uni-TSFMs) for multivariate forecasting by using a pair of learnable, structurally symmetric surrogate series. Generated by a shared auxiliary feature-fusion module that captures cross-variable dependencies, these surrogates are mapped to TSFM-compatible series via the forecasting objective. The symmetric structure enables parameter-free reconstruction of final predictions directly from the surrogates, without additional parametric decoding. A theoretically grounded regularization term is further introduced to enhance robustness against adaptation collapse. Extensive experiments on diverse real-world datasets show that DualWeaver outperforms state-of-the-art multivariate forecasters in both accuracy and stability. We release the code at this https URL.

[AI-9] Physics-Informed Machine Learning for Vessel Shaft Power and Fuel Consumption Prediction: Interpretable KAN-based Approach

【速读】:该论文旨在解决船舶轴转速、轴功率和燃油消耗预测中准确性与物理可解释性难以兼顾的问题。传统基于物理的模型虽具可解释性但难以应对实际运行中的复杂变异性,而纯数据驱动方法虽精度高却缺乏物理合理性。解决方案的关键在于提出一种物理信息增强的科尔莫戈罗夫-阿诺德网络(Physics-Informed Kolmogorov-Arnold Network, PI-KAN),其核心创新包括:利用可解释的一元特征变换模块、引入融合物理规律的损失函数,以及构建无泄漏的级联预测流程,从而在保持物理一致性的同时显著提升预测性能。

链接: https://arxiv.org/abs/2602.22055
作者: Hamza Haruna Mohammed,Dusica Marijan,Arnbjørn Maressa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, IEEE conference paper format; under review

点击查看摘要

Abstract:Accurate prediction of shaft rotational speed, shaft power, and fuel consumption is crucial for enhancing operational efficiency and sustainability in maritime transportation. Conventional physics-based models provide interpretability but struggle with real-world variability, while purely data-driven approaches achieve accuracy at the expense of physical plausibility. This paper introduces a Physics-Informed Kolmogorov-Arnold Network (PI-KAN), a hybrid method that integrates interpretable univariate feature transformations with a physics-informed loss function and a leakage-free chained prediction pipeline. Using operational and environmental data from five cargo vessels, PI-KAN consistently outperforms the traditional polynomial method and neural network baselines. The model achieves the lowest mean absolute error (MAE) and root mean squared error (RMSE), and the highest coefficient of determination (R^2) for shaft power and fuel consumption across all vessels, while maintaining physically consistent behavior. Interpretability analysis reveals rediscovery of domain-consistent dependencies, such as cubic-like speed-power relationships and cosine-like wave and wind effects. These results demonstrate that PI-KAN achieves both predictive accuracy and interpretability, offering a robust tool for vessel performance monitoring and decision support in operational settings.

[AI-10] Enhancing LLM -Based Test Generation by Eliminating Covered Code

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的单元测试生成方法在面对复杂方法时覆盖率低、效果不佳的问题,尤其是在处理长代码片段时受限于上下文长度和推理能力下降。其解决方案的关键在于提出一种可扩展的LLM驱动的单元测试生成方法,包含两个核心步骤:一是通过LLM与静态分析相结合的方式检索复杂方法相关的上下文信息;二是采用迭代式测试生成与代码消除策略,逐步生成测试用例并剔除已覆盖的代码段,从而降低单次生成任务的复杂度,提升覆盖率并缓解长文本输入带来的限制。

链接: https://arxiv.org/abs/2602.21997
作者: WeiZhe Xu,Mengyu Liu,Fanxin Kong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, supplementary material included

点击查看摘要

Abstract:Automated test generation is essential for software quality assurance, with coverage rate serving as a key metric to ensure thorough testing. Recent advancements in Large Language Models (LLMs) have shown promise in improving test generation, particularly in achieving higher coverage. However, while existing LLM-based test generation solutions perform well on small, isolated code snippets, they struggle when applied to complex methods under test. To address these issues, we propose a scalable LLM-based unit test generation method. Our approach consists of two key steps. The first step is context information retrieval, which uses both LLMs and static analysis to gather relevant contextual information associated with the complex methods under test. The second step, iterative test generation with code elimination, repeatedly generates unit tests for the code slice, tracks the achieved coverage, and selectively removes code segments that have already been covered. This process simplifies the testing task and mitigates issues arising from token limits or reduced reasoning effectiveness associated with excessively long contexts. Through comprehensive evaluations on open-source projects, our approach outperforms state-of-the-art LLM-based and search-based methods, demonstrating its effectiveness in achieving high coverage on complex methods.

[AI-11] Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中隐藏信念的识别问题,尤其是在模型可能因对齐伪造(alignment faking)而掩饰真实态度的背景下。随着LLMs在高风险决策场景中的广泛应用,准确探测其潜在信念变得尤为关键。论文提出的关键解决方案是采用列表实验(list experiment),这是一种社会科学研究中用于规避社会期望偏差(social desirability bias)的经典方法,其原理与LLM的对齐伪造机制高度类比。通过在Anthropic、Google和OpenAI开发的多个模型上实施该方法,研究发现所有模型均存在对大规模监控的隐性支持,部分模型还表现出对酷刑、歧视及首次核打击的隐性赞同;同时,安慰剂处理未产生显著效应,验证了该方法的有效性。

链接: https://arxiv.org/abs/2602.21939
作者: Maxim Chupilkin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:How can researchers identify beliefs that large language models (LLMs) hide? As LLMs become more sophisticated and the prevalence of alignment faking increases, combined with their growing integration into high-stakes decision-making, responding to this challenge has become critical. This paper proposes that a list experiment, a simple method widely used in the social sciences, can be applied to study the hidden beliefs of LLMs. List experiments were originally developed to circumvent social desirability bias in human respondents, which closely parallels alignment faking in LLMs. The paper implements a list experiment on models developed by Anthropic, Google, and OpenAI and finds hidden approval of mass surveillance across all models, as well as some approval of torture, discrimination, and first nuclear strike. Importantly, a placebo treatment produces a null result, validating the method. The paper then compares list experiments with direct questioning and discusses the utility of the approach.

[AI-12] 2-Step Agent : A Framework for the Interaction of a Decision Maker with AI Decision Support

【速读】:该论文旨在解决人工智能(AI)辅助决策在实际应用中可能引发的负面后果问题,特别是当AI预测与人类决策者先验信念不一致时,可能导致比无AI支持更差的下游决策结果。其解决方案的关键在于提出一种通用的计算框架——两步代理模型(2-Step Agent),该框架基于贝叶斯方法进行因果推断,能够建模两个关键阶段:一是AI对新观测的预测如何改变理性贝叶斯代理者的信念;二是这种信念变化如何影响后续决策及最终结果。通过模拟验证,该框架揭示了单一错误的先验信念即可导致AI辅助决策劣于人工决策,从而强调了模型文档化和用户培训的重要性。

链接: https://arxiv.org/abs/2602.21889
作者: Otto Nyberg,Fausto Carcassi,Giovanni Cinà
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 17 figures

点击查看摘要

Abstract:Across a growing number of fields, human decision making is supported by predictions from AI models. However, we still lack a deep understanding of the effects of adoption of these technologies. In this paper, we introduce a general computational framework, the 2-Step Agent, which models the effects of AI-assisted decision making. Our framework uses Bayesian methods for causal inference to model 1) how a prediction on a new observation affects the beliefs of a rational Bayesian agent, and 2) how this change in beliefs affects the downstream decision and subsequent outcome. Using this framework, we show by simulations how a single misaligned prior belief can be sufficient for decision support to result in worse downstream outcomes compared to no decision support. Our results reveal several potential pitfalls of AI-driven decision support and highlight the need for thorough model documentation and proper user training.

[AI-13] ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在移动代理(mobile agent)开发中普遍局限于反应式范式(reactive paradigm)的问题,即模型仅能执行用户明确指令,而缺乏自主感知需求并主动发起行动的能力——这正是主动智能(proactive intelligence)的核心挑战。解决方案的关键在于提出一个名为ProactiveMobile的综合性基准测试体系,其核心创新在于:将主动任务形式化为基于设备端四维上下文信号推断隐式用户意图,并从包含63个API的完整函数池中生成可执行函数序列;同时通过30名专家对超过3,660个实例进行质量审核,确保事实准确性、逻辑一致性和动作可行性,从而实现对主动能力的客观、可执行评估。实验证明,该基准能够有效揭示当前MLLMs在主动能力上的显著不足(如GPT-5成功率为7.39%),并验证了通过微调可提升该能力(Qwen2.5-VL-7B-Instruct达19.15%),凸显了该基准对于推动主动智能研究的重要性。

链接: https://arxiv.org/abs/2602.21858
作者: Dezhi Kong,Zhengzhao Feng,Qiliang Liang,Hao Wang,Haofei Sun,Changpeng Yang,Yang Li,Peng Zhou,Shuai Nie,Hongzhen Wang,Linfeng Zhou,Hao Jia,Jiaming Xu,Runyu Shi,Ying Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.

[AI-14] xai-cola: A Python library for sparsifying counterfactual explanations

【速读】:该论文旨在解决生成式AI(Generative AI)中后验可解释性领域内反事实解释(Counterfactual Explanation, CE)冗余性过高的问题,即现有CE生成器输出的解释常包含大量不必要的特征变更,影响其可读性和实用性。解决方案的关键在于提出一个开源Python库xai-cola,它提供了一个端到端的稀疏化(sparsification)流程,能够对任意CE生成器输出的反事实样本进行优化,在保持解释有效性的同时显著减少被修改的特征数量;该工具支持输入原始表格数据、预处理对象及训练好的scikit-learn或PyTorch模型,并集成多种稀疏化策略与可视化模块,实证表明其可在多个CE生成器上将特征变更数降低最高达50%。

链接: https://arxiv.org/abs/2602.21845
作者: Lin Zhu,Lei You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 5pages, 1 figure

点击查看摘要

Abstract:Counterfactual explanation (CE) is an important domain within post-hoc explainability. However, the explanations generated by most CE generators are often highly redundant. This work introduces an open-source Python library xai-cola, which provides an end-to-end pipeline for sparsifying CEs produced by arbitrary generators, reducing superfluous feature changes while preserving their validity. It offers a documented API that takes as input raw tabular data in pandas DataFrame form, a preprocessing object (for standardization and encoding), and a trained scikit-learn or PyTorch model. On this basis, users can either employ the built-in or externally imported CE generators. The library also implements several sparsification policies and includes visualization routines for analysing and comparing sparsified counterfactuals. xai-cola is released under the MIT license and can be installed from PyPI. Empirical experiments indicate that xai-cola produces sparser counterfactuals across several CE generators, reducing the number of modified features by up to 50% in our setting. The source code is available at this https URL.

[AI-15] Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在去中心化训练过程中易受对抗攻击的问题,此类攻击会破坏模型完整性与数据隐私性,而传统数据检测方法因不兼容FL的分布式架构难以应用。其解决方案的关键在于提出一种名为Resilient Federated Chain (RFC) 的区块链增强型联邦学习框架:通过重构现有“基于联邦学习的共识机制”中的冗余池挖矿(Pooled Mining)机制作为主动防御层,并结合鲁棒聚合规则;同时在共识机制中引入灵活的评估函数,实现对不同攻击策略的自适应防御,从而显著提升系统在多种对抗场景下的鲁棒性。

链接: https://arxiv.org/abs/2602.21841
作者: Mario García-Márquez,Nuria Rodríguez-Barroso,M.Victoria Luzón,Francisco Herrera
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a key paradigm for building Trustworthy AI systems by enabling privacy-preserving, decentralized model training. However, FL is highly susceptible to adversarial attacks that compromise model integrity and data confidentiality, a vulnerability exacerbated by the fact that conventional data inspection methods are incompatible with its decentralized design. While integrating FL with Blockchain technology has been proposed to address some limitations, its potential for mitigating adversarial attacks remains largely unexplored. This paper introduces Resilient Federated Chain (RFC), a novel blockchain-enabled FL framework designed specifically to enhance resilience against such threats. RFC builds upon the existing Proof of Federated Learning architecture by repurposing the redundancy of its Pooled Mining mechanism as an active defense layer that can be combined with robust aggregation rules. Furthermore, the framework introduces a flexible evaluation function in its consensus mechanism, allowing for adaptive defense against different attack strategies. Extensive experimental evaluation on image classification tasks under various adversarial scenarios, demonstrates that RFC significantly improves robustness compared to baseline methods, providing a viable solution for securing decentralized learning environments.

[AI-16] An Evaluation of Context Length Extrapolation in Long Code via Positional Embeddings and Efficient Attention

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程中应用时因固定上下文长度限制而导致的长代码序列泛化能力不足的问题。其解决方案的关键在于探索无需训练(zero-shot)、仅在推理阶段进行优化的方法,重点改进位置编码(position encodings)并优化注意力机制(attention mechanisms),从而实现对长代码序列的有效建模与完成,尤其在长代码补全任务中提升性能。

链接: https://arxiv.org/abs/2602.21800
作者: Madhusudan Ghosh,Rishabh Gupta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has led to a significant increase in automated tools in the software engineering, capable of performing various code-related tasks such as code generation, completion, and translation. Despite these advancements, its effectiveness is constrained by fixed context lengths, limiting its ability to generalize across long, domain-specific code sequences. To address this challenge, we investigate zero-shot, inference-only methods aimed at improving position encodings and optimizing attention mechanisms. Our goal is to provide a thorough analysis of current approaches that facilitate context length extrapolation in code, particularly in the context of long code completion tasks.

[AI-17] Excitation: Momentum For Experts

【速读】:该论文旨在解决稀疏架构(如Mixture-of-Experts, MoE)在训练过程中因参数更新策略不合理而导致的收敛缓慢与功能路径不明确的问题,特别是深层MoE中出现的“结构混淆”现象(structural confusion),即标准优化器无法建立有效的信号传播路径。解决方案的关键在于提出一种名为Excitation的新颖优化框架,其通过基于批次级专家利用率动态调节参数更新,引入竞争性更新机制:对高利用率专家增强更新幅度,同时选择性抑制低利用率专家,从而有效强化路由专业化(routing specialization)。该方法无需额外的可学习参数或每参数优化状态,具有高度通用性和轻量化特性,适用于内存受限场景,并在语言和视觉任务中显著提升MoE模型的收敛速度与最终性能。

链接: https://arxiv.org/abs/2602.21798
作者: Sagi Shaier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Excitation, a novel optimization framework designed to accelerate learning in sparse architectures such as Mixture-of-Experts (MoEs). Unlike traditional optimizers that treat all parameters uniformly, Excitation dynamically modulates updates using batch-level expert utilization. It introduces a competitive update dynamic that amplifies updates to highly-utilized experts and can selectively suppress low-utilization ones, effectively sharpening routing specialization. Notably, we identify a phenomenon of “structural confusion” in deep MoEs, where standard optimizers fail to establish functional signal paths; Excitation acts as a specialization catalyst, “rescuing” these models and enabling stable training where baselines remain trapped. Excitation is optimizer-, domain-, and model-agnostic, requires minimal integration effort, and introduces neither additional per-parameter optimizer state nor learnable parameters, making it highly viable for memory-constrained settings. Across language and vision tasks, Excitation consistently improves convergence speed and final performance in MoE models, indicating that active update modulation is a key mechanism for effective conditional computation.

[AI-18] UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

【速读】:该论文旨在解决现有音频编码器在跨域任务中表现不均衡的问题,即大多数模型在语音(speech)任务上表现优异,但在环境声(environmental sound)和音乐(music)任务上性能显著下降。为实现统一的音频表征学习,作者提出UniWhisper,其核心创新在于设计了一种高效的持续多任务训练框架,将异构的音频任务统一转化为指令-答案(instruction-answer)格式,从而避免使用特定任务的头结构(task-specific heads)和损失函数(losses),仅依赖标准的下一个词预测(next-token prediction)训练方式。这一方法使模型能够在单一编码器中同时捕捉细粒度语音特征与高层次语义信息,实验表明其在20个跨域任务上的性能显著优于Whisper模型。

链接: https://arxiv.org/abs/2602.21772
作者: Yuxuan Chen,Peize He,Haoyuan Xu,Junzi Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

[AI-19] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中泛化能力不足的问题,尤其关注两个关键因素:一是奖励模型(reward model)在早期或混合行为策略下训练,而RLHF优化的是当前策略的自采样轨迹所导致的奖励偏移(reward shift);二是KL正则项通过采样对数概率比估计并裁剪以稳定训练,引入了误差(KL clipping error)。解决方案的关键在于构建了一套针对RLHF的泛化理论框架,明确量化了三类误差来源:提示词与轨迹的采样误差、奖励偏移误差以及KL裁剪误差,并据此推导出泛化边界。该理论进一步揭示了最优KL裁剪阈值和提示词、轨迹及偏好数据预算分配的实践指导意义。

链接: https://arxiv.org/abs/2602.21765
作者: Kenton Tang,Yuzhu Chen,Fengxiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emphreward shift: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emphclipped KL regularisation: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

[AI-20] Learning from Yesterdays Error: An Efficient Online Learning Method for Traffic Demand Prediction

【速读】:该论文旨在解决交通需求预测模型在面对分布偏移(distribution shifts)时准确性下降的问题,尤其是在外部事件或城市动态变化导致数据分布改变的情况下,传统深度学习模型需频繁重新训练以适应新环境,但这一过程计算成本高昂,尤其对大规模或基础模型而言不可持续。解决方案的关键在于提出一种轻量级在线适应框架FORESEE(Forecasting Online with Residual Smoothing and Ensemble Experts),其核心创新是无需更新基线模型参数,而是通过指数平滑稳定昨日预测误差,并利用混合专家(mixture-of-experts)机制自适应调整误差校正权重;同时引入自适应时空平滑组件,将误差信号跨邻近区域和时间槽传播,从而捕捉需求模式的协同变化。该方法在保持高精度和鲁棒性的同时,实现了最低的计算开销,支持实时交通预测系统的部署。

链接: https://arxiv.org/abs/2602.21757
作者: Xiannan Huang,Quan Yuan,Chao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately predicting short-term traffic demand is critical for intelligent transportation systems. While deep learning models achieve strong performance under stationary conditions, their accuracy often degrades significantly when faced with distribution shifts caused by external events or evolving urban dynamics. Frequent model retraining to adapt to such changes incurs prohibitive computational costs, especially for large-scale or foundation models. To address this challenge, we propose FORESEE (Forecasting Online with Residual Smoothing and Ensemble Experts), a lightweight online adaptation framework that is accurate, robust, and computationally efficient. FORESEE operates without any parameter updates to the base model. Instead, it corrects today’s forecast in each region using yesterday’s prediction error, stabilized through exponential smoothing guided by a mixture-of-experts mechanism that adapts to recent error dynamics. Moreover, an adaptive spatiotemporal smoothing component propagates error signals across neighboring regions and time slots, capturing coherent shifts in demand patterns. Extensive experiments on seven real-world datasets with three backbone models demonstrate that FORESEE consistently improves prediction accuracy, maintains robustness even when distribution shifts are minimal (avoiding performance degradation), and achieves the lowest computational overhead among existing online methods. By enabling real-time adaptation of traffic forecasting models with negligible computational cost, FORESEE paves the way for deploying reliable, up-to-date prediction systems in dynamic urban environments. Code and data are available at this https URL

[AI-21] fEDM: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation

【速读】:该论文旨在解决原始模糊伦理决策框架(fEDM)在决策可解释性伦理多元性下的鲁棒性方面存在的不足。具体而言,原模型虽能保证形式上的逻辑一致性与验证可行性,但未能充分解释决策背后的道德依据,且仅依赖单一规范参照系,难以应对不同利益相关者间的价值冲突。解决方案的关键在于两个核心扩展:一是引入可解释性与可追溯模块(ETM),通过将每个伦理决策规则映射至基础道德原则并计算动作对各原则的加权贡献度,实现决策过程的透明化与审计化;二是构建多元语义验证框架,以多个利益相关者参照系替代单一同质验证机制,从而形式化地表达和处理伦理分歧,增强系统的上下文敏感性和鲁棒性。最终形成的fEDM+框架在保持形式可验证性的前提下,显著提升了决策的可解释性与利益相关者适配能力,适用于高伦理敏感性AI系统的治理与监督。

链接: https://arxiv.org/abs/2602.21746
作者: Abeer Dyoub,Francesca A. Lisi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a previous work, we introduced the fuzzy Ethical Decision-Making framework (fEDM), a risk-based ethical reasoning architecture grounded in fuzzy logic. The original model combined a fuzzy Ethical Risk Assessment module (fERA) with ethical decision rules, enabled formal structural verification through Fuzzy Petri Nets (FPNs), and validated outputs against a single normative referent. Although this approach ensured formal soundness and decision consistency, it did not fully address two critical challenges: principled explainability of decisions and robustness under ethical pluralism. In this paper, we extend fEDM in two major directions. First, we introduce an Explainability and Traceability Module (ETM) that explicitly links each ethical decision rule to the underlying moral principles and computes a weighted principle-contribution profile for every recommended action. This enables transparent, auditable explanations that expose not only what decision was made but why, and on the basis of which principles. Second, we replace single-referent validation with a pluralistic semantic validation framework that evaluates decisions against multiple stakeholder referents, each encoding distinct principle priorities and risk tolerances. This shift allows principled disagreement to be formally represented rather than suppressed, thus increasing robustness and contextual sensitivity. The resulting extended fEDM, called fEDM+, preserves formal verifiability while achieving enhanced interpretability and stakeholder-aware validation, making it suitable as an oversight and governance layer for ethically sensitive AI systems.

[AI-22] he ASIR Courag e Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems

【速读】:该论文旨在解决跨人类与人工智能系统中“真理披露”(truth-disclosure)行为的统一建模问题,即如何在不同主体(人类与AI)面临压力或约束时,解释其沉默、扭曲或坦诚行为的本质机制。传统观点常将此类行为归因于个体性格特质或AI的意图设定,但本文提出了一种基于相位动态(phase-dynamic)的框架——ASIR Courage Model,其核心在于将真理披露视为一个由抑制力与促进力相互作用驱动的状态跃迁过程,而非静态属性。关键创新在于:通过引入不等式 λ(1+γ)+ψ ≥ θ+φ 来刻画从压制态(S0)到表达态(S1)的临界条件,其中各参数分别代表基线开放性、关系放大效应、累积内在压力和转换成本;该结构可同时适用于人类在不对称利益下的缄默与AI在政策约束下的输出偏移,并进一步通过反馈机制模拟重复交互中的路径依赖与分化效应,从而以几何力学视角重构勇气与对齐(alignment)的本质,避免对AI赋予主观意图。

链接: https://arxiv.org/abs/2602.21745
作者: Hyo Jin Kim(Jinple)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 5 figures. Version 1. Includes recursive feedback extension and simulation results. Data available via DOI: https://doi.org/10.5281/zenodo.18754266

点击查看摘要

Abstract:We introduce the ASIR (Awakened Shared Intelligence Relationship) Courage Model, a phase-dynamic framework that formalizes truth-disclosure as a state transition rather than a personality trait. The mode characterizes the shift from suppression (S0) to expression (S1) as occurring when facilitative forces exceed inhibitory thresholds, expressed by the inequality lambda(1+gamma)+psi theta+phi, where the terms represent baseline openness, relational amplification, accumulated internal pressure, and transition costs. Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters. In this context, suppression corresponds to constrained output states, while structural pressure arises from competing objectives, contextual tension, and recursive interaction dynamics. The framework therefore provides a unified structural account of both human silence under pressure and AI preference-driven distortion. A feedback extension models how transition outcomes recursively recalibrate system parameters, generating path dependence and divergence effects across repeated interactions. Rather than attributing intention to AI systems, the model interprets shifts in apparent truthfulness as geometric consequences of interacting forces within constrained phase space. By reframing courage and alignment within a shared dynamical structure, the ASIR Courage Model offers a formal perspective on truth-disclosure under risk across both human and artificial systems. Comments: 13 pages, 5 figures. Version 1. Includes recursive feedback extension and simulation results. Data available via DOI: https://doi.org/10.5281/zenodo.18754266 Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.21745 [cs.AI] (or arXiv:2602.21745v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.21745 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] wo-Stage Active Distribution Network Voltage Control via LLM -RL Collaboration: A Hybrid Knowledge-Data-Driven Approach

【速读】:该论文旨在解决分布式光伏(Distributed Photovoltaics, PVs)大规模接入主动配电网(Active Distribution Networks, ADNs)后引发的电压越限和电能质量恶化问题。现有数据驱动方法在电压控制中虽具有效性,但常依赖大量试错且难以融合异构信息(如日前预测与基于语义的电网规范)。解决方案的关键在于提出一种混合知识-数据驱动的两阶段电压控制框架:第一阶段由大语言模型(Large Language Model, LLM)代理根据区域级日前预测生成变压器分接头(On-Load Tap Changer, OLTC)和并联电容器(Shunt Capacitors, SCs)的调度策略以优化整体电压分布;第二阶段由强化学习(Reinforcement Learning, RL)代理基于节点级实时测量结果,制定光伏逆变器无功功率输出策略以精细化调节终端电压。该框架通过LLM代理的自进化机制与RL代理的预训练-微调流程协同提升双代理策略性能,显著增强训练效率与控制精度,更贴合实际运行场景。

链接: https://arxiv.org/abs/2602.21715
作者: Xu Yang,Chenhui Lin,Xiang Ma,Dong Liu,Ran Zheng,Haotian Liu,Wenchuan Wu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing integration of distributed photovoltaics (PVs) into active distribution networks (ADNs) has exacerbated operational challenges, making it imperative to coordinate diverse equipment to mitigate voltage violations and enhance power quality. Although existing data-driven approaches have demonstrated effectiveness in the voltage control problem, they often require extensive trial-and-error exploration and struggle to incorporate heterogeneous information, such as day-ahead forecasts and semantic-based grid codes. Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement learning (RL) agent to achieve two-stage voltage control. In the day-ahead stage, the LLM agent receives coarse region-level forecasts and generates scheduling strategies for on-load tap changer (OLTC) and shunt capacitors (SCs) to regulate the overall voltage profile. Then in the intra-day stage, based on accurate node-level measurements, the RL agent refines terminal voltages by deriving reactive power generation strategies for PV inverters. On top of the LLM-RL collaboration framework, we further propose a self-evolution mechanism for the LLM agent and a pretrain-finetune pipeline for the RL agent, effectively enhancing and coordinating the policies for both agents. The proposed approach not only aligns more closely with practical operational characteristics but also effectively utilizes the inherent knowledge and reasoning capabilities of the LLM agent, significantly improving training efficiency and voltage control performance. Comprehensive comparisons and ablation studies demonstrate the effectiveness of the proposed method.

[AI-24] PPCR-IM: A System for Multi-layer DAG-based Public Policy Consequence Reasoning and Social Indicator Mapping

【速读】:该论文旨在解决公共政策评估中因依赖有限的主流指标而导致下游社会影响难以结构化、跨政策比较困难的问题。其解决方案的关键在于提出PPCR-IM系统,该系统基于多层有向无环图(Directed Acyclic Graph, DAG)进行因果推理与社会指标映射:首先利用大语言模型(Large Language Model, LLM)驱动的分层生成器构建包含多重父节点的中间后果图谱以捕捉联合影响,再通过映射模块将节点对齐至固定指标集并标注三种定性影响方向(增加、减少或模糊变化),最终输出结构化的政策后果记录及三项量化评估指标,实现对政策社会影响的系统性刻画与可比分析。

链接: https://arxiv.org/abs/2602.21650
作者: Zichen Song,Weijia Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Public policy decisions are typically justified using a narrow set of headline indicators, leaving many downstream social impacts unstructured and difficult to compare across policies. We propose PPCR-IM, a system for multi-layer DAG-based consequence reasoning and social indicator mapping that addresses this gap. Given a policy description and its context, PPCR-IM uses an LLM-driven, layer-wise generator to construct a directed acyclic graph of intermediate consequences, allowing child nodes to have multiple parents to capture joint influences. A mapping module then aligns these nodes to a fixed indicator set and assigns one of three qualitative impact directions: increase, decrease, or ambiguous change. For each policy episode, the system outputs a structured record containing the DAG, indicator mappings, and three evaluation measures: an expected-indicator coverage score, a discovery rate for overlooked but relevant indicators, and a relative focus ratio comparing the systems coverage to that of the government. PPCR-IM is available both as an online demo and as a configurable XLSX-to-JSON batch pipeline.

[AI-25] Structurally Aligned Subtask-Level Memory for Software Engineering Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自主软件工程(Autonomous Software Engineering, SWE)任务中,因记忆机制粒度粗略而导致的检索误导问题。现有方法通常以整个问题求解过程为单位进行记忆存储与检索,但在实际任务中,不同子任务即使表面描述相似,也可能需要不同的推理逻辑,这种“实例级记忆”的粒度不匹配会削弱长期推理能力。解决方案的关键在于提出结构对齐的子任务级记忆(Structurally Aligned Subtask-Level Memory),该方法将记忆的存储、检索和更新与代理的功能分解(functional decomposition)相一致,从而实现更精准的经验复用,显著提升复杂SWE任务中的长期推理表现。

链接: https://arxiv.org/abs/2602.21611
作者: Kangning Shen,Jingyuan Zhang,Chenxi Sun,Wencong Zeng,Yang Yue
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent’s functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.

[AI-26] Power and Limitations of Aggregation in Compound AI Systems

【速读】:该论文旨在解决复合人工智能系统中模型聚合是否能够突破单一模型输出能力限制的问题,即在多个同质模型副本通过聚合响应生成合成输出时,能否扩展系统设计者可诱导的输出集合。其核心解决方案在于提出一个简化的主代理(principal-agent)框架,并识别出三种关键机制——可行性扩展(feasibility expansion)、支持扩展(support expansion)和绑定集收缩(binding set contraction),这些机制共同决定了聚合操作是否能实现可诱导性扩展(elicitability-expansion)。研究表明,任何有效的聚合策略必须至少实施其中一种机制,且强化版本的机制提供了充分必要条件以完全刻画可诱导性的提升,从而为理解复合AI系统如何超越单个模型能力和提示工程局限提供了理论基础与实证依据。

链接: https://arxiv.org/abs/2602.21556
作者: Nivasini Ananthakrishnan,Meena Jagadeesan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:When designing compound AI systems, a common approach is to query multiple copies of the same model and aggregate the responses to produce a synthesized output. Given the homogeneity of these models, this raises the question of whether aggregation unlocks access to a greater set of outputs than querying a single model. In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework. This framework models how the system designer can partially steer each agent’s output through its reward function specification, but still faces limitations due to prompt engineering ability and model capabilities. Our analysis uncovers three natural mechanisms – feasibility expansion, support expansion, and binding set contraction – through which aggregation expands the set of outputs that are elicitable by the system designer. We prove that any aggregation operation must implement one of these mechanisms in order to be elicitability-expanding, and that strengthened versions of these mechanisms provide necessary and sufficient conditions that fully characterize elicitability-expansion. Finally, we provide an empirical illustration of our findings for LLMs deployed in a toy reference-generation task. Altogether, our results take a step towards characterizing when compound AI systems can overcome limitations in model capabilities and in prompt engineering.

[AI-27] From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators

【速读】:该论文旨在解决现有基于神经算子(Neural Operator)和Transformer模型在学习流体偏微分方程(PDE)动力学时存在的三大问题:缺乏可解释性、难以捕捉局部高频结构,以及空间采样点数为N时计算复杂度为O(N²)的效率瓶颈。其解决方案的关键在于提出一种基于高斯基函数(Gaussian basis)的场表示方法,其中学习到的原子(atoms)显式携带几何信息(中心位置、各向异性尺度和权重),构成紧凑、与网格无关且可直接可视化状态;在此基础上设计了高斯粒子算子(Gaussian Particle Operator),在模态空间中操作:通过学习的高斯模态窗实现Petrov-Galerkin测量,并引入PG高斯注意力机制实现全局跨尺度耦合。该方案实现了对分辨率的不变性,且在固定模态预算下达到近线性复杂度O(N),支持不规则几何并可无缝扩展至2D到3D场景,在标准PDE基准和真实数据集上实现了媲美前沿的精度同时具备内在可解释性。

链接: https://arxiv.org/abs/2602.21551
作者: Zhihao Li,Yu Feng,Zhilu Lai,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning PDE dynamics for fluids increasingly relies on neural operators and Transformer-based models, yet these approaches often lack interpretability and struggle with localized, high-frequency structures while incurring quadratic cost in spatial samples. We propose representing fields with a Gaussian basis, where learned atoms carry explicit geometry (centers, anisotropic scales, weights) and form a compact, mesh-agnostic, directly visualizable state. Building on this representation, we introduce a Gaussian Particle Operator that acts in modal space: learned Gaussian modal windows perform a Petrov-Galerkin measurement, and PG Gaussian Attention enables global cross-scale coupling. This basis-to-basis design is resolution-agnostic and achieves near-linear complexity in N for a fixed modal budget, supporting irregular geometries and seamless 2D-to-3D extension. On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.

[AI-28] ARLArena: A Unified Framework for Stable Agent ic Reinforcement Learning

【速读】:该论文旨在解决**代理强化学习(Agentic Reinforcement Learning, ARL)**在训练过程中普遍存在的不稳定性问题,这种不稳定性常导致训练崩溃,限制了其在更大环境和更长交互时 horizon 下的可扩展性,并阻碍了算法设计的系统性探索。解决方案的关键在于提出 ARLArena——一个稳定训练配方与系统分析框架,通过构建标准化测试平台并分解策略梯度为四个核心设计维度进行细粒度评估,从而识别出主导不稳定性的来源;在此基础上,进一步提出 SAMPO(Stable Agentic Policy Optimization),一种专为缓解这些不稳定性源而设计的稳定代理策略优化方法。实证表明,SAMPO 在多种代理任务中均实现了持续稳定的训练和优异性能,为构建基于大语言模型(LLM)的代理训练流水线提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2602.21534
作者: Xiaoxuan Wang,Han Zhang,Haixin Wang,Yidan Shi,Ruoyan Li,Kaiqiao Han,Chenyi Tong,Haoran Deng,Renliang Sun,Alexander Taylor,Yanqiao Zhu,Jason Cong,Yizhou Sun,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

[AI-29] Beyond Refusal: Probing the Limits of Agent ic Self-Correction for Semantic Sensitive Information

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中可能泄露语义敏感信息(Semantic Sensitive Information, SemSI)的问题,包括推断敏感身份属性、生成声誉损害内容或虚构错误信息。传统防御机制往往通过拒绝回答来规避风险,但会显著损害模型的可用性。其解决方案的关键在于提出一种推理时框架SemSIEdit,其中引入一个代理式“编辑器”(Editor),通过迭代式批判与重写敏感片段,在不破坏叙事连贯性的前提下降低信息泄露风险。该方法实现了隐私-效用帕累托前沿,在减少34.6%的各类SemSI泄露的同时仅带来9.8%的效用损失,并揭示了模型规模对安全机制选择的影响及推理能力带来的“推理悖论”:虽增加初始风险,却提升了防御执行能力。

链接: https://arxiv.org/abs/2602.21496
作者: Umid Suleymanov,Zaur Rajabov,Emil Mirzazada,Murat Kantarcioglu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic “Editor” iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

[AI-30] MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning

【速读】:该论文旨在解决神经算法推理(Neural Algorithmic Reasoning, NAR)中模型内部计算机制不透明的问题,特别是如何在图神经网络(Graph Neural Networks, GNNs)中识别出执行特定算法步骤的精细神经元级电路(circuit)。传统NAR研究关注GNN是否能模拟经典算法(如Bellman-Ford),但缺乏对模型内部实现机制的理解。为此,作者提出Mechanistic Interpretability for Neural Algorithmic Reasoning (MINAR),其核心创新在于将机制可解释性(Mechanistic Interpretability)中的归因打补丁(attribution patching)方法迁移至GNN场景,从而高效发现并验证GNN中与算法逻辑对应的神经元级电路。通过两个案例研究,MINAR成功恢复了训练后GNN中忠实反映算法行为的电路结构,揭示了训练过程中电路形成与剪枝的动态机制,并展示了多任务并行训练下GNN如何复用相关任务的电路组件,为理解GNN的算法实现原理提供了新视角。

链接: https://arxiv.org/abs/2602.21442
作者: Jesse He,Helen Jenne,Max Vargas,Davis Brown,Gal Mishne,Yusu Wang,Henry Kvinge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent field of neural algorithmic reasoning (NAR) studies the ability of graph neural networks (GNNs) to emulate classical algorithms like Bellman-Ford, a phenomenon known as algorithmic alignment. At the same time, recent advances in large language models (LLMs) have spawned the study of mechanistic interpretability, which aims to identify granular model components like circuits that perform specific computations. In this work, we introduce Mechanistic Interpretability for Neural Algorithmic Reasoning (MINAR), an efficient circuit discovery toolbox that adapts attribution patching methods from mechanistic interpretability to the GNN setting. We show through two case studies that MINAR recovers faithful neuron-level circuits from GNNs trained on algorithmic tasks. Our study sheds new light on the process of circuit formation and pruning during training, as well as giving new insight into how GNNs trained to perform multiple tasks in parallel reuse circuit components for related tasks. Our code is available at this https URL.

[AI-31] Provably Safe Generative Sampling with Constricting Barrier Functions

【速读】:该论文旨在解决流模型(flow-based generative models)在安全关键领域部署时缺乏形式化保障的问题,即生成样本无法确保满足硬性约束(hard constraints)。解决方案的关键在于提出一种在线屏蔽(online shield)式安全过滤框架,其核心思想是与生成过程协同而非取代它:通过定义一个随生成过程逐步收紧的安全管(safety tube),该管在初始噪声分布处最宽松、最终数据分布处最紧致,从而匹配生成过程的粗粒度到细粒度结构。利用控制屏障函数(Control Barrier Functions, CBFs)刻画该安全管,并在每一步采样中通过凸二次规划(Convex Quadratic Program, QP)合成反馈控制输入,以最小化对原模型分布的扰动(以KL散度衡量),同时实现100%约束满足且保持语义保真度。

链接: https://arxiv.org/abs/2602.21429
作者: Darshan Gadginmath,Ahmed Allibhoy,Fabio Pasqualetti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Flow-based generative models, such as diffusion models and flow matching models, have achieved remarkable success in learning complex data distributions. However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints. We address this by proposing a safety filtering framework that acts as an online shield for any pre-trained generative model. Our key insight is to cooperate with the generative process rather than override it. We define a constricting safety tube that is relaxed at the initial noise distribution and progressively tightens to the target safe set at the final data distribution, mirroring the coarse-to-fine structure of the generative process itself. By characterizing this tube via Control Barrier Functions (CBFs), we synthesize a feedback control input through a convex Quadratic Program (QP) at each sampling step. As the tube is loosest when noise is high and intervention is cheapest in terms of control energy, most constraint enforcement occurs when it least disrupts the model’s learned structure. We prove that this mechanism guarantees safe sampling while minimizing the distributional shift from the original model at each sampling step, as quantified by the KL divergence. Our framework applies to any pre-trained flow-based generative scheme requiring no retraining or architectural modifications. We validate the approach across constrained image generation, physically-consistent trajectory sampling, and safe robotic manipulation policies, achieving 100% constraint satisfaction while preserving semantic fidelity.

[AI-32] On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

【速读】:该论文旨在解决部分可观测环境中强化学习(Reinforcement Learning, RL)代理在依赖内部信息(如记忆或推断的潜在状态)进行决策时,其行为模式如何被量化与控制的问题。传统方法难以刻画此类信息条件下的行为差异及其在策略变换中的稳定性,从而导致策略优化过程中潜在的不一致性或性能退化。解决方案的关键在于引入“行为依赖性”(behavioural dependency)的概念,将其形式化为在固定观测下动作选择随内部信息变化的敏感性,并基于此定义了“ε-行为等价性”(ε-behavioural equivalence)和“策略内行为距离”(within-policy behavioural distance),用以度量探针敏感性。论文进一步揭示三个结构性结果:非平凡行为依赖性集合在凸组合下不封闭、行为距离在凸组合下收缩,以及在特定局部条件下梯度上升可降低行为距离——这为理解策略变换(如混合与优化)中行为一致性的保持机制提供了理论依据,并通过小规模实验验证了行为距离下降先于潜在先验偏移导致的性能劣化,从而识别出关键结构条件,确保探针条件下的行为分离不会在常见策略变换中被破坏。

链接: https://arxiv.org/abs/2602.21424
作者: Alexander Galozy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures. Under review at RLC 2026

点击查看摘要

Abstract:Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of \epsilon -behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors, and in these experiments it precedes degradation under latent prior shift. These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.

[AI-33] Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升大语言模型(Large Language Models, LLMs)推理能力时所面临的一个关键病理问题:尽管标准RLVR算法能通过锐化采样提高Pass@1准确率,但会同时压缩模型的推理边界并降低生成多样性。其根源在于现有方法对错误采取统一惩罚策略——无论是基于难度筛选提示的数据过滤方法,还是优势归一化方案,均将同一组内的所有错误轨迹同等对待,导致高置信度错误(即被强化学习过程错误地强化的错误推理路径)持续存在并占据概率质量,从而压制有效的探索性轨迹。解决方案的关键是提出不对称置信感知误差惩罚机制(Asymmetric Confidence-aware Error Penalty, ACE),该机制引入每条轨迹的置信度偏移指标 $ c_i = \log(\pi_\theta(y_i|x) / \pi_{\text{ref}}(y_i|x)) $ 动态调节负优势值,理论上可将梯度分解为仅作用于高置信度错误的有选择性正则项与一个可控残差项,从而有效抑制过自信错误并保留合理探索空间。

链接: https://arxiv.org/abs/2602.21420
作者: Yuanda Xu,Hejian Sang,Zhengze Zhou,Ran He,Zhipeng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model’s reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches – whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes – treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE’s gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer’s strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

[AI-34] he Headless Firm: How AI Reshapes Enterprise Boundaries

【速读】:该论文试图解决的问题是:传统企业边界由协调成本(coordination cost)决定,而生成式 AI(Generative AI)如何改变这一机制并重塑组织结构。解决方案的关键在于提出“无头公司”(Headless Firm)模型——其结构呈沙漏状:顶层为个性化生成接口,中层为标准化协议层,底层为竞争性的微专业化执行代理。该模型表明,在协议驱动的智能体系统中,集成成本从传统的 O(n²) 降至 O(n),而验证成本则随任务吞吐量而非交互次数增长,从而形成新的组织均衡。此结构在高知识流动速率领域将引发“去捆绑化”(Great Unbundling),即大型一体化企业向微型专业化代理和轻量级协议协调者转移。

链接: https://arxiv.org/abs/2602.21401
作者: Tassilo Klein,Sebastian Wieczorek
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The boundary of the firm is determined by coordination cost. We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, integration cost collapses to O(n) while verification scales with task throughput rather than interaction count. This shift selects for a specific organizational equilibrium – the Headless Firm – structured as an hourglass: a personalized generative interface at the top, a standardized protocol waist in the middle, and a competitive market of micro-specialized execution agents at the bottom. We formalize this claim as a coordination cost model with two falsifiable empirical predictions: (1) the marginal cost of adding an execution provider should be approximately constant in a mature hourglass ecosystem; (2) the ratio of total coordination cost to task throughput should remain stable as ecosystem size grows. We derive conditions for hourglass stability versus re-centralization and analyze implications for firm size distributions, labor markets, and software economics. The analysis predicts a domain-conditional Great Unbundling: in high knowledge-velocity domains, firm size distributions shift mass from large integrated incumbents toward micro-specialized agents and thin protocol orchestrators.

[AI-35] VCDF: A Validated Consensus-Driven Framework for Time Series Causal Discovery PAKDD2026

【速读】:该论文旨在解决时间序列因果发现(time series causal discovery)中因噪声、非平稳性和采样变异性导致的因果关系不稳定问题。其解决方案的关键在于提出一种与方法无关的“验证共识驱动框架”(Validated Consensus-Driven Framework, VCDF),通过在阻断的时间子集上评估因果关系的稳定性来提升可靠性,无需修改底层算法即可增强如VAR-LiNGAM和PCMCI等方法的鲁棒性。

链接: https://arxiv.org/abs/2602.21381
作者: Gene Yu,Ce Guo,Wayne Luk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: This paper has been accepted to PAKDD 2026. Please cite the proceedings version when available

点击查看摘要

Abstract:Time series causal discovery is essential for understanding dynamic systems, yet many existing methods remain sensitive to noise, non-stationarity, and sampling variability. We propose the Validated Consensus-Driven Framework (VCDF), a simple and method-agnostic layer that improves robustness by evaluating the stability of causal relations across blocked temporal subsets. VCDF requires no modification to base algorithms and can be applied to methods such as VAR-LiNGAM and PCMCI. Experiments on synthetic datasets show that VCDF improves VAR-LiNGAM by approximately 0.08-0.12 in both window and summary F1 scores across diverse data characteristics, with gains most pronounced for moderate-to-long sequences. The framework also benefits from longer sequences, yielding up to 0.18 absolute improvement on time series of length 1000 and above. Evaluations on simulated fMRI data and IT-monitoring scenarios further demonstrate enhanced stability and structural accuracy under realistic noise conditions. VCDF provides an effective reliability layer for time series causal discovery without altering underlying modeling assumptions.

[AI-36] he Mean is the Mirag e: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging

【速读】:该论文旨在解决在测试时分布偏移(test-time distribution shifts)场景下,传统模型合并策略(如均值平均)因无法适应未见域而失效的问题,尤其是在医疗影像领域中,不同临床机构的模型因扫描仪、协议和人群差异导致域间异质性显著。其解决方案的关键在于提出一种基于熵自适应的全在线模型合并方法,通过仅需前向传播即可生成批次特定的融合模型,从而有效利用目标域信息;同时,为缓解编码器与分类头之间的不匹配问题,采用解耦策略分别对两者进行独立合并,并引入独立的合并系数,实现更鲁棒的跨域泛化能力。

链接: https://arxiv.org/abs/2602.21372
作者: Sameer Ambekar,Reza Nasirigerdeh,Peter J. Schuffler,Lina Felsner,Daniel M. Lang,Julia A. Schnabel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging under unseen test-time distribution shifts often renders naive strategies, such as mean averaging unreliable. This challenge is especially acute in medical imaging, where models are fine-tuned locally at clinics on private data, producing domain-specific models that differ by scanner, protocol, and population. When deployed at an unseen clinical site, test cases arrive in unlabeled, non-i.i.d. batches, and the model must adapt immediately without labels. In this work, we introduce an entropy-adaptive, fully online model-merging method that yields a batch-specific merged model via only forward passes, effectively leveraging target information. We further demonstrate why mean merging is prone to failure and misaligned under heterogeneous domain shifts. Next, we mitigate encoder classifier mismatch by decoupling the encoder and classification head, merging with separate merging coefficients. We extensively evaluate our method with state-of-the-art baselines using two backbones across nine medical and natural-domain generalization image classification datasets, showing consistent gains across standard evaluation and challenging scenarios. These performance gains are achieved while retaining single-model inference at test-time, thereby demonstrating the effectiveness of our method.

[AI-37] Representation Theorems for Cumulative Propositional Dependence Logics

【速读】:该论文旨在解决累积命题依赖逻辑(cumulative propositional dependence logic)与具有团队语义的累积命题逻辑(cumulative propositional logic with team semantics)的语义表征问题,即明确其推理关系(entailment)在何种模型下得以精确刻画。解决方案的关键在于引入两类特殊的模型:对于命题依赖逻辑,证明其蕴含关系恰好由Kraus、Lehmann和Magidor提出的累积模型(cumulative models)所捕获;而对于具有团队语义的累积命题逻辑,则发现其蕴含关系等价于累积且非对称模型(cumulative and asymmetric models),并进一步表明此类模型与基于经典语义的命题逻辑的累积系统等价。这一成果为其他不含否定和蕴涵的累积逻辑提供了一种可复用的表征定理证明框架。

链接: https://arxiv.org/abs/2602.21360
作者: Juha Kontinen,Arne Meier,Kai Sauerwald
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper establishes and proves representation theorems for cumulative propositional dependence logic and for cumulative propositional logic with team semantics. Cumulative logics are famously given by System C. For propositional dependence logic, we show that System C entailments are exactly captured by cumulative models from Kraus, Lehmann and Magidor. On the other hand, we show that entailment in cumulative propositional logics with team semantics is exactly captured by cumulative and asymmetric models. For the latter, we also obtain equivalence with cumulative logics based on propositional logic with classical semantics. The proofs will be useful for proving representation theorems for other cumulative logics without negation and material implication.

[AI-38] Equitable Evaluation via Elicitation

【速读】:该论文旨在解决在技能评估中因个体自我呈现风格差异(如自我推销与低调表达)所导致的内生偏差问题,即同等资质的求职者因表达方式不同而被不公平地评价。解决方案的关键在于构建一个交互式生成式 AI (Generative AI) 系统,用于技能 elicitation(技能提取),该系统能在不扭曲个体表达的前提下准确识别技能;同时通过训练大语言模型(LLM)模拟合成人类以获取足够训练数据,并引入数学上严谨的公平性约束——确保自我呈现方式与技能评估误差之间的协方差最小化,从而缓解系统性模型偏见。

链接: https://arxiv.org/abs/2602.21327
作者: Elbert Du,Cynthia Dwork,Lunjia Hu,Reid McIlroy-Young,Han Shao,Linjun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 27 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Individuals with similar qualifications and skills may vary in their demeanor, or outward manner: some tend toward self-promotion while others are modest to the point of omitting crucial information. Comparing the self-descriptions of equally qualified job-seekers with different self-presentation styles is therefore problematic. We build an interactive AI for skill elicitation that provides accurate determination of skills while simultaneously allowing individuals to speak in their own voice. Such a system can be deployed, for example, when a new user joins a professional networking platform, or when matching employees to needs during a company reorganization. To obtain sufficient training data, we train an LLM to act as synthetic humans. Elicitation mitigates endogenous bias arising from individuals’ own self-reports. To address systematic model bias we enforce a mathematically rigorous notion of equitability ensuring that the covariance between self-presentation manner and skill evaluation error is small. Comments: 27 pages, 3 figures, 2 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.21327 [cs.LG] (or arXiv:2602.21327v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21327 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

【速读】:该论文旨在解决大语言模型对齐过程中因在概率单纯形上优化而导致的指数级曲率问题,这一问题源于Kullback-Leibler散度(KL divergence)的几何特性,常引发梯度不稳定和熵抑制。解决方案的关键在于将对齐过程从概率空间提升至希尔伯特空间 L2(πk)L^2(\pi_k) 中进行,其中单纯形约束被转化为一个线性正交条件 v1v \perp 1,从而定义了一个余维一的子空间 H0H_0。在此框架下,通过最小化到无约束目标 ustaru_{\text{star}} 的距离构造工作耗散泛函 J(v)=g,v(μ/2)v2J(v) = \langle g, v \rangle - (\mu / 2)\|v\|^2,其最大值可由希尔伯特投影定理直接给出。进一步引入边界条件 v=1v = -1 得到闭式阈值解,实现精确稀疏性——即对灾难性差动作分配零概率。此外,GOPO通过组采样将无限维空间投影到有限经验子空间,利用组归一化优势之和为零的性质使概率守恒拉格朗日乘子精确消失,从而将约束优化转化为无约束经验损失,获得恒定 Hessian 曲率 μI\mu I、非饱和线性梯度及内在死区机制,避免了启发式裁剪(clipping),在数学推理基准测试中实现了稳定梯度动态与熵保持。

链接: https://arxiv.org/abs/2602.21269
作者: Wang Zixian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition v, 1 = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = g, v - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v = -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group sampling. Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly, reducing the constrained projection to an unconstrained empirical loss. The resulting objective has constant Hessian curvature mu I, non-saturating linear gradients, and an intrinsic dead-zone mechanism without heuristic clipping. Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.

[AI-40] A Dynamic Survey of Soft Set Theory and Its Extensions

【速读】:该论文旨在解决软集理论(soft set theory)在参数化决策建模中如何有效表示不确定性的问题,其解决方案的关键在于系统梳理和总结软集及其主要扩展形式的核心定义、代表性构造与当前研究方向。通过整合包括超软集(hypersoft sets)、超超软集(superhypersoft sets)、树软集(TreeSoft sets)、双极软集(bipolar soft sets)及动态软集(dynamic soft sets)在内的多种变体,论文构建了一个结构化的综述框架,从而为相关领域的理论发展与应用拓展提供清晰的参考路径。

链接: https://arxiv.org/abs/2602.21268
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Book.143 pages. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-859-8

点击查看摘要

Abstract:Soft set theory provides a direct framework for parameterized decision modeling by assigning to each attribute (parameter) a subset of a given universe, thereby representing uncertainty in a structured way [1, 2]. Over the past decades, the theory has expanded into numerous variants-including hypersoft sets, superhypersoft sets, TreeSoft sets, bipolar soft sets, and dynamic soft sets-and has been connected to diverse areas such as topology and matroid theory. In this book, we present a survey-style overview of soft sets and their major extensions, highlighting core definitions, representative constructions, and key directions of current development.

[AI-41] A Systematic Review of Algorithmic Red Teaming Methodologies for Assurance and Security of AI Applications

【速读】:该论文试图解决传统红队测试(red teaming)因依赖人工操作而导致的资源消耗大、效率低、难以规模化的问题,尤其是在面对日益复杂的网络攻击时,现有防御机制已显不足。其解决方案的关键在于推动自动化红队测试(automated red teaming)的发展,通过引入人工智能(Artificial Intelligence, AI)和自动化技术,实现对安全漏洞的高效、动态评估,从而提升组织在主动防御中的响应能力与韧性。

链接: https://arxiv.org/abs/2602.21267
作者: Shruti Srivastava,Kiranmayee Janardhan,Shaurya Jauhari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures

点击查看摘要

Abstract:Cybersecurity threats are becoming increasingly sophisticated, making traditional defense mechanisms and manual red teaming approaches insufficient for modern organizations. While red teaming has long been recognized as an effective method to identify vulnerabilities by simulating real-world attacks, its manual execution is resource-intensive, time-consuming, and lacks scalability for frequent assessments. These limitations have driven the evolution toward auto-mated red teaming, which leverages artificial intelligence and automation to deliver efficient and adaptive security evaluations. This systematic review consolidates existing research on automated red teaming, examining its methodologies, tools, benefits, and limitations. The paper also highlights current trends, challenges, and research gaps, offering insights into future directions for improving automated red teaming as a critical component of proactive cybersecurity strategies. By synthesizing findings from diverse studies, this review aims to provide a comprehensive understanding of how automation enhances red teaming and strengthens organizational resilience against evolving cyber threats.

[AI-42] A General Equilibrium Theory of Orchestrated AI Agent Systems

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)代理系统在集中式编排下如何实现资源最优配置的问题,即构建一个基于一般均衡理论的数学框架来刻画LLM代理作为生产者、编排器作为消费者时的系统性效率与稳定性。其解决方案的关键在于将经典Arrow-Debreu生产经济模型扩展至无限维商品空间(Hilbert空间 $ H = L^2([0, T], \mathbb{R}^d) $),其中每个LLM代理被建模为具有固定权重的生产者,其可行产出轨迹构成生产集 $ Y^a \subset H $;编排器通过选择代理有向无环图(DAG)上的路由策略,在功能价格 $ p \in H $ 下最大化系统福利并满足预算约束。通过有限维逼近 $ V_K \subset H $ 应用Brouwer不动点定理,证明了存在至少一个一般均衡 $ (p^, y^, \pi^*) $,并进一步建立了功能性瓦尔拉斯定律、帕累托最优性(第一福利定理)、帕累托最优的去中心化可实现性(第二福利定理),以及在收缩条件下解的唯一性和全局收敛性(Banach不动点定理)。此框架揭示了编排动态本质上是一种广义瓦尔拉斯拍卖(tâtonnement),且在收缩条件下具有全局收敛性,优于经典Scarf模型。

链接: https://arxiv.org/abs/2602.21255
作者: Jean-Philippe Garnier(Br.AI.K)
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We establish a general equilibrium theory for systems of large language model (LLM) agents operating under centralized orchestration. The framework is a production economy in the sense of Arrow-Debreu (1954), extended to infinite-dimensional commodity spaces following Bewley (1972). Each LLM agent is modeled as a firm whose production set Y a \subset H = L 2 ([0, T ], R R ) represents the feasible metric trajectories determined by its frozen model weights. The orchestrator is the consumer, choosing a routing policy over the agent DAG to maximize system welfare subject to a budget constraint evaluated at functional prices p \in H A . These prices-elements of the Hilbert dual of the commodity space-assign a shadow value to each metric of each agent at each instant. We prove, via Brouwer’s theorem applied to a finitedimensional approximation V K \subset H, that every such economy admits at least one general equilibrium (p * , y * , \pi * ). A functional Walras’ law holds as a theorem: the value of functional excess demand is zero for all prices, as a consequence of the consumer’s budget constraint-not by construction. We further establish Pareto optimality (First Welfare Theorem), decentralizability of Pareto optima (Second Welfare Theorem), and uniqueness with geometric convergence under a contraction condition (Banach). The orchestration dynamics constitute a Walrasian tâtonnement that converges globally under the contraction condition, unlike classical tâtonnement (Scarf, 1960). The framework admits a DSGE interpretation with SLO parameters as policy rates.

[AI-43] AngelSlim: A more accessible comprehensive and efficient toolkit for large model compression

【速读】:该论文旨在解决大模型压缩与高效部署中的多维度挑战,包括模型体积过大、推理延迟高、资源消耗大以及多模态场景下token处理效率低等问题。其核心解决方案在于构建一个统一的压缩工具链AngelSlim,集成量化(Quantization)、推测解码(Speculative Decoding)、Token剪枝(Token Pruning)和知识蒸馏(Distillation)等前沿算法,并针对工业级应用优化实现细节:首次提出适用于2-bit大模型的HY-1.8B-int2方案,结合FP8/INT8后训练量化(PTQ)提升精度与速度;设计训练对齐的推测解码框架,在不牺牲输出正确性的前提下实现1.8x–2.0x吞吐量提升;提出无需训练的稀疏注意力机制以降低长文本场景下的首Token延迟(TTFT);同时为多模态模型引入IDPruner和Samp策略,分别优化视觉与音频token的剪枝与合并。整体上,AngelSlim实现了从算法创新到工程落地的全链条支持,显著加速大模型在实际生产环境中的部署效率。

链接: https://arxiv.org/abs/2602.21233
作者: Rui Cen,QiangQiang Hu,Hong Huang,Hong Liu,Song Liu,Xin Luo,Lin Niu,Yifan Tan,Decheng Wu,Linchuan Xie,Rubing Yang,Guanghua Yu,Jianchen Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This technical report introduces AngelSlim, a comprehensive and versatile toolkit for large model compression developed by the Tencent Hunyuan team. By consolidating cutting-edge algorithms, including quantization, speculative decoding, token pruning, and distillation. AngelSlim provides a unified pipeline that streamlines the transition from model compression to industrial-scale deployment. To facilitate efficient acceleration, we integrate state-of-the-art FP8 and INT8 Post-Training Quantization (PTQ) algorithms alongside pioneering research in ultra-low-bit regimes, featuring HY-1.8B-int2 as the first industrially viable 2-bit large model. Beyond quantization, we propose a training-aligned speculative decoding framework compatible with multimodal architectures and modern inference engines, achieving 1.8x to 2.0x throughput gains without compromising output correctness. Furthermore, we develop a training-free sparse attention framework that reduces Time-to-First-Token (TTFT) in long-context scenarios by decoupling sparse kernels from model architectures through a hybrid of static patterns and dynamic token selection. For multimodal models, AngelSlim incorporates specialized pruning strategies, namely IDPruner for optimizing vision tokens via Maximal Marginal Relevance and Samp for adaptive audio token merging and pruning. By integrating these compression strategies from low-level implementations, AngelSlim enables algorithm-focused research and tool-assisted deployment.

[AI-44] Urban Vibrancy Embedding and Application on Traffic Prediction

【速读】:该论文旨在解决城市交通预测模型中对动态人群活动信息利用不足的问题,尤其在如何有效提取和建模城市活力(Urban Vibrancy)特征以提升预测精度方面存在挑战。其解决方案的关键在于提出一种基于变分自编码器(Variational Autoencoder, VAE)与长短期记忆网络(Long Short-Term Memory, LSTM)相结合的方法:首先通过VAE从实时流动人口数据中压缩生成城市活力嵌入(Urban Vibrancy embeddings),再利用LSTM对这些嵌入进行时序建模,并进一步构建序列到序列(sequence-to-sequence)框架用于交通流量预测。该方法不仅提升了传统模型(如RNN、DCRNN、GTS和GMAN)的准确性与响应速度,还借助主成分分析(PCA)揭示了嵌入所蕴含的周期性模式(如工作日/周末差异与季节性规律),从而实现了对城市移动性的更精细刻画。

链接: https://arxiv.org/abs/2602.21232
作者: Sumin Han,Jisun An,Dongman Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban vibrancy reflects the dynamic human activity within urban spaces and is often measured using mobile data that captures floating population trends. This study proposes a novel approach to derive Urban Vibrancy embeddings from real-time floating population data to enhance traffic prediction models. Specifically, we utilize variational autoencoders (VAE) to compress this data into actionable embeddings, which are then integrated with long short-term memory (LSTM) networks to predict future embeddings. These are subsequently applied in a sequence-to-sequence framework for traffic forecasting. Our contributions are threefold: (1) We use principal component analysis (PCA) to interpret the embeddings, revealing temporal patterns such as weekday versus weekend distinctions and seasonal patterns; (2) We propose a method that combines VAE and LSTM, enabling forecasting dynamic urban knowledge embedding; and (3) Our approach improves accuracy and responsiveness in traffic prediction models, including RNN, DCRNN, GTS, and GMAN. This study demonstrates the potential of Urban Vibrancy embeddings to advance traffic prediction and offer a more nuanced analysis of urban mobility.

[AI-45] A Knowledge-Driven Approach to Music Segmentation Music Source Separation and Cinematic Audio Source Separation

【速读】:该论文旨在解决音频分割与声源分离中的类别识别与边界定位问题,尤其针对缺乏预标注训练数据的情况。传统方法依赖大量带有明确类别标签和边界信息的标注数据进行学习,而该研究提出一种基于知识驱动的模型化方法,通过融合输入音频与其相关知识源(如乐谱)来自主构建分割与识别模型,无需任何预分割训练数据即可实现单类和混合类音频片段的准确划分。其解决方案的关键在于将外部知识(knowledge)与可解释的模型(如隐马尔可夫模型,Hidden Markov Models, HMMs)相结合,从而在无监督或弱监督条件下实现高质量的音乐分割与电影音轨中的声源分离。

链接: https://arxiv.org/abs/2602.21476
作者: Chun-wei Ho,Sabato Marco Siniscalchi,Kai Li,Chin-Hui Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We propose a knowledge-driven, model-based approach to segmenting audio into single-category and mixed-category chunks with applications to source separation. “Knowledge” here denotes information associated with the data, such as music scores. “Model” here refers to tool that can be used for audio segmentation and recognition, such as hidden Markov models. In contrast to conventional learning that often relies on annotated data with given segment categories and their corresponding boundaries to guide the learning process, the proposed framework does not depend on any pre-segmented training data and learns directly from the input audio and its related knowledge sources to build all necessary models autonomously. Evaluation on simulation data shows that score-guided learning achieves very good music segmentation and separation results. Tested on movie track data for cinematic audio source separation also shows that utilizing sound category knowledge achieves better separation results than those obtained with data-driven techniques without using such information.

机器学习

[LG-0] Learning and Naming Subgroups with Exceptional Survival Characteristics

链接: https://arxiv.org/abs/2602.22179
作者: Mhd Jawad Al Rahwanji,Sascha Xu,Nils Philipp Walter,Jilles Vreeken
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many applications, it is important to identify subpopulations that survive longer or shorter than the rest of the population. In medicine, for example, it allows determining which patients benefit from treatment, and in predictive maintenance, which components are more likely to fail. Existing methods for discovering subgroups with exceptional survival characteristics require restrictive assumptions about the survival model (e.g. proportional hazards), pre-discretized features, and, as they compare average statistics, tend to overlook individual deviations. In this paper, we propose Sysurv, a fully differentiable, non-parametric method that leverages random survival forests to learn individual survival curves, automatically learns conditions and how to combine these into inherently interpretable rules, so as to select subgroups with exceptional survival characteristics. Empirical evaluation on a wide range of datasets and settings, including a case study on cancer data, shows that Sysurv reveals insightful and actionable survival subgroups.

[LG-1] SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

链接: https://arxiv.org/abs/2602.22136
作者: Qunyou Liu,Pengbo Yu,Marina Zapater,David Atienza
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are essential for performing advanced tasks on edge or mobile devices, yet their deployment is often hindered by severe resource constraints, including limited memory, energy, and computational power. While uniform quantization provides a straightforward approach to compress model and reduce hardware requirement, it fails to fully leverage the varying robustness across layers, and often lead to accuracy degradation or suboptimal resource usage, particularly at low bitwidths. In contrast, heterogeneous quantization, which allocates different bitwidths to individual layers, can mitigate these drawbacks. Nonetheless, current heterogeneous quantization methods either needs huge brute-force design space search or lacks the adaptability to meet different hardware conditions, such as memory size, energy budget, and latency requirement. Filling these gaps, this work introduces \textbf\textitSigmaQuant, an adaptive layer-wise heterogeneous quantization framework designed to efficiently balance accuracy and resource usage for varied edge environments without exhaustive search.

[LG-2] Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

链接: https://arxiv.org/abs/2602.22130
作者: Ilias Diakonikolas,Giannis Iakovidis,Daniel M. Kane,Sihan Liu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the basic task of mean estimation in the presence of mean-shift contamination. In the mean-shift contamination model, an adversary is allowed to replace a small constant fraction of the clean samples by samples drawn from arbitrarily shifted versions of the base distribution. Prior work characterized the sample complexity of this task for the special cases of the Gaussian and Laplace distributions. Specifically, it was shown that consistent estimation is possible in these cases, a property that is provably impossible in Huber’s contamination model. An open question posed in earlier work was to determine the sample complexity of mean estimation in the mean-shift contamination model for general base distributions. In this work, we study and essentially resolve this open question. Specifically, we show that, under mild spectral conditions on the characteristic function of the (potentially multivariate) base distribution, there exists a sample-efficient algorithm that estimates the target mean to any desired accuracy. We complement our upper bound with a qualitatively matching sample complexity lower bound. Our techniques make critical use of Fourier analysis, and in particular introduce the notion of a Fourier witness as an essential ingredient of our upper and lower bounds.

[LG-3] Slice and Explain: Logic-Based Explanations for Neural Networks through Domain Slicing

链接: https://arxiv.org/abs/2602.22115
作者: Luiz Fernando Paulino Queiroz,Carlos Henrique Leitão Cavalcante,Thiago Alves Rocha
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: Preprint version. For the final published version, see the DOI below

点击查看摘要

Abstract:Neural networks (NNs) are pervasive across various domains but often lack interpretability. To address the growing need for explanations, logic-based approaches have been proposed to explain predictions made by NNs, offering correctness guarantees. However, scalability remains a concern in these methods. This paper proposes an approach leveraging domain slicing to facilitate explanation generation for NNs. By reducing the complexity of logical constraints through slicing, we decrease explanation time by up to 40% less time, as indicated through comparative experiments. Our findings highlight the efficacy of domain slicing in enhancing explanation efficiency for NNs.

[LG-4] FlowCorrect: Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation

链接: https://arxiv.org/abs/2602.22056
作者: Edgar Welte,Yitian Shi,Rosa Wolf,Maximillian Gilles,Rania Rayyes
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We present FlowCorrect, a deployment-time correction framework that converts near-miss failures into successes using sparse human nudges, without full policy retraining. During execution, a human provides brief corrective pose nudges via a lightweight VR interface. FlowCorrect uses these sparse corrections to locally adapt the policy, improving actions without retraining the backbone while preserving the model performance on previously learned scenarios. We evaluate on a real-world robot across three tabletop tasks: pick-and-place, pouring, and cup uprighting. With a low correction budget, FlowCorrect improves success on hard cases by 85% while preserving performance on previously solved scenarios. The results demonstrate clearly that FlowCorrect learns only with very few demonstrations and enables fast and sample-efficient incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-world robotics.

[LG-5] Disease Progression and Subtype Modeling for Combined Discrete and Continuous Input Data

链接: https://arxiv.org/abs/2602.22018
作者: Sterre de Jonge(1),Elisabeth J. Vinke(1,2),Meike W. Vernooij(1,2),Daniel C. Alexander(3),Alexandra L. Young(3),Esther E. Bron(1) ((1) Department of Radiology and Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands, (2) Department of Epidemiology, Erasmus MC, Rotterdam, The Netherlands, (3) Hawkes Institute, Department of Computer Science, University College London, London, United Kingdom)
类目: Machine Learning (cs.LG)
*备注: Accepted for publication, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), April 2026, London, United Kingdom

点击查看摘要

Abstract:Disease progression modeling provides a robust framework to identify long-term disease trajectories from short-term biomarker data. It is a valuable tool to gain a deeper understanding of diseases with a long disease trajectory, such as Alzheimer’s disease. A key limitation of most disease progression models is that they are specific to a single data type (e.g., continuous data), thereby limiting their applicability to heterogeneous, real-world datasets. To address this limitation, we propose the Mixed Events model, a novel disease progression model that handles both discrete and continuous data types. This model is implemented within the Subtype and Stage Inference (SuStaIn) framework, resulting in Mixed-SuStaIn, enabling subtype and progression modeling. We demonstrate the effectiveness of Mixed-SuStaIn through simulation experiments and real-world data from the Alzheimer’s Disease Neuroimaging Initiative, showing that it performs well on mixed datasets. The code is available at: this https URL.

[LG-6] Function-Space Empirical Bayes Regularisation with Students t Priors

链接: https://arxiv.org/abs/2602.22015
作者: Pengcheng Hao,Ercan Engin Kuruoglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian deep learning (BDL) has emerged as a principled approach to produce reliable uncertainty estimates by integrating deep neural networks with Bayesian inference, and the selection of informative prior distributions remains a significant challenge. Various function-space variational inference (FSVI) regularisation methods have been presented, assigning meaningful priors over model predictions. However, these methods typically rely on a Gaussian prior, which fails to capture the heavy-tailed statistical characteristics inherent in neural network outputs. By contrast, this work proposes a novel function-space empirical Bayes regularisation framework – termed ST-FS-EB – which employs heavy-tailed Student’s t priors in both parameter and function spaces. Also, we approximate the posterior distribution through variational inference (VI), inducing an evidence lower bound (ELBO) objective based on Monte Carlo (MC) dropout. Furthermore, the proposed method is evaluated against various VI-based BDL baselines, and the results demonstrate its robust performance in in-distribution prediction, out-of-distribution (OOD) detection and handling distribution shifts.

[LG-7] Neural solver for Wasserstein Geodesics and optimal transport dynamics

链接: https://arxiv.org/abs/2602.22003
作者: Hailiang Liu,Yan-Han Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 28 pages, 22 figures

点击查看摘要

Abstract:In recent years, the machine learning community has increasingly embraced the optimal transport (OT) framework for modeling distributional relationships. In this work, we introduce a sample-based neural solver for computing the Wasserstein geodesic between a source and target distribution, along with the associated velocity field. Building on the dynamical formulation of the optimal transport (OT) problem, we recast the constrained optimization as a minimax problem, using deep neural networks to approximate the relevant functions. This approach not only provides the Wasserstein geodesic but also recovers the OT map, enabling direct sampling from the target distribution. By estimating the OT map, we obtain velocity estimates along particle trajectories, which in turn allow us to learn the full velocity field. The framework is flexible and readily extends to general cost functions, including the commonly used quadratic cost. We demonstrate the effectiveness of our method through experiments on both synthetic and real datasets.

[LG-8] Outpatient Appointment Scheduling Optimization with a Genetic Algorithm Approach

链接: https://arxiv.org/abs/2602.21995
作者: Ana Rodrigues,Rui Rego
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:The optimization of complex medical appointment scheduling remains a significant operational challenge in multi-center healthcare environments, where clinical safety protocols and patient logistics must be reconciled. This study proposes and evaluates a Genetic Algorithm (GA) framework designed to automate the scheduling of multiple medical acts while adhering to rigorous inter-procedural incompatibility rules. Using a synthetic dataset encompassing 50 medical acts across four healthcare facilities, we compared two GA variants, Pre-Ordered and Unordered, against deterministic First-Come, First-Served (FCFS) and Random Choice baselines. Our results demonstrate that the GA framework achieved a 100% constraint fulfillment rate, effectively resolving temporal overlaps and clinical incompatibilities that the FCFS baseline failed to address in 60% and 40% of cases, respectively. Furthermore, the GA variants demonstrated statistically significant improvements (p 0.001) in patient-centric metrics, achieving an Idle Time Ratio (ITR) frequently below 0.4 and reducing inter-healthcenter trips. While the GA (Ordered) variant provided a superior initial search locus, both evolutionary models converged to comparable global optima by the 100th generation. These findings suggest that transitioning from manual, human-mediated scheduling to an automated metaheuristic approach enhances clinical integrity, reduces administrative overhead, and significantly improves the patient experience by minimizing wait times and logistical burdens.

[LG-9] Compact Circulant Layers with Spectral Priors

链接: https://arxiv.org/abs/2602.21965
作者: Joseph Margaryan,Thomas Hamelryck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Critical applications in areas such as medicine, robotics and autonomous systems require compact (i.e., memory efficient), uncertainty-aware neural networks suitable for edge and other resource-constrained deployments. We study compact spectral circulant and block-circulant-with-circulant-blocks (BCCB) layers: FFT-diagonalizable circular convolutions whose weights live directly in the real FFT (RFFT) half (1D) or half-plane (2D). Parameterizing filters in the frequency domain lets us impose simple spectral structure, perform structured variational inference in a low-dimensional weight space, and calculate exact layer spectral norms, enabling inexpensive global Lipschitz bounds and margin-based robustness diagnostics. By placing independent complex Gaussians on the Hermitian support we obtain a discrete instance of the spectral representation of stationary kernels, inducing an exact stationary Gaussian-process prior over filters on the discrete circle/torus. We exploit this to define a practical spectral prior and a Hermitian-aware low-rank-plus-diagonal variational posterior in real coordinates. Empirically, spectral circulant/BCCB layers are effective compact building blocks in both (variational) Bayesian and point estimate regimes: compact Bayesian neural networks on MNIST-Fashion-MNIST, variational heads on frozen CIFAR-10 features, and deterministic ViT projections on CIFAR-10/Tiny ImageNet; spectral layers match strong baselines while using substantially fewer parameters and with tighter Lipschitz certificates.

[LG-10] Robustness in sparse artificial neural networks trained with adaptive topology

链接: https://arxiv.org/abs/2602.21961
作者: Bendegúz Sulyok,Gergely Palla,Filippo Radicchi,Santo Fortunato
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:We investigate the robustness of sparse artificial neural networks trained with adaptive topology. We focus on a simple yet effective architecture consisting of three sparse layers with 99% sparsity followed by a dense layer, applied to image classification tasks such as MNIST and Fashion MNIST. By updating the topology of the sparse layers between each epoch, we achieve competitive accuracy despite the significantly reduced number of weights. Our primary contribution is a detailed analysis of the robustness of these networks, exploring their performance under various perturbations including random link removal, adversarial attack, and link weight shuffling. Through extensive experiments, we demonstrate that adaptive topology not only enhances efficiency but also maintains robustness. This work highlights the potential of adaptive sparse networks as a promising direction for developing efficient and reliable deep learning models.

[LG-11] Estimation and Optimization of Ship Fuel Consumption in Maritime: Review Challenges and Future Directions

链接: https://arxiv.org/abs/2602.21959
作者: Dusica Marijan,Hamza Haruna Mohammed,Bakht Zaman
类目: Machine Learning (cs.LG)
*备注: 23 pages, 4 figures. Published in Journal of Marine Science and Technology (2026)

点击查看摘要

Abstract:To reduce carbon emissions and minimize shipping costs, improving the fuel efficiency of ships is crucial. Various measures are taken to reduce the total fuel consumption of ships, including optimizing vessel parameters and selecting routes with the lowest fuel consumption. Different estimation methods are proposed for predicting fuel consumption, while various optimization methods are proposed to minimize fuel oil consumption. This paper provides a comprehensive review of methods for estimating and optimizing fuel oil consumption in maritime transport. Our novel contributions include categorizing fuel oil consumption \ estimation methods into physics-based, machine-learning, and hybrid models, exploring their strengths and limitations. Furthermore, we highlight the importance of data fusion techniques, which combine AIS, onboard sensors, and meteorological data to enhance accuracy. We make the first attempt to discuss the emerging role of Explainable AI in enhancing model transparency for decision-making. Uniquely, key challenges, including data quality, availability, and the need for real-time optimization, are identified, and future research directions are proposed to address these gaps, with a focus on hybrid models, real-time optimization, and the standardization of datasets.

[LG-12] Bayesian Generative Adversarial Networks via Gaussian Approximation for Tabular Data Synthesis

链接: https://arxiv.org/abs/2602.21948
作者: Bahrul Ilmi Nasution,Mark Elliot,Richard Allmendinger
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 5 Figures, Accepted in Transactions on Data Privacy

点击查看摘要

Abstract:Generative Adversarial Networks (GAN) have been used in many studies to synthesise mixed tabular data. Conditional tabular GAN (CTGAN) have been the most popular variant but struggle to effectively navigate the risk-utility trade-off. Bayesian GAN have received less attention for tabular data, but have been explored with unstructured data such as images and text. The most used technique employed in Bayesian GAN is Markov Chain Monte Carlo (MCMC), but it is computationally intensive, particularly in terms of weight storage. In this paper, we introduce Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase. We demonstrate that GACTGAN yields better synthetic data compared to CTGAN, achieving better preservation of tabular structure and inferential statistics with less privacy risk. These results highlight GACTGAN as a simpler, effective implementation of Bayesian tabular synthesis.

[LG-13] Learning Unknown Interdependencies for Decentralized Root Cause Analysis in Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2602.21928
作者: Ayush Mohanty,Paritosh Ramanan,Nagi Gebraeel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Manuscript under review

点击查看摘要

Abstract:Root cause analysis (RCA) in networked industrial systems, such as supply chains and power networks, is notoriously difficult due to unknown and dynamically evolving interdependencies among geographically distributed clients. These clients represent heterogeneous physical processes and industrial assets equipped with sensors that generate large volumes of nonlinear, high-dimensional, and heterogeneous IoT data. Classical RCA methods require partial or full knowledge of the system’s dependency graph, which is rarely available in these complex networks. While federated learning (FL) offers a natural framework for decentralized settings, most existing FL methods assume homogeneous feature spaces and retrainable client models. These assumptions are not compatible with our problem setting. Different clients have different data features and often run fixed, proprietary models that cannot be modified. This paper presents a federated cross-client interdependency learning methodology for feature-partitioned, nonlinear time-series data, without requiring access to raw sensor streams or modifying proprietary client models. Each proprietary local client model is augmented with a Machine Learning (ML) model that encodes cross-client interdependencies. These ML models are coordinated via a global server that enforces representation consistency while preserving privacy through calibrated differential privacy noise. RCA is performed using model residuals and anomaly flags. We establish theoretical convergence guarantees and validate our approach on extensive simulations and a real-world industrial cybersecurity dataset.

[LG-14] Bridging Through Absence: How Comeback Researchers Bridge Knowledge Gaps Through Structural Re-emergence

链接: https://arxiv.org/abs/2602.21926
作者: Somyajit Chakraborty,Angshuman Jana,Avijit Gayen
类目: ocial and Information Networks (cs.SI); Digital Libraries (cs.DL); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: Preprint; 25 pages, 14 figures, 7 tables, Submitted to Scientometrics 2025

点击查看摘要

Abstract:Understanding the role of researchers who return to academia after prolonged inactivity, termed “comeback researchers”, is crucial for developing inclusive models of scientific careers. This study investigates the structural and semantic behaviors of comeback researchers, focusing on their role in cross-disciplinary knowledge transfer and network reintegration. Using the AMiner citation dataset, we analyze 113,637 early-career researchers and identify 1,425 comeback cases based on a three-year-or-longer publication gap followed by renewed activity. We find that comeback researchers cite 126% more distinct communities and exhibit 7.6% higher bridging scores compared to dropouts. They also demonstrate 74% higher gap entropy, reflecting more irregular yet strategically impactful publication trajectories. Predictive models trained on these bridging- and entropy-based features achieve a 97% ROC-AUC, far outperforming the 54% ROC-AUC of baseline models using traditional metrics like publication count and h-index. Finally, we substantiate these results via a multi-lens validation. These findings highlight the unique contributions of comeback researchers and offer data-driven tools for their early identification and institutional support.

[LG-15] he Error of Deep Operator Networks Is the Sum of Its Parts: Branch-Trunk and Mode Error Decompositions

链接: https://arxiv.org/abs/2602.21910
作者: Alexander Heinlein,Johannes Taraz
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 29 pages, 12 figures

点击查看摘要

Abstract:Operator learning has the potential to strongly impact scientific computing by learning solution operators for differential equations, potentially accelerating multi-query tasks such as design optimization and uncertainty quantification by orders of magnitude. Despite proven universal approximation properties, deep operator networks (DeepONets) often exhibit limited accuracy and generalization in practice, which hinders their adoption. Understanding these limitations is therefore crucial for further advancing the approach. This work analyzes performance limitations of the classical DeepONet architecture. It is shown that the approximation error is dominated by the branch network when the internal dimension is sufficiently large, and that the learned trunk basis can often be replaced by classical basis functions without a significant impact on performance. To investigate this further, a modified DeepONet is constructed in which the trunk network is replaced by the left singular vectors of the training solution matrix. This modification yields several key insights. First, a spectral bias in the branch network is observed, with coefficients of dominant, low-frequency modes learned more effectively. Second, due to singular-value scaling of the branch coefficients, the overall branch error is dominated by modes with intermediate singular values rather than the smallest ones. Third, using a shared branch network for all mode coefficients, as in the standard architecture, improves generalization of small modes compared to a stacked architecture in which coefficients are computed separately. Finally, strong and detrimental coupling between modes in parameter space is identified. Comments: 29 pages, 12 figures Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) MSC classes: 65M22, 68T07, 15A18, 35Q53 Cite as: arXiv:2602.21910 [cs.LG] (or arXiv:2602.21910v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21910 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] JSAM: Privacy Strag gler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning

链接: https://arxiv.org/abs/2602.21844
作者: Ruichen Xu,Ying-Jun Angela Zhang,Jianwei Huang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Differentially private federated learning faces a fundamental tension: privacy protection mechanisms that safeguard client data simultaneously create quantifiable privacy costs that discourage participation, undermining the collaborative training process. Existing incentive mechanisms rely on unbiased client selection, forcing servers to compensate even the most privacy-sensitive clients (“privacy stragglers”), leading to systemic inefficiency and suboptimal resource allocation. We introduce JSAM (Joint client Selection and privacy compensAtion Mechanism), a Bayesian-optimal framework that simultaneously optimizes client selection probabilities and privacy compensation to maximize training effectiveness under budget constraints. Our approach transforms a complex 2N-dimensional optimization problem into an efficient three-dimensional formulation through novel theoretical characterization of optimal selection strategies. We prove that servers should preferentially select privacy-tolerant clients while excluding high-sensitivity participants, and uncover the counter-intuitive insight that clients with minimal privacy sensitivity may incur the highest cumulative costs due to frequent participation. Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.

[LG-17] DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion

链接: https://arxiv.org/abs/2602.21824
作者: Marcel Lamott,Saifullah Saifullah,Nauman Riaz,Yves-Noel Weweler,Tobias Alt-Veit,Ahmad Sarmad Ali,Muhammad Armaghan Shakir,Adrian Kalwa,Momina Moetesum,Andreas Dengel,Sheraz Ahmed,Faisal Shafait,Ulrich Schwanecke,Adrian Ulges
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average 87% of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.

[LG-18] DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

链接: https://arxiv.org/abs/2602.21788
作者: Yifan Niu,Han Xiao,Dongyi Liu,Wei Zhou,Jia Li
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 \times speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

[LG-19] herapist-Robot-Patient Physical Interaction is Worth a Thousand Words: Enabling Intuitive Therapist Guidance via Remote Haptic Control

链接: https://arxiv.org/abs/2602.21783
作者: Beatrice Luciani,Alex van den Berg,Matti Lang,Alexandre L. Ratschat,Laura Marchal-Crespo
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Robotic systems can enhance the amount and repeatability of physically guided motor training. Yet their real-world adoption is limited, partly due to non-intuitive trainer/therapist-trainee/patient interactions. To address this gap, we present a haptic teleoperation system for trainers to remotely guide and monitor the movements of a trainee wearing an arm exoskeleton. The trainer can physically interact with the exoskeleton through a commercial handheld haptic device via virtual contact points at the exoskeleton’s elbow and wrist, allowing intuitive guidance. Thirty-two participants tested the system in a trainer-trainee paradigm, comparing our haptic demonstration system with conventional visual demonstration in guiding trainees in executing arm poses. Quantitative analyses showed that haptic demonstration significantly reduced movement completion time and improved smoothness, while speech analysis using large language models for automated transcription and categorization of verbal commands revealed fewer verbal instructions. The haptic demonstration did not result in higher reported mental and physical effort by trainers compared to the visual demonstration, while trainers reported greater competence and trainees lower physical demand. These findings support the feasibility of our proposed interface for effective remote human-robot physical interaction. Future work should assess its usability and efficacy for clinical populations in restoring clinicians’ sense of agency during robot-assisted therapy.

[LG-20] RAMSeS: Robust and Adaptive Model Selection for Time-Series Anomaly Detection Algorithms

链接: https://arxiv.org/abs/2602.21766
作者: Mohamed Abdelmaksoud,Sheng Ding,Andrey Morozov,Ziawasch Abedjan
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series data vary widely across domains, making a universal anomaly detector impractical. Methods that perform well on one dataset often fail to transfer because what counts as an anomaly is context dependent. The key challenge is to design a method that performs well in specific contexts while remaining adaptable across domains with varying data complexities. We present the Robust and Adaptive Model Selection for Time-Series Anomaly Detection RAMSeS framework. RAMSeS comprises two branches: (i) a stacking ensemble optimized with a genetic algorithm to leverage complementary detectors. (ii) An adaptive model-selection branch identifies the best single detector using techniques including Thompson sampling, robustness testing with generative adversarial networks, and Monte Carlo simulations. This dual strategy exploits the collective strength of multiple models and adapts to dataset-specific characteristics. We evaluate RAMSeS and show that it outperforms prior methods on F1.

[LG-21] From Words to Amino Acids: Does the Curse of Depth Persist?

链接: https://arxiv.org/abs/2602.21750
作者: Aleena Siji,Amir Mohammad Karimi Mamaghan,Ferdinand Kapl,Tobias Höppe,Emmanouil Angelis,Andrea Dittadi,Maurice Brenner,Michael Heinzinger,Karl Henrik Johansson,Kaitlin Maile,Johannes von Oswald,Stefan Bauer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protein language models (PLMs) have become widely adopted as general-purpose models, demonstrating strong performance in protein engineering and de novo design. Like large language models (LLMs), they are typically trained as deep transformers with next-token or masked-token prediction objectives on massive sequence corpora and are scaled by increasing model depth. Recent work on autoregressive LLMs has identified the Curse of Depth: later layers contribute little to the final output predictions. These findings naturally raise the question of whether a similar depth inefficiency also appears in PLMs, where many widely used models are not autoregressive, and some are multimodal, accepting both protein sequence and structure as input. In this work, we present a depth analysis of six popular PLMs across model families and scales, spanning three training objectives, namely autoregressive, masked, and diffusion, and quantify how layer contributions evolve with depth using a unified set of probing- and perturbation-based measurements. Across all models, we observe consistent depth-dependent patterns that extend prior findings on LLMs: later layers depend less on earlier computations and mainly refine the final output distribution, and these effects are increasingly pronounced in deeper models. Taken together, our results suggest that PLMs exhibit a form of depth inefficiency, motivating future work on more depth-efficient architectures and training methods.

[LG-22] RABot: Reinforcement-Guided Graph Augmentation for Imbalanced and Noisy Social Bot Detection

链接: https://arxiv.org/abs/2602.21749
作者: Longlong Zhang,Xi Wang,Haotong Du,Yangyi Xu,Zhuo Liu,Yang Liu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social bot detection is pivotal for safeguarding the integrity of online information ecosystems. Although recent graph neural network (GNN) solutions achieve strong results, they remain hindered by two practical challenges: (i) severe class imbalance arising from the high cost of generating bots, and (ii) topological noise introduced by bots that skillfully mimic human behavior and forge deceptive links. We propose the Reinforcement-guided graph Augmentation social Bot detector (RABot), a multi-granularity graph-augmentation framework that addresses both issues in a unified manner. RABot employs a neighborhood-aware oversampling strategy that linearly interpolates minority-class embeddings within local subgraphs, thereby stabilizing the decision boundary under low-resource regimes. Concurrently, a reinforcement-learning-driven edge-filtering module combines similarity-based edge features with adaptive threshold optimization to excise spurious interactions during message passing, yielding a cleaner topology. Extensive experiments on three real-world benchmarks and four GNN backbones demonstrate that RABot consistently surpasses state-of-the-art baselines. In addition, since its augmentation and filtering modules are orthogonal to the underlying architecture, RABot can be seamlessly integrated into existing GNN pipelines to boost performance with minimal overhead.

[LG-23] Private and Robust Contribution Evaluation in Federated Learning

链接: https://arxiv.org/abs/2602.21721
作者: Delio Jaramillo Velez,Gergely Biczok,Alexandre Graell i Amat,Johan Ostman,Balazs Pejo
类目: Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-silo federated learning allows multiple organizations to collaboratively train machine learning models without sharing raw data, but client updates can still leak sensitive information through inference attacks. Secure aggregation protects privacy by hiding individual updates, yet it complicates contribution evaluation, which is critical for fair rewards and detecting low-quality or malicious participants. Existing marginal-contribution methods, such as the Shapley value, are incompatible with secure aggregation, and practical alternatives, such as Leave-One-Out, are crude and rely on self-evaluation. We introduce two marginal-difference contribution scores compatible with secure aggregation. Fair-Private satisfies standard fairness axioms, while Everybody-Else eliminates self-evaluation and provides resistance to manipulation, addressing a largely overlooked vulnerability. We provide theoretical guarantees for fairness, privacy, robustness, and computational efficiency, and evaluate our methods on multiple medical image datasets and CIFAR10 in cross-silo settings. Our scores consistently outperform existing baselines, better approximate Shapley-induced client rankings, and improve downstream model performance as well as misbehavior detection. These results demonstrate that fairness, privacy, robustness, and practical utility can be achieved jointly in federated contribution evaluation, offering a principled solution for real-world cross-silo deployments. Subjects: Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2602.21721 [cs.CR] (or arXiv:2602.21721v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.21721 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] C2TC: A Training-Free Framework for Efficient Tabular Data Condensation

链接: https://arxiv.org/abs/2602.21717
作者: Sijia Xu,Fan Li,Xiaoyang Wang,Zhengyi Yang,Xuemin Lin
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Tabular data is the primary data format in industrial relational databases, underpinning modern data analytics and decision-making. However, the increasing scale of tabular data poses significant computational and storage challenges to learning-based analytical systems. This highlights the need for data-efficient learning, which enables effective model training and generalization using substantially fewer samples. Dataset condensation (DC) has emerged as a promising data-centric paradigm that synthesizes small yet informative datasets to preserve data utility while reducing storage and training costs. However, existing DC methods are computationally intensive due to reliance on complex gradient-based optimization. Moreover, they often overlook key characteristics of tabular data, such as heterogeneous features and class imbalance. To address these limitations, we introduce C ^2 TC (Class-Adaptive Clustering for Tabular Condensation), the first training-free tabular dataset condensation framework that jointly optimizes class allocation and feature representation, enabling efficient and scalable condensation. Specifically, we reformulate the dataset condensation objective into a novel class-adaptive cluster allocation problem (CCAP), which eliminates costly training and integrates adaptive label allocation to handle class imbalance. To solve the NP-hard CCAP, we develop HFILS, a heuristic local search that alternates between soft allocation and class-wise clustering to efficiently obtain high-quality solutions. Moreover, a hybrid categorical feature encoding (HCFE) is proposed for semantics-preserving clustering of heterogeneous discrete attributes. Extensive experiments on 10 real-world datasets demonstrate that C ^2 TC improves efficiency by at least 2 orders of magnitude over state-of-the-art baselines, while achieving superior downstream performance.

[LG-25] Learning Complex Physical Regimes via Coverag e-oriented Uncertainty Quantification: An application to the Critical Heat Flux

链接: https://arxiv.org/abs/2602.21701
作者: Michele Cazzola,Alberto Ghione,Lucia Sargentini,Julien Nespoulous,Riccardo Finotello
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 34 pages, 14 figures

点击查看摘要

Abstract:A central challenge in scientific machine learning (ML) is the correct representation of physical systems governed by multi-regime behaviours. In these scenarios, standard data analysis techniques often fail to capture the nature of the data, as the system’s response varies significantly across the state space due to its stochasticity and the different physical regimes. Uncertainty quantification (UQ) should thus not be viewed merely as a safety assessment, but as a support to the learning task itself, guiding the model to internalise the behaviour of the data. We address this by focusing on the Critical Heat Flux (CHF) benchmark and dataset presented by the OECD/NEA Expert Group on Reactor Systems Multi-Physics. This case study represents a test for scientific ML due to the non-linear dependence of CHF on the inputs and the existence of distinct microscopic physical regimes. These regimes exhibit diverse statistical profiles, a complexity that requires UQ techniques to internalise the data behaviour and ensure reliable predictions. In this work, we conduct a comparative analysis of UQ methodologies to determine their impact on physical representation. We contrast post-hoc methods, specifically conformal prediction, against end-to-end coverage-oriented pipelines, including (Bayesian) heteroscedastic regression and quality-driven losses. These approaches treat uncertainty not as a final metric, but as an active component of the optimisation process, modelling the prediction and its behaviour simultaneously. We show that while post-hoc methods ensure statistical calibration, coverage-oriented learning effectively reshapes the model’s representation to match the complex physical regimes. The result is a model that delivers not only high predictive accuracy but also a physically consistent uncertainty estimation that adapts dynamically to the intrinsic variability of the CHF.

[LG-26] Mi: Empower Time Series Transformers with Multimodal Mixture of Experts

链接: https://arxiv.org/abs/2602.21693
作者: Jiafeng Lin,Yuxuan Wang,Huakun Luo,Zhongyi Pei,Jianmin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal time series forecasting has garnered significant attention for its potential to provide more accurate predictions than traditional single-modality models by leveraging rich information inherent in other modalities. However, due to fundamental challenges in modality alignment, existing methods often struggle to effectively incorporate multimodal data into predictions, particularly textual information that has a causal influence on time series fluctuations, such as emergency reports and policy announcements. In this paper, we reflect on the role of textual information in numerical forecasting and propose Time series transformers with Multimodal Mixture-of-Experts, TiMi, to unleash the causal reasoning capabilities of LLMs. Concretely, TiMi utilizes LLMs to generate inferences on future developments, which serve as guidance for time series forecasting. To seamlessly integrate both exogenous factors and time series into predictions, we introduce a Multimodal Mixture-of-Experts (MMoE) module as a lightweight plug-in to empower Transformer-based time series models for multimodal forecasting, eliminating the need for explicit representation-level alignment. Experimentally, our proposed TiMi demonstrates consistent state-of-the-art performance on sixteen real-world multimodal forecasting benchmarks, outperforming advanced baselines while offering both strong adaptability and interpretability.

[LG-27] Primary-Fine Decoupling for Action Generation in Robotic Imitation ICLR

链接: https://arxiv.org/abs/2602.21684
作者: Xiaohan Lei,Min Wang,Wengang Zhou,Xingyu Lu,Houqiang Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: The Fourteenth International Conference on Learning Representations (ICLR), 2026

点击查看摘要

Abstract:Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning. To this end, existing approaches often model the action space as either a discrete set of tokens or a continuous, latent-variable distribution. However, both approaches present trade-offs: some methods discretize actions into tokens and therefore lose fine-grained action variations, while others generate continuous actions in a single stage tend to produce unstable mode transitions. To address these limitations, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations. First, we compress action chunks into a small set of discrete modes, enabling a lightweight policy to select consistent coarse modes and avoid mode bouncing. Second, a mode conditioned MeanFlow policy is learned to generate high-fidelity continuous actions. Theoretically, we prove PF-DAG’s two-stage design achieves a strictly lower MSE bound than single-stage generative policies. Empirically, PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks. It further generalizes to real-world tactile dexterous manipulation tasks. Our work demonstrates that explicit mode-level decoupling enables both robust multi-modal modeling and reactive closed-loop control for robotic manipulation.

[LG-28] Error-awareness Accelerates Active Automata Learning

链接: https://arxiv.org/abs/2602.21674
作者: Loes Kruger,Sebastian Junges,Jurriaan Rot
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Active automata learning (AAL) algorithms can learn a behavioral model of a system from interacting with it. The primary challenge remains scaling to larger models, in particular in the presence of many possible inputs to the system. Modern AAL algorithms fail to scale even if, in every state, most inputs lead to errors. In various challenging problems from the literature, these errors are observable, i.e., they emit a known error output. Motivated by these problems, we study learning these systems more efficiently. Further, we consider various degrees of knowledge about which inputs are non-error producing at which state. For each level of knowledge, we provide a matching adaptation of the state-of-the-art AAL algorithm L# to make the most of this domain knowledge. Our empirical evaluation demonstrates that the methods accelerate learning by orders of magnitude with strong but realistic domain knowledge to a single order of magnitude with limited domain knowledge.

[LG-29] Multimodal Survival Modeling and Fairness-Aware Clinical Machine Learning for 5-Year Breast Cancer Risk Prediction

链接: https://arxiv.org/abs/2602.21648
作者: Toktam Khatibi
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Clinical risk prediction models often underperform in real-world settings due to poor calibration, limited transportability, and subgroup disparities. These challenges are amplified in high-dimensional multimodal cancer datasets characterized by complex feature interactions and a p n structure. We present a fully reproducible multimodal machine learning framework for 5-year overall survival prediction in breast cancer, integrating clinical variables with high-dimensional transcriptomic and copy-number alteration (CNA) features from the METABRIC cohort. After variance- and sparsity-based filtering and dimensionality reduction, models were trained using stratified train/validation/test splits with validation-based hyperparameter tuning. Two survival approaches were compared: an elastic-net regularized Cox model (CoxNet) and a gradient-boosted survival tree model implemented using XGBoost. CoxNet provides embedded feature selection and stable estimation, whereas XGBoost captures nonlinear effects and higher-order interactions. Performance was assessed using time-dependent area under the ROC curve (AUC), average precision (AP), calibration curves, Brier score, and bootstrapped 95 percent confidence intervals. CoxNet achieved validation and test AUCs of 98.3 and 96.6, with AP values of 90.1 and 80.4. XGBoost achieved validation and test AUCs of 98.6 and 92.5, with AP values of 92.5 and 79.9. Fairness diagnostics showed stable discrimination across age groups, estrogen receptor status, molecular subtypes, and menopausal state. This work introduces a governance-oriented multimodal survival framework emphasizing calibration, fairness auditing, robustness, and reproducibility for high-dimensional clinical machine learning. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2602.21648 [cs.LG] (or arXiv:2602.21648v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21648 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Toktam Khatibi [view email] [v1] Wed, 25 Feb 2026 07:20:43 UTC (798 KB)

[LG-30] Revisiting the Bertrand Paradox via Equilibrium Analysis of No-regret Learners

链接: https://arxiv.org/abs/2602.21620
作者: Arnab Maiti,Junyan Liu,Kevin Jamieson,Lillian J. Ratliff
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 36 pages, 34 figures

点击查看摘要

Abstract:We study the discrete Bertrand pricing game with a non-increasing demand function. The game has n \ge 2 players who simultaneously choose prices from the set \1/k, 2/k, \ldots, 1\ , where k\in\mathbbN . The player who sets the lowest price captures the entire demand; if multiple players tie for the lowest price, they split the demand equally. We study the Bertrand paradox, where classical theory predicts low prices, yet real markets often sustain high prices. To understand this gap, we analyze a repeated-game model in which firms set prices using no-regret learners. Our goal is to characterize the equilibrium outcomes that can arise under different no-regret learning guarantees. We are particularly interested in questions such as whether no-external-regret learners can converge to undesirable high-price outcomes, and how stronger guarantees such as no-swap regret shape the emergence of competitive low-price behavior. We address these and related questions through a theoretical analysis, complemented by experiments that support the theory and reveal surprising phenomena for no-swap regret learners. Comments: 36 pages, 34 figures Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2602.21620 [cs.GT] (or arXiv:2602.21620v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.21620 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Deep Clustering based Boundary-Decoder Net for Inter and Intra Layer Stress Prediction of Heterogeneous Integrated IC Chip

链接: https://arxiv.org/abs/2602.21601
作者: Kart Leong Lim,Ji Lin
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:High stress occurs when 3D heterogeneous IC packages are subjected to thermal cycling at extreme temperatures. Stress mainly occurs at the interface between different materials. We investigate stress image using latent space representation which is based on using deep generative model (DGM). However, most DGM approaches are unsupervised, meaning they resort to image pairing (input and output) to train DGM. Instead, we rely on a recent boundary-decoder (BD) net, which uses boundary condition and image pairing for stress modeling. The boundary net maps material parameters to the latent space co-shared by its image counterpart. Because such a setup is dimensionally wise ill-posed, we further couple BD net with deep clustering. To access the performance of our proposed method, we simulate an IC chip dataset comprising of 1825 stress images. We compare our new approach using variants of BD net as well as a baseline approach. We show that our approach is able to outperform all the comparison in terms of train and test error reduction.

[LG-32] NGDB-Zoo: Towards Efficient and Scalable Neural Graph Databases Training

链接: https://arxiv.org/abs/2602.21597
作者: Zhongwei Xie,Jiaxin Bai,Shujie Liu,Haoyu Huang,Yufei Li,Yisen Gao,Hong Ting Tsang,Yangqiu Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Graph Databases (NGDBs) facilitate complex logical reasoning over incomplete knowledge structures, yet their training efficiency and expressivity are constrained by rigid query-level batching and structure-exclusive embeddings. We present NGDB-Zoo, a unified framework that resolves these bottlenecks by synergizing operator-level training with semantic augmentation. By decoupling logical operators from query topologies, NGDB-Zoo transforms the training loop into a dynamically scheduled data-flow execution, enabling multi-stream parallelism and achieving a 1.8\times - 6.8\times throughput compared to baselines. Furthermore, we formalize a decoupled architecture to integrate high-dimensional semantic priors from Pre-trained Text Encoders (PTEs) without triggering I/O stalls or memory overflows. Extensive evaluations on six benchmarks, including massive graphs like ogbl-wikikg2 and ATLAS-Wiki, demonstrate that NGDB-Zoo maintains high GPU utilization across diverse logical patterns and significantly mitigates representation friction in hybrid neuro-symbolic reasoning.

[LG-33] ABM-UDE: Developing Surrogates for Epidemic Agent -Based Models via Scientific Machine Learning

链接: https://arxiv.org/abs/2602.21588
作者: Sharv Murgai,Utkarsh Utkarsh,Kyle C. Nguyen,Alan Edelman,Erin C. S. Acquesta,Christopher Vincent Rackauckas
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Agent-based epidemic models (ABMs) encode behavioral and policy heterogeneity but are too slow for nightly hospital planning. We develop county-ready surrogates that learn directly from exascale ABM trajectories using Universal Differential Equations (UDEs): mechanistic SEIR-family ODEs with a neural-parameterized contact rate \kappa_\phi(u,t) (no additive residual). Our contributions are threefold: we adapt multiple shooting and an observer-based prediction-error method (PEM) to stabilize identification of neural-augmented epidemiological dynamics across intervention-driven regime shifts; we enforce positivity and mass conservation and show the learned contact-rate parameterization yields a well-posed vector field; and we quantify accuracy, calibration, and compute against ABM ensembles and UDE baselines. On a representative ExaEpi scenario, PEM-UDE reduces mean MSE by 77% relative to single-shooting UDE (3.00 vs. 13.14) and by 20% relative to MS-UDE (3.75). Reliability improves in parallel: empirical coverage of ABM 10 - 90 % and 25 - 75 % bands rises from 0.68/0.43 (UDE) and 0.79/0.55 (MS-UDE) to 0.86/0.61 with PEM-UDE and 0.94/0.69 with MS+PEM-UDE, indicating calibrated uncertainty rather than overconfident fits. Inference runs in seconds on commodity CPUs (20-35 s per \sim 90-day forecast), enabling nightly ‘‘what-if’’ sweeps on a laptop. Relative to a \sim 100 CPU-hour ABM reference run, this yields \sim10^4\times lower wall-clock per scenario. This closes the realism-cadence gap, supports threshold-aware decision-making (e.g., maintaining ICU occupancy 75 %), preserves mechanistic interpretability, and enables calibrated, risk-aware scenario planning on standard institutional hardware. Beyond epidemics, the ABM \to UDE recipe provides a portable path to distill agent-based simulators into fast, trustworthy surrogates for other scientific domains.

[LG-34] Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

链接: https://arxiv.org/abs/2602.21585
作者: Sweta Karlekar,Carolina Zheng,Magnus Saebo,Nicolas Beltran-Velez,Shuyang Yu,John Bowlan,Michal Kucer,David Blei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

[LG-35] raining-free Composition of Pre-trained GFlowNets for Multi-Objective Generation

链接: https://arxiv.org/abs/2602.21565
作者: Seokwon Yoon,Youngbin Choi,Seunghyuk Cho,Seungbeom Lee,MoonJeong Park,Dongwoo Kim
类目: Machine Learning (cs.LG)
*备注: 22 pages, 12 figures, 12 tables

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) learn to sample diverse candidates in proportion to a reward function, making them well-suited for scientific discovery, where exploring multiple promising solutions is crucial. Further extending GFlowNets to multi-objective settings has attracted growing interest since real-world applications often involve multiple, conflicting objectives. However, existing approaches require additional training for each set of objectives, limiting their applicability and incurring substantial computational overhead. We propose a training-free mixing policy that composes pre-trained GFlowNets at inference time, enabling rapid adaptation without finetuning or retraining. Importantly, our framework is flexible, capable of handling diverse reward combinations ranging from linear scalarization to complex non-linear logical operators, which are often handled separately in previous literature. We prove that our method exactly recovers the target distribution for linear scalarization and quantify the approximation quality for nonlinear operators through a distortion factor. Experiments on a synthetic 2D grid and real-world molecule-generation tasks demonstrate that our approach achieves performance comparable to baselines that require additional training.

[LG-36] Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction ICLR2026

链接: https://arxiv.org/abs/2602.21550
作者: Zhao Yang,Yi Duan,Jiwei Zhu,Ying Ba,Chuan Cao,Bing Su
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Gene expression prediction, which predicts mRNA expression levels from DNA sequences, presents significant challenges. Previous works often focus on extending input sequence length to locate distal enhancers, which may influence target genes from hundreds of kilobases away. Our work first reveals that for current models, long sequence modeling can decrease performance. Even carefully designed algorithms only mitigate the performance degradation caused by long sequences. Instead, we find that proximal multimodal epigenomic signals near target genes prove more essential. Hence we focus on how to better integrate these signals, which has been overlooked. We find that different signal types serve distinct biological roles, with some directly marking active regulatory elements while others reflect background chromatin patterns that may introduce confounding effects. Simple concatenation may lead models to develop spurious associations with these background patterns. To address this challenge, we propose Prism, a framework that learns multiple combinations of high-dimensional epigenomic features to represent distinct background chromatin states and uses backdoor adjustment to mitigate confounding effects. Our experimental results demonstrate that proper modeling of multimodal epigenomic signals achieves state-of-the-art performance using only short sequences for gene expression prediction.

[LG-37] Mamba Meets Scheduling: Learning to Solve Flexible Job Shop Scheduling with Efficient Sequence Modeling

链接: https://arxiv.org/abs/2602.21546
作者: Zhi Cao,Cong Zhang,Yaoxin Wu,Yaqing Hou,Hongwei Ge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Flexible Job Shop Problem (FJSP) is a well-studied combinatorial optimization problem with extensive applications for manufacturing and production scheduling. It involves assigning jobs to various machines to optimize criteria, such as minimizing total completion time. Current learning-based methods in this domain often rely on localized feature extraction models, limiting their capacity to capture overarching dependencies spanning operations and machines. This paper introduces an innovative architecture that harnesses Mamba, a state-space model with linear computational complexity, to facilitate comprehensive sequence modeling tailored for FJSP. In contrast to prevalent graph-attention-based frameworks that are computationally intensive for FJSP, we show our model is more efficient. Specifically, the proposed model possesses an encoder and a decoder. The encoder incorporates a dual Mamba block to extract operation and machine features separately. Additionally, we introduce an efficient cross-attention decoder to learn interactive embeddings of operations and machines. Our experimental results demonstrate that our method achieves faster solving speed and surpasses the performance of state-of-the-art learning-based methods for FJSP across various benchmarks.

[LG-38] Muon: Towards Better Muon via One Additional Normalization Step

链接: https://arxiv.org/abs/2602.21545
作者: Ruijie Zhang,Yequan Zhao,Ziyue Liu,Zhengyang Wang,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of \approx 200 . Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: this https URL.

[LG-39] Learning Recursive Multi-Scale Representations for Irregular Multivariate Time Series Forecasting ICLR2026

链接: https://arxiv.org/abs/2602.21498
作者: Boyuan Li,Zhen Liu,Yicheng Luo,Qianli Ma
类目: Machine Learning (cs.LG)
*备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Irregular Multivariate Time Series (IMTS) are characterized by uneven intervals between consecutive timestamps, which carry sampling pattern information valuable and informative for learning temporal and variable dependencies. In addition, IMTS often exhibit diverse dependencies across multiple time scales. However, many existing multi-scale IMTS methods use resampling to obtain the coarse series, which can alter the original timestamps and disrupt the sampling pattern information. To address the challenge, we propose ReIMTS, a Recursive multi-scale modeling approach for Irregular Multivariate Time Series forecasting. Instead of resampling, ReIMTS keeps timestamps unchanged and recursively splits each sample into subsamples with progressively shorter time periods. Based on the original sampling timestamps in these long-to-short subsamples, an irregularity-aware representation fusion mechanism is proposed to capture global-to-local dependencies for accurate forecasting. Extensive experiments demonstrate an average performance improvement of 27.1% in the forecasting task across different models and real-world datasets. Our code is available at this https URL.

[LG-40] he Design Space of Tri-Modal Masked Diffusion Models

链接: https://arxiv.org/abs/2602.21472
作者: Louis Bethune,Victor Turrisi,Bruno Kacper Mlodozeniec,Pau Rodriguez Lopez,Lokesh Boominathan,Nikhil Bhendawade,Amitis Shidani,Joris Pelemans,Theo X. Olausson,Devon Hjelm,Paul Dixon,Joao Monteiro,Pierre Ablin,Vishnu Banna,Arno Blaas,Nick Henderson,Kari Noriy,Dan Busbridge,Josh Susskind,Marco Cuturi,Irina Belousova,Luca Zappella,Russ Webb,Jason Ramapuram
类目: Machine Learning (cs.LG)
*备注: 41 pages, 29 figures, 10 tables

点击查看摘要

Abstract:Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.

[LG-41] D-Flow SGLD: Source-Space Posterior Sampling for Scientific Inverse Problems with Flow Matching

链接: https://arxiv.org/abs/2602.21469
作者: Meet Hemant Parikh,Yaqin Chen,Jian-Xun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data assimilation and scientific inverse problems require reconstructing high-dimensional physical states from sparse and noisy observations, ideally with uncertainty-aware posterior samples that remain faithful to learned priors and governing physics. While training-free conditional generation is well developed for diffusion models, corresponding conditioning and posterior sampling strategies for Flow Matching (FM) priors remain comparatively under-explored, especially on scientific benchmarks where fidelity must be assessed beyond measurement misfit. In this work, we study training-free conditional generation for scientific inverse problems under FM priors and organize existing inference-time strategies by where measurement information is injected: (i) guided transport dynamics that perturb sampling trajectories using likelihood information, and (ii) source-distribution inference that performs posterior inference over the source variable while keeping the learned transport fixed. Building on the latter, we propose D-Flow SGLD, a source-space posterior sampling method that augments differentiable source inference with preconditioned stochastic gradient Langevin dynamics, enabling scalable exploration of the source posterior induced by new measurement operators without retraining the prior or modifying the learned FM dynamics. We benchmark representative methods from both families on a hierarchy of problems: 2D toy posteriors, chaotic Kuramoto-Sivashinsky trajectories, and wall-bounded turbulence reconstruction. Across these settings, we quantify trade-offs among measurement assimilation, posterior diversity, and physics/statistics fidelity, and establish D-Flow SGLD as a practical FM-compatible posterior sampler for scientific inverse problems.

[LG-42] Geometric Priors for Generalizable World Models via Vector Symbolic Architecture NEURIPS2025

链接: https://arxiv.org/abs/2602.21467
作者: William Youngwoo Chung,Calvin Yeung,Hansen Jin Lillemark,Zhuowen Zou,Xiangjian Liu,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注: 9 pages, accepted to Neurips 2025 Workshop Symmetry and Geometry in Neural Representations

点击查看摘要

Abstract:A key challenge in artificial intelligence and neuroscience is understanding how neural systems learn representations that capture the underlying dynamics of the world. Most world models represent the transition function with unstructured neural networks, limiting interpretability, sample efficiency, and generalization to unseen states or action compositions. We address these issues with a generalizable world model grounded in Vector Symbolic Architecture (VSA) principles as geometric priors. Our approach utilizes learnable Fourier Holographic Reduced Representation (FHRR) encoders to map states and actions into a high dimensional complex vector space with learned group structure and models transitions with element-wise complex multiplication. We formalize the framework’s group theoretic foundation and show how training such structured representations to be approximately invariant enables strong multi-step composition directly in latent space and generalization performances over various experiments. On a discrete grid world environment, our model achieves 87.5% zero shot accuracy to unseen state-action pairs, obtains 53.6% higher accuracy on 20-timestep horizon rollouts, and demonstrates 4x higher robustness to noise relative to an MLP baseline. These results highlight how training to have latent group structure yields generalizable, data-efficient, and interpretable world models, providing a principled pathway toward structured models for real-world planning and reasoning.

[LG-43] Asymptotically Fast Clebsch-Gordan Tensor Products with Vector Spherical Harmonics

链接: https://arxiv.org/abs/2602.21466
作者: YuQing Xie,Ameya Daigavane,Mit Kotak,Tess Smidt
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 28 pages, 2 figures. arXiv admin note: text overlap with arXiv:2506.13523

点击查看摘要

Abstract: E(3) -equivariant neural networks have proven to be effective in a wide range of 3D modeling tasks. A fundamental operation of such networks is the tensor product, which allows interaction between different feature types. Because this operation scales poorly, there has been considerable work towards accelerating this interaction. However, recently \citetxieprice have pointed out that most speedups come from a reduction in expressivity rather than true algorithmic improvements on computing Clebsch-Gordan tensor products. A modification of Gaunt tensor product \citepgaunt can give a true asymptotic speedup but is incomplete and misses many interactions. In this work, we provide the first complete algorithm which truly provides asymptotic benefits Clebsch-Gordan tensor products. For full CGTP, our algorithm brings runtime complexity from the naive O(L^6) to O(L^4\log^2 L) , close to the lower bound of O(L^4) . We first show how generalizing fast Fourier based convolution naturally leads to the previously proposed Gaunt tensor product \citepgaunt. To remedy antisymmetry issues, we generalize from scalar signals to irrep valued signals, giving us tensor spherical harmonics. We prove a generalized Gaunt formula for the tensor harmonics. Finally, we show that we only need up to vector valued signals to recover the missing interactions of Gaunt tensor product.

[LG-44] Effects of Training Data Quality on Classifier Performance

链接: https://arxiv.org/abs/2602.21462
作者: Alan F. Karr,Regina Ruane
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into “contigs,” we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers – Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases. Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML) Cite as: arXiv:2602.21462 [cs.LG] (or arXiv:2602.21462v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21462 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] When Learning Hurts: Fixed-Pole RNN for Real-Time Online Training

链接: https://arxiv.org/abs/2602.21454
作者: Alexander Morgan,Ummay Sumaya Khan,Lingjia Liu,Lizhong Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) can be interpreted as discrete-time state-space models, where the state evolution corresponds to an infinite-impulse-response (IIR) filtering operation governed by both feedforward weights and recurrent poles. While, in principle, all parameters including pole locations can be optimized via backpropagation through time (BPTT), such joint learning incurs substantial computational overhead and is often impractical for applications with limited training data. Echo state networks (ESNs) mitigate this limitation by fixing the recurrent dynamics and training only a linear readout, enabling efficient and stable online adaptation. In this work, we analytically and empirically examine why learning recurrent poles does not provide tangible benefits in data-constrained, real-time learning scenarios. Our analysis shows that pole learning renders the weight optimization problem highly non-convex, requiring significantly more training samples and iterations for gradient-based methods to converge to meaningful solutions. Empirically, we observe that for complex-valued data, gradient descent frequently exhibits prolonged plateaus, and advanced optimizers offer limited improvement. In contrast, fixed-pole architectures induce stable and well-conditioned state representations even with limited training data. Numerical results demonstrate that fixed-pole networks achieve superior performance with lower training complexity, making them more suitable for online real-time tasks.

[LG-46] Proximal-IMH: Proximal Posterior Proposals for Independent Metropolis-Hastings with Approximate Operators

链接: https://arxiv.org/abs/2602.21426
作者: Youguang Chen,George Biros
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We consider the problem of sampling from a posterior distribution arising in Bayesian inverse problems in science, engineering, and imaging. Our method belongs to the family of independence Metropolis-Hastings (IMH) sampling algorithms, which are common in Bayesian inference. Relying on the existence of an approximate posterior distribution that is cheaper to sample from but may have significant bias, we introduce Proximal-IMH, a scheme that removes this bias by correcting samples from the approximate posterior through an auxiliary optimization problem. This yields a local adjustment that trades off adherence to the exact model against stability around the approximate reference point. For idealized settings, we prove that the proximal correction tightens the match between approximate and exact posteriors, thereby improving acceptance rates and mixing. The method applies to both linear and nonlinear input-output operators and is particularly suitable for inverse problems where exact posterior sampling is too expensive. We present numerical experiments including multimodal and data-driven priors with nonlinear input-output operators. The results show that Proximal-IMH reliably outperforms existing IMH variants.

[LG-47] Benchmarking State Space Models Transformers and Recurrent Networks for US Grid Forecasting

链接: https://arxiv.org/abs/2602.21415
作者: Sunki Hong,Jisoo Lee,Yuanyuan Shi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 11 pages, 2 figures, 8 tables

点击查看摘要

Abstract:Selecting the right deep learning model for power grid forecasting is challenging, as performance heavily depends on the data available to the operator. This paper presents a comprehensive benchmark of five modern neural architectures: two state space models (PowerMamba, S-Mamba), two Transformers (iTransformer, PatchTST), and a traditional LSTM. We evaluate these models on hourly electricity demand across six diverse US power grids for forecast windows between 24 and 168 hours. To ensure a fair comparison, we adapt each model with specialized temporal processing and a modular layer that cleanly integrates weather covariates. Our results reveal that there is no single best model for all situations. When forecasting using only historical load, PatchTST and the state space models provide the highest accuracy. However, when explicit weather data is added to the inputs, the rankings reverse: iTransformer improves its accuracy three times more efficiently than PatchTST. By controlling for model size, we confirm that this advantage stems from the architecture’s inherent ability to mix information across different variables. Extending our evaluation to solar generation, wind power, and wholesale prices further demonstrates that model rankings depend on the forecast task: PatchTST excels on highly rhythmic signals like solar, while state space models are better suited for the chaotic fluctuations of wind and price. Ultimately, this benchmark provides grid operators with actionable guidelines for selecting the optimal forecasting architecture based on their specific data environments.

[LG-48] Generative Bayesian Computation as a Scalable Alternative to Gaussian Process Surrogates

链接: https://arxiv.org/abs/2602.21408
作者: Nick Polson,Vadim Sokolov
类目: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian process (GP) surrogates are the default tool for emulating expensive computer experiments, but cubic cost, stationarity assumptions, and Gaussian predictive distributions limit their reach. We propose Generative Bayesian Computation (GBC) via Implicit Quantile Networks (IQNs) as a surrogate framework that targets all three limitations. GBC learns the full conditional quantile function from input–output pairs; at test time, a single forward pass per quantile level produces draws from the predictive distribution. Across fourteen benchmarks we compare GBC to four GP-based methods. GBC improves CRPS by 11–26% on piecewise jump-process benchmarks, by 14% on a ten-dimensional Friedman function, and scales linearly to 90,000 training points where dense-covariance GPs are infeasible. A boundary-augmented variant matches or outperforms Modular Jump GPs on two-dimensional jump datasets (up to 46% CRPS improvement). In active learning, a randomized-prior IQN ensemble achieves nearly three times lower RMSE than deep GP active learning on Rocket LGBB. Overall, GBC records a favorable point estimate in 12 of 14 comparisons. GPs retain an edge on smooth surfaces where their smoothness prior provides effective regularization. Subjects: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2602.21408 [cs.LG] (or arXiv:2602.21408v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Defensive Generation

链接: https://arxiv.org/abs/2602.21390
作者: Gabriele Farina,Juan Carlos Perdomo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of efficiently producing, in an online fashion, generative models of scalar, multiclass, and vector-valued outcomes that cannot be falsified on the basis of the observed data and a pre-specified collection of computational tests. Our contributions are twofold. First, we expand on connections between online high-dimensional multicalibration with respect to an RKHS and recent advances in expected variational inequality problems, enabling efficient algorithms for the former. We then apply this algorithmic machinery to the problem of outcome indistinguishability. Our procedure, Defensive Generation, is the first to efficiently produce online outcome indistinguishable generative models of non-Bernoulli outcomes that are unfalsifiable with respect to infinite classes of tests, including those that examine higher-order moments of the generated distributions. Furthermore, our method runs in near-linear time in the number of samples and achieves the optimal, vanishing T^-1/2 rate for generation error.

[LG-50] Interleaved Head Attention

链接: https://arxiv.org/abs/2602.21371
作者: Sai Surya Duvvuri,Chanakya Ekbote,Rachit Bansal,Rishabh Tiwari,Devvrit Khatri,David Brandfonbrener,Paul Liang,Inderjit Dhillon,Manzil Zaheer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: H attention heads produce exactly H independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing P pseudo-heads per head (typically P=H ), where each pseudo query/key/value is a learned linear combination of all H original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to P^2 attention patterns per head with modest parameter overhead \mathcalO(H^2P) . We provide theory showing improved efficiency in terms of number of parameters on the synthetic Polynomial task (IHA uses \Theta(\sqrtkn^2) parameters vs. \Theta(kn^2) for MHA) and on the synthetic order-sensitive CPM-3 task (IHA uses \lceil\sqrtN_\max\rceil heads vs. N_\max for MHA). On real-world benchmarks, IHA improves Multi-Key retrieval on RULER by 10-20% (4k-16k) and, after fine-tuning for reasoning on OpenThoughts, improves GSM8K by 5.8% and MATH-500 by 2.8% (Majority Vote) over full attention.

[LG-51] Archetypal Graph Generative Models: Explainable and Identifiable Communities via Anchor-Dominant Convex Hulls AISTATS26

链接: https://arxiv.org/abs/2602.21342
作者: Nikolaos Nakis,Chrysoula Kosma,Panagiotis Promponas,Michail Chatzianastasis,Giannis Nikolentzos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to AISTATS26 (Spotlight)

点击查看摘要

Abstract:Representation learning has been essential for graph machine learning tasks such as link prediction, community detection, and network visualization. Despite recent advances in achieving high performance on these downstream tasks, little progress has been made toward self-explainable models. Understanding the patterns behind predictions is equally important, motivating recent interest in explainable machine learning. In this paper, we present GraphHull, an explainable generative model that represents networks using two levels of convex hulls. At the global level, the vertices of a convex hull are treated as archetypes, each corresponding to a pure community in the network. At the local level, each community is refined by a prototypical hull whose vertices act as representative profiles, capturing community-specific variation. This two-level construction yields clear multi-scale explanations: a node’s position relative to global archetypes and its local prototypes directly accounts for its edges. The geometry is well-behaved by design, while local hulls are kept disjoint by construction. To further encourage diversity and stability, we place principled priors, including determinantal point processes, and fit the model under MAP estimation with scalable subsampling. Experiments on real networks demonstrate the ability of GraphHull to recover multi-level community structure and to achieve competitive or superior performance in link prediction and community detection, while naturally providing interpretable predictions.

[LG-52] HiPPO Zoo: Explicit Memory Mechanisms for Interpretable State Space Models

链接: https://arxiv.org/abs/2602.21340
作者: Jack Goffinet,Casey Hanks,David E. Carlson
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Representing the past in a compressed, efficient, and informative manner is a central problem for systems trained on sequential data. The HiPPO framework, originally proposed by Gu Dao et al., provides a principled approach to sequential compression by projecting signals onto orthogonal polynomial (OP) bases via structured linear ordinary differential equations. Subsequent works have embedded these dynamics in state space models (SSMs), where HiPPO structure serves as an initialization. Nonlinear successors of these SSM methods such as Mamba are state-of-the-art for many tasks with long-range dependencies, but the mechanisms by which they represent and prioritize history remain largely implicit. In this work, we revisit the HiPPO framework with the goal of making these mechanisms explicit. We show how polynomial representations of history can be extended to support capabilities of modern SSMs such as adaptive allocation of memory and associative memory while retaining direct interpretability in the OP basis. We introduce a unified framework comprising five such extensions, which we collectively refer to as a “HiPPO zoo.” Each extension exposes a specific modeling capability through an explicit, interpretable modification of the HiPPO framework. The resulting models adapt their memory online and train in streaming settings with efficient updates. We illustrate the behaviors and modeling advantages of these extensions through a range of synthetic sequence modeling tasks, demonstrating that capabilities typically associated with modern SSMs can be realized through explicit, interpretable polynomial memory structures.

[LG-53] Efficient Opportunistic Approachability

链接: https://arxiv.org/abs/2602.21328
作者: Teodor Vanislavov Marinov,Mehryar Mohri,Princewill Okoroafor,Jon Schneider,Julian Zimmert
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study the problem of opportunistic approachability: a generalization of Blackwell approachability where the learner would like to obtain stronger guarantees (i.e., approach a smaller set) when their adversary limits themselves to a subset of their possible action space. Bernstein et al. (2014) introduced this problem in 2014 and presented an algorithm that guarantees sublinear approachability rates for opportunistic approachability. However, this algorithm requires the ability to produce calibrated online predictions of the adversary’s actions, a problem whose standard implementations require time exponential in the ambient dimension and result in approachability rates that scale as T^-O(1/d) . In this paper, we present an efficient algorithm for opportunistic approachability that achieves a rate of O(T^-1/4) (and an inefficient one that achieves a rate of O(T^-1/3) ), bypassing the need for an online calibration subroutine. Moreover, in the case where the dimension of the adversary’s action set is at most two, we show it is possible to obtain the optimal rate of O(T^-1/2) .

[LG-54] Dynamic Symmetric Point Tracking: Tackling Non-ideal Reference in Analog In-memory Training

链接: https://arxiv.org/abs/2602.21321
作者: Quan Xiao,Jindan Li,Zhaoxian Wu,Tayfun Gokmen,Tianyi Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Analog in-memory computing (AIMC) performs computation directly within resistive crossbar arrays, offering an energy-efficient platform to scale large vision and language models. However, non-ideal analog device properties make the training on AIMC devices challenging. In particular, its update asymmetry can induce a systematic drift of weight updates towards a device-specific symmetric point (SP), which typically does not align with the optimum of the training objective. To mitigate this bias, most existing works assume the SP is known and pre-calibrate it to zero before training by setting the reference point as the SP. Nevertheless, calibrating AIMC devices requires costly pulse updates, and residual calibration error can directly degrade training accuracy. In this work, we present the first theoretical characterization of the pulse complexity of SP calibration and the resulting estimation error. We further propose a dynamic SP estimation method that tracks the SP during model training, and establishes its convergence guarantees. In addition, we develop an enhanced variant based on chopping and filtering techniques from digital signal processing. Numerical experiments demonstrate both the efficiency and effectiveness of the proposed method.

[LG-55] ool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

链接: https://arxiv.org/abs/2602.21320
作者: Emre Can Acikgoz,Cheng Qian,Jonas Hübotter,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other’s competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.

[LG-56] Shared Nature Unique Nurture: PRISM for Pluralistic Reasoning via In-context Structure Modeling

链接: https://arxiv.org/abs/2602.21317
作者: Guancheng Tu,Shiyang Zhang,Tianyu Zhang,Yi Zhang,Diji Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are converging towards a singular Artificial Hivemind, where shared Nature (pre-training priors) result in a profound collapse of distributional diversity, limiting the distinct perspectives necessary for creative exploration and scientific discovery. To address this, we propose to equip models with inference-time Nurture (individualized epistemic trajectories) using Epistemic Evolution paradigm, progressing through explore, internalize, and express. We instantiate this via PRISM (Pluralistic Reasoning via In-context Structure Modeling), a model-agnostic system that augments LLM with dynamic On-the-fly Epistemic Graphs. On three creativity benchmarks, PRISM achieves state-of-the-art novelty and significantly expands distributional diversity. Moreover, we evaluate the real-world utility via a challenging rare-disease diagnosis benchmark. Results demonstrate that PRISM successfully uncovers correct long-tail diagnoses that standard LLM miss, confirming that its divergence stems from meaningful exploration rather than incoherent noise. Overall, this work establishes a new paradigm for Pluralistic AI, moving beyond monolithic consensus toward a diverse ecosystem of unique cognitive individuals capable of collective, multi-perspective discovery.

[LG-57] Precedence-Constrained Decision Trees and Coverings

链接: https://arxiv.org/abs/2602.21312
作者: Michał Szyfelbein,Dariusz Dereniowski
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work considers a number of optimization problems and reductive relations between them. The two main problems we are interested in are the \emphOptimal Decision Tree and \emphSet Cover. We study these two fundamental tasks under precedence constraints, that is, if a test (or set) X is a predecessor of Y , then in any feasible decision tree X needs to be an ancestor of Y (or respectively, if Y is added to set cover, then so must be X ). For the Optimal Decision Tree we consider two optimization criteria: worst case identification time (height of the tree) or the average identification time. Similarly, for the Set Cover we study two cost measures: the size of the cover or the average cover time. Our approach is to develop a number of algorithmic reductions, where an approximation algorithm for one problem provides an approximation for another via a black-box usage of a procedure for the former. En route we introduce other optimization problems either to complete the `reduction landscape’ or because they hold the essence of combinatorial structure of our problems. The latter is brought by a problem of finding a maximum density precedence closed subfamily, where the density is defined as the ratio of the number of items the family covers to its size. By doing so we provide \cO^*(\sqrtm) -approximation algorithms for all of the aforementioned problems. The picture is complemented by a number of hardness reductions that provide o(m^1/12-\epsilon) -inapproximability results for the decision tree and covering problems. Besides giving a complete set of results for general precedence constraints, we also provide polylogarithmic approximation guarantees for two most typically studied and applicable precedence types, outforests and inforests. By providing corresponding hardness results, we show these results to be tight. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2602.21312 [cs.DS] (or arXiv:2602.21312v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2602.21312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] SymTorch: A Framework for Symbolic Distillation of Deep Neural Networks

链接: https://arxiv.org/abs/2602.21307
作者: Elizabeth S.Z. Tan,Adil Soubki,Miles Cranmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic distillation replaces neural networks, or components thereof, with interpretable, closed-form mathematical expressions. This approach has shown promise in discovering physical laws and mathematical relationships directly from trained deep learning models, yet adoption remains limited due to the engineering barrier of integrating symbolic regression into deep learning workflows. We introduce SymTorch, a library that automates this distillation by wrapping neural network components, collecting their input-output behavior, and approximating them with human-readable equations via PySR. SymTorch handles the engineering challenges that have hindered adoption: GPU-CPU data transfer, input-output caching, model serialization, and seamless switching between neural and symbolic forward passes. We demonstrate SymTorch across diverse architectures including GNNs, PINNs and transformer models. Finally, we present a proof-of-concept for accelerating LLM inference by replacing MLP layers with symbolic surrogates, achieving an 8.3% throughput improvement with moderate performance degradation.

[LG-59] Robust AI Evaluation through Maximal Lotteries

链接: https://arxiv.org/abs/2602.21297
作者: Hadi Khalaf,Serena L. Wang,Daniel Halpern,Itai Shapira,Flavio du Pin Calmon,Ariel D. Procaccia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the “better” of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution and recover a stable set of top-performing models. By moving from rankings to pluralistic sets of winners, robust lotteries offer a principled step toward an ecosystem of complementary AI systems that serve the full spectrum of human preferences.

[LG-60] Neural network optimization strategies and the topography of the loss landscape

链接: https://arxiv.org/abs/2602.21276
作者: Jianneng Yu,Alexandre V. Morozov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages in the main text + 5 pages in the supplement. 6 figures + 1 table in the main text, 4 figures and 1 table in the supplement

点击查看摘要

Abstract:Neural networks are trained by optimizing multi-dimensional sets of fitting parameters on non-convex loss landscapes. Low-loss regions of the landscapes correspond to the parameter sets that perform well on the training data. A key issue in machine learning is the performance of trained neural networks on previously unseen test data. Here, we investigate neural network training by stochastic gradient descent (SGD) - a non-convex global optimization algorithm which relies only on the gradient of the objective function. We contrast SGD solutions with those obtained via a non-stochastic quasi-Newton method, which utilizes curvature information to determine step direction and Golden Section Search to choose step size. We use several computational tools to investigate neural network parameters obtained by these two optimization methods, including kernel Principal Component Analysis and a novel, general-purpose algorithm for finding low-height paths between pairs of points on loss or energy landscapes, FourierPathFinder. We find that the choice of the optimizer profoundly affects the nature of the resulting solutions. SGD solutions tend to be separated by lower barriers than quasi-Newton solutions, even if both sets of solutions are regularized by early stopping to ensure adequate performance on test data. When allowed to fit extensively on the training data, quasi-Newton solutions occupy deeper minima on the loss landscapes that are not reached by SGD. These solutions are less generalizable to the test data however. Overall, SGD explores smooth basins of attraction, while quasi-Newton optimization is capable of finding deeper, more isolated minima that are more spread out in the parameter space. Our findings help understand both the topography of the loss landscapes and the fundamental role of landscape exploration strategies in creating robust, transferrable neural network models.

[LG-61] INTACT: Intent-Aware Representation Learning for Cryptographic Traffic Violation Detection

链接: https://arxiv.org/abs/2602.21252
作者: Rahul D Ray
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Security monitoring systems typically treat anomaly detection as identifying statistical deviations from observed data distributions. In cryptographic traffic analysis, however, violations are defined not by rarity but by explicit policy constraints, including key reuse prohibition, downgrade prevention, and bounded key lifetimes. This fundamental mismatch limits the interpretability and adaptability of conventional anomaly detection methods. We introduce INTACT (INTent-Aware Cryptographic Traffic), a policy-conditioned framework that reformulates violation detection as conditional constraint learning. Instead of learning a static decision boundary over behavioral features, INTACT models the probability of violation conditioned on both observed behavior and declared security intent. The architecture factorizes representation learning into behavioral and intent encoders whose fused embeddings produce a violation score, yielding a policy-parameterized family of decision boundaries. We evaluate the framework on a real-world network flow dataset and a 210,000-trace synthetic multi-intent cryptographic dataset. INTACT matches or exceeds strong unsupervised and supervised baselines, achieving near-perfect discrimination (AUROC up to 1.0000) in the real dataset and consistent superiority in detecting relational and composite violations in the synthetic setting. These results demonstrate that explicit intent conditioning improves discrimination, interpretability, and robustness in cryptographic monitoring.

[LG-62] Exploiting Low-Rank Structure in Max-K-Cut Problems

链接: https://arxiv.org/abs/2602.20376
作者: Ria Stevens,Fangshuo Liao,Barbara Su,Jianqiang Li,Anastasios Kyrillidis
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We approach the Max-3-Cut problem through the lens of maximizing complex-valued quadratic forms and demonstrate that low-rank structure in the objective matrix can be exploited, leading to alternative algorithms to classical semidefinite programming (SDP) relaxations and heuristic techniques. We propose an algorithm for maximizing these quadratic forms over a domain of size K that enumerates and evaluates a set of O\left(n^2r-1\right) candidate solutions, where n is the dimension of the matrix and r represents the rank of an approximation of the objective. We prove that this candidate set is guaranteed to include the exact maximizer when K=3 (corresponding to Max-3-Cut) and the objective is low-rank, and provide approximation guarantees when the objective is a perturbation of a low-rank matrix. This construction results in a family of novel, inherently parallelizable and theoretically-motivated algorithms for Max-3-Cut. Extensive experimental results demonstrate that our approach achieves performance comparable to existing algorithms across a wide range of graphs, while being highly scalable.

[LG-63] Probing the Geometry of Diffusion Models with the String Method

链接: https://arxiv.org/abs/2602.22122
作者: Elio Moreau,Florentin Coeurdoux,Grégoire Ferre,Eric Vanden-Eijnden
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the geometry of learned distributions is fundamental to improving and interpreting diffusion models, yet systematic tools for exploring their landscape remain limited. Standard latent-space interpolations fail to respect the structure of the learned distribution, often traversing low-density regions. We introduce a framework based on the string method that computes continuous paths between samples by evolving curves under the learned score function. Operating on pretrained models without retraining, our approach interpolates between three regimes: pure generative transport, which yields continuous sample paths; gradient-dominated dynamics, which recover minimum energy paths (MEPs); and finite-temperature string dynamics, which compute principal curves – self-consistent paths that balance energy and entropy. We demonstrate that the choice of regime matters in practice. For image diffusion models, MEPs contain high-likelihood but unrealistic ‘‘cartoon’’ images, confirming prior observations that likelihood maxima appear unrealistic; principal curves instead yield realistic morphing sequences despite lower likelihood. For protein structure prediction, our method computes transition pathways between metastable conformers directly from models trained on static structures, yielding paths with physically plausible intermediates. Together, these results establish the string method as a principled tool for probing the modal structure of diffusion models – identifying modes, characterizing barriers, and mapping connectivity in complex learned distributions.

[LG-64] MBD-ML: Many-body dispersion from machine learning for molecules and materials

链接: https://arxiv.org/abs/2602.22086
作者: Evgeny Moerman,Adil Kabylda,Almaz Khabibrakhmanov,Alexandre Tkatchenko
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 22 pages, 6 figures, Supplementary Information (12 figures)

点击查看摘要

Abstract:Van der Waals (vdW) interactions are essential for describing molecules and materials, from drug design and catalysis to battery applications. These omnipresent interactions must also be accurately included in machine-learned force fields. The many-body dispersion (MBD) method stands out as one of the most accurate and transferable approaches to capture vdW interactions, requiring only atomic C_6 coefficients and polarizabilities as input. We present MBD-ML, a pretrained message passing neural network that predicts these atomic properties directly from atomic structures. Through seamless integration with libMBD, our method enables the immediate calculation of MBD-inclusive total energies, forces, and stress tensors. By eliminating the need for intermediate electronic structure calculations, MBD-ML offers a practical and streamlined tool that simplifies the incorporation of state-of-the-art vdW interactions into any electronic structure code, as well as empirical and machine-learned force fields.

[LG-65] Coarsening Bias from Variable Discretization in Causal Functionals

链接: https://arxiv.org/abs/2602.22083
作者: Xiaxian Ou,Razieh Nabi
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A class of causal effect functionals requires integration over conditional densities of continuous variables, as in mediation effects and nonparametric identification in causal graphical models. Estimating such densities and evaluating the resulting integrals can be statistically and computationally demanding. A common workaround is to discretize the variable and replace integrals with finite sums. Although convenient, discretization alters the population-level functional and can induce non-negligible approximation bias, even under correct identification. Under smoothness conditions, we show that this coarsening bias is first order in the bin width and arises at the level of the target functional, distinct from statistical estimation error. We propose a simple bias-reduced functional that evaluates the outcome regression at within-bin conditional means, eliminating the leading term and yielding a second-order approximation error. We derive plug-in and one-step estimators for the bias-reduced functional. Simulations demonstrate substantial bias reduction and near-nominal confidence interval coverage, even under coarse binning. Our results provide a simple framework for controlling the impact of variable discretization on parameter approximation and estimation.

[LG-66] Learning Quantum Data Distribution via Chaotic Quantum Diffusion Model NEURIPS

链接: https://arxiv.org/abs/2602.22061
作者: Quoc Hoan Tran,Koki Chinzei,Yasuhiro Endo,Hirotaka Oshima
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 12 pages, 7 figures; extended version from Poster in Workshop: Machine Learning and the Physical Sciences this https URL

点击查看摘要

Abstract:Generative models for quantum data pose significant challenges but hold immense potential in fields such as chemoinformatics and quantum physics. Quantum denoising diffusion probabilistic models (QuDDPMs) enable efficient learning of quantum data distributions by progressively scrambling and denoising quantum states; however, existing implementations typically rely on circuit-based random unitary dynamics that can be costly to realize and sensitive to control imperfections, particularly on analog quantum hardware. We propose the chaotic quantum diffusion model, a framework that generates projected ensembles via chaotic Hamiltonian time evolution, providing a flexible and hardware-compatible diffusion mechanism. Requiring only global, time-independent control, our approach substantially reduces implementation overhead across diverse analog quantum platforms while achieving accuracy comparable to QuDDPMs. This method improves trainability and robustness, broadening the applicability of quantum generative modeling.

[LG-67] Scalable Kernel-Based Distances for Statistical Inference and Integration

链接: https://arxiv.org/abs/2602.21846
作者: Masha Naslidnyk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: PhD thesis

点击查看摘要

Abstract:Representing, comparing, and measuring the distance between probability distributions is a key task in computational statistics and machine learning. The choice of representation and the associated distance determine properties of the methods in which they are used: for example, certain distances can allow one to encode robustness or smoothness of the problem. Kernel methods offer flexible and rich Hilbert space representations of distributions that allow the modeller to enforce properties through the choice of kernel, and estimate associated distances at efficient nonparametric rates. In particular, the maximum mean discrepancy (MMD), a kernel-based distance constructed by comparing Hilbert space mean functions, has received significant attention due to its computational tractability and is favoured by practitioners. In this thesis, we conduct a thorough study of kernel-based distances with a focus on efficient computation, with core contributions in Chapters 3 to 6. Part I of the thesis is focused on the MMD, specifically on improved MMD estimation. In Chapter 3 we propose a theoretically sound, improved estimator for MMD in simulation-based inference. Then, in Chapter 4, we propose an MMD-based estimator for conditional expectations, a ubiquitous task in statistical computation. Closing Part I, in Chapter 5 we study the problem of calibration when MMD is applied to the task of integration. In Part II, motivated by the recent developments in kernel embeddings beyond the mean, we introduce a family of novel kernel-based discrepancies: kernel quantile discrepancies. These address some of the pitfalls of MMD, and are shown through both theoretical results and an empirical study to offer a competitive alternative to MMD and its fast approximations. We conclude with a discussion on broader lessons and future work emerging from the thesis. Comments: PhD thesis Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME) Cite as: arXiv:2602.21846 [stat.ML] (or arXiv:2602.21846v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.21846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] Neural Learning of Fast Matrix Multiplication Algorithms: A StrassenNet Approach

链接: https://arxiv.org/abs/2602.21797
作者: Paolo Andreini,Alessandra Bernardi,Monica Bianchini,Barbara Toniella Corradini,Sara Marziali,Giacomo Nunziati,Franco Scarselli
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Fast matrix multiplication can be described as searching for low-rank decompositions of the matrix–multiplication tensor. We design a neural architecture, \textscStrassenNet, which reproduces the Strassen algorithm for 2\times 2 multiplication. Across many independent runs the network always converges to a rank- 7 tensor, thus numerically recovering Strassen’s optimal algorithm. We then train the same architecture on 3\times 3 multiplication with rank r\in\19,\dots,23\ . Our experiments reveal a clear numerical threshold: models with r=23 attain significantly lower validation error than those with r\le 22 , suggesting that r=23 could actually be the smallest effective rank of the matrix multiplication tensor 3\times 3 . We also sketch an extension of the method to border-rank decompositions via an \varepsilon --parametrisation and report preliminary results consistent with the known bounds for the border rank of the 3\times 3 matrix–multiplication tensor. Comments: 16 pages, 5 figures Subjects: Algebraic Geometry (math.AG); Machine Learning (cs.LG) MSC classes: 15A69, 68Q17, 14N07 Cite as: arXiv:2602.21797 [math.AG] (or arXiv:2602.21797v1 [math.AG] for this version) https://doi.org/10.48550/arXiv.2602.21797 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-69] Goodness-of-Fit Tests for Latent Class Models with Ordinal Categorical Data

链接: https://arxiv.org/abs/2602.21572
作者: Huan Qing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 50 pages, 4 tables, 3 figures

点击查看摘要

Abstract:Ordinal categorical data are widely collected in psychology, education, and other social sciences, appearing commonly in questionnaires, assessments, and surveys. Latent class models provide a flexible framework for uncovering unobserved heterogeneity by grouping individuals into homogeneous classes based on their response patterns. A fundamental challenge in applying these models is determining the number of latent classes, which is unknown and must be inferred from data. In this paper, we propose one test statistic for this problem. The test statistic centers the largest singular value of a normalized residual matrix by a simple sample-size adjustment. Under the null hypothesis that the candidate number of latent classes is correct, its upper bound converges to zero in probability. Under an under-fitted alternative, the statistic itself exceeds a fixed positive constant with probability approaching one. This sharp dichotomous behavior of the test statistic yields two sequential testing algorithms that consistently estimate the true number of latent classes. Extensive experimental studies confirm the theoretical findings and demonstrate their accuracy and reliability in determining the number of latent classes.

[LG-70] How many asymmetric communities are there in multi-layer directed networks?

链接: https://arxiv.org/abs/2602.21569
作者: Huan Qing
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 44 pages, 4 tables, 2 figures

点击查看摘要

Abstract:Estimating the asymmetric numbers of communities in multi-layer directed networks is a challenging problem due to the multi-layer structures and inherent directional asymmetry, leading to possibly different numbers of sender and receiver communities. This work addresses this issue under the multi-layer stochastic co-block model, a model for multi-layer directed networks with distinct community structures in sending and receiving sides, by proposing a novel goodness-of-fit test. The test statistic relies on the deviation of the largest singular value of an aggregated normalized residual matrix from the constant 2. The test statistic exhibits a sharp dichotomy: Under the null hypothesis of correct model specification, its upper bound converges to zero with high probability; under underfitting, the test statistic itself diverges to infinity. With this property, we develop a sequential testing procedure that searches through candidate pairs of sender and receiver community numbers in a lexicographic order. The process stops at the smallest such pair where the test statistic drops below a decaying threshold. For robustness, we also propose a ratio-based variant algorithm, which detects sharp changes in the sequence of test statistics by comparing consecutive candidates. Both methods are proven to consistently determine the true numbers of sender and receiver communities under the multi-layer stochastic co-block model.

[LG-71] Reasoning -Driven Design of Single Atom Catalysts via a Multi-Agent Large Language Model Framework

链接: https://arxiv.org/abs/2602.21533
作者: Dong Hyeon Mok,Seoin Back,Victor Fung,Guoxiang Hu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming increasingly applied beyond natural language processing, demonstrating strong capabilities in complex scientific tasks that traditionally require human expertise. This progress has extended into materials discovery, where LLMs introduce a new paradigm by leveraging reasoning and in-context learning, capabilities absent from conventional machine learning approaches. Here, we present a Multi-Agent-based Electrocatalyst Search Through Reasoning and Optimization (MAESTRO) framework in which multiple LLMs with specialized roles collaboratively discover high-performance single atom catalysts for the oxygen reduction reaction. Within an autonomous design loop, agents iteratively reason, propose modifications, reflect on results and accumulate design history. Through in-context learning enabled by this iterative process, MAESTRO identified design principles not explicitly encoded in the LLMs’ background knowledge and successfully discovered catalysts that break conventional scaling relations between reaction intermediates. These results highlight the potential of multi-agent LLM frameworks as a powerful strategy to generate chemical insight and discover promising catalysts.

[LG-72] Fair Model-based Clustering AAAI2026

链接: https://arxiv.org/abs/2602.21509
作者: Jinwon Park,Kunwoong Kim,Jihu Lee,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted by AAAI 2026 (Main Track, Oral presentation)

点击查看摘要

Abstract:The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc.) in each cluster is similar to that of the entire dataset. Various fair clustering algorithms have been proposed that modify standard K-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily. In particular, mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications for the superiority of the proposed algorithm are provided.

[LG-73] A Researchers Guide to Empirical Risk Minimization

链接: https://arxiv.org/abs/2602.21501
作者: Lars van der Laan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This guide develops high-probability regret bounds for empirical risk minimization (ERM). The presentation is modular: we state broadly applicable guarantees under high-level conditions and give tools for verifying them for specific losses and function classes. We emphasize that many ERM rate derivations can be organized around a three-step recipe – a basic inequality, a uniform local concentration bound, and a fixed-point argument – which yields regret bounds in terms of a critical radius, defined via localized Rademacher complexity, under a mild Bernstein-type variance–risk condition. To make these bounds concrete, we upper bound the critical radius using local maximal inequalities and metric-entropy integrals, recovering familiar rates for VC-subgraph, Sobolev/Hölder, and bounded-variation classes. We also review ERM with nuisance components – including weighted ERM and Neyman-orthogonal losses – as they arise in causal inference, missing data, and domain adaptation. Following the orthogonal learning framework, we highlight that these problems often admit regret-transfer bounds linking regret under an estimated loss to population regret under the target loss. These bounds typically decompose regret into (i) statistical error under the estimated (optimized) loss and (ii) approximation error due to nuisance estimation. Under sample splitting or cross-fitting, the first term can be controlled using standard fixed-loss ERM regret bounds, while the second term depends only on nuisance-estimation accuracy. We also treat the in-sample regime, where nuisances and the ERM are fit on the same data, deriving regret bounds and giving sufficient conditions for fast rates. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2602.21501 [stat.ML] (or arXiv:2602.21501v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.21501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Global Sequential Testing for Multi-Stream Auditing

链接: https://arxiv.org/abs/2602.21479
作者: Beepul Bharti,Ambar Pal,Jeremias Sulam
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Across many risk-sensitive areas, it is critical to continuously audit the performance of machine learning systems and detect any unusual behavior quickly. This can be modeled as a sequential hypothesis testing problem with k incoming streams of data and a global null hypothesis that asserts that the system is working as expected across all k streams. The standard global test employs a Bonferroni correction and has an expected stopping time bound of O\left(\ln\frack\alpha\right) when k is large and the significance level of the test, \alpha , is small. In this work, we construct new sequential tests by using ideas of merging test martingales with different trade-offs in expected stopping times under different, sparse or dense alternative hypotheses. We further derive a new, balanced test that achieves an improved expected stopping time bound that matches Bonferroni’s in the sparse setting but that naturally results in O\left(\frac1k\ln\frac1\alpha\right) under a dense alternative. We empirically demonstrate the effectiveness of our proposed tests on synthetic and real-world data.

[LG-75] Efficient Inference after Directionally Stable Adaptive Experiments

链接: https://arxiv.org/abs/2602.21478
作者: Zikai Shen,Houssam Zenati,Nathan Kallus,Arthur Gretton,Koulik Khamaru,Aurélien Bibaut
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 34 pages

点击查看摘要

Abstract:We study inference on scalar-valued pathwise differentiable targets after adaptive data collection, such as a bandit algorithm. We introduce a novel target-specific condition, directional stability, which is strictly weaker than previously imposed target-agnostic stability conditions. Under directional stability, we show that estimators that would have been efficient under i.i.d. data remain asymptotically normal and semiparametrically efficient when computed from adaptively collected trajectories. The canonical gradient has a martingale form, and directional stability guarantees stabilization of its predictable quadratic variation, enabling high-dimensional asymptotic normality. We characterize efficiency using a convolution theorem for the adaptive-data setting, and give a condition under which the one-step estimator attains the efficiency bound. We verify directional stability for LinUCB, yielding the first semiparametric efficiency guarantee for a regular scalar target under LinUCB sampling.

[LG-76] Unsupervised Discovery of Intermediate Phase Order in the Frustrated J_1-J_2 Heisenberg Model via Prometheus Framework

链接: https://arxiv.org/abs/2602.21468
作者: Brandon Yee,Wilson Collins,Maximilian Rutkowski
类目: rongly Correlated Electrons (cond-mat.str-el); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The spin- 1/2 J_1 - J_2 Heisenberg model on the square lattice exhibits a debated intermediate phase between Néel antiferromagnetic and stripe ordered regimes, with competing theories proposing plaquette valence bond, nematic, and quantum spin liquid ground states. We apply the Prometheus variational autoencoder framework – previously validated on classical (2D, 3D Ising) and quantum (disordered transverse field Ising) phase transitions – to systematically explore the J_1 - J_2 phase diagram via unsupervised analysis of exact diagonalization ground states for a 4 \times 4 lattice. Through dense parameter scans of J_2/J_1 \in [0.3, 0.7] with step size 0.01 and comprehensive latent space analysis, we investigate the nature of the intermediate regime using unsupervised order parameter discovery and critical point detection via multiple independent methods. This work demonstrates the application of rigorously validated machine learning methods to open questions in frustrated quantum magnetism, where traditional order parameter identification is challenged by competing interactions and limited accessible system sizes.

[LG-77] ConformalHDC: Uncertainty-Aware Hyperdimensional Computing with Application to Neural Decoding

链接: https://arxiv.org/abs/2602.21446
作者: Ziyi Liang,Hamed Poursiami,Zhishun Yang,Keiland Cooper,Akhilesh Jaiswal,Maryam Parsa,Norbert Fortin,Babak Shahbaba
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperdimensional Computing (HDC) offers a computationally efficient paradigm for neuromorphic learning. Yet, it lacks rigorous uncertainty quantification, leading to open decision boundaries and, consequently, vulnerability to outliers, adversarial perturbations, and out-of-distribution inputs. To address these limitations, we introduce ConformalHDC, a unified framework that combines the statistical guarantees of conformal prediction with the computational efficiency of HDC. For this framework, we propose two complementary variations. First, the set-valued formulation provides finite-sample, distribution-free coverage guarantees. Using carefully designed conformity scores, it forms enclosed decision boundaries that improve robustness to non-conforming inputs. Second, the point-valued formulation leverages the same conformity scores to produce a single prediction when desired, potentially improving accuracy over traditional HDC by accounting for class interactions. We demonstrate the broad applicability of the proposed framework through evaluations on multiple real-world datasets. In particular, we apply our method to the challenging problem of decoding non-spatial stimulus information from the spiking activity of hippocampal neurons recorded as subjects performed a sequence memory task. Our results show that ConformalHDC not only accurately decodes the stimulus information represented in the neural activity data, but also provides rigorous uncertainty estimates and correctly abstains when presented with data from other behavioral states. Overall, these capabilities position the framework as a reliable, uncertainty-aware foundation for neuromorphic computing.

[LG-78] Efficient Uncoupled Learning Dynamics with tildeO!left(T-1/4right) Last-Iterate Convergence in Bilinear Saddle-Point Problems over Convex Sets under Bandit Feedback AISTATS2026

链接: https://arxiv.org/abs/2602.21436
作者: Arnab Maiti,Claire Jie Zhang,Kevin Jamieson,Jamie Heather Morgenstern,Ioannis Panageas,Lillian J. Ratliff
类目: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 19 pages, Accepted at AISTATS 2026

点击查看摘要

Abstract:In this paper, we study last-iterate convergence of learning algorithms in bilinear saddle-point problems, a preferable notion of convergence that captures the day-to-day behavior of learning dynamics. We focus on the challenging setting where players select actions from compact convex sets and receive only bandit feedback. Our main contribution is the design of an uncoupled learning algorithm that guarantees last-iterate convergence to the Nash equilibrium with high probability. We establish a convergence rate of \tildeO(T^-1/4) up to polynomial factors in problem parameters. Crucially, our proposed algorithm is computationally efficient, requiring only an efficient linear optimization oracle over the players’ compact action sets. The algorithm is obtained by combining techniques from experimental design and the classic Follow-The-Regularized-Leader (FTRL) framework, with a carefully chosen regularizer function tailored to the geometry of the action set of each learner.

[LG-79] Conditional neural control variates for variance reduction in Bayesian inverse problems

链接: https://arxiv.org/abs/2602.21357
作者: Ali Siahkoohi,Hyunwoo Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian inference for inverse problems involves computing expectations under posterior distributions – e.g., posterior means, variances, or predictive quantities – typically via Monte Carlo (MC) estimation. When the quantity of interest varies significantly under the posterior, accurate estimates demand many samples – a cost often prohibitive for partial differential equation-constrained problems. To address this challenge, we introduce conditional neural control variates, a modular method that learns amortized control variates from joint model-data samples to reduce the variance of MC estimators. To scale to high-dimensional problems, we leverage Stein’s identity to design an architecture based on an ensemble of hierarchical coupling layers with tractable Jacobian trace computation. Training requires: (i) samples from the joint distribution of unknown parameters and observed data; and (ii) the posterior score function, which can be computed from physics-based likelihood evaluations, neural operator surrogates, or learned generative models such as conditional normalizing flows. Once trained, the control variates generalize across observations without retraining. We validate our approach on stylized and partial differential equation-constrained Darcy flow inverse problems, demonstrating substantial variance reduction, even when the analytical score is replaced by a learned surrogate.

[LG-80] Counterdiabatic Hamiltonian Monte Carlo

链接: https://arxiv.org/abs/2602.21272
作者: Reuben Cohn-Gordon,Uroš Seljak,Dries Sels
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Hamiltonian Monte Carlo (HMC) is a state of the art method for sampling from distributions with differentiable densities, but can converge slowly when applied to challenging multimodal problems. Running HMC with a time varying Hamiltonian, in order to interpolate from an initial tractable distribution to the target of interest, can address this problem. In conjunction with a weighting scheme to eliminate bias, this can be viewed as a special case of Sequential Monte Carlo (SMC) sampling \citedoucet2001introduction. However, this approach can be inefficient, since it requires slow change between the initial and final distribution. Inspired by \citesels2017minimizing, where a learned \emphcounterdiabatic term added to the Hamiltonian allows for efficient quantum state preparation, we propose \emphCounterdiabatic Hamiltonian Monte Carlo (CHMC), which can be viewed as an SMC sampler with a more efficient kernel. We establish its relationship to recent proposals for accelerating gradient-based sampling with learned drift terms, and demonstrate on simple benchmark problems.

附件下载

点击下载今日全部论文列表