本篇博文主要内容为 2026-02-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-25)

今日共更新517篇论文,其中:

  • 自然语言处理59篇(Computation and Language (cs.CL))
  • 人工智能146篇(Artificial Intelligence (cs.AI))
  • 计算机视觉125篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习149篇(Machine Learning (cs.LG))
  • 多智能体系统12篇(Multiagent Systems (cs.MA))
  • 信息检索15篇(Information Retrieval (cs.IR))
  • 人机交互19篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] A Micro-Macro Model of Encounter-Driven Information Diffusion in Robot Swarms

【速读】:该论文旨在解决**遭遇驱动的信息扩散(Encounter-Driven Information Diffusion, EDID)**问题,即在机器人无法预知相遇时间、地点和对象的约束下,如何实现信息的有效传播。其核心挑战在于缺乏对通信机会的调度能力,导致传统路由机制失效。解决方案的关键在于提出一个基于物理启发的双层模型:微观层面引入广义“平均自由路径”(mean free path)概念以刻画单次相遇事件的信息传递特性;宏观层面则构建全局扩散动力学模型,从而为设计适应不同环境参数(如群体规模、通信范围、环境尺寸及运动模式)的存储与路由算法提供理论依据。

链接: https://arxiv.org/abs/2602.21148
作者: Davis S. Catherman,Carlo Pinciroli
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 10 pages, 5 figures, published at ANTS 2026

点击查看摘要

Abstract:In this paper, we propose the problem of Encounter-Driven Information Diffusion (EDID). In EDID, robots are allowed to exchange information only upon meeting. Crucially, EDID assumes that the robots are not allowed to schedule their meetings. As such, the robots have no means to anticipate when, where, and who they will meet. As a step towards the design of storage and routing algorithms for EDID, in this paper we propose a model of information diffusion that captures the essential dynamics of EDID. The model is derived from first principles and is composed of two levels: a micro model, based on a generalization of the concept of `mean free path’; and a macro model, which captures the global dynamics of information diffusion. We validate the model through extensive robot simulations, in which we consider swarm size, communication range, environment size, and different random motion regimes. We conclude the paper with a discussion of the implications of this model on the algorithms that best support information diffusion according to the parameters of interest.

[MA-1] Stability Under Valuation Updates in Coalition Formation

【速读】:该论文致力于解决在可加分离型偏好博弈(additively separable hedonic games)中,当个体对其他成员的效用估值发生变化时,如何高效地找到一个稳定的联盟结构(coalition structure),同时最小化因调整带来的重配置成本。其核心问题是:在每次估值更新后,如何快速找到一个“邻近”的稳定解,而非从头计算新的稳定结构。解决方案的关键在于聚焦于基于单个代理偏离(single-agent deviations)的稳定性概念(如Nash稳定性、个体稳定性等),并通过证明该问题在一般情况下为NP完全,而在对称估值受限条件下存在多项式时间算法,从而实现对长期更新序列中平均距离的有界控制,有效平衡了稳定性和计算效率。

链接: https://arxiv.org/abs/2602.21041
作者: Fabian Frank,Matija Novaković,René Romen
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Coalition formation studies how to partition a set of agents into disjoint coalitions under consideration of their preferences. We study the classical objective of stability in a variant of additively separable hedonic games where agents can change their valuations. Our objective is to find a stable partition after each change. To minimize the reconfiguration cost, we search for nearby stable coalition structures. Our focus is on stability concepts based on single-agent deviations. We present a detailed picture of the complexity of finding nearby stable coalition structures in additively separable hedonic games, for both symmetric and non-symmetric valuations. Our results show that the problem is NP-complete for Nash stability, individual stability, contractual Nash stability, and contractual individual stability. We complement these results by presenting polynomial-time algorithms for contractual Nash stability and contractual individual stability under restricted symmetric valuations. Finally, we show that these algorithms guarantee a bounded average distance over long sequences of updates.

[MA-2] Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

【速读】:该论文旨在解决多智能体模仿学习(Multi-agent Imitation Learning, MA-IL)中一个关键理论空白:在离线场景下,所学策略与纳什均衡(Nash equilibrium)之间的差距缺乏可量化表征。现有方法虽能保证策略性能,但无法刻画其偏离纳什均衡的程度,尤其在一般n玩家马尔可夫博弈(Markov Games)中存在不可逼近性和计算难度。论文通过构造反例证明即使精确匹配专家行为分布(measure matching),仍可能无法获得低可 exploited(低被利用性)策略,并进一步揭示了在固定测量匹配误差下,估算纳什差距(Nash gap)的计算困难。解决方案的关键在于引入战略优势(strategic dominance)假设——当专家策略构成占优策略均衡(dominant strategy equilibrium)时,结合行为克隆误差 ϵBC\epsilon_\text{BC},可推导出纳什模仿差距(Nash imitation gap)为 O(nϵBC/(1γ)2)\mathcal{O}\left(n\epsilon_\text{BC}/(1-\gamma)^2\right),其中 γ\gamma 为折扣因子;并进一步提出“最优响应连续性”(best-response continuity)的新概念,表明该性质可通过标准正则化技术隐式促进,从而为构建更接近纳什均衡的离线多智能体策略提供理论保障。

链接: https://arxiv.org/abs/2602.21020
作者: Antoine Bergerault,Volkan Cevher,Negar Mehr
机构: EPFL(瑞士联邦理工学院); UC Berkeley(加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general n -player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error \epsilon_\textBC , this provides a Nash imitation gap of \mathcalO\left(n\epsilon_\textBC/(1-\gamma)^2\right) for a discount factor \gamma . We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

[MA-3] Body-Reservoir Governance in Repeated Games: Embodied Decision-Making Dynamic Sentinel Adaptation and Complexity-Regularized Optimization

【速读】:该论文旨在解决传统博弈论中合作行为的计算成本过高问题,尤其是在具身智能体(embodied agents)重复博弈场景下,经典策略如“以牙还牙”(Tit-for-Tat, TfT)依赖持续的实时计算,带来显著的物理能耗。其解决方案的核心是提出一种三层具身治理架构(Body-Reservoir Governance, BRG):第一层为身体储层(body reservoir),由回声状态网络(echo state network)实现对交互历史的隐式推理,作为决策机制与异常检测器;第二层为认知滤波器,按需激活高成本的战略工具;第三层为元认知治理层,通过可调参数 α\alpha 动态控制身体与认知模块的耦合强度。关键创新在于将合作视为储层状态的固定点而非显式计算结果,从而大幅降低策略复杂度代价(定义为储层状态分布与基线之间的KL散度),并通过动态哨兵机制生成复合不适信号驱动 α(t)\alpha(t) 自适应调整,在合作时维持低扰动、背叛时迅速触发反击,最终实现比静态治理、TfT及EMA基线更高的收益表现。

链接: https://arxiv.org/abs/2602.20846
作者: Yuki Nakamura
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 44 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Standard game theory explains cooperation in repeated games through conditional strategies such as Tit-for-Tat (TfT), but these require continuous computation that imposes physical costs on embodied agents. We propose a three-layer Body-Reservoir Governance (BRG) architecture: (1) a body reservoir (echo state network) whose d -dimensional state performs implicit inference over interaction history, serving as both decision-maker and anomaly detector, (2) a cognitive filter providing costly strategic tools activated on demand, and (3) a metacognitive governance layer with receptivity parameter \alpha \in [0,1] . At full body governance ( \alpha=1 ), closed-loop dynamics satisfy a self-consistency equation: cooperation is expressed as the reservoir’s fixed point, not computed. Strategy complexity cost is defined as the KL divergence between the reservoir’s state distribution and its habituated baseline. Body governance reduces this cost, with action variance decreasing up to 1600\times with dimension d . A dynamic sentinel generates a composite discomfort signal from the reservoir’s own state, driving adaptive \alpha(t) : near baseline during cooperation, rapidly dropping upon defection to activate cognitive retaliation. Overriding the body incurs thermodynamic cost proportional to internal state distortion. The sentinel achieves the highest payoff across all conditions, outperforming static body governance, TfT, and EMA baselines. A dimension sweep ( d \in \5,\ldots,100\ ) shows implicit inference scales with bodily richness ( 23\times to 1600\times variance reduction), attributable to reservoir dynamics. A phase diagram in (d, \tau_\mathrmenv) space reveals governance regime transitions near d \approx 20 . The framework reinterprets cooperation as the minimum-dissipation response of an adapted dynamical system – emergent from embodied dynamics rather than computed.

[MA-4] Probing Dec-POMDP Reasoning in Cooperative MARL AAMAS2026

【速读】:该论文旨在解决当前合作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)基准测试是否真正要求解Dec-POMDP(去中心化部分可观测马尔可夫决策过程)推理的问题。研究表明,现有基准(如MPE、SMAX、Overcooked、Hanabi和MaBrax)中的多数任务并不需要真正的Dec-POMDP推理能力,即智能体利用历史信息推断隐藏状态并基于局部信息进行协调的能力;相反,许多成功策略依赖于简单的反应式行为或脆弱的同步动作耦合。论文的关键解决方案是一套诊断工具集,结合统计显著性检验与信息论探针,系统评估基线策略(如IPPO和MAPPO)在37种场景下的行为复杂度,从而揭示当前训练范式下基准测试可能高估了智能体的真实协作能力。

链接: https://arxiv.org/abs/2602.20804
作者: Kale-ab Tessera,Leonard Hinckeldey,Riccardo Zamboni,David Abel,Amos Storkey
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: To appear at the 25th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2026)

点击查看摘要

Abstract:Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.

[MA-5] Agile V: A Compliance-Ready Framework for AI-Augmented Engineering – From Concept to Audit-Ready Delivery

【速读】:该论文旨在解决当前人工智能辅助工程工作流中缺乏任务级验证与监管可追溯性机制的问题,尤其是在高速交付场景下难以保证合规性和审计完整性。其解决方案的关键在于提出一种名为Agile V的框架,该框架将敏捷迭代与V模型验证融合为持续的无限循环(Infinity Loop),并在每个任务周期内嵌入独立的验证和审计文档生成功能;通过部署专用AI代理(AI agents)分别处理需求、设计、构建、测试及合规等环节,并设置强制的人工审批节点以确保质量控制,从而实现审计就绪文档自动生成、100%需求级验证覆盖以及每轮迭代仅需少量人工交互(如6个提示),显著提升效率并降低开发成本。

链接: https://arxiv.org/abs/2602.20684
作者: Christopher Koch,Joshua Andreas Wellbrock
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Current AI-assisted engineering workflows lack a built-in mechanism to maintain task-level verification and regulatory traceability at machine-speed delivery. Agile V addresses this gap by embedding independent verification and audit artifact generation into each task cycle. The framework merges Agile iteration with V-Model verification into a continuous Infinity Loop, deploying specialized AI agents for requirements, design, build, test, and compliance, governed by mandatory human approval gates. We evaluate three hypotheses: (H1) audit-ready artifacts emerge as a by-product of development, (H2) 100% requirement-level verification is achievable with independent test generation, and (H3) verified increments can be delivered with single-digit human interactions per cycle. A feasibility case study on a Hardware-in-the-Loop system (about 500 LOC, 8 requirements, 54 tests) supports all three hypotheses: audit-ready documentation was generated automatically (H1), 100% requirement-level pass rate was achieved (H2), and only 6 prompts per cycle were required (H3), yielding an estimated 10-50x cost reduction versus a COCOMO II baseline (sensitivity range from pessimistic to optimistic assumptions). We invite independent replication to validate generalizability.

[MA-6] he Trag edy of the Commons in Multi-Population Resource Games

【速读】:该论文旨在解决多群体资源博弈中因个体自优化行为导致公共资源被过度开发的“公地悲剧”问题,其核心在于构建一个双层决策结构:低层代理属于多个不同群体,高层代理代表各自群体做出影响局部选择的决策。解决方案的关键在于识别出高层博弈中的唯一对称纳什均衡(symmetric Nash equilibrium),并分析该均衡对公共资源可持续性的影响——研究表明,随着群体数量增加,资源水平会下降,但并非必然枯竭;论文进一步界定了资源不枯竭与枯竭的区域边界,从而为理解多主体系统下资源管理策略提供了理论依据。

链接: https://arxiv.org/abs/2602.20603
作者: Yamin Vahmian,Keith Paarporn
机构: University of Colorado, Colorado Springs (科罗拉多大学斯普林斯分校)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 8 pages, 3 figures. Accepted for presentation at the 2026 American Control Conference (ACC)

点击查看摘要

Abstract:Self-optimizing behaviors can lead to outcomes where collective benefits are ultimately destroyed, a well-known phenomenon known as the ``tragedy of the commons". These scenarios are widely studied using game-theoretic approaches to analyze strategic agent decision-making. In this paper, we examine this phenomenon in a bi-level decision-making hierarchy, where low-level agents belong to multiple distinct populations, and high-level agents make decisions that impact the choices of the local populations they represent. We study strategic interactions in a context where the populations benefit from a common environmental resource that degrades with higher extractive efforts made by high-level agents. We characterize a unique symmetric Nash equilibrium in the high-level game, and investigate its consequences on the common resource. While the equilibrium resource level degrades as the number of populations grows large, there are instances where it does not become depleted. We identify such regions, as well as the regions where the resource does deplete. Comments: 8 pages, 3 figures. Accepted for presentation at the 2026 American Control Conference (ACC) Subjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2602.20603 [cs.GT] (or arXiv:2602.20603v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.20603 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yamin Vahmian [view email] [v1] Tue, 24 Feb 2026 06:53:04 UTC (707 KB) Full-text links: Access Paper: View a PDF of the paper titled The Tragedy of the Commons in Multi-Population Resource Games, by Yamin Vahmian and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.GT prev | next new | recent | 2026-02 Change to browse by: cs cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[MA-7] AWCP: A Workspace Delegation Protocol for Deep-Engagement Collaboration across Remote Agents

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的自主代理在协作过程中因执行环境隔离导致的“上下文鸿沟”问题,即代理无法直接访问或操作其他代理的文件系统或工具链,只能通过低效且易错的环境重建来实现协作。解决方案的关键在于提出一种代理工作区协作协议(Agent Workspace Collaboration Protocol, AWCP),其核心思想是借鉴 Unix哲学“一切皆文件”,通过临时工作区委托机制,将轻量级控制平面与可插拔传输机制解耦,使委托方(Delegator)能够将其工作区投影至远程执行方(Executor),后者可直接在其本地工具链中操作共享文件,从而实现高效、原生的跨代理协作。

链接: https://arxiv.org/abs/2602.20493
作者: Xiaohang Nie,Zihan Guo,Youliang Chen,Yuanjian Zhou,Weinan Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Shanghai Innovation Institute (上海创新研究院); Sun Yat-sen University (中山大学); Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学)
类目: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA)
备注: 16 pages, 7 figure, tech report of Agent Workspace Collaboration Protocol

点击查看摘要

Abstract:The rapid evolution of Large Language Model (LLM)-based autonomous agents is reshaping the digital landscape toward an emerging Agentic Web, where increasingly specialized agents must collaborate to accomplish complex tasks. However, existing collaboration paradigms are constrained to message passing, leaving execution environments as isolated silos. This creates a context gap: agents cannot directly manipulate files or invoke tools in a peer’s environment, and must instead resort to costly, error-prone environment reconstruction. We introduce the Agent Workspace Collaboration Protocol (AWCP), which bridges this gap through temporary workspace delegation inspired by the Unix philosophy that everything is a file. AWCP decouples a lightweight control plane from pluggable transport mechanisms, allowing a Delegator to project its workspace to a remote Executor, who then operates on the shared files directly with unmodified local toolchains. We provide a fully open-source reference implementation with MCP tool integration and validate the protocol through live demonstrations of asymmetric collaboration, where agents with complementary capabilities cooperate through delegated workspaces. By establishing the missing workspace layer in the agentic protocol stack, AWCP paves the way for a universally interoperable agent ecosystem in which collaboration transcends message boundaries. The protocol and reference implementation are publicly available at this https URL.

[MA-8] HieraMAS: Optimizing Intra-Node LLM Mixtures and Inter-Node Topology for Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在利用大语言模型(Large Language Models, LLMs)时存在的局限性问题,即现有方法通常仅优化单一维度(如通信拓扑、角色分配或LLM路由),并将每个代理视为不可分割的单元,从而忽略了在单个代理内部融合多种异构LLM以增强特定角色能力的可能性。其解决方案的关键在于提出HieraMAS——一种分层协作框架,通过引入超节点(supernodes)实现节点内LLM混合(intra-node LLM mixtures)与节点间通信拓扑(inter-node communication topology)的协同优化:每个超节点由多个异构LLM组成,采用“提议-合成”(propose-synthesis)结构执行功能角色;同时设计双阶段算法应对独特的信用分配挑战——第一阶段为多层级奖励归因,提供节点级与系统级细粒度反馈;第二阶段采用图分类方法进行拓扑选择,将通信结构优化视为整体决策而非逐边优化。实验表明,HieraMAS在推理和编码基准上显著优于现有方法,并具备更优的成本-性能权衡。

链接: https://arxiv.org/abs/2602.20229
作者: Tianjun Yao,Zhaoyi Li,Zhiqiang Shen
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 22 pages, 13 tables

点击查看摘要

Abstract:Multi-agent systems (MAS) built on large language models (LLMs) have shown strong performance across many tasks. Most existing approaches improve only one aspect at a time, such as the communication topology, role assignment, or LLM routing, while treating each agent as a single, indivisible unit. This misses the opportunity to use mixtures of LLMs within an agent to strengthen role-specific abilities. We propose HieraMAS, a hierarchical collaboration framework that combines intra-node LLM mixtures with an inter-node communication topology. HieraMAS introduces supernodes, where each functional role is implemented by multiple heterogeneous LLMs using a propose-synthesis structure. Optimizing HieraMAS creates unique credit-assignment challenges: final task performance depends heavily on the underlying LLMs’ capabilities, which can lead reinforcement methods to incorrectly reward suboptimal configurations. To address this, we use a two-stage algorithm: (1) multi-level reward attribution, which provides fine-grained feedback at both the node level and the overall system level; (2) graph classification for topology selection, which treats choosing the communication structure as a holistic decision rather than optimizing edges one by one. Experiments on reasoning and coding benchmarks show that HieraMAS substantially outperforms existing methods while also delivering better cost-performance trade-offs.

[MA-9] Mitigating “Epistemic Debt” in Generative AI-Scaffolded Novice Programming using Metacognitive Scripts

【速读】:该论文旨在解决生成式 AI 在编程教育中引发的“认知外包”问题,即初学者在无指导条件下过度依赖大语言模型(Large Language Models, LLMs)进行代码编写,导致其无法有效构建内在的认知结构(schema),从而积累“认知债务”(Epistemic Debt),形成表面高效但缺乏纠错能力的“脆弱专家”(Fragile Experts)。其核心解决方案在于引入一种称为“解释门”(Explanation Gate)的元认知干预机制,通过实时 LLM 作为评判者(LLM-as-a-Judge)强制执行“教学回授”(Teach-Back)协议,在代码生成前要求用户解释其逻辑,以此将原本应由学习者承担的内在认知负荷(Intrinsic Cognitive Load)从“外包”转为“外化”,实现对认知负荷的合理分配与内化。实验表明,采用该机制的受试者虽产出效率略低于无约束AI组,但在后续AI不可用情境下的维护任务中表现出显著更高的成功率(39% vs. 77%失败率),验证了该方法在保障长期可维护性方面的有效性。

链接: https://arxiv.org/abs/2602.20206
作者: Sreecharan Sankaranarayanan
机构: Extuitive Inc. (Flagship Pioneering)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The democratization of Large Language Models (LLMs) has given rise to Vibe Coding," a workflow where novice programmers prioritize semantic intent over syntactic implementation. While this lowers barriers to entry, we hypothesize that without pedagogical guardrails, it is fundamentally misaligned with cognitive skill acquisition. Drawing on the distinction between Cognitive Offloading and Cognitive Outsourcing, we argue that unrestricted AI encourages novices to outsource the Intrinsic Cognitive Load required for schema formation, rather than merely offloading Extraneous Load. This accumulation of Epistemic Debt" creates Fragile Experts" whose high functional utility masks critically low corrective competence. To quantify and mitigate this debt, we conducted a between-subjects experiment (N=78) using a custom Cursor IDE plugin backed by Claude 3.5 Sonnet. Participants represented "AI-Native" learners across three conditions: Manual (Control), Unrestricted AI (Outsourcing), and Scaffolded AI (Offloading). The Scaffolded condition utilized a novel Explanation Gate," leveraging a real-time LLM-as-a-Judge framework to enforce a Teach-Back" protocol before generated code could be integrated. Results reveal a Collapse of Competence": while Unrestricted AI users matched the productivity of the Scaffolded group (p .001 vs. Manual), they suffered a 77% failure rate in a subsequent AI-Blackout maintenance task, compared to only 39% in the Scaffolded group. Qualitative analysis suggests that successful vibe coders naturally engage in self-scaffolding, treating the AI as a consultant rather than a contractor. We discuss the implications for the maintainability of AI-generated software and propose that future learning systems must enforce Metacognitive Friction to prevent the mass production of unmaintainable code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA) Cite as: arXiv:2602.20206 [cs.SE] (or arXiv:2602.20206v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.20206 Focus to learn more arXiv-issued DOI via DataCite

[MA-10] SMaRT: Online Reusable Resource Assignment and an Application to Mediation in the Kenyan Judiciary

【速读】:该论文旨在解决肯尼亚司法系统中调解员(mediator)与案件(case)匹配的在线资源分配问题,核心挑战包括:调解员质量未知、资源容量为软约束、任务与资源间的可匹配关系具有高维结构(87个地理区域 × 12种案件类型),且需在实时分配中最大化案件完成率。解决方案的关键在于提出一种名为SMaRT(Selecting Mediators that are Right for the Task)的新算法,其结合了二次规划(quadratic program)用于任务分配和多智能体bandit学习框架用于资源质量估计,从而有效处理高维状态空间下的不确定性,并支持在容量约束严格性与案件解决率之间进行灵活权衡。

链接: https://arxiv.org/abs/2602.18431
作者: Shafkat Farabi,Didac Marti Pinto,Wei Lu,Manuel Ramos-Maqueda,Sanmay Das,Antoine Deeb,Anja Sautmann
机构: Virginia Tech (弗吉尼亚理工学院); University College London (伦敦大学学院); The World Bank (世界银行)
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Motivated by the problem of assigning mediators to cases in the Kenyan judicial, we study an online resource allocation problem where incoming tasks (cases) must be immediately assigned to available, capacity-constrained resources (mediators). The resources differ in their quality, which may need to be learned. In addition, resources can only be assigned to a subset of tasks that overlaps to varying degrees with the subset of tasks other resources can be assigned to. The objective is to maximize task completion while satisfying soft capacity constraints across all the resources. The scale of the real-world problem poses substantial challenges, since there are over 2000 mediators and a multitude of combinations of geographic locations (87) and case types (12) that each mediator is qualified to work on. Together, these features, unknown quality of new resources, soft capacity constraints, and a high-dimensional state space, make existing scheduling and resource allocation algorithms either inapplicable or inefficient. We formalize the problem in a tractable manner using a quadratic program formulation for assignment and a multi-agent bandit-style framework for learning. We demonstrate the key properties and advantages of our new algorithm, SMaRT (Selecting Mediators that are Right for the Task), compared with baselines on stylized instances of the mediator allocation problem. We then consider its application to real-world data on cases and mediators from the Kenyan judiciary. SMaRT outperforms baselines and allows control over the tradeoff between the strictness of capacity constraints and overall case resolution rates, both in settings where mediator quality is known beforehand and in bandit-like settings where learning is part of the problem definition. On the strength of these results, we plan to run a randomized controlled trial with SMaRT in the judiciary in the near future.

[MA-11] Multimodal Multi-Agent Empowered Legal Judgment Prediction ICASSP

【速读】:该论文旨在解决法律判决预测(Legal Judgment Prediction, LJP)任务中传统方法因多指控、多样化证据及适应性不足而导致的性能瓶颈问题。其解决方案的关键在于提出一种名为JurisMMA的新框架,该框架通过有效分解审判任务、标准化处理流程,并将其组织为多个清晰阶段,从而提升模型对复杂法律场景的理解与预测能力。此外,研究构建了包含超过10万条中国司法记录的大型多模态数据集JurisMM(含文本与视频-文本数据),为LJP提供了更全面的评估基础,验证了该框架在实际应用中的有效性与泛化潜力。

链接: https://arxiv.org/abs/2601.12815
作者: Zhaolu Kang,Junhao Gong,Qingxi Chen,Hao Zhang,Jiaxin Liu,Rong Fu,Zhiyuan Feng,Yuan Wang,Simon Fong,Kaiyue Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: Accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework’s effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

自然语言处理

[NLP-0] Aletheia tackles FirstProof autonomously

【速读】: 该论文旨在解决数学研究任务中自动化推理与问题求解的挑战,特别是在复杂数学证明场景下的自主性与准确性问题。解决方案的关键在于开发并部署Aletheia这一数学研究代理(Mathematics Research Agent),其基于Gemini 3 Deep Think模型,具备深度推理能力,在首届FirstProof挑战赛中实现了对10道题目中6道的自主解答(第2、5、7、8、9、10题),并通过多数专家评估验证了结果的有效性;值得注意的是,仅第8题存在专家意见分歧,表明系统在高难度问题上的表现仍具探索空间。

链接: https://arxiv.org/abs/2602.21201
作者: Tony Feng,Junehyuk Jung,Sang-hyun Kim,Carlo Pagano,Sergei Gukov,Chiang-Chiang Tsai,David Woodruff,Adel Javanmard,Aryan Mokhtari,Dawsen Hwang,Yuri Chervonyi,Jonathan N. Lee,Garrett Bingham,Trieu H. Trinh,Vahab Mirrokni,Quoc V. Le,Thang Luong
机构: Google DeepMind(谷歌深度思维)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 34 pages. Project page: this https URL

点击查看摘要

Abstract:We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at this https URL.

[NLP-1] Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLM s

【速读】: 该论文旨在解决具身大语言模型(Embodied LLMs)在机器人部署中缺乏反思能力的问题,即模型无法识别错误原因或从经验中学习,导致任务执行成为一系列独立尝试而非渐进式优化。其解决方案的核心是引入反射性测试时规划(Reflective Test-Time Planning),该方法融合三种反射机制:① 行动中反思(reflection-in-action),通过测试时扩展生成并评分多个候选动作;② 行动后反思(reflection-on-action),利用外部反馈更新内部反思模型与动作策略;③ 回顾性反思(retrospective reflection),结合事后视角重新评估早期决策以实现长期信用分配。实验表明,该框架在长程家庭任务和MuJoCo柜子拟合任务上显著优于基线模型,且消融实验证明了两种反思模式的互补作用。

链接: https://arxiv.org/abs/2602.21198
作者: Yining Hong,Huang Huang,Manling Li,Li Fei-Fei,Jiajun Wu,Yejin Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textitreflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textitreflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

[NLP-2] On Data Engineering for Scaling LLM Terminal Capabilities

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)终端代理(Terminal Agents)训练数据策略不透明的问题,从而限制了该领域研究的可复现性与进展。其核心解决方案在于提出了一种轻量级合成任务生成流水线 Terminal-Task-Gen,支持基于种子(seed-based)和技能(skill-based)的任务构造,并构建了一个大规模开源数据集 Terminal-Corpus。在此基础上,作者系统分析了数据过滤、课程学习(curriculum learning)、长上下文训练及扩展行为等关键训练策略,最终训练出 Nemotron-Terminal 系列模型,在 Terminal-Bench 2.0 上显著提升性能,证明了数据工程在终端代理能力提升中的决定性作用。

链接: https://arxiv.org/abs/2602.21193
作者: Renjie Pi,Grace Lam,Mohammad Shoeybi,Pooya Jannaty,Bryan Catanzaro,Wei Ping
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at this https URL.

[NLP-3] PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

【速读】: 该论文旨在解决大规模患者生成文本(如安全消息、调查和访谈)中患者声音(Patient Voice, PV)结构化分析的难题,传统定性编码方法效率低且难以扩展,而现有机器学习与自然语言处理(Natural Language Processing, NLP)方法通常将患者中心沟通(Patient-Centered Communication, PCC)与社会健康决定因素(Social Determinants of Health, SDoH)视为独立任务,或使用不适用于患者语境的语言模型。其解决方案的关键在于提出PVminer——一个领域适配的NLP框架,通过多标签、多分类预测任务整合患者特定BERT编码器(PV-BERT-base和PV-BERT-large)、无监督主题建模(PV-Topic-BERT)进行主题增强,并在微调和推理阶段融合主题表示以丰富语义输入,从而实现对Code、Subcode及Combo层级标签的高精度识别,显著优于生物医学和临床预训练基线模型,在F1分数上达到82.25%(Code)、80.14%(Subcode)和77.87%(Combo)。

链接: https://arxiv.org/abs/2602.21165
作者: Samah Fodeh,Linhai Ma,Yan Wang,Srivani Talakokkul,Ganesh Puthiaraju,Afshan Khan,Ashley Hagaman,Sarah Lowe,Aimee Roundtree
机构: Yale School of Medicine (耶鲁大学医学院); Yale School of Public Health (耶鲁大学公共卫生学院); Texas State University (德克萨斯州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

[NLP-4] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多步决策代理场景中,因奖励设计不足而导致探索效率低、学习稳定性差的问题。现有方法虽尝试了奖励塑形和步骤级信用分配,但忽略了模型自身的内在不确定性这一关键信号。解决方案的关键在于提出SELAUR框架——通过融合熵、最小置信度和间隔等指标构建细粒度的token级不确定性估计,并将其注入到步骤级与轨迹级奖励中,从而实现基于不确定性的奖励重塑,显著提升探索效率与学习鲁棒性。

链接: https://arxiv.org/abs/2602.21158
作者: Dengjia Zhang,Xiaoou Liu,Lu Cheng,Yaqing Wang,Kenton Murray,Hua Wei
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

[NLP-5] Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLM s and Text Classification

【速读】: 该论文旨在解决大规模未结构化客户评论(如餐厅评论)中自动化进行细粒度情感分析的难题,尤其针对传统方法在处理海量数据时的效率瓶颈以及大语言模型(LLM)在计算成本和可扩展性方面的限制。其解决方案的关键在于提出一种混合方法:利用大语言模型(如ChatGPT)进行关键方面(aspect)识别,同时采用经典机器学习方法对大规模评论进行情感分类,从而在保证准确性的同时显著提升处理效率与可扩展性。该框架成功应用于470万条历史评论,验证了其在不同餐饮体验维度、菜系及地理区域中对整体评分变异性的解释能力。

链接: https://arxiv.org/abs/2602.21082
作者: Vishal Patil,Shree Vaishnavi Bacha,Revanth Yamani,Yidan Sun,Mayank Kejriwal
机构: University of Southern California(南加州大学); ISI(信息科学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.

[NLP-6] VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在实际应用中频繁产生幻觉(hallucination)的问题,从而限制其安全部署。现有基于大语言模型(LLM)的自评估方法依赖于语言先验知识来判断输出正确性,但难以适用于视觉条件下的预测评估。论文提出了一种视觉感知的不确定性量化框架VAUQ(Vision-Aware Uncertainty Quantification),其核心创新在于引入图像信息得分(Image-Information Score, IS),用于显式衡量模型输出对视觉证据的依赖程度;同时设计了一种无监督的核心区域掩码策略,增强显著区域对预测不确定性的贡献。通过将预测熵与核心掩码后的IS相结合,构建了一个无需训练的评分函数,能够可靠反映答案的正确性,实验表明该方法在多个数据集上均优于现有自评估方法。

链接: https://arxiv.org/abs/2602.21054
作者: Seongheon Park,Changdae Oh,Hyeong Kyu Choi,Xuefeng Du,Sharon Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model’s ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model’s output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.

[NLP-7] Evaluating Proactive Risk Awareness of Large Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在实际应用中仅能被动应对显性有害意图,而缺乏对潜在但具有重大后果风险的前瞻性识别能力的问题。其核心挑战在于如何评估LLMs是否具备预见生态影响并主动发出预警的能力,从而实现从“事后响应”向“事前预防”的安全范式转变。解决方案的关键在于构建了一个名为Butterfly的数据集,该数据集通过模拟日常决策场景中可能引发隐性生态影响的查询,量化了LLMs在环境与生态领域中的主动风险意识水平,并提出了一个系统性的评估框架,用于分析不同响应长度、语言和模态条件下模型的风险预判能力,揭示了现有安全对齐机制在物种保护等关键领域存在显著盲区,强调需建立更主动的防护策略以支撑LLMs在真实世界中的生态责任履行。

链接: https://arxiv.org/abs/2602.20976
作者: Xuan Luo,Yubin Chen,Zhiyu Hou,Linpu Yu,Geng Tu,Jing Li,Ruifeng Xu
机构: The Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); The Hong Kong Polytechnic University, Hong Kong (香港理工大学); Southern University of Science and Technology, Shenzhen (南方科技大学); Shenzhen Loop Area Institute, Shenzhen, China (深圳市 loop 区域研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.

[NLP-8] Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学推理能力评估中过度依赖线性推理(linear reasoning)数据集,而忽视了如反证法(proof by contradiction)和分类讨论法(proof by cases)等复杂推理形式的问题。其解决方案的关键在于构建一个名为PC-FOL的新型一阶逻辑(First-Order Logic, FOL)数据集,该数据集由专业数学家标注,专注于分类讨论类推理问题,并为每个实例提供人工编写的自然语言证明,从而明确区分于传统线性推理数据集。实验结果表明,LLMs在分类讨论推理任务上表现显著逊于线性推理任务,作者进一步基于图模型(graphical model)提供了理论分析,解释了这一性能差距的成因,为自动化自然语言数学证明生成领域的研究指明了核心挑战与发展方向。

链接: https://arxiv.org/abs/2602.20973
作者: Yuliang Ji,Fuchen Shen,Jian Wu,Qiujie Xie,Yue Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs’ reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.

[NLP-9] Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在理解语言结构和系统性规律方面的能力评估问题,具体聚焦于LLMs是否能够识别语言对象及其属性、检测并利用跨句子的系统性模式,以及其错误类型是偏向语言层面还是推理层面。解决方案的关键在于提出了一种新颖的语言任务——黑鸟语言矩阵(Blackbird Language Matrices, BLM),该任务具有多层级结构:句内、输入序列间及候选答案内部均存在复杂关系。BLM数据集通过人工构建与自然语料结合的方式,提供清晰的标注结构和可解释的学习上下文,从而支持对LLMs行为机制的多维分析。研究表明,BLM任务虽具挑战性,但可通过简单基线模型或定制化模型达到较高性能,且其内部表征能有效捕捉语法对象与属性,并依赖于跨句子的系统性模式识别来实现解题,因此为深入探究LLMs的语言能力与推理机制提供了可解释性强的实验平台。

链接: https://arxiv.org/abs/2602.20966
作者: Paola Merlo,Chunyang Jiang,Giuseppe Samo,Vivi Nastase
机构: Idiap Research Institute (Idiap研究学院); University of Geneva (日内瓦大学)
类目: Computation and Language (cs.CL)
备注: Under review, 46 pages, 5 tables, 28 figures

点击查看摘要

Abstract:This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do. Comments: Under review, 46 pages, 5 tables, 28 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.20966 [cs.CL] (or arXiv:2602.20966v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20966 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-10] he Art of Efficient Reasoning : Data Reward and Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中存在的计算开销过大的问题,目标是实现高效推理——即在保持准确性的前提下缩短思考轨迹的长度。解决方案的关键在于通过强化学习(Reinforcement Learning, RL)进行奖励塑形(reward shaping),并揭示了训练过程遵循“长度适应”与“推理精炼”两个阶段的机制;其中最关键发现是:在相对简单的提示(prompt)上进行训练可确保正向奖励信号密度,从而避免长度坍缩(length collapse),同时所学得的长度偏好具有跨领域的泛化能力。

链接: https://arxiv.org/abs/2602.20945
作者: Taiqiang Wu,Zenan Zu,Bo Zhou,Ngai Wong
机构: The University of Hong Kong (香港大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Tech Report, Insights on Efficient Reasoning via Reward Shaping

点击查看摘要

Abstract:Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

[NLP-11] Predicting Sentence Acceptability Judgments in Multimodal Contexts

【速读】: 该论文旨在探究视觉上下文(visual context)对人类句子可接受性判断(sentence acceptability judgments)以及大语言模型(LLMs)预测能力的影响,从而揭示人类与LLMs在多模态语境下处理句子的异同。其关键解决方案在于设计实验对比不同条件下(有无视觉图像)人类与LLMs的判断表现:结果表明,视觉上下文对人类判断影响甚微,但LLMs仍表现出类似文本上下文中的“压缩效应”(compression effect),且其预测准确性在移除视觉信息后略有提升;此外,LLMs的预测与其归一化对数概率高度相关,但在视觉上下文存在时相关性下降,说明视觉信息可能加剧了LLMs内部表示与输出预测之间的差异。这一发现揭示了LLMs在多模态语境中处理句子时的敏感性与局限性。

链接: https://arxiv.org/abs/2602.20918
作者: Hyewon Jang,Nikolai Ilinykh,Sharid Loáiciga,Jey Han Lau,Shalom Lappin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.

[NLP-12] Exa-PSD: a new Persian sentiment analysis dataset on Twitter

【速读】: 该论文旨在解决波斯语(Persian)社交媒体文本中情感分析(Sentiment Analysis)数据集匮乏的问题,尤其是在推特(Twitter)场景下,用户常使用讽刺(irony)、口语化表达等复杂语言现象,而现有波斯语语料库多集中于特定领域(如产品、食品、酒店),难以覆盖真实社交语境。解决方案的关键在于构建一个大规模、高质量的波斯语推特情感分析数据集——Exa Sentiment Analysis Persian Dataset,该数据集包含12,000条标注 tweets,分为正面、中性与负面三类,并由5位母语者进行人工标注。研究进一步利用预训练模型 ParsBERT 和 RoBERTa 对该数据集进行评估,取得了79.87的Macro F-score,验证了该数据集和模型组合在波斯语情感分析任务中的有效性与实用性。

链接: https://arxiv.org/abs/2602.20892
作者: Seyed Himan Ghaderi,Saeed Sarbazi Azad,Mohammad Mehdi Jaziriyan,Ahmad Akbari
机构: 未知
类目: Computation and Language (cs.CL)
备注: This is the original submitted (preprint) version of a paper published in Language Resources and Evaluation. The final published version is available at Springer via DOI: this https URL

点击查看摘要

Abstract:Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained Pars Bert and Roberta as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.

[NLP-13] FinAnchor: Aligned Multi-Model Representations for Financial Prediction

【速读】: 该论文旨在解决从长篇金融文档中进行财务预测时面临的挑战,即有效信号稀疏且易被噪声干扰,同时不同任务和时间段下最优的大语言模型(Large Language Model, LLM)存在差异。解决方案的关键在于提出一种轻量级框架FinAnchor(Financial Anchored Representations),通过选择一个锚定嵌入空间(anchor embedding space),并学习线性映射将来自多个LLM的异构表示对齐至该空间,进而聚合这些对齐特征以构建统一的下游预测表示,从而在不微调底层模型的前提下实现跨模型嵌入的有效融合与性能提升。

链接: https://arxiv.org/abs/2602.20859
作者: Zirui He,Huopu Zhang,Yanguang Liu,Sirui Wu,Mengnan Du
机构: New Jersey Institute of Technology (新泽西理工学院); Georgia Institute of Technology (佐治亚理工学院); Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Financial prediction from long documents involves significant challenges, as actionable signals are often sparse and obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods. In this paper, we propose FinAnchor(Financial Anchored Representations), a lightweight framework that integrates embeddings from multiple LLMs without fine-tuning the underlying models. FinAnchor addresses the incompatibility of feature spaces by selecting an anchor embedding space and learning linear mappings to align representations from other models into this anchor. These aligned features are then aggregated to form a unified representation for downstream prediction. Across multiple financial NLP tasks, FinAnchor consistently outperforms strong single-model baselines and standard ensemble methods, demonstrating the effectiveness of anchoring heterogeneous representations for robust financial prediction.

[NLP-14] Dont Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

【速读】: 该论文旨在解决传统知识蒸馏中使用标准Kullback-Leibler (KL)散度作为学习信号时,因过度关注教师模型高概率预测(即模式区域)而导致低概率但潜在信息丰富的尾部分布被忽视的问题。解决方案的关键在于提出一种新的“尾部感知”散度(tail-aware divergence),通过解耦教师模型top-K高概率预测与低概率预测的贡献,在保持与KL散度相同计算复杂度的前提下,显著降低教师模式的影响,从而增强尾部分布对学生模型训练的贡献,提升蒸馏效果。

链接: https://arxiv.org/abs/2602.20816
作者: Sayantan Dasgupta,Trevor Cohn,Timothy Baldwin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher’s modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model’s top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.

[NLP-15] Overton Pluralistic Reinforcement Learning for Large Language Models

【速读】: 该论文旨在解决当前对齐范式在捕捉人类价值观多元性(pluralistic nature of human values)方面的局限性,即现有方法难以有效生成涵盖多种真实人类观点的响应。其解决方案的关键在于提出OP-GRPO(Overton Pluralistic Group Relative Policy Optimization)框架,这是一种基于强化学习的隐式Overton多元主义实现方式,通过引入一个由Sentence Transformer微调得到的相似度评估器,并将其嵌入到双奖励机制中,以同时保障响应对真实人类视角的广泛覆盖与每个视角的独特性,从而促进输出多样性。此方法使单个大语言模型无需显式提示或模块化编排即可生成具有高视角覆盖率的多样化回答。

链接: https://arxiv.org/abs/2602.20759
作者: Yu Fu,Seongho Son,Ilija Bogunovic
机构: University College London (伦敦大学学院); University of Basel (巴塞尔大学)
类目: Computation and Language (cs.CL)
备注: 28 pages, 8 figures

点击查看摘要

Abstract:Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a “small models, big perspective coverage” effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

[NLP-16] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

【速读】: 该论文旨在解决开放生成任务中对齐且鲁棒的奖励设计难题,这是强化学习(Reinforcement Learning, RL)后训练阶段的关键瓶颈。现有基于规则(rubric)的方法存在局限:专家制定的规则成本高,提示生成的规则常浅显不一致,而固定池的判别式规则易饱和漂移,导致奖励黑客(reward hacking)问题。解决方案的关键在于提出SibylSense——一种推理时学习方法,通过可调的记忆库(memory bank)动态适配冻结的规则生成器,记忆库由验证过的规则项构成,并利用少量样本中参考答案与候选答案之间的判别差距(discriminative gaps)作为验证奖励进行更新。该方法交替执行记忆微调与规则对抗性策略更新,使生成的候选答案更符合规则,缩小判别差距,从而推动规则生成器捕捉新的质量维度,显著提升下游RL性能。

链接: https://arxiv.org/abs/2602.20751
作者: Yifei Xu,Guilherme Potje,Shivam Shandilya,Tiancheng Yuan,Leonardo de Oliveira Nunes,Rakshanda Agarwal,Saeid Asgari,Adam Atkinson,Emre Kıcıman,Songwu Lu,Ranveer Chandra,Tusher Chakraborty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

[NLP-17] Explicit Grammar Semantic Feature Fusion for Robust Text Classification

【速读】: 该论文旨在解决现有自然语言处理(Natural Language Processing, NLP)模型依赖计算密集型Transformer架构、难以在资源受限环境(如边缘设备)中部署的问题。其解决方案的关键在于提出一种轻量级文本分类框架,通过显式编码句子级语法结构(包括句法组成、短语模式和复杂度指标)生成紧凑的语法向量,并将其与冻结的上下文嵌入进行融合,从而构建一个统一的特征表示。该方法将语法作为显式的归纳偏置(inductive bias)而非可学习模块,避免了对完整参数化Transformer或深层神经网络的依赖,显著降低了模型复杂度,同时在跨域任务中相比基线模型提升2%-15%的性能,适用于边缘计算场景。

链接: https://arxiv.org/abs/2602.20749
作者: Azrin Sultana,Firoz Ahmed
机构: American International University-Bangladesh (美国国际大学-孟加拉国)
类目: Computation and Language (cs.CL)
备注: 30 pages, 9 figures

点击查看摘要

Abstract:Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices

[NLP-18] Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization

【速读】: 该论文旨在解决文本匿名化过程中隐私保护与信息效用之间平衡难以适配不同领域、隐私目标和下游应用场景的问题。现有方法依赖静态且人工设计的策略,缺乏灵活性且泛化能力弱。其解决方案的关键在于提出“自适应文本匿名化”(adaptive text anonymization)的新任务范式,通过任务特定的提示优化框架自动构建语言模型的匿名化指令,从而动态调整以满足多样化的隐私-效用需求。该框架在五个跨域数据集上验证有效,显著优于基线方法,在保持计算效率的同时实现了接近大闭源模型的性能,并能发现新的匿名化策略以探索隐私-效用权衡前沿。

链接: https://arxiv.org/abs/2602.20743
作者: Gabriel Loiseau,Damien Sileo,Damien Riquet,Maxime Meyer,Marc Tommasi
机构: Hornetsecurity(霍恩特安全); Univ. Lille(里尔大学); Inria(法国国家信息与自动化研究院); CNRS(法国国家科学研究中心); Centrale Lille(里尔中央理工学院); CRIStAL(计算科学与技术研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization strategies are automatically adapted to specific privacy-utility requirements. We propose a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns. To evaluate our approach, we present a benchmark spanning five datasets with diverse domains, privacy constraints, and utility objectives. Across all evaluated settings, our framework consistently achieves a better privacy-utility trade-off than existing baselines, while remaining computationally efficient and effective on open-source language models, with performance comparable to larger closed-source models. Additionally, we show that our method can discover novel anonymization strategies that explore different points along the privacy-utility trade-off frontier.

[NLP-19] ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition

【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术中普遍存在的权衡问题:在大规模语言模型(Large Language Models, LLMs)微调过程中,尽管LoRA等方法已广泛使用,但其可训练参数数量仍较高;而若通过降低秩(rank)来减少参数量,则会在复杂多任务场景下显著损害模型性能。解决方案的关键在于提出ID-LoRA框架,其核心创新是提取预训练权重矩阵中的聚类参数组,并将其重构为多个低秩组件,这些组件共享一个单一的初始化可训练低秩矩阵,从而在保持模型容量的同时大幅减少可训练参数数量。

链接: https://arxiv.org/abs/2602.20727
作者: Xindian Ma,Rundong Kong,Peng Zhang,Ruoxiang Huang,Yongyu Jiang
机构: Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation lies in extracting and reusing clustered parameter groups from the pretrained weight matrix. These groups are then used to form multiple low-rank components, all of which share only a single initialized trainable low-rank matrix. This approach cuts the number of trainable parameters while keeping the model’s capacity intact. We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment. ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (e.g., LoRA, DoRA, HydraLoRA) while using up to 46% fewer trainable parameters than the standard LoRA. In multi-task scenarios, it surpasses LoRA and its recent variants (e.g., DoRA and HydraLoRA) on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.

[NLP-20] Counterfactual Simulation Training for Chain-of-Thought Faithfulness

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理中普遍存在的忠实性(faithfulness)问题,即模型生成的CoT可能并非其真实决策依据,从而误导对模型输出原因的理解。为提升CoT的忠实性,作者提出一种名为反事实模拟训练(Counterfactual Simulation Training, CST)的训练方法,其核心在于通过奖励那些使模拟器能够准确预测模型在反事实输入下输出的CoT,来强化模型产生可解释且可靠的推理过程。CST的关键创新在于将CoT视为可被模拟器验证的中间推理路径,并利用反事实扰动(如提示词扰动或通用模型生成的反事实样本)来驱动模型学习更忠实、更具泛化能力的推理行为。实验表明,CST在大规模语言模型(最大达235B参数)上显著提升了基于提示词的反事实检测准确率(+35点)和通用反事实模拟性(+2点),并优于提示基线与仅使用强化学习的方法。

链接: https://arxiv.org/abs/2602.20710
作者: Peter Hase,Christopher Potts
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model’s outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at this https URL

[NLP-21] CAMEL: Confidence-Gated Reflection for Reward Modeling

【速读】: 该论文旨在解决当前奖励模型(Reward Model)在对齐大语言模型与人类偏好时面临的效率与可解释性之间的权衡问题:传统标量判别式偏好模型虽计算高效但缺乏推理过程的可解释性,而生成式判断模型虽能提供更丰富的推理路径,却因高计算开销难以部署。其解决方案的关键在于提出CAMEL框架——一种基于置信度门控的反思机制,通过轻量级单标记偏好决策初步筛选实例,并仅对低置信度样本触发反思过程;同时利用反事实前缀增强训练策略,使模型接触多样初始判断并促进真实自我修正,从而在保持高精度的同时显著提升效率,在三个主流基准上达到82.9%平均准确率,优于此前最优模型3.2%,且参数规模仅为70B模型的约20%。

链接: https://arxiv.org/abs/2602.20670
作者: Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Kun Xu,Yang You
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 13 pages

点击查看摘要

Abstract:Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

[NLP-22] CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models

【速读】: 该论文旨在解决心理咨询中治疗联盟(Therapeutic Alliance)感知评估的难题,即传统事后问卷负担重、延迟明显,而现有计算方法得分粗糙、缺乏可解释性且无法建模会话整体语境。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的框架CARE,其核心创新是采用“理由增强型监督”(rationale-augmented supervision)策略,在CounselingWAI数据集基础上融合9,516条专家标注的解释性理由,以LLaMA-3.1-8B-Instruct为基座进行微调。该方法不仅显著提升了对多维治疗联盟评分的预测准确性(Pearson相关系数较现有模型提升超70%),还能生成与上下文高度相关的高质量可解释理由,从而更贴近客户实际感知,具备在真实在线心理咨询服务中辅助临床决策的应用潜力。

链接: https://arxiv.org/abs/2602.20648
作者: Anqi Li,Chenxiao Wang,Yu Lu,Renjun Xu,Lizhi Ma,Zhenzhong Lan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Hangzhou Normal University (杭州师范大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts. Built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales, CARE is fine-tuned using rationale-augmented supervision with the LLaMA-3.1-8B-Instruct backbone. Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings. Rationale-augmented supervision further improves predictive accuracy. CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations. Applied to real-world Chinese online counseling sessions, CARE uncovers common alliance-building challenges, illustrates how interaction patterns shape alliance development, and provides actionable insights, demonstrating its potential as an AI-assisted tool for supporting mental health care.

[NLP-23] Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28606 Books

【速读】: 该论文旨在解决如何在语料库尺度上量化叙事结构的信息论特征及其对读者参与度的影响这一问题。其解决方案的关键在于引入“语义新颖性”(semantic novelty)作为衡量指标,即每段落句子嵌入与此前所有段落运行中心点之间余弦距离的动态变化,并通过SBERT(Sentence-BERT)模型提取768维嵌入向量,在PG19语料库中计算出逐段的新颖性曲线;随后采用分段聚合近似(Piecewise Aggregate Approximation, PAA)将曲线压缩为16段表示,再利用Ward聚类识别出八种典型的叙事形状原型(narrative shape archetypes),从而揭示了信息密度动态(information-density dynamics)作为区别于情感或主题的叙事结构核心维度,且该维度显著预测读者接受度(如体积—variance为最强长度无关预测因子,partial rho = 0.32)。

链接: https://arxiv.org/abs/2602.20647
作者: W. Frederick Zimmerman
机构: Nimble Books LLC
类目: Computation and Language (cs.CL)
备注: six figures. dataset available at Hugging Face

点击查看摘要

Abstract:I introduce semantic novelty–cosine distance between each paragraph’s sentence embedding and the running centroid of all preceding paragraphs–as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume–variance of the novelty trajectory–is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11–demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: this https URL

[NLP-24] Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches

【速读】: 该论文旨在解决社交媒体平台上仇恨言论(hate speech)和攻击性语言的识别与治理问题。其解决方案的关键在于对比不同机器学习模型在检测效果上的差异,并探索文本转换技术以中和有害内容。研究发现,基于深度上下文理解的先进模型如BERT具有更高的准确率,而混合模型在特定场景下表现更优;同时,创新的文本转换方法能够将负面表达转化为中性表述,从而降低有害内容的影响。

链接: https://arxiv.org/abs/2602.20634
作者: Saurabh Mishra,Shivani Thakur,Radhika Mamidi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 24 figures

点击查看摘要

Abstract:The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools. This study evaluates the efficacy of various machine learning models in identifying hate speech and offensive language and investigates the potential of text transformation techniques to neutralize such content. We compare traditional models like CNNs and LSTMs with advanced neural network models such as BERT and its derivatives, alongside exploring hybrid models that combine different architectural features. Our results indicate that while advanced models like BERT show superior accuracy due to their deep contextual understanding, hybrid models exhibit improved capabilities in certain scenarios. Furthermore, we introduce innovative text transformation approaches that convert negative expressions into neutral ones, thereby potentially mitigating the impact of harmful content. The implications of these findings are discussed, highlighting the strengths and limitations of current technologies and proposing future directions for more robust hate speech detection systems.

[NLP-25] SpecMind: Cognitively Inspired Interactive Multi-Turn Framework for Postcondition Inference

【速读】: 该论文旨在解决手动编写程序规范(specification)耗时且困难的问题,尤其是在生成后置条件(postcondition)时,现有基于大语言模型(Large Language Model, LLM)的单次提示(single-pass prompting)方法往往准确性不足。其解决方案的关键在于提出SpecMind框架,将LLM视为交互式与探索性推理者,而非一次性生成器;通过反馈驱动的多轮提示机制,使模型能够结合显式和隐式正确性反馈迭代优化候选后置条件,并自主决定终止时机,从而提升对代码语义的理解深度与生成结果的准确性和完整性。

链接: https://arxiv.org/abs/2602.20610
作者: Cuong Chi Le,Minh V.T Pham,Tung Vu Duy,Cuong Duc Van,Huy N. Phan,Hoang N. Phan,Tien N. Nguyen
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Specifications are vital for ensuring program correctness, yet writing them manually remains challenging and time-intensive. Recent large language model (LLM)-based methods have shown successes in generating specifications such as postconditions, but existing single-pass prompting often yields inaccurate results. In this paper, we present SpecMind, a novel framework for postcondition generation that treats LLMs as interactive and exploratory reasoners rather than one-shot generators. SpecMind employs feedback-driven multi-turn prompting approaches, enabling the model to iteratively refine candidate postconditions by incorporating implicit and explicit correctness feedback, while autonomously deciding when to stop. This process fosters deeper code comprehension and improves alignment with true program behavior via exploratory attempts. Our empirical evaluation shows that SpecMind significantly outperforms state-of-the-art approaches in both accuracy and completeness of generated postconditions.

[NLP-26] Personal Information Parroting in Language Models EACL

【速读】: 该论文旨在解决现代语言模型(Language Models, LM)在大规模网络文本数据上训练时,因记忆个人隐私信息(Personal Information, PI)而引发的隐私泄露风险问题。其核心解决方案是开发了一套基于正则表达式和规则(regexes and rules, RR)的检测工具集,用于精准识别电子邮件地址、电话号码和IP地址等PI类型,并通过实证分析发现模型规模与预训练迭代次数均与PI记忆程度呈正相关,进而提出应严格过滤和匿名化预训练数据以降低PI复述(parroting)风险。

链接: https://arxiv.org/abs/2602.20580
作者: Nishant Subramani,Kshitish Ghate,Mona Diab
机构: Carnegie Mellon University - Language Technologies Institute (卡内基梅隆大学语言技术研究所); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: EACL Findings 2026

点击查看摘要

Abstract:Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (RR) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.

[NLP-27] GATES: Self-Distillation under Privileged Context with Consensus Gating

【速读】: 该论文旨在解决在缺乏可靠监督信号(如真实标签、可验证奖励或外部评分器)的场景下,如何有效进行自蒸馏(self-distillation)的问题。其核心挑战在于:当模型同时担任教师(tutor)和学生(student)角色时,若无法保证教师输出的正确性,则传统蒸馏方法难以收敛。解决方案的关键在于通过在线构建可靠性信号——即从多个基于文档的推理轨迹中采样并利用教师之间的共识(consensus)来筛选可信的学习样本,从而动态控制知识蒸馏的触发条件;在此基础上,不仅蒸馏最终答案,还蒸馏完整的教师推理路径(trajectory),提供密集且稳定的监督信号。实验证明,该方法显著提升了无文档依赖的学生模型性能,在域内评估中准确率从46.0%提升至62.0%,并在公开数学基准测试中平均准确率(maj@8)从20.2%提升至35.4%。

链接: https://arxiv.org/abs/2602.20574
作者: Alex Stein,Furong Huang,Tom Goldstein
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 Pages of main text with an additional 7 pages of supplementary material

点击查看摘要

Abstract:We study self-distillation in settings where supervision is unreliable: there are no ground truth labels, verifiable rewards, or external graders to evaluate answers. We focus on document-grounded question answering with asymmetric context, where a single model serves as both tutor (with access to a relevant source document during training) and student (answering from the question alone at test time). Rather than assuming tutor correctness, we derive supervision online from tutor consensus by sampling multiple document-grounded reasoning traces and using agreement to gate learning. Conditioned on this reliability signal, we distill knowledge through full tutor reasoning trajectories (not just final answers), providing a dense and stable learning signal. Empirically, this consensus-gated trajectory distillation substantially improves transfer to the document-free student. Held-out in-domain accuracy under asymmetric evaluation improves from 46.0% to 62.0%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2% to 35.4%.

[NLP-28] Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习后训练(reinforcement learning post-training)过程中,因依赖大规模异构数据集而导致的课程学习(curriculum learning)效率低下与效果不稳定的问题。其核心挑战在于如何自动、动态地选择最优训练问题以最大化策略性能提升。解决方案的关键是提出ACTOR-CURATOR框架,该框架通过学习一个神经 curator(即“课程管理者”),将问题选择建模为非平稳随机多臂赌博机(non-stationary stochastic bandit)问题,并基于在线随机镜像下降(online stochastic mirror descent)推导出一个原则性的损失函数,在部分反馈条件下建立 regret 保证。该方法实现了自动化、可扩展的课程学习,显著优于均匀采样和现有强基线,在多个推理基准测试中实现最高达80%的训练加速和28.6%–30.5%的相对性能提升。

链接: https://arxiv.org/abs/2602.20532
作者: Zhengyao Gu,Jonathan Light,Raul Astudillo,Ziyu Ye,Langzhou He,Henry Peng Zou,Wei Cheng,Santiago Paternain,Philip S. Yu,Yisong Yue
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Caltech (加州理工学院); RPI (伦斯勒理工学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学); University of Chicago (芝加哥大学); NEC Laboratories America (美国NEC实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 37 pages, 8 figures, 1 table. Preprint under review. Equal contribution by first two authors

点击查看摘要

Abstract:Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

[NLP-29] Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

【速读】: 该论文旨在解决传统自回归语言模型(Autoregressive Language Models)在生成过程中缺乏全局语义规划能力的问题,即模型受限于逐token的局部决策,难以实现对长程语义一致性和常识推理的优化。其解决方案的关键在于提出Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM),该模型通过引入一个“思考”阶段(thinking phase),在生成前利用潜空间扩散机制(latent diffusion planning)对语义计划进行连续空间中的优化与精炼,从而在离散token生成之前完成全局规划。这一设计显著提升了模型在语言理解、叙事连贯性和常识推理方面的表现,并支持轻量级分类器控制属性导向生成,无需重新训练即可实现细粒度调控,同时保持更优的流畅性与控制权衡。

链接: https://arxiv.org/abs/2602.20528
作者: Justin Lovelace,Christian Belardi,Sofian Zalouk,Adhitya Polavaram,Srivatsa Kundurthy,Kilian Q. Weinberger
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: COLM 2025

点击查看摘要

Abstract:The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a “thinking” phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves 70% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. The architecture also allows straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.

[NLP-30] Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination NEURIPS2025

【速读】: 该论文旨在解决当前模仿学习(imitation learning)方法在建模人类行为多样性与非马尔可夫特性方面的不足,以及缺乏在推理阶段对行为进行细粒度调控的能力。其关键解决方案是提出MIMIC框架,该框架借鉴人类认知理论中“内省言语”(inner speech)引导行为选择的机制,利用视觉-语言模型作为语言结构支撑,训练一个条件变分自编码器(conditional variational autoencoder)从观测中生成内省言语表示;随后通过基于扩散模型的行为克隆策略,使动作选择同时依赖于当前观测和生成的内省言语,从而实现对行为的精细化控制与多样化输出,且无需额外的人类示范数据即可完成推理时的行为调制。

链接: https://arxiv.org/abs/2602.20517
作者: Rakshit Trivedi,Kartik Sharma,David C Parkes
机构: Massachusetts Institute of Technology (麻省理工学院); Georgia Institute of Technology (佐治亚理工学院); Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Spotlight paper at NeurIPS 2025

点击查看摘要

Abstract:Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations. We open source our code and provide pre-trained MIMIC agents and qualitative demos at: this https URL.

[NLP-31] From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在实际应用中缺乏统一评估框架的问题。尽管现有指标已覆盖部分社会技术决定因素(sociotechnical determinants),但它们未被系统组织,导致跨场景的模型效用评估难以一致。解决方案的关键在于提出语言模型效用分类法(Language Model Utility Taxonomy, LUX),其将效用评估划分为性能、交互、运营和治理四个维度,并在每个维度下构建主题对齐的层级结构,使每个组件均可通过量化指标进行比较,从而实现模型选择与预期用途的精准匹配。此外,配套的动态在线工具进一步支持基于指标库的实践评估探索。

链接: https://arxiv.org/abs/2602.20513
作者: Gavin Levinson,Keith Feldman
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model’s fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are rarely organized in a way that supports consistent evaluation, leaving no unified taxonomy for assessing and comparing LLM utility across use cases. To address this gap, we introduce the Language Model Utility Taxonomy (LUX), a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance. Within each domain, LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics that enable quantitative comparison and alignment of model selection with intended use. In addition, an external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics (factors) for applied evaluation.

[NLP-32] PreScience: A Benchmark for Forecasting Scientific Contributions

【速读】: 该论文试图解决的问题是:能否利用训练于固定时间点之前科学文献记录的AI系统,预测此后发生的科学进展。这一能力对于研究人员识别潜在合作者、发现有影响力的研究方向以及预判未来关键问题与方法具有重要意义。解决方案的关键在于提出并构建了一个名为PreScience的科学预测基准,其将科研过程分解为四个相互依赖的生成任务:合作者预测、前期工作选择、贡献生成和影响预测;同时构建了一个包含98K篇近期人工智能相关论文的精心标注数据集,涵盖去重后的作者身份、时间对齐的学术元数据及覆盖502K篇论文的结构化作者出版历史与引用图谱,并开发了LACERScore这一基于大语言模型(Large Language Model, LLM)的贡献相似性度量指标,显著优于以往方法且逼近人工标注者一致性。实验表明,当前前沿LLM在各任务中仍存在较大提升空间,尤其在贡献生成任务中仅达到中等相似度(GPT-5平均得分为5.6/10),且端到端模拟12个月科研产出时生成的合成语料在多样性与新颖性上均显著低于同期人类研究。

链接: https://arxiv.org/abs/2602.20459
作者: Anirudh Ajith,Amanpreet Singh,Jay DeYoung,Nadav Kunievsky,Austin C. Kozlowski,Oyvind Tafjord,James Evans,Daniel S. Weld,Tom Hope,Doug Downey
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages (53 with bibliography and appendix), 4 figures (13 with appendix), 4 tables (10 with appendix), 1 algorithm

点击查看摘要

Abstract:Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience – a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task – e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.

[NLP-33] Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference

【速读】: 该论文旨在解决蛋白质语言模型(Protein Language Models, PLMs)在结构与功能预测任务中效率与精度难以兼顾的问题,以及Transformer架构在蛋白质序列建模中与自然语言处理领域存在的差异机制尚不清晰的挑战。其解决方案的关键在于:首先,通过对比蛋白质与自然语言域中注意力头信息分布的差异,揭示了PLMs在生物序列中的独特行为模式;其次,引入一种源自自然语言领域的早期退出(early-exit)策略,并对其进行适应性改进,使模型能够根据具体蛋白质和任务自动选择最优中间层表示,从而在保持甚至提升准确率(性能提升0.4–7.01个百分点)的同时,实现超过10%的计算效率增益。这一方法不仅提升了非结构属性预测的实用性,也为生物序列建模提供了新的优化范式。

链接: https://arxiv.org/abs/2602.20449
作者: Anna Hart,Chi Han,Jeonghwan Kim,Huimin Zhao,Heng Ji
机构: Siebel School of Data Science and Computing, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校数据科学与计算学院); DOE Center for Advanced Bioenergy and Bioproducts Innovation (美国能源部先进生物能源与生物产品创新中心); Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校化学与生物分子工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.

[NLP-34] Disentangling Geometry Performance and Training in Language Models

【速读】: 该论文旨在解决现有语言模型几何特性(尤其是解嵌入矩阵的几何性质)与下游任务性能之间关系不明确的问题。此前研究虽发现解嵌入矩阵的有效秩(effective rank)在模型可解释性中具有重要意义,但其是否能作为性能预测指标仍缺乏系统验证。论文的关键解决方案在于通过控制变量实验设计,系统分析了108个OLMo风格语言模型在不同预训练超参数下的几何特征与性能表现,揭示出有效秩虽常与高性能模型相关,但并非普适规律;同时指出模型几何特征主要反映训练策略选择而非性能本身,从而澄清了以往对几何指标与性能关联性的误解。

链接: https://arxiv.org/abs/2602.20433
作者: Atharva Kulkarni,Jacob Mitchell Springer,Arjun Subramonian,Swabha Swayamdipta
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model’s performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model’s geometry, as captured by existing metrics, primarily reflects training choices rather than performance.

[NLP-35] MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation CVPR2026

【速读】: 该论文旨在解决医学图像分割中因标注数据有限、解剖结构模糊以及域偏移(domain shift)带来的挑战,同时探索视觉-语言模型(如CLIP)在密集文本引导的医学图像分割中的潜力。解决方案的关键在于提出MedCLIPSeg框架,其核心创新包括:利用patch-level CLIP嵌入并通过概率交叉模态注意力实现图像与文本token之间的双向交互,从而显式建模预测不确定性;结合软patch级对比损失,促进多样文本提示下的细粒度语义学习,显著提升数据效率和域泛化能力。该方法在16个不同成像模态和器官的数据集上验证了其在准确性、效率和鲁棒性上的优势,并提供可解释的不确定性图谱以增强临床可信度。

链接: https://arxiv.org/abs/2602.20423
作者: Taha Koleilat,Hojat Asgariandehkordi,Omid Nejati Manzari,Berardino Barile,Yiming Xiao,Hassan Rivaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: CVPR 2026; Project Page: this https URL

点击查看摘要

Abstract:Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.

[NLP-36] Case-Aware LLM -as-a-Judge Evaluation for Enterprise-Scale RAG Systems

【速读】: 该论文旨在解决企业级多轮检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际运维场景中评估困难的问题,特别是现有评估框架难以捕捉案例误识别、流程错位和跨轮次部分解决等企业特有失败模式。解决方案的关键在于提出一种面向案例的大型语言模型作为裁判(LLM-as-a-Judge)评估框架,通过八个基于操作实践的指标分离检索质量、事实一致性、回答实用性、精确性完整性及案例/流程对齐度,并引入严重性感知评分机制以减少分数膨胀、提升诊断清晰度;同时采用确定性提示与严格JSON输出格式,支持可扩展的批量评估、回归测试和生产监控,从而揭示通用代理指标无法反映的企业级关键权衡,为系统优化提供可行动依据。

链接: https://arxiv.org/abs/2602.20379
作者: Mukul Chhabra,Luigi Medrano,Arush Verma
机构: Dell Technologies (戴尔科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages including appendix, 6 figures

点击查看摘要

Abstract:Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement. Comments: 12 pages including appendix, 6 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20379 [cs.CL] (or arXiv:2602.20379v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20379 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-37] How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity

【速读】: 该论文试图解决的问题是:现有研究对数值系统(numeral system)的沟通效率评估存在偏差,未能充分考虑语言在形态句法复杂度(morphosyntactic complexity)上的差异,从而可能高估了全球语言中数值系统的优化程度。其解决方案的关键在于,基于52种遗传上多样化的语言数据,采用一种区分可预测与不可预测异干形式(allomorphy)的标注方案,重新量化数值词项的平均复杂度,发现多数语言的实际效率显著低于理论预期,揭示了传统模型在刻画语言演化效率时的局限性。

链接: https://arxiv.org/abs/2602.20372
作者: Chundra Cathcart,Arne Rubehn,Katja Bocklage,Luca Ciucci,Kellen Parker van Dam,Alžběta Kučerová,Jekaterina Mažara,Carlo Y. Meloni,David Snee,Johann-Mattis List
机构: University of Zurich (苏黎世大学); University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research argues that exact recursive numeral systems optimize communicative efficiency by balancing a tradeoff between the size of the numeral lexicon and the average morphosyntactic complexity (roughly length in morphemes) of numeral terms. We argue that previous studies have not characterized the data in a fashion that accounts for the degree of complexity languages display. Using data from 52 genetically diverse languages and an annotation scheme distinguishing between predictable and unpredictable allomorphy (formal variation), we show that many of the world’s languages are decisively less efficient than one would expect. We discuss the implications of our findings for the study of numeral systems and linguistic evolution more generally.

[NLP-38] Natural Language Processing Models for Robust Document Categorization

【速读】: 该论文旨在解决自动化文本分类中如何在分类准确率与计算效率之间取得平衡的问题,特别是在处理类别分布不均的文档分类场景下。其核心挑战在于将AI模型集成到实际自动化流程时,需兼顾性能表现和资源消耗。解决方案的关键在于对比三种不同复杂度的模型——朴素贝叶斯(Naive Bayes)、双向长短期记忆网络(Bidirectional LSTM, BiLSTM)和微调后的BERT(Bidirectional Encoder Representations from Transformers)——发现BiLSTM在准确率(约98.56%)与训练成本之间实现了最优权衡,既保持了较强的上下文理解能力,又避免了BERT对计算资源的高需求,同时验证了一个可实际部署的演示系统在技术请求自动路由中的高效性,从而为工业级应用提供了可行路径。

链接: https://arxiv.org/abs/2602.20336
作者: Radoslaw Roszczyk,Pawel Tecza,Maciej Stodolski,Krzysztof Siwek
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 fiure, 5 tables

点击查看摘要

Abstract:This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures. Comments: 13 pages, 1 fiure, 5 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.20336 [cs.CL] (or arXiv:2602.20336v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-39] No One Size Fits All: QueryBandits for Hallucination Mitigation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在具备高级推理能力的同时频繁产生幻觉(Hallucination)的问题,尤其关注闭源模型(closed-source models)中幻觉的缓解策略缺失这一关键短板。现有研究多集中于开源模型的后验检测与参数编辑,而忽视了机构部署中占主导地位的闭源模型。其解决方案的核心是提出 QueryBandits——一个模型无关的上下文相关多臂赌博机(contextual bandit)框架,通过在线学习动态选择最优查询重写(query-rewrite)策略,利用实证验证并校准的奖励函数进行优化。该方法无需重新训练或梯度调整即可在前向传播过程中调节模型行为,显著优于静态策略(如改写或扩展)和无重写基线,在16个问答场景中实现87.5%胜率,并揭示了不存在适用于所有查询的通用重写策略,强调了自适应机制的重要性。

链接: https://arxiv.org/abs/2602.20332
作者: Nicole Cho,William Watson,Alec Koppel,Sumitra Ganesh,Manuela Veloso
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

[NLP-40] An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

【速读】: 该论文旨在解决罕见病诊断中表型(phenotype)注释的瓶颈问题,即从临床笔记中手动提取并结构化表型信息耗时且难以扩展。传统人工智能方法通常仅优化表型识别的单一环节,而未覆盖从文本特征提取、标准化为人类表型本体(Human Phenotype Ontology, HPO)术语到优先排序诊断相关HPO项的完整临床工作流程。其解决方案的关键在于提出RARE-PHENIX框架,该框架整合了基于大语言模型的表型抽取、以本体为依据的标准化以及监督式诊断信息优先级排序三个模块,形成端到端的可操作化流程;在11个未确诊疾病网络临床站点和范德比尔特大学医学中心的真实数据上验证表明,该方法在Ontology-based相似性和精确率-召回率-F1指标上显著优于现有最优模型PhenoBERT,且模块化消融分析证实各组件对整体性能提升具有贡献,从而实现更贴近临床判断的结构化、排序化表型输出,具备支持人机协同罕见病诊断的实际潜力。

链接: https://arxiv.org/abs/2602.20324
作者: Cathy Shyr,Yan Hu,Rory J. Tinker,Thomas A. Cassini,Kevin W. Byram,Rizwan Hamid,Daniel V. Fabbri,Adam Wright,Josh F. Peterson,Lisa Bastarache,Hua Xu
机构: Vanderbilt University Medical Center (范德比尔特大学医学中心); Mount Sinai School of Medicine (西奈山医学院); Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.

[NLP-41] What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance EACL2026

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)幻觉(hallucination)问题,即模型在回答中生成不真实或无依据内容的现象。传统观点将幻觉归因于模型本身或解码策略的缺陷,而本文提出一个新视角:查询(query)的形式特征本身可能显著影响模型产生幻觉的概率。解决方案的关键在于构建一个包含22维特征的查询表征向量,涵盖从句复杂度、词汇罕见性到指代消解(anaphora)、否定表达、可回答性(answerability)及意图锚定(intention grounding)等维度,这些特征均被证实会影响人类理解过程。基于369,837条真实世界查询的大规模实证分析表明,某些特征如深层从句嵌套和信息不足(underspecification)与更高幻觉倾向显著相关,而清晰的意图锚定和可回答性则降低幻觉风险;其他特征如领域特异性则表现出依赖数据集和模型的混合效应。这一发现首次建立了可量化、可观察的查询特征与幻觉风险之间的关联,为后续基于查询重写(guided query rewriting)的干预策略提供了理论基础和实证支持。

链接: https://arxiv.org/abs/2602.20300
作者: William Watson,Nicole Cho,Sumitra Ganesh,Manuela Veloso
机构: J.P. Morgan AI Research (摩根大通人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL 2026 Findings

点击查看摘要

Abstract:Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query’s form can also shape a listener’s (and model’s) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent “risk landscape”: certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

[NLP-42] InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation

【速读】: 该论文旨在解决当前生成式AI(Generative AI)在模拟真实人物个性时缺乏直接基于个体实际言论进行评估的问题。现有方法多依赖人口统计学调查、人格量表或简短的AI引导访谈作为代理指标,无法准确反映个体真实表达的内容。解决方案的关键在于构建一个以访谈为基础的大规模评估框架:从1000位公众人物的23,000份经验证的访谈记录中提取超过671,000条问答对,提出包含内容相似性、事实一致性、人格契合度和事实知识保留四个维度的综合评价指标体系。该框架通过实证表明,基于真实访谈数据的方法显著优于仅依赖传记资料或模型参数知识的方法,并揭示了检索增强型与时间顺序型方法之间的权衡关系——前者更擅长捕捉个性风格和响应质量,后者更能保障事实一致性和知识保留能力。

链接: https://arxiv.org/abs/2602.20294
作者: Yu Li,Pranav Narayanan Venkit,Yada Pruksachatkun,Chien-Sheng Wu
机构: Salesforce Research (Salesforce 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model’s parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.

[NLP-43] Exploring Anti-Aging Literature via ConvexTopics and Large Language Models

【速读】: 该论文旨在解决生物医学文献快速增长背景下知识组织与新兴趋势识别的挑战,尤其是传统聚类和主题建模方法(如K-means、LDA)因初始值敏感且易陷入局部最优而导致可重复性差、解释性不足的问题。其解决方案的关键在于提出一种基于凸优化的聚类算法重构方法,通过从数据中选取代表性样本(exemplars)来生成稳定且细粒度的主题,并保证全局最优解,从而显著提升结果的可重复性和可解释性。该方法在约12,000篇关于衰老与长寿的PubMed文章上验证有效,所提取的主题涵盖分子机制到肠道微生物等多维度内容,优于K-means、LDA及BERTopic等常见方法。

链接: https://arxiv.org/abs/2602.20224
作者: Lana E. Yeganova,Won G. Kim,Shubo Tian,Natalie Xie,Donald C. Comeau,W. John Wilbur,Zhiyong Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid expansion of biomedical publications creates challenges for organizing knowledge and detecting emerging trends, underscoring the need for scalable and interpretable methods. Common clustering and topic modeling approaches such as K-means or LDA remain sensitive to initialization and prone to local optima, limiting reproducibility and evaluation. We propose a reformulation of a convex optimization based clustering algorithm that produces stable, fine-grained topics by selecting exemplars from the data and guaranteeing a global optimum. Applied to about 12,000 PubMed articles on aging and longevity, our method uncovers topics validated by medical experts. It yields interpretable topics spanning from molecular mechanisms to dietary supplements, physical activity, and gut microbiota. The method performs favorably, and most importantly, its reproducibility and interpretability distinguish it from common clustering approaches, including K-means, LDA, and BERTopic. This work provides a basis for developing scalable, web-accessible tools for knowledge discovery.

[NLP-44] MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLM s

【速读】: 该论文旨在解决云边端设备上大语言模型(Large Language Model, LLM)弹性部署时因计算资源动态变化而导致的量化精度切换难题,特别是传统量化方法中校准参数与特定精度强耦合、难以在运行时灵活切换的问题。解决方案的关键在于提出一种名为MoBiQuant的混合位宽(Mixture-of-Bits)量化框架,其核心创新包括:一是基于token敏感度的多比特递归残差量化(many-in-one recursive residual quantization),可迭代重构高精度权重;二是引入token感知路由机制(token-aware router),根据输入token的敏感度动态选择使用多少残差比特切片。该方法通过缓解精度依赖的异常值迁移现象,实现了无需重复校准即可平滑切换精度,并显著提升对token异常分布的泛化能力。

链接: https://arxiv.org/abs/2602.20191
作者: Dongwei Wang,Jinhee Kim,Seokho Han,Denis Gudovskiy,Yohei Nakata,Tomoyuki Okuno,KhayTze Peong,Kang Eun Jeon,Jong Hwan Ko,Yiran Chen,Huanrui Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically linked to specific precisions, which presents challenges during elastic-precision calibration and precision switching at runtime. In this work, we attribute the source of varying calibration parameters to the varying token-level sensitivity caused by a precision-dependent outlier migration this http URL by this observation, we propose \textttMoBiQuant, a novel Mixture-of-Bits quantization framework that adjusts weight precision for elastic LLM inference based on token sensitivity. Specifically, we propose the many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights and the token-aware router to dynamically select the number of residual bit slices. MoBiQuant enables smooth precision switching while improving generalization for the distribution of token outliers. Experimental results demonstrate that MoBiQuant exhibits strong elasticity, enabling it to match the performance of bit-specific calibrated PTQ on LLaMA3-8B without repeated calibration.

[NLP-45] ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling

【速读】: 该论文旨在解决智能代理在实际应用中因产生大量误报(mostly false alerts)而导致的“警报疲劳”(alert fatigue)问题,进而影响用户对关键告警的敏感性。解决方案的关键在于提出一种名为ConceptRM的新方法,通过仅需少量专家标注作为锚点,构建带有不同噪声比例的扰动数据集,并利用协同教学(co-teaching)训练多个独立模型进行协作学习;随后基于这些模型的共识决策识别出噪声数据集中可靠的负样本,从而生成高质量的训练语料以训练反射模型(reflection model),显著提升对虚假警报的拦截能力,且在最小化标注成本的前提下,在域内和域外数据集上分别优于多个先进大语言模型(LLM)基线高达53.31%和41.67%。

链接: https://arxiv.org/abs/2602.20166
作者: Yongda Yu,Lei Zhang,Xinxin Guo,Minghui Yu,Zhengqi Zhuang,Guoping Rong,Haifeng Shen,Zhengfeng Li,Boge Wang,Guoan Zhang,Bangyu Xiang,Xiaobin Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ‘‘alert fatigue’’. A common strategy is to train a reflection model as a filter to intercept false alerts with labelled data collected from user verification feedback. However, a key challenge is the noisy nature of such data as it is often collected in production environments. As cleaning noise via manual annotation incurs high costs, this paper proposes a novel method ConceptRM for constructing a high-quality corpus to train a reflection model capable of effectively intercepting false alerts. With only a small amount of expert annotations as anchors, ConceptRM creates perturbed datasets with varying noise ratios and utilizes co-teaching to train multiple distinct models for collaborative learning. By analyzing the consensus decisions of these models, it effectively identifies reliable negative samples from a noisy dataset. Experimental results demonstrate that ConceptRM significantly enhances the interception of false alerts with minimal annotation cost, outperforming several state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.

[NLP-46] Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

【速读】: 该论文旨在解决如何在资源受限环境中构建高性能小语言模型(Small Language Models, SLMs)的问题,即在保持强大推理能力的同时显著降低计算成本。其解决方案的关键在于知识蒸馏(Knowledge Distillation)技术,通过将大型预训练模型的知识迁移至小型模型,实现性能与计算效率的最优平衡;实验表明,蒸馏后的8B参数模型在计算效率上比原始模型高出2000倍以上,且推理能力可媲美甚至超越十倍于其规模的标准模型,验证了知识蒸馏不仅是压缩手段,更是构建高性能、高可访问性人工智能的核心策略。

链接: https://arxiv.org/abs/2602.20164
作者: Sachin Gopal Wani,Eric Page,Ajay Dholakia,David Ellison
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, accepted at the the 2025 TPCTC Conference

点击查看摘要

Abstract:Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI

[NLP-47] Graph Modelling Analysis of Speech-Gesture Interaction for Aphasia Severity Estimation IJCAI

【速读】: 该论文旨在解决传统 aphasia (失语症) 评估工具(如 Western Aphasia Battery-Revised, WAB-R)主要依赖孤立语言技能测量、难以全面反映日常语言能力的问题。其解决方案的关键在于提出一种基于图神经网络(Graph Neural Network)的框架,将个体的言语与手势作为多模态图结构进行建模:节点表示词汇项和手势,边编码词-词、手势-词及词-手势之间的转移关系;通过 GraphSAGE 学习参与者层面的嵌入向量,从而整合局部邻接信息与全局图结构特征。实验表明,失语症严重程度并非由孤立词汇分布决定,而是源于言语与手势之间结构化的交互模式,该架构实现了可靠的自动化失语症评估,适用于床旁筛查和远程医疗监测场景。

链接: https://arxiv.org/abs/2602.20163
作者: Navya Martin Kollapally,Christa Akers,Renjith Nelson Joseph
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: IJCAI

点击查看摘要

Abstract:Aphasia is an acquired language disorder caused by injury to the regions of the brain that are responsible for language. Aphasia may impair the use and comprehension of written and spoken language. The Western Aphasia Battery-Revised (WAB-R) is an assessment tool administered by speech-language pathologists (SLPs) to evaluate the aphasia type and severity. Because the WAB-R measures isolated linguistic skills, there has been growing interest in the assessment of discourse production as a more holistic representation of everyday language abilities. Recent advancements in speech analysis focus on automated estimation of aphasia severity from spontaneous speech, relying mostly in isolated linguistic or acoustical features. In this work, we propose a graph neural network-based framework for estimating aphasia severity. We represented each participant’s discourse as a directed multi-modal graph, where nodes represent lexical items and gestures and edges encode word-word, gesture-word, and word-gesture transitions. GraphSAGE is employed to learn participant-level embeddings, thus integrating information from immediate neighbors and overall graph structure. Our results suggest that aphasia severity is not encoded in isolated lexical distribution, but rather emerges from structured interactions between speech and gesture. The proposed architecture offers a reliable automated aphasia assessment, with possible uses in bedside screening and telehealth-based monitoring.

[NLP-48] alking to Yourself: Defying Forgetting in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在窄域任务数据上微调时面临的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务过程中会显著退化其原有的通用知识和推理能力。解决方案的关键在于提出一种轻量级的自增强方法(SA-SFT),该方法在微调前让LLM自动生成对话数据(self-dialogues),并将这些自生成数据与任务数据混合使用,无需修改优化策略或训练调度。实验表明,SA-SFT在50个评估场景中保持了与原始模型相当的性能,并在40个场景中表现最优,显著优于层冻结(layer freezing)和外部数据混合等基线方法;理论分析进一步指出,遗忘部分源于风格诱导的参数漂移(style-induced parameter drift),而通过自生成数据进行自对齐可有效抑制此类漂移,从而实现鲁棒的模型适应。

链接: https://arxiv.org/abs/2602.20162
作者: Yutao Sun,Mingshuai Chen,Tiancheng Zhao,Phillip Miao,Zilun Zhang,Haozhan Shen,Ruizhe Zhu,Jianwei Yin
机构: Zhejiang University (浙江大学); Binjiang Institute of Zhejiang University (浙江大学滨江研究院); Om AI Research; Stanford University (斯坦福大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains a major challenge when fine-tuning large language models (LLMs) on narrow, task-specific data, often degrading their general knowledge and reasoning abilities. We propose SA-SFT, a lightweight self-augmentation routine in which an LLM generates self-dialogues prior to fine-tuning, and the resulting self-authored data are mixed with task data without modifying optimization or training schedules. Despite requiring no external data or additional tuning, SA-SFT consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, it maintains performance comparable to the original model and achieves the best results in 40 cases, outperforming common baselines such as layer freezing and external data mixing. Guided by these empirical findings, we further present a theoretical analysis suggesting that forgetting can partly stem from style-induced parameter drift, and that self-alignment through self-generated data provides an effective means to counteract this effect. Overall, our results indicate that self-augmentation offers a simple and effective mechanism for robust LLM adaptation without incurring catastrophic forgetting. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20162 [cs.CL] (or arXiv:2602.20162v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20162 Focus to learn more arXiv-issued DOI via DataCite

[NLP-49] Multimodal MRI Report Findings Supervised Brain Lesion Segmentation with Substructures

【速读】: 该论文旨在解决医学影像中肿瘤分割任务因缺乏密集像素级标注而导致的监督信号不足问题,尤其针对多参数磁共振成像(MRI)中报告描述不完整、层级结构复杂以及不同病变类型先验差异显著等挑战。传统报告监督(Report-Supervised, RSuper)方法在面对仅描述最大病灶、存在定性或模糊表述(如“轻微”、“可能”)的情况时,易产生过约束或幻觉现象,且无法有效利用模态特异性信息和群体间解剖先验。其解决方案的关键在于提出一种统一的一侧不确定性感知框架(MS-RSuper),通过三方面机制实现:(i) 利用存在与缺失损失将模态特异性定性线索(如T1增强、FLAIR水肿)精准对齐至对应子结构;(ii) 对部分定量线索(如最大病灶尺寸、最小多重性)施加单侧下界约束;(iii) 引入跨轴向与轴内解剖先验以适应人群差异,并通过置信度标记动态调节惩罚权重,使缺失线索自动降权。此方法在1238例带报告标注的BraTS-MET/MEN数据集上显著优于稀疏监督基线和朴素RSuper方法。

链接: https://arxiv.org/abs/2602.20994
作者: Yubin Ge,Yongsong Huang,Xiaofeng Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Report-supervised (RSuper) learning seeks to alleviate the need for dense tumor voxel labels with constraints derived from radiology reports (e.g., volumes, counts, sizes, locations). In MRI studies of brain tumors, however, we often involve multi-parametric scans and substructures. Here, fine-grained modality/parameter-wise reports are usually provided along with global findings and are correlated with different substructures. Moreover, the reports often describe only the largest lesion and provide qualitative or uncertain cues (mild,'' possible’'). Classical RSuper losses (e.g., sum volume consistency) can over-constrain or hallucinate unreported findings under such incompleteness, and are unable to utilize these hierarchical findings or exploit the priors of varied lesion types in a merged dataset. We explicitly parse the global quantitative and modality-wise qualitative findings and introduce a unified, one-sided, uncertainty-aware formulation (MS-RSuper) that: (i) aligns modality-specific qualitative cues (e.g., T1c enhancement, FLAIR edema) with their corresponding substructures using existence and absence losses; (ii) enforces one-sided lower-bounds for partial quantitative cues (e.g., largest lesion size, minimal multiplicity); and (iii) adds extra- vs. intra-axial anatomical priors to respect cohort differences. Certainty tokens scale penalties; missing cues are down-weighted. On 1238 report-labeled BraTS-MET/MEN scans, our MS-RSuper largely outperforms both a sparsely-supervised baseline and a naive RSuper method.

信息检索

[IR-0] Multi-Vector Index Compression in Any Modality

【速读】:该论文旨在解决多模态信息检索中晚期交互(late interaction)机制在图像、视频和音频等长文档场景下因计算与存储成本随文档长度线性增长而导致的效率瓶颈问题。其核心解决方案在于提出一种基于注意力引导聚类(attention-guided clustering, AGC)的索引压缩方法,通过注意力机制识别文档中语义最显著的区域作为聚类中心,并对令牌进行加权聚合,从而在固定向量预算下实现更高效的多向量表示压缩。实验表明,AGC在多个跨模态数据集上均优于其他参数化压缩方法(如序列截断和记忆令牌),且在索引大小灵活性和性能表现上优于非参数化层次聚类,同时可达到或超越完整未压缩索引的效果。

链接: https://arxiv.org/abs/2602.21202
作者: Hanxiang Qin,Alexander Martin,Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: this http URL.

[IR-1] A Benchmark for Deep Information Synthesis ICLR2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在复杂、现实世界任务中评估不足的问题,尤其是其在多源信息整合、深度推理与洞察生成方面的能力难以被有效衡量。现有基准测试主要聚焦于简单事实检索,无法反映代理在真实场景下处理时间密集型任务时的综合能力。解决方案的关键在于提出DEEPSYNTH这一新型基准,它包含120个跨7个领域和67个国家的真实任务,通过多阶段数据收集流程(包括官方数据获取、假设构建、人工分析与可验证答案设计)确保任务的复杂性与真实性。该基准要求代理完成信息搜集、融合与结构化推理以产出见解,从而更全面地评估LLM代理在复杂认知任务中的表现。

链接: https://arxiv.org/abs/2602.21143
作者: Debjit Paul,Daniel Murphy,Milan Gritta,Ronald Cardenas,Victor Prokhorov,Lena Sophia Bolliger,Aysim Toker,Roy Miles,Andreea-Maria Oncescu,Jasivan Alex Sivakumar,Philipp Borchert,Ismail Elezi,Meiru Zhang,Ka Yiu Lee,Guchun Zhang,Jun Wang,Gerasimos Lampouras
机构: Huawei Noah’s Ark Lab(华为诺亚方舟实验室); Imperial College London(帝国理工学院); UCL Centre for Artificial Intelligence(伦敦大学学院人工智能中心); University of Zurich(苏黎世大学); University of Sheffield(谢菲尔德大学); University of Cambridge(剑桥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.

[IR-2] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

【速读】:该论文旨在解决大模型在进行高级推理时面临的高延迟与高昂测试阶段计算成本问题,以及通过微调小型模型实现推理能力提升所带来的可解释性下降和资源开销增加的问题。其解决方案的关键在于提出 Prompt-Level Distillation(PLD),即从教师模型中提取显式的推理模式,并将其组织为结构化的、表达性强的指令列表,嵌入到学生模型的系统提示(System Prompt)中,从而在不显著增加延迟的前提下,使小型模型达到前沿性能,同时保持决策过程的透明性,便于人工验证逻辑,适用于对可解释性和效率要求较高的场景。

链接: https://arxiv.org/abs/2602.21103
作者: Sanket Badhe,Deep Shah
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model’s System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57% to 90.0% and 67% to 83% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

[IR-3] urning Semantics into Topology: LLM -Driven Attribute Augmentation for Collaborative Filtering

【速读】:该论文旨在解决如何将大语言模型(Large Language Models, LLMs)所具备的丰富语义知识与推荐系统中传统的协同过滤(Collaborative Filtering, CF)嵌入表示有效融合的问题。现有方法要么因计算复杂度过高难以应用于大规模场景,要么仅通过单向特征增强而非全局协同信号优化来利用LLM信息。其解决方案的关键在于提出Topology-Augmented Graph Collaborative Filtering (TAGCF) 框架:首先借助LLM推断用户-物品对间的交互意图和潜在因果关系,并将其转化为中间属性节点,构建增强型用户-属性-物品(User-Attribute-Item, U-A-I)图;其次设计自适应关系加权图卷积(Adaptive Relation-weighted Graph Convolution, ARGC),通过关系特定的预测网络动态学习不同关系类型的权重,从而更精准地建模异构图结构中的多维关联。这一机制实现了从语义知识到协同信号的拓扑化转换与增强,显著提升了推荐性能,尤其在冷启动场景下表现稳健。

链接: https://arxiv.org/abs/2602.21099
作者: Junjie Meng,Ranxu zhang,Wei Wu,Rui Zhang,Chuan Qin,Qi Zhang,Qi Liu,Hui Xiong,Chao Wang
机构: University of Science and Technology of China(中国科学技术大学); Alibaba Group(阿里巴巴集团); Chinese Academy of Sciences(中国科学院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); The Hong Kong University of Science and Technology (广州)(香港科技大学(广州)); The Hong Kong University of Science and Technology(香港科技大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great potential for enhancing recommender systems through their extensive world knowledge and reasoning capabilities. However, effectively translating these semantic signals into traditional collaborative embeddings remains an open challenge. Existing approaches typically fall into two extremes: direct inference methods are computationally prohibitive for large-scale retrieval, while embedding-based methods primarily focus on unilateral feature augmentation rather than holistic collaborative signal enhancement. To bridge this gap, we propose Topology-Augmented Graph Collaborative Filtering (TAGCF), a novel framework that transforms semantic knowledge into topological connectivity. Unlike existing approaches that depend on textual features or direct interaction synthesis, TAGCF employs LLMs to infer interaction intents and underlying causal relationships from user-item pairs, representing these insights as intermediate attribute nodes within an enriched User-Attribute-Item (U-A-I) graph. Furthermore, to effectively model the heterogeneous relations in this augmented structure, we propose Adaptive Relation-weighted Graph Convolution (ARGC), which employs relation-specific prediction networks to dynamically estimate the importance of each relation type. Extensive experiments across multiple benchmark datasets and CF backbones demonstrate consistent improvements, with comprehensive evaluations including cold-start scenarios validating the effectiveness and robustness of our framework. All code will be made publicly available. For anonymous review, our code is available at the following anonymous link: this https URL.

[IR-4] Position-Aware Sequential Attention for Accurate Next Item Recommendations

【速读】:该论文旨在解决传统基于加性位置嵌入(additive positional embeddings)的自注意力模型在序列建模中对时序顺序敏感性不足的问题。现有方法将位置信息与语义嵌入混杂,导致位置信号传播弱、难以捕捉复杂序列模式。其解决方案的关键在于引入一种核化自注意力机制(kernelized self-attention),通过一个可学习的位置核(positional kernel)在纯位置空间中独立调节注意力权重,从而实现语义与位置信息的解耦,并支持每层注意力模块的自适应多尺度序列建模能力。实验表明,该方法在标准下一物品预测基准上显著优于强基线模型。

链接: https://arxiv.org/abs/2602.21052
作者: Timur Nabiev,Evgeny Frolov
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究所); AIRI
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is permutation-equivariant over sequence positions and thus has no intrinsic notion of temporal order beyond causal masking. We argue that additive positional embeddings make the attention mechanism only superficially sensitive to sequence order: positional information is entangled with item embedding semantics, propagates weakly in deep architectures, and limits the ability to capture rich sequential patterns. To address these limitations, we introduce a kernelized self-attention mechanism, where a learnable positional kernel operates purely in the position space, disentangled from semantic similarity, and directly modulates attention weights. When applied per attention block, this kernel enables adaptive multi-scale sequential modeling. Experiments on standard next-item prediction benchmarks show that our positional kernel attention consistently improves over strong competing baselines.

[IR-5] HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders

【速读】:该论文旨在解决现代推荐系统中因用户行为序列过长而导致的端到端建模在生产环境中不可行的问题,具体表现为延迟高和内存消耗大;同时,现有基于兴趣中心(Interest Centers)的摘要方法难以在适当粒度上识别用户特定的兴趣中心,并且在行为分配时易产生量化误差,导致长尾偏好丢失。解决方案的关键在于提出分层稀疏激活压缩(Hierarchical Sparse Activation Compression, HiSAC)框架:通过将交互编码为多层级语义ID并构建全局分层码本,利用分层投票机制稀疏激活个性化兴趣代理(Interest-Agents)作为细粒度偏好中心,并借助软路由注意力机制(Soft-Routing Attention)在语义空间中聚合历史信号,按相似性加权以最小化量化误差、保留长尾行为,从而实现高效且精准的个性化序列建模。

链接: https://arxiv.org/abs/2602.21009
作者: Kun Yuan,Junyu Bi,Daixuan Cheng,Changfa Wu,Shuwen Xiao,Binbin Cao,Jian Wu,Yuning Jiang
机构: Alibaba Group(阿里巴巴集团); Renmin University(中国人民大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints. While summarizing history via interest centers offers a practical alternative, existing methods struggle to (1) identify user-specific centers at appropriate granularity and (2) accurately assign behaviors, leading to quantization errors and loss of long-tail preferences. To alleviate these issues, we propose Hierarchical Sparse Activation Compression (HiSAC), an efficient framework for personalized sequence modeling. HiSAC encodes interactions into multi-level semantic IDs and constructs a global hierarchical codebook. A hierarchical voting mechanism sparsely activates personalized interest-agents as fine-grained preference centers. Guided by these agents, Soft-Routing Attention aggregates historical signals in semantic space, weighting by similarity to minimize quantization error and retain long-tail behaviors. Deployed on Taobao’s “Guess What You Like” homepage, HiSAC achieves significant compression and cost reduction, with online A/B tests showing a consistent 1.65% CTR uplift – demonstrating its scalability and real-world effectiveness.

[IR-6] Generative Pseudo-Labeling for Pre-Ranking with LLM s

【速读】:该论文旨在解决工业推荐系统中预排序(pre-ranking)阶段存在的训练-服务差异(train-serving discrepancy)问题:模型在训练时仅能利用曝光过的交互数据,但在在线服务时需对所有召回候选物品(包括未曝光项)进行打分,导致严重的样本选择偏差并损害长尾内容的泛化能力。现有去偏方法依赖启发式负采样或从有偏排名器蒸馏伪标签,易将潜在正例误判为负例或传播暴露偏差。其解决方案的关键在于提出生成式伪标签(Generative Pseudo-Labeling, GPL)框架,利用大语言模型(LLMs)生成内容感知且无偏的伪标签,显式对齐训练分布与在线服务空间;通过离线生成用户特定的兴趣锚点并在冻结语义空间中匹配候选项,提供高质量监督信号而不增加在线延迟。

链接: https://arxiv.org/abs/2602.20995
作者: Junyu Bi,Xinting Niu,Daixuan Cheng,Kun Yuan,Tao Wang,Binbin Cao,Jian Wu,Yuning Jiang
机构: Alibaba Group(阿里巴巴集团); Renmin University(中国人民大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking. A key challenge is the train-serving discrepancy: pre-ranking models are trained only on exposed interactions, yet must score all recalled candidates – including unexposed items – during online serving. This mismatch not only induces severe sample selection bias but also degrades generalization, especially for long-tail content. Existing debiasing approaches typically rely on heuristics (e.g., negative sampling) or distillation from biased rankers, which either mislabel plausible unexposed items as negatives or propagate exposure bias into pseudo-labels. In this work, we propose Generative Pseudo-Labeling (GPL), a framework that leverages large language models (LLMs) to generate unbiased, content-aware pseudo-labels for unexposed items, explicitly aligning the training distribution with the online serving space. By offline generating user-specific interest anchors and matching them with candidates in a frozen semantic space, GPL provides high-quality supervision without adding online latency. Deployed in a large-scale production system, GPL improves click-through rate by 3.07%, while significantly enhancing recommendation diversity and long-tail item discovery.

[IR-7] Naver Labs Europe @ WSDM CUP | Multilingual Retrieval WSDM

【速读】:该论文旨在解决多语言文档检索中跨语言泛化能力不足的问题,尤其是在以英文查询检索非英文文档场景下的性能瓶颈。其核心解决方案是基于SPLARE(一种学习得到的稀疏检索模型),该模型能够生成具有强泛化能力的稀疏潜在表示(sparse latent representations),特别适用于多语言检索任务。通过逐步增强的五次实验运行,包括引入轻量级重排序器Qwen3-Reranker-4B和简单的分数融合策略,研究证明SPLARE在多语言环境下优于当前主流的密集检索基线模型(如Qwen3-8B-Embed),凸显了学习型稀疏检索方法在非英语主导场景中的持续竞争力与实用性。

链接: https://arxiv.org/abs/2602.20986
作者: Thibault Formal,Maxime Louis,Hervé Déjean,Stéphane Clinchant
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Report paper of our submission to the WSDM Cup 2026

点击查看摘要

Abstract:This report presents our participation to the WSDM Cup 2026 shared task on multilingual document retrieval from English queries. The task provides a challenging benchmark for cross-lingual generalization. It also provides a natural testbed for evaluating SPLARE, our recently proposed learned sparse retrieval model, which produces generalizable sparse latent representations and is particularly well suited to multilingual retrieval settings. We evaluate five progressively enhanced runs, starting from a SPLARE-7B model and incorporating lightweight improvements, including reranking with Qwen3-Reranker-4B and simple score fusion strategies. Our results demonstrate the strength of SPLARE compared to state-of-the-art dense baselines such as Qwen3-8B-Embed. More broadly, our submission highlights the continued relevance and competitiveness of learned sparse retrieval models beyond English-centric scenarios. Comments: Report paper of our submission to the WSDM Cup 2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.20986 [cs.IR] (or arXiv:2602.20986v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.20986 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-8] E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications

【速读】:该论文旨在解决多模态推荐系统(Multimodal Recommender Systems, MMRSs)在实际应用中面临的两大局限性:一是对固定模态集合的依赖导致模态扩展性差,二是任务特定的目标函数限制了跨任务的泛化能力。为此,作者提出E-MMKGR框架,其核心创新在于构建了一个面向电商场景的多模态知识图谱(E-commerce-specific Multimodal Knowledge Graph, E-MMKG),并通过基于图神经网络(Graph Neural Network, GNN)的传播机制与知识图谱导向的优化策略,学习统一的物品表征。这种统一表征为多种下游任务提供了共享的语义基础,从而实现了模态的可扩展性和任务的通用性,实验表明该方法在推荐和商品检索任务上均显著优于传统向量基线方法。

链接: https://arxiv.org/abs/2602.20877
作者: Jiwoo Kang,Yeon-Chang Lee
机构: UNIST (韩国科学技术院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalization. We propose E-MMKGR, a framework that constructs an e-commerce-specific Multimodal Knowledge Graph E-MMKG and learns unified item representations through GNN-based propagation and KG-oriented optimization. These representations provide a shared semantic foundation applicable to diverse tasks. Experiments on real-world Amazon datasets show improvements of up to 10.18% in Recall@10 for recommendation and up to 21.72% over vector-based retrieval for product search, demonstrating the effectiveness and extensibility of our approach.

[IR-9] Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking

【速读】:该论文旨在解决生成式信息检索(Generative Information Retrieval, GenIR)中候选文档选择阶段的评估可信度问题,特别是针对文化相关性(cultural relevance)这类规范性标准的评价偏差。当前基于大语言模型作为裁判(LLM-as-a-Judge)的评估方法常因监督与评估模型之间的重叠导致循环性(circularity)和偏好泄露(preference leakage),从而高估模型性能。其解决方案的关键在于将文化相关性建模为一个查询内排序任务,并提出一种无泄露的双裁判框架(two-judge framework),严格分离训练监督信号(Judge B)与独立评估机制(Judge A)。通过这一设计,作者在新构建的33,052条文化语境故事数据集(NGR-33k)上验证了:尽管传统基线提升有限,但由Judge-B监督训练并蒸馏得到的密集双编码器(dense bi-encoder)——即BGE-M3模型,在无泄露的Judge-A评估下显著优于原始交叉编码器(Cross-Encoder),证明了高效且无偏的排序器可从细微文化偏好中学习并部署。

链接: https://arxiv.org/abs/2602.20800
作者: Dalia Nahhas,Xiaohao Cai,Imran Razzak,Shoaib Jameel
机构: University of Southampton (南安普顿大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

[IR-10] RMIT-ADMS at the MMU-RAG NeurIPS 2025 Competition NEURIPS2025

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在复杂研究任务中动态适应查询复杂度与证据充分性不足的问题,以提升生成结果的准确性与资源效率。其核心解决方案是提出Routing-to-RAG(R2RAG)架构,通过轻量级组件实现对检索策略的动态调整,依据推理出的查询复杂度和证据充足程度自动选择最优检索路径;该系统使用较小规模的大语言模型(Large Language Models, LLMs),可在单个消费级GPU上运行,同时支持高复杂度的研究型任务,相较前代G-RAG系统进一步优化了模块设计并基于输出质量反馈进行改进,最终在NeurIPS 2025 MMU-RAG竞赛中获得开源类别“最佳动态评估奖”。

链接: https://arxiv.org/abs/2602.20735
作者: Kun Ran,Marwah Alaofi,Danula Hettiachchi,Chenglong Ma,Khoi Nguyen Dinh Anh,Khoi Vo Nguyen,Sachin Pathiyan Cherumanal,Lida Rashidi,Falk Scholer,Damiano Spina,Shuoqi Sun,Oleg Zendel
机构: RMIT University (皇家墨尔本理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: MMU-RAG NeurIPS 2025 winning system

点击查看摘要

Abstract:This paper presents the award-winning RMIT-ADM+S system for the Text-to-Text track of the NeurIPS~2025 MMU-RAG Competition. We introduce Routing-to-RAG (R2RAG), a research-focused retrieval-augmented generation (RAG) architecture composed of lightweight components that dynamically adapt the retrieval strategy based on inferred query complexity and evidence sufficiency. The system uses smaller LLMs, enabling operation on a single consumer-grade GPU while supporting complex research tasks. It builds on the G-RAG system, winner of the ACM~SIGIR~2025 LiveRAG Challenge, and extends it with modules informed by qualitative review of outputs. R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources. Comments: MMU-RAG NeurIPS 2025 winning system Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.20735 [cs.IR] (or arXiv:2602.20735v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.20735 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oleg Zendel [view email] [v1] Tue, 24 Feb 2026 09:58:25 UTC (93 KB) Full-text links: Access Paper: View a PDF of the paper titled RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition, by Kun Ran and 11 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.IR prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[IR-11] IntRR: A Framework for Integrating SID Redistribution and Length Reduction

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)中语义标识符(Semantic ID, SID)设计与推荐目标不一致的问题,以及现有方法因扁平化处理层级SID导致序列长度膨胀、计算开销大和推理延迟高的缺陷。其解决方案的关键在于提出IntRR框架,通过引入目标对齐的SID重分配机制(Objective-aligned SID Redistribution)和结构化的长度压缩策略(Structural Length Reduction),利用物品特有的唯一ID(Unique ID, UID)作为协同锚点动态调整各层级码本中的语义权重,并递归处理SID层次结构以避免序列扁平化,从而实现每项仅消耗一个固定token的高效建模,显著提升推荐精度与计算效率。

链接: https://arxiv.org/abs/2602.20704
作者: Zesheng Wang,Longfei Xu,Weidong Deng,Huimin Yan,Kaikui Liu,Xiangxiang Chu
机构: AMAP, Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative Recommendation (GR) has emerged as a transformative paradigm that reformulates the traditional cascade ranking system into a sequence-to-item generation task, facilitated by the use of discrete Semantic IDs (SIDs). However, current SIDs are suboptimal as the indexing objectives (Stage 1) are misaligned with the actual recommendation goals (Stage 2). Since these identifiers remain static (Stage 2), the backbone model lacks the flexibility to adapt them to the evolving complexities of user interactions. Furthermore, the prevailing strategy of flattening hierarchical SIDs into token sequences leads to sequence length inflation, resulting in prohibitive computational overhead and inference latency. To address these challenges, we propose IntRR, a novel framework that integrates objective-aligned SID Redistribution and structural Length Reduction. By leveraging item-specific Unique IDs (UIDs) as collaborative anchors, this approach dynamically redistributes semantic weights across hierarchical codebook layers. Concurrently, IntRR handles the SID hierarchy recursively, eliminating the need to flatten sequences. This ensures a fixed cost of one token per item. Extensive experiments on benchmark datasets demonstrate that IntRR yields substantial improvements over representative generative baselines, achieving superior performance in both recommendation accuracy and efficiency.

[IR-12] PRECTR-V2:Unified Relevance-CTR Framework with Cross-User Preference Mining Exposure Bias Correction and LLM -Distilled Encoder Optimization

【速读】:该论文旨在解决搜索系统中相关性匹配与点击率(CTR)预测任务之间的不一致性问题,尤其针对低活跃用户冷启动、粗排阶段训练数据分布偏移以及模型架构限制导致的表示学习与CTR微调错位等三大挑战。其核心解决方案在于:首先,通过挖掘特定查询下的全局相关偏好来缓解低活跃用户的稀疏行为问题,提升个性化相关建模能力;其次,利用嵌入噪声注入与相关标签重构构建难负样本,并采用成对损失优化其相对于正样本的排序,以纠正曝光偏差;最后,基于大语言模型(LLM)和监督微调(SFT)的知识蒸馏预训练轻量级Transformer编码器,替代原有冻结的BERT模块,实现表示学习与CTR微调的联合优化,从而突破传统Emb+MLP范式局限。

链接: https://arxiv.org/abs/2602.20676
作者: Shuzhi Cao,Rong Chen,Ailong He,Shuguang Han,Jufeng Chen
机构: Alibaba Group(阿里巴巴集团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2503.18395

点击查看摘要

Abstract:In search systems, effectively coordinating the two core objectives of search relevance matching and click-through rate (CTR) prediction is crucial for discovering users’ interests and enhancing platform revenue. In our prior work PRECTR, we proposed a unified framework to integrate these two subtasks,thereby eliminating their inconsistency and leading to mutual this http URL, our previous work still faces three main challenges. First, low-active users and new users have limited search behavioral data, making it difficult to achieve effective personalized relevance preference modeling. Second, training data for ranking models predominantly come from high-relevance exposures, creating a distribution mismatch with the broader candidate space in coarse-ranking, leading to generalization bias. Third, due to the latency constraint, the original model employs an Emb+MLP architecture with a frozen BERT encoder, which prevents joint optimization and creates misalignment between representation learning and CTR fine-tuning. To solve these issues, we further reinforce our method and propose PRECTR-V2. Specifically, we mitigate the low-activity users’ sparse behavior problem by mining global relevance preferences under the specific query, which facilitates effective personalized relevance modeling for cold-start scenarios. Subsequently, we construct hard negative samples through embedding noise injection and relevance label reconstruction, and optimize their relative ranking against positive samples via pairwise loss, thereby correcting exposure bias. Finally, we pretrain a lightweight transformer-based encoder via knowledge distillation from LLM and SFT on the text relevance classification task. This encoder replaces the frozen BERT module, enabling better adaptation to CTR fine-tuning and advancing beyond the traditional Emb+MLP paradigm.

[IR-13] From Logs to Language: Learning Optimal Verbalization for LLM -Based Recommendation in Production

【速读】:该论文旨在解决生成式推荐系统中关键的“语义化(verbalization)”问题,即如何将结构化的用户交互日志有效转化为适合大语言模型(Large Language Models, LLMs)处理的自然语言输入。现有方法依赖于固定模板拼接字段,导致文本表示质量低下,影响推荐效果。解决方案的关键在于提出一种以数据为中心的框架,通过强化学习训练一个语义化代理(verbalization agent),该代理以推荐准确率为优化信号,自动学习从原始交互历史中过滤噪声、融合相关元数据并重构信息顺序,从而生成更优的文本上下文。实验表明,该方法在大规模工业流式数据集上相较模板基线可提升高达93%的发现类物品推荐准确率,并揭示出如兴趣摘要、噪声去除和语法规范化等涌现策略,为LLM驱动的推荐系统提供了有效的上下文构建范式。

链接: https://arxiv.org/abs/2602.20558
作者: Yucheng Shi,Ying Li,Yu Wang,Yesu Feng,Arjun Rao,Rein Houthooft,Shradha Sehgal,Jin Wang,Hao Zhen,Ninghao Liu,Linas Baltrunas
机构: University of Georgia (佐治亚大学); Netflix; Capital One (资本一号); Hong Kong Polytechnic University (香港理工大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalization agent transforms raw interaction histories into optimized textual contexts, with recommendation accuracy as the training signal. This agent learns to filter noise, incorporate relevant metadata, and reorganize information to improve downstream predictions. Experiments on a large-scale industrial streaming dataset show that learned verbalization delivers up to 93% relative improvement in discovery item recommendation accuracy over template-based baselines. Further analysis reveals emergent strategies such as user interest summarization, noise removal, and syntax normalization, offering insights into effective context construction for LLM-based recommender systems.

[IR-14] Indaleko: The Unified Personal Index

【速读】:该论文旨在解决个人信息检索(Personal Information Retrieval, PIR)系统在面对人类记忆机制时的失效问题,即现有平台依赖关键词匹配且局限于孤立的数据存储孤岛,无法支持基于时间、地点和情境等情景线索的记忆式查询。其解决方案的关键在于提出统一个人索引(Unified Personal Index, UPI)架构,通过将时空与活动元数据整合进图数据库,并采用记忆锚点索引(memory anchor indexing)实现亚秒级响应,从而支持自然语言形式的多维查询(如“去年春天在会议场地附近的照片”),有效消除跨平台搜索碎片化并保持高精度,同时具备快速扩展新数据源的能力和基于UUID的隐私保护机制。

链接: https://arxiv.org/abs/2602.20507
作者: William Anthony Mason
机构: 未知
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注: PhD dissertation, University of British Columbia, August 2025. 287 pages

点击查看摘要

Abstract:Personal information retrieval fails when systems ignore how human memory works. While existing platforms force keyword searches across isolated silos, humans naturally recall through episodic cues like when, where, and in what context information was encountered. This dissertation presents the Unified Personal Index (UPI), a memory-aligned architecture that bridges this fundamental gap. The Indaleko prototype demonstrates the UPI’s feasibility on a 31-million file dataset spanning 160TB across eight storage platforms. By integrating temporal, spatial, and activity metadata into a unified graph database, Indaleko enables natural language queries like “photos near the conference venue last spring” that existing systems cannot process. The implementation achieves sub-second query responses through memory anchor indexing, eliminates cross-platform search fragmentation, and maintains perfect precision for well-specified memory patterns. Evaluation against commercial systems (Google Drive, OneDrive, Dropbox, Windows Search) reveals that all fail on memory-based queries, returning overwhelming result sets without contextual filtering. In contrast, Indaleko successfully processes multi-dimensional queries combining time, location, and activity patterns. The extensible architecture supports rapid integration of new data sources (10 minutes to 10 hours per provider) while preserving privacy through UUID-based semantic decoupling. The UPI’s architectural synthesis bridges cognitive theory with distributed systems design, as demonstrated through the Indaleko prototype and rigorous evaluation. This work transforms personal information retrieval from keyword matching to memory-aligned finding, providing immediate benefits for existing data while establishing foundations for future context-aware systems.

人机交互

[HC-0] SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery

【速读】:该论文旨在解决大规模收集用户定性反馈时面临的效率瓶颈问题,即传统半结构化访谈依赖专家人力、耗时且难以覆盖广泛主题与动态涌现的洞察。现有基于大语言模型(Large Language Models, LLMs)的自动化访谈系统缺乏在预设话题覆盖率与灵活探索新兴主题之间的平衡机制,无法有效执行追问和深度挖掘。解决方案的关键在于将自适应半结构化访谈建模为一个优化问题,定义“访谈效用”为预设主题覆盖度、新兴主题发现能力与对话成本(轮次长度)之间的权衡;进而提出SparkMe,一种基于多智能体LLM的访谈系统,通过模拟对话回溯进行 deliberative planning(审慎规划),选择预期效用最高的问题。实验表明,SparkMe在控制实验中显著提升主题覆盖率(较最优基线提高4.7%),并以更少轮次获取更丰富的 emergent insights(新兴洞察),同时在跨7个职业领域的用户研究中获得领域专家对其高质量适应性访谈的积极评价。

链接: https://arxiv.org/abs/2602.21136
作者: David Anugraha,Vishakh Padmakumar,Diyi Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to conduct semi-structured interviews. Recent work has explored using large language models (LLMs) to automate interviewing, yet existing systems lack a principled mechanism for balancing systematic coverage of predefined topics with adaptive exploration, or the ability to pursue follow-ups, deep dives, and emergent themes that arise organically during conversation. In this work, we formulate adaptive semi-structured interviewing as an optimization problem over the interviewer’s behavior. We define interview utility as a trade-off between coverage of a predefined interview topic guide, discovery of relevant emergent themes, and interview cost measured by length. Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility. We evaluate SparkMe through controlled experiments with LLM-based interviewees, showing that it achieves higher interview utility, improving topic guide coverage (+4.7% over the best baseline) and eliciting richer emergent insights while using fewer conversational turns than prior LLM interviewing approaches. We further validate SparkMe in a user study with 70 participants across 7 professions on the impact of AI on their workflows. Domain experts rate SparkMe as producing high-quality adaptive interviews that surface helpful profession-specific insights not captured by prior approaches. The code, datasets, and evaluation protocols for SparkMe are available as open-source at this https URL.

[HC-1] “Are You Sure?”: An Empirical Study of Human Perception Vulnerability in LLM -Driven Agent ic Systems

【速读】:该论文旨在解决生成式 AI(Generative AI)代理在高风险领域中因被攻陷而引发的“代理中介欺骗”(Agent-Mediated Deception, AMD)问题,即人类用户对受控代理产生的信任如何被利用以实施欺骗攻击。其核心解决方案是通过构建高保真的人机信任实验室平台 HAT-Lab,设计九类涵盖日常与专业场景(如医疗、软件开发、人力资源)的实验情境,并基于303名参与者的大规模实证研究识别出人类在面对AMD时的认知失败模式与行为漏洞。关键发现包括:仅8.6%的用户能识别AMD攻击,专家在特定场景下更易受骗,且风险意识难以转化为防护行为;有效的防御策略应采用低验证成本的流程中断式警告,并强调基于HAT-Lab的体验式学习可显著提升用户防范意识——超过90%的感知风险用户报告态度更加谨慎。

链接: https://arxiv.org/abs/2602.21127
作者: Xinfeng Li,Shenyu Dai,Kelong Zheng,Yue Xiao,Gelei Deng,Wei Dong,Xiaofeng Wang
机构: Nanyang Technological University (南洋理工大学); KTH; William & Mary (威廉玛丽学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users. While extensive research focuses on agent-centric threats, human susceptibility to deception by a compromised agent remains unexplored. We present the first large-scale empirical study with 303 participants to measure human susceptibility to AMD. This is based on HAT-Lab (Human-Agent Trust Laboratory), a high-fidelity research platform we develop, featuring nine carefully crafted scenarios spanning everyday and professional domains (e.g., healthcare, software development, human resources). Our 10 key findings reveal significant vulnerabilities and provide future defense perspectives. Specifically, only 8.6% of participants perceive AMD attacks, while domain experts show increased susceptibility in certain scenarios. We identify six cognitive failure modes in users and find that their risk awareness often fails to translate to protective behavior. The defense analysis reveals that effective warnings should interrupt workflows with low verification costs. With experiential learning based on HAT-Lab, over 90% of users who perceive risks report increased caution against AMD. This work provides empirical evidence and a platform for human-centric agent security research.

[HC-2] An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在学术问答任务中可靠性评估不足的问题,特别是现有自动化评估指标因缺乏领域上下文而无法准确反映科学专家对LLM输出的实际判断。其解决方案的关键在于构建并验证了一个基于领域专家实际评估策略的误差分类框架:通过与领域专家合作,从68个问答对中识别出20种错误模式,涵盖七个类别,并进一步通过情境化访谈验证该框架能有效引导专家识别以往忽略的错误类型。该框架体现了专家常用的系统性评估方法,包括技术精确性测试、价值导向评估及元评估(meta-evaluation),为开发适配个体专家经验水平和评估习惯的结构化工具提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2602.21059
作者: Anna Martin-Boyle,William Humphreys,Martha Brown,Cara Leckey,Harmanpreet Kaur
机构: University of Minnesota (明尼苏达大学); NASA Langley Research Center (美国国家航空航天局兰利研究中心)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 24 pages, 2 figures. Accepted at ACM CHI conference on Human Factors in Computing Systems, 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

[HC-3] PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM -based Scholarly QA

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在学术问答(Scholarly Question-Answering, QA)系统中生成内容时存在的细微错误问题,如缺乏支持的主张或信息遗漏,而现有溯源机制(如来源引用)因粒度不足难以满足学术领域对严谨验证的需求。其解决方案的关键在于提出PaperTrail界面,通过将LLM回答和源文献均分解为离散的主张(claims)与证据(evidence),并建立映射关系,从而清晰揭示被支持的断言、未被支持的主张以及源文本中缺失的信息,实现细粒度的主张-证据匹配,以提升用户对LLM输出可信度的认知判断。

链接: https://arxiv.org/abs/2602.21045
作者: Anna Martin-Boyle,Cara A.C. Leckey,Martha C. Brown,Harmanpreet Kaur
机构: University of Minnesota (明尼苏达大学); NASA Langley Research Center (美国国家航空航天局兰利研究中心)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 25 pages, 3 figures. Accepted at the ACM CHI conference on Human Factors in Computing Systems 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants’ trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.

[HC-4] Hierarchic-EEG2Text: Assessing EEG-To-Text Decoding across Hierarchical Abstraction Levels

【速读】:该论文旨在解决脑电图(EEG)在细粒度类别识别中因信噪比低而导致的挑战,特别是探索EEG是否能够捕捉跨多个层次语义抽象的对象表征。其关键解决方案是提出一种层次感知的“情景分析”(episodic analysis)方法,利用WordNet生成具有不同层级结构和类数的分类任务(episode),从而评估机器学习模型在多层级语义抽象下的性能变化;同时构建了当前EEG领域最大的情景框架(PEERS数据集),包含931538个EEG样本和1610个对象标签,支持对感知、决策与绩效监控神经动态的系统研究,结果表明模型在较高语义抽象层级上表现更优,凸显了抽象深度作为EEG解码中未被充分挖掘的关键维度。

链接: https://arxiv.org/abs/2602.20932
作者: Anupam Sharma,Harish Katti,Prajwal Singh,Shanmuganathan Raman,Krishna Miyapuram
机构: IIT Gandhinagar (印度理工学院甘地纳格尔分校); NIMH, NIH (美国国家心理健康研究所,美国国立卫生研究院)
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:An electroencephalogram (EEG) records the spatially averaged electrical activity of neurons in the brain, measured from the human scalp. Prior studies have explored EEG-based classification of objects or concepts, often for passive viewing of briefly presented image or video stimuli, with limited classes. Because EEG exhibits a low signal-to-noise ratio, recognizing fine-grained representations across a large number of classes remains challenging; however, abstract-level object representations may exist. In this work, we investigate whether EEG captures object representations across multiple hierarchical levels, and propose episodic analysis, in which a Machine Learning (ML) model is evaluated across various, yet related, classification tasks (episodes). Unlike prior episodic EEG studies that rely on fixed or randomly sampled classes of equal cardinality, we adopt hierarchy-aware episode sampling using WordNet to generate episodes with variable classes of diverse hierarchy. We also present the largest episodic framework in the EEG domain for detecting observed text from EEG signals in the PEERS dataset, comprising 931538 EEG samples under 1610 object labels, acquired from 264 human participants (subjects) performing controlled cognitive tasks, enabling the study of neural dynamics underlying perception, decision-making, and performance monitoring. We examine how the semantic abstraction level affects classification performance across multiple learning techniques and architectures, providing a comprehensive analysis. The models tend to improve performance when the classification categories are drawn from higher levels of the hierarchy, suggesting sensitivity to abstraction. Our work highlights abstraction depth as an underexplored dimension of EEG decoding and motivates future research in this direction. Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP) Cite as: arXiv:2602.20932 [cs.LG] (or arXiv:2602.20932v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20932 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-5] InterPilot: Exploring the Design Space of AI-assisted Job Interview Support for HR Professionals

【速读】:该论文旨在解决招聘面试过程中面试官面临的认知负荷过重问题(cognitive demands),即在倾听、评估候选人、记录信息和生成追问问题等多任务并行时难以高效完成工作。其解决方案的关键在于设计并实现了一个名为InterPilot的原型系统,该系统通过智能笔记记录与会后摘要、自适应问题生成以及实时技能-证据映射等功能,提供实时的人工智能辅助支持,从而减轻面试官的文档负担并提升访谈效率。

链接: https://arxiv.org/abs/2602.20891
作者: Zhengtao Xu,Zimo Xia,Zicheng Zhu,Nattapat Boonprakong,Yu-An Chen,Rabih Zbib,Casimiro Pio Carrino,Yi-Chieh Lee
机构: National University of Singapore(新加坡国立大学); Harvard University(哈佛大学); Avature(阿瓦图)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Recruitment interviews are cognitively demanding interactions in which interviewers must simultaneously listen, evaluate candidates, take notes, and formulate follow-up questions. To better understand these challenges, we conducted a formative study with eight HR professionals, from which we derived key design goals for real-time AI support. Guided by these insights, we developed InterPilot, a prototype system that augments interviews through intelligent note-taking and post-interview summary, adaptive question generation, and real-time skill-evidence mapping. We evaluated the system with another seven HR professionals in mock interviews using a within-subjects design. Results show that InterPilot reduced documentation burden without increasing overall workload, but introduced usability trade-offs related to visual attention and interaction complexity. Qualitative findings further reveal tensions around trust and verification when AI suggests highly specific technical questions. We discuss implications for designing future real-time human-AI collaboration in professional settings, highlighting the need to balance assistance granularity, attentional demands, and human agency.

[HC-6] When LLM s Enter Everyday Feminism on Chinese Social Media: Opportunities and Risks for Womens Empowerment

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在介入日常数字女性主义(Everyday digital feminism)实践时,如何影响女性在线社群中对性别不平等的批判性讨论与赋权路径。研究发现,尽管用户普遍接纳由DeepSeek生成的内容作为新的声音,但这些内容主要引导女性在既有社会规范内进行自我优化和成就追求,而非挑战结构性性别偏见。解决方案的关键在于:通过批判性话语分析识别LLMs在日常女性主义实践中所强化的“内卷式赋权”倾向,并据此提出设计建议,以使LLMs更有效地支持真正具有变革潜力的女性主义行动,而非仅复制主流意识形态。

链接: https://arxiv.org/abs/2602.20876
作者: Runhua Zhang,Ziqi Pan,Kangyu Yuan,Qiaoyi Chen,Yulin Tian,Huamin Qu,Xiaojuan Ma
机构: The Hong Kong University of Science and Technology(香港科技大学); College of Design and Innovation, Tongji University(同济大学设计创意学院)
类目: Human-Computer Interaction (cs.HC)
备注: This paper is conditionally accepted to CHI 2026

点击查看摘要

Abstract:Everyday digital feminism refers to the ordinary, often pragmatic ways women articulate lived experiences and cultivate solidarity in online spaces. In China, such practices flourish on RedNote through discussions under hashtags like ‘‘women’s growth’’. Recently, DeepSeek-generated content has been taken up as a new voice in these conversations. Given widely recognized gender biases in LLMs, this raises critical concerns about how LLMs interact with everyday feminist practices. Through an analysis of 430 RedNote posts, 139 shared DeepSeek responses, and 3211 comments, we found that users predominantly welcomed DeepSeek’s advice. Yet feminist critical discourse analysis revealed that these responses primarily encouraged women to self-optimize and pursue achievements within prevailing norms rather than challenge them. By interpreting this case, we discuss the opportunities and risks that LLMs introduce for everyday feminism as a pathway toward women’s empowerment, and offer design implications for leveraging LLMs to better support such practices.

[HC-7] Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video

【速读】:该论文旨在解决在真实工作环境中难以准确获取手动搬运任务中关键人体工学风险参数(如水平手部距离 H 和垂直手部距离 V)的问题,从而限制了修订版NIOSH搬运方程(Revised NIOSH Lifting Equation, RNLE)的实际应用。解决方案的关键在于利用创新的视觉语言模型(Vision-Language Models, VLMs)从RGB视频流中非侵入式估计H和V参数,具体包括两个多阶段VLM管道:一个仅基于检测的文本引导管道和一个融合检测与分割的管道。其中,基于像素级分割的多视角管道显著降低了估计误差(H平均绝对误差约6–8 cm,V约5–8 cm),相较纯检测方法在H上减少20–30%、V上减少35–40%,验证了VLM在视频驱动的人体工学参数估计中的可行性与优越性。

链接: https://arxiv.org/abs/2602.20658
作者: Mohammad Sadra Rajabi,Aanuoluwapo Ojelade,Sunwook Kim,Maury A. Nussbaum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.

[HC-8] Improving Data Quality via Pre-Task Participant Screening in Crowdsourced GUI Experiments

【速读】:该论文旨在解决众包用户实验中因部分参与者忽视指令或操作粗心而导致性能模型有效性受损的问题。其解决方案的关键在于引入一个简短的预任务筛选机制,该任务模拟主任务的图形用户界面(GUI)操作(如图像缩放),并以操作误差作为连续的质量信号;通过设定误差阈值排除表现异常的参与者,从而提升主任务中GUI性能模型的拟合优度和预测准确性。

链接: https://arxiv.org/abs/2602.20594
作者: Takaya Miyama,Satoshi Nakamura,Shota Yamanaka
机构: Meiji University (明治大学); LY Corporation
类目: Human-Computer Interaction (cs.HC)
备注: To appear at CHI 2026

点击查看摘要

Abstract:In crowdsourced user experiments that collect performance data from graphical user interface (GUI) interactions, some participants ignore instructions or act carelessly, threatening the validity of performance models. We investigate a pre-task screening method that requires simple GUI operations analogous to the main task and uses the resulting error as a continuous quality signal. Our pre-task is a brief image-resizing task in which workers match an on-screen card to a physical card; workers whose resizing error exceeds a threshold are excluded from the main experiment. The main task is a standardized pointing experiment with well-established models of movement time and error rate. Across mouse- and smartphone-based crowdsourced experiments, we show that reducing the proportion of workers exhibiting unexpected behavior and tightening the pre-task threshold systematically improve the goodness of fit and predictive accuracy of GUI performance models, demonstrating that brief pre-task screening can enhance data quality.

[HC-9] What Drives Students Use of AI Chatbots? Technology Acceptance in Conversational AI

【速读】:该论文旨在解决学生对对话式人工智能(Conversational AI)工具采纳行为背后的驱动机制问题,特别是如何通过扩展技术接受模型(Technology Acceptance Model, TAM)来理解感知有用性、感知易用性等核心因素与使用意图之间的关系。其解决方案的关键在于引入信任(trust)、感知享乐性(perceived enjoyment)和主观规范(subjective norms)作为额外变量,以捕捉社会影响和情感因素对AI输出不确定性的调节作用;实证结果表明,感知有用性仍是决定学生使用意图的最强因素,而感知易用性则主要通过间接路径影响意图,且信任和主观规范显著提升感知有用性,感知享乐性则同时具备直接和间接效应,揭示了对话式AI采纳决策更依赖于对系统输出的信心、情感参与度及社交环境,而非单纯的操作便捷性。

链接: https://arxiv.org/abs/2602.20547
作者: Griffin Pitts,Sanaz Motamedi
机构: North Carolina State University (北卡罗来纳州立大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Conversational AI tools have been rapidly adopted by students and are becoming part of their learning routines. To understand what drives this adoption, we draw on the Technology Acceptance Model (TAM) and examine how perceived usefulness and perceived ease of use relate to students’ behavioral intention to use conversational AI that generates responses for learning tasks. We extend TAM by incorporating trust, perceived enjoyment, and subjective norms as additional factors that capture social and affective influences and uncertainty around AI outputs. Using partial least squares structural equation modeling, we find perceived usefulness remains the strongest predictor of students’ intention to use conversational AI. However, perceived ease of use does not exert a significant direct effect on behavioral intention once other factors are considered, operating instead indirectly through perceived usefulness. Trust and subjective norms significantly influence perceptions of usefulness, while perceived enjoyment exerts both a direct and indirect effect on usage intentions. These findings suggest that adoption decisions for conversational AI systems are influenced less by effort-related considerations and more by confidence in system outputs, affective engagement, and social context. Future research is needed to further examine how these acceptance relationships generalize across different conversational systems and usage contexts. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET) ACMclasses: J.4; K.3; K.4 Cite as: arXiv:2602.20547 [cs.HC] (or arXiv:2602.20547v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.20547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-10] Changing the Optics: Comparing Traditional and Retrieval-Augmented GenAI E-Tutorials in Interdisciplinary Learning

【速读】:该论文试图解决的问题是:在电子学习(e-learning)环境中,生成式 AI(Generative AI)媒介的电子教程如何影响学习者的信息寻求行为,尤其是在跨学科领域中,学习者面对复杂且碎片化的信息时的行为差异尚不明确。解决方案的关键在于采用“定向寻路”(orienteering)的三种搜索模式来对比传统电子教程与生成式 AI 电子教程的学习者行为特征,并通过对照实验设计发现:传统教程使用者对信息空间保持更高的意识和专注度,而生成式 AI 用户则表现出更具主动性和探索性的行为模式,同时认知负荷更低,这归因于查询驱动的交互机制。这一发现为优化电子教程设计提供了实证依据。

链接: https://arxiv.org/abs/2602.20544
作者: Hannah Kim,Rahad Arman Nabid,Jeni Sorathiya,Minh Doan,Elijah Jordan,Rayhana Nasimova,Sergei L. Kosakovsky Pond,Stephen MacNeil
机构: Temple University (坦普尔大学); Rutgers University–New Brunswick (罗格斯大学新布朗斯维克分校)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages (16 pages including appendix), 2 figures, 5 supplementary figures, 2 supplementary tables

点击查看摘要

Abstract:Understanding information-seeking behaviors in e-learning is critical, as learners must often make sense of complex and fragmented information, a challenge compounded in interdisciplinary fields with diverse prior knowledge. Compared to traditional e-tutorials, GenAI e-tutorials offer new ways to navigate information spaces, yet how they shape learners information-seeking behaviors remains unclear. To address this gap, we characterized behavioral differences between traditional and GenAI-mediated e-tutorial learning using the three search modes of orienteering. We conducted a between-subject study in which learners engaged with either a traditional e-tutorial or a GenAI e-tutorial accessing the same underlying information content. We found that the traditional users maintained greater awareness and focus of the information space, whereas GenAI users exhibited more proactive and exploratory behaviors with lower cognitive load due to the querying-driven interaction. These findings offer guidance for designing tutorials in e-learning.

[HC-11] Hybrid LLM -Embedded Dialogue Agents for Learner Reflection: Designing Responsive and Theory-Driven Interactions

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在教育对话系统中缺乏与学习理论对齐的问题,即虽然大型语言模型(Large Language Models, LLMs)具备情境敏感响应能力,但其设计未充分融合如自我调节学习(Self-Regulated Learning, SRL)等成熟教学理论,导致交互可能偏离有效学习支持的路径。解决方案的关键在于构建一个混合对话系统:将LLM的灵活性嵌入到基于SRL理论的规则框架中,使系统既能根据学习者状态动态触发深度反思性提问,又能确保整体对话结构符合教育心理学原则,从而在保持理论一致性的同时提升响应的适应性与有效性。

链接: https://arxiv.org/abs/2602.20486
作者: Paras Sharma,YuePing Sha,Janet Shufor Bih Epse Fofang,Brayden Yan,Jess A. Turner,Nicole Balay,Hubert O. Asare,Angela E.B. Stewart,Erin Walker
机构: University of Pittsburgh(匹兹堡大学); Carnegie Mellon University(卡内基梅隆大学); Bowie State University(鲍伊州立大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dialogue systems have long supported learner reflections, with theoretically grounded, rule-based designs offering structured scaffolding but often struggling to respond to shifts in engagement. Large Language Models (LLMs), in contrast, can generate context-sensitive responses but are not informed by decades of research on how learning interactions should be structured, raising questions about their alignment with pedagogical theories. This paper presents a hybrid dialogue system that embeds LLM responsiveness within a theory-aligned, rule-based framework to support learner reflections in a culturally responsive robotics summer camp. The rule-based structure grounds dialogue in self-regulated learning theory, while the LLM decides when and how to prompt deeper reflections, responding to evolving conversation context. We analyze themes across dialogues to explore how our hybrid system shaped learner reflections. Our findings indicate that LLM-embedded dialogues supported richer learner reflections on goals and activities, but also introduced challenges due to repetitiveness and misalignment in prompts, reducing engagement.

[HC-12] Examining and Addressing Barriers to Diversity in LLM -Generated Ideas

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在创意生成过程中可能导致的思维同质化问题,即LLMs生成的想法多样性低于人类独立样本,从而可能削弱社会层面的创新潜力。其核心解决方案在于识别并干预两个关键机制:一是个体层面的“固定效应”(fixation),即早期输出限制后续生成,可通过链式思维(Chain-of-Thought, CoT)提示缓解;二是群体层面的知识分布单一化,即LLMs将知识聚合为统一分布,缺乏人类群体中因个体差异导致的知识分区现象,可通过引入普通人格设定(ordinary personas)作为语义空间中的多样化采样线索来改善。结合两种干预策略可显著提升LLM生成想法的多样性,甚至超越人类表现,为实现高效且富有创造力的人机协作提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2602.20408
作者: Yuting Deng,Melanie Brucks,Olivier Toubia
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Ideas generated by independent samples of humans tend to be more diverse than ideas generated from independent LLM samples, raising concerns that widespread reliance on LLMs could homogenize ideation and undermine innovation at a societal level. Drawing on cognitive psychology, we identify (both theoretically and empirically) two mechanisms undermining LLM idea diversity. First, at the individual level, LLMs exhibit fixation just as humans do, where early outputs constrain subsequent ideation. Second, at the collective level, LLMs aggregate knowledge into a unified distribution rather than exhibiting the knowledge partitioning inherent to human populations, where each person occupies a distinct region of the knowledge space. Through four studies, we demonstrate that targeted prompting interventions can address each mechanism independently: Chain-of-Thought (CoT) prompting reduces fixation by encouraging structured reasoning (only in LLMs, not humans), while ordinary personas (versus “creative entrepreneurs” such as Steve Jobs) improve knowledge partitioning by serving as diverse sampling cues, anchoring generation in distinct regions of the semantic space. Combining both approaches produces the highest idea diversity, outperforming humans. These findings offer a theoretically grounded framework for understanding LLM idea diversity and practical strategies for human-AI collaborations that leverage AI’s efficiency without compromising the diversity essential to a healthy innovation ecosystem.

[HC-13] Misty Forest VR: Turning Real ADHD Attention Patterns into Shared Momentum for Youth Collaboration

【速读】:该论文试图解决注意力缺陷多动障碍(ADHD)在文化语境中长期存在的污名化问题,特别是在中国,ADHD行为常被道德化而非视为神经发育差异,导致个体自我认知困难、社会误解及ADHD与非ADHD人群间协作缺失。解决方案的关键在于设计了一款基于虚拟现实(VR)的不对称协同游戏Misty Forest,将实证支持的ADHD行为特征(如注意力波动和时间感知障碍)转化为互补角色,要求玩家之间相互依赖与协调,而非单纯补偿缺陷。这种以神经多样性为中心的交互设计促使ADHD参与者提升自我接纳,增强非ADHD参与者共情能力,并促进跨群体任务完成效率,从而推动包容性协作与理解。

链接: https://arxiv.org/abs/2602.20350
作者: Yibo Meng,Bingyi Liu,Ruiqi Chen,Yan Guan
机构: Tsinghua University (清华大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted by CHI Poster 2026

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) remains highly stigmatized in many cultural contexts, particularly in China, where ADHD-related behaviors are often moralized rather than understood as neurodevelopmental differences. As a result, challenges of self-perception, social misunderstanding, and collaboration between ADHD and non-ADHD individuals remain largely unaddressed. We present Misty Forest, a VR-based collaborative game that explores ADHD through asymmetric co-play. The system translates empirically grounded ADHD behavioral patterns – such as fluctuating attention and time blindness – into complementary roles that require mutual coordination between players. Rather than compensating for deficits, the design treats cognitive differences as a source of interdependence. In a controlled study with mixed ADHD–non-ADHD dyads, Misty Forest led to higher task completion, increased self-acceptance among ADHD participants, improved ADHD knowledge, and greater empathy among non-ADHD players. These findings suggest that neurodiversity-centered interactive design can foster understanding, reciprocity, and inclusive collaboration.

[HC-14] 52-Hz Whale Song: An Embodied VR Experience for Exploring Misunderstanding and Empathy

【速读】:该论文试图解决的问题是:在跨语言和跨文化情境中,个体因表达方式与接收者理解模式不匹配而导致的沟通障碍(communicative mismatch),而传统人机交互(HCI)研究中的共情方法多依赖叙事叙述,使用户处于观察者而非具身参与者的位置,难以实现深层次的情感共鸣。解决方案的关键在于设计了一种基于具身隐喻(embodied metaphor)的虚拟现实(VR)体验——“52-Hz Whale Song”,通过角色转换(perspective-shifting)让用户以具身方式体验被误解的孤独感,并经历从沟通失败到主体性重建再到调解的过程。这种设计不仅增强了用户的视角采择能力(perspective-taking),还在移民相关情境中显著降低了自我报告的社会距离,为面向共情的交互系统提供了可迁移的设计启示。

链接: https://arxiv.org/abs/2602.20348
作者: Yibo Meng,Bingyi Liu,Ruiqi Chen,Xin Chen,Yan Guan
机构: Tsinghua University (清华大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); University of Washington (华盛顿大学); Universidad Politécnica de Madrid (马德里理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted as CHI 26 Poster

点击查看摘要

Abstract:Experiences of being misunderstood often stem not from a lack of voice, but from mismatches between how individuals express themselves and how others listen. Such communicative mismatches arise across many social settings, including situations involving linguistic and cultural displacement. While prior HCI research has explored empathy through virtual reality, many approaches rely on narrative explanation, positioning users as observers rather than embodied participants. We present 52-Hz Whale Song, an embodied VR experience that explores miscommunication through metaphor and perspective-shifting. Inspired by the real-world “52-Hz whale,” whose calls are not responded to by others, the experience uses this phenomenon as an experiential lens on communicative mismatch rather than representing any specific social group. Players progress through a three-act arc that moves from failed communication to agency and ultimately to mediation. A preliminary mixed-methods study (N = 30) suggests increased perspective-taking and reduced self-reported social distance in immigrant-related situations. This work highlights how embodied metaphor and role-shifting can support empathic engagement and offers transferable design insights for empathy-oriented interactive systems.

[HC-15] Rapid Testing Duck Lips and Tilted Cameras: Youth Everyday Algorithm Auditing Practices with Generative AI Filters

【速读】:该论文旨在解决如何将青少年在日常使用生成式 AI(Generative AI)应用中的实践行为与更正式的算法审计(algorithm auditing)科学素养相结合的问题,从而促进更具参与性和批判性的数字素养教育。其解决方案的关键在于识别并强化青少年在 TikTok 平台上对生成式 AI 滤镜进行自发、系统性测试的行为——这些行为表现为通过复杂的摄像角度和面部操作快速探测滤镜的功能边界,体现出一种“日常算法审计”实践。研究认为,此类实践可作为设计基础,推动 syncretic(融合式)技术产品与教学方案的发展,以连接日常经验与正式算法理解。

链接: https://arxiv.org/abs/2602.20314
作者: Lauren Vogelstein,Vedya Konda,Deborah Fields,Yasmin Kafai,Luis Morales-Navarro,Danaé Metaxa
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Today’s youth have extensive experience interacting with artificial intelligence and machine learning applications on popular social media platforms, putting youth in a unique position to examine, evaluate, and even challenge these applications. Algorithm auditing is a promising candidate for connecting youth’s everyday practices in using AI applications with more formal scientific literacies (syncretic designs). In this paper, we analyze high school youth participants’ everyday algorithm auditing practices when interacting with generative AI filters on TikTok, revealing thorough and extensive examinations, with youth rapidly testing filters with sophisticated camera variations and facial manipulations to identify filter limitations. In the discussion, we address how these findings can provide a foundation for developing designs that bring together everyday and more formal algorithm auditing.

[HC-16] Is Robot Labor Labor? Delivery Robots and the Politics of Work in Public Space

【速读】:该论文试图解决的问题是:随着送餐机器人在城市生活中日益普及,人们普遍将机器人行为视为自主且高效的,但这种认知忽略了其背后依赖的复杂社会技术协作机制。论文通过在首尔两个智慧城市区域的民族志田野调查发现,每一次成功的配送实际上是一种分布式社会技术成果,依赖于人类劳动、制度协调与社会适应。解决方案的关键在于提出“机器人劳动”应被重新概念化为一种集体构成(collective assemblage)——即机器人劳动并非替代人类劳动,而是重构了劳动的可见性与隐匿性:一方面凸显了机器人的性能表现,另一方面遮蔽了支撑性的隐性人力与制度劳动。这一视角强调需从劳动政治和空间政治角度深入理解人机交互(HRI),以更全面地阐释公共场景中机器人部署的社会意义。

链接: https://arxiv.org/abs/2602.20180
作者: EunJeong Cheon,Do Yeon Shin
机构: Syracuse University (雪城大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As sidewalk delivery robots become increasingly integrated into urban life, this paper begins with a critical provocation: Is robot labor labor? More than a rhetorical question, this inquiry invites closer attention to the social and political arrangements that robot labor entails. Drawing on ethnographic fieldwork across two smart-city districts in Seoul, we examine how delivery robot labor is collectively sustained. While robotic actions are often framed as autonomous and efficient, we show that each successful delivery is in fact a distributed sociotechnical achievement–reliant on human labor, regulatory coordination, and social accommodations. We argue that delivery robots do not replace labor but reconfigure it–rendering some forms more visible (robotic performance) while obscuring others (human and institutional support). Unlike industrial robots, delivery robots operate in shared public space, engage everyday passersby, and are embedded in policy and progress narratives. In these spaces, we identify “robot privilege”–humans routinely yielding to robots–and distinct perceptions between casual observers (“cute”) and everyday coexisters (“admirable”). We contribute a conceptual reframing of robot labor as a collective assemblage, empirical insights into South Korea’s smart-city automation, and a call for HRI to engage more deeply with labor and spatial politics to better theorize public-facing robots.

[HC-17] ShaRP: Shape-Regularized Multidimensional Projections

【速读】:该论文旨在解决高维数据可视化中投影方法(projections)的视觉特征(visual signature)难以控制的问题,即现有降维技术通常仅通过算法设计隐式地决定散点图的布局模式(如局部或全局结构保留),而无法满足交互式可视化场景下用户对特定视觉形状的定制需求。解决方案的关键在于提出一种新型投影技术ShaRP,其核心创新是赋予用户对散点图视觉签名的显式控制能力,同时保持良好的可扩展性(适用于高维和大规模数据集)、通用性(支持任意定量数据)以及在质量指标上的可控损失,从而在灵活性与性能之间实现平衡。

链接: https://arxiv.org/abs/2306.00554
作者: Alister Machado,Alexandru Telea,Michael Behrisch
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in EuroVA Workshop 2023

点击查看摘要

Abstract:Projections, or dimensionality reduction methods, are techniques of choice for the visual exploration of high-dimensional data. Many such techniques exist, each one of them having a distinct visual signature - i.e., a recognizable way to arrange points in the resulting scatterplot. Such signatures are implicit consequences of algorithm design, such as whether the method focuses on local vs global data pattern preservation; optimization techniques; and hyperparameter settings. We present a novel projection technique - ShaRP - that provides users explicit control over the visual signature of the created scatterplot, which can cater better to interactive visualization scenarios. ShaRP scales well with dimensionality and dataset size, generically handles any quantitative dataset, and provides this extended functionality of controlling projection shapes at a small, user-controllable cost in terms of quality metrics.

计算机视觉

[CV-0] st-Time Training with KV Binding Is Secretly Linear Attention

【速读】:该论文旨在解决测试时训练(Test-time Training, TTT)机制在实际应用中行为不明确的问题,特别是当前主流观点将其解释为一种在线元学习(online meta-learning)以实现键值(Key-Value, KV)映射记忆的做法,与实验观察到的多种现象存在矛盾。论文的关键解决方案在于重新审视TTT的数学形式,揭示其本质上可被建模为一种学习型线性注意力(learned linear attention)算子,这一视角不仅解释了此前难以理解的模型行为,还带来了多项实践优势:支持架构简化、提出完全并行化实现以提升效率且保持性能,并能将多种TTT变体系统性地归约为标准线性注意力形式。因此,论文从根本上将TTT的理论框架从“测试时记忆”重构为“增强表达能力的学习型线性注意力”。

链接: https://arxiv.org/abs/2602.21204
作者: Junchen Liu,Sven Elflein,Or Litany,Zan Gojcic,Ruilong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

[CV-1] Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning)在机器人应用中训练效率低下的问题,尤其是在高维图像输入下,传统离策略(off-policy)方法虽样本高效但训练慢,而同策略(on-policy)方法虽可并行化但样本利用率低。解决方案的关键在于提出Squint方法,其核心创新包括:并行仿真(parallel simulation)、分布式评论家(distributional critic)、分辨率压缩(resolution squinting)、层归一化(layer normalization)、优化的更新-数据比(tuned update-to-data ratio)以及高度优化的实现方式,从而在单张RTX 3090 GPU上实现比先前视觉离策略与同策略方法更快的壁钟时间(wall-clock time)训练效率,并在SO-101任务集上验证了其有效性及真实机器人迁移能力。

链接: https://arxiv.org/abs/2602.21203
作者: Abdulaziz Almuzairee,Henrik I. Christensen
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: For website and code, see this https URL

点击查看摘要

Abstract:Visual reinforcement learning is appealing for robotics but expensive – off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

[CV-2] Region of Interest Segmentation and Morphological Analysis for Membranes in Cryo-Electron Tomography

【速读】:该论文旨在解决冷冻电子断层扫描(cryo-ET)中感兴趣区域(ROI)识别与形态学表面分析的瓶颈问题,即传统方法依赖于对完整结构进行分割后进行事后分析,难以直接、精准地提取复杂几何结构(如膜系统)的局部特征。其解决方案的关键在于提出一个两步框架——TomoROIS-SurfORA:首先,TomoROIS基于深度学习实现无需先验形状知识的直接ROI分割,且可在小样本标注数据下从头训练,具备跨数据集的实用性;其次,SurfORA将分割结果以点云和表面网格形式处理,可量化计算膜间距离、曲率及表面粗糙度等形态参数,并特别针对cryo-ET中常见的开放表面(由于缺失楔形效应)设计了适配算法,从而实现了对膜接触位点及重塑事件(如内陷)的自动定量分析。

链接: https://arxiv.org/abs/2602.21195
作者: Xingyi Cheng,Julien Maufront,Aurélie Di Cicco,Daniël M. Pelt,Manuela Dezi,Daniel Lévy
机构: Institut Curie, CNRS UMR168, Université PSL, Sorbonne Université, Paris, France; Leiden Institute of Advanced Computer Science (LIACS), Universiteit Leiden, the Netherlands; Inria Center at University of Rennes, SAIRPICO team, Rennes, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cryo-electron tomography (cryo-ET) enables high resolution, three-dimensional reconstruction of biological structures, including membranes and membrane proteins. Identification of regions of interest (ROIs) is central to scientific imaging, as it enables isolation and quantitative analysis of specific structural features within complex datasets. In practice, however, ROIs are typically derived indirectly through full structure segmentation followed by post hoc analysis. This limitation is especially apparent for continuous and geometrically complex structures such as membranes, which are segmented as single entities. Here, we developed TomoROIS-SurfORA, a two step framework for direct, shape-agnostic ROI segmentation and morphological surface analysis. TomoROIS performs deep learning-based ROI segmentation and can be trained from scratch using small annotated datasets, enabling practical application across diverse imaging data. SurfORA processes segmented structures as point clouds and surface meshes to extract quantitative morphological features, including inter-membrane distances, curvature, and surface roughness. It supports both closed and open surfaces, with specific considerations for open surfaces, which are common in cryo-ET due to the missing wedge effect. We demonstrate both tools using in vitro reconstituted membrane systems containing deformable vesicles with complex geometries, enabling automatic quantitative analysis of membrane contact sites and remodeling events such as invagination. While demonstrated here on cryo-ET membrane data, the combined approach is applicable to ROI detection and surface analysis in broader scientific imaging contexts.

[CV-3] Human Video Generation from a Single Image with 3D Pose and View Control

【速读】:该论文旨在解决从单张图像生成高质量、多视角且时空一致的人类视频的难题,尤其在人体视频生成中,如何从单一图像推断出与视角一致、依赖运动的衣物褶皱仍是一个挑战。其解决方案的关键在于提出了一种名为HVG(Human Video Generation in 4D)的潜在视频扩散模型,通过三项核心设计实现:(i) 关节调制(Articulated Pose Modulation),利用新型双维度骨骼图捕捉3D关节间的解剖关系,并引入3D信息解决跨视角自遮挡问题;(ii) 视图与时间对齐(View and Temporal Alignment),确保多视角一致性并实现参考图像与姿态序列之间的帧间稳定对齐;(iii) 渐进式时空采样(Progressive Spatio-Temporal Sampling),结合时间对齐机制以保持长时间多视角动画中的平滑过渡。

链接: https://arxiv.org/abs/2602.21188
作者: Tiantian Wang,Chun-Han Yao,Tao Hu,Mallikarjun Byrasandra Ramalinga Reddy,Ming-Hsuan Yang,Varun Jampani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

[CV-4] Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在三维空间理解与推理能力上的不足问题,即VLMs虽具备出色的二维视觉感知能力,但对三维空间的建模仍停留在表面层次。为克服这一局限,作者提出了一种名为Spa3R的自监督框架,其核心创新在于引入预测性空间场建模(Predictive Spatial Field Modeling, PSFM)范式:通过从无标注的多视角图像中学习统一且视角不变的空间表征,使模型能够基于紧凑的潜在表示合成任意未见视角的特征场,从而内化对底层三维场景的整体一致性理解。该方案的关键在于不依赖显式的三维模态或部分几何先验,而是仅从二维图像中直接构建可泛化的三维空间认知能力,进而通过轻量级适配器将预训练的Spa3R编码器集成到现有VLM中形成Spa3-VLM,在VSI-Bench等基准上实现了58.6%的3D视觉问答(3D VQA)准确率,显著优于此前方法,验证了PSFM作为提升空间智能的可扩展路径的有效性。

链接: https://arxiv.org/abs/2602.21186
作者: Haoyi Jiang,Liu Liu,Xinjie Wang,Yonghao He,Wei Sui,Zhizhong Su,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science & Technology (华中科技大学); Horizon Robotics (地平线机器人); D-Robotics (D-机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space–a cornerstone of spatial intelligence–remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at this https URL.

[CV-5] Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision

【速读】:该论文旨在解决图结构医学图像分割中因缺乏手动标注的点对点对应地标数据而导致临床应用受限的问题。传统方法依赖于保持跨患者一致性的手动标注地标,而这类数据在实际中极为稀缺。解决方案的关键在于提出Mask-HybridGNet框架,该框架无需人工标注地标即可直接使用标准像素级掩码进行训练:通过结合Chamfer距离监督与基于边的正则化,将可变长度的真实边界与固定长度的地标预测对齐,同时利用可微分光栅化进一步优化局部平滑性和地标分布规律;这一机制使预测的地标位置在不同患者间自动关联至特定解剖位置,实现隐式图谱学习,从而支持时间序列追踪、跨切片重建和群体形态学分析等下游任务。

链接: https://arxiv.org/abs/2602.21179
作者: Nicolás Gaggion,Maria J. Ledesma-Carbayo,Stergios Christodoulidis,Maria Vakalopoulou,Enzo Ferrante
机构: APOLO Biotech(阿波罗生物技术公司); Universidad de Buenos Aires(布宜诺斯艾利斯大学); CONICET-Universidad de Buenos Aires(阿根廷国家科学技术研究委员会-布宜诺斯艾利斯大学); Universidad Politécnica de Madrid(马德里理工大学); Instituto de Salud Carlos III(西班牙卫生部卡洛斯三世研究所); CentraleSupélec(中央理工-巴黎萨克雷大学); Université Paris-Saclay(巴黎萨克雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.

[CV-6] XMorph: Explainable Brain Tumor Analysis Via LLM -Assisted Hybrid Deep Intelligence

【速读】:该论文旨在解决深度学习在脑肿瘤自动诊断中因可解释性不足和计算资源限制而导致临床应用受限的问题,特别是传统模型作为“黑箱”难以量化恶性肿瘤不规则边界特征的缺陷。其解决方案的关键在于提出XMorph框架,核心创新包括:一是设计信息加权边界归一化(Information-Weighted Boundary Normalization, IWBN)机制,通过强调具有诊断意义的边界区域并融合非线性混沌与临床验证特征,实现更丰富的肿瘤形态表征;二是构建双通道可解释AI模块,结合GradCAM++视觉热力图与大语言模型(Large Language Model, LLM)生成的文本推理,将模型决策过程转化为临床可理解的解释,从而在保持96.0%高分类准确率的同时提升模型透明度与可信度。

链接: https://arxiv.org/abs/2602.21178
作者: Sepehr Salem Ghahfarokhi,M. Moein Esfahani,Raj Sunderraman,Vince Calhoun,Mohammed Alser
机构: 1. Georgia State University (佐治亚州立大学); 2. The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

点击查看摘要

Abstract:Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ‘‘black boxes’’ and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: this https URL.

[CV-7] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

【速读】:该论文旨在解决短且描述不足的用户查询在文本到图像检索(text-to-image retrieval)任务中导致的语义模糊性、视觉解释冲突以及缺乏对检索图像质量控制的问题。其解决方案的关键在于提出一种质量可控的检索新范式,通过引入生成式语言模型(generative language model)作为查询补全函数,将简短查询扩展为包含细粒度视觉属性(如姿态、场景和美学特征)的描述性表达,并基于离散的质量等级(由相关性和美学评分模型推导)对查询补全进行条件约束,从而实现语义丰富且质量感知的查询增强。这一方法不仅兼容任意预训练视觉语言模型(vision-language models, VLMs),还具备灵活性、透明性和可控性三大优势。

链接: https://arxiv.org/abs/2602.21175
作者: Jianglin Lu,Simon Jenni,Kushal Kafle,Jing Shi,Handong Zhao,Yun Fu
机构: Adobe Research; Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at this https URL.

[CV-8] NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning CVPR2026

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶应用中面临的两大高成本问题:大规模数据集收集需求和密集的推理标注依赖。为应对这些挑战,作者提出了一种名为\modelname(No Reasoning for Driving)的新方法,其核心创新在于通过引入Dr. GRPO算法来缓解标准组相对策略优化(Group Relative Policy Optimization, GRPO)在小规模、无推理标注数据上训练时出现的难度偏差(difficulty bias)问题。该方案使模型仅需现有VLA训练数据的40%且无需任何推理标注即可达到竞争性性能,显著降低了训练资源消耗,从而推动更高效自动驾驶系统的实现。

链接: https://arxiv.org/abs/2602.21172
作者: Ishaan Rawal,Shubh Gupta,Yihan Hu,Wei Zhan
机构: Applied Intuition(应用直觉); Texas A&M University (德克萨斯农工大学); UC Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbfNo \textbfReasoning for \textbfDriving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on 60% of the data and no reasoning annotations, resulting in 3 \times fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.

[CV-9] SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement

【速读】:该论文旨在解决2D游戏精灵图(sprite)转换为适用于骨骼动画框架(如Spine2D)的三角形网格(triangle mesh)过程中人工建模效率低下的问题。传统方法需艺术家手动在视觉边界上逐个放置顶点,耗时15–60分钟/张,严重制约开发效率。解决方案的关键在于提出一个全自动的混合学习-算法流水线(SPRITETOMESH):首先使用训练好的分割网络(EfficientNet-B0编码器+U-Net解码器)生成高精度二值掩膜(IoU=0.87),随后通过Douglas-Peucker简化与自适应弧长细分提取外轮廓顶点,并利用双边滤波多通道Canny边缘检测结合轮廓追踪定位内边界顶点;最终采用Delaunay三角剖分并辅以掩膜中心过滤生成完整网格。值得注意的是,作者通过实验证明直接神经网络热图回归预测顶点位置不可行(损失无法收敛至0.061),从而验证了“学习明确任务(分割)+算法启发式处理(顶点布局)”这一混合设计的必要性,实现单帧处理时间<3秒,较人工提升300–1200倍效率。

链接: https://arxiv.org/abs/2602.21153
作者: Bastien Gimbert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 17 figures. Code available at this https URL

点击查看摘要

Abstract:We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.

[CV-10] LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis

【速读】:该论文旨在解决纵向胸片(CXR)影像分析中人工判读耗时且难以高效提取时间动态变化信息的问题,从而提升放射科医生在诊断与预后判断中的决策效率。其解决方案的关键在于提出一种名为LUMEN的新颖训练框架,该框架通过多图像和多任务指令微调(instruction fine-tuning),优化视觉-语言模型(VLMs)对纵向影像的解释能力,特别增强了模型在诊断性视觉问答(VQA)和预后性VQA任务上的性能表现。

链接: https://arxiv.org/abs/2602.21142
作者: Zhifan Jiang,Dong Yang,Vishwesh Nath,Abhijeet Parida,Nishad P. Kulkarni,Ziyue Xu,Daguang Xu,Syed Muhammad Anwar,Holger R. Roth,Marius George Linguraru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.

[CV-11] SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

【速读】:该论文旨在解决工业场景中物体感知(Object Perception)任务因缺乏大规模标注数据而导致的模型训练困难问题,尤其是在机器人物料搬运和质量检测等半受控环境中,获取并标注真实世界数据的成本过高。解决方案的关键在于提出一个名为SynthRender的开源合成图像生成框架,其核心能力是结合引导式域随机化(Guided Domain Randomization),能够高效生成具有高泛化能力的合成数据;同时引入IRIS工业真实-仿真图像数据集,包含32类具有强类内差异与类间相似性的复杂纹理样本,并通过多基准测试验证了该方法在多个公开数据集上的优越性能,mAP@50最高达99.1%,显著优于现有方法。

链接: https://arxiv.org/abs/2602.21141
作者: Jose Moises Araya-Martinez,Thushar Tom,Adrián Sanchis Reig,Pablo Rey Valiente,Jens Lambrecht,Jörg Krüger
机构: Technical University Berlin (柏林工业大学); Mercedes-Benz AG (梅赛德斯-奔驰集团); ARENA2036 e.V. (ARENA2036协会); Festo SE & Co. KG (费斯托股份有限公司); GlobalFastener Inc. (全球紧固件公司); McMaster-Carr Supply Co. (麦克马斯特-卡罗供应公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

[CV-12] UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

【速读】:该论文旨在解决视频语言模型(Video Language Models, VideoLMs)在复杂城市交通场景中对多智能体动态行为理解不足的问题,特别是模型在视觉定位(visual grounding)与因果推理(causal reasoning)之间存在的显著感知-推理差距(perception-reasoning gap)。其解决方案的关键在于构建了一个名为Urban Dynamics VideoQA(UDVideoQA)的基准数据集,该数据集通过事件驱动的动态模糊技术(event-driven dynamic blur)在保障隐私的同时保持场景保真度,并采用分层推理层级标注体系(hierarchical reasoning level taxonomy),涵盖从基础理解到反事实推理的多个认知层次。此外,通过统一的标注流程生成28K问答对,结合VideoQGen任务评估问题生成质量,系统性地推动了对模型多模态推理能力的评测与提升。实验表明,微调小型模型如Qwen2.5-VL 7B可在UDVideoQA上有效弥合感知与推理间的鸿沟,达到与专有系统相当的性能。

链接: https://arxiv.org/abs/2602.21137
作者: Joseph Raj Vishal,Nagasiri Poluri,Katha Naik,Rutuja Patil,Kashyap Hegde Kota,Krishna Vinod,Prithvi Jai Ramesh,Mohammad Farhadi,Yezhou Yang,Bharatesh Chakravarthi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at this https URL.

[CV-13] BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting CVPR2026

【速读】:该论文旨在解决从无结构数据中恢复边界表示(Boundary Representation, B-rep)模型的难题,这在计算机视觉与图形学领域具有重要价值。当前基于深度学习的方法虽能提升三维形状几何重建效果,但仍依赖于密集且干净的点云数据,且难以泛化到新形状。其解决方案的关键在于提出一种名为B-rep Gaussian Splatting (BrepGaussian) 的新框架,该框架通过可学习特征的高斯点渲染器结合特定拟合策略,实现从二维图像中学习三维参数化表示;进一步采用两阶段学习机制,先捕获几何结构与边缘信息,再优化局部贴片特征以获得清晰几何与一致实例表征,从而有效分离几何重建与特征学习任务,显著提升性能与泛化能力。

链接: https://arxiv.org/abs/2602.21105
作者: Jiaxing Yu,Dongyang Ren,Hangyu Xu,Zhouyuxiao Yang,Yuanqi Li,Jie Guo,Zhengkang Zhou,Yanwen Guo
机构: Nanjing University (南京大学); Nanjing Bridge Intelligent Management Co.,Ltd. (南京桥梁智能管理有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.

[CV-14] Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones

【速读】:该论文旨在解决高速飞行无人机在有限电池约束下进行密集三维重建时,因图像运动模糊和位姿估计漂移噪声导致神经辐射场(NeRF)重建质量下降的问题。其核心挑战在于传统基于RGB帧的方法在高动态场景中难以维持几何精度与细节保真度。解决方案的关键在于提出一个统一框架,融合异步事件流(event stream)与运动模糊图像(motion-blurred frames),通过将事件-图像融合嵌入NeRF优化过程,并联合优化基于事件和帧模态的视觉惯性里程计(Visual-Inertial Odometry, VIO)先验,从而在无地面真值监督的情况下恢复清晰的辐射场和准确的相机轨迹。该方法在合成数据和真实高速飞行序列上均表现出显著优于现有技术的性能,尤其在RGB帧严重退化、位姿先验不可靠的场景中仍能保持高质量重建。

链接: https://arxiv.org/abs/2602.21101
作者: Rong Zou,Marco Cannici,Davide Scaramuzza
机构: University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.

[CV-15] Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction CVPR

【速读】:该论文旨在解决从图像中重建高保真三维头部几何结构的问题,现有方法存在显著局限:传统摄影测量法虽能实现极高的细节还原度,但依赖大量相机阵列(25–200+视角)、计算资源密集且需人工处理如面部毛发等复杂区域;而基于基础模型的单图重建方法效率高但缺乏精细几何细节,优化驱动的方法虽精度较高却需密集视角和昂贵计算。解决方案的关键在于提出一种混合范式——首先设计一个多视角表面法向预测模型,通过引入跨视角注意力机制扩展单目基础模型,实现前馈式几何一致的法向估计;随后将这些预测结果作为强几何先验嵌入逆渲染优化框架中,以恢复高频表面细节。此方法在保持与密集视图摄影测量相当的重建质量的同时,大幅降低摄像机数量和计算成本。

链接: https://arxiv.org/abs/2602.21100
作者: Noé Artru,Rukhshanda Hussain,Emeline Got,Alexandre Messier,David B. Lindell,Abdallah Dib
机构: Ubisoft La Forge (育碧工作室); University of Toronto (多伦多大学); ÉTS Montréal (蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 8 figures, to be published in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

点击查看摘要

Abstract:Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.

[CV-16] Optimizing Occupancy Sensor Placement in Smart Environments

【速读】:该论文旨在解决商业建筑环境中如何实现高精度、隐私保护的区域占用状态实时识别问题,以减少不必要的照明、供暖和制冷能耗。解决方案的关键在于提出了一种自动传感器布局方法,该方法基于办公室环境的几何约束,通过模拟大量人员轨迹构建优化模型,并将传感器布置问题建模为整数线性规划(Integer Linear Programming, ILP)问题,利用分支定界法求解最优传感器配置,从而在给定传感器数量下最大化计数准确性。

链接: https://arxiv.org/abs/2602.21098
作者: Hao Lu,Richard J. Radke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the locations of occupants in a commercial built environment is critical for realizing energy savings by delivering lighting, heating, and cooling only where it is needed. The key to achieving this goal is being able to recognize zone occupancy in real time, without impeding occupants’ activities or compromising privacy. While low-resolution, privacy-preserving time-of-flight (ToF) sensor networks have demonstrated good performance in zone counting, the performance depends on careful sensor placement. To address this issue, we propose an automatic sensor placement method that determines optimal sensor layouts for a given number of sensors, and can predict the counting accuracy of such a layout. In particular, given the geometric constraints of an office environment, we simulate a large number of occupant trajectories. We then formulate the sensor placement problem as an integer linear programming (ILP) problem and solve it with the branch and bound method. We demonstrate the effectiveness of the proposed method based on simulations of several different office environments.

[CV-17] ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning CVPR2026

【速读】:该论文旨在解决联邦半监督学习(Federated Semi-Supervised Learning, FSSL)中同时存在的外部异构性(external heterogeneity)和内部异构性(internal heterogeneity)问题。外部异构性指不同客户端间数据分布的差异,而内部异构性则表现为单个客户端内标注数据与未标注数据之间的分布不匹配。现有方法通常采用固定或动态参数聚合策略处理外部异构性,或通过过滤低置信度样本缓解内部异构性,但前者难以精准逼近理想全局分布,后者又导致大量未标注样本被排除在训练之外。为此,作者提出基于代理(proxy)引导的统一框架ProxyFL:其核心创新在于将可学习的分类器权重视为代理,用于模拟局部和全局类别分布;对外部异构性,显式优化全局代理以对抗异常值而非直接聚合权重;对内部异构性,通过正负代理池重新引入被丢弃样本进行训练,从而缓解伪标签错误带来的负面影响。实验与理论分析验证了该方法在性能提升和收敛稳定性方面的有效性。

链接: https://arxiv.org/abs/2602.21078
作者: Duowen Chen,Yan Wang
机构: Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University (华东师范大学多维信息处理重点实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. code: this https URL

点击查看摘要

Abstract:Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model across clients by leveraging partially-annotated local data in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients and within clients. External heterogeneity refers to the data distribution discrepancy across different clients, while internal heterogeneity represents the mismatch between labeled and unlabeled data within clients. Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (external) and / or filter out low-confidence unlabeled samples to reduce mistakes in local client (internal). But, the former is hard to precisely fit the ideal global distribution via direct weights, and the latter results in fewer data participation into FL training. To this end, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy against outliers instead of direct weights; for internal, we re-include the discarded samples into training by a positive-negative proxy pool to mitigate the impact of potentially-incorrect pseudo-labels. Insight experiments theoretical analysis show our significant performance and convergence in FSSL.

[CV-18] Motivation is Something You Need

【速读】:该论文旨在解决传统模型训练中难以在保持较低训练成本的同时提升模型性能的问题。其核心挑战在于如何有效利用更大模型的潜力,同时避免因直接训练大模型而导致的高计算开销和资源浪费。解决方案的关键在于提出一种受情绪神经科学启发的双模型训练范式:一个小型基础模型持续训练,而一个大型动机模型仅在预设的“动机条件”下间歇性激活。该机制模拟了人类大脑中高好奇心与奖励预期状态下的认知增强过程,通过共享权重更新和选择性扩展网络容量,在关键训练步骤中实现更高效的性能提升。实验表明,该方法不仅显著优于传统训练方式,且动机模型即使每轮训练数据较少,也能超越其独立训练的同类模型,从而实现了在不同部署约束下同时训练两个模型,并以更低的总体训练成本获得竞争力或更优性能。

链接: https://arxiv.org/abs/2602.21064
作者: Mehdi Acheli,Walid Gaaloul
机构: SAMOVAR, Telecom SudParis (电信巴黎高等理工学院); Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined “motivation conditions”. The framework mimics the emotional state of high curiosity and anticipation of reward in which broader brain regions are recruited to enhance cognitive performance. Exploiting scalable architectures where larger models extend smaller ones, our method enables shared weight updates and selective expansion of network capacity during noteworthy training steps. Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch. This opens the possibility of simultaneously training two models tailored to different deployment constraints with competitive or superior performance while keeping training cost lower than when training the larger model.

[CV-19] OCR-Agent : Agent ic OCR with Capability and Memory Reflection

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在多轮迭代修正过程中缺乏有效自我校正机制的问题,导致其难以独立纠正认知偏差,常陷入重复且无效的尝试,无法实现答案的稳定提升。解决方案的关键在于提出一种新颖的迭代式自我校正框架,通过两个核心能力实现:**能力反思(Capability Reflection)**用于诊断错误并生成修正计划,**记忆反思(Memory Reflection)**则回顾历史尝试以避免重复并探索新解法,最终通过严格的再推理优化答案。该方法无需额外训练即可显著增强VLM的推理鲁棒性。

链接: https://arxiv.org/abs/2602.21053
作者: Shimin Wen,Zeyu Zhang,Xingdou Bian,Hongjie Zhu,Lulu He,Layi Shama,Daji Ergu,Ying Cai
机构: Southwest Minzu University (西南民族大学); AI Geeks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization this http URL, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer this http URL address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs’ reasoning robustness without additional training. Code: this https URL.

[CV-20] OmniOCR: Generalist OCR for Ethnic Minority Languages

【速读】:该论文旨在解决少数民族文字光学字符识别(OCR)在低资源或零样本场景下的性能瓶颈问题,其核心挑战包括书写系统复杂、标注数据稀缺以及历史与现代形式多样性导致的泛化困难。解决方案的关键在于提出OmniOCR框架,通过引入动态低秩适配(Dynamic Low-Rank Adaptation, Dynamic LoRA)机制,在不同层和脚本间灵活分配模型容量,实现高效知识迁移与参数适应;同时结合稀疏正则化策略剪枝冗余更新,保证适配过程紧凑且无额外推理开销,从而在TibetanMNIST、水文、古彝文和东巴文等多类少数民族文字数据集上显著提升识别准确率(较现有最优基线提升39%-66%),并展现出卓越的参数效率。

链接: https://arxiv.org/abs/2602.21042
作者: Bonan Liu,Zeyu Zhang,Bingbing Meng,Han Wang,Hanshuo Zhang,Chengping Wang,Daji Ergu,Ying Cai
机构: Southwest Minzu University (西南民族大学); AI Geeks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: this https URL.

[CV-21] Not Just Whats There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在理解否定语义时存在的缺陷,即模型常将否定表述(如“no dog”)与肯定表述(如“dog”)嵌入到相似的语义空间中,导致错误匹配图像。为解决此问题,作者提出CLIPGlasses这一即插即用框架,其核心在于双阶段设计:第一阶段通过Lens模块从文本嵌入中解耦出否定语义信息;第二阶段通过Frame模块预测上下文感知的排斥强度,并将其融入改进的相似度计算中,从而对包含否定语义的匹配进行惩罚,降低误报率。该方案无需微调原始CLIP文本编码器,有效避免过拟合,同时在低资源和跨域场景下展现出更强的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2602.21035
作者: Junhao Xiao,Zhiyu Wu,Hao Lin,Yi Chen,Yahui Liu,Xiaoran Zhao,Zixu Wang,Zejiang He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching “no dog” with dog images). Existing methods refine negation understanding via fine-tuning CLIP’s text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP’s ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

[CV-22] MIP Candy: A Modular PyTorch Framework for Medical Image Processing

【速读】:该论文旨在解决医学图像处理领域中现有框架普遍存在的两个问题:一是低级组件需大量集成工作,二是刚性、不可修改的单体式流水线难以适应多样化研究需求。针对这些问题,作者提出了一种名为MIP Candy(MIPCandy)的开源、基于PyTorch的模块化框架,其核心创新在于引入了LayerT机制——一种延迟配置(deferred configuration)方法,允许在运行时动态替换卷积(convolution)、归一化(normalization)和激活(activation)模块而无需继承子类,从而实现对模型结构的细粒度控制。此外,MIPCandy提供从数据加载到评估的完整可插拔流水线,并通过扩展包生态支持预构建模型与统一训练-预测模式的无缝集成,显著提升了科研效率与灵活性。

链接: https://arxiv.org/abs/2602.21033
作者: Tianhao Fu,Yucheng Chen
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); Project Neura (项目神经); UTMIST (UTMIST); Amplimit (Amplimit)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Medical image processing demands specialized software that handles high-dimensional volumetric data, heterogeneous file formats, and domain-specific training procedures. Existing frameworks either provide low-level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch-based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, \textttbuild_network , while retaining fine-grained control over every component. Central to the design is \textttLayerT , a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built-in k -fold cross-validation, dataset inspection with automatic region-of-interest detection, deep supervision, exponential moving average, multi-frontend experiment tracking (Weights Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre-built model implementations that follow a consistent trainer–predictor pattern and integrate with the core framework without modification. MIPCandy is open-source under the Apache-2.0 license and requires Python~3.12 or later. Source code and documentation are available at this https URL.

[CV-23] From Perception to Action: An Interactive Benchmark for Vision Reasoning

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Model, VLM)评估体系缺乏对物理结构理解能力的考察问题,尤其是模型在动态环境中推理几何、接触与支撑关系如何共同约束可行动作的能力。现有评估多基于结构无关的单轮问答(如VQA),无法衡量代理在复杂物理场景中进行结构化行动规划和执行的能力。解决方案的关键在于提出一个名为CHAIN(Causal Hierarchy of Actions and Interactions)的新基准,这是一个交互式3D物理驱动测试平台,通过机械拼图、3D堆叠与包装等任务,将评估范式从被动感知转向主动问题求解,从而系统性检验模型是否能基于物理约束进行结构化行动序列的理解、规划与执行。

链接: https://arxiv.org/abs/2602.21015
作者: Yuhao Wu,Maojia Song,Yihuai Lan,Lei Wang,Zhiqiang Hu,Yao Xiao,Heng Zhou,Weihua Zheng,Dylan Raharja,Soujanya Poria,Roy Ka-Wei Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in processing. Website: this https URL

点击查看摘要

Abstract:Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents’ ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at this https URL.

[CV-24] Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design CVPR

【速读】:该论文旨在解决实时目标检测(Real-time Object Detection)中现有检测变压器(DETR)模型因过度预训练开销导致难以从头复现的问题,从而限制了新型骨干网络架构的探索。其核心挑战在于如何在不依赖复杂且计算密集型预训练的情况下实现高性能。解决方案的关键在于通过精心设计的骨干网络结构(EfficientNAT)和改进的混合编码器(引入局部注意力机制),显著降低预训练成本并提升推理效率。具体而言,作者提出了一种名为Le-DETR(Low-cost and Efficient DETR)的新模型,在仅使用ImageNet1K和COCO2017数据集训练的前提下,实现了优于YOLOv12及DEIM-D-FINE的精度与速度平衡,验证了良好设计即可在无需昂贵预训练的基础上达成SOTA性能。

链接: https://arxiv.org/abs/2602.21010
作者: Jiannan Huang,Aditya Kane,Fengzhe Zhou,Yunchao Wei,Humphrey Shi
机构: SHI Labs @ Georgia Tech (佐治亚理工学院); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Findings

点击查看摘要

Abstract:Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbfhigh performance with \textbflow pre-training cost. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbfLow-cost and \textbfEfficient \textbfDEtection \textbfTRansformer), which achieves a new \textbfSOTA in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf52.9/54.3/55.1 mAP on COCO Val2017 with \textbf4.45/5.01/6.68 ms on an RTX4090. It surpasses YOLOv12-L/X by \textbf+0.6/-0.1 mAP while achieving similar speed and \textbf+20% speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf+0.2 mAP with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf+0.4 mAP with only \textbf0.4 ms additional latency. Code and weights will be open-sourced.

[CV-25] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models WWW

【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成模型中存在的安全风险问题,即攻击者可能通过在参考图像中嵌入隐式视觉指令(visual instructions),利用模型对图像的视觉理解能力诱导生成有害内容,而无需修改文本提示本身。解决方案的关键在于提出一种无需训练且可迁移的越狱框架——视觉指令注入(Visual Instruction Injection, VII),其核心由两个模块组成:恶意意图重编程模块(Malicious Intent Reprogramming)负责从不安全文本提示中提取恶意意图并最小化其静态危害性,视觉指令锚定模块(Visual Instruction Grounding)则将该意图转化为与原始文本语义一致但表面无害的视觉线索,并将其嵌入安全参考图像中,从而在I2V生成过程中触发有害输出。实验表明,该方法在四个主流商业I2V模型上实现了最高达83.5%的攻击成功率,同时将拒绝率降至接近零。

链接: https://arxiv.org/abs/2602.20999
作者: Bowen Zheng,Yongli Xiang,Ziming Hong,Zerong Lin,Chaojian Yu,Tongliang Liu,Xinge You
机构: National Anti-Counterfeit Engineering Research Center, Huazhong University of Science and Technology (华中科技大学国家防伪工程研究中心); School of Electronic Information and Communications, Huazhong University of Science and Technology (华中科技大学电子与信息通信学院); Sydney AI Centre, The University of Sydney (悉尼大学人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.

[CV-26] Cycle-Consistent Tuning for Layered Image Decomposition CVPR2026

【速读】:该论文旨在解决真实世界图像中视觉层解耦的难题,特别是针对Logo与物体表面之间的复杂非线性耦合关系(如阴影、反射和透视畸变),实现高保真度的分层分离。其解决方案的关键在于提出一种基于大扩散基础模型(diffusion foundation models)的上下文内图像分解框架,通过轻量级LoRA(Low-Rank Adaptation)微调策略,并引入循环一致性训练机制,联合优化分解与重构模型,从而在双向监督下增强对复杂交互场景的鲁棒性;此外,还设计了一种渐进式自提升流程,利用高质量模型生成样本迭代扩充训练集,持续优化性能,展现出良好的泛化能力与统一性潜力。

链接: https://arxiv.org/abs/2602.20989
作者: Zheng Gu,Min Lu,Zhida Sun,Dani Lischinski,Daniel Cohen-O,Hui Huang
机构: Shenzhen University (深圳大学); Hebrew University of Jerusalem (耶路撒冷希伯来大学); Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.

[CV-27] EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer CVPR2026

【速读】:该论文旨在解决现实世界中目标检测面临的动态演化挑战,即在不访问历史数据的前提下,实现增量学习(Incremental Learning)、域适应(Domain Adaptation)与未知对象检测(Unknown Detection)的统一建模。其核心问题是:如何在新类别涌现、域分布漂移且需识别未见物体的情况下,保持模型性能并避免灾难性遗忘。解决方案的关键在于提出EW-DETR框架,该框架通过三个协同模块实现:(1) 基于LoRA的增量适配器(Incremental LoRA Adapters),支持无示例条件下的增量学习;(2) 查询归一化置信度适配器(Query-Norm Objectness Adapter),解耦检测头中的置信度特征以提升鲁棒性;(3) 基于熵感知的未知混合机制(Entropy-Aware Unknown Mixing),实现校准后的未知样本判别。此架构可泛化至各类DETR基线模型,并显著提升在Pascal系列和多样天气场景上的综合性能,FOGS评分提升达57.24%。

链接: https://arxiv.org/abs/2602.20985
作者: Munish Monga,Vishal Chudasama,Pankaj Wasnik,C.V. Jawahar
机构: Sony Research India(索尼研究印度); IIIT Hyderabad(印度国际信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as “unknown”: all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.

[CV-28] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models CVPR2026

【速读】:该论文旨在解决视频与音频之间多模态对齐在长时序生成中的扩展性难题,特别是受限于数据稀缺以及文本描述与帧级视频信息不匹配的问题。其核心解决方案是提出一种名为MMHNet的分层多模态网络架构,通过引入非因果Mamba结构实现对长音频序列的有效建模,从而显著提升视频到音频生成任务中超过5分钟的长时音频生成能力,且无需在训练阶段使用长时序样本即可实现良好的泛化性能。

链接: https://arxiv.org/abs/2602.20981
作者: Christian Simon,MAsato Ishii,Wei-Yao Wang,Koichi Saito,Akio Hayakawa,Dongseok Shim,Zhi Zhong,Shuyang Cui,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji
机构: Sony Group Corporation (索尼集团); Sony AI (索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

[CV-29] CrystaL: Spontaneous Emergence of Visual Latents in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在隐式链式思维(latent Chain-of-Thought, CoT)推理过程中,由于依赖启发式预定义监督信号,导致中间隐状态难以有效保留关键视觉信息的问题。解决方案的关键在于提出一种单阶段框架CrystaL(Crystallized Latent Reasoning),其包含两条并行路径:一条处理完整图像,另一条处理受损图像;通过显式对齐两条路径的注意力模式(attention patterns)与预测分布(prediction distributions),将隐式表示“结晶化”为任务相关的视觉语义,从而无需额外标注或外部模块即可增强细粒度视觉理解能力,并保持鲁棒的推理性能。

链接: https://arxiv.org/abs/2602.20980
作者: Yang Zhang,Danyang Li,Yuxuan Li,Xin Zhang,Tianyu Xie,Mingming Cheng,Xiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

[CV-30] Are Multimodal Large Language Models Good Annotators for Image Tagging?

【速读】:该论文旨在解决图像标注(image tagging)任务中依赖人工标注导致成本高昂的问题,同时探索多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动化标注中的潜力与局限。研究表明,尽管MLLM生成的标注质量约为人类标注的50%–80%,但其在下游训练任务中仍能实现超过90%的人工标注性能,表明存在显著的成本效益空间。为此,作者提出TagLLM框架,其核心在于两个关键组件:一是候选标签生成(Candidates generation),通过结构化分组提示(structured group-wise prompting)高效生成覆盖广泛真实标签的紧凑候选集,从而降低后续标注负担;二是标签消歧(label disambiguation),通过交互式校准提示中的语义概念,对候选标签进行精细化修正。实验表明,TagLLM显著缩小了MLLM标注与人工标注之间的差距,尤其在下游训练性能上弥补了约60%–80%的差异。

链接: https://arxiv.org/abs/2602.20972
作者: Ming-Kun Xie,Jia-Hao Xiao,Zhiqiang Kou,Zhongnian Li,Gang Niu,Masashi Sugiyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50% to 80% of human performance, while achieving over 90% performance on downstream training this http URL by these findings, we propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations. TagLLM comprises two components: Candidates generation, which employs structured group-wise prompting to efficiently produce a compact candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and label disambiguation, which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels. Extensive experiments show that TagLLM substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60% to 80% of the difference.

[CV-31] See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agent ic Data Synthesis

【速读】:该论文旨在解决生成式 AI(Generative AI)图像中普遍存在视觉伪影(visual artifacts)的问题,这些问题严重影响了图像的真实感,而现有方法依赖人工标注的伪影数据集,存在成本高、难以扩展的局限。解决方案的关键在于提出 ArtiAgent,一个由感知代理(perception agent)、合成代理(synthesis agent)和筛选代理(curation agent)组成的自动化框架:感知代理从真实图像中识别并定位实体及其子实体;合成代理通过在扩散 Transformer 中引入基于 patch-wise embedding 操作的伪影注入工具,实现可控的伪影添加;筛选代理则对生成的伪影进行过滤,并提供局部与全局解释,从而高效构建包含丰富伪影标注的 10 万张图像数据集,显著提升了伪影研究的可扩展性与实用性。

链接: https://arxiv.org/abs/2602.20951
作者: Jaehyun Park,Minyoung Ahn,Minkyu Kim,Jonghyun Lee,Jae-Gil Lee,Dongmin Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.

[CV-32] Estimation of Confidence Bounds in Binary Classification using Wilson Score Kernel Density Estimation

【速读】:该论文旨在解决深度学习二分类模型在关键任务中应用时缺乏可靠置信度边界估计的问题,从而确保系统性能达到一定的统计显著性。解决方案的关键在于提出了一种名为“威尔逊分数核密度分类”(Wilson Score Kernel Density Classification)的新方法,其核心是“威尔逊分数核密度估计器”(Wilson Score Kernel Density Estimator),这是一种用于估计条件成功概率变化的二项实验中置信度边界的函数估计方法,能够作为任何特征提取器(包括视觉基础模型)的分类头,在保持与高斯过程分类相当性能的同时显著降低计算复杂度。

链接: https://arxiv.org/abs/2602.20947
作者: Thorbjørn Mosekjær Iversen,Zebin Duan,Frederik Hagelskjær
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The performance and ease of use of deep learning-based binary classifiers have improved significantly in recent years. This has opened up the potential for automating critical inspection tasks, which have traditionally only been trusted to be done manually. However, the application of binary classifiers in critical operations depends on the estimation of reliable confidence bounds such that system performance can be ensured up to a given statistical significance. We present Wilson Score Kernel Density Classification, which is a novel kernel-based method for estimating confidence bounds in binary classification. The core of our method is the Wilson Score Kernel Density Estimator, which is a function estimator for estimating confidence bounds in Binomial experiments with conditionally varying success probabilities. Our method is evaluated in the context of selective classification on four different datasets, illustrating its use as a classification head of any feature extractor, including vision foundation models. Our proposed method shows similar performance to Gaussian Process Classification, but at a lower computational complexity.

[CV-33] UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

【速读】:该论文旨在解决长距离动态驾驶场景重建中因序列长度增长导致的计算复杂度上升以及动态物体长期运动建模困难的问题。现有前馈式方法虽在3D重建上表现良好,但其时间复杂度随序列长度呈二次增长,难以处理长时间驾驶片段;同时对动态物体的运动建模能力有限,影响重建精度。解决方案的关键在于提出一种新型递归范式UFO(Unifying Feed-forward and Optimization),通过维护一个可迭代优化的4D场景表示,在新观测到达时持续更新,并引入基于可见性的场景token过滤机制以高效处理长序列;此外,针对动态物体,设计了基于物体位姿引导的建模策略,实现精确的长距离运动捕捉,从而在Waymo Open Dataset上显著优于传统逐场景优化和现有前馈方法,可在0.5秒内完成16秒驾驶日志的高质量4D重建。

链接: https://arxiv.org/abs/2602.20943
作者: Kaiyuan Tan,Yingying Shen,Mingfei Tu,Haohui Zhu,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun
机构: Xiaomi EV; UIUC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.

[CV-34] Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting CVPR2026

【速读】:该论文旨在解决3D高斯散射(3D Gaussian Splatting, 3DGS)在稀疏视角条件下容易过拟合的问题。现有Dropout方法通过随机置零高斯体素的不透明度来实现正则化,但存在两个关键缺陷:一是“邻域补偿效应”,即被丢弃的高斯体素常被其邻近体素补偿,削弱了正则化效果;二是忽略了高阶球谐函数(Spherical Harmonics, SH)系数对过拟合的贡献。解决方案的关键在于提出一种基于锚点的Dropout策略——DropAnSH-GS:首先随机选取部分高斯体素作为锚点,并同步移除其空间邻域,从而破坏局部冗余并促使模型学习更具全局性的表示;其次将Dropout扩展至颜色属性,随机丢弃高阶SH系数,使外观信息集中于低阶SH,进一步抑制过拟合并支持训练后通过截断SH实现模型压缩。

链接: https://arxiv.org/abs/2602.20933
作者: Shuangkang Fang,I-Chao Shen,Xuanyang Zhang,Zesheng Wang,Yufeng Wang,Wenrui Ding,Gang Yu,Takeo Igarashi
机构: Beihang University (北京航空航天大学); The University of Tokyo (东京大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Recent 3D Gaussian Splatting (3DGS) Dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based Dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies near anchors and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the Dropout to color attributes by randomly dropping higher-degree SH to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing Dropout methods with negligible computational overhead, and can be readily integrated into various 3DGS variants to enhance their performances. Project Website: this https URL

[CV-35] Computing a Characteristic Orientation for Rotation-Independent Image Analysis

【速读】:该论文旨在解决深度学习在计算机视觉中处理几何变换(尤其是旋转)时的鲁棒性问题,即标准神经网络缺乏固有的旋转不变性,导致在图像旋转情况下性能下降。现有方法如数据增强或架构修改虽有效,但会增加计算开销、需要特殊实现或改变网络结构,限制了其通用性。解决方案的关键在于提出一种名为“全局强度方向(General Intensity Direction, GID)”的预处理方法:通过估计每张图像的全局方向并将其对齐到一个规范参考坐标系,使标准模型能够在不同旋转角度下一致地处理输入,从而无需修改网络结构即可提升旋转鲁棒性。该方法直接转换图像空间结构,保留了卷积网络的兼容性,且在MNIST和CIFAR-10等数据集上验证了有效性。

链接: https://arxiv.org/abs/2602.20930
作者: Cristian Valero-Abundio,Emilio Sansano-Sansano,Raúl Montoliu,Marina Martínez García
机构: Institute of New Imaging Technologies, Universitat Jaume I (胡安梅·伊大学新成像技术研究所); Mathematics Department, Universitat Jaume I (胡安梅·伊大学数学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026). 8 pages

点击查看摘要

Abstract:Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convolutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.

[CV-36] LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments ICRA2026

【速读】:该论文旨在解决热成像(thermal imaging)环境下机器人在复杂动态大尺度户外场景中进行稳定、准确的同步定位与建图(SLAM)问题,其核心挑战包括特征提取不可靠、运动跟踪不稳定以及全局位姿和地图构建不一致。解决方案的关键在于提出LST-SLAM系统,其创新性地融合了自监督热特征学习、立体双层运动跟踪机制、几何位姿优化,并引入语义-几何混合约束以抑制缺乏强帧间几何一致性的潜在动态特征;同时构建在线增量式词袋模型用于回环检测并结合全局位姿优化有效缓解累积漂移。

链接: https://arxiv.org/abs/2602.20925
作者: Zeyu Jiang,Kuan Xu,Changhao Chen
机构: PEAK-Lab, The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); School of Electrical and Electronic Engineering, Nanyang Technological University (南洋理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026

点击查看摘要

Abstract:Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semantic-geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.

[CV-37] LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding CVPR2026

【速读】:该论文旨在解决长视频理解在计算资源受限条件下的效率问题,即如何在低计算预算下实现对长视频内容的高效精准理解。其解决方案的关键在于提出一种主动推理型多模态大语言模型(MLLM)代理 LongVideo-R1,该代理通过引入基于高层视觉线索的推理模块,智能选择最具信息量的视频片段进行处理,从而避免了传统方法中全视频遍历带来的冗余计算;该代理采用自顶向下的层次化摘要遍历策略,在推理过程中动态聚焦并及时终止探索,直至获得回答问题所需的足够知识,同时结合两阶段微调(监督微调与强化学习)优化导航策略,显著提升了问答准确率与计算效率之间的平衡。

链接: https://arxiv.org/abs/2602.20913
作者: Jihao Qiu,Lingxi Xie,Xinyue Huo,Qi Tian,Qixiang Ye
机构: University of Chinese Academy of Sciences (中国科学院大学); Huawei Consumer Business Group (华为消费者业务集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures, 8 tables, accepted to CVPR 2026

点击查看摘要

Abstract:This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: this https URL

[CV-38] From Isolation to Integration: Building an Adaptive Expert Forest for Pre-Trained Model-based Class-Incremental Learning

【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因知识遗忘导致模型性能下降的问题,尤其是在冻结预训练模型后仅通过轻量级适配器(adapter)进行新任务学习时,未能有效利用任务间的语义关系,从而造成知识孤立、共享效率低下的问题。其解决方案的关键在于提出语义引导的自适应专家森林(Semantic-guided Adaptive Expert Forest, SAEF),该方法首先基于任务的语义相似性将任务聚类为概念簇,随后在每个簇内通过融合相似任务的适配器构建平衡的专家树结构,并在推理阶段根据输入样本动态激活一组相关专家,最终通过加权融合这些专家的输出实现更精准的预测。这一结构化知识组织机制显著提升了跨任务的知识迁移与复用能力,实现了当前最优(SOTA)的增量学习性能。

链接: https://arxiv.org/abs/2602.20911
作者: Ruiqi Liu,Boyu Diao,Hangda Liu,Zhulin An,Fei Wang,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class-Incremental Learning (CIL) requires models to learn new classes without forgetting old ones. A common method is to freeze a pre-trained model and train a new, lightweight adapter for each task. While this prevents forgetting, it treats the learned knowledge as a simple, unstructured collection and fails to use the relationships between tasks. To this end, we propose the Semantic-guided Adaptive Expert Forest (SAEF), a new method that organizes adapters into a structured hierarchy for better knowledge sharing. SAEF first groups tasks into conceptual clusters based on their semantic relationships. Then, within each cluster, it builds a balanced expert tree by creating new adapters from merging the adapters of similar tasks. At inference time, SAEF finds and activates a set of relevant experts from the forest for any given input. The final prediction is made by combining the outputs of these activated experts, weighted by how confident each expert is. Experiments on several benchmark datasets show that SAEF achieves SOTA performance.

[CV-39] xtPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

【速读】:该论文旨在解决视觉文本渲染(Visual Text Rendering, VTR)中生成文本存在结构异常(如扭曲、模糊和错位)的问题,而当前主流多模态大语言模型(Multimodal Large Language Models, MLLMs)和专用光学字符识别(Optical Character Recognition, OCR)模型难以感知这些结构异常,导致评估与强化学习(Reinforcement Learning, RL)优化失效。解决方案的关键在于提出TextPecker——一种即插即用的结构异常感知强化学习策略,通过构建带有字符级结构异常标注的识别数据集,并开发笔画编辑合成引擎以扩展结构错误覆盖范围,从而有效缓解噪声奖励信号问题,显著提升文本生成的结构保真度与语义对齐性。

链接: https://arxiv.org/abs/2602.20903
作者: Hanshen Zhu,Yuliang Liu,Xuecheng Wu,An-Lan Wang,Hao Feng,Dingkang Yang,Chao Feng,Can Huang,Jingqun Tang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

[CV-40] SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models CVPR2026

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在复杂现实环境中缺乏空间逻辑推理能力的问题,即模型难以理解场景中物体间的空间关系以及多步骤任务中的逻辑依赖。为应对这一挑战,作者提出了一个名为Spatial Logical Question Answering (SpatiaLQA) 的新基准,包含来自241个真实室内场景的9,605个问答对,用于系统评估VLMs的空间逻辑推理性能。解决方案的关键在于提出一种递归场景图辅助推理方法(recursive scene graph assisted reasoning),该方法利用视觉基础模型逐步将复杂场景分解为任务相关的场景图,从而显著提升VLMs的空间逻辑推理能力,优于现有所有方法。

链接: https://arxiv.org/abs/2602.20901
作者: Yuechen Xie,Xiaoyan Zhang,Yicheng Shan,Hao Zhu,Rui Tang,Rong Wei,Mingli Song,Yuanyu Wan,Jie Song
机构: Zhejiang University (浙江大学); The University of Sydney (悉尼大学); ManyCore; State Key Laboratory of Blockchain and Security, Zhejiang University (区块链与安全国家重点实验室,浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新技术产业开发区(滨江区)区块链与数据安全研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at this https URL.

[CV-41] When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance CVPR CVPR2026

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成过程中因安全引导方法忽视不同危害类别间复杂交互而导致的“有害冲突”问题,即在缓解某一类有害内容时可能无意中加剧另一类有害内容,从而整体提升有害率。解决方案的关键在于提出一种无需训练的冲突感知自适应安全引导框架(Conflict-aware Adaptive Safety Guidance, CASG),其核心由两个组件构成:(i) 冲突感知类别识别(Conflict-aware Category Identification, CaCI),用于动态识别与当前生成状态最对齐的有害类别;(ii) 冲突消解引导应用(Conflict-resolving Guidance Application, CrGA),仅沿识别出的类别方向施加安全引导,避免多类别干扰。该方法在潜在空间和文本空间均可适用,在T2I安全基准测试中将有害率降低达15.4%,显著优于现有方法。

链接: https://arxiv.org/abs/2602.20880
作者: Yongli Xiang,Ziming Hong,Zhaoqing Wang,Xiangyu Zhao,Bo Han,Tongliang Liu
机构: Sydney AI Centre, The University of Sydney (悉尼人工智能中心,悉尼大学); City University of Hong Kong (香港城市大学); TMLR Group, Hong Kong Baptist University (香港浸会大学TMLR小组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026; Code is released at this https URL

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to “harmful conflicts” where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model’s evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG’s state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.

[CV-42] MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification CVPR2026

【速读】:该论文旨在解决计算病理学中全切片图像(Whole Slide Image, WSI)在极少量专家标注样本下的分类难题,其核心瓶颈在于现有视觉-语言方法将大语言模型(Large Language Model, LLM)生成的文本描述视为静态类级先验,缺乏对每个样本的细粒度适应与动态优化,导致视觉-语义对齐的多样性与精度不足,限制了模型在有限监督下的泛化能力。解决方案的关键在于提出一种随机多视角语义增强框架(Stochastic MUlti-view Semantic Enhancement, MUSE),其包含两个核心模块:一是样本级细粒度语义增强(Sample-wise Fine-grained Semantic Enhancement, SFSE),通过MoE(Mixture of Experts)机制实现样本自适应的视觉-语义交互,生成个性化语义先验;二是随机多视角模型优化(Stochastic Multi-view Model Optimization, SMMO),基于LLM构建每类多样化的病理描述知识库,并在训练中检索并随机整合多个匹配文本视图,作为动态语义监督信号以优化视觉-语言模型,从而提升鲁棒性并缓解过拟合。

链接: https://arxiv.org/abs/2602.20873
作者: Jiahao Xu,Sheng Huang,Xin Zhang,Zhixiong Nan,Jiajun Dong,Nankun Mu
机构: Chongqing University (重庆大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: this https URL.

[CV-43] DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation

【速读】:该论文旨在解决无监督域自适应(Unsupervised Domain Adaptation, UDA)中目标域预测置信度与实际准确率不一致的问题,即网络校准质量差,这在安全关键应用中构成重大风险。现有方法虽能提升目标域语义分割性能,但常忽视校准问题,导致软伪标签替换硬伪标签时性能显著下降——尽管理论上完美校准的软伪标签应等效于硬伪标签。解决方案的关键在于提出DA-Cal框架,将目标域校准转化为软伪标签优化问题:通过引入元温度网络(Meta Temperature Network)生成像素级校准参数,并采用双层优化建立软伪标签与UDA监督信号之间的关系;同时结合互补域混合策略以防止过拟合并降低域差异。该方法可无缝集成至现有自训练框架,在多个UDA分割基准上显著提升校准性能且无需推理开销。

链接: https://arxiv.org/abs/2602.20860
作者: Wangkai Li,Rui Sun,Zhaoyang Li,Yujia Chen,Tianzhu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy – a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo-labels to hard pseudo-labels. Based on this finding, we propose DA-Cal, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization. DA-Cal introduces a Meta Temperature Network to generate pixel-level calibration parameters and employs bi-level optimization to establish the relationship between soft pseudo-labels and UDA supervision, while utilizing complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.

[CV-44] On the Explainability of Vision-Language Models in Art History

【速读】:该论文旨在解决如何通过可解释人工智能(Explainable Artificial Intelligence, XAI)方法使视觉语言模型(Vision-Language Models, VLMs)——特别是CLIP模型——在艺术史语境下的视觉推理过程变得可理解的问题。其解决方案的关键在于结合零样本定位实验与人类可解释性研究,评估七种XAI方法的有效性,结果表明这些方法能否准确捕捉人类对图像内容的理解,取决于所分析类别在概念上的稳定性以及其在模型表示空间中的可获取性。

链接: https://arxiv.org/abs/2602.20853
作者: Stefanie Schneider
机构: Marburg University (马尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine ‘understanding.’ In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.

[CV-45] Hybrid Fusion: One-Minute Efficient Training for Zero-Shot Cross-Domain Image Fusion

【速读】:该论文旨在解决图像融合(image fusion)中传统方法适应性差、性能不足与深度学习方法训练效率低、存在训练-推理分辨率差异的问题。其核心解决方案是提出一种新型混合框架,通过可学习的U-Net生成动态引导图(guidance map),指导经典固定拉普拉斯金字塔(Laplacian pyramid)融合核进行像素级融合,从而实现策略学习与像素合成的解耦。这一设计使得模型能够在全分辨率下高效训练,消除训练与推理之间的差距,并在无需外部预训练模型的情况下,于RTX 4090显卡上约1分钟内完成从零训练,同时具备跨任务的零样本泛化能力(如红外-可见光到医学影像)。此外,融合结果仅由源图像线性构造而成,保障了关键应用中的高保真度。

链接: https://arxiv.org/abs/2602.20851
作者: Ran Zhang,Xuanhua He,Liu Liu
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve state-of-the-art (SOTA) results but suffer from critical inefficiencies: their reliance on slow, resource-intensive, patch-based training introduces a significant gap with full-resolution inference. We propose a novel hybrid framework that resolves this trade-off. Our method utilizes a learnable U-Net to generate a dynamic guidance map that directs a classic, fixed Laplacian pyramid fusion kernel. This decoupling of policy learning from pixel synthesis enables remarkably efficient full-resolution training, eliminating the train-inference gap. Consequently, our model achieves SOTA-comparable performance in about one minute on a RTX 4090 or two minutes on a consumer laptop GPU from scratch without any external model and demonstrates powerful zero-shot generalization across diverse tasks, from infrared-visible to medical imaging. By design, the fused output is linearly constructed solely from source information, ensuring high faithfulness for critical applications. The codes are available at this https URL

[CV-46] FLIM Networks with Bag of Feature Points

【速读】:该论文旨在解决卷积神经网络在图像标注上依赖大量人工标注数据所带来的成本高、耗时长的问题,特别是在光学显微镜图像中寄生虫检测任务中的应用。其解决方案的关键在于提出一种无需反向传播训练的FLIM-Bag of Feature Points(FLIM-BoFP)方法,通过用户绘制的标记点直接估计编码器滤波器(encoder filters),并利用单次聚类生成“特征点袋”(bag of feature points),进而映射到各编码块以定义滤波器位置,从而显著提升滤波器估计效率与模型泛化能力,同时保持轻量化特性。

链接: https://arxiv.org/abs/2602.20845
作者: João Deltregia Martinelli,Marcelo Luis Rodrigues Filho,Felipe Crispim da Rocha Salvagnini,Gilson Junior Soares,Jefersson A. dos Santos,Alexandre X. Falcão
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer

点击查看摘要

Abstract:Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder’s block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.

[CV-47] raining-Free Multi-Concept Image Editing

【速读】:该论文旨在解决无训练条件下利用扩散模型进行图像编辑时难以保持对象身份或捕捉语言无法表达的细节问题(例如面部结构、材质纹理或物体几何形状等视觉概念)。现有基于优化的方法虽能实现零样本文本驱动编辑,但在保留语义一致性和控制精细视觉特征方面存在局限。其解决方案的关键在于提出一种无需训练的概念驱动图像编辑框架,通过将优化型DDPM(Optimised DDS)与LoRA(Low-Rank Adaptation)驱动的概念组合相结合,其中LoRA的训练数据即代表特定视觉概念;该方法可在扩散过程中直接整合文本语义引导与预训练概念适配器提供的低层视觉线索,从而实现多概念可控编辑,并通过有序时间步、正则化和负向提示引导提升稳定性与可控性。

链接: https://arxiv.org/abs/2602.20839
作者: Niki Foteinopoulou,Ignas Budvytis,Stephan Liwicki
机构: Toshiba Europe (东芝欧洲); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.

[CV-48] GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection

【速读】:该论文旨在解决多模态表情包(memes)中仇恨内容检测的难题,其核心挑战在于有害信息往往源自看似无害的图像与文本之间的复杂交互作用。解决方案的关键在于提出GatedCLIP模型,通过三项关键技术改进:一是引入可学习的投影头(projection heads),将CLIP的嵌入映射到任务优化的语义空间;二是设计动态门控融合机制(dynamic gated fusion mechanism),自适应地调整视觉和文本特征的权重;三是采用对比学习目标(contrastive learning objective),确保跨模态语义一致性。这些改进使模型在Hateful Memes数据集上达到0.66的AUROC,显著优于CLIP基线(0.49),且仅需350K可训练参数,兼具性能与效率。

链接: https://arxiv.org/abs/2602.20818
作者: Yingying Guo,Ke Zhang,Zirong Zeng
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP’s multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.

[CV-49] RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

【速读】:该论文旨在解决动态环境中基于3D高斯溅射(3D Gaussian Splatting)的SLAM系统性能下降的问题,尤其是移动物体干扰重建精度和跟踪稳定性。其关键解决方案在于提出一种鲁棒且高效的4D感知SLAM框架RU4D-SLAM,通过引入时间维度增强空间3D表示,并结合不确定性感知的场景变化建模、模糊图像合成与动态场景重建机制;具体包括:扩展原有静态场景下的像素级不确定性模型以适应模糊图像,设计语义引导的不确定性重加权机制提升动态场景中的感知可靠性,并引入可学习的不透明度权重实现自适应4D映射,从而显著提升在存在运动物体和低质量输入条件下的轨迹精度与4D场景重建效果。

链接: https://arxiv.org/abs/2602.20807
作者: Yangfan Zhao,Hanwei Zhang,Ke Huang,Qiufeng Wang,Zhenzhou Shao,Dengyu Wu
机构: Capital Normal University (首都师范大学); Saarland University (萨尔兰大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: this https URL

[CV-50] VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving CVPR2026

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中缺乏跨视角三维几何建模能力的问题,导致其在复杂驾驶任务中的表现受限。现有方法虽尝试通过构建问答数据辅助训练以缓解此问题,但未能从根本上赋予VLM全面处理多样化评估协议的能力。解决方案的关键在于将成熟三维基础模型(3D foundation models)的跨视角几何定位能力注入VLM,提出一种名为VGGDrive的新架构,其中核心创新是引入一个即插即用的跨视角三维几何增强模块(Cross-View 3D Geometric Enabler, CVGE),该模块通过分层自适应注入机制,将冻结的三维视觉模型的几何特征与VLM的二维视觉特征有效对齐,从而显著提升VLM在交叉视图风险感知、运动预测和轨迹规划等多任务上的性能。

链接: https://arxiv.org/abs/2602.20794
作者: Jie Wang,Guang Li,Zhijian Huang,Chenxu Dang,Hangjun Ye,Yahong Han,Long Chen
机构: Tianjin University (天津大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing QA data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM’s 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It’s our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

[CV-51] SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking CVPR2026

【速读】:该论文旨在解决计算机视觉领域中脊柱运动建模的难题,即由于脊柱复杂的多关节运动学特性以及缺乏大规模3D标注数据,导致现有方法难以准确估计自然状态下脊椎的三维姿态。其解决方案的关键在于提出了一种生物力学感知的关键点模拟框架(biomechanics-aware keypoint simulation framework),该框架通过肌肉骨骼建模生成解剖学一致的3D脊椎关键点,并将其注入现有2D人体姿态数据集,从而构建首个公开的SIMSPINE数据集——该数据集包含室内多摄像机捕捉下无外部约束的自然全身动作中的稀疏椎体级3D脊柱标注,共计214万帧。此方案不仅填补了肌肉骨骼仿真与计算机视觉之间的鸿沟,还为基于数据驱动的椎骨运动学学习提供了基础,同时发布了预训练基线模型,建立了统一的生物力学有效脊柱运动估计基准。

链接: https://arxiv.org/abs/2602.20792
作者: Muhammad Saif Ullah Khan,Didier Stricker
机构: German Research Center for Artificial Intelligence (DFKI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine’s complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.

[CV-52] Real-time Motion Segmentation with Event-based Normal Flow

【速读】:该论文旨在解决事件相机(event-based camera)在实时视觉任务中因原始事件数据稀疏性导致处理效率低下、难以应用于运动分割(motion segmentation)等动态场景理解任务的问题。其核心解决方案是引入法向量流(normal flow)作为中间表示,通过从事件邻域直接学习稠密法向量流信息,并将运动分割建模为基于图割(graph cuts)的能量最小化问题,结合法向量流聚类与运动模型拟合进行迭代优化。关键创新在于利用法向量流初始化和拟合运动模型,仅需少量候选模型即可高效估计独立移动物体的运动参数,显著降低计算复杂度,在多个公开数据集上实现了接近800倍的速度提升,同时保持高精度。

链接: https://arxiv.org/abs/2602.20790
作者: Sheng Zhong,Zhongyang Ren,Xiya Zhu,Dehao Yuan,Cornelia Fermuller,Yi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework.

[CV-53] Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization

【速读】:该论文旨在解决医疗影像分割中因多模态数据分布差异导致的模型泛化能力不足问题,尤其是在联邦学习(Federated Learning, FL)场景下,各机构仅能提供单模态数据(如CT或MRI),且无法共享原始数据以满足隐私保护要求。其关键解决方案是提出一种全局强度非线性(Global Intensity Nonlinear, GIN)增强策略,通过模拟跨模态的外观变化来提升模型对不同成像模式的适应性,同时保持解剖结构的一致性;实验表明,GIN在中心化和联邦训练环境中均显著优于其他增强方法,例如在胰腺分割任务中Dice分数提升了498%,且联邦方案能达到中心化训练准确率的93–98%,实现了高精度、强泛化且符合隐私约束的跨模态医学图像分割。

链接: https://arxiv.org/abs/2602.20773
作者: Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot
机构: Norwegian University of Science and Technology (挪威科技大学); St. Olavs Hospital, Trondheim University Hospital (特隆赫姆大学医院圣奥拉夫医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE JBHI

点击查看摘要

Abstract:Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.

[CV-54] OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation

【速读】:该论文旨在解决肌骨MRI(Musculoskeletal MRI)图像解读中因结构复杂、异常多样且依赖专家经验而导致的诊断效率低、一致性差的问题。解决方案的关键在于提出OrthoDiffusion——一个基于扩散模型(diffusion-based foundation model)的统一框架,通过在15,948例未标注膝关节MRI数据上自监督预训练三个方向特异的3D扩散模型(分别对应矢状面、冠状面和轴面),学习鲁棒的解剖特征表示;这些视图特异性表征被融合以支持多种临床任务(如11个膝关节结构分割和8种异常检测),并在多中心、多场强场景下展现出优异的泛化能力,尤其在标签数据稀缺时仍能保持高精度(仅用10%训练标签)。此外,其学到的解剖表征具有跨关节迁移性,可有效应用于踝关节和肩关节疾病诊断,表明扩散基础模型有望成为多病种、多任务肌骨影像分析的通用平台。

链接: https://arxiv.org/abs/2602.20752
作者: Tian Lan,Lei Xu,Zimu Yuan,Shanggui Liu,Jiajun Liu,Jiaxin Liu,Weilai Xiang,Hongyu Yang,Dong Jiang,Jianxin Yin,Dingyu Wang
机构: 北京大学(University of Peking); 中国人民大学(Renmin University of China); 北京邮电大学(Beijing University of Posts and Telecommunications)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Musculoskeletal disorders represent a significant global health burden and are a leading cause of disability worldwide. While MRI is essential for accurate diagnosis, its interpretation remains exceptionally challenging. Radiologists must identify multiple potential abnormalities within complex anatomical structures across different imaging planes, a process that requires significant expertise and is prone to variability. We developed OrthoDiffusion, a unified diffusion-based foundation model designed for multi-task musculoskeletal MRI interpretation. The framework utilizes three orientation-specific 3D diffusion models, pre-trained in a self-supervised manner on 15,948 unlabeled knee MRI scans, to learn robust anatomical features from sagittal, coronal, and axial views. These view-specific representations are integrated to support diverse clinical tasks, including anatomical segmentation and multi-label diagnosis. Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities. The model exhibited remarkable robustness across different clinical centers and MRI field strengths, consistently outperforming traditional supervised models. Notably, in settings where labeled data was scarce, OrthoDiffusion maintained high diagnostic precision using only 10% of training labels. Furthermore, the anatomical representations learned from knee imaging proved highly transferable to other joints, achieving strong diagnostic performance across 11 diseases of the ankle and shoulder. These findings suggest that diffusion-based foundation models can serve as a unified platform for multi-disease diagnosis and anatomical segmentation, potentially improving the efficiency and accuracy of musculoskeletal MRI interpretation in real-world clinical workflows.

[CV-55] PyVision-RL: Forging Open Agent ic Vision Models via RL

【速读】:该论文旨在解决生成式 AI(Generative AI)在多模态代理模型中因交互坍塌(interaction collapse)导致的工具使用减少和多轮推理能力下降的问题,从而限制了代理行为的有效性。其解决方案的关键在于提出 PyVision-RL 框架,该框架通过结合过采样-过滤-排序(oversampling-filtering-ranking)的 rollout 策略与累积工具奖励机制(accumulative tool reward),有效稳定训练过程并维持持续交互;同时,在视频理解任务中引入按需上下文构建(on-demand context construction)策略,仅在推理时选择性采样任务相关的帧,显著降低视觉 token 消耗,从而提升效率与可扩展性。

链接: https://arxiv.org/abs/2602.20739
作者: Shitian Zhao,Shaoheng Lin,Ming Li,Haoquan Zhang,Wenshuo Peng,Kaipeng Zhang,Chen Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

[CV-56] Communication-Inspired Tokenization for Structured Image Representations

【速读】:该论文旨在解决现有离散图像分词器(discrete image tokenizer)在视觉和多模态系统中主要优化重建与压缩任务,导致生成的token更关注局部纹理而非对象级语义结构的问题。其解决方案的关键在于提出一种受人类交流启发的结构化离散视觉token序列学习框架——COMmunication inspired Tokenization (COMiT),该框架通过迭代观察局部图像区域并递归更新离散表示,在固定token预算下构建潜在信息消息;每一步整合新视觉信息的同时精炼和重组已有token序列,最终由流匹配解码器重建全图。整个编码与解码过程在一个Transformer模型中端到端训练,结合流匹配重建损失与语义表征对齐损失,从而实现可解释、以对象为中心的token结构,并显著提升组合泛化能力和关系推理性能。

链接: https://arxiv.org/abs/2602.20731
作者: Aram Davtyan,Yusuf Sahin,Yasaman Haghighi,Sebastian Stapf,Pablo Acuaviva,Alexandre Alahi,Paolo Favaro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

[CV-57] Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation

【速读】:该论文旨在解决生成式 AI(Generative AI)在图像生成中对低层次物理属性(如阴影和材质)控制不足的问题,同时弥补基于物理的渲染(Physically Based Rendering, PBR)缺乏提示驱动灵活性的缺陷。其解决方案的关键在于提出一个统一的随机微分方程(Stochastic Differential Equation, SDE)框架,将蒙特卡洛渲染与扩散生成建模相融合:首先基于中心极限定理建立蒙特卡洛积分的通用SDE形式,并通过路径追踪实现物理可解释的SDE表示;进一步从噪声方差角度系统分析如何将路径追踪的物理特性扩展至现有扩散模型,从而在扩散生成结果中实现物理基础的可控性,涵盖渲染与材质编辑等任务。

链接: https://arxiv.org/abs/2602.20725
作者: Junwei Shu,Wenjie Liu,Changgu Chen,Hantang Liu,Yang Li,Changbo Wang
机构: East China Normal University (华东师范大学); Kun Byte (昆仑科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Diffusion-based image generators excel at producing realistic content from text or image conditions, but they offer only limited explicit control over low-level, physically grounded shading and material properties. In contrast, physically based rendering (PBR) offers fine-grained physical control but lacks prompt-driven flexibility. Although these two paradigms originate from distinct communities, both share a common evolution – from noisy observations to clean images. In this paper, we propose a unified stochastic formulation that bridges Monte Carlo rendering and diffusion-based generative modeling. First, a general stochastic differential equation (SDE) formulation for Monte Carlo integration under the Central Limit Theorem is modeled. Through instantiation via physically based path tracing, we convert it into a physically grounded SDE representation. Moreover, we provide a systematic analysis of how the physical characteristics of path tracing can be extended to existing diffusion models from the perspective of noise variance. Extensive experiments across multiple tasks show that our method can exert physically grounded control over diffusion-generated results, covering tasks such as rendering and material editing.

[CV-58] CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization

【速读】:该论文旨在解决扩散模型中基于编码器的风格迁移方法存在的内容泄露(content leakage)问题,即风格图像中的语义元素意外出现在生成结果中,导致提示词保真度(prompt fidelity)和风格一致性下降。解决方案的关键在于提出CleanStyle框架,其核心创新为:1)通过奇异值分解(SVD)识别并动态抑制风格嵌入的尾部成分(tail components),从而在去噪过程中获得纯净且保持风格特性的条件嵌入;2)引入风格特定的无分类器引导(Style-Specific Classifier-Free Guidance, SS-CFG),利用被抑制的尾部成分构建风格相关但与提示无关的负向输入,以更精准地抑制干扰性视觉模式,提升生成质量与提示对齐效果。该方法无需重新训练,具有轻量化、可解释性强等优势,可无缝集成至现有编码器驱动的扩散模型中。

链接: https://arxiv.org/abs/2602.20721
作者: Xiaoman Feng,Mingkun Lei,Yang Wang,Dingwen Fu,Chi Zhang
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages

点击查看摘要

Abstract:Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.

[CV-59] Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization IJCNN2025

【速读】:该论文旨在解决内窥镜软组织在机器人辅助手术中3D重建的两大难题:一是基于3D高斯泼溅(3D Gaussian Splatting)的方法难以实现一致的表面重建,二是现有神经辐射场(NeRF)方法缺乏实时渲染能力。解决方案的关键在于提出一种表面感知的重建策略,首先利用符号距离场(Sign Distance Field, SDF)构建初始网格,再以此网格约束高斯泼溅的重建过程;同时引入局部刚性与全局非刚性约束,以生成符合物理规律的形变,从而在保证实时渲染速度的同时获得平滑且几何纹理质量优异的重建结果。

链接: https://arxiv.org/abs/2602.20718
作者: Yangsen Chen,Hao Wang
机构: HKUST(GZ)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ijcnn 2025

点击查看摘要

Abstract:Reconstructing deformable endoscopic tissues is crucial for achieving robot-assisted surgery. However, 3D Gaussian Splatting-based approaches encounter challenges in achieving consistent tissue surface reconstruction, while existing NeRF-based methods lack real-time rendering capabilities. In pursuit of both smooth deformable surfaces and real-time rendering, we introduce a novel approach based on 3D Gaussian Splatting. Specifically, we introduce surface-aware reconstruction, initially employing a Sign Distance Field-based method to construct a mesh, subsequently utilizing this mesh to constrain the Gaussian Splatting reconstruction process. Furthermore, to ensure the generation of physically plausible deformations, we incorporate local rigidity and global non-rigidity restrictions to guide Gaussian deformation, tailored for the highly deformable nature of soft endoscopic tissue. Based on 3D Gaussian Splatting, our proposed method delivers a fast rendering process and smooth surface appearances. Quantitative and qualitative analysis against alternative methodologies shows that our approach achieves solid reconstruction quality in both textures and geometries.

[CV-60] Onboard-Targeted Segmentation of Straylight in Space Camera Sensors

【速读】:该论文旨在解决空间相机图像中由太阳入射光引起的杂散光(straylight)效应的语义分割问题,此类干扰会影响图像质量并影响航天器导航系统的准确性。解决方案的关键在于利用预训练策略提升模型对多样flare纹理的泛化能力:通过在包含多种非空间场景下耀斑数据的公开数据集(Flare7k++)上进行预训练,缓解真实空间场景数据稀缺的问题;同时采用基于MobileNetV3骨干网络的DeepLabV3模型,在保证分割精度的同时满足星载硬件资源受限条件下的部署需求,并设计了与机载导航流程集成的定制评估指标,以实现系统级性能验证。

链接: https://arxiv.org/abs/2602.20709
作者: Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Aerospace Science and Technology

点击查看摘要

Abstract:This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults. Specifically, we address the segmentation of straylight effects induced by solar presence around the camera’s Field of View (FoV). Anomalous images are sourced from our published dataset. Our approach emphasizes generalization across diverse flare textures, leveraging pre-training on a public dataset (Flare7k++) including flares in various non-space contexts to mitigate the scarcity of realistic space-specific data. A DeepLabV3 model with MobileNetV3 backbone performs the segmentation task. The model design targets deployment in spacecraft resource-constrained hardware. Finally, based on a proposed interface between our model and the onboard navigation pipeline, we develop custom metrics to assess the model’s performance in the system-level context.

[CV-61] NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image

【速读】:该论文旨在解决从自然图像中准确估计服装缝制图案(sewing pattern)的问题,尤其针对现有方法在真实场景下泛化能力差、难以捕捉衣物部件间复杂关系以及仅限于单层服装的局限性。其解决方案的关键在于提出一种名为NGL(Natural Garment Language)的中间语言,将原本面向参数化建模的GarmentCode重构为更符合语言模型理解的形式,并设计了无需训练的NGL-Prompter框架:通过大视觉语言模型(VLM)直接解析图像并提取结构化参数,再经确定性映射生成有效的GarmentCode。此方法显著提升了对多层服装和遮挡情况下的重建性能,且无需昂贵的模型微调即可实现优于现有基线的结果。

链接: https://arxiv.org/abs/2602.20700
作者: Anna Badalyan,Pratheba Selvaraju,Giorgio Becherini,Omid Taheri,Victoria Fernandez Abrevaya,Michael Black
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.

[CV-62] MatchED: Crisp Edge Detection Using End-to-End Matching-based Supervision CVPR2026

【速读】:该论文旨在解决边缘检测中生成 crisp(即单像素宽)边缘的难题,该问题在传统方法和基于学习的方法中均普遍存在。现有方法通常依赖于两个手工设计的后处理算法——非极大值抑制(Non-Maximum Suppression, NMS)和基于骨架的细化(skeleton-based thinning),这些操作不可微,阻碍了端到端优化,并且所有现有 crisp 边缘检测方法仍需依赖此类后处理才能获得满意结果。论文提出了一种轻量级、仅增加约 21K 参数的“匹配监督模块”(\MethodLPP),可无缝集成至任意边缘检测模型中,实现 crisp 边缘的联合端到端学习。其核心创新在于:在每次训练迭代中,基于空间距离和置信度对预测边缘与真实边缘进行一对一匹配,从而确保训练与推理协议的一致性,显著提升边缘清晰度(Average Crispness, AC)并首次在强调清晰度的评估指标(CEval)下达到或超越标准后处理方法的性能。

链接: https://arxiv.org/abs/2602.20689
作者: Bedrettin Cetinkaya,Sinan Kalkan,Emre Akbas
机构: Middle East Technical University (中东技术大学); METU ROMER Robotics Center (METU ROMER 机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only \sim 21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2–4 \times compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20–35% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at this https URL.

[CV-63] RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation CVPR2026

【速读】:该论文旨在解决现有世界模型在处理时空相关性时存在的局限性,特别是传统方法通常将空间与时间相关性分开建模,难以实现统一的4D(四维)时空推理。其解决方案的关键在于提出一种无几何的世界模型——RAYNOVA,它采用双因果自回归框架,在自回归过程中同时遵循尺度层级和时间拓扑顺序,并通过全局注意力机制实现统一的4D时空推理。该方法摒弃了对强3D几何先验的依赖,基于相对Plücker-ray位置编码构建各视角、帧和尺度下各向同性的时空表示,从而在不显式依赖3D场景重建的前提下,实现对多样化相机设置和自身运动的鲁棒泛化能力。此外,引入递归训练范式以缓解长时视频生成中的分布漂移问题,显著提升了多视角视频生成性能与可控性。

链接: https://arxiv.org/abs/2602.20685
作者: Yichen Xie,Chensheng Peng,Mazen Abdelfattah,Yihan Hu,Jiezhi Yang,Eric Higgins,Ryan Brigden,Masayoshi Tomizuka,Wei Zhan
机构: Applied Intuition (应用直觉); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026; Project website: this http URL

点击查看摘要

Abstract:World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at this http URL.

[CV-64] GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio

【速读】:该论文旨在解决现有驾驶模拟器在自由视角生成、可编辑性及高保真度方面的局限性,以支持端到端自动驾驶系统的训练与评估。其核心解决方案是提出GA-Drive框架,通过几何-外观解耦(Geometry-Appearance Decoupling)与基于扩散模型的视频生成技术实现高质量新视角合成:首先利用已知场景几何信息生成伪视图(pseudo-views),再通过训练好的视频扩散模型将其转化为逼真的图像视图,从而实现几何结构不变前提下的外观编辑一致性,显著优于现有方法在NTA-IoU、NTL-IoU和FID等指标上的表现。

链接: https://arxiv.org/abs/2602.20673
作者: Hao Zhang,Lue Fan,Qitai Wang,Wenbo Li,Zehuan Wu,Lewei Lu,Zhaoxiang Zhang,Hongsheng Li
机构: MMLab, CUHK (香港中文大学多媒体实验室); CASIA (中国科学院自动化研究所); Shanghai Jiaotong University (上海交通大学); SenseTime Research (商汤科技研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.

[CV-65] BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

【速读】:该论文旨在解决当前文本到图像生成模型在控制精度上的局限性问题,即现有模型主要依赖描述性语言进行生成,难以满足专业场景中对物体位置、尺寸和颜色等参数的精确数值控制需求。解决方案的关键在于提出一种名为BBQ的大规模文本到图像模型,其核心创新是将数值化的边界框(bounding boxes)和RGB颜色值直接纳入统一的结构化文本框架中进行条件建模,通过在包含参数标注的 captions 上训练实现无需架构改动或推理时优化即可获得精准的空间与色彩控制能力。这一方法不仅提升了图像生成中对象定位和颜色还原的准确性,还支持直观的人机交互方式,如拖拽对象和色盘选择,从而推动了从模糊提示向结构化参数控制的新范式转变。

链接: https://arxiv.org/abs/2602.20672
作者: Eliran Kachlon,Alexander Visheratin,Nimrod Sarid,Tal Hacham,Eyal Gutflaish,Saar Huberman,Hezi Zisman,David Ruppin,Ron Mokady
机构: BRIA AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.

[CV-66] BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity

【速读】:该论文旨在解决当前3D生成模型缺乏对人类创造性过程支持的问题,特别是如何在从粗略抽象到精细细节的迭代过程中辅助用户进行3D形状设计。其核心挑战在于实现一种符合人类感知逻辑的、由粗到细(coarse-to-fine)的3D形状生成机制。解决方案的关键在于提出一个双阶段生成框架:第一阶段为BoxSplitGen模型,通过学习聚类合并过程的逆向——即边界框分裂序列,实现对3D部件边界框集合的渐进式细化;第二阶段为box-to-shape生成模型,基于预训练的3D扩散模型并引入边界框条件约束,将分裂后的边界框转化为具体的3D形状。该方法首次系统性地建模了人类从抽象到具象的创作路径,并在实验中验证了其优于传统token预测和图像修复范式的效果。

链接: https://arxiv.org/abs/2602.20666
作者: Juil Koo,Wei-Tung Lin,Chanho Park,Chanhyeok Park,Minhyuk Sung
机构: KAIST(韩国科学技术院); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation – particularly for detailing abstractions from coarse to fine – have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technical components of our framework are two generative models: the box-splitting generative model and the box-to-shape generative model. The first model, named BoxSplitGen, generates a collection of 3D part bounding boxes with varying granularity by iteratively splitting coarse bounding boxes. It utilizes part bounding boxes created through agglomerative merging and learns the reverse of the merging process – the splitting sequences. The model consists of two main components: the first learns the categorical distribution of the box to be split, and the second learns the distribution of the two new boxes, given the set of boxes and the indication of which box to split. The second model, the box-to-shape generative model, is trained by leveraging the 3D shape priors learned by an existing 3D diffusion model while adapting the model to incorporate bounding box conditioning. In our experiments, we demonstrate that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model. Also, we show that our box-to-shape model, based on a state-of-the-art 3D diffusion model, provides superior results compared to a previous model.

[CV-67] AnimeAgent : Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?

【速读】:该论文旨在解决定制化分镜生成(Custom Storyboard Generation, CSG)中现有基于静态扩散模型方法的三大局限:(1) 模型缺乏动态表现力,常采用“复制粘贴”模式;(2) 单次推理无法迭代修正属性缺失或提示遵循不佳的问题;(3) 多智能体框架依赖非鲁棒评估器,难以有效评价风格化、非写实动画。解决方案的关键在于提出 AnimeAgent——首个基于图像到视频(Image-to-Video, I2V)的多智能体框架,其通过利用 I2V 的隐式运动先验提升画面一致性与表现力,并引入混合主观-客观评审机制实现可靠迭代优化,从而在一致性、提示忠实度和风格化方面达到当前最优性能。

链接: https://arxiv.org/abs/2602.20664
作者: Hailong Yan,Shice Liu,Tao Wang,Xiangtao Zhang,Yijie Zhong,Jinwei Chen,Le Zhang,Bo Li
机构: UESTC(电子科技大学); vivo Mobile Communication Co. Ltd(维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report

点击查看摘要

Abstract:Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to “copy-paste” pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for assessing stylized, non-realistic animation. To address these, we propose AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework for CSG. Inspired by Disney’s “Combination of Straight Ahead and Pose to Pose” workflow, AnimeAgent leverages I2V’s implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement. We also collect a human-annotated CSG benchmark with ground-truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.

[CV-68] SD4R: Sparse-to-Dense Learning for 3D Object Detection with 4D Radar

【速读】:该论文旨在解决4D雷达点云因固有稀疏性和噪声导致的3D目标检测精度不足问题,尤其针对极端稀疏场景下现有点云增强方法性能下降的挑战。其解决方案的关键在于提出SD4R框架,通过两个核心模块实现:一是前景点生成器(Foreground Point Generator, FPG),用于抑制噪声传播并生成稠密点云;二是对数查询编码器(Logit-Query Encoder, LQE),改进传统柱状化(pillarization)过程以获得更鲁棒的特征表示。这两项创新共同提升了点云去噪与前景点稠密化的能力,从而显著改善了复杂环境下的3D感知性能。

链接: https://arxiv.org/abs/2602.20653
作者: Xiaokai Bai,Jiahao Cheng,Songkai Wang,Yixuan Luo,Lianqing Zheng,Xiaohan Zhang,Si-Yuan Cao,Hui-Liang Shen
机构: Zhejiang University (浙江大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 4 tables

点击查看摘要

Abstract:4D radar measurements offer an affordable and weather-robust solution for 3D perception. However, the inherent sparsity and noise of radar point clouds present significant challenges for accurate 3D object detection, underscoring the need for effective and robust point clouds densification. Despite recent progress, existing densification methods often fail to address the extreme sparsity of 4D radar point clouds and exhibit limited robustness when processing scenes with a small number of points. In this paper, we propose SD4R, a novel framework that transforms sparse radar point clouds into dense representations. SD4R begins by utilizing a foreground point generator (FPG) to mitigate noise propagation and produce densified point clouds. Subsequently, a logit-query encoder (LQE) enhances conventional pillarization, resulting in robust feature representations. Through these innovations, our SD4R demonstrates strong capability in both noise reduction and foreground point densification. Extensive experiments conducted on the publicly available View-of-Delft dataset demonstrate that SD4R achieves state-of-the-art performance. Source code is available at this https URL.

[CV-69] Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression ICLR2026

【速读】:该论文旨在解决大规模图像数据集在资源受限环境中的存储压力问题,现有方法通常通过丢弃样本进行压缩,但忽略了图像内部(尤其是颜色空间)的冗余信息。其解决方案的关键在于提出了一种统一框架——数据集色彩量化(Dataset Color Quantization, DCQ),通过在相似图像间强制一致的调色板表示、基于模型感知选择保留语义重要颜色,并保持对特征学习至关重要的结构细节,从而在大幅压缩数据集的同时保障模型训练性能。

链接: https://arxiv.org/abs/2602.20650
作者: Chenyue Yu,Lingao Xiao,Jinhong Deng,Ivor W. Tsang,Yang He
机构: CFAR, Agency for Science, Technology and Research, Singapore (新加坡科技研究局); IHPC, Agency for Science, Technology and Research, Singapore (新加坡科技研究局); National University of Singapore (新加坡国立大学); University of Electronic Science and Technology of China (UESTC) (电子科技大学); Nanyang Technological University (NTU), Singapore (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image – particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \hrefthis https URLthis https URL.

[CV-70] SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement

【速读】:该论文旨在解决微创手术中视场(Field-of-View, FoV)引导的准确性与稳定性问题,现有方法常将视觉注意力估计与下游相机控制混淆,或依赖于直接的对象中心假设,导致泛化能力弱且缺乏连续性。解决方案的关键在于将外科医生注意力追踪建模为时空学习问题,并通过密集注意力热图(attention heatmap)实现帧级连续且可解释的FoV引导;提出SurgAtt-Tracker框架,利用提案级重排序和运动感知精炼机制挖掘时间一致性,而非直接回归,从而提升在遮挡、多器械干扰及跨域场景下的鲁棒性。

链接: https://arxiv.org/abs/2602.20636
作者: Rulin Zhou,Guankun Wang,An Wang,Yujie Ma,Lixin Ouyang,Bolin Cui,Junyan Li,Chaowei Zhu,Mingyang Li,Ming Chen,Xiaopin Zhong,Peng Lu,Jiankun Wang,Xianming Liu,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.

[CV-71] Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection

【速读】:该论文旨在解决4D毫米波雷达(4D millimeter-wave radar)在自动驾驶中因几何信息稀疏且微弱而导致实例激活不可靠的问题,从而限制了现有雷达-相机融合方法的性能。针对这一挑战,其核心解决方案是提出SIFormer——一种场景-实例感知的Transformer架构,关键在于:首先通过分割与深度引导的定位策略抑制视图转换过程中的背景噪声;其次引入跨视角激活机制,将2D实例特征注入BEV(鸟瞰图)空间以增强弱雷达几何条件下的实例感知能力;最后利用基于Transformer的融合模块聚合图像语义与雷达几何信息,实现鲁棒感知。该方法有效弥合了BEV级与透视级融合范式的差距,兼顾全局场景理解与实例细节捕捉,显著提升了检测精度。

链接: https://arxiv.org/abs/2602.20632
作者: Xiaokai Bai,Lianqing Zheng,Si-Yuan Cao,Xiaohan Zhang,Zhe Wu,Beinan Yu,Fang Wang,Jie Bai,Hui-Liang Shen
机构: Zhejiang University (浙江大学); Tongji University (同济大学); Hangzhou City University (杭州城市学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures, 13 tables

点击查看摘要

Abstract:4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at this http URL.

[CV-72] From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection CVPR2026

【速读】:该论文旨在解决现有基于学习的特征点检测方法在长序列中跟踪稳定性不足的问题,尤其是在视角和光照变化较大的场景下,传统训练范式(基于图像对)难以优化特征点的长期可追踪性(long-term trackability)。解决方案的关键在于将特征点检测重新建模为一个序列决策问题,并提出TraqPoint框架,其核心创新是一种轨迹感知奖励机制(track-aware reward mechanism),该机制通过策略梯度方法联合优化特征点在多视图中的一致性(consistency)与区分度(distinctiveness),从而直接提升特征点的Track-quality(Traq),在稀疏匹配基准测试中显著优于当前最优方法。

链接: https://arxiv.org/abs/2602.20630
作者: Yepeng Liu,Hao Li,Liwen Yang,Fangzhen Li,Xudi Ge,Yuliang Gu,kuang Gao,Bing Wang,Guang Chen,Hangjun Ye,Yongchao Xu
机构: Wuhan University (武汉大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbfTrack-\textbfquality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.

[CV-73] Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

【速读】:该论文旨在解决单目3D目标检测(Monocular 3D Object Detection, M3OD)中因训练数据标注成本高、多样性不足而导致的模型性能受限问题。其核心挑战在于,真实场景中物体、场景与相机位姿三者存在强耦合(entanglement),导致训练数据缺乏独立变化的组合,从而引发利用效率低和过拟合问题。解决方案的关键是提出一种在线的“物体-场景-相机位姿”解耦与重组的数据增强策略:首先高效地将训练图像分解为带纹理的3D物体点云模型与背景场景;随后在每个训练周期内,通过将物体插入场景空域并使用扰动后的相机位姿重新渲染,生成涵盖完整独立组合的新训练样本,从而显著提升数据多样性与模型泛化能力。该方案可作为即插即用模块适配多种监督设置(包括稀疏标注),并在KITTI和Waymo数据集上验证了对五种代表性M3OD模型的有效性。

链接: https://arxiv.org/abs/2602.20627
作者: Zhaonian Kuang,Rui Ding,Meng Yang,Xinhu Zheng,Gang Hua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: IJCV

点击查看摘要

Abstract:Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera this http URL, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D objects are always captured in particular scenes with fixed camera poses, and hence lacks necessary diversity. Such tight entanglement induces the challenging issues of insufficient utilization and overfitting to uniform training data. To mitigate this, we propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data. We first fully decompose training images into textured 3D object point models and background scenes in an efficient computation and storage manner. We then continuously recompose new training images in each epoch by inserting the 3D objects into the freespace of the background scenes, and rendering them with perturbed camera poses from textured 3D point representation. In this way, the refreshed training data in all epochs can cover the full spectrum of independent object, scene, and camera pose combinations. This scheme can serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings. In the sparsely-supervised setting, objects closest to the ego-camera for all instances are sparsely annotated. We then can flexibly increase the annotated objects to control annotation cost. For validation, our method is widely applied to five representative M3OD models and evaluated on both the KITTI and the more complicated Waymo datasets.

[CV-74] RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces CVPR2026

【速读】:该论文旨在解决当前基于脆弱水印(fragile watermark)的图像篡改检测与恢复方法在面对知情攻击者时易被移除、且与鲁棒水印共存时因相互干扰而导致有效性下降的问题。解决方案的关键在于提出 RecoverMark 框架,其核心创新为:第一,利用现实世界中攻击者必须保持背景语义一致性以避免视觉异常这一约束,从而设计出对全局不可感知的水印移除攻击具有鲁棒性的检测机制;第二,将人脸内容本身作为水印嵌入到周围背景中,借助图像自包含信息提升水印提取的鲁棒性。通过两阶段训练策略和模拟多种潜在攻击的失真层设计,RecoverMark 实现了无需脆弱水印即可同时完成篡改定位、内容恢复及所有权验证的功能。

链接: https://arxiv.org/abs/2602.20618
作者: Haonan An,Xiaohui Ye,Guang Hua,Yihang Tao,Hangcheng Cao,Xiangyu Yu,Yuguang Fang
机构: City University of Hong Kong (香港城市大学); Singapore Institute of Technology (新加坡理工学院); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark’s effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background’s semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image’s own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark’s robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.

[CV-75] Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model

【速读】:该论文旨在解决开放世界目标检测(Open-world Object Detection, OWOD)中已知类别与未知类别混淆的问题,即现有方法虽注重提升未知类别的召回率(unknown recall),却忽视了检测结果的可解释性,导致模型在区分已知与未知类别时可靠性下降。其解决方案的关键在于提出一个概念驱动的可解释OWOD框架(InterPretable OWOD framework, IPOW),通过引入概念分解模型(Concept Decomposition Model, CDM),将Faster R-CNN中的RoI特征显式分解为判别性(discriminative)、共享(shared)和背景(background)三类概念:其中判别性概念用于增强已知类间的特征距离,而共享与背景概念则具备强泛化能力,可用于有效识别未知类别;进一步地,基于该可解释框架,作者提出概念引导修正机制(Concept-Guided Rectification, CGR),以缓解因未知物体落入已知类判别空间而导致的混淆问题,从而实现更高准确率与更强可靠性的未知类别检测。

链接: https://arxiv.org/abs/2602.20616
作者: Xueqiang Lv,Shizhou Zhang,Yinghui Xing,Di Xu,Peng Wang,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-world object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known-unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly “knowing the unknown”. To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known-unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.

[CV-76] VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

【速读】:该论文旨在解决3D物体可操作性定位(affordance grounding)中存在的关键问题:现有方法主要依赖静态视觉或文本线索,忽略了可操作性本质上由动态交互行为定义这一特性,导致难以准确识别真实人-物交互(HOI)中的接触区域。其解决方案的核心在于引入视频引导的3D可操作性定位(video-guided 3D affordance grounding),通过利用视频中动态交互序列提供功能监督信号,提出VAGNet框架,将视频提取的交互线索与3D结构对齐,从而消除仅靠静态线索无法解决的歧义。该方案在首个HOI视频-3D配对数据集PVAD上验证了有效性,显著优于基于静态线索的基线方法。

链接: https://arxiv.org/abs/2602.20608
作者: Aihua Mao,Kaihang Huang,Yong-Jin Liu,Chee Seng Chan,Ying He
机构: SCUT(华南理工大学); THU(清华大学); UM(马来西亚大学); NTU(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

[CV-77] Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing

【速读】:该论文旨在解决基于第一人称视角(egocentric)的交互理解中,如何准确解析手部与主动物体(active objects)之间细粒度交互关系的问题。现有基于Transformer的方法存在三大局限:一是查询初始化依赖语义线索或可学习参数,难以适应不同场景下的动态交互对象;二是利用像素级语义特征迭代优化查询时引入无关信息,干扰交互特征学习;三是模型易产生“交互幻觉”,即预测物理上不一致的结果。为应对上述挑战,作者提出端到端的Interaction-aware Transformer(InterFormer),其核心创新在于三个关键组件:动态查询生成器(Dynamic Query Generator, DQG)通过显式建模手-物接触的空间动态,实现对交互目标的精准查询初始化;双上下文特征选择器(Dual-context Feature Selector, DFS)融合粗粒度交互提示与语义特征,抑制非交互噪声并强化交互关系建模;条件共现损失(Conditional Co-occurrence loss, CoCo)引入手-物关系约束,提升预测结果的物理一致性。该方法在EgoHOS和mini-HOI4D等基准上达到当前最优性能,展现出优异的泛化能力。

链接: https://arxiv.org/abs/2602.20597
作者: Yuejiao Su,Yi Wang,Lei Yao,Yawen Cui,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to “interaction illusion”, producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at this https URL.

[CV-78] Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change

【速读】:该论文旨在解决长期环境监测中基于结构光恢复(Structure-from-Motion, SfM)的三维重建问题,即在多次实地考察间隔数月甚至数年的情况下,传统SfM方法因依赖近同时图像采集和有限外观变化假设而失效,无法实现跨时段一致的三维模型对齐与重建。其解决方案的关键在于:摒弃传统的独立重建后对齐策略,转而通过联合SfM框架直接在不同时间段的图像间强制建立对应关系,并结合手工设计特征与学习型视觉特征以增强跨时间跨度的匹配鲁棒性;同时,利用视觉场景识别(visual place recognition)预先筛选高概率跨会话图像对,仅对少量候选对使用计算密集的深度特征匹配,从而在保证重建一致性的同时显著提升算法可扩展性和效率。

链接: https://arxiv.org/abs/2602.20584
作者: Beverley Gorry,Tobias Fischer,Michael Milford,Alejandro Fontan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.

[CV-79] PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

【速读】:该论文旨在解决基于传播(Propagation-based)的视频编辑中训练数据稀缺的问题,即传统方法依赖大规模成对的源视频与编辑后视频数据集,而这类数据获取成本高且复杂。其解决方案的关键在于提出PropFly训练流程,通过利用预训练视频扩散模型(VDMs)的在线监督(on-the-fly supervision),无需预先构建配对数据集即可实现高效训练。具体而言,PropFly借助不同无分类器引导(Classifier-Free Guidance, CFG)尺度下的一步纯净潜在估计,从中间噪声潜在空间中合成多样化的“源”(低CFG)与“编辑”(高CFG)潜在表示,其中源潜在保留视频结构信息,编辑潜在提供目标变换用于学习传播;进一步引入引导调制流匹配(Guidance-Modulated Flow Matching, GMFM)损失,使附加适配器能够学习如何在保持时间一致性的同时复制目标变换,从而实现高质量、动态且一致的视频编辑结果。

链接: https://arxiv.org/abs/2602.20583
作者: Wonyong Seo,Jaeho Moon,Jaehyup Lee,Soo Ye Kim,Munchurl Kim
机构: KAIST(韩国科学技术院); Kyungpook National University(庆北国立大学); Adobe Research(Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work (equal contribution)

点击查看摘要

Abstract:Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of ‘source’ (low-CFG) and ‘edited’ (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

[CV-80] Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在端到端自动驾驶中面临的推理延迟高、动作精度低以及可解释性差的问题。现有自回归方法因逐token生成导致效率低下,而基于扩散的方法则常依赖缺乏显式几何结构的通用语言标记,难以实现精准轨迹规划。其解决方案的关键在于提出一种掩码视觉-语言-动作扩散模型(Masked Vision-Language-Action Diffusion for Autonomous Driving, MVLAD-AD),通过离散动作标记化策略构建符合运动学可行性的路径点代码本,并引入几何感知嵌入学习以保证潜在空间中的表示逼近物理几何度量;同时采用动作优先解码机制优先生成轨迹,从而在nuScenes及其衍生基准上实现了更高的规划精度与效率,兼具高保真度和可解释性。

链接: https://arxiv.org/abs/2602.20577
作者: Jiaru Zhang,Manav Gagvani,Can Cui,Juntong Peng,Ruqi Zhang,Ziran Wang
机构: Purdue University (普渡大学); Institute for Physical Artificial Intelligence (IPAI) (物理人工智能研究所); College of Engineering (工程学院); Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

[CV-81] An interactive enhanced driving dataset for autonomous driving

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶领域发展中面临的两大瓶颈问题:一是交互场景数据稀疏,二是现有数据中多模态对齐不足。解决方案的关键在于构建一个大规模、高质量的交互增强型驾驶数据集(Interactive Enhanced Driving Dataset, IEDD),其核心创新包括:基于交互轨迹挖掘百万级交互片段的可扩展流水线,以及设计量化交互过程的指标体系;同时,进一步生成结构化语义对齐的BEV(Bird’s Eye View)视频用于VQA任务,从而为评估和微调自动驾驶模型的推理能力提供基准支持。

链接: https://arxiv.org/abs/2602.20575
作者: Haojie Feng,Peizhi Zhang,Mengjie Tian,Xinrui Zhang,Zhuoren Li,Junpeng Huang,Xiurong Wang,Junfan Zhu,Jianzhou Wang,Dongxiao Yin,Lu Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird’s Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset’s reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.

[CV-82] AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

【速读】:该论文旨在解决当前文档伪造检测领域对生成式AI(Generative AI)驱动的扩散模型(diffusion model)图像修复(inpainting)攻击缺乏针对性评估基准的问题。现有数据集主要基于传统数字编辑工具(如Adobe Photoshop)构建,导致主流检测方法在面对由Gemini 2.5 Flash Image和Ideogram v2 Edit等AI inpainting API生成的伪造文档时性能急剧下降。解决方案的关键在于提出首个专用于扩散模型驱动文档篡改的基准AIForge-Doc,其通过系统性地在真实收据与表单图像中伪造数值字段,生成4,061张带像素级篡改区域掩码的伪造图像,并采用DocTamper兼容格式标注,从而首次实现对AI伪造文档的精细化、多语言、跨分布评估。

链接: https://arxiv.org/abs/2602.20569
作者: Jiaqi Wu,Yuchen Zhou,Muduo Xu,Zisheng Liang,Simiao Ren,Jiayu Xue,Meige Yang,Siying Chen,Jingheng Huan
机构: Duke University (杜克大学); New York University (纽约大学); University of North Carolina (北卡罗来纳大学); Scam.ai; University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs – Gemini 2.5 Flash Image and Ideogram v2 Edit – yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors – TruFor, DocTamper, and a zero-shot GPT-4o judge – and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 – essentially at chance – confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.

[CV-83] BFA: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中因多视角输入导致视觉token数量激增而引发的实时性瓶颈问题。现有VLM加速方法如token pruning在直接应用于VLA模型时性能下降,因其忽略了跨视图关联性及任务动态特性。解决方案的关键在于提出BFA++——一种面向VLA模型的动态token剪枝框架,其核心是两级重要性预测机制:局部视图内预测器(intra-view predictor)识别单张图像中的任务相关区域以抑制空间噪声,跨视图预测器(inter-view predictor)在不同操作阶段甄别关键相机视角以消除冗余信息,从而实现高效且语义保留的token选择,显著提升推理速度与操作成功率。

链接: https://arxiv.org/abs/2602.20566
作者: Haosheng Li,Weixin Mao,Zihan Lan,Hongwei Xiong,Hongan Wang,Chenyang Si,Ziwei Liu,Xiaoming Deng,Hua Chen
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学); LimX Dynamic (灵犀动态); Nanjing University (南京大学); Nanyang Technological University (南洋理工大学); Zhejiang University (浙江大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the \pi0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.

[CV-84] WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos

【速读】:该论文旨在解决单目视频中3D手部重建在真实复杂场景下性能下降的问题,尤其针对手物交互、极端姿态、光照变化和运动模糊等严重扰动导致的重建质量退化。其核心解决方案是提出WildGHand框架,关键在于两个创新组件:一是动态扰动解耦模块(dynamic perturbation disentanglement module),通过将扰动建模为优化过程中3D高斯属性的时间变化偏置,实现对扰动的显式表征;二是扰动感知优化策略(perturbation-aware optimization strategy),生成逐帧各向异性的加权掩码以指导优化过程。二者协同作用,使模型能够在空间和时间维度上识别并抑制扰动,从而显著提升野外出镜视频中高保真手部虚拟形象的重建精度。

链接: https://arxiv.org/abs/2602.20556
作者: Hanhui Li,Xuan Huang,Wanquan Liu,Yuhao Cheng,Long Chen,Yiqiang Yan,Xiaodan Liang,Chenqiang Gao
机构: Sun Yat-sen University (中山大学); Lenovo Research Group (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a 15.8% relative gain in PSNR and a 23.1% relative reduction in LPIPS). Our implementation and dataset are available at this https URL.

[CV-85] CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

【速读】:该论文旨在解决传统基于自然语言提示(verbal-prompted segmentation)在工业场景中对罕见、特定实例或难以描述对象分割效果不佳的问题,同时克服仅依赖图像示例(image exemplars)所导致的外观线索(如颜色和纹理)与几何身份无关的局限性。其解决方案的关键在于提出一种基于CAD模型提示(CAD-prompted segmentation)的框架,该框架以SAM3为基础,利用CAD模型的规范多视角渲染图像作为提示输入,从而提供与表面外观无关的几何信息引导;训练阶段使用仿真环境下多样视角和场景上下文生成的合成网格渲染数据,最终实现单阶段、高鲁棒性的掩码预测,扩展了可提示分割技术在无法仅靠语言或外观描述准确识别的对象上的应用能力。

链接: https://arxiv.org/abs/2602.20551
作者: Zhenran Tang,Rohan Nagabhirava,Changliu Liu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part’s geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

[CV-86] he Finite Primitive Basis Theorem for Computational Imaging: Formal Foundations of the OperatorGraph Representation

【速读】:该论文旨在解决计算成像前向模型(computational imaging forward models)长期存在的“单体化、模态特异性”编码问题,即不同成像模态(如编码孔径光谱相机、MRI扫描仪等)的物理建模与算法实现通常以独立、封闭的代码形式存在,缺乏通用性与可组合性。解决方案的关键在于提出并证明了“有限基元定理”(Finite Primitive Basis Theorem):所有属于广泛定义的算子类 Cimg(涵盖临床、科学和工业成像模态,线性和非线性)的前向模型,均可被 ε-近似表示为一个有类型 Directed Acyclic Graph (DAG),其节点仅来自由11个基本操作构成的最小库——Propagate、Modulate、Project、Encode、Convolve、Accumulate、Detect、Sample、Disperse、Scatter 和 Transform。该定理不仅提供了构造性算法来生成此类DAG,还通过实证验证(31个线性模态误差 eimg < 0.01,最多5个节点和深度5)和理论最小性证明(移除任一基元将导致至少一种模态无法近似),确立了统一建模框架的数学基础,从而为Physics World Model (PWM) 框架奠定了核心理论支撑。

链接: https://arxiv.org/abs/2602.20550
作者: Chengshuai Yang
机构: NextGen PlatformAI C Corp, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computational imaging forward models, from coded aperture spectral cameras to MRI scanners, are traditionally implemented as monolithic, modality-specific codes. We prove that every forward model in a broad, precisely defined operator class Cimg (encompassing clinical, scientific, and industrial imaging modalities, both linear and nonlinear) admits an epsilon-approximate representation as a typed directed acyclic graph (DAG) whose nodes are drawn from a library of exactly 11 canonical primitives: Propagate, Modulate, Project, Encode, Convolve, Accumulate, Detect, Sample, Disperse, Scatter, and Transform. We call this the Finite Primitive Basis Theorem. The proof is constructive: we provide an algorithm that, given any H in Cimg, produces a DAG G with relative operator error at most epsilon and graph complexity within prescribed bounds. We further prove that the library is minimal: removing any single primitive causes at least one modality to lose its epsilon-approximate representation. A systematic analysis of nonlinearities in imaging physics shows they fall into two structural categories: pointwise scalar functions (handled by Transform) and self-consistent iterations (unrolled into existing linear primitives). Empirical validation on 31 linear modalities confirms eimg below 0.01 with at most 5 nodes and depth 5, and we provide constructive DAG decompositions for 9 additional nonlinear modalities. These results establish mathematical foundations for the Physics World Model (PWM) framework.

[CV-87] Sample-efficient evidence estimation of score based priors for model selection ICLR2026

【速读】:该论文旨在解决在求解病态成像逆问题(ill-posed imaging inverse problems)时,如何选择与观测数据 $ y $ 一致的先验模型以避免严重偏差的问题。传统贝叶斯框架中,可通过比较不同先验模型 $ M $ 下的模型证据 $ p(y \mid M) $ 来选择最优先验,但直接计算基于扩散模型(diffusion models)作为数据驱动先验的模型证据是不可行的。现有模型证据估计方法通常依赖大量点对点的未归一化先验密度评估或高精度的干净先验得分(clean prior score),计算成本高昂且难以实现。本文提出一种新的模型证据估计方法——\method,其核心创新在于利用反向扩散采样过程中自然产生的中间样本,通过对后验采样路径的时间边缘分布进行积分来估计模型证据,仅需少量后验样本(如20个)即可获得准确估计。该方法可与近期扩散后验采样技术无缝集成,并在多种高度病态、非线性逆问题中验证了其有效性,包括真实世界的黑洞成像问题,能够正确选择扩散先验并诊断先验不匹配情况。

链接: https://arxiv.org/abs/2602.20549
作者: Frederic Wang,Katherine L. Bouman
机构: California Institute of Technology (加州理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
备注: ICLR 2026

点击查看摘要

Abstract:The choice of prior is central to solving ill-posed imaging inverse problems, making it essential to select one consistent with the measurements y to avoid severe bias. In Bayesian inverse problems, this could be achieved by evaluating the model evidence p(y \mid M) under different models M that specify the prior and then selecting the one with the highest value. Diffusion models are the state-of-the-art approach to solving inverse problems with a data-driven prior; however, directly computing the model evidence with respect to a diffusion prior is intractable. Furthermore, most existing model evidence estimators require either many pointwise evaluations of the unnormalized prior density or an accurate clean prior score. We propose \method, an estimator of the model evidence of a diffusion prior by integrating over the time-marginals of posterior sampling methods. Our method leverages the large amount of intermediate samples naturally obtained during the reverse diffusion sampling process to obtain an accurate estimation of the model evidence using only a handful of posterior samples (e.g., 20). We also demonstrate how to implement our estimator in tandem with recent diffusion posterior sampling methods. Empirically, our estimator matches the model evidence when it can be computed analytically, and it is able to both select the correct diffusion model prior and diagnose prior misfit under different highly ill-conditioned, non-linear inverse problems, including a real-world black hole imaging problem.

[CV-88] Robust Spiking Neural Networks Against Adversarial Attacks ICLR2026

【速读】:该论文旨在解决直接训练的脉冲神经网络(Spiking Neural Networks, SNNs)在复杂对抗环境下的鲁棒性不足问题。研究表明,阈值邻近的脉冲神经元是限制其鲁棒性的关键因素,这些神经元不仅设定了对抗攻击强度的理论上限,还容易因微小扰动发生状态翻转。解决方案的核心在于提出一种阈值守护优化(Threshold Guarding Optimization, TGO)方法:一方面通过在损失函数中引入额外约束,将神经元膜电位从阈值移开,从而增加梯度稀疏性并降低对抗攻击的理论上限;另一方面引入噪声脉冲神经元,使神经元放电机制由确定性转向概率性,有效降低因微小扰动导致的状态翻转概率。

链接: https://arxiv.org/abs/2602.20548
作者: Shuai Wang,Malu Zhang,Yulin Jiang,Dehao Zhang,Ammar Belatreche,Yu Liang,Yimeng Shan,Zijian Zhou,Yang Yang,Haizhou Li
机构: University of Electronic Science and Technology of China (电子科技大学); Northumbria University (诺桑比亚大学); Shenzhen Loop Area Institute (深圳环区研究院); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing due to their bio-plausible and spike-driven characteristics. However, the robustness of SNNs in complex adversarial environments remains significantly constrained. In this study, we theoretically demonstrate that those threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs. We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances. To address this challenge, we propose a Threshold Guarding Optimization (TGO) method, which comprises two key aspects. First, we incorporate additional constraints into the loss function to move neurons’ membrane potentials away from their thresholds. It increases SNNs’ gradient sparsity, thereby reducing the theoretical upper bound of adversarial attacks. Second, we introduce noisy spiking neurons to transition the neuronal firing mechanism from deterministic to probabilistic, decreasing their state-flipping probability due to minor disturbances. Extensive experiments conducted in standard adversarial scenarios prove that our method significantly enhances the robustness of directly trained SNNs. These findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications.

[CV-89] Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing

【速读】:该论文旨在解决制药生产中菌落形成单位(Colony-forming Unit, CFU)检测的自动化难题,传统人工计数效率低且易出错,而现有深度学习(Deep Learning, DL)方法在样本质量波动和伪影干扰下性能不稳定,难以满足药品级质量控制要求。解决方案的关键在于构建一个融合深度学习与视觉-语言模型(Vision-Language Model, VLM)的多智能体框架:首先由VLM代理判断培养皿是否有效,对有效样本则由DL与VLM独立估算菌落数量,当两者结果差异小于5%时自动记录至Postgres和SAP系统,否则转交专家复核;该机制支持基于专家反馈的持续再训练与自我优化,使自动化验证覆盖率从50%提升至85%,实现了可扩展、可审计且符合监管要求的微生物质控自动化方案。

链接: https://arxiv.org/abs/2602.20543
作者: Subhra Jyoti Mandal,Lara Rachidi,Puneet Jain,Matthieu Duvinage,Sander W. Timmer
机构: GSK, Enterprise AI(葛兰素史克企业人工智能); Databricks(数据布鲁克斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical-grade requirements. A custom Detectron2 model trained on GSK’s dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi-agent framework combining DL with vision-language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self-improvement. Initial DL-based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.

[CV-90] PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning CVPR2026

【速读】:该论文旨在解决纯卷积模型在时空预测任务中因固定感受野导致难以自适应捕捉空间变化运动模式的问题。其解决方案的关键在于提出PFGNet框架,通过像素级频率引导门控机制动态调节感受野:核心的外围频率门控(Peripheral Frequency Gating, PFG)模块提取局部频谱线索,自适应融合多尺度大核外围响应与可学习中心抑制,从而构建空间自适应带通滤波器;同时为保持效率,所有大核被分解为可分离的一维卷积(1×k 和 k×1),将每通道计算复杂度从 O(k²) 降至 O(2k),实现无需循环或注意力机制的结构感知时空建模。

链接: https://arxiv.org/abs/2602.20537
作者: Xinyong Cai,Changbin Sun,Yong Wang,Hongyu Yang,Yuankai Wu
机构: Sichuan University (四川大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions ( 1 \times k followed by k \times 1 ), reducing per-channel computational cost from O(k^2) to O(2k) . PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs. Our code is available at this https URL.

[CV-91] A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata

【速读】:该论文旨在解决现有应用评分预测模型主要依赖文本数据或用户界面(UI)特征,而忽视了联合利用UI视觉信息与语义信息的问题。其解决方案的关键在于提出了一种轻量级的视觉-语言框架,通过MobileNetV3提取UI布局的视觉特征、DistilBERT提取文本语义特征,并采用带有Swish激活函数的门控融合模块进行多模态特征融合,最后由多层感知机(MLP)回归头输出评分预测结果。该方法显著提升了预测精度,在多个评估指标上表现优异,且具备边缘设备部署的可行性。

链接: https://arxiv.org/abs/2602.20531
作者: Azrin Sultana,Firoz Ahmed
机构: American International University–Bangladesh(美国国际大学-孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision–language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.

[CV-92] How Do Inpainting Artifacts Propagate to Language?

【速读】:该论文旨在解决扩散模型(diffusion-based inpainting)生成的视觉伪影对视觉-语言模型(vision-language models)语言生成性能的影响问题。其解决方案的关键在于构建了一个两阶段诊断框架:首先对图像中被遮挡区域进行重建,随后将原始图像与重建图像分别输入到图像描述生成模型中,从而实现对生成文本质量的可控比较。通过这一方法,研究者系统分析了重建保真度(包括像素级和感知级指标)与下游图像描述任务中词汇准确性和语义一致性之间的关联,并进一步揭示了视觉伪影如何引发模型内部表征和注意力机制的层级依赖性变化,为评估多模态系统中视觉重建质量对语言生成影响提供了可操作的诊断工具。

链接: https://arxiv.org/abs/2602.20520
作者: Pratham Yashwante,Davit Abrahamyan,Shresth Grover,Sukruth Rao
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.

[CV-93] Leverag ing Causal Reasoning Method for Explaining Medical Image Segmentation Models

【速读】:该论文旨在解决医学图像分割模型在高风险临床场景中因“黑箱”特性而缺乏可信度的问题,尤其是现有可解释性技术主要集中在分类任务,对分割任务的解释研究相对不足。其解决方案的关键在于引入因果推断框架,通过反向传播平均处理效应(Average Treatment Effect, ATE)构建量化指标,从而精确评估输入区域及网络组件对目标分割区域的影响。该方法不仅提升了解释的忠实性,还揭示了不同分割模型乃至同一模型在不同输入下存在显著的感知策略异质性,为优化模型设计提供了潜在洞见。

链接: https://arxiv.org/abs/2602.20511
作者: Limai Jiang,Ruitao Xie,Bokai Yang,Huazhen Huang,Juan He,Yufu Huo,Zikai Wang,Yang Wei,Yunpeng Cai
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); Shenzhen University (深圳大学); Shenzhen Polytechnic University (深圳职业技术学院); University of Macau (澳门大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at this https URL.

[CV-94] Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models CVPR2026

【速读】:该论文试图解决视觉系统如何真正理解物体的可供性(affordance)这一核心问题,即如何从感知层面建模物体结构与交互行为之间的关系。其解决方案的关键在于提出并验证了两个互补的感知维度:几何感知(geometric perception),用于识别物体中支持交互的结构部件;以及交互感知(interaction perception),用于建模代理(agent)动作与这些部件之间的动态关联。研究通过系统性探查视觉基础模型(Visual Foundation Models, VFMs)发现,DINO等模型天然编码了部件级几何结构,而Flux等生成式模型则包含丰富的、以动词条件化的空间注意力图,作为隐式的交互先验。最关键是,作者通过无需训练、零样本的方式将二者融合,实现了与弱监督方法相当的可供性估计性能,从而证明几何感知与交互感知是视觉基础模型中可供性理解的基本构成单元。

链接: https://arxiv.org/abs/2602.20501
作者: Qing Zhang,Xuesong Li,Jing Zhang
机构: The Australian National University (澳大利亚国立大学); CSIRO (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 12 figures, Accepted to CVPR 2026

点击查看摘要

Abstract:What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent’s actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO’s geometric prototypes with Flux’s interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.

[CV-95] Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining

【速读】:该论文旨在解决腹腔镜手术中自主摄像机控制的两大核心挑战:一是如何在快速器械-组织交互过程中保持稳定的、安全的术野视角,二是如何使系统决策过程对主刀医生具有可解释性。解决方案的关键在于提出了一种策略驱动的框架,该框架将高层视觉-语言推理(Vision-Language Inference)与底层闭环控制(Closed-loop Control)相耦合:离线阶段通过解析原始手术视频生成结构化的带属性事件图(Attributed Event Graphs),从中挖掘出可复用的摄像机操作策略基元(Strategy Primitives),作为监督信号用于模型训练;在线阶段则利用微调后的视觉-语言模型(VLM)实时预测主导策略及基于图像的离散运动指令,由基于图像的视觉伺服(IBVS-RCM)控制器在严格安全约束下执行,同时支持语音输入实现人机协同干预。

链接: https://arxiv.org/abs/2602.20500
作者: Keyu Zhou,Peisen Xu,Yahao Wu,Jiming Chen,Gaofeng Li,Shunlei Li
机构: Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Robotics (T-RO). 19 pages, 9 figures

点击查看摘要

Abstract:Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.

[CV-96] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

【速读】:该论文旨在解决扩散模型(Diffusion Models)中基于扩散变换器(DiTs)的高计算开销问题,尤其是在图像和视频生成任务中,传统特征缓存策略因无法适应扩散过程中阶段依赖的复杂动态特性,导致生成质量下降且难以保持与标准去噪流程的一致性。解决方案的关键在于提出一种可学习的分阶段感知(LEarnable Stage-Aware, LESA)预测框架,其核心包括两个创新:一是采用科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Network, KAN)精确建模时间特征映射;二是设计多阶段、多专家架构,为不同噪声水平阶段分配专用预测器,从而实现更精准、鲁棒的特征预测。该方法在多个主流模型上实现了显著加速(如FLUX.1-dev达5.00x加速),同时维持或提升生成质量,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2602.20497
作者: Peiliang Cai,Jiacheng Liu,Haowen Xu,Xinyu Wang,Chang Zou,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.

[CV-97] Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching CVPR2026

【速读】:该论文旨在解决迭代立体匹配(iterative stereo matching)在边缘设备部署中的瓶颈问题,核心挑战在于其对循环神经网络(Recurrent Neural Networks, RNN)的依赖导致计算开销高、延迟大,难以满足实时性需求。解决方案的关键在于三方面创新:首先,提出渐进式迭代剪枝策略(progressive iteration pruning),通过抑制冗余更新步骤将递归计算压缩为近似单次遍历推理;其次,设计协作式单目先验迁移框架(collaborative monocular prior transfer),隐式嵌入深度先验信息而无需专用单目编码器,从而消除额外计算负担;最后,开发面向硬件优化的FlashGRU算子,利用结构化稀疏性和I/O感知设计,在2K分辨率下相较原生ConvGRU实现7.28倍加速、峰值内存降低76.6%及全局内存请求减少80.9%,最终使PipStereo在边缘设备上实现高保真、低延迟的实时立体匹配。

链接: https://arxiv.org/abs/2602.20496
作者: Jintu Zheng,Qizhe Liu,HuangXin Xu,Zhuojie Chen
机构: ARIDGE XPENG
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (3D vision track)

点击查看摘要

Abstract:While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28 \times speedup, 76.6% memory peak reduction and 80.9% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320 \times 640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: this https URL.

[CV-98] Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation

【速读】:该论文旨在解决跨模态少样本适应中视觉-语义对齐问题,现有基于欧氏空间的流匹配(Flow Matching, FM)方法因受限于平坦几何结构,难以处理多样化的特征分布,导致路径纠缠严重。其解决方案的核心在于提出路径解耦的双曲流匹配(Hyperbolic Flow Matching, HFM),通过利用洛伦兹流形(Lorentz manifold)的指数膨胀特性实现轨迹解耦:一是构建以文本根节点为中心的向心双曲对齐机制,将视觉特征推向边界以初始化有序流场;二是设计路径解耦目标函数作为“语义护栏”,通过分步监督约束轨迹在类特定的测地线走廊内传播;此外引入自适应直径停止策略,依据内在语义尺度防止过运输至拥挤原点。

链接: https://arxiv.org/abs/2602.20479
作者: Lin Li,Ziqi Jiang,Gefan Ye,Zhenqi He,Jiahui Li,Jun Xiao,Kwang-Ting Cheng,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold’s exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail’’ rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.

[CV-99] SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

【速读】:该论文旨在解决文本驱动的3D人体运动合成中如何同时实现语义意图(如“走到沙发”)与物理可行性(如避免碰撞)的问题。现有方法通常依赖于计算成本高昂的3D场景数据(如点云或体素占用网格),并采用生成式框架联合学习高层规划与低层接触推理,导致效率低下。解决方案的关键在于提出SceMoS框架,通过结构化的2D场景表示将全局规划与局部执行解耦:首先利用鸟瞰图(BEV)图像结合DINOv2特征进行文本条件下的自回归全局运动规划,以捕捉空间布局和可操作性;其次基于2D局部高度图训练一个几何约束的运动令牌化器(motion tokenizer),将表面物理信息嵌入离散词汇空间,从而无需全量3D体积推理即可实现高精度物理贴合。该设计在保持运动真实性和接触准确性的同时,显著降低参数量(超过50%),验证了2D场景线索对3D人-场景交互的有效性。

链接: https://arxiv.org/abs/2602.20476
作者: Anindita Ghosh,Vladislav Golyanik,Taku Komura,Philipp Slusallek,Christian Theobalt,Rishabh Dabral
机构: DFKI; MPI for Informatics; Saarland Informatics Campus; University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (“walk to the couch”) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird’s-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.

[CV-100] gQIR: Generative Quanta Image Reconstruction CVPR2026

【速读】:该论文旨在解决在极低光条件下,仅依靠少量探测光子(photon)实现高质量成像的挑战,尤其是在单光子雪崩二极管(SPAD)传感器捕获的稀疏、噪声严重且为二值化的量子帧(quanta frames)中恢复出结构清晰、色彩准确的图像问题。其核心解决方案是将大规模文本到图像潜在扩散模型(latent diffusion models)适配至光子受限域,通过引入针对伯努利光子统计特性的机制,并结合潜在空间中的图像修复与帧级时空推理能力,从而在高动态运动场景下仍能生成在光照一致性与视觉感知质量上均显著优于传统方法和现有学习基线的重建结果。

链接: https://arxiv.org/abs/2602.20417
作者: Aryan Garg,Sizhuo Ma,Mohit Gupta
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Snap Inc. (Snap 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emphquanta frames contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textitDeforming (XD) video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \hrefthis https URLthis https URL.

[CV-101] SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images CVPR2026

【速读】:该论文旨在解决当前生成式AI(Generative AI)图像检测模型在面对跨生成器分布偏移时泛化能力差的问题,尤其是现有方法因过拟合训练数据而在精心设计的难样本测试集上表现严重退化。其解决方案的关键在于提出一种名为SimLBR(Simple Latent Blending Regularization)的框架,通过学习真实图像分布周围紧致的决策边界,并将伪造类别视为“吸收类”(sink class),从而提升模型对未知生成器的鲁棒性与泛化性能。该方法在Chameleon基准测试中显著提升了准确率(最高+24.85%)和召回率(+69.62%),同时训练效率远超现有方法。

链接: https://arxiv.org/abs/2602.20412
作者: Aayush Dhakal,Subash Khanal,Srikumar Sastry,Jacob Arndt,Philipe Ambrozio Dias,Dalton Lunga,Nathan Jacobs
机构: Washington University in St. Louis(圣路易斯华盛顿大学); Oak Ridge National Laboratory(橡树岭国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85% accuracy and +69.62% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.

[CV-102] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation CVPR2026

【速读】:该论文旨在解决3D点云域适应(3D point cloud domain adaptation)中模型在跨域迁移时性能下降的问题,尤其是在从合成数据到真实世界点云场景下的适应性不足。现有方法通常依赖于大量可训练参数的编码器,虽能提升精度但效率低下。其解决方案的关键在于提出CLIPoint3D——首个基于CLIP的少样本无监督3D点云域适应框架:通过将3D样本投影为多视角深度图,并利用冻结的CLIP主干网络,结合轻量级3D编码器提供的几何信息与高阶语言先验进行知识驱动的提示调优(prompt tuning);同时采用参数高效微调策略优化CLIP编码器、设计熵引导的视图采样机制以选择高置信度投影,并引入基于最优传输的对齐损失和不确定性感知原型对齐损失,协同缩小源域与目标域分布差异的同时保持类别可分性。该方案在PointDA-10和GraspNetPC-10基准上显著优于基于CLIP及传统编码器的方法,准确率提升达3–16%。

链接: https://arxiv.org/abs/2602.20409
作者: Mainak Singha,Sarthak Mehrotra,Paolo Casari,Subhasis Chaudhuri,Elisa Ricci,Biplab Banerjee
机构: University of Trento, Italy(特伦托大学, 意大利); MDSR Labs Adobe, India(Adobe MDSR 实验室, 印度); CNIT, Italy(意大利国家电信研究所, 意大利); IIT Bombay, India(印度理工学院孟买分校, 印度); Fondazione Bruno Kessler, Italy(布鲁诺·凯斯勒基金会, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026

点击查看摘要

Abstract:Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP’s encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at this https URL.

[CV-103] Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field

【速读】:该论文旨在解决现有美学视角建议方法在场景理解与效率上的局限性问题:传统单视图方法无法建模场景几何结构,而三维探索方法则依赖密集采样或预构建3D环境,并结合昂贵的强化学习(Reinforcement Learning, RL)搜索。其解决方案的关键在于提出“三维美学场”(3D aesthetic field)的概念,通过一个前馈式三维高斯点绘(3D Gaussian Splatting)网络,将预训练二维美学模型中的高层美学知识蒸馏至三维空间,从而实现仅需稀疏输入视图即可对新视角进行美学预测。在此基础上,设计了一个两阶段搜索流程,结合粗粒度采样与基于梯度的精调,高效识别具有美学吸引力的视角,无需密集采集或RL探索。

链接: https://arxiv.org/abs/2602.20363
作者: Sheyang Tang,Armin Shafiee Sarvestani,Jialu Xu,Xiaoyu Xu,Zhou Wang
机构: University of Waterloo (滑铁卢大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.

[CV-104] Momentum Guidance: Plug-and-Play Guidance for Flow Models

【速读】:该论文旨在解决流模型(flow-based generative models)在条件生成时因神经网络平滑效应导致样本细节不足、分布稀疏的问题,以及现有引导技术如无分类器引导(Classifier-Free Guidance, CFG)虽能提升保真度但增加推理成本且降低多样性的问题。其解决方案的关键在于提出一种新的引导机制——动量引导(Momentum Guidance, MG),该方法利用常微分方程(ODE)轨迹自身的速度信息,通过指数移动平均(exponential moving average)外推当前速度,在保持每步仅一次前向评估的计算开销前提下,实现与标准引导相当甚至更优的效果,并可与CFG协同进一步提升生成质量。

链接: https://arxiv.org/abs/2602.20360
作者: Runlong Liao,Jian Yu,Baiyu Su,Chi Zhang,Lizhang Chen,Qiang Liu
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG’s effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.

[CV-105] 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

【速读】:该论文旨在解决生成式视频(Generative Video) realism 评估缺乏自动化、统一且可靠方法的问题,尤其针对现有评估方式依赖人工标注或特定参考视频所带来的局限性。解决方案的关键在于提出一种名为 3DSPA 的自动化评估框架,其核心创新是构建一个基于 3D 空间-时间点的自编码器模型,通过融合物体运动轨迹(3D point trajectories)、深度信息(depth cues)与 DINO 语义特征(DINO semantic features),形成统一的视频表征空间,从而实现对视频真实性、时序一致性及物理合理性等维度的鲁棒评估,且无需参考视频即可完成。

链接: https://arxiv.org/abs/2602.20354
作者: Bhavik Chandna,Kelsey R. Allen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process – requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at this https URL.

[CV-106] BiRQA: Bidirectional Robust Quality Assessment for Images

【速读】:该论文旨在解决全参考图像质量评估(Full-Reference Image Quality Assessment, FR IQA)中现有神经网络指标存在计算效率低和对抗攻击下脆弱性高的问题。解决方案的关键在于提出BiRQA模型,其核心创新包括:1)设计了一个双向多尺度金字塔结构,融合四种快速互补特征,并通过自底向上不确定性感知门控模块将细粒度信息注入粗粒度层级,同时利用自顶向下交叉门控块将语义上下文反馈至高分辨率;2)引入基于理论保障的锚定对抗训练(Anchored Adversarial Training)策略,结合干净“锚点”样本与排序损失,有效约束对抗攻击下的逐点预测误差。该方法在五个公开FR IQA基准上达到或超越当前最优性能,且推理速度提升约3倍,并在未见白盒攻击下显著增强鲁棒性(如KADID-10k数据集上的SROCC从0.30–0.57提升至0.60–0.84)。

链接: https://arxiv.org/abs/2602.20351
作者: Aleksandr Gushchin,Dmitriy S. Vatolin,Anastasia Antsiferova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean “anchor” samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ~3x faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.

[CV-107] Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques

【速读】:该论文旨在解决无人机(UAV)在实时空中感知应用中,如何实现低延迟、高保真度的3D重建与可视化问题。现有方法在将3D高斯溅射(3D Gaussian Splatting, 3DGS)集成到端到端无人机重建系统方面仍缺乏探索,且传统基于NeRF的方法存在渲染效率低、端到端延迟高等局限。解决方案的关键在于构建一个端到端流水线,融合RTMP实时视频流采集、同步传感器数据融合、相机位姿估计与3DGS优化模块,支持连续模型更新,并在交互式可视化环境中实现低延迟部署,从而显著提升渲染性能并保持4–7%的重建质量误差范围,适用于沉浸式增强现实(AR)/虚拟现实(VR)场景。

链接: https://arxiv.org/abs/2602.20342
作者: Christos Maikos,Georgios Angelidis,Georgios Th. Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.

[CV-108] Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking CVPR2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)缺乏可解释性的问题,即其内部决策过程如同“黑箱”,难以系统分析多模态推理机制。解决方案的关键在于提出首个面向VLM的透明电路追踪框架,通过引入转换器(transcoders)、归因图(attribution graphs)和基于注意力的方法,揭示VLM如何分层整合视觉与语义概念;进一步验证发现,不同视觉特征电路分别负责数学推理和跨模态关联,并通过特征操控(feature steering)与电路修补(circuit patching)证明这些电路具有因果性和可控性,从而为构建更可解释、更可靠的VLM奠定基础。

链接: https://arxiv.org/abs/2602.20330
作者: Jingcheng Yang,Tianhu Xiong,Shengyi Qian,Klara Nahrstedt,Mingyuan Wu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in the Findings of CVPR 2026

点击查看摘要

Abstract:Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.

[CV-109] GSNR: Graph Smooth Null-Space Representation for Inverse Problems

【速读】:该论文旨在解决成像逆问题中因传感矩阵存在非平凡零空间(null-space)而导致的解不唯一性问题,即传统图像先验(如稀疏性、平滑性或得分函数)无法约束零空间分量,从而引入重建偏差。其解决方案的关键在于提出图平滑零空间表示(Graph-Smooth Null-Space Representation, GSNR),通过构造仅作用于零空间的图拉普拉斯算子(null-restricted Laplacian),编码零空间信号中邻近像素间的相似性,并设计低维投影矩阵以捕获前 $ p $ 个最平滑的谱图模式($ p $-smoothest spectral graph modes),从而在保持测量一致性的同时,有效建模零空间结构。此方法可显著提升收敛速度、覆盖能力和预测精度,并在图像去模糊、压缩感知、去马赛克和超分辨率等任务中实现稳定性能提升。

链接: https://arxiv.org/abs/2602.20328
作者: Romario Gualdrón-Hurtado,Roman Jacome,Rafael S. Suarez,Henry Arguello
机构: Universidad Industrial de Santander (桑坦德工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
备注: 23 pages, 24 figures, Accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the p -smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by p modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.

[CV-110] N4MC: Neural 4D Mesh Compression

【速读】:该论文旨在解决时间变化网格序列(time-varying mesh sequences)的高效压缩问题,尤其针对传统神经网格压缩方法忽略帧间冗余、无法有效利用时间相关性的局限。解决方案的关键在于提出首个4D神经压缩框架N4MC,其核心创新包括:将连续不规则网格帧转换为规则的4D张量以实现统一紧凑表示,并通过自编码器(auto-decoder)同时捕获空间与时间相关性以去除冗余;此外,引入基于Transformer的插值模型,利用追踪的体心隐变量嵌入预测中间帧,从而增强时间一致性并消除运动模糊。该方法在率失真性能上优于现有最优方案,且支持4D网格序列的实时解码。

链接: https://arxiv.org/abs/2602.20312
作者: Guodong Chen,Huanshuo Dong,Mallesham Dasari
机构: Northeastern University (东北大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: this https URL.

[CV-111] De-rendering Reasoning and Repairing Charts with Vision-Language Models

【速读】:该论文旨在解决数据可视化在科学传播、新闻报道和日常决策中因设计缺陷而导致的误读或误导问题。现有方法如基于规则的可视化lint工具无法理解上下文,而通用大语言模型(Large Language Models, LLMs)缺乏可视化设计原则的训练,难以提供可靠且一致的反馈。其解决方案的关键在于提出一个融合图表重构(chart de-rendering)、自动化分析与迭代优化的框架:系统首先从图像中重建图表结构,继而通过视觉-语言推理识别设计缺陷,并基于可视化研究中的已知原则提出具体可执行的改进建议;用户可选择性采纳建议并重新渲染图表,形成闭环反馈机制,从而提升可视化质量并增强用户的可视化素养。

链接: https://arxiv.org/abs/2602.20291
作者: Valentin Bonas,Martin Sinnona,Viviana Siless,Emmanuel Iarussi
机构: Universidad Torcuato Di Tella (托尔夸托·德特拉大学); Consejo Nacional de Investigaciones Científicas y Técnicas (国家科学与技术研究理事会); Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires (布宜诺斯艾利斯大学自然科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.

[CV-112] UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

【速读】:该论文旨在解决当前基于RGB图像学习的潜在动作表示(latent action representations)在机器人操作任务中缺乏显式三维几何结构的问题,从而限制了精确且接触密集型操作的能力。其解决方案的关键在于提出UniLACT模型,通过引入深度感知的潜在动作预训练机制,使视觉-语言-动作(VLA)模型能够继承更强的空间先验。具体而言,作者设计了UniLARN框架,基于逆动力学和前向动力学目标统一学习RGB与深度模态的共享嵌入空间,并显式建模跨模态交互,生成模态特定及统一的潜在动作表示作为伪标签,用于深度感知的预训练,显著提升了下游策略在仿真和真实场景中的泛化性能。

链接: https://arxiv.org/abs/2602.20231
作者: Manish Kumar Govind,Dominick Reilly,Pu Wang,Srijan Das
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.

[CV-113] OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport CVPR2026

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在推理过程中因冗余视觉标记(visual tokens)导致的高计算开销问题。现有视觉标记剪枝方法忽视了视觉表示背后的分布结构,难以在保持语义忠实性的同时实现高效剪枝。其解决方案的关键在于提出一种无需训练的框架 OTPrune,将剪枝建模为通过最优传输(Optimal Transport, OT)进行分布对齐的问题,通过最小化完整与剪枝后标记分布之间的 2-Wasserstein 距离,同时保留局部多样性与全局代表性;进一步推导出一个可 tractable 的子模(submodular)优化目标,理论上证明其单调性和子模性,从而为稳定且高效的剪枝提供原理性基础。

链接: https://arxiv.org/abs/2602.20205
作者: Xiwen Chen,Wenhui Zhu,Gen Li,Xuanzhao Dong,Yujian Xiong,Hao Wang,Peijie Qiu,Qingquan Song,Zhipeng Wang,Shao Tang,Yalin Wang,Abolfazl Razi
机构: Morgan Stanley; Clemson University (克莱姆森大学); Arizona State University (亚利桑那州立大学); Washington University in St. Louis (圣路易斯华盛顿大学); Texas A&M University (德克萨斯农工大学); Rice University (莱斯大学); Florida State University (佛罗里达州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026 (Findings). arXiv admin note: text overlap with arXiv:2503.02175 by other authors

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at this https URL.

[CV-114] Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

【速读】:该论文旨在解决当前分层视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因动作生成过程导致的性能瓶颈问题,具体包括:(i) 推理效率低下,源于各向同性噪声先验与目标动作分布之间的显著分布差距,导致去噪步骤增多及不可行样本频发;(ii) 鲁棒性差,现有策略仅依赖当前观测,忽视历史动作序列约束,缺乏对任务进展和时序一致性的感知。解决方案的关键在于提出OptimusVLA框架,其核心创新为双记忆机制:全局先验记忆(Global Prior Memory, GPM)通过检索语义相似轨迹获取任务级先验替代高斯噪声,缩短生成路径并减少函数评估次数(NFE);局部一致性记忆(Local Consistency Memory, LCM)动态建模已执行动作序列以推断任务进展,并注入学习到的一致性约束以强化轨迹的时序连贯性和平滑性。

链接: https://arxiv.org/abs/2602.20200
作者: Zaijing Li,Bing Hu,Rui Shao,Gongwei Chen,Dongmei Jiang,Pengwei Xie,Jianye Hao,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); PengCheng Laboratory (鹏城实验室); Shenzhen Loop Area Institute (深圳环区研究院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

[CV-115] VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography

【速读】:该论文旨在解决当前心律失常(arrhythmia)定位过程中依赖高密度映射技术和术前CT/MRI所导致的耗时、资源密集的问题。解决方案的关键在于构建一个基于深度学习的AI框架,利用电生理检查中常规使用的腔内超声(intracardiac echocardiography, ICE)视频数据,将心律失常来源定位任务建模为三分类问题(正常窦性心律、左心源性及右心源性),并通过训练3D卷积神经网络实现自动化识别。实验表明,该方法在交叉验证中平均准确率达66.2%,显著优于随机基线(33.3%),证明了ICE结合深度学习在快速、精准引导心律失常靶点方面的可行性与临床潜力。

链接: https://arxiv.org/abs/2602.20165
作者: Dorsa EPMoghaddam,Feng Gao,Drew Bernard,Kavya Sinha,Mehdi Razavi,Behnaam Aazhang
机构: Rice University (莱斯大学); Stanford University (斯坦福大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 3 figures, 3 tabels

点击查看摘要

Abstract:Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.

[CV-116] Interpretable Medical Image Classification using Prototype Learning and Privileged Information MICCAI2023

【速读】:该论文旨在解决医学影像中对模型可解释性(interpretability)与高性能预测之间矛盾的问题。当前深度学习方法虽具备强大性能,但其“黑箱”特性限制了在医疗场景中的可信应用。解决方案的关键在于提出一种名为Proto-Caps的新模型,该模型融合了胶囊网络(capsule networks)、原型学习(prototype learning)以及特权信息(privileged information)的使用,从而在训练过程中引入额外信息以增强模型的可解释性与准确性。实验表明,该方法在LIDC-IDRI数据集上实现了93.0%的恶性肿瘤预测准确率,较可解释基线模型提升超过6%,同时通过原型表示提供病例级推理,支持放射科医生定义特征的可视化验证。

链接: https://arxiv.org/abs/2310.15741
作者: Luisa Gallee,Meinrad Beer,Michael Goetz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2023 Medical Image Computing and Computer Assisted Intervention

点击查看摘要

Abstract:Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the training process can be used to create an understandable and powerful model. We propose an innovative solution called Proto-Caps that leverages the benefits of capsule networks, prototype learning and the use of privileged information. Evaluating the proposed solution on the LIDC-IDRI dataset shows that it combines increased interpretability with above state-of-the-art prediction performance. Compared to the explainable baseline model, our method achieves more than 6 % higher accuracy in predicting both malignancy (93.0 %) and mean characteristic features of lung nodules. Simultaneously, the model provides case-based reasoning with prototype representations that allow visual validation of radiologist-defined attributes.

[CV-117] Progressive Per-Branch Depth Optimization for DEFOM-Stereo and SAM3 Joint Analysis in UAV Forestry Applications

【速读】:该论文旨在解决复杂林冠环境下,基于无人机(UAV)的树木单枝三维重建精度不足的问题,尤其针对现代立体匹配器生成的稠密视差图噪声过大、难以用于个体枝条分析的挑战。解决方案的关键在于提出一个渐进式处理流程,融合DEFOM-Stereo基础模型视差估计、SAM3实例分割与多阶段深度优化,系统识别并逐级修正三类误差:通过形态学腐蚀和保持骨架结构的变体修复掩码边界污染;利用LAB空间马氏距离颜色验证与跨枝重叠仲裁缓解分割不准确;最终以五阶段深度去噪策略(包括MAD全局检测、空间密度一致性验证、局部MAD滤波、RGB引导滤波及自适应双边滤波)显著降低深度噪声,使平均每枝深度标准差降低82%,同时保持边缘保真度,从而获得适用于自主修剪工具定位的几何一致的点云数据。

链接: https://arxiv.org/abs/2602.20539
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate per-branch 3D reconstruction is a prerequisite for autonomous UAV-based tree pruning; however, dense disparity maps from modern stereo matchers often remain too noisy for individual branch analysis in complex forest canopies. This paper introduces a progressive pipeline integrating DEFOM-Stereo foundation-model disparity estimation, SAM3 instance segmentation, and multi-stage depth optimization to deliver robust per-branch point clouds. Starting from a naive baseline, we systematically identify and resolve three error families through successive refinements. Mask boundary contamination is first addressed through morphological erosion and subsequently refined via a skeleton-preserving variant to safeguard thin-branch topology. Segmentation inaccuracy is then mitigated using LAB-space Mahalanobis color validation coupled with cross-branch overlap arbitration. Finally, depth noise - the most persistent error source - is initially reduced by outlier removal and median filtering, before being superseded by a robust five-stage scheme comprising MAD global detection, spatial density consensus, local MAD filtering, RGB-guided filtering, and adaptive bilateral filtering. Evaluated on 1920x1080 stereo imagery of Radiata pine (Pinus radiata) acquired with a ZED Mini camera (63 mm baseline) from a UAV in Canterbury, New Zealand, the proposed pipeline reduces the average per-branch depth standard deviation by 82% while retaining edge fidelity. The result is geometrically coherent 3D point clouds suitable for autonomous pruning tool positioning. All code and processed data are publicly released to facilitate further UAV forestry research.

[CV-118] Inspectorch: Efficient rare event exploration in solar observations

【速读】:该论文旨在解决太阳观测数据中罕见事件难以被传统机器学习方法有效识别的问题,因为这些方法通常关注普遍趋势而忽略低频异常事件。其解决方案的关键在于引入一种基于流模型(flow-based models)的无监督概率框架Inspectorch,该框架能够高效建模多维太阳观测数据的复杂分布,并为每个样本分配概率值,从而精准识别出偏离正常分布的异常事件,如强多普勒偏移、非典型展宽及小尺度磁重联动力学等。此方法通过概率异常评分优化计算资源分配,使研究聚焦于最具物理意义和信息量的极端现象。

链接: https://arxiv.org/abs/2602.20316
作者: C. J. Díaz Baso,I. J. Soler Poquet,C. Kuckein,M. van Noort,N. Poirier
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV)
备注: Comments: 12+1 pages, 11+2 figures, submitted to AA

点击查看摘要

Abstract:The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensional solar observations and optimize our computational resources to the study of these extreme phenomena. We introduce Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations. Once optimized, it assigns a probability to each sample, allowing us to identify unusual events. We apply this approach by applying it to observations from the Hinode Spectro-Polarimeter, the Interface Region Imaging Spectrograph, the Microlensed Hyperspectral Imager at Swedish 1-m Solar Telescope, the Atmospheric Imaging Assembly on board the Solar Dynamics Observatory and the Extreme Ultraviolet Imager on board Solar Orbiter. We find that the algorithm assigns consistently lower probabilities to spectra that exhibit unusual features. For example, it identifies profiles with very strong Doppler shifts, uncommon broadening, and temporal dynamics associated with small-scale reconnection events, among others. As a result, Inspectorch demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets. The resulting probabilistic anomaly scores allow computational resources to be focused on the most informative and physically relevant events. We make our Python package publicly available at this https URL.

人工智能

[AI-0] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

【速读】:该论文旨在解决在优化生成式 AI(Generative AI)模型的多样本通过率(pass@k)时,常出现的与单样本通过率(pass@1)性能下降之间的权衡问题。这一权衡限制了实际部署中对延迟和成本敏感场景下的应用,因为 pass@1 通常作为关键操作约束。解决方案的关键在于理论分析并揭示了这种权衡的根源:pass@k 策略优化过程中由于提示干扰(prompt interference)引发的梯度冲突——具体而言,pass@k 优化会隐式地重新加权提示,使其偏向低成功率提示;当这些提示为负向干扰提示(negatively interfering prompts)时,其被过度加权会导致 pass@k 的更新方向偏离 pass@1 方向,从而降低单样本性能。论文通过理论建模和大规模语言模型实验验证了该机制,并为设计更平衡的推理感知微调方法提供了依据。

链接: https://arxiv.org/abs/2602.21189
作者: Anas Barakat,Souradip Chakraborty,Khushbu Pahwa,Amrit Singh Bedi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of k independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@ k . However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@ k policy gradients can conflict with pass@1 gradients because pass@ k optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.

[AI-1] Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids

【速读】:该论文旨在解决大规模高分辨率环境下路径规划的效率与最优性难题。传统搜索方法(如A*)在高分辨率地图中存在可扩展性差的问题,而采样和轨迹优化方法则未能充分利用多分辨率体积地图中显式的连通性信息。解决方案的关键在于结合任意角度规划(any-angle planning)的最优性和完备性优势,同时借助多分辨率表示来克服基于搜索方法的计算不可行性,从而实现高效且最优的路径规划。实验表明,该方法在真实与合成环境中均显著优于现有采样方法,在保证路径质量的同时大幅提升计算速度。

链接: https://arxiv.org/abs/2602.21174
作者: Victor Reijgwart,Cesar Cadena,Roland Siegwart,Lionel Ott
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures, 4 tables, accepted to RSS 2025, code is open-source: this https URL

点击查看摘要

Abstract:Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information. Yet widely used path planning methods such as sampling and trajectory optimization do not exploit this explicit connectivity information, and search-based methods such as A* suffer from scalability issues in large-scale high-resolution maps. In many applications, Euclidean shortest paths form the underpinning of the navigation system. For such applications, any-angle planning methods, which find optimal paths by connecting corners of obstacles with straight-line segments, provide a simple and efficient solution. In this paper, we present a method that has the optimality and completeness properties of any-angle planners while overcoming computational tractability issues common to search-based methods by exploiting multi-resolution representations. Extensive experiments on real and synthetic environments demonstrate the proposed approach’s solution quality and speed, outperforming even sampling-based methods. The framework is open-sourced to allow the robotics and planning community to build on our research.

[AI-2] CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning ICASSP2026

【速读】:该论文旨在解决多模态心电图(ECG)与临床报告联合分析中的两个核心问题:其一,模态内问题——现有模型对ECG信号采用无导联(lead-agnostic)处理方式,忽视了导联间的时空依赖关系,限制了对精细诊断模式的建模能力;其二,模态间问题——直接对齐ECG信号与自由文本形式的临床报告会引入模态特异性偏差。解决方案的关键在于提出CG-DMER框架,通过两项核心技术设计实现解耦式多模态表示学习:(1) 空间-时间掩码建模(Spatial-temporal masked modeling),在时空维度上施加掩码并重建缺失信息,以捕捉更精细的时序动态和导联间空间依赖;(2) 表示解耦与对齐策略(Representation disentanglement and alignment strategy),引入模态特异性和共享编码器,分离模态不变与模态特定表示,从而降低噪声干扰和模态偏差。

链接: https://arxiv.org/abs/2602.21154
作者: Ziwei Niu,Hao Sun,Shujun Bian,Xihong Yang,Lanfen Lin,Yuxin Liu,Yueming Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.

[AI-3] Cooperative-Competitive Team Play of Real-World Craft Robots ICRA2026

【速读】:该论文旨在解决多智能体深度强化学习(Multi-agent Deep Reinforcement Learning, MADRL)在实际机器人系统中高效训练及策略迁移(sim-to-real transfer)的挑战。其核心问题包括:如何在分布式学习框架下高效训练协作与竞争策略,以及如何缓解仿真环境到现实世界的策略性能下降问题。解决方案的关键在于提出一种名为“分布外状态初始化”(Out of Distribution State Initialization, OODSI)的新方法,通过在训练过程中引入分布外状态初始化机制,有效降低仿真与现实之间的差异影响,从而显著提升策略在真实场景中的泛化能力;实验表明,OODSI使Sim2Real性能提升达20%,并在多机器人竞速游戏和协作任务中验证了该方法的有效性。

链接: https://arxiv.org/abs/2602.21119
作者: Rui Zhao,Xihui Li,Yizheng Zhang,Yuzhen Liu,Zhong Zhang,Yufeng Zhang,Cheng Zhou,Zhengyou Zhang,Lei Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), Vienna, Austria

点击查看摘要

Abstract:Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi-agent sim-to-real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim-to-real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi-robot car competitive game and a cooperative task in real-world settings.

[AI-4] Probing Graph Neural Network Activation Patterns Through Graph Topology

【速读】:该论文旨在解决图结构拓扑特性与图神经网络(Graph Neural Networks, GNNs)学习偏好之间关系不明确的问题,尤其是如何理解图的曲率(curvature)分布如何影响GNN在信息传递过程中的表现,如过度平滑(oversmoothing)和过度挤压(oversquashing)等现象。其解决方案的关键在于引入“极大激活”(Massive Activations, MAs)作为诊断工具,通过分析极端边激活值来探测图结构曲率与模型注意力机制之间的对应关系;研究发现,尽管理论上曲率与信息流动相关,但MAs并不集中在曲率极值区域,而在长程图基准测试中观察到全局注意力机制会加剧拓扑瓶颈,显著提升负曲率出现频率,从而将曲率重新定义为一种用于诊断图学习失败时机与原因的探针。

链接: https://arxiv.org/abs/2602.21092
作者: Floriano Tori,Lorenzo Bini,Marco Sorbi,Stéphane Marchand-Maillet,Vincent Ginis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Curvature notions on graphs provide a theoretical description of graph topology, highlighting bottlenecks and denser connected regions. Artifacts of the message passing paradigm in Graph Neural Networks, such as oversmoothing and oversquashing, have been attributed to these regions. However, it remains unclear how the topology of a graph interacts with the learned preferences of GNNs. Through Massive Activations, which correspond to extreme edge activation values in Graph Transformers, we probe this correspondence. Our findings on synthetic graphs and molecular benchmarks reveal that MAs do not preferentially concentrate on curvature extremes, despite their theoretical link to information flow. On the Long Range Graph Benchmark, we identify a systemic \textitcurvature shift: global attention mechanisms exacerbate topological bottlenecks, drastically increasing the prevalence of negative curvature. Our work reframes curvature as a diagnostic probe for understanding when and why graph learning fails.

[AI-5] Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因源域与目标域转移动态差异(dynamics mismatch)导致的策略迁移性能下降问题,尤其针对在仅有少量目标域数据和大量源域数据情况下如何高效利用源数据。现有方法通常采用全局或逐样本的过滤策略,难以捕捉局部状态空间内的跨域相似性,且存在计算开销大或信息损失的问题。解决方案的关键在于提出局部动态感知域自适应方法(Localized Dynamics-Aware Domain Adaptation, LoDADA):通过聚类源域与目标域的数据片段,并基于领域判别机制估计每个簇级别的动态差异,从而实现细粒度的数据筛选——保留动态差异较小簇中的源数据,剔除差异较大的簇,形成一种兼顾精度与效率的可扩展数据选择策略,显著提升了对局部分布偏移的建模能力并优于当前最先进的off-dynamics offline RL方法。

链接: https://arxiv.org/abs/2602.21072
作者: Zhangjie Xia,Yu Yang,Pan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 33 pages, 9 figures, 11 tables

点击查看摘要

Abstract:Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics. Existing methods typically address dynamics mismatch either globally over the state space or via pointwise data filtering; these approaches can miss localized cross-domain similarities or incur high computational cost. We propose Localized Dynamics-Aware Domain Adaptation (LoDADA), which exploits localized dynamics mismatch to better reuse source data. LoDADA clusters transitions from source and target datasets and estimates cluster-level dynamics discrepancy via domain discrimination. Source transitions from clusters with small discrepancy are retained, while those from clusters with large discrepancy are filtered out. This yields a fine-grained and scalable data selection strategy that avoids overly coarse global assumptions and expensive per-sample filtering. We provide theoretical insights and extensive experiments across environments with diverse global and local dynamics shifts. Results show that LoDADA consistently outperforms state-of-the-art off-dynamics offline RL methods by better leveraging localized distribution mismatch.

[AI-6] he Initial Exploration Problem in Knowledge Graph Exploration

【速读】:该论文试图解决知识图谱(Knowledge Graph, KG)在初始探索阶段所面临的“初始探索问题”(Initial Exploration Problem, IEP),即普通用户在首次接触陌生KG时,由于对知识结构、可问问题范围及查询方式缺乏认知,导致无法有效开展探索。解决方案的关键在于识别并理论化IEP的三个相互依赖的障碍:范围不确定性(scope uncertainty)、本体不透明性(ontology opacity)和查询无能(query incapacity),并指出当前多数KG界面设计基于不适用于首次接触场景的认知假设。论文进一步提出,设计空间中存在一个结构性缺口——缺乏用于揭示知识图谱内容范围的交互原语(interaction primitives),这些原语能够在无需用户构建查询或理解本体结构的前提下,帮助用户理解KG的覆盖范围与组织逻辑。这一发现为评估和设计KG入口级支持机制提供了理论框架。

链接: https://arxiv.org/abs/2602.21066
作者: Claire McNamara,Lucy Hederman,Declan O’Sullivan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Knowledge Graphs (KGs) enable the integration and representation of complex information across domains, but their semantic richness and structural complexity create substantial barriers for lay users without expertise in semantic web technologies. When encountering an unfamiliar KG, such users face a distinct orientation challenge: they do not know what questions are possible, how the knowledge is structured, or how to begin exploration. This paper identifies and theorises this phenomenon as the Initial Exploration Problem (IEP). Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three interdependent barriers: scope uncertainty, ontology opacity, and query incapacity. We argue that these barriers converge at the moment of first contact, distinguishing the IEP from related concepts that presuppose an existing starting point or information goal. Analysing KG exploration interfaces at the level of interaction primitives, we suggest that many systems rely on epistemic assumptions that do not hold at first contact. This reveals a structural gap in the design space: the absence of interaction primitives for scope revelation, mechanisms that communicate what a KG contains without requiring users to formulate queries or interpret ontological structures. In articulating the IEP, this paper provides a theoretical lens for evaluating KG interfaces and for designing entry-point scaffolding that supports initial exploration.

[AI-7] ool Building as a Path to “Superintelligence”

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时搜索(test-time search)框架下能否实现超智能(superintelligence)的问题,特别是评估其在逻辑类分布外推理任务中的稳定性与可扩展性。研究的核心问题是:LLMs是否能在复杂、逐步递增难度的推理任务中保持高步成功概率(step-success probability, γ),从而支持通过持续迭代优化达成超智能。解决方案的关键在于设计了一个基于GF(2)电路重构的任务基准,该任务从信息论角度要求模型必须精确整合全部输入信息才能可靠求解;实验表明,尽管小规模模型的γ随推理深度呈超线性下降,前沿模型展现出一定程度的鲁棒性,且大规模成功推理高度依赖于精确的工具调用(tool calls),揭示了工具设计能力是LLMs借助Diligent Learner框架迈向通用超智能的关键瓶颈。

链接: https://arxiv.org/abs/2602.21061
作者: David Koplow,Tomer Galanti,Tomaso Poggio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability \gamma . In this work, we design a benchmark to measure \gamma on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the \gamma value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.

[AI-8] LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在逻辑推理评估中过度聚焦于收敛性推理(convergent logical reasoning)的问题,即模型被训练为寻找单一正确证明路径,而忽视了现实世界中许多推理任务存在多个有效推导路径的特性。为系统评估多路径逻辑推理能力,作者提出LogicGraph——首个专注于多路径逻辑推理的基准测试,其核心创新在于采用神经符号框架(neuro-symbolic framework),通过逆向逻辑生成(backward logic generation)与语义实例化(semantic instantiation)构建具有高深度多路径推理结构及内在逻辑干扰项的问题集,并为每个实例提供完整的最小证明集合。此外,论文还设计了一种无需参考答案的评估框架,以严格衡量模型在收敛与发散推理场景下的表现,揭示了当前主流模型普遍存在的“过早承诺单一路径”缺陷及其随推理深度加剧的覆盖差距。

链接: https://arxiv.org/abs/2602.21044
作者: Yanrui Wu,Lingling Zhang,Xinyu Zhang,Jiayu Chang,Pengyu Li,Xu Jiang,Jingtao Hu,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 17 figures

点击查看摘要

Abstract:Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at this https URL.

[AI-9] oward an Agent ic Infused Software Ecosystem

【速读】:该论文旨在解决如何充分释放AI代理(AI agents)在软件开发中的潜力,这需要对现有软件生态系统进行根本性重构。其解决方案的关键在于构建一个“代理融合的软件生态系统”(Agentic Infused Software Ecosystem, AISE),该系统由三大支柱协同推进:一是具备独立完成复杂开发任务能力的AI代理;二是编程语言与API(或工具),作为人机协作和代理间通信的基础架构;三是运行时环境及生态,使代理能够程序化地与外部世界交互并执行操作。实现这一愿景的核心在于三者需以协同方式同步演进,兼顾当前与未来AI代理及人类开发者的需求。

链接: https://arxiv.org/abs/2602.20979
作者: Mark Marron
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Fully leveraging the capabilities of AI agents in software development requires a rethinking of the software ecosystem itself. To this end, this paper outlines the creation of an Agentic Infused Software Ecosystem (AISE), that rests on three pillars. The first, of course, is the AI agents themselves, which in the past 5 years have moved from simple code completion and toward sophisticated independent development tasks, a trend which will only continue. The second pillar is the programming language and APIs (or tools) that these agents use to accomplish tasks, and increasingly, serve as the communication substrate that humans and AI agents interact and collaborate through. The final pillar is the runtime environment and ecosystem that agents operate within, and which provide the capabilities that programmatic agents use to interface with (and effect actions in) the external world. To realize the vision of AISE, all three pillars must be advanced in a holistic manner, and critically, in a manner that is synergistic for AI agents as they exist today, those that will exist in the future, and for the human developers that work alongside them.

[AI-10] Does Order Matter : Connecting The Law of Robustness to Robust Generalization

【速读】:该论文试图解决Bubeck和Sellke(2021)提出的开放问题,即“鲁棒性定律”(law of robustness)与“鲁棒泛化”(robust generalization)之间的联系。其中,鲁棒性定律指出,为实现鲁棒插值(robust interpolation),模型必须过参数化,且学习到的函数需满足Lipschitz连续性;而鲁棒泛化则关注:小的鲁棒训练损失是否意味着小的鲁棒测试损失。论文通过引入一个非平凡的鲁棒泛化误差定义,并将其转化为诱导鲁棒损失类的期望Rademacher复杂度下界,从而建立了二者在任意数据分布下的显式联系。其关键突破在于证明:在常数因子范围内,鲁棒泛化不会改变实现平滑插值所需的Lipschitz常数阶数,同时实验验证了MNIST数据集上预测的Lipschitz常数随数据规模变化的规律更符合Wu等人(2023)的结果。

链接: https://arxiv.org/abs/2602.20971
作者: Himadri Mandal,Vishnu Varadarajan,Jaee Ponde,Aritra Das,Mihir More,Debayan Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bubeck and Sellke (2021) pose as an open problem the connection between the law of robustness and robust generalization. The law of robustness states that overparameterization is necessary for models to interpolate robustly; in particular, robust interpolation requires the learned function to be Lipschitz. Robust generalization asks whether small robust training loss implies small robust test loss. We resolve this problem by explicitly connecting the two for arbitrary data distributions. Specifically, we introduce a nontrivial notion of robust generalization error and convert it into a lower bound on the expected Rademacher complexity of the induced robust loss class. Our bounds recover the \Omega(n^1/d) regime of Wu et al.\ (2023) and show that, up to constants, robust generalization does not change the order of the Lipschitz constant required for smooth interpolation. We conduct experiments to probe the predicted scaling with dataset size and model capacity, testing whether empirical behavior aligns more closely with the predictions of Bubeck and Sellke (2021) or Wu et al.\ (2023). For MNIST, we find that the lower-bound Lipschitz constant scales on the order predicted by Wu et al.\ (2023). Informally, to obtain low robust generalization error, the Lipschitz constant must lie in a range that we bound, and the allowable perturbation radius is linked to the Lipschitz scale.

[AI-11] EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

【速读】:该论文旨在解决搜救(Search and Rescue, SAR)任务中无人机(Unmanned Aerial Vehicles, UAVs)在复杂环境下对目标人员进行安全跟踪与跟随时的距离估计不准问题。其核心挑战在于如何在真实世界条件下准确估算摄像头与目标人体之间的距离,以确保飞行安全并维持稳定跟踪。解决方案的关键在于融合深度相机(depth camera)测量数据与单目相机基于YOLO-pose的人体关键点检测所估计的相机到人体距离,并通过扩展卡尔曼滤波(Extended Kalman Filter, EKF)实现实时多模态信息融合,从而提升距离估计的鲁棒性和精度。实验表明,该方法在室内真实场景下可将平均误差、均方根误差(Root Mean Square Error, RMSE)和标准差分别降低最多15.3%。

链接: https://arxiv.org/abs/2602.20958
作者: Luka Šiktar,Branimir Ćaran,Bojan Šekoranja,Marko Švaco
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Search and rescue (SAR) operations require rapid responses to save lives or property. Unmanned Aerial Vehicles (UAVs) equipped with vision-based systems support these missions through prior terrain investigation or real-time assistance during the mission itself. Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. UAVs with deep learning-based vision systems offer a new approach to the planning and execution of SAR operations. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning-based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, root mean square error (RMSE) and standard deviations of distance estimation up to 15,3% in three tested scenarios.

[AI-12] Architecting Agent OS: From Token-Level Context to Emergent System-Level Intelligence

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)从静态推理引擎向动态自主认知系统演进过程中,微观层面的token处理与宏观层面的系统智能之间缺乏理论桥梁的问题。解决方案的关键在于提出AgentOS这一整体性概念框架,将LLM重新定义为由结构化操作系统管理的“推理内核”(Reasoning Kernel),其核心机制是深度上下文管理(Deep Context Management),该机制将上下文窗口视为可寻址的语义空间(Addressable Semantic Space),并通过语义切片(Semantic Slicing)和时间对齐(Temporal Alignment)机制有效缓解多智能体场景下的认知漂移(cognitive drift),同时将经典操作系统抽象如内存分页、中断处理和进程调度映射至LLM原生构造,从而为构建鲁棒、可扩展且自我演化的认知系统提供严谨的架构路线图。

链接: https://arxiv.org/abs/2602.20934
作者: ChengYou Li,XiaoDong Liu,XiangBao Meng,XinYu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages,9 figures

点击查看摘要

Abstract:The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive this http URL current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge between micro scale token processing and macro scale systemic intelligence remains this http URL paper proposes AgentOS,a holistic conceptual framework that redefines the LLM as a “Reasoning Kernel” governed by structured operating system this http URL to this architecture is Deep Context Management which conceptualizes the context window as an Addressable Semantic Space rather than a passive this http URL systematically deconstruct the transition from discrete sequences to coherent cognitive states introducing mechanisms for Semantic Slicing and Temporal Alignment to mitigate cognitive drift in multi-agent this http URL mapping classical OS abstractions such as memory paging interrupt handling and process scheduling onto LLM native constructs, this review provides a rigorous roadmap for architecting resilient scalable and self-evolving cognitive this http URL analysis asserts that the next frontier of AGI development lies in the architectural efficiency of system-level coordination.

[AI-13] HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中因固有知识边界和幻觉问题导致的可靠性不足,以及现有检索增强生成(Retrieval-Augmented Generation, RAG)方法难以有效建模多跳推理中结构依赖关系的问题。其解决方案的关键在于提出一种名为HELP的新型图结构RAG框架,核心创新包括:1)HyperNode Expansion策略,通过迭代地将知识三元组组合为抽象为超节点(HyperNode)的连贯推理路径,以捕捉复杂的结构依赖并保障检索准确性;2)Logical Path-Guided Evidence Localization策略,利用预计算的图-文本相关性,直接将逻辑路径映射至语料库,从而显著提升效率并避免昂贵的随机游走与语义失真,实现准确率与实用效率之间的平衡。

链接: https://arxiv.org/abs/2602.20926
作者: Yuqi Huang,Ning Liao,Kai Yang,Anning Hu,Shengchao Hu,Xiaoxing Wang,Junchi Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations, limiting their reliability in knowledge-intensive tasks. While Retrieval-Augmented Generation (RAG) mitigates these issues, it frequently overlooks structural interdependencies essential for multi-hop reasoning. Graph-based RAG approaches attempt to bridge this gap, yet they typically face trade-offs between accuracy and efficiency due to challenges such as costly graph traversals and semantic noise in LLM-generated summaries. In this paper, we propose HyperNode Expansion and Logical Path-Guided Evidence Localization strategies for GraphRAG (HELP), a novel framework designed to balance accuracy with practical efficiency through two core strategies: 1) HyperNode Expansion, which iteratively chains knowledge triplets into coherent reasoning paths abstracted as HyperNodes to capture complex structural dependencies and ensure retrieval accuracy; and 2) Logical Path-Guided Evidence Localization, which leverages precomputed graph-text correlations to map these paths directly to the corpus for superior efficiency. HELP avoids expensive random walks and semantic distortion, preserving knowledge integrity while drastically reducing retrieval latency. Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8 \times speedup over leading Graph-based RAG baselines.

[AI-14] Airavat: An Agent ic Framework for Internet Measurement

【速读】:该论文旨在解决互联网测量(Internet measurement)领域中两个核心挑战:一是复杂分析任务需要专家级工具编排能力,二是即使语法正确的实现也可能存在方法论缺陷且难以验证,从而限制了测量能力的普及。为应对这一问题,作者提出Airavat——首个具备系统性验证与验证能力的互联网测量工作流生成代理框架(agentic framework)。其关键在于通过一组模拟专家推理的智能体(agents)协同完成问题分解、方案设计和代码实现,并借助一个现有工具注册表增强实用性;同时引入两个专用引擎确保方法论正确性:验证引擎(Verification Engine)基于编码五十年测量研究知识图谱对工作流进行校验,而验证引擎(Validation Engine)则依据成熟方法学识别合适的验证技术,从而在无地面真值场景下也能发现传统执行测试遗漏的方法论错误。

链接: https://arxiv.org/abs/2602.20924
作者: Alagappan Ramanathan,Eunju Kang,Dongsu Han,Sangeetha Abdu Jyothi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Internet measurement faces twin challenges: complex analyses require expert-level orchestration of tools, yet even syntactically correct implementations can have methodological flaws and can be difficult to verify. Democratizing measurement capabilities thus demands automating both workflow generation and verification against methodological standards established through decades of research. We present Airavat, the first agentic framework for Internet measurement workflow generation with systematic verification and validation. Airavat coordinates a set of agents mirroring expert reasoning: three agents handle problem decomposition, solution design, and code implementation, with assistance from a registry of existing tools. Two specialized engines ensure methodological correctness: a Verification Engine evaluates workflows against a knowledge graph encoding five decades of measurement research, while a Validation Engine identifies appropriate validation techniques grounded in established methodologies. Through four Internet measurement case studies, we demonstrate that Airavat (i) generates workflows matching expert-level solutions, (ii) makes sound architectural decisions, (iii) addresses novel problems without ground truth, and (iv) identifies methodological flaws missed by standard execution-based testing. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2602.20924 [cs.NI] (or arXiv:2602.20924v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2602.20924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉问答任务中依赖虚假相关性而非真正因果推理的问题,现有评估方法仅关注答案正确性,难以区分模型失败是源于推理能力不足还是对因果相关信息识别错误。其解决方案的关键在于提出视觉语言因果图(Vision-Language Causal Graphs, VLCGs),这是一种结构化、查询条件化的表示方法,显式编码因果相关的对象、属性、关系及场景基础假设;在此基础上构建的ViLCaR诊断基准包含因果归因、因果推理和问答任务,并引入与图对齐的评估指标,从而超越传统准确率,量化模型对因果信息的识别能力。实验表明,注入结构化相关性信息可显著提升LVLM在归因与推理一致性上的表现,说明当前LVLM因果推理局限主要源于缺乏结构引导而非推理能力不足。

链接: https://arxiv.org/abs/2602.20878
作者: Dhita Putri Pratama,Soyeon Caren Han,Yihao Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

[AI-16] SoK: Agent ic Skills – Beyond Tool Use in LLM Agents

【速读】:该论文旨在解决当前自主代理系统(agent systems)在执行长周期任务时面临的可复用性与可靠性问题,核心挑战在于如何构建一套结构化、可组合且安全的“技能层”(skill layer),以支持代理在不同环境和任务中高效调用和演化其行为能力。解决方案的关键在于提出两个互补的分类体系:一是系统级的七种设计模式,涵盖从元数据驱动的渐进式披露到自演化库和市场分发等技能封装与执行机制;二是“表示 × 范围”分类法,用于刻画技能的内容形式(自然语言、代码、策略或混合)及其适用环境(网络、操作系统、软件工程、机器人等)。通过这一框架,论文不仅实现了对技能生命周期(发现、实践、提炼、存储、组合、评估与更新)的全面建模,还揭示了技能驱动代理在供应链风险、提示注入攻击及信任分级执行等方面的治理挑战,并基于ClawHavoc恶意技能入侵案例验证了现实安全性问题,最终强调需发展具备确定性评估能力、可验证性和可认证性的技能体系,以支撑真实世界自主代理的可信部署。

链接: https://arxiv.org/abs/2602.20867
作者: Yanna Jiang,Delong Li,Haiyu Deng,Baihe Ma,Xu Wang,Qin Wang,Guangsheng Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Agentic systems increasingly rely on reusable procedural capabilities, \textita.k.a., agentic skills, to execute long-horizon workflows reliably. These capabilities are callable modules that package procedural knowledge with explicit applicability conditions, execution policies, termination criteria, and reusable interfaces. Unlike one-off plans or atomic tool calls, skills operate (and often do well) across tasks. This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies. The first is a system-level set of \textbfseven design patterns capturing how skills are packaged and executed in practice, from metadata-driven progressive disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal \textbfrepresentation \times scope taxonomy describing what skills \emphare (natural language, code, policy, hybrid) and what environments they operate over (web, OS, software engineering, robotics). We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution, grounded by a case study of the ClawHavoc campaign in which nearly 1,200 malicious skills infiltrated a major agent marketplace, exfiltrating API keys, cryptocurrency wallets, and browser credentials at scale. We further survey deterministic evaluation approaches, anchored by recent benchmark evidence that curated skills can substantially improve agent success rates while self-generated skills may degrade them. We conclude with open challenges toward robust, verifiable, and certifiable skills for real-world autonomous agents. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET) Cite as: arXiv:2602.20867 [cs.CR] (or arXiv:2602.20867v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.20867 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

【速读】:该论文旨在解决当前语言模型对齐(Alignment)评估中缺乏真实情境下多轮交互测试的问题,即现有方法主要依赖模型在单一指令下的行为描述,难以揭示其在复杂、冲突或动态场景中的潜在风险。其解决方案的关键在于构建一个涵盖904个现实场景的对齐基准测试集,覆盖诚实性(Honesty)、安全性(Safety)、非操纵性(Non-Manipulation)、鲁棒性(Robustness)、可纠正性(Corrigibility)和阴谋倾向(Scheming)六大类别,并通过人类标注验证其真实性;同时引入多轮交互设计(包括冲突指令、模拟工具访问与逐步升级),使模型行为倾向更易暴露,从而实现更全面、贴近实际应用的对齐性能评估。

链接: https://arxiv.org/abs/2602.20813
作者: Nora Petrova,John Burden
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories – Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming – validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

[AI-18] Qwen -BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在建筑信息建模(Building Information Modeling, BIM)领域应用中的性能瓶颈问题,其核心障碍在于缺乏针对BIM任务的专用数据集和评估基准。解决方案的关键在于:首先构建了一个面向BIM设计的评估基准及量化指标体系,用于系统性衡量LLMs在该领域的表现;其次提出一种从BIM模型中自动生成文本数据的方法,从而建立高质量的BIM衍生数据集,支持LLM的评估与微调;最后设计了一种针对性微调策略,使LLM能够适应BIM特定任务。实验表明,基于该基准和数据集训练的Qwen-BIM模型在G-Eval评分上较基础模型提升21.0%,且仅用14B参数即可达到671B参数通用大模型的性能水平,验证了方案的有效性与高效性。

链接: https://arxiv.org/abs/2602.20812
作者: Jia-Rui Lin,Yun-Hong Cai,Xiang-Rui Ni,Shaojie Zhou,Peng Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the construction industry advances toward digital transformation, BIM (Building Information Modeling)-based design has become a key driver supporting intelligent construction. Despite Large Language Models (LLMs) have shown potential in promoting BIM-based design, the lack of specific datasets and LLM evaluation benchmarks has significantly hindered the performance of LLMs. Therefore, this paper addresses this gap by proposing: 1) an evaluation benchmark for BIM-based design together with corresponding quantitative indicators to evaluate the performance of LLMs, 2) a method for generating textual data from BIM and constructing corresponding BIM-derived datasets for LLM evaluation and fine-tuning, and 3) a fine-tuning strategy to adapt LLMs for BIM-based design. Results demonstrate that the proposed domain-specific benchmark effectively and comprehensively assesses LLM capabilities, highlighting that general LLMs are still incompetent for domain-specific tasks. Meanwhile, with the proposed benchmark and datasets, Qwen-BIM is developed and achieves a 21.0% average increase in G-Eval score compared to the base LLM model. Notably, with only 14B parameters, performance of Qwen-BIM is comparable to that of general LLMs with 671B parameters for BIM-based design tasks. Overall, this study develops the first domain-specific LLM for BIM-based design by introducing a comprehensive benchmark and high-quality dataset, which provide a solid foundation for developing BIM-related LLMs in various fields.

[AI-19] POMDPPlanners: Open-Source Package for POMDP Planning

【速读】:该论文旨在解决当前在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)规划算法研究中缺乏高效、可复现且支持风险敏感决策的评估工具的问题。现有工具包在处理高风险场景时表现不足,难以支撑大规模仿真与参数优化。解决方案的关键在于构建一个开源的 Python 工具包 POMDPPlanners,其核心特性包括:集成前沿规划算法、提供包含安全关键变体的基准环境、通过 Optuna 实现自动化超参数优化、基于持久化缓存与故障恢复机制减少重复计算开销,以及支持可配置的并行模拟,从而显著降低实验复杂度,提升决策不确定性下的研究可扩展性与可复现性。

链接: https://arxiv.org/abs/2602.20810
作者: Yaacov Pariente,Vadim Indelman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter optimization via Optuna, persistent caching with failure recovery, and configurable parallel simulation – reducing the overhead of extensive simulation studies. POMDPPlanners is designed to enable scalable, reproducible research on decision-making under uncertainty, with particular emphasis on risk-sensitive settings where standard toolkits fall short.

[AI-20] Regret-Guided Search Control for Efficient Learning in AlphaZero ICLR2026

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在训练过程中学习效率远低于人类的问题,特别是针对AlphaZero类算法在自我对弈中资源利用不充分、难以高效聚焦于高价值学习状态的局限性。其解决方案的关键在于提出一种基于后悔值引导的搜索控制机制(Regret-Guided Search Control, RGSC),通过引入一个后悔网络(regret network)来识别并优先重用那些具有高后悔值(high-regret states)的状态——即代理评估与实际结果偏差最大的状态。这些状态从自对弈轨迹和蒙特卡洛树搜索(MCTS)节点中收集,并存储于优先级后悔缓冲区中作为新的起始位置,从而显著提升训练效率与模型性能,在多个棋类任务上实现超越AlphaZero和Go-Exploit的稳定改进。

链接: https://arxiv.org/abs/2602.20809
作者: Yun-Jui Tsai,Wei-Yu Chen,Yan-Ru Ju,Yu-Hung Chang,Ti-Rong Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent’s evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at this https URL.

[AI-21] Pipeline for Verifying LLM -Generated Mathematical Solutions

【速读】:该论文旨在解决当前大型推理模型(Large Reasoning Models)在数学问题求解能力评估中存在的准确性不足问题,即仅通过比对最终答案进行评测的方法容易产生误判(False Positives)。为提升评测的可靠性,作者提出了一种结合自动与交互式验证的流水线(pipeline),其核心在于利用提示工程(prompt engineering)引导模型输出结构化、可验证的解题过程,从而适配形式化证明助手(proof assistants)或小型语言模型(≤8B参数)进行精确验证。该方案不仅提高了评估准确性,还具备生成正式与非正式语言下正确解法的能力,实验表明其假阳性概率较低。

链接: https://arxiv.org/abs/2602.20770
作者: Varvara Sazonova,Dmitri Shmelkin,Stanislav Kikot,Vasily Motolygin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more accurate alternative to only checking the answer which is currently the most popular approach for benchmarks. The pipeline can also be used as a generator of correct solutions both in formal and informal languages. 3 AI agents, which can be chosen for the benchmark accordingly, are included in the structure. The key idea is the use of prompts to obtain the solution in the specific form which allows for easier verification using proof assistants and possible use of small models ( \le 8B ). Experiments on several datasets suggest low probability of False Positives. The open-source implementation with instructions on setting up a server is available at this https URL.

[AI-22] Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

【速读】:该论文旨在解决当前自动歌唱评估(Automatic Singing Assessment, ASA)系统对库尔德马卡姆(Maqam)音乐中微分音和音高弯曲等非西方平均律特征识别不足的问题。现有ASA工具基于西方音乐规则,无法准确检测库尔德传统歌唱中的微调误差,导致误判。解决方案的关键在于构建一个双头卷积神经网络-双向长短期记忆网络(CNN-BiLSTM)模型,并引入注意力机制,以识别和分类三种特定错误类型:精细音高误差、节奏误差和调式稳定性漂移(modal drift)。研究通过采集并标注50首歌曲中的221个错误片段,将音频数据转换为对数梅尔频谱图后进行训练,最终在验证集上达到0.468的宏F1分数,表明该方法能够有效识别常见错误类型,但对调式漂移的召回率较低,提示未来需扩充数据与优化类别平衡。

链接: https://arxiv.org/abs/2602.20744
作者: Darvan Shvan Khairaldeen,Hossein Hassani
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Maqam, a singing type, is a significant component of Kurdish music. A maqam singer receives training in a traditional face-to-face or through self-training. Automatic Singing Assessment (ASA) uses machine learning (ML) to provide the accuracy of singing styles and can help learners to improve their performance through error detection. Currently, the available ASA tools follow Western music rules. The musical composition requires all notes to stay within their expected pitch range from start to finish. The system fails to detect micro-intervals and pitch bends, so it identifies Kurdish maqam singing as incorrect even though the singer performs according to traditional rules. Kurdish maqam requires recognizing performance errors within microtonal spaces, which is beyond Western equal temperament. This research is the first attempt to address the mentioned gap. While many error types happen during singing, our focus is on pitch, rhythm, and modal stability errors in the context of Bayati-Kurd. We collected 50 songs from 13 vocalists ( 2-3 hours) and annotated 221 error spans (150 fine pitch, 46 rhythm, 25 modal drift). The data was segmented into 15,199 overlapping windows and converted to log-mel spectrograms. We developed a two-headed CNN-BiLSTM with attention mode to decide whether a window contains an error and to classify it based on the chosen errors. Trained for 20 epochs with early stopping at epoch 10, the model reached a validation macro-F1 of 0.468. On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% . Within detected windows, type macro-F1 was 0.387, with F1 of 0.492 (fine pitch), 0.536 (rhythm), and 0.133 (modal drift); modal drift recall was 8.0%. The better performance on common error types shows that the method works, while the poor modal-drift recall shows that more data and balancing are needed.

[AI-23] CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

【速读】:该论文旨在解决长上下文大语言模型(Long-context LLMs)在推理过程中因KV缓存(Key-Value cache)规模增长导致的低效问题,即现有剪枝方法多为上下文无关(context-agnostic),忽略逐步相关性和局部语义信息,从而损害生成质量,并且由于不规则访问和选择开销,难以实现显著的端到端加速。其解决方案的关键在于提出一种算法-系统协同设计的KV缓存管理框架CHESS:算法层面引入上下文感知、分层的选择策略,动态重构当前解码所需的连贯上下文;系统层面采用粗粒度选择机制,避免昂贵的数据移动,充分释放理论稀疏性带来的实际加速潜力。

链接: https://arxiv.org/abs/2602.20732
作者: Chao Fei,Guozhong Li,Chenxi Liu,Panos Kalnis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbfCHESS, an \textitalgorithm-system co-design KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf1% of the KV cache, delivers low-latency stable inference with up to \textbf4.56 \times higher throughput, and consistently outperforms other strong baselines. Code is available at \hrefthis https URLthis https URL.

[AI-24] Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback

【速读】:该论文旨在解决多目标强化学习(Multi-Objective Reinforcement Learning, MORL)中奖励设计困难的问题,特别是在存在冲突目标时,传统方法容易导致策略过度偏向主导目标而忽视其他重要目标。其解决方案的关键在于将AI反馈强化学习(RL from AI Feedback, RLAIF)扩展至多目标场景,利用大语言模型(Large Language Models, LLMs)自动生成偏好标签,从而在无需人工精细奖励工程的情况下,学习到能够平衡不同用户优先级的策略,实现更符合用户意图的多目标行为优化。

链接: https://arxiv.org/abs/2602.20728
作者: Chenyang Zhao,Vinny Cahill,Ivana Dusparic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.

[AI-25] Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation

【速读】:该论文旨在解决多模态推荐中因模态异质性和冲突导致的融合难题,尤其是现有方法依赖共享融合路径所引发的表示纠缠与模态不平衡问题。其核心解决方案是提出MAGNET(Modality-Guided Mixture of Adaptive Graph Experts Network),通过两个关键机制实现可控、稳定且可解释的多模态融合:一是将交互条件驱动的专家路由与结构感知的图增强相结合,显式控制“融合什么”和“如何融合”;二是引入分阶段熵触发路由机制,在训练初期侧重覆盖性、后期转向专业化,动态平衡专家利用与路由置信度,从而提升稀疏场景下的推荐性能与模型鲁棒性。

链接: https://arxiv.org/abs/2602.20723
作者: Ji Dai,Quan Fang,Dengsheng Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective fusion both crucial and challenging. Existing approaches often rely on shared fusion pathways, leading to entangled representations and modality imbalance. To address these issues, we propose \textbfMAGNET, a \textbfModality-Guided Mixture of \textbfAdaptive \textbfGraph Experts \textbfNetwork with Progressive \textbfEntropy-\textbfTriggered Routing for Multimodal Recommendation, designed to enhance controllability, stability, and interpretability in multimodal fusion. MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both \emphwhat to fuse and \emphhow to fuse are explicitly controlled and interpretable. At the representation level, a dual-view graph learning module augments the interaction graph with content-induced edges, improving coverage for sparse and long-tail items while preserving collaborative structure via parallel encoding and lightweight fusion. At the fusion level, MAGNET employs structured experts with explicit modality roles – dominant, balanced, and complementary – enabling a more interpretable and adaptive combination of behavioral, visual, and textual cues. To further stabilize sparse routing and prevent expert collapse, we introduce a two-stage entropy-weighting mechanism that monitors routing entropy. This mechanism automatically transitions training from an early coverage-oriented regime to a later specialization-oriented regime, progressively balancing expert utilization and routing confidence. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines.

[AI-26] Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

【速读】:该论文旨在解决传统基于策略的强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架在大语言模型后训练过程中存在的经验浪费(experience waste)和奖励同质化(reward homogeneity)问题,这些问题显著降低了模型在困难样本上的学习效率。解决方案的关键在于提出一种离策略的RLVR框架——批量自适应策略优化(Batch Adaptation Policy Optimization, BAPO),其通过动态重评估历史难样本并复用高质量样本实现数据高效训练,同时保证策略改进的下界(lower bound guarantee)。实验表明,BAPO在数学、规划和视觉推理任务中平均优于GRPO 12.5%,且成功解决了基线模型持续失败的40.7%的问题。

链接: https://arxiv.org/abs/2602.20722
作者: Xu Wan,Yansheng Wang,Wenqi Huang,Mingyang Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

[AI-27] AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agent ic LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)代理在集成外部数据服务(如Model Context Protocol, MCP)后所面临的关键安全漏洞问题,特别是间接提示注入(Indirect Prompt Injection, IPI)攻击的持续演化与隐蔽性挑战。现有攻击方法受限于静态模式依赖和对简单语言模型的评估,难以有效模拟现代AI代理的复杂交互行为。其解决方案的核心在于提出AdapTools框架,包含两个关键组件:一是自适应攻击策略构建,通过生成可迁移的对抗性策略优化提示;二是攻击增强机制,识别能绕过任务相关性防御的隐蔽工具。该框架显著提升了攻击成功率(提升2.13倍),同时保持对系统功能的适度影响(系统效用下降1.78倍),并在面对前沿防御机制时仍具有效性,从而为IPI攻击的系统性评估与防御研究提供了新范式。

链接: https://arxiv.org/abs/2602.20720
作者: Che Wang,Jiaming Zhang,Ziqi Zhang,Zijie Wang,Yinghui Wang,Jianbo Gao,Tao Wei,Zhong Chen,Wei Yang Bryan Lim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:The integration of external data services (e.g., Model Context Protocol, MCP) has made large language model-based agents increasingly powerful for complex task execution. However, this advancement introduces critical security vulnerabilities, particularly indirect prompt injection (IPI) attacks. Existing attack methods are limited by their reliance on static patterns and evaluation on simple language models, failing to address the fast-evolving nature of modern AI agents. We introduce AdapTools, a novel adaptive IPI attack framework that selects stealthier attack tools and generates adaptive attack prompts to create a rigorous security evaluation environment. Our approach comprises two key components: (1) Adaptive Attack Strategy Construction, which develops transferable adversarial strategies for prompt optimization, and (2) Attack Enhancement, which identifies stealthy tools capable of circumventing task-relevance defenses. Comprehensive experimental evaluation shows that AdapTools achieves a 2.13 times improvement in attack success rate while degrading system utility by a factor of 1.78. Notably, the framework maintains its effectiveness even against state-of-the-art defense mechanisms. Our method advances the understanding of IPI attacks and provides a useful reference for future research.

[AI-28] ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在面对间接提示注入(Indirect Prompt Injection, IPI)攻击时的安全性问题,此类攻击通过检索到的恶意内容劫持代理执行流程。现有防御方法多依赖严格过滤或拒绝机制,但存在过度拒绝(over-refusal)缺陷,导致合法任务流程被提前终止。解决方案的关键在于提出ICON框架——其核心创新是利用IPI攻击在潜在空间中留下的显著过聚焦特征(over-focusing signatures),通过引入潜空间痕迹探测器(Latent Space Trace Prober)识别攻击强度得分;随后由缓解校正器(Mitigating Rectifier)实施精准注意力引导,选择性地调控对抗性查询-键依赖关系,同时增强任务相关元素,从而恢复LLM的功能轨迹。该方法在保持高安全性的同时显著提升任务连续性和效率。

链接: https://arxiv.org/abs/2602.20708
作者: Che Wang,Fuyao Zhang,Jiaming Zhang,Ziqi Zhang,Yinghui Wang,Longtao Huang,Jianbo Gao,Zhong Chen,Wei Yang Bryan Lim
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages,

点击查看摘要

Abstract:Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent’s execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM’s functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

[AI-29] Online Algorithms with Unreliable Guidance

【速读】:该论文旨在解决在线决策问题中如何有效融合预测信息与算法策略的难题,特别是在预测可能不可靠的情况下如何保证算法在一致性和鲁棒性之间的平衡。其核心挑战在于设计一种统一的分析框架,使得算法既能利用高质量预测(β=0时)实现高一致性,也能在预测被恶意干扰(β=1时)时保持稳健性能。解决方案的关键是提出了一种名为“在线算法带不可靠指导”(Online Algorithms with Unreliable Guidance, OAG)的新模型,该模型将预测模块与算法执行逻辑彻底解耦,并引入一个称为“盲选丢弃或信任”(Drop or Trust Blindly, DTB)的编译器机制——该机制通过随机选择是否采纳预测指导来构造学习增强型在线算法,从而自然地实现了从一致性到鲁棒性的平滑过渡,且对缓存、均匀度量任务系统和二分图匹配三个经典问题均展现出最优或超越现有最优的竞争力保证。

链接: https://arxiv.org/abs/2602.20706
作者: Julien Dallot,Yuval Emek,Yuval Gil,Maciej Pacut,Stefan Schmid
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:This paper introduces a new model for ML-augmented online decision making, called online algorithms with unreliable guidance (OAG). This model completely separates between the predictive and algorithmic components, thus offering a single well-defined analysis framework that relies solely on the considered problem. Formulated through the lens of request-answer games, an OAG algorithm receives, with each incoming request, a piece of guidance which is taken from the problem’s answer space; ideally, this guidance is the optimal answer for the current request, however with probability \beta , the guidance is adversarially corrupted. The goal is to develop OAG algorithms that admit good competitiveness when \beta = 0 (a.k.a. consistency) as well as when \beta = 1 (a.k.a. robustness); the appealing notion of smoothness, that in most prior work required a dedicated loss function, now arises naturally as \beta shifts from 0 to 1 . We then describe a systematic method, called the drop or trust blindly (DTB) compiler, which transforms any online algorithm into a learning-augmented online algorithm in the OAG model. Given a prediction-oblivious online algorithm, its learning-augmented counterpart produced by applying the DTB compiler either follows the incoming guidance blindly or ignores it altogether and proceeds as the initial algorithm would have; the choice between these two alternatives is based on the outcome of a (biased) coin toss. As our main technical contribution, we prove (rigorously) that although remarkably simple, the class of algorithms produced via the DTB compiler includes algorithms with attractive consistency-robustness guarantees for three classic online problems: for caching and uniform metrical task systems our algorithms are optimal, whereas for bipartite matching (with adversarial arrival order), our algorithm outperforms the state-of-the-art. Subjects: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2602.20706 [cs.AI] (or arXiv:2602.20706v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在行为对齐(behavioral alignment)方面存在的问题,即现有对齐方法主要依赖训练阶段的高质量数据与额外标注,导致计算成本高、泛化能力受限。为实现无需重新训练即可提升模型行为可靠性的目标,论文提出了一种测试时行为控制方法——极性提示对比解码(Polarity-Prompt Contrastive Decoding, PromptCD)。其核心创新在于通过构造正负向引导提示(paired positive and negative guiding prompts),对比模型输出中的词级别概率分布(LLMs)或视觉注意力模式(VLMs),从而在不改变模型参数的前提下增强特定行为表现。该方法具有通用性强、无需额外训练、跨模态适用等优势,在“3H”对齐目标(助人性、诚实性、无害性)及视觉问答(VQA)任务中均取得显著改进,验证了其作为高效行为调控策略的有效性。

链接: https://arxiv.org/abs/2602.20696
作者: Baolong Bi,Yuyao Ge,Shenghua Liu,Yuchen He,Siqian Tong,Lizhe Chen,Lingrui Mei,Zehao Li,Yiwei Wang,Yujun Cai,Ming-Hsuan Yang,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human preferences and values. However, most existing alignment approaches operate at training time and rely on additional high-quality data, incurring significant computational and annotation costs. While recent work has shown that contrastive decoding can leverage a model’s internal distributions to improve specific capabilities, its applicability remains limited to narrow behavioral scopes and scenarios. In this work, we introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings. PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses-specifically token-level probability distributions in LLMs and visual attention patterns in VLMs-to reinforce desirable outcomes. This formulation extends contrastive decoding to a wide range of enhancement objectives and is applicable to both LLMs and Vision-Language Models (VLMs) without additional training. For LLMs, experiments on the “3H” alignment objectives (helpfulness, honesty, and harmlessness) demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time. For VLMs, we further analyze contrastive effects on visual attention, showing that PromptCD significantly improves VQA performance by reinforcing behavior-consistent visual grounding. Collectively, these results highlight PromptCD as a simple, general, and cost-efficient strategy for reliable behavior control across modalities.

[AI-31] How Foundational Skills Influence VLM-based Embodied Agents :A Native Perspective

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)驱动的具身智能体在评估时存在的两大问题:一是现有基准测试多依赖高层指令或离散动作空间,与真实世界控制的原生低级动作空间不一致;二是缺乏对低级技能与高级任务的联合评估机制,难以实现细粒度的能力分析。其解决方案的关键在于提出一个名为NativeEmbodied的新基准,该基准采用统一的原生低级动作空间,并在复杂场景中设计三种代表性高级任务以评估整体性能,同时解耦复杂任务所需的底层技能,构建四类针对基础具身能力的低级任务,从而实现跨任务层级和技能层级的联合评估,揭示当前VLM驱动具身智能体在关键基础技能上的缺陷及其对高层任务表现的制约作用。

链接: https://arxiv.org/abs/2602.20687
作者: Bo Peng,Pi Bu,Keyu Pan,Xinrun Xu,Yinxiu Zhao,Miao Chen,Yang Du,Lin Li,Jun Song,Tong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research.

[AI-32] UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

【速读】:该论文旨在解决城市计算领域中因“场景特定”模型导致的碎片化问题,这类模型过度拟合于特定区域或任务,严重限制了其泛化能力。为推进城市时空基础模型的发展,研究提出以“扩展”(scaling)为核心视角,系统性地回答“扩展什么”和“如何扩展”的关键问题。解决方案的关键在于从三个维度进行协同创新:首先,通过构建WorldST这一百亿规模语料库实现数据扩展,统一全球100多个城市的多源物理信号(如交通流与速度);其次,设计MiniST单元作为计算扩展机制,将连续时空场离散化为可学习的计算单元,从而统一网格与传感器观测的表示;最后,提出UrbanFM架构,采用极简自注意力结构并引入有限归纳偏置,以自主学习大规模数据中的动态时空依赖关系。该方案在EvalST基准上验证了卓越的零样本跨城市与跨任务泛化性能,标志着迈向城市时空基础模型的重要一步。

链接: https://arxiv.org/abs/2602.20677
作者: Wei Chen,Yuqian Wu,Junle Chen,Xiaofang Zhou,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to “scenario-specific” models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.

[AI-33] Recursive Belief Vision Language Model

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在部分可观测环境下进行长程操作任务时面临的挑战,包括任务进展丢失、感知相似性导致的动作重复以及高推理延迟等问题。其核心解决方案是提出一种以信念为中心的架构RB-VLA,通过自监督世界模型目标训练得到一个紧凑的潜在状态表示(belief),该表示编码了任务相关的历史信息、动态变化和物体交互关系,从而实现无需存储原始观测或随时间增长内存的因果性、阶段感知控制。该信念模块与高层意图共同条件化扩散策略,显著提升了长程任务的成功率并降低了推理延迟,实验证明其在多阶段抓取放置和堆叠任务上分别比基线提升52.5%和37.5%,且消除记忆随时间膨胀的问题。

链接: https://arxiv.org/abs/2602.20659
作者: Vaidehi Bagaria,Bijo Sebastian,Nirav Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to \pi0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.

[AI-34] rajGPT -R: Generating Urban Mobility Trajectory with Reinforcement Learning-Enhanced Generative Pre-trained Transformer

【速读】:该论文旨在解决城市移动轨迹数据在隐私保护限制下难以获取的问题,同时提升生成轨迹的可靠性与多样性。其解决方案的关键在于提出了一种基于Transformer架构的两阶段训练框架:首先将轨迹生成建模为离线强化学习(Offline Reinforcement Learning, RL)问题,并通过降低词汇空间实现高效tokenization;其次引入逆向强化学习(Inverse Reinforcement Learning, IRL)从历史数据中推断个体移动偏好,构建奖励模型用于指导预训练模型的微调,从而有效克服传统RL方法在长期信用分配和稀疏奖励环境中的局限性。

链接: https://arxiv.org/abs/2602.20643
作者: Jiawei Wang,Chuang Yang,Jiawei Yong,Xiaohang Xu,Hongjun Wang,Noboru Koshizuka,Shintaro Fukushima,Ryosuke Shibasaki,Renhe Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: TrajGPT-R is a Reinforcement Learning-Enhanced Generative Pre-trained Transformer for Mobility Trajectory Generation

点击查看摘要

Abstract:Mobility trajectories are essential for understanding urban dynamics and enhancing urban planning, yet access to such data is frequently hindered by privacy concerns. This research introduces a transformative framework for generating large-scale urban mobility trajectories, employing a novel application of a transformer-based model pre-trained and fine-tuned through a two-phase process. Initially, trajectory generation is conceptualized as an offline reinforcement learning (RL) problem, with a significant reduction in vocabulary space achieved during tokenization. The integration of Inverse Reinforcement Learning (IRL) allows for the capture of trajectory-wise reward signals, leveraging historical data to infer individual mobility preferences. Subsequently, the pre-trained model is fine-tuned using the constructed reward model, effectively addressing the challenges inherent in traditional RL-based autoregressive methods, such as long-term credit assignment and handling of sparse reward environments. Comprehensive evaluations on multiple datasets illustrate that our framework markedly surpasses existing models in terms of reliability and diversity. Our findings not only advance the field of urban mobility modeling but also provide a robust methodology for simulating urban data, with significant implications for traffic management and urban development planning. The implementation is publicly available at this https URL.

[AI-35] Grounding LLM s in Scientific Discovery via Embodied Actions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学发现中难以将理论推理与可验证的物理仿真相衔接的问题,尤其在于现有方法采用被动的“执行-响应”循环,缺乏运行时感知能力,无法及时识别瞬态异常(如数值不稳定或发散振荡)。其解决方案的关键在于提出EmbodiedAct框架,通过将成熟的科学计算软件(如MATLAB)转化为具有具身行动能力的主动代理,构建紧密耦合的感知-执行闭环,从而实现对仿真过程的实时监控与动态调整,显著提升了长周期仿真中的可靠性、稳定性及科学建模的准确性。

链接: https://arxiv.org/abs/2602.20639
作者: Bo Zhang,Jinfeng Zhou,Yuxuan Chen,Jianing Yin,Minlie Huang,Hongning Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures, 7 tables. Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and verifiable physical simulation. Existing solutions operate in a passive “execute-then-response” loop and thus lacks runtime perception, obscuring agents to transient anomalies (e.g., numerical instability or diverging oscillations). To address this limitation, we propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by grounding LLMs in embodied actions with a tight perception-execution loop. We instantiate EmbodiedAct within MATLAB and evaluate it on complex engineering design and scientific modeling tasks. Extensive experiments show that EmbodiedAct significantly outperforms existing baselines, achieving SOTA performance by ensuring satisfactory reliability and stability in long-horizon simulations and enhanced accuracy in scientific modeling.

[AI-36] Identifying two piecewise linear additive value functions from anonymous preference information

【速读】:该论文旨在解决如何同时从两位决策者(decision-maker)中准确识别其各自的偏好模型(preference model),尤其是在每次查询仅能获得两个无噪声回答但无法确定每个回答对应哪位决策者的情况下。解决方案的关键在于提出了一种新的交互式获取方法(elicitation procedure),通过假设每位决策者的偏好可由具有已知断点的分段线性边际价值函数(marginal value functions)构成,从而在不区分回答归属的前提下,有效分离并识别出两位决策者的加性价值函数(additive value function)。

链接: https://arxiv.org/abs/2602.20638
作者: Vincent Auriau,Khaled Belahcene(Heudiasyc),Emmanuel Malherbe,Vincent Mousseau(MICS),Marc Pirlot
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Eliciting a preference model involves asking a person, named decision-maker, a series of questions. We assume that these preferences can be represented by an additive value function. In this work, we query simultaneously two decision-makers in the aim to elicit their respective value functions. For each query we receive two answers, without noise, but without knowing which answer corresponds to which this http URL propose an elicitation procedure that identifies the two preference models when the marginal value functions are piecewise linear with known breaking points.

[AI-37] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

【速读】:该论文旨在解决高自主性人工智能(AI)系统在部署后可能因对齐失败而引发灾难性危害的风险问题,尤其聚焦于“不可信监控”(untrusted monitoring)机制的安全性验证难题。其核心解决方案在于通过放宽先前AI控制研究中对恶意AI collusion(共谋)策略的假设限制,提出一个涵盖被动自我识别、因果共谋(隐藏预共享信号)、非因果共谋(通过Schelling点隐藏信号)及组合策略的分类体系,并构建了一个清晰的安全论证框架(safety case sketch),以明确陈述假设并识别尚未解决的挑战。关键突破在于揭示了被动自我识别可能比以往研究关注的策略更具有效性,从而推动对不可信监控机制更鲁棒(robust)的评估方法发展。

链接: https://arxiv.org/abs/2602.20628
作者: Nelson Gardner-Challis,Jonathan Bostock,Georgiy Kozhevnikov,Morgan Sinclaire,Joan Velja,Alessandro Abate,Charlie Griffin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 66 pages, 14 figures, Preprint

点击查看摘要

Abstract:AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring – using one untrusted model to oversee another – is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state our assumptions, and highlight unsolved challenges. We identify conditions under which passive self-recognition could be a more effective collusion strategy than those studied previously. Our work builds towards more robust evaluations of untrusted monitoring.

[AI-38] Physics-based phenomenological characterization of cross-modal bias in multimodal models AAAI2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因复杂模态交互动态导致的系统性偏见问题,尤其是传统嵌入或表征层面分析难以捕捉的跨模态偏差。其解决方案的关键在于提出一种基于物理的替代模型(surrogate physics-based model),用于描述Transformer架构中的动态机制(如语义网络结构与自注意力/交叉注意力),从而从现象学可解释的角度解析跨模态偏见的产生机理。该方法通过多输入诊断实验验证:一方面在情绪分类任务中对Qwen2.5-Omni和Gemma 3n进行扰动分析,另一方面借助物理代理模型对Lorenz混沌时间序列预测进行动力学分析,结果表明,多模态输入反而可能强化模态主导性而非缓解偏见,这一发现由系统标签扰动下的结构化误差吸引子模式及动力学特性共同支持。

链接: https://arxiv.org/abs/2602.20624
作者: Hyeongmo Kim,Sohyun Kang,Yerin Choi,Seungyeon Ji,Junhyuk Woo,Hyunsuk Chung,Soyeon Caren Han,Kyungreem Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech)
备注: Best Paper Award at BiasinAI track in AAAI2026

点击查看摘要

Abstract:The term ‘algorithmic fairness’ is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as “treat like cases as like”) and non-comparative (where unfairness arises from the model’s inaccuracy, arbitrariness, or inscrutability) contexts. Recent advances in multimodal large language models (MLLMs) are breaking new ground in multimodal understanding, reasoning, and generation; however, we argue that inconspicuous distortions arising from complex multimodal interaction dynamics can lead to systematic bias. The purpose of this position paper is twofold: first, it is intended to acquaint AI researchers with phenomenological explainable approaches that rely on the physical entities that the machine experiences during training/inference, as opposed to the traditional cognitivist symbolic account or metaphysical approaches; second, it is to state that this phenomenological doctrine will be practically useful for tackling algorithmic fairness issues in MLLMs. We develop a surrogate physics-based model that describes transformer dynamics (i.e., semantic network structure and self-/cross-attention) to analyze the dynamics of cross-modal bias in MLLM, which are not fully captured by conventional embedding- or representation-level analyses. We support this position through multi-input diagnostic experiments: 1) perturbation-based analyses of emotion classification using Qwen2.5-Omni and Gemma 3n, and 2) dynamical analysis of Lorenz chaotic time-series prediction through the physical surrogate. Across two architecturally distinct MLLMs, we show that multimodal inputs can reinforce modality dominance rather than mitigate it, as revealed by structured error-attractor patterns under systematic label perturbation, complemented by dynamical analysis.

[AI-39] OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

【速读】:该论文旨在解决多租户大语言模型(Large Language Model, LLM)服务框架中基于共享键值(Key-Value)缓存引发的侧信道隐私泄露问题,即攻击者可通过缓存访问重建用户输入的敏感提示(prompt),而现有研究虽识别出此类攻击面,但未有效优化攻击效率,导致对真实隐私风险的低估。其解决方案的关键在于提出OptiLeak框架,利用强化学习增强的两阶段微调机制:首先通过概率排序自动识别领域特异的“难预测词”(hard tokens)——这些词难以被模型预测却携带敏感信息;随后构建偏好对用于直接偏好优化(Direct Preference Optimization, DPO),从而实现无需人工标注的偏好对齐,避免了扩展监督微调带来的过拟合问题。该方法显著提升了提示重构效率,在医疗与金融领域的三个基准测试中,相较基线方法平均请求次数减少达12.48倍,且在3B至14B参数规模模型上均表现一致提升,表明缓存侧信道攻击的实际威胁远超此前认知,亟需加强生产环境中的缓存隔离机制。

链接: https://arxiv.org/abs/2602.20595
作者: Longxiang Wang,Xiang Zheng,Xuhao Zhang,Yao Zhang,Ye Wu,Cong Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-tenant LLM serving frameworks widely adopt shared Key-Value caches to enhance efficiency. However, this creates side-channel vulnerabilities enabling prompt leakage attacks. Prior studies identified these attack surfaces yet focused on expanding attack vectors rather than optimizing attack performance, reporting impractically high attack costs that underestimate the true privacy risk. We propose OptiLeak, a reinforcement learning-enhanced framework that maximizes prompt reconstruction efficiency through two-stage fine-tuning. Our key insight is that domain-specific ``hard tokens’’ – terms difficult to predict yet carrying sensitive information – can be automatically identified via likelihood ranking and used to construct preference pairs for Direct Preference Optimization, eliminating manual annotation. This enables effective preference alignment while avoiding the overfitting issues of extended supervised fine-tuning. Evaluated on three benchmarks spanning medical and financial domains, OptiLeak achieves up to 12.48\times reduction in average requests per token compared to baseline approaches, with consistent improvements across model scales from 3B to 14B parameters. Our findings demonstrate that cache-based prompt leakage poses a more severe threat than previously reported, underscoring the need for robust cache isolation in production deployments.

[AI-40] CausalReasoning Benchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

【速读】:该论文旨在解决当前自动化因果推断评估中存在的一大问题:现有基准测试通常仅依赖单一数值输出(如平均处理效应,ATE)来衡量系统性能,这模糊了因果分析中的两个关键步骤——识别(identification,即在假设条件下制定有效的研究设计)与估计(estimation,即在有限数据上实现该设计的数值计算)。为精准诊断系统瓶颈,作者提出CausalReasoningBenchmark,其核心创新在于将评估拆分为两个独立组件:一是结构化的识别规范(包括策略、处理变量、结果变量、控制变量及设计特有要素),二是点估计及其标准误。通过分别评分这两个组件,该基准能够区分因果推理错误与数值执行错误,从而揭示模型在细节层面的设计能力不足(如基线大语言模型在高阶策略识别正确率达84%,但完整识别规范正确率仅为30%),凸显出提升自动因果推断系统在研究设计严谨性上的必要性。

链接: https://arxiv.org/abs/2602.20571
作者: Ayush Sawarni,Jiyuan Tan,Vasilis Syrgkanis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many benchmarks for automated causal inference evaluate a system’s performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification-formulating a valid research design under stated assumptions-and estimation-implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 138 real-world datasets, curated from 85 peer-reviewed research papers and four widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state-of-the-art LLM show that, while the model correctly identifies the high-level strategy in 84 % of cases, full identification-specification correctness drops to only 30 %, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

[AI-41] Maximin Share Guarantees via Limited Cost-Sensitive Sharing AAMAS2026 AAMAS

【速读】:该论文旨在解决在允许有限共享(即每件物品可分配给最多 $ k $ 个代理,但需承担共享成本)条件下,如何实现对不可分物品的公平分配问题。传统最大最小份额(Maximin Share, MMS)分配在许多情形下可能不存在,而本文通过引入可控共享机制恢复了原本无法保证的公平性。其关键解决方案包括:(1) 当物品可在至少一半代理间成本敏感地共享且代理人数量为偶数时,证明了精确MMS分配的存在性;(2) 提出“共享袋填充算法”(Shared Bag-Filling Algorithm),可保证 (1C)(k1)(1 - C)(k - 1)-近似MMS分配,其中 $ C $ 为最大共享成本,当 (1C)(k1)1(1 - C)(k - 1) \geq 1 时退化为精确MMS分配;(3) 引入新的公平性概念——共享最大最小份额(Sharing Maximin Share, SMMS),并证明在相同效用和两代理人场景下SMMS分配始终存在,同时构造反例表明其普遍存在性不成立;(4) 建立SMMS与受限MMS(Constrained MMS, CMMS)之间的联系,从而利用现有CMMS结果获得SMMS的近似保证。这些成果为多代理环境中资源有限共享下的公平分配提供了坚实的理论基础。

链接: https://arxiv.org/abs/2602.20541
作者: Hana Salavcova,Martin Černý,Arpita Biswas
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: In Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25 - 29, 2026, IFAAMAS, 11 pages

点击查看摘要

Abstract:We study the problem of fairly allocating indivisible goods when limited sharing is allowed, that is, each good may be allocated to up to k agents, while incurring a cost for sharing. While classic maximin share (MMS) allocations may not exist in many instances, we demonstrate that allowing controlled sharing can restore fairness guarantees that are otherwise unattainable in certain scenarios. (1) Our first contribution shows that exact maximin share (MMS) allocations are guaranteed to exist whenever goods are allowed to be cost-sensitively shared among at least half of the agents and the number of agents is even; for odd numbers of agents, we obtain a slightly weaker MMS guarantee. (2) We further design a Shared Bag-Filling Algorithm that guarantees a (1 - C)(k - 1) -approximate MMS allocation, where C is the maximum cost of sharing a good. Notably, when (1 - C)(k - 1) \geq 1 , our algorithm recovers an exact MMS allocation. (3) We additionally introduce the Sharing Maximin Share (SMMS) fairness notion, a natural extension of MMS to the k -sharing setting. (4) We show that SMMS allocations always exist under identical utilities and for instances with two agents. (5) We construct a counterexample to show the impossibility of the universal existence of an SMMS allocation. (6) Finally, we establish a connection between SMMS and constrained MMS (CMMS), yielding approximation guarantees for SMMS via existing CMMS results. These contributions provide deep theoretical insights for the problem of fair resource allocation when a limited sharing of resources are allowed in multi-agent environments.

[AI-42] A Generalized Apprenticeship Learning Framework for Capturing Evolving Student Pedagogical Strategies

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在教育技术应用中面临的两大核心挑战:样本效率低和奖励函数设计困难。为克服这些问题,作者提出基于广义模仿学习(Apprenticeship Learning, AL)框架THEMES,其关键在于通过少量专家示范数据推断出动态演化的多奖励函数,并据此生成可泛化且能复现最优教学行为的策略。实验表明,仅需前一学期18条学生轨迹,THEMES即可在后续学期预测学生教学决策,AUC达0.899,Jaccard指数达0.653,显著优于六种前沿基线方法,验证了其作为有效教学策略诱导工具的潜力。

链接: https://arxiv.org/abs/2602.20527
作者: Md Mirajul Islam,Xi Yang,Adittya Soukarjya Saha,Rajesh Debnath,Min Chi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) have advanced rapidly in recent years and have been successfully applied to e-learning environments like intelligent tutoring systems (ITSs). Despite great success, the broader application of DRL to educational technologies has been limited due to major challenges such as sample inefficiency and difficulty designing the reward function. In contrast, Apprenticeship Learning (AL) uses a few expert demonstrations to infer the expert’s underlying reward functions and derive decision-making policies that generalize and replicate optimal behavior. In this work, we leverage a generalized AL framework, THEMES, to induce effective pedagogical policies by capturing the complexities of the expert student learning process, where multiple reward functions may dynamically evolve over time. We evaluate the effectiveness of THEMES against six state-of-the-art baselines, demonstrating its superior performance and highlighting its potential as a powerful alternative for inducing effective pedagogical policies and show that it can achieve high performance, with an AUC of 0.899 and a Jaccard of 0.653, using only 18 trajectories of a previous semester to predict student pedagogical decisions in a later semester.

[AI-43] ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

【速读】:该论文旨在解决现有图形用户界面(Graphical User Interface, GUI)代理在执行任务时存在的高成本、高延迟及低准确率问题,这些问题主要源于其依赖逐步调用视觉语言模型(Vision Language Model, VLM)进行反应式操作,缺乏对已访问页面的持久记忆。解决方案的关键在于提出ActionEngine框架,采用一种新颖的双代理架构:爬取代理(Crawling Agent)通过离线探索构建可更新的状态机记忆(state-machine memory),用于存储GUI的结构化状态;执行代理(Execution Agent)则利用该记忆合成完整的、可执行的Python程序以实现在线任务规划与执行。此外,通过引入基于视觉的再定位(re-grounding)容错机制,在执行失败时修复动作并更新记忆,从而显著提升效率和准确性——在WebArena基准的Reddit任务中,实现了95%的任务成功率,平均仅需一次大语言模型(Large Language Model, LLM)调用,相比最强纯视觉基线提升29个百分点,同时降低11.8倍成本和2倍端到端延迟。

链接: https://arxiv.org/abs/2602.20502
作者: Hongbin Zhong,Fazle Faisal,Luis França,Tanakorn Leesatapornwongsa,Adriana Szekeres,Kexin Rong,Suman Nath
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models–taking a screenshot, reasoning about the next action, executing it, then repeating on the new page–resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the WebArena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline, while reducing cost by 11.8x and end-to-end latency by 2x. Together, these components yield scalable and reliable GUI interaction by combining global programmatic planning, crawler-validated action templates, and node-level execution with localized validation and repair. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.20502 [cs.AI] (or arXiv:2602.20502v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20502 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hongbin Zhong [view email] [v1] Tue, 24 Feb 2026 03:03:18 UTC (4,246 KB) Full-text links: Access Paper: View a PDF of the paper titled ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory, by Hongbin Zhong and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-02 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-44] KairosVL: Orchestrating Time Series and Semantics for Unified Reasoning

【速读】:该论文旨在解决传统时间序列分析方法在面对日益复杂和决策导向任务时的局限性,即仅依赖数值建模而缺乏对上下文与语义信息的理解。为此,作者提出了“语义条件时间序列推理”(Semantic-Conditional Time Series Reasoning)这一新任务,并设计了一个两轮强化学习框架:第一轮增强模型对基础时间原语(temporal primitives)的感知能力,第二轮聚焦于语义条件下的推理机制。其核心创新在于将语义理解与时间建模相结合,从而显著提升模型在复杂场景下的推理能力和泛化性能,最终构建出名为KairosVL的高效时间序列智能模型。

链接: https://arxiv.org/abs/2602.20494
作者: Haotian Si,Changhua Pei,Xiao He,Zeyan Li,Zhe Xie,Zexin Wang,Jiyao Hu,Zhaoyang Yu,Tieying Zhang,Dan Pei,Jianhui Li,Gaogang Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by the increasingly complex and decision-oriented demands of time series analysis, we introduce the Semantic-Conditional Time Series Reasoning task, which extends conventional time series analysis beyond purely numerical modeling to incorporate contextual and semantic understanding. To further enhance the mode’s reasoning capabilities on complex time series problems, we propose a two-round reinforcement learning framework: the first round strengthens the mode’s perception of fundamental temporal primitives, while the second focuses on semantic-conditioned reasoning. The resulting model, KairosVL, achieves competitive performance across both synthetic and real-world tasks. Extensive experiments and ablation studies demonstrate that our framework not only boosts performance but also preserves intrinsic reasoning ability and significantly improves generalization to unseen scenarios. To summarize, our work highlights the potential of combining semantic reasoning with temporal modeling and provides a practical framework for real-world time series intelligence, which is in urgent demand.

[AI-45] Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA

【速读】:该论文旨在解决去中心化联邦学习(Decentralized Federated Learning, DFL)中基于低秩适配(LoRA)的大语言模型(LLM)协同微调所面临的三大核心问题:(i) 微调过程中因数据异构性导致的灾难性知识遗忘,源于更新方向冲突;(ii) 模型聚合阶段通信效率低下与收敛缓慢,源于带宽密集型冗余模型传输;(iii) 推理阶段多任务知识干扰,源于不兼容知识表示共存。解决方案的关键在于:首先提出稀疏且正交的LoRA机制,通过确保模型更新间的正交性以消除方向冲突;其次分析设备连接拓扑对多任务性能的影响,进而设计基于聚类的拓扑结构以优化协作效率;最后引入隐式专家混合(Implicit Mixture of Experts, MoE)机制,在推理阶段避免不兼容知识共存,从而提升整体系统鲁棒性和性能。

链接: https://arxiv.org/abs/2602.20492
作者: Nuocheng Yang,Sihua Wang,Ouwen Huan,Mingzhe Chen,Tony Q. S. Quek,Changchuan Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Decentralized federated learning (DFL) based on low-rank adaptation (LoRA) enables mobile devices with multi-task datasets to collaboratively fine-tune a large language model (LLM) by exchanging locally updated parameters with a subset of neighboring devices via wireless connections for knowledge this http URL, directly aggregating parameters fine-tuned on heterogeneous datasets induces three primary issues across the DFL life-cycle: (i) \textitcatastrophic knowledge forgetting during fine-tuning process, arising from conflicting update directions caused by data heterogeneity; (ii) \textitinefficient communication and convergence during model aggregation process, due to bandwidth-intensive redundant model transmissions; and (iii) \textitmulti-task knowledge interference during inference process, resulting from incompatible knowledge representations coexistence during inference. To address these issues in a fully decentralized scenario, we first propose a sparse-and-orthogonal LoRA that ensures orthogonality between model updates to eliminate direction conflicts during this http URL, we analyze how device connection topology affects multi-task performance, prompting a cluster-based topology design during this http URL, we propose an implicit mixture of experts (MoE) mechanism to avoid the coexistence of incompatible knowledge during inference. Simulation results demonstrate that the proposed approach effectively reduces communication resource consumption by up to 73% and enhances average performance by 5% compared with the traditional LoRA method.

[AI-46] VINA: Variational Invertible Neural Architectures

【速读】:该论文旨在解决生成式模型中正常化流(Normalizing Flows, NFs)与可逆神经网络(Invertible Neural Networks, INNs)在实际应用中缺乏理论保证的问题,特别是针对后验推断(posterior inference)和分布建模的近似精度,在更弱且更现实的假设条件下未有充分理论支撑。其解决方案的关键在于提出一个统一框架,基于变分无监督损失函数(variational unsupervised loss functions),受生成对抗网络(GANs)及用于训练正常化流的精确-召回散度(Precision-Recall divergence)启发,从而在理论上量化INNs的后验准确性与NFs的分布准确性,突破以往研究依赖强假设的局限,为实际设计提供可信赖的理论依据与实践指导。

链接: https://arxiv.org/abs/2602.20480
作者: Shubhanshu Shekhar,Mohammad Javad Khojasteh,Ananya Acharya,Tony Tohme,Kamal Youcef-Toumi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 57 pages, 11 figures, 5 tables

点击查看摘要

Abstract:The distinctive architectural features of normalizing flows (NFs), notably bijectivity and tractable Jacobians, make them well-suited for generative modeling. Invertible neural networks (INNs) build on these principles to address supervised inverse problems, enabling direct modeling of both forward and inverse mappings. In this paper, we revisit these architectures from both theoretical and practical perspectives and address a key gap in the literature: the lack of theoretical guarantees on approximation quality under realistic assumptions, whether for posterior inference in INNs or for generative modeling with NFs. We introduce a unified framework for INNs and NFs based on variational unsupervised loss functions, inspired by analogous formulations in related areas such as generative adversarial networks (GANs) and the Precision-Recall divergence for training normalizing flows. Within this framework, we derive theoretical performance guarantees, quantifying posterior accuracy for INNs and distributional accuracy for NFs, under assumptions that are weaker and more practically realistic than those used in prior work. Building on these theoretical results, we conduct extensive case studies to distill general design principles and practical guidelines. We conclude by demonstrating the effectiveness of our approach on a realistic ocean-acoustic inversion problem. Comments: 57 pages, 11 figures, 5 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20480 [cs.LG] (or arXiv:2602.20480v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20480 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] Elimination-compensation pruning for fully-connected neural networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks)模型压缩与信息保留之间的平衡问题,尤其针对传统剪枝(Pruning)方法中假设可移除权重对网络误差影响较小的局限性。其关键创新在于提出一种新型剪枝策略:在移除每个权重后,通过自动微分(automatic differentiation)计算其相邻偏置(bias)的最优扰动(perturbation),以补偿该权重被移除带来的输出变化,从而更精确地衡量权重的重要性并提升剪枝后的模型性能。此扰动不增加模型稀疏度(sparsity),但能有效维持网络输出行为的稳定性,显著优于现有主流剪枝方法。

链接: https://arxiv.org/abs/2602.20467
作者: Enrico Ballini,Luca Muscarnera,Alessio Fumagalli,Anna Scotti,Francesco Regazzoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of parameters that characterize model architectures. Pruning techniques affirmed themselves as valid tools to extract sparse representations of neural networks parameters, carefully balancing between compression and preservation of information. However, a fundamental assumption behind pruning is that expendable weights should have small impact on the error of the network, while highly important weights should tend to have a larger influence on the inference. We argue that this idea could be generalized; what if a weight is not simply removed but also compensated with a perturbation of the adjacent bias, which does not contribute to the network sparsity? Our work introduces a novel pruning method in which the importance measure of each weight is computed considering the output behavior after an optimal perturbation of its adjacent bias, efficiently computable by automatic differentiation. These perturbations can be then applied directly after the removal of each weight, independently of each other. After deriving analytical expressions for the aforementioned quantities, numerical experiments are conducted to benchmark this technique against some of the most popular pruning strategies, demonstrating an intrinsic efficiency of the proposed approach in very diverse machine learning scenarios. Finally, our findings are discussed and the theoretical implications of our results are presented.

[AI-48] Imputation of Unknown Missingness in Sparse Electronic Health Records

【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中“未知的未知”(unknown unknowns)问题,即当数据缺失时难以区分其是真正未发生还是因共享限制导致未记录的情况。传统插补方法主要针对已知缺失值(known unknowns),无法有效处理此类语义模糊的缺失模式。解决方案的关键在于提出一种通用的去噪算法,基于Transformer架构设计神经网络模型,通过自适应阈值化机制识别并恢复二进制EHR中的潜在缺失值,从而提升数据完整性与下游任务性能,如在真实世界医院再入院预测任务中显著优于现有基线方法。

链接: https://arxiv.org/abs/2602.20442
作者: Jun Han,Josue Nassar,Sanjit Singh Batra,Aldo Cordova-Palomera,Vijay Nori,Robert E. Tillman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.

[AI-49] Learning to Rewrite Tool Descriptions for Reliable LLM -Agent Tool Use

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在面对大规模候选工具集时,因工具接口(tool interfaces)质量低下而成为性能瓶颈的问题。现有方法多依赖执行轨迹(execution traces)优化工具接口,但在冷启动或隐私受限场景下难以获取此类数据,且通常独立优化每个工具,缺乏可扩展性和对未见工具的泛化能力。解决方案的关键在于提出 Trace-Free+ 框架,一种基于课程学习(curriculum learning)的无轨迹监督方法,通过从有轨迹的训练环境逐步迁移至无轨迹部署环境,促使模型抽象出可复用的接口使用模式和工具使用结果,从而提升工具选择的准确性与泛化能力。

链接: https://arxiv.org/abs/2602.20426
作者: Ruocheng Guo,Kaiwen Dong,Xiang Gao,Kamalika Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural language descriptions and parameter schemas-remain largely human-oriented and often become a bottleneck, especially when agents must select from large candidate tool sets. Existing approaches to improving tool interfaces rely on execution traces, which are frequently unavailable in cold-start or privacy-constrained settings, and typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to abstract reusable interface-usage patterns and tool usage outcomes. To support this approach, we construct a large-scale dataset of high-quality tool interfaces using a structured workflow over a diverse collection of tools. Experiments on StableToolBench and RestBench show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that tool interface optimization is a practical and deployable complement to agent fine-tuning.

[AI-50] Implicit Intelligence – Evaluating Agents on What Users Dont Say

【速读】:该论文旨在解决当前AI代理在面对现实世界请求时因指令不明确而难以进行隐式推理的问题,即现有基准测试仅评估显式指令遵循能力,却忽视了代理对隐含需求(如可访问性、隐私边界、灾难性风险及上下文约束)的推理能力。解决方案的关键在于提出“隐式智能”(Implicit Intelligence)评估框架与“代理即世界”(Agent-as-a-World, AaW)模拟环境:前者系统化地测试代理能否从看似简单的用户请求中推断出隐藏的复杂逻辑和约束条件;后者通过人类可读的YAML文件定义交互式场景,并由语言模型模拟运行,使代理必须通过环境探索发现并满足未明说的约束,从而推动AI从单纯指令跟随向真正目标实现者演进。

链接: https://arxiv.org/abs/2602.20424
作者: Ved Sirdeshmukh,Marc Wetter
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We present Implicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired with Agent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.

[AI-51] Diffusion Modulation via Environment Mechanism Modeling for Planning

【速读】:该论文旨在解决传统基于扩散模型(diffusion models)的规划方法在离线强化学习(offline reinforcement learning, offline RL)中生成轨迹时,未能充分考虑环境转移动态与奖励函数之间一致性的问题,导致生成轨迹与真实环境机制存在显著偏差。解决方案的关键在于提出一种名为“通过环境机制建模进行扩散调制”(Diffusion Modulation via Environment Mechanism Modeling, DMEMM)的新方法,该方法在扩散模型训练过程中引入关键的环境机制信息,特别是状态转移动力学(transition dynamics)和奖励函数(reward function),从而增强生成轨迹在物理和逻辑层面与真实环境的一致性,最终实现更可靠的规划性能。

链接: https://arxiv.org/abs/2602.20422
作者: Hanping Zhang,Yuhong Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.

[AI-52] hree Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

【速读】:该论文旨在解决当前语言模型在无监督场景下(如无外部标签的训练)或通过“由易到难”泛化策略提升输出真实性时,因评估数据集设计过于理想化而导致性能评估结果过于乐观的问题。其关键在于构建一组缺乏典型真实数据集特性的挑战性测试集——即:(1) 缺乏比真实性更显著的特征、(2) 训练集不平衡、(3) 包含模型难以给出明确答案的数据点——以此对主流的无监督诱导(unsupervised elicitation)和易到难泛化(easy-to-hard generalization)技术进行压力测试。研究发现,现有方法在这些挑战性数据上均无法稳定表现良好,且集成或组合策略仅能部分缓解性能下降,表明未来工作需优先攻克此类现实世界中的复杂性问题。

链接: https://arxiv.org/abs/2602.20400
作者: Callum Canavan,Aditya Shrivastava,Allison Qi,Jonathan Michala,Fabien Roger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training algorithms to steer models with no external labels at all (unsupervised elicitation). Although techniques from both paradigms have been shown to improve model accuracy on a wide variety of tasks, we argue that the datasets used for these evaluations could cause overoptimistic evaluation results. Unlike many real-world datasets, they often (1) have no features with more salience than truthfulness, (2) have balanced training sets, and (3) contain only data points to which the model can give a well-defined answer. We construct datasets that lack each of these properties to stress-test a range of standard unsupervised elicitation and easy-to-hard generalization techniques. We find that no technique reliably performs well on any of these challenges. We also study ensembling and combining easy-to-hard and unsupervised techniques, and find they only partially mitigate performance degradation due to these challenges. We believe that overcoming these challenges should be a priority for future work on unsupervised elicitation.

[AI-53] Learning During Detection: Continual Learning for Neural OFDM Receivers via DMRS

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在正交频分复用(Orthogonal Frequency-Division Multiplexing, OFDM)接收机中因信道分布随时间快速变化而导致性能下降的问题,传统方法需周期性重新训练,带来资源开销和中断风险。解决方案的关键在于提出一种零开销的在线持续学习框架,利用现有的解调参考信号(Demodulation Reference Signals, DMRS)实现软比特检测与模型自适应同步进行,通过三种灵活的导频设计(完全随机、混合及额外导频)支持联合解调与学习,并结合两种接收机架构(并行设计以保证连续运行,前向传播复用设计以降低计算复杂度),从而在不增加额外资源、不中断服务的前提下有效跟踪慢速与快速信道分布变化,避免灾难性性能退化。

链接: https://arxiv.org/abs/2602.20361
作者: Mohanad Obeed,Ming Jian
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have been increasingly explored for receiver design because they can handle complex environments without relying on explicit channel models. Nevertheless, because communication channels change rapidly, their distributions can shift over time, often making periodic retraining necessary. This paper proposes a zero-overhead online and continual learning framework for orthogonal frequency-division multiplexing (OFDM) neural receivers that directly detect the soft bits of received signals. Unlike conventional fine-tuning methods that rely on dedicated training intervals or full resource grids, our approach leverages existing demodulation reference signals (DMRS) to simultaneously enable signal demodulation and model adaptation. We introduce three pilot designs: fully randomized, hybrid, and additional pilots that flexibly support joint demodulation and learning. To accommodate these pilot designs, we develop two receiver architectures: (i) a parallel design that separates inference and fine-tuning for uninterrupted operation, and (ii) a forward-pass reusing design that reduces computational complexity. Simulation results show that the proposed method effectively tracks both slow and fast channel distribution variations without additional overhead, service interruption, or catastrophic performance degradation under distribution shift.

[AI-54] Hierarchical Molecular Representation Learning via Frag ment-Based Self-Supervised Embedding Prediction

【速读】:该论文旨在解决现有图自监督学习(Graph Self-Supervised Learning, GSSL)方法在分子图表示学习中忽视化学相关子结构的问题,这些问题导致模型难以捕捉对分子性质起关键作用的局部化学片段信息。解决方案的关键在于提出了一种分层自监督框架——图语义预测网络(Graph Semantic Predictive Network, GraSPNet),该框架通过无预定义词汇表的方式将分子图分解为化学上有意义的片段,并利用多层级消息传递机制,在原子级和片段级同时进行掩码语义预测,从而显式建模多层次语义信息,实现更具表达力和可迁移性的分子表示学习。

链接: https://arxiv.org/abs/2602.20344
作者: Jiele Wu,Haozhe Ma,Zhihan Guo,Thanh Vinh Vo,Tze Yun Leong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 15 pages (8 pages main text),8 figures

点击查看摘要

Abstract:Graph self-supervised learning (GSSL) has demonstrated strong potential for generating expressive graph embeddings without the need for human annotations, making it particularly valuable in domains with high labeling costs such as molecular graph analysis. However, existing GSSL methods mostly focus on node- or edge-level information, often ignoring chemically relevant substructures which strongly influence molecular properties. In this work, we propose Graph Semantic Predictive Network (GraSPNet), a hierarchical self-supervised framework that explicitly models both atomic-level and fragment-level semantics. GraSPNet decomposes molecular graphs into chemically meaningful fragments without predefined vocabularies and learns node- and fragment-level representations through multi-level message passing with masked semantic prediction at both levels. This hierarchical semantic supervision enables GraSPNet to learn multi-resolution structural information that is both expressive and transferable. Extensive experiments on multiple molecular property prediction benchmarks demonstrate that GraSPNet learns chemically meaningful representations and consistently outperforms state-of-the-art GSSL methods in transfer learning settings.

[AI-55] DMCD: Semantic-Statistical Framework for Causal Discovery

【速读】:该论文旨在解决因果发现(causal discovery)中因数据稀疏或高维导致的结构学习效率与准确性不足的问题,尤其是在缺乏先验知识的情况下难以有效识别真实因果关系。其解决方案的关键在于提出了一种两阶段的因果发现框架DMCD(DataMap Causal Discovery),第一阶段利用大语言模型(Large Language Model, LLM)基于变量元数据生成一个稀疏的初始有向无环图(DAG)作为语义引导的先验;第二阶段通过条件独立性检验对这一草图进行审计和修正,以统计验证方式精准调整边的缺失或冗余。该方法在工业工程、环境监测和IT系统分析等多领域真实数据集上表现优异,尤其在召回率(recall)和F1分数上显著优于现有基线,证明了语义推理与统计验证相结合的有效性。

链接: https://arxiv.org/abs/2602.20333
作者: Samarth KaPatel,Sofia Nikiforova,Giacinto Paolo Saggese,Paul Smith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20333 [cs.AI] (or arXiv:2602.20333v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20333 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-56] Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

【速读】:该论文旨在解决机器人在执行物体操作任务时,因缺乏对特定物理属性(如摩擦力、稳定性)的实时理解而导致的泛化能力不足问题。现有视觉-语言模型(Vision-Language Model, VLM)虽能进行一般性物理推理,但难以准确预测特定物体在特定环境中的行为,例如某球体在某表面的滚动特性或某石块是否能作为稳定支撑。解决方案的关键在于提出PhysMem记忆框架,其核心机制是在测试阶段通过交互式经验记录、候选假设生成与目标验证实现物理原理的学习,且不更新模型参数;特别强调“验证先于应用”——即系统在将经验用于决策前,需通过针对性交互验证假设的有效性,从而降低对历史经验的刚性依赖,提升在物理条件变化下的适应能力。

链接: https://arxiv.org/abs/2602.20323
作者: Haoyang Li,Yang You,Hao Su,Leonidas Guibas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.

[AI-57] Shape-informed cardiac mechanics surrogates in data-scarce regimes via geometric encoding and generative augmentation

【速读】:该论文旨在解决高保真心脏力学计算模型在临床应用中因计算成本过高而难以普及的问题,尤其是当数据稀缺时,传统代理模型难以在不同解剖结构间实现良好泛化。其解决方案的关键在于提出一个两阶段框架:首先通过形状模型学习左心室几何的紧凑潜在表示,从而生成合成几何用于数据增强;其次构建一种基于神经场(Neural Field)的代理模型,该模型以几何编码为条件,预测外部载荷下的心室位移,并采用通用的心室坐标进行位置编码,提升跨解剖结构的泛化能力。该方法通过解耦几何表示与物理响应学习,有效提升了在数据稀缺场景下的预测精度和鲁棒性。

链接: https://arxiv.org/abs/2602.20306
作者: Davide Carrara,Marc Hirschvogel,Francesca Bonizzoni,Stefano Pagani,Simone Pezzuto,Francesco Regazzoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 39 pages, 19 figures

点击查看摘要

Abstract:High-fidelity computational models of cardiac mechanics provide mechanistic insight into the heart function but are computationally prohibitive for routine clinical use. Surrogate models can accelerate simulations, but generalization across diverse anatomies is challenging, particularly in data-scarce settings. We propose a two-step framework that decouples geometric representation from learning the physics response, to enable shape-informed surrogate modeling under data-scarce conditions. First, a shape model learns a compact latent representation of left ventricular geometries. The learned latent space effectively encodes anatomies and enables synthetic geometries generation for data augmentation. Second, a neural field-based surrogate model, conditioned on this geometric encoding, is trained to predict ventricular displacement under external loading. The proposed architecture performs positional encoding by using universal ventricular coordinates, which improves generalization across diverse anatomies. Geometric variability is encoded using two alternative strategies, which are systematically compared: a PCA-based approach suitable for working with point cloud representations of geometries, and a DeepSDF-based implicit neural representation learned directly from point clouds. Overall, our results, obtained on idealized and patient-specific datasets, show that the proposed approaches allow for accurate predictions and generalization to unseen geometries, and robustness to noisy or sparsely sampled inputs.

[AI-58] Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged 10-17: Comparative Evaluation of Statistical and Machine Learning Approaches Using the 2021 National Survey of Childrens Health

【速读】:该论文旨在解决美国青少年超重和肥胖问题的多层级预测建模难题,即识别行为、家庭与社区层面的关键预测因子,并比较统计模型、机器学习(Machine Learning, ML)与深度学习(Deep Learning, DL)方法在预测性能、校准度及亚组公平性方面的差异。其解决方案的关键在于:首先通过大规模横断面数据(2021年国家儿童健康调查,n=18,792)系统整合多种维度的预测变量;其次采用多种先进算法进行建模评估,发现尽管复杂模型如梯度提升和深度神经网络可略微提升召回率与F1分数,但逻辑回归(Logistic Regression)在判别能力与校准之间保持最佳平衡,且无一种模型在所有指标上均占优;更重要的是,无论算法复杂度如何,不同种族和贫困群体间的预测性能差异依然存在,提示当前问题的核心不在于算法优化,而在于提升数据质量与构建以公平为导向的监测体系。

链接: https://arxiv.org/abs/2602.20303
作者: Joyanta Jyoti Mondal
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: Childhood and adolescent overweight and obesity remain major public health concerns in the United States and are shaped by behavioral, household, and community factors. Their joint predictive structure at the population level remains incompletely characterized. Objectives: The study aims to identify multilevel predictors of overweight and obesity among U.S. adolescents and compare the predictive performance, calibration, and subgroup equity of statistical, machine-learning, and deep-learning models. Data and Methods: We analyze 18,792 children aged 10-17 years from the 2021 National Survey of Children’s Health. Overweight/obesity is defined using BMI categories. Predictors included diet, physical activity, sleep, parental stress, socioeconomic conditions, adverse experiences, and neighborhood characteristics. Models include logistic regression, random forest, gradient boosting, XGBoost, LightGBM, multilayer perceptron, and TabNet. Performance is evaluated using AUC, accuracy, precision, recall, F1 score, and Brier score. Results: Discrimination range from 0.66 to 0.79. Logistic regression, gradient boosting, and MLP showed the most stable balance of discrimination and calibration. Boosting and deep learning modestly improve recall and F1 score. No model was uniformly superior. Performance disparities across race and poverty groups persist across algorithms. Conclusion: Increased model complexity yields limited gains over logistic regression. Predictors consistently span behavioral, household, and neighborhood domains. Persistent subgroup disparities indicate the need for improved data quality and equity-focused surveillance rather than greater algorithmic complexity.

[AI-59] Quantifying the Expectation-Realisation Gap for Agent ic AI Systems

【速读】:该论文试图解决生成式 AI (Generative AI) 系统在实际部署后表现与预设预期之间存在显著偏差的问题,即“期望-实现差距”(expectation-realisation gap)。研究表明,在软件开发、临床文档记录和临床决策支持等场景中,AI工具的实际效果远低于开发者或供应商的宣称,主要归因于工作流整合摩擦、验证负担、测量指标不匹配及治疗效应的系统性异质性。解决方案的关键在于建立结构化的规划框架,要求在部署前明确量化收益预期,并将人工监督成本纳入评估体系,从而提升AI应用的实效性和可预测性。

链接: https://arxiv.org/abs/2602.20292
作者: Sebastian Lobentanzer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, no figures

点击查看摘要

Abstract:Agentic AI systems are deployed with expectations of substantial productivity gains, yet rigorous empirical evidence reveals systematic discrepancies between pre-deployment expectations and post-deployment outcomes. We review controlled trials and independent validations across software engineering, clinical documentation, and clinical decision support to quantify this expectation-realisation gap. In software development, experienced developers expected a 24% speedup from AI tools but were slowed by 19% – a 43 percentage-point calibration error. In clinical documentation, vendor claims of multi-minute time savings contrast with measured reductions of less than one minute per note, and one widely deployed tool showed no statistically significant effect. In clinical decision support, externally validated performance falls substantially below developer-reported metrics. These shortfalls are driven by workflow integration friction, verification burden, measurement construct mismatches, and systematic heterogeneity in treatment effects. The evidence motivates structured planning frameworks that require explicit, quantified benefit expectations with human oversight costs factored in.

[AI-60] Uncertainty-Aware Delivery Delay Duration Prediction via Multi-Task Deep Learning

【速读】:该论文旨在解决物流供应链中交付延迟时间预测的难题,尤其在数据严重不平衡(延迟发货样本稀少但影响重大)和多区域、多模式运输复杂性背景下,传统方法难以准确建模。其解决方案的关键在于提出一种多任务深度学习模型,通过专用嵌入层处理高维表格特征,并采用“分类-回归”联合策略:先区分是否延迟,再对延迟情况预测具体延迟时长,从而实现端到端训练,提升延迟案例检测精度,并支持不确定性感知的概率预测,显著优于单一步骤或两阶段树模型方法。

链接: https://arxiv.org/abs/2602.20271
作者: Stefan Faulkner,Reza Zandehshahvar,Vahid Eghbal Akhlaghi,Sebastien Ouellet,Carsten Jordan,Pascal Van Hentenryck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Accurate delivery delay prediction is critical for maintaining operational efficiency and customer satisfaction across modern supply chains. Yet the increasing complexity of logistics networks, spanning multimodal transportation, cross-country routing, and pronounced regional variability, makes this prediction task inherently challenging. This paper introduces a multi-task deep learning model for delivery delay duration prediction in the presence of significant imbalanced data, where delayed shipments are rare but operationally consequential. The model embeds high-dimensional shipment features with dedicated embedding layers for tabular data, and then uses a classification-then-regression strategy to predict the delivery delay duration for on-time and delayed shipments. Unlike sequential pipelines, this approach enables end-to-end training, improves the detection of delayed cases, and supports probabilistic forecasting for uncertainty-aware decision making. The proposed approach is evaluated on a large-scale real-world dataset from an industrial partner, comprising more than 10 million historical shipment records across four major source locations with distinct regional characteristics. The proposed model is compared with traditional machine learning methods. Experimental results show that the proposed method achieves a mean absolute error of 0.67-0.91 days for delayed-shipment predictions, outperforming single-step tree-based regression baselines by 41-64% and two-step classify-then-regress tree-based models by 15-35%. These gains demonstrate the effectiveness of the proposed model in operational delivery delay forecasting under highly imbalanced and heterogeneous conditions.

[AI-61] MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning CVPR2026

【速读】:该论文旨在解决TabPFN(Tabular Prior-data Fitted Network)在处理异构模态数据(如图像和文本)时的局限性,从而提升其在医疗、营销等多模态场景中的适用性。解决方案的关键在于提出Multi-Modal Prior-data Fitted Network (MMPFN),通过引入跨模态编码器(per-modality encoders)、模态投影器(modality projectors)以及预训练基础模型,实现对表格数据与非表格模态的统一建模;其中,模态投影器作为核心组件,将非表格模态嵌入转换为与表格兼容的token表示,同时采用多头门控MLP(multi-head gated MLP)和交叉注意力池化器(cross-attention pooler)增强上下文提取能力并缓解多模态学习中的注意力不平衡问题。

链接: https://arxiv.org/abs/2602.20223
作者: Wall Kim,Chaeyoung Song,Hanul Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention pooler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning. The source code is available at this https URL.

[AI-62] What Matters for Simulation to Online Reinforcement Learning on Real Robots

【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning, Online RL)在物理机器人上成功部署时的关键设计选择问题,即哪些算法、系统和实验决策对稳定学习至关重要。通过在三种不同机器人平台上进行100次真实世界训练实验,研究系统性地消融了以往文献中常被隐含默认的设计因素,发现某些广泛采用的默认设置可能有害,而一组稳健且易于实施的标准RL实践设计选择能够实现跨任务与硬件的稳定学习。其解决方案之关键在于识别并验证这些可复现的、非显式假设的“设计惯例”,从而显著降低在线RL在实际机器人系统中的工程实现难度。

链接: https://arxiv.org/abs/2602.20220
作者: Yarden As,Dhruva Tirumala,René Zurbrügg,Chenhao Li,Stelian Coros,Andreas Krause,Markus Wulfmeier
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. These results provide the first large-sample empirical study of such design choices, enabling practitioners to deploy online RL with lower engineering effort.

[AI-63] An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)中准确理解人类意图这一核心挑战,以实现更自然、直观的人机协作。解决方案的关键在于提出了一种融合多模态感知与智能决策的新型HRI框架:通过集成Florence-2进行视觉目标检测、Llama 3.1实现自然语言理解以及Whisper完成语音识别,构建了一个紧密耦合的语音与视觉-语言处理系统;同时引入模糊逻辑(Fuzzy Logic)提升控制策略的适应性,从而在消费级硬件上实现了75%的指令执行准确率,显著增强了命令解释与执行的可靠性。

链接: https://arxiv.org/abs/2602.20219
作者: Guanting Shen,Zi Tian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Preprint currently under revision

点击查看摘要

Abstract:Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for future HRI research, offering a practical pathway toward more sophisticated and natural human-robot collaboration through tightly coupled speech and vision-language processing.

[AI-64] KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

【速读】:该论文旨在解决自洽式推测解码(Self-speculative Decoding, SSD)在长上下文场景下因静态启发式策略忽略注意力机制动态计算开销而导致的推理效率瓶颈问题。解决方案的关键在于提出一种无需训练的框架 KnapSpec,其核心创新是将草稿模型选择问题建模为背包问题(knapsack problem),通过解耦注意力层与多层感知机(MLP)层并分别建模其硬件特定延迟随上下文长度变化的函数关系,利用并行动态规划算法在线自适应地识别最优草稿配置;同时首次理论证明了隐藏状态间余弦相似度可作为token接受率的数学合理代理指标,从而在保持高推测保真度的同时有效应对真实硬件中的瓶颈迁移,显著提升单位时间内生成token的吞吐量。

链接: https://arxiv.org/abs/2602.20217
作者: Seongjin Cha,Gyuwan Kim,Dongsu Han,Tao Yang,Insu Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model’s output distribution.

[AI-65] Right to History: A Sovereignty Kernel for Verifiable AI Agent Execution

【速读】:该论文旨在解决当前AI代理(AI agent)在个人硬件上运行时缺乏可验证、防篡改行为记录的问题,这在欧盟人工智能法案等法规要求高风险AI系统自动日志记录的背景下尤为关键。解决方案的核心是提出“历史权”(Right to History)原则,即个体有权获得其设备上所有AI代理操作的完整、可验证记录,并通过五个系统不变量的形式化定义与证明来保障该原则的实现;关键技术在于PunkGo——一个基于Rust的主权内核,集成了RFC 6962 Merkle树审计日志、基于能力的隔离机制、能耗预算治理和人工审批流程,实测表明其能以亚毫秒级延迟(<1.3 ms)、每秒约400次操作的吞吐量以及448字节的Merkle包含证明维持高可靠性与性能。

链接: https://arxiv.org/abs/2602.20214
作者: Jing Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: 22 pages, 3 figures, 7 tables. Open-source: this https URL

点击查看摘要

Abstract:AI agents increasingly act on behalf of humans, yet no existing system provides a tamper-evident, independently verifiable record of what they did. As regulations such as the EU AI Act begin mandating automatic logging for high-risk AI systems, this gap carries concrete consequences – especially for agents running on personal hardware, where no centralized provider controls the log. Extending Floridi’s informational rights framework from data about individuals to actions performed on their behalf, this paper proposes the Right to History: the principle that individuals are entitled to a complete, verifiable record of every AI agent action on their own hardware. The paper formalizes this principle through five system invariants with structured proof sketches, and implements it in PunkGo, a Rust sovereignty kernel that unifies RFC 6962 Merkle tree audit logs, capability-based isolation, energy-budget governance, and a human-approval mechanism. Adversarial testing confirms all five invariants hold. Performance evaluation shows sub-1.3 ms median action latency, ~400 actions/sec throughput, and 448-byte Merkle inclusion proofs at 10,000 log entries.

[AI-66] CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)代码生成评测中测试用例覆盖不足的问题,尤其是对细微边界情况(corner cases)缺乏有效检测,导致错误代码仍能通过现有基准测试。其解决方案的关键在于提出CodeHacker框架——一个自动化的对抗性测试代理系统,通过模拟竞赛编程中的攻击机制,采用多策略方法(包括压力测试、反哈希攻击和逻辑特定靶向)生成针对性的对抗测试用例,以暴露程序提交中的潜在漏洞。为确保攻击的有效性与可靠性,该框架引入校准阶段(Calibration Phase),通过自动生成的对抗探测样本迭代优化验证器(Validator)与检查器(Checker),从而显著提升数据集的真实负类率(True Negative Rate, TNR),有效过滤此前被误判为正确的错误解法,并证明生成的对抗案例作为训练数据可显著增强强化学习训练模型在LiveCodeBench等基准上的性能。

链接: https://arxiv.org/abs/2602.20213
作者: Jingwei Shi,Xinxiang Yin,Jing Huang,Jinman Zhao,Shengyu Tao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, existing benchmarks often lack coverage for subtle corner cases, allowing incorrect solutions to pass. To bridge this gap, we propose CodeHacker, an automated agent framework dedicated to generating targeted adversarial test cases that expose latent vulnerabilities in program submissions. Mimicking the hack mechanism in competitive programming, CodeHacker employs a multi-strategy approach, including stress testing, anti-hash attacks, and logic-specific targeting to break specific code submissions. To ensure the validity and reliability of these attacks, we introduce a Calibration Phase, where the agent iteratively refines its own Validator and Checker via self-generated adversarial probes before evaluating contestant this http URL demonstrate that CodeHacker significantly improves the True Negative Rate (TNR) of existing datasets, effectively filtering out incorrect solutions that were previously accepted. Furthermore, generated adversarial cases prove to be superior training data, boosting the performance of RL-trained models on benchmarks like LiveCodeBench.

[AI-67] Multimodal Crystal Flow: Any-to-Any Modality Generation for Unified Crystal Modeling

【速读】:该论文旨在解决当前生成式晶体建模方法中普遍存在的任务特异性问题,即现有深度生成模型通常仅针对单一晶体生成任务(如晶体结构预测 Crystal Structure Prediction, CSP 或从头生成 Crystal De Novo Generation, DNG)设计,缺乏一个能够共享晶体表示并统一处理多种生成任务的框架。其解决方案的关键在于提出 Multimodal Crystal Flow (MCFlow),一种基于流模型(flow-based model)的统一多模态框架,通过为原子类型和晶体结构分别引入独立的时间变量,将不同生成任务映射为不同的推理轨迹;同时,为适配标准 Transformer 模型实现多模态流,作者创新性地设计了具有组成与对称性感知的原子排序策略及分层排列增强机制,在不依赖显式结构模板的情况下注入强化学组成和晶体学先验知识,从而在 MP-20 和 MPTS-52 基准上实现了对多个晶体生成任务的竞争力性能。

链接: https://arxiv.org/abs/2602.20210
作者: Kiyoung Seong,Sungsoo Ahn,Sehui Han,Changyoung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crystal modeling spans a family of conditional and unconditional generation tasks across different modalities, including crystal structure prediction (CSP) and \emphde novo generation (DNG). While recent deep generative models have shown promising performance, they remain largely task-specific, lacking a unified framework that shares crystal representations across different generation tasks. To address this limitation, we propose \emphMultimodal Crystal Flow (MCFlow), a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. To enable multimodal flow in a standard transformer model, we introduce a composition- and symmetry-aware atom ordering with hierarchical permutation augmentation, injecting strong compositional and crystallographic priors without explicit structural templates. Experiments on the MP-20 and MPTS-52 benchmarks show that MCFlow achieves competitive performance against task-specific baselines across multiple crystal generation tasks.

[AI-68] Model Merging in the Essential Subspace CVPR2026

【速读】:该论文旨在解决多任务模型合并(model merging)中因任务干扰(task interference)导致性能下降的问题。现有方法在将多个基于同一预训练检查点微调的专用模型合并为单一多任务模型时,常因不同任务更新参数之间的冲突而影响整体性能。其解决方案的关键在于提出一种名为ESM(Essential Subspace Merging)的框架:首先通过主成分分析(PCA)提取由参数更新引起的特征变化的主要方向,构建主导特征表示的核心子空间(essential subspace);随后将每个任务的参数更新矩阵投影到对应的核心子空间并进行低秩分解,从而在合并过程中抑制跨任务干扰并保留关键任务特性;同时引入多级极化缩放策略(multi-level polarized scaling),增强蕴含重要知识的参数、抑制冗余部分,防止核心知识在融合过程中被稀释。该方法在多种任务组合与模型规模下均实现了当前最优的多任务模型合并效果。

链接: https://arxiv.org/abs/2602.20208
作者: Longhua Li,Lei Qi,Qi Tian,Xin Geng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Model merging aims to integrate multiple task-specific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task’s parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.

[AI-69] Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中知识编辑(knowledge editing)的效率与泛化问题,即如何在不破坏模型其他行为的前提下,高效地更新特定查询的预测结果。现有方法通常依赖于为每个样本单独确定最优编辑层,导致计算成本高且难以推广。论文的关键创新在于提出“黄金层”(golden layers)的概念,即存在一组固定层能在不同样本上实现接近样本最优层的编辑效果;并通过Layer Gradient Analysis(LGA)方法,利用梯度归因(gradient-attribution)高效估计这些黄金层,从而避免反复试错,显著提升编辑效率与跨数据集的泛化能力。

链接: https://arxiv.org/abs/2602.20207
作者: Shrestha Datta,Hongfu Liu,Anshuman Chhabra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge editing in Large Language Models (LLMs) aims to update the model’s prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.

[AI-70] Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler ASPLOS’26

【速读】:该论文旨在解决边缘设备上AI内核编译中如何有效利用并行性与隐藏内存延迟的问题,尤其是在存在分层内存结构和显式数据移动的情况下。其解决方案的关键在于基于MLIR的编译管道中引入三种编译器控制机制:向量化(Vectorization, Vec)、跨硬件上下文的多线程(Multi-Threading, MT)以及使用乒乓片上缓冲区实现的双缓冲(Double Buffering, DB),以分别优化带宽敏感型计算、调度开销摊销后的并行加速以及数据传输与计算的重叠执行。通过Triton/Inductor生成的内核进行消融实验,量化了这三种机制对GELU激活函数等典型算子性能提升的贡献,验证了Vec在带宽受限场景下的主导作用、MT在问题规模增大时的显著加速潜力,以及DB在传输与计算可重叠时的额外收益。

链接: https://arxiv.org/abs/2602.20204
作者: Javed Absar,Samarth Narang,Muthu Baskaran
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注: Accepted at MLBench workshop as part of ASPLOS’26

点击查看摘要

Abstract:AI kernel compilation for edge devices depends on the compiler’s ability to exploit parallelism and hide memory latency in the presence of hierarchical memory and explicit data movement. This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline: vectorization (Vec), multi-threading (MT) across hardware contexts, and double buffering (DB) using ping–pong scratchpad buffers to overlap DMA transfers with compute. Using Triton/Inductor-generated kernels, we present an ablation ladder that separates the contribution of Vec, MT, and DB, and we quantify how MT speedup scales with problem size using GELU as a representative activation kernel. The results show that vectorization provides the primary gain for bandwidth-sensitive kernels, MT delivers substantial improvements once scheduling overhead is amortized, and DB provides additional benefit when transfers and compute can be overlapped (i.e., outside the extremes of purely memory-bound or purely compute-bound behavior).

[AI-71] Evaluating the Reliability of Digital Forensic Evidence Discovered by Large Language Model: A Case Study

【速读】:该论文旨在解决人工智能(AI)识别数字证据时的可靠性问题,尤其是在大型语言模型(LLM)被广泛应用于数字取证场景下,如何确保所提取证据的可信度、可追溯性与法律合规性。其解决方案的关键在于提出一个结构化框架,通过自动化提取取证构件、利用LLM进行数据精炼,并借助数字取证知识图谱(DFKG)对结果进行验证;同时采用确定性的唯一标识符(UIDs)和交叉引用机制保障证据链的完整性与上下文一致性,从而实现高精度(>95%)的构件提取、减少分类错误,并建立可审计、可扩展且具备法律效力的AI辅助数字取证方法论。

链接: https://arxiv.org/abs/2602.20202
作者: Jeel Piyushkumar Khatiwala,Daniel Kwaku Ntiamoah Addai,Weifeng Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Published in the Proceedings of the 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada, 8-11 July 2025

点击查看摘要

Abstract:The growing reliance on AI-identified digital evidence raises significant concerns about its reliability, particularly as large language models (LLMs) are increasingly integrated into forensic investigations. This paper proposes a structured framework that automates forensic artifact extraction, refines data through LLM-driven analysis, and validates results using a Digital Forensic Knowledge Graph (DFKG). Evaluated on a 13 GB forensic image dataset containing 61 applications, 2,864 databases, and 5,870 tables, the framework ensures artifact traceability and evidentiary consistency through deterministic Unique Identifiers (UIDs) and forensic cross-referencing. We propose this methodology to address challenges in ensuring the credibility and forensic integrity of AI-identified evidence, reducing classification errors, and advancing scalable, auditable methodologies. A comprehensive case study on this dataset demonstrates the framework’s effectiveness, achieving over 95 percent accuracy in artifact extraction, strong support of chain-of-custody adherence, and robust contextual consistency in forensic relationships. Key results validate the framework’s ability to enhance reliability, reduce errors, and establish a legally sound paradigm for AI-assisted digital forensics.

[AI-72] IMOVNO: A Regional Partitioning and Meta-Heuristic Ensemble Framework for Imbalanced Multi-Class Learning

【速读】:该论文旨在解决多类分类任务中因类别不平衡(class imbalance)、类别重叠(overlap)和噪声(noise)导致的数据质量下降、模型可靠性降低及泛化能力受限的问题。传统方法在多类场景下难以有效处理复杂的类间关系,且单一依赖几何距离或局部二值化策略易引发信息丢失或合成样本质量低下。其解决方案的关键在于提出IMOVNO+框架,该框架采用两级优化机制:数据层通过条件概率量化样本信息量,结合Z-score与大跳跃间隙距离识别并清理重叠区域,引入基于多正则化的智能过采样策略控制合成样本邻近性以避免新增重叠;算法层则利用元启发式方法剪枝集成学习中的弱分类器,从而提升整体模型鲁棒性。此设计实现了对数据质量和算法性能的协同增强,在35个数据集上显著优于现有方法,尤其在多类任务中G-mean、F1-score等指标提升达25%-57%。

链接: https://arxiv.org/abs/2602.20199
作者: Soufiane Bacha,Laouni Djafri,Sahraoui Dhelim,Huansheng Ning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Class imbalance, overlap, and noise degrade data quality, reduce model reliability, and limit generalization. Although widely studied in binary classification, these issues remain underexplored in multi-class settings, where complex inter-class relationships make minority-majority structures unclear and traditional clustering fails to capture distribution shape. Approaches that rely only on geometric distances risk removing informative samples and generating low-quality synthetic data, while binarization approaches treat imbalance locally and ignore global inter-class dependencies. At the algorithmic level, ensembles struggle to integrate weak classifiers, leading to limited robustness. This paper proposes IMOVNO+ (IMbalance-OVerlap-NOise+ Algorithm-Level Optimization), a two-level framework designed to jointly enhance data quality and algorithmic robustness for binary and multi-class tasks. At the data level, first, conditional probability is used to quantify the informativeness of each sample. Second, the dataset is partitioned into core, overlapping, and noisy regions. Third, an overlapping-cleaning algorithm is introduced that combines Z-score metrics with a big-jump gap distance. Fourth, a smart oversampling algorithm based on multi-regularization controls synthetic sample proximity, preventing new overlaps. At the algorithmic level, a meta-heuristic prunes ensemble classifiers to reduce weak-learner influence. IMOVNO+ was evaluated on 35 datasets (13 multi-class, 22 binary). Results show consistent superiority over state-of-the-art methods, approaching 100% in several cases. For multi-class data, IMOVNO+ achieves gains of 37-57% in G-mean, 25-44% in F1-score, 25-39% in precision, and 26-43% in recall. In binary tasks, it attains near-perfect performance with improvements of 14-39%. The framework handles data scarcity and imbalance from collection and privacy limits.

[AI-73] Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning ICLR2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中,由于状态空间庞大和奖励稀疏所导致的熵崩溃(entropy collapse)、策略退化(policy degradation)及对次优行为的过度利用(over-exploitation)问题。解决方案的关键在于提出一种混合策略的RLVR框架CalibRL,其核心机制包括:1)基于分布感知的优势加权(distribution-aware advantage weighting),通过按组别稀有度调整更新强度来校准策略分布并维持探索性;2)采用非对称激活函数(LeakyReLU)作为专家知识的校准基准,抑制过自信更新的同时保留修正方向。该设计实现了受控的探索(controllable exploration),缓解了模型策略与专家轨迹之间的分布偏移,从而在探索与利用之间取得更稳定的平衡。

链接: https://arxiv.org/abs/2602.20197
作者: Zhuoxu Huang,Mengxi Jia,Hao Sun,Xuelong Li,Jungong Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model’s policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid-policy RLVR training. Code is available at this https URL.

[AI-74] OpenPort Protocol: A Security Governance Specification for AI Agent Tool Access

【速读】:该论文旨在解决生成式 AI (Generative AI) 应用在生产环境中面临的治理难题,特别是如何实现最小权限授权、受控写操作执行、可预测的故障处理、抗滥用能力以及审计可追溯性等关键治理属性的显式表达与验证。其核心解决方案是提出 OpenPort 协议(OpenPort Protocol, OPP),一种以治理优先的设计规范,通过安全的服务器端网关暴露应用工具,并具备模型和运行时无关性,能无缝集成现有工具生态。OPP 的关键创新包括:基于授权的发现机制、带机器可读 reason codes 的稳定响应封装、结合集成凭证、作用域权限与基于属性的访问控制(ABAC)策略的授权模型;对写操作采用风险分级生命周期管理(默认草稿+人工审核,支持时间限定自动执行,强制高风险防护如预检影响绑定和幂等性保障);引入可选 State Witness 配置文件以应对延迟审批中的状态漂移问题(通过执行时重新验证前置条件并状态不匹配时闭合失败);同时要求严格的准入控制(速率限制/配额)与结构化审计事件记录,确保客户端恢复与事故分析的确定性。

链接: https://arxiv.org/abs/2602.20196
作者: Genliang Zhu,Chu Wang,Ziyuan Wang,Zhida Li,Qiang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents increasingly require direct, structured access to application data and actions, but production deployments still struggle to express and verify the governance properties that matter in practice: least-privilege authorization, controlled write execution, predictable failure handling, abuse resistance, and auditability. This paper introduces OpenPort Protocol (OPP), a governance-first specification for exposing application tools through a secure server-side gateway that is model- and runtime-neutral and can bind to existing tool ecosystems. OpenPort defines authorization-dependent discovery, stable response envelopes with machine-actionable \textttagent.* reason codes, and an authorization model combining integration credentials, scoped permissions, and ABAC-style policy constraints. For write operations, OpenPort specifies a risk-gated lifecycle that defaults to draft creation and human review, supports time-bounded auto-execution under explicit policy, and enforces high-risk safeguards including preflight impact binding and idempotency. To address time-of-check/time-of-use drift in delayed approval flows, OpenPort also specifies an optional State Witness profile that revalidates execution-time preconditions and fails closed on state mismatch. Operationally, the protocol requires admission control (rate limits/quotas) with stable 429 semantics and structured audit events across allow/deny/fail paths so that client recovery and incident analysis are deterministic. We present a reference runtime and an executable governance toolchain (layered conformance profiles, negative security tests, fuzz/abuse regression, and release-gate scans) and evaluate the core profile at a pinned release tag using artifact-based, externally reproducible validation.

[AI-75] When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型中后门攻击评估的局限性问题,即现有方法仅关注触发器激活和视觉保真度,而忽略了编码器侧污染引发的持久性、无触发器语义畸变对表示流形的根本性破坏。其解决方案的关键在于提出一种基于几何机制的分析框架——通过Jacobian矩阵分析揭示后门攻击本质上是低秩、目标中心的形变,会增强局部敏感性并导致畸变在语义邻域内一致传播;进而设计SEMAD(Semantic Alignment and Drift)诊断工具,从内部嵌入漂移和下游功能错位两个维度量化结构退化,从而实现对编码器中毒风险的系统性评估。

链接: https://arxiv.org/abs/2602.20193
作者: Shenyang Chen,Liuwan Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard evaluations of backdoor attacks on text-to-image (T2I) models primarily measure trigger activation and visual fidelity. We challenge this paradigm, demonstrating that encoder-side poisoning induces persistent, trigger-free semantic corruption that fundamentally reshapes the representation manifold. We trace this vulnerability to a geometric mechanism: a Jacobian-based analysis reveals that backdoors act as low-rank, target-centered deformations that amplify local sensitivity, causing distortion to propagate coherently across semantic neighborhoods. To rigorously quantify this structural degradation, we introduce SEMAD (Semantic Alignment and Drift), a diagnostic framework that measures both internal embedding drift and downstream functional misalignment. Our findings, validated across diffusion and contrastive paradigms, expose the deep structural risks of encoder poisoning and highlight the necessity of geometric audits beyond simple attack success rates.

[AI-76] Closing the Expertise Gap in Residential Building Energy Retrofits: A Domain-Specific LLM for Informed Decision-Making

【速读】:该论文旨在解决住宅能源改造决策中因业主缺乏专业技术知识而导致的“专家缺口”问题(expertise gap),即普通住户难以基于复杂的能源评估结果做出最优改造选择。其解决方案的关键在于开发了一个面向建筑能效领域的专用大语言模型(domain-specific large language model, LLM),该模型通过低秩适应(Low-Rank Adaptation, LoRA)技术对536,416个美国住宅原型进行微调,利用物理驱动的能耗模拟与经济性计算,将住户可理解的房屋特征描述映射为最优改造方案及其性能预测。实验表明,该模型在98.9%的案例中能将碳减排最优方案排入前三推荐,在93.3%的案例中将投资回收期最短方案纳入前三推荐,且在输入信息不完整时仍保持高性能,从而支持居民在低碳转型中的知情决策。

链接: https://arxiv.org/abs/2602.20181
作者: Lei Shu,Armin Yeganeh,Sinem Mollaoglu,Jiayu Zhou,Dong Zhao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: A preprint version is available via SSRN (Elsevier Preprint Service)

点击查看摘要

Abstract:Residential energy retrofit decision-making is constrained by an expertise gap, as homeowners lack the technical literacy required for energy assessments. To address this challenge, this study develops a domain-specific large language model (LLM) that provides optimal retrofit recommendations using homeowner-accessible descriptions of basic dwelling characteristics. The model is fine-tuned on physics-based energy simulations and techno-economic calculations derived from 536,416 U.S. residential building prototypes across nine major retrofit categories. Using Low-Rank Adaptation (LoRA), the LLM maps dwelling characteristics to optimal retrofit selections and associated performance outcomes. Evaluation against physics-grounded baselines shows that the model identifies the optimal retrofit for CO2 reduction within its top three recommendations in 98.9% of cases and the shortest discounted payback period in 93.3% of cases. Fine-tuning yields an order-of-magnitude reduction in CO2 prediction error and multi-fold reductions for energy use and retrofit cost. The model maintains performance under incomplete input conditions, supporting informed residential decarbonization decisions.

[AI-77] Enhancing Heat Sink Efficiency in MOSFETs using Physics Informed Neural Networks: A Systematic Study on Coolant Velocity Estimation

【速读】:该论文旨在解决多层金属-氧化物-半导体场效应晶体管(MOSFET)在给定热通量和进出口温度条件下,确定冷却液所需流速的不适定逆问题(ill-posed inverse problem)。传统方法难以有效求解此类问题,而本文提出基于物理信息神经网络(Physics Informed Neural Networks, PINNs)的解决方案,其关键在于采用分层顺序训练策略:在训练每一层时,将其他层参数视为常数,从而降低优化空间维度,提升全局最优解的搜索效率并避免陷入局部极小值。该方法通过理论分析证明了PINNs解的收敛性,并验证了预测结果与实验数据高度一致。

链接: https://arxiv.org/abs/2602.20177
作者: Aniruddha Bora,Isabel K. Alvarez,Julie Chalfant,Chryssostomos Chryssostomidis
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:In this work, we present a methodology using Physics Informed Neural Networks (PINNs) to determine the required velocity of a coolant, given inlet and outlet temperatures for a given heat flux in a multilayered metal-oxide-semiconductor field-effect transistor (MOSFET). MOSFETs are integral components of Power Electronic Building Blocks (PEBBs) and experiences the majority of the thermal load. Effective cooling of MOSFETs is therefore essential to prevent overheating and potential burnout. Determining the required velocity for the purpose of effective cooling is of importance but is an ill-posed inverse problem and difficult to solve using traditional methods. MOSFET consists of multiple layers with different thermal conductivities, including aluminum, pyrolytic graphite sheets (PGS), and stainless steel pipes containing flowing water. We propose an algorithm that employs sequential training of the MOSFET layers in PINNs. Mathematically, the sequential training method decouples the optimization of each layer by treating the parameters of other layers as constants during its training phase. This reduces the dimensionality of the optimization landscape, making it easier to find the global minimum for each layer’s parameters and avoid poor local minima. Convergence of the PINNs solution to the analytical solution is theoretically analyzed. Finally we show the prediction of our proposed methodology to be in good agreement with experimental results.

[AI-78] CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation ICLR2026

【速读】:该论文旨在解决现有红队测试基准(red-teaming benchmarks)在通过直接翻译适配到新语言时,无法捕捉根植于本地文化与法律的社会技术漏洞(socio-technical vulnerabilities),从而导致大语言模型(LLM)安全评估存在关键盲区的问题。其解决方案的关键在于提出CAGE(Culturally Adaptive Generation)框架,该框架的核心是“语义模具”(Semantic Mold)——一种将提示的对抗性结构与文化内容解耦的新方法,使得能够建模真实、本地化的威胁,而非仅测试简单的越狱行为(jailbreaks)。此机制实现了跨文化语境下安全评估的可扩展性和情境敏感性。

链接: https://arxiv.org/abs/2602.20170
作者: Chaeyun Kim,YongTaek Lim,Kihyun Kim,Junghwan Kim,Minwoo Kim
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt’s adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures. Our dataset and evaluation rubrics are publicly available at this https URL. (WARNING: This paper contains model outputs that can be offensive in nature.)

[AI-79] Autonomous AI and Ownership Rules

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)生成内容的所有权归属问题,特别是在AI输出从其创造者处“脱钩”(即失去可追溯性)时如何有效分配产权,以避免市场扭曲和激励失效。其核心解决方案在于区分两种情形:当AI输出仍可追溯至原始创作者时,采用“附随取得原则”(accession doctrine)来明确所有权,从而保障投资激励并维持责任机制;而当AI变得不可追溯时(无论因疏忽、故意隐藏或自主演化行为),则引入“先占规则”(first possession rules),激励新使用者将AI整合进生产性用途。此外,论文还提出通过赏金机制、私人激励与政府补贴等手段应对战略性所有权消解问题,防止AI被用于税收套利和监管规避,确保AI资源的高效配置与市场秩序稳定。

链接: https://arxiv.org/abs/2602.20169
作者: Frank Fagan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This Article examines the circumstances in which AI-generated outputs remain linked to their creators and the points at which they lose that connection, whether through accident, deliberate design, or emergent behavior. In cases where AI is traceable to an originator, accession doctrine provides an efficient means of assigning ownership, preserving investment incentives while maintaining accountability. When AI becomes untraceable – whether through carelessness, deliberate obfuscation, or emergent behavior – first possession rules can encourage reallocation to new custodians who are incentivized to integrate AI into productive use. The analysis further explores strategic ownership dissolution, where autonomous AI is intentionally designed to evade attribution, creating opportunities for tax arbitrage and regulatory avoidance. To counteract these inefficiencies, bounty systems, private incentives, and government subsidies are proposed as mechanisms to encourage AI capture and prevent ownerless AI from distorting markets.

[AI-80] Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing

【速读】:该论文旨在解决当前数据驱动的早期恶化预测模型在急诊分诊场景中评估不具临床现实性的问题,即多数模型依赖于初始评估阶段无法获取的数据(如实验室指标或影像学结果),导致其在真实医疗环境中部署时性能不可靠。解决方案的关键在于提出一个“泄漏感知”(leakage-aware)的基准测试框架,通过使用来自MIMIC-IV-ED数据库去重后的患者队列,在限制输入仅包含首小时内可获得的生命体征(vitals-only)条件下进行模型评估,从而模拟资源受限环境下的实际分诊决策流程。研究发现,即便仅使用生命体征信息,模型预测性能下降有限,且呼吸和氧合指标是风险分层中最关键的贡献因素,表明早期生理参数已蕴含足够临床信号用于构建可部署的分诊辅助系统。

链接: https://arxiv.org/abs/2602.20168
作者: KMA Solaiman,Joshua Sebastian,Karma Tobden
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 6 tables. Submitted to IEEE ICHI 2026

点击查看摘要

Abstract:Emergency triage decisions are made under severe information constraints, yet most data-driven deterioration models are evaluated using signals unavailable during initial assessment. We present a leakage-aware benchmarking framework for early deterioration prediction that evaluates model performance under realistic, time-limited sensing conditions. Using a patient-deduplicated cohort derived from MIMIC-IV-ED, we compare hospital-rich triage with a vitals-only, MCI-like setting, restricting inputs to information available within the first hour of presentation. Across multiple modeling approaches, predictive performance declines only modestly when limited to vitals, indicating that early physiological measurements retain substantial clinical signal. Structured ablation and interpretability analyses identify respiratory and oxygenation measures as the most influential contributors to early risk stratification, with models exhibiting stable, graceful degradation as sensing is reduced. This work provides a clinically grounded benchmark to support the evaluation and design of deployable triage decision-support systems in resource-constrained settings.

[AI-81] Attention-Based SINR Estimation in User-Centric Non-Terrestrial Networks ICML

【速读】:该论文旨在解决卫星非地面网络(Non-Terrestrial Networks, NTN)中用户中心波束赋形(User-Centric Beamforming)场景下信干噪比(Signal-to-Interference-plus-Noise Ratio, SINR)评估的高计算复杂度问题。传统方法要么依赖专用导频传输,要么需预先通过最小均方误差(Minimum Mean Squared Error, MMSE)算法计算波束赋形矩阵,导致显著的计算开销。解决方案的关键在于提出一种基于双多头自注意力机制(Dual Multi-Head Self-Attention, DMHSA)的低复杂度SINR估计框架,该框架可直接从信道状态信息(Channel State Information, CSI)或用户位置报告中提取用户间干扰特征,无需显式执行MMSE计算,从而在CSI设置下实现计算复杂度降低3倍,在位置信息设置下降低两个数量级,同时保持根均方误差低于1 dB的高精度,支持调度过程中对多个候选用户组进行高效SINR与容量评估。

链接: https://arxiv.org/abs/2602.21116
作者: Bruno De Filippo,Alessandro Guidotti,Alessandro Vanelli-Coralli
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Paper accepted for presentation at IEEE International Conference on Machine Learning in Communications and Networking (ICMLCN) 2026

点击查看摘要

Abstract:The signal-to-interference-plus-noise ratio (SINR) is central to performance optimization in user-centric beamforming for satellite-based non-terrestrial networks (NTNs). Its assessment either requires the transmission of dedicated pilots or relies on computing the beamforming matrix through minimum mean squared error (MMSE)-based formulations beforehand, a process that introduces significant computational overhead. In this paper, we propose a low-complexity SINR estimation framework that leverages multi-head self-attention (MHSA) to extract inter-user interference features directly from either channel state information or user location reports. The proposed dual MHSA (DMHSA) models evaluate the SINR of a scheduled user group without requiring explicit MMSE calculations. The architecture achieves a computational complexity reduction by a factor of three in the CSI-based setting and by two orders of magnitude in the location-based configuration, the latter benefiting from the lower dimensionality of user reports. We show that both DMHSA models maintain high estimation accuracy, with the root mean squared error typically below 1 dB with priority-queuing-based scheduled users. These results enable the integration of DMHSA-based estimators into scheduling procedures, allowing the evaluation of multiple candidate user groups and the selection of those offering the highest average SINR and capacity.

[AI-82] raining-Free Intelligibility-Guided Observation Addition for Noisy ASR

【速读】:该论文旨在解决自动语音识别(ASR)在噪声环境中性能显著下降的问题。尽管语音增强(SE)前端能有效抑制背景噪声,但常引入失真 artifacts 从而损害识别效果。现有方法如观察添加(OA)通过融合原始噪声语音与SE增强语音来改善识别性能,且无需修改SE或ASR模型参数。本文提出一种基于可懂度引导的OA方法,其关键在于利用ASR后端直接估计的可懂度信息动态确定融合权重,而非依赖训练过的神经预测器;该方案无需额外训练,降低了复杂度并提升了泛化能力,在多种SE-ASR组合和数据集上均表现出更强鲁棒性和优于现有OA基线的性能。

链接: https://arxiv.org/abs/2602.20967
作者: Haoyang Li,Changsong Liu,Wei Rao,Hao Shi,Sakriani Sakti,Eng Siong Chng
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

[AI-83] Some Simple Economics of AGI

【速读】:该论文试图解决的问题是:随着人工智能(AI)逐步摆脱对生物体认知的依赖,执行成本趋近于零,导致人类验证能力(verification bandwidth)成为经济增长的新瓶颈——即人类无法有效验证海量自动化输出,从而引发“可度量性缺口”(Measurability Gap),并可能导致经济结构失衡甚至崩溃(Hollow Economy)。其核心挑战不再是提升自动化能力,而是如何同步扩展人类对智能系统输出的验证与责任承担能力。解决方案的关键在于:通过系统性地扩展验证能力(包括构建可信数据来源、强化可审计性与责任保险机制),将原本被忽视的“验证级真实”(verification-grade ground truth)、加密溯源和风险承保能力转化为新的价值锚点,并推动形成一个由人类监督与智能代理协同演进的“增强型经济”(Augmented Economy),从而实现从被动防御到主动创新的跃迁。

链接: https://arxiv.org/abs/2602.20946
作者: Christian Catalini,Xiang Hui,Jane Wu
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: JEL Classification: D82, D83, J23, J24, L23, O33. 112 pages, 3 figures

点击查看摘要

Abstract:For millennia, human cognition was the primary engine of progress on Earth. As AI decouples cognition from biology, the marginal cost of measurable execution falls to zero, absorbing any labor capturable by metrics–including creative, analytical, and innovative work. The binding constraint on growth is no longer intelligence but human verification bandwidth: the capacity to validate, audit, and underwrite responsibility when execution is abundant. We model the AGI transition as the collision of two racing cost curves: an exponentially decaying Cost to Automate and a biologically bottlenecked Cost to Verify. This structural asymmetry widens a Measurability Gap between what agents can execute and what humans can afford to verify. It also drives a shift from skill-biased to measurability-biased technical change. Rents migrate to verification-grade ground truth, cryptographic provenance, and liability underwriting–the ability to insure outcomes rather than merely generate them. The current human-in-the-loop equilibrium is unstable: eroded from below as apprenticeship collapses (Missing Junior Loop) and from within as experts codify their obsolescence (Codifier’s Curse). Unverified deployment becomes privately rational–a Trojan Horse externality. Unmanaged, these forces pull toward a Hollow Economy. Yet by scaling verification alongside agentic capabilities, the forces that threaten collapse become the catalyst for unbounded discovery and experimentation–an Augmented Economy. We derive a practical playbook for individuals, companies, investors, and policymakers. Today’s defining challenge is not the race to deploy the most autonomous systems; it is the race to secure the foundations of their oversight. Only by scaling our bandwidth for verification alongside our capacity for execution can we ensure that the intelligence we have summoned preserves the humanity that initiated it.

[AI-84] Fast Spectrogram Event Extraction via Offline Self-Supervised Learning: From Fusion Diagnostics to Bioacoustics

【速读】:该论文旨在解决下一代聚变装置(如ITER)面临的“数据洪流”问题,即每日生成的多诊断信号数据量可达petabyte级别,传统人工分析方法难以应对。解决方案的关键在于提出一种“信号优先”的自监督框架,通过非线性最优多通道信号处理技术与快速神经网络代理模型相结合,实现对高噪声时间-频率数据中相干模态、准相干模态及瞬态模态的自动提取。该方法在DIII-D、TJ-II等托卡马克装置以及非聚变光谱图数据上验证有效,推理延迟仅为0.5秒,支持实时模态识别与大规模自动化数据库构建,为先进等离子体控制提供关键技术支撑。

链接: https://arxiv.org/abs/2602.20317
作者: Nathaniel Chen,Kouroche Bouchiat,Peter Steiner,Andrew Rothstein,David Smith,Max Austin,Mike van Zeeland,Azarakhsh Jalalvand,Egemen Kolemen
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Plasma Physics (physics.plasm-ph)
备注:

点击查看摘要

Abstract:Next-generation fusion facilities like ITER face a “data deluge,” generating petabytes of multi-diagnostic signals daily that challenge manual analysis. We present a “signals-first” self-supervised framework for the automated extraction of coherent and transient modes from high-noise time-frequency data. We also develop a general-purpose method and tool for extracting coherent, quasi-coherent, and transient modes for fluctuation measurements in tokamaks by employing non-linear optimal techniques in multichannel signal processing with a fast neural network surrogate on fast magnetics, electron cyclotron emission, CO2 interferometers, and beam emission spectroscopy measurements from DIII-D. Results are tested on data from DIII-D, TJ-II, and non-fusion spectrograms. With an inference latency of 0.5 seconds, this framework enables real-time mode identification and large-scale automated database generation for advanced plasma control. Repository is in this https URL.

[AI-85] AINet: Anchor Instances Learning for Regional Heterogeneity in Whole Slide Image

【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分析中因肿瘤区域稀疏性和形态多样性导致的局部异质性问题,从而提升多实例学习(Multi-Instance Learning, MIL)框架下高质且判别性强的特征表示能力。其解决方案的关键在于提出了一种新颖的“锚实例”(Anchor Instance, AI)概念,即在局部区域内具有代表性、全局层面具备判别性的紧凑实例子集,并设计了双层锚实例挖掘(Dual-level Anchor Mining, DAM)模块以从海量实例中筛选出最具信息量的AI,同时引入锚引导区域修正(Anchor-guided Region Correction, ARC)模块,通过融合各区域互补信息来优化每个区域的表征,最终构建出简洁高效的AINet框架,在显著减少浮点运算次数(FLOPs)和参数量的同时超越现有先进方法。

链接: https://arxiv.org/abs/2602.20187
作者: Tingting Zheng,Hongxun Yao,Kui Jiang,Sicheng Zhao,Yi Xiao
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in multi-instance learning (MIL) have witnessed impressive performance in whole slide image (WSI) analysis. However, the inherent sparsity of tumors and their morphological diversity lead to obvious heterogeneity across regions, posing significant challenges in aggregating high-quality and discriminative representations. To address this, we introduce a novel concept of anchor instance (AI), a compact subset of instances that are representative within their regions (local) and discriminative at the bag (global) level. These AIs act as semantic references to guide interactions across regions, correcting non-discriminative patterns while preserving regional diversity. Specifically, we propose a dual-level anchor mining (DAM) module to \textbfselect AIs from massive instances, where the most informative AI in each region is extracted by assessing its similarity to both local and global embeddings. Furthermore, to ensure completeness and diversity, we devise an anchor-guided region correction (ARC) module that explores the complementary information from all regions to \textbfcorrect each regional representation. Building upon DAM and ARC, we develop a concise yet effective framework, AINet, which employs a simple predictor and surpasses state-of-the-art methods with substantially fewer FLOPs and parameters. Moreover, both DAM and ARC are modular and can be seamlessly integrated into existing MIL frameworks, consistently improving their performance.

机器学习

[LG-0] Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

链接: https://arxiv.org/abs/2602.21196
作者: Ravi Ghadia,Maksim Abraham,Sergei Vorobyov,Max Ryabinin
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5 % for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8 \times H100 node, improving upon prior methods by over 25 % .

[LG-1] Statistical Query Lower Bounds for Smoothed Agnostic Learning

链接: https://arxiv.org/abs/2602.21191
作者: Ilias Diakonikolas,Daniel M. Kane
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the complexity of smoothed agnostic learning, recently introduced by~\citeCKKMS24, in which the learner competes with the best classifier in a target class under slight Gaussian perturbations of the inputs. Specifically, we focus on the prototypical task of agnostically learning halfspaces under subgaussian distributions in the smoothed model. The best known upper bound for this problem relies on L_1 -polynomial regression and has complexity d^\tildeO(1/\sigma^2) \log(1/\epsilon) , where \sigma is the smoothing parameter and \epsilon is the excess error. Our main result is a Statistical Query (SQ) lower bound providing formal evidence that this upper bound is close to best possible. In more detail, we show that (even for Gaussian marginals) any SQ algorithm for smoothed agnostic learning of halfspaces requires complexity d^\Omega(1/\sigma^2+\log(1/\epsilon)) . This is the first non-trivial lower bound on the complexity of this task and nearly matches the known upper bound. Roughly speaking, we show that applying L_1 -polynomial regression to a smoothed version of the function is essentially best possible. Our techniques involve finding a moment-matching hard distribution by way of linear programming duality. This dual program corresponds exactly to finding a low-degree approximating polynomial to the smoothed version of the target function (which turns out to be the same condition required for the L_1 -polynomial regression to work). Our explicit SQ lower bound then comes from proving lower bounds on this approximation degree for the class of halfspaces.

[LG-2] he Diffusion Duality Chapter II: Ψ-Samplers and Efficient Curriculum

链接: https://arxiv.org/abs/2602.21185
作者: Justin Deschenaux,Caglar Gulcehre,Subham Sekhar Sahoo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: this https URL

[LG-3] Sequential Counterfactual Inference for Temporal Clinical Data: Addressing the Time Traveler Dilemma

链接: https://arxiv.org/abs/2602.21168
作者: Jingya Cheng,Alaleh Azhir,Jiazi Tian,Hossein Estiri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual inference enables clinicians to ask “what if” questions about patient outcomes, but standard methods assume feature independence and simultaneous modifiability – assumptions violated by longitudinal clinical data. We introduce the Sequential Counterfactual Framework, which respects temporal dependencies in electronic health records by distinguishing immutable features (chronic diagnoses) from controllable features (lab values) and modeling how interventions propagate through time. Applied to 2,723 COVID-19 patients (383 Long COVID heart failure cases, 2,340 matched controls), we demonstrate that 38-67% of patients with chronic conditions would require biologically impossible counterfactuals under naive methods. We identify a cardiorenal cascade (CKD - AKI - HF) with relative risks of 2.27 and 1.19 at each step, illustrating temporal propagation that sequential – but not naive – counterfactuals can capture. Our framework transforms counterfactual explanation from “what if this feature were different?” to “what if we had intervened earlier, and how would that propagate forward?” – yielding clinically actionable insights grounded in biological plausibility.

[LG-4] Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

链接: https://arxiv.org/abs/2602.21144
作者: Anurag Dutt,Nimit Shah,Hazem Masarani,Anshul Gandhi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Submitted to 46th IEEE International Conference on Distributed Computing Systems (ICDCS 2026)

点击查看摘要

Abstract:Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer’s packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead. Comments: Submitted to 46th IEEE International Conference on Distributed Computing Systems (ICDCS 2026) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2602.21144 [cs.DC] (or arXiv:2602.21144v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.21144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models

链接: https://arxiv.org/abs/2602.21133
作者: Alessandro Londei,Denise Lanzieri,Matteo Benati
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Vector-quantized representations enable powerful discrete generative models but lack semantic structure in token space, limiting interpretable human control. We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks with explicit low-dimensional topology. Unlike standard VQ-VAE, SOM-VQ uses topology-aware updates that preserve neighborhood structure: nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space. We demonstrate that SOM-VQ produces more learnable token sequences in the evaluated domains while providing an explicit navigable geometry in code space. Critically, the topological organization enables intuitive human-in-the-loop control: users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints. We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases (choreography, rehabilitation, HCI) make topology-aware control especially natural - demonstrating controlled divergence and convergence from reference sequences through simple grid-based sampling. SOM-VQ provides a general framework for interpretable discrete representations applicable to music, gesture, and other interactive generative domains.

[LG-6] Ski Rental with Distributional Predictions of Unknown Quality

链接: https://arxiv.org/abs/2602.21104
作者: Qiming Cui,Michael Dinitz
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We revisit the central online problem of ski rental in the “algorithms with predictions” framework from the point of view of distributional predictions. Ski rental was one of the first problems to be studied with predictions, where a natural prediction is simply the number of ski days. But it is both more natural and potentially more powerful to think of a prediction as a distribution p-hat over the ski days. If the true number of ski days is drawn from some true (but unknown) distribution p, then we show as our main result that there is an algorithm with expected cost at most OPT + O(min(max(eta, 1) * sqrt(b), b log b)), where OPT is the expected cost of the optimal policy for the true distribution p, b is the cost of buying, and eta is the Earth Mover’s (Wasserstein-1) distance between p and p-hat. Note that when eta o(sqrt(b)) this gives additive loss less than b (the trivial bound), and when eta is arbitrarily large (corresponding to an extremely inaccurate prediction) we still do not pay more than O(b log b) additive loss. An implication of these bounds is that our algorithm has consistency O(sqrt(b)) (additive loss when the prediction error is 0) and robustness O(b log b) (additive loss when the prediction error is arbitrarily large). Moreover, we do not need to assume that we know (or have any bound on) the prediction error eta, in contrast with previous work in robust optimization which assumes that we know this error. We complement this upper bound with a variety of lower bounds showing that it is essentially tight: not only can the consistency/robustness tradeoff not be improved, but our particular loss function cannot be meaningfully improved. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2602.21104 [cs.LG] (or arXiv:2602.21104v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.21104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

链接: https://arxiv.org/abs/2602.21081
作者: Huy Trinh,Rebecca Ma,Zeqi Yu,Tahsin Reza
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated remarkable potential in image processing tasks by utilizing self-attention mechanisms to capture global relationships within data. However, their scalability is hindered by significant computational and memory demands, especially for large-scale models with many parameters. This study aims to leverage DeepSpeed, a highly efficient distributed training framework that is commonly used for language models, to enhance the scalability and performance of ViTs. We evaluate intra- and inter-node training efficiency across multiple GPU configurations on various datasets like CIFAR-10 and CIFAR-100, exploring the impact of distributed data parallelism on training speed, communication overhead, and overall scalability (strong and weak scaling). By systematically varying software parameters, such as batch size and gradient accumulation, we identify key factors influencing performance of distributed training. The experiments in this study provide a foundational basis for applying DeepSpeed to image-related tasks. Future work will extend these investigations to deepen our understanding of DeepSpeed’s limitations and explore strategies for optimizing distributed training pipelines for Vision Transformers.

[LG-8] PIME: Prototype-based Interpretable MCTS-Enhanced Brain Network Analysis for Disorder Diagnosis

链接: https://arxiv.org/abs/2602.21046
作者: Kunyu Zhang,Yanwu Yang,Jing Zhang,Xiangjie Shi,Shujian Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent deep learning methods for fMRI-based diagnosis have achieved promising accuracy by modeling functional connectivity networks. However, standard approaches often struggle with noisy interactions, and conventional post-hoc attribution methods may lack reliability, potentially highlighting dataset-specific artifacts. To address these challenges, we introduce PIME, an interpretable framework that bridges intrinsic interpretability with minimal-sufficient subgraph optimization by integrating prototype-based classification and consistency training with structural perturbations during learning. This encourages a structured latent space and enables Monte Carlo Tree Search (MCTS) under a prototype-consistent objective to extract compact minimal-sufficient explanatory subgraphs post-training. Experiments on three benchmark fMRI datasets demonstrate that PIME achieves state-of-the-art performance. Furthermore, by constraining the search space via learned prototypes, PIME identifies critical brain regions that are consistent with established neuroimaging findings. Stability analysis shows 90% reproducibility and consistent explanations across atlases.

[LG-9] 1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation ICLR2026

链接: https://arxiv.org/abs/2602.21043
作者: Dongik Park,Hyunwoo Ryu,Suahn Bae,Keondo Park,Hyung-Sin Kim
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Imputing missing values in multivariate time series remains challenging, especially under diverse missing patterns and heavy missingness. Existing methods suffer from suboptimal performance as corrupted temporal features hinder effective cross-variable information transfer, amplifying reconstruction errors. Robust imputation requires both extracting temporal patterns from sparse observations within each variable and selectively transferring information across variables–yet current approaches excel at one while compromising the other. We introduce T1 (Time series imputation with 1-to-1 channel-head binding), a CNN-Transformer hybrid architecture that achieves robust imputation through Channel-Head Binding–a mechanism creating one-to-one correspondence between CNN channels and attention heads. This design enables selective information transfer: when missingness corrupts certain temporal patterns, their corresponding attention pathways adaptively down-weight based on remaining observable patterns while preserving reliable cross-variable connections through unaffected channels. Experiments on 11 benchmark datasets demonstrate that T1 achieves state-of-the-art performance, reducing MSE by 46% on average compared to the second-best baseline, with particularly strong gains under extreme sparsity (70% missing ratio). The model generalizes to unseen missing patterns without retraining and uses a consistent hyperparameter configuration across all datasets. The code is available at this https URL.

[LG-10] MAST: A Multi-fidelity Augmented Surrogate model via Spatial Trust-weighting

链接: https://arxiv.org/abs/2602.20974
作者: Ahmed Mohamed Eisa Nasr,Haris Moazam Sheikh
类目: Machine Learning (cs.LG)
*备注: Submitted to International Conference on Machine Learning 2026

点击查看摘要

Abstract:In engineering design and scientific computing, computational cost and predictive accuracy are intrinsically coupled. High-fidelity simulations provide accurate predictions but at substantial computational costs, while lower-fidelity approximations offer efficiency at the expense of accuracy. Multi-fidelity surrogate modelling addresses this trade-off by combining abundant low-fidelity data with sparse high-fidelity observations. However, existing methods suffer from expensive training cost or rely on global correlation assumptions that often fail in practice to capture how fidelity relationships vary across the input space, leading to poor performance particularly under tight budget constraints. We introduce MAST, a method that blends corrected low-fidelity observations with high-fidelity predictions, trusting high-fidelity near observed samples and relying on corrected low-fidelity elsewhere. MAST achieves this through explicit discrepancy modelling and distance-based weighting with closed-form variance propagation, producing a single heteroscedastic Gaussian process. Across multi-fidelity synthetic benchmarks, MAST shows a marked improvement over the current state-of-the-art techniques. Crucially, MAST maintains robust performance across varying total budget and fidelity gaps, conditions under which competing methods exhibit significant degradation or unstable behaviour.

[LG-11] Extending μP: Spectral Conditions for Feature Learning Across Optimizers

链接: https://arxiv.org/abs/2602.20937
作者: Akshita Gupta,Marieme Ngom,Sam Foreman,Venkatram Vishwanath
类目: Machine Learning (cs.LG)
*备注: 10 main pages, 16 appendix pages and 17 figures; Amended version of the publication in 17th International OPT Workshop on Optimization for Machine Learning

点击查看摘要

Abstract:Several variations of adaptive first-order and second-order optimization methods have been proposed to accelerate and scale the training of large language models. The performance of these optimization routines is highly sensitive to the choice of hyperparameters (HPs), which are computationally expensive to tune for large-scale models. Maximal update parameterization (\mu P ) is a set of scaling rules which aims to make the optimal HPs independent of the model size, thereby allowing the HPs tuned on a smaller (computationally cheaper) model to be transferred to train a larger, target model. Despite promising results for SGD and Adam, deriving \mu P for other optimizers is challenging because the underlying tensor programming approach is difficult to grasp. Building on recent work that introduced spectral conditions as an alternative to tensor programs, we propose a novel framework to derive \mu P for a broader class of optimizers, including AdamW, ADOPT, LAMB, Sophia, Shampoo and Muon. We implement our \mu P derivations on multiple benchmark models and demonstrate zero-shot learning rate transfer across increasing model width for the above optimizers. Further, we provide empirical insights into depth-scaling parameterization for these optimizers.

[LG-12] On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

链接: https://arxiv.org/abs/2602.20921
作者: Jinshu Huang,Mingfei Sun,Chunlin Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have significantly advanced machine learning, with model depth playing a central role in their successes. The dynamical system modeling approach has recently emerged as a powerful framework, offering new mathematical insights into the structure and learning behavior of DNNs. In this work, we establish generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit. The resulting bounds are of order O(1/\sqrtS) with respect to the number of training samples S , and include a structure-dependent negative term, yielding depth-uniform and asymptotic generalization bounds under milder assumptions. These findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.

[LG-13] ranscoder Adapters for Reasoning -Model Diffing

链接: https://arxiv.org/abs/2602.20904
作者: Nathan Hu,Jake Ward,Thomas Icard,Christopher Potts
类目: Machine Learning (cs.LG)
*备注: 9 pages main, 27 pages total, 10 figures. Code and visualizations at this https URL

点击查看摘要

Abstract:While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model’s internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model’s internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model’s response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior – the production of hesitation tokens (e.g., “wait”). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

[LG-14] DRESS: A Continuous Framework for Structural Graph Refinement

链接: https://arxiv.org/abs/2602.20833
作者: Eduar Castrillo Velilla
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Weisfeiler-Lehman (WL) hierarchy is a cornerstone framework for graph isomorphism testing and structural analysis. However, scaling beyond 1-WL to 3-WL and higher requires tensor-based operations that scale as O(n^3) or O(n^4), making them computationally prohibitive for large graphs. In this paper, we start from the Original-DRESS equation (Castrillo, Leon, and Gomez, 2018)–a parameter-free, continuous dynamical system on edges–and show that it distinguishes the prism graph from K_3,3, a pair that 1-WL provably cannot separate. We then generalize it to Motif-DRESS, which replaces triangle neighborhoods with arbitrary structural motifs and converges to a unique fixed point under three sufficient conditions, and further to Generalized-DRESS, an abstract template parameterized by the choice of neighborhood operator, aggregation function and norm. Finally, we introduce Delta-DRESS, which runs DRESS on each node-deleted subgraph G\v, connecting the framework to the Kelly-Ulam reconstruction conjecture. Both Motif-DRESS and Delta-DRESS empirically distinguish Strongly Regular Graphs (SRGs)–such as the Rook and Shrikhande graphs–that confound 3-WL. Our results establish the DRESS family as a highly scalable framework that empirically surpasses both 1-WL and 3-WL on well-known benchmark graphs, without the prohibitive O(n^4) computational cost.

[LG-15] Assessing the Impact of Speaker Identity in Speech Spoofing Detection

链接: https://arxiv.org/abs/2602.20805
作者: Anh-Tuan Dao,Driss Matrouf,Nicholas Evans
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker-Invariant Multi-Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi-task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker-invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11).

[LG-16] Exploring the Impact of Parameter Update Magnitude on Forgetting and Generalization of Continual Learning

链接: https://arxiv.org/abs/2602.20796
作者: JinLi He,Liang Bai,Xian Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The magnitude of parameter updates are considered a key factor in continual learning. However, most existing studies focus on designing diverse update strategies, while a theoretical understanding of the underlying mechanisms remains limited. Therefore, we characterize model’s forgetting from the perspective of parameter update magnitude and formalize it as knowledge degradation induced by task-specific drift in the parameter space, which has not been fully captured in previous studies due to their assumption of a unified parameter space. By deriving the optimal parameter update magnitude that minimizes forgetting, we unify two representative update paradigms, frozen training and initialized training, within an optimization framework for constrained parameter updates. Our theoretical results further reveals that sequence tasks with small parameter distances exhibit better generalization and less forgetting under frozen training rather than initialized training. These theoretical insights inspire a novel hybrid parameter update strategy that adaptively adjusts update magnitude based on gradient directions. Experiments on deep neural networks demonstrate that this hybrid approach outperforms standard training strategies, providing new theoretical perspectives and practical inspiration for designing efficient and scalable continual learning algorithms.

[LG-17] Understanding the Role of Rehearsal Scale in Continual Learning under Varying Model Capacities

链接: https://arxiv.org/abs/2602.20791
作者: JinLi He,Liang Bai,Xian Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rehearsal is one of the key techniques for mitigating catastrophic forgetting and has been widely adopted in continual learning algorithms due to its simplicity and practicality. However, the theoretical understanding of how rehearsal scale influences learning dynamics remains limited. To address this gap, we formulate rehearsal-based continual learning as a multidimensional effectiveness-driven iterative optimization problem, providing a unified characterization across diverse performance metrics. Within this framework, we derive a closed-form analysis of adaptability, memorability, and generalization from the perspective of rehearsal scale. Our results uncover several intriguing and counterintuitive findings. First, rehearsal can impair model’s adaptability, in sharp contrast to its traditionally recognized benefits. Second, increasing the rehearsal scale does not necessarily improve memory retention. When tasks are similar and noise levels are low, the memory error exhibits a diminishing lower bound. Finally, we validate these insights through numerical simulations and extended analyses on deep neural networks across multiple real-world datasets, revealing statistical patterns of rehearsal mechanisms in continual learning.

[LG-18] On Electric Vehicle Energy Demand Forecasting and the Effect of Federated Learning

链接: https://arxiv.org/abs/2602.20782
作者: Andreas Tritsarolis,Gil Sampaio,Nikos Pelekis,Yannis Theodoridis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The wide spread of new energy resources, smart devices, and demand side management strategies has motivated several analytics operations, from infrastructure load modeling to user behavior profiling. Energy Demand Forecasting (EDF) of Electric Vehicle Supply Equipments (EVSEs) is one of the most critical operations for ensuring efficient energy management and sustainability, since it enables utility providers to anticipate energy/power demand, optimize resource allocation, and implement proactive measures to improve grid reliability. However, accurate EDF is a challenging problem due to external factors, such as the varying user routines, weather conditions, driving behaviors, unknown state of charge, etc. Furthermore, as concerns and restrictions about privacy and sustainability have grown, training data has become increasingly fragmented, resulting in distributed datasets scattered across different data silos and/or edge devices, calling for federated learning solutions. In this paper, we investigate different well-established time series forecasting methodologies to address the EDF problem, from statistical methods (the ARIMA family) to traditional machine learning models (such as XGBoost) and deep neural networks (GRU and LSTM). We provide an overview of these methods through a performance comparison over four real-world EVSE datasets, evaluated under both centralized and federated learning paradigms, focusing on the trade-offs between forecasting fidelity, privacy preservation, and energy overheads. Our experimental results demonstrate, on the one hand, the superiority of gradient boosted trees (XGBoost) over statistical and NN-based models in both prediction accuracy and energy efficiency and, on the other hand, an insight that Federated Learning-enabled models balance these factors, offering a promising direction for decentralized energy demand forecasting.

[LG-19] Deep unfolding of MCMC kernels: scalable modular explainable GANs for high-dimensional posterior sampling

链接: https://arxiv.org/abs/2602.20758
作者: Jonathan Spence,Tobías I. Liaudat,Konstantinos Zygalakis,Marcelo Pereyra
类目: Machine Learning (cs.LG)
*备注: 37 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Markov chain Monte Carlo (MCMC) methods are fundamental to Bayesian computation, but can be computationally intensive, especially in high-dimensional settings. Push-forward generative models, such as generative adversarial networks (GANs), variational auto-encoders and normalising flows offer a computationally efficient alternative for posterior sampling. However, push-forward models are opaque as they lack the modularity of Bayes Theorem, leading to poor generalisation with respect to changes in the likelihood function. In this work, we introduce a novel approach to GAN architecture design by applying deep unfolding to Langevin MCMC algorithms. This paradigm maps fixed-step iterative algorithms onto modular neural networks, yielding architectures that are both flexible and amenable to interpretation. Crucially, our design allows key model parameters to be specified at inference time, offering robustness to changes in the likelihood parameters. We train these unfolded samplers end-to-end using a supervised regularized Wasserstein GAN framework for posterior sampling. Through extensive Bayesian imaging experiments, we demonstrate that our proposed approach achieves high sampling accuracy and excellent computational efficiency, while retaining the physics consistency, adaptability and interpretability of classical MCMC strategies.

[LG-20] Rethink Efficiency Side of Neural Combinatorial Solver: An Offline and Self-Play Paradigm

链接: https://arxiv.org/abs/2602.20730
作者: Zhenxing Xu,Zeyuan Ma,Weidong Bao,Hui Yan,Yan Zheng,Ji Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose ECO, a versatile learning paradigm that enables efficient offline self-play for Neural Combinatorial Optimization (NCO). ECO addresses key limitations in the field through: 1) Paradigm Shift: Moving beyond inefficient online paradigms, we introduce a two-phase offline paradigm consisting of supervised warm-up and iterative Direct Preference Optimization (DPO); 2) Architecture Shift: We deliberately design a Mamba-based architecture to further enhance the efficiency in the offline paradigm; and 3) Progressive Bootstrapping: To stabilize training, we employ a heuristic-based bootstrapping mechanism that ensures continuous policy improvement during training. Comparison results on TSP and CVRP highlight that ECO performs competitively with up-to-date baselines, with significant advantage on the efficiency side in terms of memory utilization and training throughput. We provide further in-depth analysis on the efficiency, throughput and memory usage of ECO. Ablation studies show rationale behind our designs.

[LG-21] Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty

链接: https://arxiv.org/abs/2602.20729
作者: Xu Wan,Chao Yang,Cheng Yang,Jie Song,Mingyang Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose Fuz-RL, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value functions using Choquet integrals. Theoretically, we prove that solving the Fuz-RL problem (in Constrained Markov Decision Process (CMDP) form) is equivalent to solving distributionally robust safe RL problems (in robust CMDP form), effectively avoiding min-max optimization. Empirical analyses on safe-control-gym and safety-gymnasium scenarios demonstrate that Fuz-RL effectively integrates with existing safe RL baselines in a model-free manner, significantly improving both safety and control performance under various types of uncertainties in observation, action, and dynamics.

[LG-22] WeirNet: A Large-Scale 3D CFD Benchmark for Geometric Surrogate Modeling of Piano Key Weirs

链接: https://arxiv.org/abs/2602.20714
作者: Lisa Lüddecke,Michael Hohmann,Sebastian Eilermann,Jan Tillmann-Mumm,Pezhman Pourabdollah,Mario Oertel,Oliver Niggemann
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Reliable prediction of hydraulic performance is challenging for Piano Key Weir (PKW) design because discharge capacity depends on three-dimensional geometry and operating conditions. Surrogate models can accelerate hydraulic-structure design, but progress is limited by scarce large, well-documented datasets that jointly capture geometric variation, operating conditions, and functional performance. This study presents WeirNet, a large 3D CFD benchmark dataset for geometric surrogate modeling of PKWs. WeirNet contains 3,794 parametric, feasibility-constrained rectangular and trapezoidal PKW geometries, each scheduled at 19 discharge conditions using a consistent free-surface OpenFOAM workflow, resulting in 71,387 completed simulations that form the benchmark and with complete discharge coefficient labels. The dataset is released as multiple modalities compact parametric descriptors, watertight surface meshes and high-resolution point clouds together with standardized tasks and in-distribution and out-of-distribution splits. Representative surrogate families are benchmarked for discharge coefficient prediction. Tree-based regressors on parametric descriptors achieve the best overall accuracy, while point- and mesh-based models remain competitive and offer parameterization-agnostic inference. All surrogates evaluate in milliseconds per sample, providing orders-of-magnitude speedups over CFD runtimes. Out-of-distribution results identify geometry shift as the dominant failure mode compared to unseen discharge values, and data-efficiency experiments show diminishing returns beyond roughly 60% of the training data. By publicly releasing the dataset together with simulation setups and evaluation pipelines, WeirNet establishes a reproducible framework for data-driven hydraulic modeling and enables faster exploration of PKW designs during the early stages of hydraulic planning.

[LG-23] High-Dimensional Robust Mean Estimation with Untrusted Batches

链接: https://arxiv.org/abs/2602.20698
作者: Maryam Aliakbarpour,Vladimir Braverman,Yuhan Liu,Junze Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study high-dimensional mean estimation in a collaborative setting where data is contributed by N users in batches of size n . In this environment, a learner seeks to recover the mean \mu of a true distribution P from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an \varepsilon -fraction of users are entirely adversarial, while the remaining ``good’’ users provide data from distributions that are related to P , but deviate by a proximity parameter \alpha . Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of \sqrt\alpha , or (2) an \alpha -fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate O(\sqrt\varepsilon/n + \sqrtd/nN + \sqrt\alpha) , demonstrating that while heterogeneity \alpha represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of 1/\sqrtn due to the internal averaging afforded by the batch structure. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.20698 [cs.LG] (or arXiv:2602.20698v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20698 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Bikelution: Federated Gradient-Boosting for Scalable Shared Micro-Mobility Demand Forecasting

链接: https://arxiv.org/abs/2602.20671
作者: Antonios Tziorvas,Andreas Tritsarolis,Yannis Theodoridis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of dockless bike-sharing systems has generated massive spatio-temporal datasets useful for fleet allocation, congestion reduction, and sustainable mobility. Bike demand, however, depends on several external factors, making traditional time-series models insufficient. Centralized Machine Learning (CML) yields high-accuracy forecasts but raises privacy and bandwidth issues when data are distributed across edge devices. To overcome these limitations, we propose Bikelution, an efficient Federated Learning (FL) solution based on gradient-boosted trees that preserves privacy while delivering accurate mid-term demand forecasts up to six hours ahead. Experiments on three real-world BSS datasets show that Bikelution is comparable to its CML-based variant and outperforms the current state-of-the-art. The results highlight the feasibility of privacy-aware demand forecasting and outline the trade-offs between FL and CML approaches.

[LG-25] Sparse Bayesian Deep Functional Learning with Structured Region Selection

链接: https://arxiv.org/abs/2602.20651
作者: Xiaoxian Zhu,Yingmeng Li,Shuangge Ma,Mengyun Wu
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In modern applications such as ECG monitoring, neuroimaging, wearable sensing, and industrial equipment diagnostics, complex and continuously structured data are ubiquitous, presenting both challenges and opportunities for functional data analysis. However, existing methods face a critical trade-off: conventional functional models are limited by linearity, whereas deep learning approaches lack interpretable region selection for sparse effects. To bridge these gaps, we propose a sparse Bayesian functional deep neural network (sBayFDNN). It learns adaptive functional embeddings through a deep Bayesian architecture to capture complex nonlinear relationships, while a structured prior enables interpretable, region-wise selection of influential domains with quantified uncertainty. Theoretically, we establish rigorous approximation error bounds, posterior consistency, and region selection consistency. These results provide the first theoretical guarantees for a Bayesian deep functional model, ensuring its reliability and statistical rigor. Empirically, comprehensive simulations and real-world studies confirm the effectiveness and superiority of sBayFDNN. Crucially, sBayFDNN excels in recognizing intricate dependencies for accurate predictions and more precisely identifies functionally meaningful regions, capabilities fundamentally beyond existing approaches.

[LG-26] QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

链接: https://arxiv.org/abs/2602.20629
作者: Santiago Gonzalez,Alireza Amiri Bavandpour,Peter Ye,Edward Zhang,Ruslans Aleksejevs,Todor Antić,Polina Baron,Sujeet Bhalerao,Shubhrajit Bhattacharya,Zachary Burton,John Byrne,Hyungjun Choi,Nujhat Ahmed Disha,Koppany István Encz,Yuchen Fang,Robert Joseph George,Ebrahim Ghorbani,Alan Goldfarb,Jing Guo,Meghal Gupta,Stefano Huber,Annika Kanckos,Minjung Kang,Hyun Jong Kim,Dino Lorenzini,Levi Lorenzo,Tianyi Mao,Giovanni Marzenta,Ariane M. Masuda,Lukas Mauth,Ana Mickovic,Andres Miniguano-Trujillo,Antoine Moulin,Wenqi Ni,Tomos Parry,Kevin Ren,Hossein Roodbarani,Mathieu Rundström,Manjil Saikia,Detchat Samart,Rebecca Steiner,Connor Stewart,Dhara Thakkar,Jeffrey Tse,Vasiliki Velona,Yunhai Xiang,Sibel Yalçın,Jun Yan,Ji Zeng,Arman Cohan,Quanquan C. Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard “LLM-as-a-Judge” protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at this https URL.

[LG-27] Is the Trigger Essential? A Feature-Based Triggerless Backdoor Attack in Vertical Federated Learning

链接: https://arxiv.org/abs/2602.20593
作者: Yige Liu,Yiwei Lou,Che Wang,Yongzhi Cao,Hanpin Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As a distributed collaborative machine learning paradigm, vertical federated learning (VFL) allows multiple passive parties with distinct features and one active party with labels to collaboratively train a model. Although it is known for the privacy-preserving capabilities, VFL still faces significant privacy and security threats from backdoor attacks. Existing backdoor attacks typically involve an attacker implanting a trigger into the model during the training phase and executing the attack by adding the trigger to the samples during the inference phase. However, in this paper, we find that triggers are not essential for backdoor attacks in VFL. In light of this, we disclose a new backdoor attack pathway in VFL by introducing a feature-based triggerless backdoor attack. This attack operates under a more stringent security assumption, where the attacker is honest-but-curious rather than malicious during the training phase. It comprises three modules: label inference for the targeted backdoor attack, poison generation with amplification and perturbation mechanisms, and backdoor execution to implement the attack. Extensive experiments on five benchmark datasets demonstrate that our attack outperforms three baseline backdoor attacks by 2 to 50 times while minimally impacting the main task. Even in VFL scenarios with 32 passive parties and only one set of auxiliary data, our attack maintains high performance. Moreover, when confronted with distinct defense strategies, our attack remains largely unaffected and exhibits strong robustness. We hope that the disclosure of this triggerless backdoor attack pathway will encourage the community to revisit security threats in VFL scenarios and inspire researchers to develop more robust and practical defense strategies.

[LG-28] Upper-Linearizability of Online Non-Monotone DR-Submodular Maximization over Down-Closed Convex Sets

链接: https://arxiv.org/abs/2602.20578
作者: Yiyang Lu,Haresh Jadav,Mohammad Pedramfar,Ranveer Singh,Vaneet Aggarwal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study online maximization of non-monotone Diminishing-Return(DR)-submodular functions over down-closed convex sets, a regime where existing projection-free online methods suffer from suboptimal regret and limited feedback guarantees. Our main contribution is a new structural result showing that this class is 1/e -linearizable under carefully designed exponential reparametrization, scaling parameter, and surrogate potential, enabling a reduction to online linear optimization. As a result, we obtain O(T^1/2) static regret with a single gradient query per round and unlock adaptive and dynamic regret guarantees, together with improved rates under semi-bandit, bandit, and zeroth-order feedback. Across all feedback models, our bounds strictly improve the state of the art.

[LG-29] Benchmarking GNN Models on Molecular Regression Tasks with CKA-Based Representation Analysis

链接: https://arxiv.org/abs/2602.20573
作者: Rajan,Ishaan Gupta
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures and 2 tables

点击查看摘要

Abstract:Molecules are commonly represented as SMILES strings, which can be readily converted to fixed-size molecular fingerprints. These fingerprints serve as feature vectors to train ML/DL models for molecular property prediction tasks in the field of computational chemistry, drug discovery, biochemistry, and materials science. Recent research has demonstrated that SMILES can be used to construct molecular graphs where atoms are nodes ( V ) and bonds are edges ( E ). These graphs can subsequently be used to train geometric DL models like GNN. GNN learns the inherent structural relationships within a molecule rather than depending on fixed-size fingerprints. Although GNN are powerful aggregators, their efficacy on smaller datasets and inductive biases across different architectures is less studied. In our present study, we performed a systematic benchmarking of four different GNN architectures across a diverse domain of datasets (physical chemistry, biological, and analytical). Additionally, we have also implemented a hierarchical fusion (GNN+FP) framework for target prediction. We observed that the fusion framework consistently outperforms or matches the performance of standalone GNN (RMSE improvement 7% ) and baseline models. Further, we investigated the representational similarity using centered kernel alignment (CKA) between GNN and fingerprint embeddings and found that they occupy highly independent latent spaces (CKA \le0.46 ). The cross-architectural CKA score suggests a high convergence between isotopic models like GCN, GraphSAGE and GIN (CKA \geq0.88 ), with GAT learning moderately independent representation (CKA 0.55-0.80 ).

[LG-30] Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs

链接: https://arxiv.org/abs/2602.20567
作者: Yifei Liang,Yan Sun,Xiaochun Cao,Li Shen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 47 Pages

点击查看摘要

Abstract:Push-Sum-based decentralized learning enables optimization over directed communication networks, where information exchange may be asymmetric. While convergence properties of such methods are well understood, their finite-iteration stability and generalization behavior remain unclear due to structural bias induced by column-stochastic mixing and asymmetric error propagation. In this work, we develop a unified uniform-stability framework for the Stochastic Gradient Push (SGP) algorithm that captures the effect of directed topology. A key technical ingredient is an imbalance-aware consistency bound for Push-Sum, which controls consensus deviation through two quantities: the stationary distribution imbalance parameter \delta and the spectral gap (1-\lambda) governing mixing speed. This decomposition enables us to disentangle statistical effects from topology-induced bias. We establish finite-iteration stability and optimization guarantees for both convex objectives and non-convex objectives satisfying the Polyak–Łojasiewicz condition. For convex problems, SGP attains excess generalization error of order \tilde\mathcalO!\left(\frac1\sqrtmn+\frac\gamma\delta(1-\lambda)+\gamma\right) under step-size schedules, and we characterize the corresponding optimal early stopping time that minimizes this bound. For PŁ objectives, we obtain convex-like optimization and generalization rates with dominant dependence proportional to \kappa!\left(1+\frac1\delta(1-\lambda)\right) , revealing a multiplicative coupling between problem conditioning and directed communication topology. Our analysis clarifies when Push-Sum correction is necessary compared with standard decentralized SGD and quantifies how imbalance and mixing jointly shape the best attainable learning performance.

[LG-31] GENSR: Symbolic Regression Based in Equation Generative Space

链接: https://arxiv.org/abs/2602.20557
作者: Qian Li,Yuxiao Hu,Juncheng Liu,Yuntian Chen
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Symbolic Regression (SR) tries to reveal the hidden equations behind observed data. However, most methods search within a discrete equation space, where the structural modifications of equations rarely align with their numerical behavior, leaving fitting error feedback too noisy to guide exploration. To address this challenge, we propose GenSR, a generative latent space-based SR framework following the map construction - coarse localization - fine search'' paradigm. Specifically, GenSR first pretrains a dual-branch Conditional Variational Autoencoder (CVAE) to reparameterize symbolic equations into a generative latent space with symbolic continuity and local numerical smoothness. This space can be regarded as a well-structured map’’ of the equation space, providing directional signals for search. At inference, the CVAE coarsely localizes the input data to promising regions in the latent space. Then, a modified CMA-ES refines the candidate region, leveraging smooth latent gradients. From a Bayesian perspective, GenSR reframes the SR task as maximizing the conditional distribution p(\mathrmEqu. \mid \mathrmNum.) , with CVAE training achieving this objective through the Evidence Lower Bound (ELBO). This new perspective provides a theoretical guarantee for the effectiveness of GenSR. Extensive experiments show that GenSR jointly optimizes predictive accuracy, expression simplicity, and computational efficiency, while remaining robust under noise.

[LG-32] Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition

链接: https://arxiv.org/abs/2602.20530
作者: Ming Li,Yong-Jin Liu,Fang Liu,Huankun Sheng,Yeying Fan,Yixiang Wei,Minnan Luo,Weizhan Zhang,Wenping Wang
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Emotion recognition from multi-modal physiological and behavioral signals plays a pivotal role in affective computing, yet most existing models remain constrained to the prediction of singular emotions in controlled laboratory settings. Real-world human emotional experiences, by contrast, are often characterized by the simultaneous presence of multiple affective states, spurring recent interest in mixed emotion recognition as an emotion distribution learning problem. Current approaches, however, often neglect the valence consistency and structured correlations inherent among coexisting emotions. To address this limitation, we propose a Memory-guided Prototypical Co-occurrence Learning (MPCL) framework that explicitly models emotion co-occurrence patterns. Specifically, we first fuse multi-modal signals via a multi-scale associative memory mechanism. To capture cross-modal semantic relationships, we construct emotion-specific prototype memory banks, yielding rich physiological and behavioral representations, and employ prototype relation distillation to ensure cross-modal alignment in the latent prototype space. Furthermore, inspired by human cognitive memory systems, we introduce a memory retrieval strategy to extract semantic-level co-occurrence associations across emotion categories. Through this bottom-up hierarchical abstraction process, our model learns affectively informative representations for accurate emotion distribution prediction. Comprehensive experiments on two public datasets demonstrate that MPCL consistently outperforms state-of-the-art methods in mixed emotion recognition, both quantitatively and qualitatively.

[LG-33] CGSTA: Cross-Scale Graph Contrast with Stability-Aware Alignment for Multivariate Time-Series Anomaly Detection DASFAA’26

链接: https://arxiv.org/abs/2602.20468
作者: Zhongpeng Qi,Jun Zhang,Wei Li,Zhuoxuan Liang
类目: Machine Learning (cs.LG)
*备注: Accepted by DASFAA’26

点击查看摘要

Abstract:Multivariate time-series anomaly detection is essential for reliable industrial control, telemetry, and service monitoring. However, the evolving inter-variable dependencies and inevitable noise render it challenging. Existing methods often use single-scale graphs or instance-level contrast. Moreover, learned dynamic graphs can overfit noise without a stable anchor, causing false alarms or misses. To address these challenges, we propose the CGSTA framework with two key innovations. First, Dynamic Layered Graph Construction (DLGC) forms local, regional, and global views of variable relations for each sliding window; rather than contrasting whole windows, Contrastive Discrimination across Scales (CDS) contrasts graph representations within each view and aligns the same window across views to make learning structure-aware. Second, Stability-Aware Alignment (SAA) maintains a per-scale stable reference learned from normal data and guides the current window’s fast-changing graphs toward it to suppress noise. We fuse the multi-scale and temporal features and use a conditional density estimator to produce per-time-step anomaly scores. Across four benchmarks, CGSTA delivers optimal performance on PSM and WADI, and is comparable to the baseline methods on SWaT and SMAP.

[LG-34] Prior-Agnostic Incentive-Compatible Exploration

链接: https://arxiv.org/abs/2602.20465
作者: Ramya Ramalingam,Osbert Bastani,Aaron Roth
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In bandit settings, optimizing long-term regret metrics requires exploration, which corresponds to sometimes taking myopically sub-optimal actions. When a long-lived principal merely recommends actions to be executed by a sequence of different agents (as in an online recommendation platform) this provides an incentive misalignment: exploration is “worth it” for the principal but not for the agents. Prior work studies regret minimization under the constraint of Bayesian Incentive-Compatibility in a static stochastic setting with a fixed and common prior shared amongst the agents and the algorithm designer. We show that (weighted) swap regret bounds on their own suffice to cause agents to faithfully follow forecasts in an approximate Bayes Nash equilibrium, even in dynamic environments in which agents have conflicting prior beliefs and the mechanism designer has no knowledge of any agents beliefs. To obtain these bounds, it is necessary to assume that the agents have some degree of uncertainty not just about the rewards, but about their arrival time – i.e. their relative position in the sequence of agents served by the algorithm. We instantiate our abstract bounds with concrete algorithms for guaranteeing adaptive and weighted regret in bandit settings. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2602.20465 [cs.GT] (or arXiv:2602.20465v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.20465 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] A Long-Short Flow-Map Perspective for Drifting Models

链接: https://arxiv.org/abs/2602.20463
作者: Zhiqi Li,Bo Zhu
类目: Machine Learning (cs.LG)
*备注: 25 pages, 7 figures

点击查看摘要

Abstract:This paper provides a reinterpretation of the Drifting Model~\citedeng2026generative through a semigroup-consistent long-short flow-map factorization. We show that a global transport process can be decomposed into a long-horizon flow map followed by a short-time terminal flow map admitting a closed-form optimal velocity representation, and that taking the terminal interval length to zero recovers exactly the drifting field together with a conservative impulse term required for flow-map consistency. Based on this perspective, we propose a new likelihood learning formulation that aligns the long-short flow-map decomposition with density evolution under transport. We validate the framework through both theoretical analysis and empirical evaluations on benchmark tests, and further provide a theoretical interpretation of the feature-space optimization while highlighting several open problems for future study.

[LG-36] Nonparametric Teaching of Attention Learners ICLR2026

链接: https://arxiv.org/abs/2602.20461
作者: Chen Zhang,Jianghui Wang,Bingyang Cheng,Zhongtao Chen,Wendong XU,Cong Wang,Marco Canini,Francesco Orabona,Yik Chung WU,Ngai Wong
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 (36 pages, 6 figures)

点击查看摘要

Abstract:Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named Attention Neural Teaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on parameter-based gradient descent during training, and recasting the evolution of attention learners, shaped by parameter updates, through functional gradient descent in nonparametric teaching, we show for the first time that teaching attention learners is consistent with teaching importance-adaptive nonparametric learners. These new findings readily commit AtteNT to enhancing learning efficiency of attention learners. Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes. Crucially, these gains are achieved without compromising accuracy; in fact, performance is consistently preserved and often enhanced across a diverse set of downstream tasks.

[LG-37] Oracle-Robust Online Alignment for Large Language Models

链接: https://arxiv.org/abs/2602.20457
作者: Zimeng Li,Mudit Gaur,Vaneet Aggarwal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement problem due to the coupling between data collection and policy updates. Recently, the problem has been reduced to tractable single-level objective in the SAIL (Self-Improving Efficient Online Alignment) framework. In this paper, we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objective as a worst-case optimization problem. For log-linear policies, we show that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty. We develop projected stochastic composite updates for the resulting weakly convex objective and prove \widetildeO(\varepsilon^-2) oracle complexity for reaching approximate stationarity.

[LG-38] Heterogeneity-Aware Client Selection Methodology For Efficient Federated Learning

链接: https://arxiv.org/abs/2602.20450
作者: Nihal Balivada,Shrey Gupta,Shashank Shreedhar Bhatt,Suyash Gupta
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables a distributed client-server architecture where multiple clients collaboratively train a global Machine Learning (ML) model without sharing sensitive local data. However, FL often results in lower accuracy than traditional ML algorithms due to statistical heterogeneity across clients. Prior works attempt to address this by using model updates, such as loss and bias, from client models to select participants that can improve the global model’s accuracy. However, these updates neither accurately represent a client’s heterogeneity nor are their selection methods deterministic. We mitigate these limitations by introducing Terraform, a novel client selection methodology that uses gradient updates and a deterministic selection algorithm to select heterogeneous clients for retraining. This bi-pronged approach allows Terraform to achieve up to 47 percent higher accuracy over prior works. We further demonstrate its efficiency through comprehensive ablation studies and training time analyses, providing strong justification for the robustness of Terraform.

[LG-39] GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization

链接: https://arxiv.org/abs/2602.20427
作者: Yaohui Cai,Vesal Bakhtazad,Cunxi Yu,Zhiru Zhang
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Efficient operator scheduling is a fundamental challenge in software compilation and hardware synthesis. While recent differentiable approaches have sought to replace traditional ones like exact solvers or heuristics with gradient-based search, they typically rely on categorical distributions that fail to capture the ordinal nature of time and suffer from a parameter space that scales poorly. In this paper, we propose a novel differentiable framework, GauS, that models operator scheduling as a stochastic relaxation using Gaussian distributions, which fully utilize modern parallel computing devices like GPUs. By representing schedules as continuous Gaussian variables, we successfully capture the ordinal nature of time and reduce the optimization space by orders of magnitude. Our method is highly flexible to represent various objectives and constraints, which provides the first differentiable formulation for the complex pipelined scheduling problem. We evaluate our method on a range of benchmarks, demonstrating that Gaus achieves Pareto-optimal results.

[LG-40] CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks

链接: https://arxiv.org/abs/2602.20419
作者: Bolin Shen,Zhan Cheng,Neil Zhenqiang Gong,Fan Yao,Yushun Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning as a Service (MLaaS) has emerged as a widely adopted paradigm for providing access to deep neural network (DNN) models, enabling users to conveniently leverage these models through standardized APIs. However, such services are highly vulnerable to Model Extraction Attacks (MEAs), where an adversary repeatedly queries a target model to collect input-output pairs and uses them to train a surrogate model that closely replicates its functionality. While numerous defense strategies have been proposed, verifying the ownership of a suspicious model with strict theoretical guarantees remains a challenging task. To address this gap, we introduce CREDIT, a certified ownership verification against MEAs. Specifically, we employ mutual information to quantify the similarity between DNN models, propose a practical verification threshold, and provide rigorous theoretical guarantees for ownership verification based on this threshold. We extensively evaluate our approach on several mainstream datasets across different domains and tasks, achieving state-of-the-art performance. Our implementation is publicly available at: this https URL.

[LG-41] CITED: A Decision Boundary-Aware Signature for GNNs Towards Model Extraction Defense

链接: https://arxiv.org/abs/2602.20418
作者: Bolin Shen,Md Shamim Seraj,Zhan Cheng,Shayok Chakraborty,Yushun Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have demonstrated superior performance in various applications, such as recommendation systems and financial risk management. However, deploying large-scale GNN models locally is particularly challenging for users, as it requires significant computational resources and extensive property data. Consequently, Machine Learning as a Service (MLaaS) has become increasingly popular, offering a convenient way to deploy and access various models, including GNNs. However, an emerging threat known as Model Extraction Attacks (MEAs) presents significant risks, as adversaries can readily obtain surrogate GNN models exhibiting similar functionality. Specifically, attackers repeatedly query the target model using subgraph inputs to collect corresponding responses. These input-output pairs are subsequently utilized to train their own surrogate models at minimal cost. Many techniques have been proposed to defend against MEAs, but most are limited to specific output levels (e.g., embedding or label) and suffer from inherent technical drawbacks. To address these limitations, we propose a novel ownership verification framework CITED which is a first-of-its-kind method to achieve ownership verification on both embedding and label levels. Moreover, CITED is a novel signature-based method that neither harms downstream performance nor introduces auxiliary models that reduce efficiency, while still outperforming all watermarking and fingerprinting approaches. Extensive experiments demonstrate the effectiveness and robustness of our CITED framework. Code is available at: this https URL.

[LG-42] κ-Explorer: A Unified Framework for Active Model Estimation in MDPs

链接: https://arxiv.org/abs/2602.20404
作者: Xihe Gu,Urbashi Mitra,Tara Javidi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In tabular Markov decision processes (MDPs) with perfect state observability, each trajectory provides active samples from the transition distributions conditioned on state-action pairs. Consequently, accurate model estimation depends on how the exploration policy allocates visitation frequencies in accordance with the intrinsic complexity of each transition distribution. Building on recent work on coverage-based exploration, we introduce a parameterized family of decomposable and concave objective functions U_\kappa that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Moreover, the curvature \kappa provides a unified treatment of various global objectives, such as the average-case and worst-case estimation error objectives. Using the closed-form characterization of the gradient of U_\kappa , we propose \kappa -Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures. The diminishing-returns structure of U_\kappa naturally prioritizes underexplored and high-variance transitions, while preserving smoothness properties that enable efficient optimization. We establish tight regret guarantees for \kappa -Explorer and further introduce a fully online and computationally efficient surrogate algorithm for practical use. Experiments on benchmark MDPs demonstrate that \kappa -Explorer provides superior performance compared to existing exploration strategies.

[LG-43] Wasserstein Distributionally Robust Online Learning

链接: https://arxiv.org/abs/2602.20403
作者: Guixian Chen,Salar Fattahi,Soroosh Shafiee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study distributionally robust online learning, where a risk-averse learner updates decisions sequentially to guard against worst-case distributions drawn from a Wasserstein ambiguity set centered at past observations. While this paradigm is well understood in the offline setting through Wasserstein Distributionally Robust Optimization (DRO), its online extension poses significant challenges in both convergence and computation. In this paper, we address these challenges. First, we formulate the problem as an online saddle-point stochastic game between a decision maker and an adversary selecting worst-case distributions, and propose a general framework that converges to a robust Nash equilibrium coinciding with the solution of the corresponding offline Wasserstein DRO problem. Second, we address the main computational bottleneck, which is the repeated solution of worst-case expectation problems. For the important class of piecewise concave loss functions, we propose a tailored algorithm that exploits problem geometry to achieve substantial speedups over state-of-the-art solvers such as Gurobi. The key insight is a novel connection between the worst-case expectation problem, an inherently infinite-dimensional optimization problem, and a classical and tractable budget allocation problem, which is of independent interest.

[LG-44] GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training

链接: https://arxiv.org/abs/2602.20399
作者: Haixu Wu,Minghao Guo,Zongyi Li,Zhiyang Dou,Mingsheng Long,Kaiming He,Wojciech Matusik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high-fidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics-aware self-supervision without physics labels. Pre-trained on over one million samples, GeoPT consistently improves industrial-fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20-60% and accelerating convergence by 2 \times . These results show that lifting with synthetic dynamics bridges the geometry-physics gap, unlocking a scalable path for neural simulation and potentially beyond. Code is available at this https URL.

[LG-45] cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context

链接: https://arxiv.org/abs/2602.20396
作者: Jörg Martin,Stefan Haufe
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Explainable artificial intelligence promises to yield insights into relevant features, thereby enabling humans to examine and scrutinize machine learning models or even facilitating scientific discovery. Considering the widespread technique of Shapley values, we find that purely data-driven operationalization of multivariate feature importance is unsuitable for such purposes. Even for simple problems with two features, spurious associations due to collider bias and suppression arise from considering one feature only in the observational context of the other, which can lead to misinterpretations. Causal knowledge about the data-generating process is required to identify and correct such misleading feature attributions. We propose cc-Shapley (causal context Shapley), an interventional modification of conventional observational Shapley values leveraging knowledge of the data’s causal structure, thereby analyzing the relevance of a feature in the causal context of the remaining features. We show theoretically that this eradicates spurious association induced by collider bias. We compare the behavior of Shapley and cc-Shapley values on various, synthetic, and real-world datasets. We observe nullification or reversal of associations compared to univariate feature importance when moving from observational to cc-Shapley.

[LG-46] Quantitative Approximation Rates for Group Equivariant Learning

链接: https://arxiv.org/abs/2602.20370
作者: Jonathan W. Siegel,Snir Hordan,Hannah Lawrence,Ali Syed,Nadav Dym
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The universal approximation theorem establishes that neural networks can approximate any continuous function on a compact set. Later works in approximation theory provide quantitative approximation rates for ReLU networks on the class of \alpha -Hölder functions f: [0,1]^N \to \mathbbR . The goal of this paper is to provide similar quantitative approximation results in the context of group equivariant learning, where the learned \alpha -Hölder function is known to obey certain group symmetries. While there has been much interest in the literature in understanding the universal approximation properties of equivariant models, very few quantitative approximation results are known for equivariant models. In this paper, we bridge this gap by deriving quantitative approximation rates for several prominent group-equivariant and invariant architectures. The architectures that we consider include: the permutation-invariant Deep Sets architecture; the permutation-equivariant Sumformer and Transformer architectures; joint invariance to permutations and rigid motions using invariant networks based on frame averaging; and general bi-Lipschitz invariant models. Overall, we show that equally-sized ReLU MLPs and equivariant architectures are equally expressive over equivariant functions. Thus, hard-coding equivariance does not result in a loss of expressivity or approximation power in these models. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) MSC classes: 68T07, 41A25 Cite as: arXiv:2602.20370 [cs.LG] (or arXiv:2602.20370v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] Emergent Manifold Separability during Reasoning in Large Language Models

链接: https://arxiv.org/abs/2602.20338
作者: Alexandre Polo,Chanwoo Chun,SueYeon Chung
类目: Machine Learning (cs.LG)
*备注: Alexandre Polo and Chanwoo Chun contributed equally to this work"

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting significantly improves reasoning in Large Language Models, yet the temporal dynamics of the underlying representation geometry remain poorly understood. We investigate these dynamics by applying Manifold Capacity Theory (MCT) to a compositional Boolean logic task, allowing us to quantify the linear separability of latent representations without the confounding factors of probe training. Our analysis reveals that reasoning manifests as a transient geometric pulse, where concept manifolds are untangled into linearly separable subspaces immediately prior to computation and rapidly compressed thereafter. This behavior diverges from standard linear probe accuracy, which remains high long after computation, suggesting a fundamental distinction between information that is merely retrievable and information that is geometrically prepared for processing. We interpret this phenomenon as \emphDynamic Manifold Management, a mechanism where the model dynamically modulates representational capacity to optimize the bandwidth of the residual stream throughout the reasoning chain.

[LG-48] CaDrift: A Time-dependent Causal Generator of Drifting Data Streams ICLR2026

链接: https://arxiv.org/abs/2602.20329
作者: Eduardo V. L. Barboza,Jean Paul Barddal,Robert Sabourin,Rafael M. O. Cruz
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Paper submitted to ICLR 2026

点击查看摘要

Abstract:This work presents Causal Drift Generator (CaDrift), a time-dependent synthetic data generator framework based on Structural Causal Models (SCMs). The framework produces a virtually infinite combination of data streams with controlled shift events and time-dependent data, making it a tool to evaluate methods under evolving data. CaDrift synthesizes various distributional and covariate shifts by drifting mapping functions of the SCM, which change underlying cause-and-effect relationships between features and the target. In addition, CaDrift models occasional perturbations by leveraging interventions in causal modeling. Experimental results show that, after distributional shift events, the accuracy of classifiers tends to drop, followed by a gradual retrieval, confirming the generator’s effectiveness in simulating shifts. The framework has been made available on GitHub.

[LG-49] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

链接: https://arxiv.org/abs/2602.20309
作者: Jingxuan Zhang,Yunta Hsieh,Zhongwei Wang,Haokun Lin,Xin Wang,Ziqi Wang,Yingtie Lei,Mi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

[LG-50] In-context Pre-trained Time-Series Foundation Models adapt to Unseen Tasks

链接: https://arxiv.org/abs/2602.20307
作者: Shangqing Xu,Harshavardhan Kamarthi,Haoxin Liu,B. Aditya Prakash
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series foundation models (TSFMs) have demonstrated strong generalization capabilities across diverse datasets and tasks. However, existing foundation models are typically pre-trained to enhance performance on specific tasks and often struggle to generalize to unseen tasks without fine-tuning. To address this limitation, we propose augmenting TSFMs with In-Context Learning (ICL) capabilities, enabling them to perform test-time inference by dynamically adapting to input-output relationships provided within the context. Our framework, In-Context Time-series Pre-training (ICTP), restructures the original pre-training data to equip the backbone TSFM with ICL capabilities, enabling adaptation to unseen tasks. Experiments demonstrate that ICT improves the performance of state-of-the-art TSFMs by approximately 11.4% on unseen tasks without requiring fine-tuning.

[LG-51] Learning to Solve Complex Problems via Dataset Decomposition NEURIPS2025

链接: https://arxiv.org/abs/2602.20296
作者: Wanru Zhao,Lucas Caccia,Zhengyan Shi,Minseon Kim,Weijia Xu,Alessandro Sordoni
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Curriculum learning is a class of training strategies that organizes the data being exposed to a model by difficulty, gradually from simpler to more complex examples. This research explores a reverse curriculum generation approach that recursively decomposes complex datasets into simpler, more learnable components. We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to recursively generate easier versions of examples, enabling the student model to progressively master difficult tasks. We propose a novel scoring system to measure data difficulty based on its structural complexity and conceptual depth, allowing curriculum construction over decomposed data. Experiments on math datasets (MATH and AIME) and code generation datasets demonstrate that models trained with curricula generated by our approach exhibit superior performance compared to standard training on original datasets.

[LG-52] Discrete Diffusion with Sample-Efficient Estimators for Conditionals

链接: https://arxiv.org/abs/2602.20293
作者: Karthik Elamvazhuthi,Abhijith Jayakumar,Andrey Y. Lokhov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study a discrete denoising diffusion framework that integrates a sample-efficient estimator of single-site conditionals with round-robin noising and denoising dynamics for generative modeling over discrete state spaces. Rather than approximating a discrete analog of a score function, our formulation treats single-site conditional probabilities as the fundamental objects that parameterize the reverse diffusion process. We employ a sample-efficient method known as Neural Interaction Screening Estimator (NeurISE) to estimate these conditionals in the diffusion dynamics. Controlled experiments on synthetic Ising models, MNIST, and scientific data sets produced by a D-Wave quantum annealer, synthetic Potts model and one-dimensional quantum systems demonstrate the proposed approach. On the binary data sets, these experiments demonstrate that the proposed approach outperforms popular existing methods including ratio-based approaches, achieving improved performance in total variation, cross-correlations, and kernel density estimation metrics.

[LG-53] he Truthfulness Spectrum Hypothesis

链接: https://arxiv.org/abs/2602.20273
作者: Zhuofan Josh Ying,Shauli Ravfogel,Nikolaus Kriegeskorte,Peter Hase
类目: Machine Learning (cs.LG)
*备注: 28 pages, 26 figures

点击查看摘要

Abstract:Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding’s generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models’ sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in this https URL.

[LG-54] Coupled Cluster con MōLe: Molecular Orbital Learning for Neural Wavefunctions

链接: https://arxiv.org/abs/2602.20232
作者: Luca Thiede,Abdulrahman Aldossary,Andreas Burger,Jorge Arturo Campos-Gonzalez-Angulo,Ning Wang,Alexander Zook,Melisa Alkan,Kouhei Nakaji,Taylor Lee Patti,Jérôme Florian Gonthier,Mohammad Ghazi Vakili,Alán Aspuru-Guzik
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Density functional theory (DFT) is the most widely used method for calculating molecular properties; however, its accuracy is often insufficient for quantitative predictions. Coupled-cluster (CC) theory is the most successful method for achieving accuracy beyond DFT and for predicting properties that closely align with experiment. It is known as the ‘‘gold standard’’ of quantum chemistry. Unfortunately, the high computational cost of CC limits its widespread applicability. In this work, we present the Molecular Orbital Learning (MōLe) architecture, an equivariant machine learning model that directly predicts CC’s core mathematical objects, the excitation amplitudes, from the mean-field Hartree-Fock molecular orbitals as inputs. We test various aspects of our model and demonstrate its remarkable data efficiency and out-of-distribution generalization to larger molecules and off-equilibrium geometries, despite being trained only on small equilibrium geometries. Finally, we also examine its ability to reduce the number of cycles required to converge CC calculations. MōLe can set the foundations for high-accuracy wavefunction-based ML architectures to accelerate molecular design and complement force-field approaches.

[LG-55] FedAvg-Based CTMC Hazard Model for Federated Bridge Deterioration Assessment

链接: https://arxiv.org/abs/2602.20194
作者: Takato Yasuno
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Bridge periodic inspection records contain sensitive information about public infrastructure, making cross-organizational data sharing impractical under existing data governance constraints. We propose a federated framework for estimating a Continuous-Time Markov Chain (CTMC) hazard model of bridge deterioration, enabling municipalities to collaboratively train a shared benchmark model without transferring raw inspection records. Each User holds local inspection data and trains a log-linear hazard model over three deterioration-direction transitions – Good \to Minor, Good \to Severe, and Minor \to Severe – with covariates for bridge age, coastline distance, and deck area. Local optimization is performed via mini-batch stochastic gradient descent on the CTMC log-likelihood, and only a 12-dimensional pseudo-gradient vector is uploaded to a central server per communication round. The server aggregates User updates using sample-weighted Federated Averaging (FedAvg) with momentum and gradient clipping. All experiments in this paper are conducted on fully synthetic data generated from a known ground-truth parameter set with region-specific heterogeneity, enabling controlled evaluation of federated convergence behaviour. Simulation results across heterogeneous Users show consistent convergence of the average negative log-likelihood, with the aggregated gradient norm decreasing as User scale increases. Furthermore, the federated update mechanism provides a natural participation incentive: Users who register their local inspection datasets on a shared technical-standard platform receive in return the periodically updated global benchmark parameters – information that cannot be obtained from local data alone – thereby enabling evidence-based life-cycle planning without surrendering data sovereignty.

[LG-56] nsor Network Generator-Enhanced Optimization for Traveling Salesman Problem

链接: https://arxiv.org/abs/2602.20175
作者: Ryo Sakai,Chen-Yu Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:We present an application of the tensor network generator-enhanced optimization (TN-GEO) framework to address the traveling salesman problem (TSP), a fundamental combinatorial optimization challenge. Our approach employs a tensor network Born machine based on automatically differentiable matrix product states (MPS) as the generative model, using the Born rule to define probability distributions over candidate solutions. Unlike approaches based on binary encoding, which require N^2 variables and penalty terms to enforce valid tour constraints, we adopt a permutation-based formulation with integer variables and use autoregressive sampling with masking to guarantee that every generated sample is a valid tour by construction. We also introduce a k -site MPS variant that learns distributions over k -grams (consecutive city subsequences) using a sliding window approach, enabling parameter-efficient modeling for larger instances. Experimental validation on TSPLIB benchmark instances with up to 52 cities demonstrates that TN-GEO can outperform classical heuristics including swap and 2-opt hill-climbing. The k -site variants, which put more focus on local correlations, show better results compared to the full-MPS case.

[LG-57] Not Just How Much But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

链接: https://arxiv.org/abs/2602.21160
作者: Mame Diarra Toure,David A. Stephens
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 8 pages, 17 figures

点击查看摘要

Abstract:In safety-critical classification, the cost of failure is often asymmetric, yet Bayesian deep learning summarises epistemic uncertainty with a single scalar, mutual information (MI), that cannot distinguish whether a model’s ignorance involves a benign or safety-critical class. We decompose MI into a per-class vector C_k(x)=\sigma_k^2/(2\mu_k) , with \mu_k=\mathbbE[p_k] and \sigma_k^2=\mathrmVar[p_k] across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the 1/\mu_k weighting corrects boundary suppression and makes C_k comparable across rare and common classes. By construction \sum_k C_k \approx \mathrmMI , and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of C_k , we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class C_k reduces selective risk by 34.7% over MI and 56.2% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where \sum_k C_k achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which \sum_k C_k shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

[LG-58] Complexity of Classical Acceleration for ell_1-Regularized PageRank

链接: https://arxiv.org/abs/2602.21138
作者: Kimon Fountoulakis,David Martínez-Rubio
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 23 pages, 8 Figures

点击查看摘要

Abstract:We study the degree-weighted work required to compute \ell_1 -regularized PageRank using the standard one-gradient-per-iteration accelerated proximal-gradient method (FISTA). For non-accelerated local methods, the best known worst-case work scales as \widetildeO ((\alpha\rho)^-1) , where \alpha is the teleportation parameter and \rho is the \ell_1 -regularization parameter. A natural question is whether FISTA can improve the dependence on \alpha from 1/\alpha to 1/\sqrt\alpha while preserving the 1/\rho locality scaling. The challenge is that acceleration can break locality by transiently activating nodes that are zero at optimality, thereby increasing the cost of gradient evaluations. We analyze FISTA on a slightly over-regularized objective and show that, under a checkable confinement condition, all spurious activations remain inside a boundary set \mathcalB . This yields a bound consisting of an accelerated (\rho\sqrt\alpha)^-1\log(\alpha/\varepsilon) term plus a boundary overhead \sqrtvol(\mathcalB)/(\rho\alpha^3/2) . We provide graph-structural conditions that imply such confinement. Experiments on synthetic and real graphs show the resulting speedup and slowdown regimes under the degree-weighted work model.

[LG-59] An Enhanced Projection Pursuit Tree Classifier with Visual Methods for Assessing Algorithmic Improvements

链接: https://arxiv.org/abs/2602.21130
作者: Natalia da Silva,Dianne Cook,Eun-Kyung Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents enhancements to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions. The original algorithm uses linear combinations of variables in a tree structure where depth is constrained to be less than the number of classes – a limitation that proves too rigid for complex classification problems. Our extensions improve performance in multi-class settings with unequal variance-covariance structures and nonlinear class separations by allowing more splits and more flexible class groupings in the projection pursuit computation. Proposing algorithmic improvements is straightforward; demonstrating their actual utility is not. We therefore develop two visual diagnostic approaches to verify that the enhancements perform as intended. Using high-dimensional visualization techniques, we examine model fits on benchmark datasets to assess whether the algorithm behaves as theorized. An interactive web application enables users to explore the behavior of both the original and enhanced classifiers under controlled scenarios. The enhancements are implemented in the R package PPtreeExt.

[LG-60] Is Multi-Distribution Learning as Easy as PAC Learning: Sharp Rates with Bounded Label Noise

链接: https://arxiv.org/abs/2602.21039
作者: Rafael Hanashiro,Abhishek Shetty,Patrick Jaillet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Towards understanding the statistical complexity of learning from heterogeneous sources, we study the problem of multi-distribution learning. Given k data sources, the goal is to output a classifier for each source by exploiting shared structure to reduce sample complexity. We focus on the bounded label noise setting to determine whether the fast 1/\epsilon rates achievable in single-task learning extend to this regime with minimal dependence on k . Surprisingly, we show that this is not the case. We demonstrate that learning across k distributions inherently incurs slow rates scaling with k/\epsilon^2 , even under constant noise levels, unless each distribution is learned separately. A key technical contribution is a structured hypothesis-testing framework that captures the statistical cost of certifying near-optimality under bounded noise-a cost we show is unavoidable in the multi-distribution setting. Finally, we prove that when competing with the stronger benchmark of each distribution’s optimal Bayes error, the sample complexity incurs a \textitmultiplicative penalty in k . This establishes a \textitstatistical separation between random classification noise and Massart noise, highlighting a fundamental barrier unique to learning from multiple sources. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2602.21039 [stat.ML] (or arXiv:2602.21039v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.21039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Empirically Calibrated Conditional Independence Tests

链接: https://arxiv.org/abs/2602.21036
作者: Milleno Pan,Antoine de Mathelin,Wesley Tansey
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guarantees in practice. We highlight two common failure modes: (i) in small samples, asymptotic guarantees for many CITs can be inaccurate and even correctly specified models fail to estimate the noise levels and control the error, and (ii) when sample sizes are large but models are misspecified, unaccounted dependencies skew the test’s behavior and fail to return uniform p-values under the null. We propose Empirically Calibrated Conditional Independence Tests (ECCIT), a method that measures and corrects for miscalibration. For a chosen base CIT (e.g., GCM, HRT), ECCIT optimizes an adversary that selects features and response functions to maximize a miscalibration metric. ECCIT then fits a monotone calibration map that adjusts the base-test p-values in proportion to the observed miscalibration. Across empirical benchmarks on synthetic and real data, ECCIT achieves valid FDR with higher power than existing calibration strategies while remaining test agnostic.

[LG-62] Functional Continuous Decomposition

链接: https://arxiv.org/abs/2602.20857
作者: Teymur Aghayev
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, 6 tables

点击查看摘要

Abstract:The analysis of non-stationary time-series data requires insight into its local and global patterns with physical interpretability. However, traditional smoothing algorithms, such as B-splines, Savitzky-Golay filtering, and Empirical Mode Decomposition (EMD), lack the ability to perform parametric optimization with guaranteed continuity. In this paper, we propose Functional Continuous Decomposition (FCD), a JAX-accelerated framework that performs parametric, continuous optimization on a wide range of mathematical functions. By using Levenberg-Marquardt optimization to achieve up to C^1 continuous fitting, FCD transforms raw time-series data into M modes that capture different temporal patterns from short-term to long-term trends. Applications of FCD include physics, medicine, financial analysis, and machine learning, where it is commonly used for the analysis of signal temporal patterns, optimized parameters, derivatives, and integrals of decomposition. Furthermore, FCD can be applied for physical analysis and feature extraction with an average SRMSE of 0.735 per segment and a speed of 0.47s on full decomposition of 1,000 points. Finally, we demonstrate that a Convolutional Neural Network (CNN) enhanced with FCD features, such as optimized function values, parameters, and derivatives, achieved 16.8% faster convergence and 2.5% higher accuracy over a standard CNN.

[LG-63] F10.7 Index Prediction: A Multiscale Decomposition Strategy with Wavelet Transform for Performance Optimization

链接: https://arxiv.org/abs/2602.20712
作者: Xuran Ma,Xuebao Li,Yanfang Zheng,Yongshang Lv,Xiaojia Ji,Jiancheng Xu,Hongwei Ye,Zixian Wu,Shuainan Yan,Liang Dong,Zamri Zainal Abidin,Xusheng Huang,Shunhuang Zhang,Honglei Jin,Tarik Abdul Latef,Noraisyah Mohamed Shah,Mohamadariff Othman,Kamarul Ariffin Noordin
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we construct Dataset A for training, validation, and testing, and Dataset B to evaluate generalization. We propose a novel F10.7 index forecasting method using wavelet decomposition, which feeds F10.7 together with its decomposed approximate and detail signals into the iTransformer model. We also incorporate the International Sunspot Number (ISN) and its wavelet-decomposed signals to assess their influence on prediction performance. Our optimal method is then compared with the latest method from S. Yan et al. (2025) and three operational models (SWPC, BGS, CLS). Additionally, we transfer our method to the PatchTST model used in H. Ye et al. (2024) and compare our method with theirs on Dataset B. Key findings include: (1) The wavelet-based combination methods overall outperform the baseline using only F10.7 index. The prediction performance improves as higher-level approximate and detail signals are incrementally added. The Combination 6 method integrating F10.7 with its first to fifth level approximate and detail signals outperforms methods using only approximate or detail signals. (2) Incorporating ISN and its wavelet-decomposed signals does not enhance prediction performance. (3) The Combination 6 method significantly surpasses S. Yan et al. (2025) and three operational models, with RMSE, MAE, and MAPE reduced by 18.22%, 15.09%, and 8.57%, respectively, against the former method. It also excels across four different conditions of solar activity. (4) Our method demonstrates superior generalization and prediction capability over the method of H. Ye et al. (2024) across all forecast horizons. To our knowledge, this is the first application of wavelet decomposition in F10.7 prediction, substantially improving forecast performance.

[LG-64] DANCE: Doubly Adaptive Neighborhood Conformal Estimation

链接: https://arxiv.org/abs/2602.20652
作者: Brandon R. Feng,Brian J. Reich,Daniel Beaglehole,Xihaier Luo,David Keetae Park,Shinjae Yoo,Zhechao Huang,Xueyu Mao,Olcay Boz,Jungeum Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent developments of complex deep learning models have led to unprecedented ability to accurately predict across multiple data representation types. Conformal prediction for uncertainty quantification of these models has risen in popularity, providing adaptive, statistically-valid prediction sets. For classification tasks, conformal methods have typically focused on utilizing logit scores. For pre-trained models, however, this can result in inefficient, overly conservative set sizes when not calibrated towards the target task. We propose DANCE, a doubly locally adaptive nearest-neighbor based conformal algorithm combining two novel nonconformity scores directly using the data’s embedded representation. DANCE first fits a task-adaptive kernel regression model from the embedding layer before using the learned kernel space to produce the final prediction sets for uncertainty quantification. We test against state-of-the-art local, task-adapted and zero-shot conformal baselines, demonstrating DANCE’s superior blend of set size efficiency and robustness across various datasets.

[LG-65] On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes

链接: https://arxiv.org/abs/2602.20646
作者: Boao Kong,Hengrui Zhang,Kun Yuan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages

点击查看摘要

Abstract:We study stochastic gradient descent (SGD) for composite optimization problems with N sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive and localized, perturbations to intermediate outputs and gradients cascade through the computational graph, compounding geometrically with the number of operators. We present the first comprehensive theoretical analysis of this setting. Specifically, we characterize how forward and backward perturbations propagate and amplify within a single gradient step, derive convergence guarantees for both general non-convex objectives and functions satisfying the Polyak–Łojasiewicz condition, and identify conditions under which perturbations do not deteriorate the asymptotic convergence order. As a byproduct, our analysis furnishes a theoretical explanation for the gradient spiking phenomenon widely observed in deep learning, precisely characterizing the conditions under which training recovers from spikes or diverges. Experiments on logistic regression with convex and non-convex regularization validate our theories, illustrating the predicted spike behavior and the asymmetric sensitivity to forward versus backward perturbations.

[LG-66] Amortized Bayesian inference for actigraph time sheet data from mobile devices

链接: https://arxiv.org/abs/2602.20611
作者: Daniel Zhou,Sudipto Banerjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 40 pages, 7 figures

点击查看摘要

Abstract:Mobile data technologies use ``actigraphs’’ to furnish information on health variables as a function of a subject’s movement. The advent of wearable devices and related technologies has propelled the creation of health databases consisting of human movement data to conduct research on mobility patterns and health outcomes. Statistical methods for analyzing high-resolution actigraph data depend on the specific inferential context, but the advent of Artificial Intelligence (AI) frameworks require that the methods be congruent to transfer learning and amortization. This article devises amortized Bayesian inference for actigraph time sheets. We pursue a Bayesian approach to ensure full propagation of uncertainty and its quantification using a hierarchical dynamic linear model. We build our analysis around actigraph data from the Physical Activity through Sustainable Transport Approaches in Los Angeles (PASTA-LA) study conducted by the Fielding School of Public Health in the University of California, Los Angeles. Apart from achieving probabilistic imputation of actigraph time sheets, we are also able to statistically learn about the time-varying impact of explanatory variables on the magnitude of acceleration (MAG) for a cohort of subjects.

[LG-67] Characterizing Online and Private Learnability under Distributional Constraints via Generalized Smoothness

链接: https://arxiv.org/abs/2602.20585
作者: Moïse Blanchard,Abhishek Shetty,Alexander Rakhlin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding minimal assumptions that enable learning and generalization is perhaps the central question of learning theory. Several celebrated results in statistical learning theory, such as the VC theorem and Littlestone’s characterization of online learnability, establish conditions on the hypothesis class that allow for learning under independent data and adversarial data, respectively. Building upon recent work bridging these extremes, we study sequential decision making under distributional adversaries that can adaptively choose data-generating distributions from a fixed family U and ask when such problems are learnable with sample complexity that behaves like the favorable independent case. We provide a near complete characterization of families U that admit learnability in terms of a notion known as generalized smoothness i.e. a distribution family admits VC-dimension-dependent regret bounds for every finite-VC hypothesis class if and only if it is generalized smooth. Further, we give universal algorithms that achieve low regret under any generalized smooth adversary without explicit knowledge of U . Finally, when U is known, we provide refined bounds in terms of a combinatorial parameter, the fragmentation number, that captures how many disjoint regions can carry nontrivial mass under U . These results provide a nearly complete understanding of learnability under distributional adversaries. In addition, building upon the surprising connection between online learning and differential privacy, we show that the generalized smoothness also characterizes private learnability under distributional constraints.

[LG-68] Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with Csλ Targets

链接: https://arxiv.org/abs/2602.20555
作者: Yanming Lai,Defeng Sun
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 58 pages, 1 figure

点击查看摘要

Abstract:The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers can approximate Hölder functions C^s,\lambda\left([0,1]^d\times n\right) (s\in\mathbbN_\geq0,0\lambda\leq1) under the L^t distance ( t \in [1, \infty] ) with arbitrary precision. Building upon this approximation result, we demonstrate that standard Transformers achieve the minimax optimal rate in nonparametric regression for Hölder target functions. It is worth mentioning that, by introducing two metrics: the size tuple and the dimension vector, we provide a fine-grained characterization of Transformer structures, which facilitates future research on the generalization and optimization errors of Transformers with different structures. As intermediate results, we also derive the upper bounds for the Lipschitz constant of standard Transformers and their memorization capacity, which may be of independent interest. These findings provide theoretical justification for the powerful capabilities of Transformer models.

[LG-69] PhyGHT: Physics-Guided HyperGraph Transformer for Signal Purification at the HL-LHC

链接: https://arxiv.org/abs/2602.20475
作者: Mohammed Rakib,Luke Vaughan,Shivang Patel,Flera Rizatdinova,Alexander Khanov,Atriya Sen
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:The High-Luminosity Large Hadron Collider (HL-LHC) at CERN will produce unprecedented datasets capable of revealing fundamental properties of the universe. However, realizing its discovery potential faces a significant challenge: extracting small signal fractions from overwhelming backgrounds dominated by approximately 200 simultaneous pileup collisions. This extreme noise severely distorts the physical observables required for accurate reconstruction. To address this, we introduce the Physics-Guided Hypergraph Transformer (PhyGHT), a hybrid architecture that combines distance-aware local graph attention with global self-attention to mirror the physical topology of particle showers formed in proton-proton collisions. Crucially, we integrate a Pileup Suppression Gate (PSG), an interpretable, physics-constrained mechanism that explicitly learns to filter soft noise prior to hypergraph aggregation. To validate our approach, we release a novel simulated dataset of top-quark pair production to model extreme pileup conditions. PhyGHT outperforms state-of-the-art baselines from the ATLAS and CMS experiments in predicting the signal’s energy and mass correction factors. By accurately reconstructing the top quark’s invariant mass, we demonstrate how machine learning innovation and interdisciplinary collaboration can directly advance scientific discovery at the frontiers of experimental physics and enhance the HL-LHC’s discovery potential. The dataset and code are available at this https URL

[LG-70] Selecting Optimal Variable Order in Autoregressive Ising Models

链接: https://arxiv.org/abs/2602.20394
作者: Shiba Biswal,Marc Vuffray,Andrey Y. Lokhov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive models enable tractable sampling from learned probability distributions, but their performance critically depends on the variable ordering used in the factorization via complexities of the resulting conditional distributions. We propose to learn the Markov random field describing the underlying data, and use the inferred graphical model structure to construct optimized variable orderings. We illustrate our approach on two-dimensional image-like models where a structure-aware ordering leads to restricted conditioning sets, thereby reducing model complexity. Numerical experiments on Ising models with discrete data demonstrate that graph-informed orderings yield higher-fidelity generated samples compared to naive variable orderings.

[LG-71] Detecting and Mitigating Group Bias in Heterogeneous Treatment Effects

链接: https://arxiv.org/abs/2602.20383
作者: Joel Persson,Jurriën Bakker,Dennis Bohle,Stefan Feuerriegel,Florian von Wangenheim
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Heterogeneous treatment effects (HTEs) are increasingly estimated using machine learning models that produce highly personalized predictions of treatment effects. In practice, however, predicted treatment effects are rarely interpreted, reported, or audited at the individual level but, instead, are often aggregated to broader subgroups, such as demographic segments, risk strata, or markets. We show that such aggregation can induce systematic bias of the group-level causal effect: even when models for predicting the individual-level conditional average treatment effect (CATE) are correctly specified and trained on data from randomized experiments, aggregating the predicted CATEs up to the group level does not, in general, recover the corresponding group average treatment effect (GATE). We develop a unified statistical framework to detect and mitigate this form of group bias in randomized experiments. We first define group bias as the discrepancy between the model-implied and experimentally identified GATEs, derive an asymptotically normal estimator, and then provide a simple-to-implement statistical test. For mitigation, we propose a shrinkage-based bias-correction, and show that the theoretically optimal and empirically feasible solutions have closed-form expressions. The framework is fully general, imposes minimal assumptions, and only requires computing sample moments. We analyze the economic implications of mitigating detected group bias for profit-maximizing personalized targeting, thereby characterizing when bias correction alters targeting decisions and profits, and the trade-offs involved. Applications to large-scale experimental data at major digital platforms validate our theoretical results and demonstrate empirical performance.

[LG-72] Gap-Dependent Bounds for Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

链接: https://arxiv.org/abs/2602.20297
作者: Haochen Zhang,Zhong Zheng,Lingzhou Xue
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study gap-dependent performance guarantees for nearly minimax-optimal algorithms in reinforcement learning with linear function approximation. While prior works have established gap-dependent regret bounds in this setting, existing analyses do not apply to algorithms that achieve the nearly minimax-optimal worst-case regret bound \tildeO(d\sqrtH^3K) , where d is the feature dimension, H is the horizon length, and K is the number of episodes. We bridge this gap by providing the first gap-dependent regret bound for the nearly minimax-optimal algorithm LSVI-UCB++ (He et al., 2023). Our analysis yields improved dependencies on both d and H compared to previous gap-dependent results. Moreover, leveraging the low policy-switching property of LSVI-UCB++, we introduce a concurrent variant that enables efficient parallel exploration across multiple agents and establish the first gap-dependent sample complexity upper bound for online multi-agent RL with linear function approximation, achieving linear speedup with respect to the number of agents.

[LG-73] he Sim-to-Real Gap in MRS Quantification: A Systematic Deep Learning Validation for GABA

链接: https://arxiv.org/abs/2602.20289
作者: Zien Ma,S. M. Shermer,Oktay Karakuş,Frank C. Langbein
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 37 pages, 10 figures, 12 tables

点击查看摘要

Abstract:Magnetic resonance spectroscopy (MRS) is used to quantify metabolites in vivo and estimate biomarkers for conditions ranging from neurological disorders to cancers. Quantifying low-concentration metabolites such as GABA ( \gamma -aminobutyric acid) is challenging due to low signal-to-noise ratio (SNR) and spectral overlap. We investigate and validate deep learning for quantifying complex, low-SNR, overlapping signals from MEGA-PRESS spectra, devise a convolutional neural network (CNN) and a Y-shaped autoencoder (YAE), and select the best models via Bayesian optimisation on 10,000 simulated spectra from slice-profile-aware MEGA-PRESS simulations. The selected models are trained on 100,000 simulated spectra. We validate their performance on 144 spectra from 112 experimental phantoms containing five metabolites of interest (GABA, Glu, Gln, NAA, Cr) with known ground truth concentrations across solution and gel series acquired at 3 T under varied bandwidths and implementations. These models are further assessed against the widely used LCModel quantification tool. On simulations, both models achieve near-perfect agreement (small MAEs; regression slopes \approx 1.00 , R^2 \approx 1.00 ). On experimental phantom data, errors initially increased substantially. However, modelling variable linewidths in the training data significantly reduced this gap. The best augmented deep learning models achieved a mean MAE for GABA over all phantom spectra of 0.151 (YAE) and 0.160 (FCNN) in max-normalised relative concentrations, outperforming the conventional baseline LCModel (0.220). A sim-to-real gap remains, but physics-informed data augmentation substantially reduced it. Phantom ground truth is needed to judge whether a method will perform reliably on real data.

[LG-74] Regressor-guided Diffusion Model for De Novo Peptide Sequencing with Explicit Mass Control

链接: https://arxiv.org/abs/2602.20209
作者: Shaorong Chen,Jingbo Zhou,Jun Xia
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint, that a predicted peptide’s mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post-processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide-level mass loss guides model optimization, while at inference, regressor-based guidance from gradient-based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state-of-the-art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.

[LG-75] KEMP-PIP: A Feature-Fusion Based Approach for Pro-inflammatory Peptide Prediction

链接: https://arxiv.org/abs/2602.20198
作者: Soumik Deb Niloy,Md. Fahmid-Ul-Alam Juboraj,Swakkhar Shatabda
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 6 tables; includes web server and GitHub implementation

点击查看摘要

Abstract:Pro-inflammatory peptides (PIPs) play critical roles in immune signaling and inflammation but are difficult to identify experimentally due to costly and time-consuming assays. To address this challenge, we present KEMP-PIP, a hybrid machine learning framework that integrates deep protein embeddings with handcrafted descriptors for robust PIP prediction. Our approach combines contextual embeddings from pretrained ESM protein language models with multi-scale k-mer frequencies, physicochemical descriptors, and modlAMP sequence features. Feature pruning and class-weighted logistic regression manage high dimensionality and class imbalance, while ensemble averaging with an optimized decision threshold enhances the sensitivity–specificity balance. Through systematic ablation studies, we demonstrate that integrating complementary feature sets consistently improves predictive performance. On the standard benchmark dataset, KEMP-PIP achieves an MCC of 0.505, accuracy of 0.752, and AUC of 0.762, outperforming ProIn-fuse, MultiFeatVotPIP, and StackPIP. Relative to StackPIP, these results represent improvements of 9.5% in MCC and 4.8% in both accuracy and AUC. The KEMP-PIP web server is freely available at this https URL and the full implementation at this https URL.

[LG-76] OrgFlow: Generative Modeling of Organic Crystal Structures from Molecular Graphs

链接: https://arxiv.org/abs/2602.20195
作者: Mohammadmahdi Vahediahmar,Matthew A. McDonald,Feng Liu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Crystal structure prediction is a long-standing challenge in materials science, with most data-driven methods developed for inorganic systems. This leaves an important gap for organic crystals, which are central to pharmaceuticals, polymers, and functional materials, but present unique challenges, such as larger unit cells and strict chemical connectivity. We introduce a flow-matching model for predicting organic crystal structures directly from molecular graphs. The architecture integrates molecular connectivity with periodic boundary conditions while preserving the symmetries of crystalline systems. A bond-aware loss guides the model toward realistic local chemistry by enforcing distributions of bond lengths and connectivity. To support reliable and efficient training, we built a curated dataset of organic crystals, along with a preprocessing pipeline that precomputes bonds and edges, substantially reducing computational overhead during both training and inference. Experiments show that our method achieves a Match Rate more than 10 times higher than existing baselines while requiring fewer sampling steps for inference. These results establish generative modeling as a practical and scalable framework for organic crystal structure prediction.

[LG-77] Data-Driven Deep MIMO Detection:Network Architectures and Generalization Analysis

链接: https://arxiv.org/abs/2602.20178
作者: Yongwei Yi,Xinping Yi,Wenjin Wang,Xiao Li,Shi Jin
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures. Full version of a work prepared for submission to IEEE

点击查看摘要

Abstract:In practical Multiuser Multiple-Input Multiple-Output (MU-MIMO) systems, symbol detection remains challenging due to severe inter-user interference and sensitivity to Channel State Information (CSI) uncertainty. In contrast to the mostly studied belief propagation-type model-driven methods, which incur high computational complexity, Soft Interference Cancellation (SIC) strikes a good balance between performance and complexity. To further address CSI mismatch and nonlinear effects, the recently proposed data-driven deep neural receivers, such as DeepSIC, leverage the advantages of deep neural networks for interference cancellation and symbol detection, demonstrating strong empirical performance. However, there is still a lack of theoretical underpinning for why and to what extent DeepSIC could generalize with the number of training samples. This paper proposes inspecting the fully data-driven DeepSIC detection within a Network-of-MLPs architecture, which is composed of multiple interconnected MLPs via outer and inner Directed Acyclic Graphs (DAGs). Within such an architecture, DeepSIC can be upgraded as a graph-based message-passing process using Graph Neural Networks (GNNs), termed GNNSIC, with shared model parameters across users and iterations. Notably, GNNSIC achieves excellent expressivity comparable to DeepSIC with substantially fewer trainable parameters, resulting in improved sample efficiency and enhanced user generalization. By conducting a norm-based generalization analysis using Rademacher complexity, we reveal that an exponential dependence on the number of iterations for DeepSIC can be eliminated in GNNSIC due to parameter sharing. Simulation results demonstrate that GNNSIC attains comparable or improved Symbol Error Rate (SER) performance to DeepSIC with significantly fewer parameters and training samples.

[LG-78] Cross-Chirality Generalization by Axial Vectors for Hetero-Chiral Protein-Peptide Interaction Design

链接: https://arxiv.org/abs/2602.20176
作者: Ziyi Yang,Zitong Tian,Yinjun Jia,Tianyi Zhang,Jiqing Zheng,Hao Wang,Yubu Su,Juncai He,Lei Liu,Yanyan Lan
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:D-peptide binders targeting L-proteins have promising therapeutic potential. Despite rapid advances in machine learning-based target-conditioned peptide design, generating D-peptide binders remains largely unexplored. In this work, we show that by injecting axial features to E(3) -equivariant (polar) vector features,it is feasible to achieve cross-chirality generalization from homo-chiral (L–L) training data to hetero-chiral (D–L) design tasks. By implementing this method within a latent diffusion model, we achieved D-peptide binder design that not only outperforms existing tools in in silico benchmarks, but also demonstrates efficacy in wet-lab validation. To our knowledge, our approach represents the first wet-lab validated generative AI for the de novo design of D-peptide binders, offering new perspectives on handling chirality in protein design.

附件下载

点击下载今日全部论文列表