本篇博文主要内容为 2026-02-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-02-24)

今日共更新550篇论文,其中:

  • 自然语言处理55篇(Computation and Language (cs.CL))
  • 人工智能161篇(Artificial Intelligence (cs.AI))
  • 计算机视觉141篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习149篇(Machine Learning (cs.LG))
  • 多智能体系统11篇(Multiagent Systems (cs.MA))
  • 信息检索19篇(Information Retrieval (cs.IR))
  • 人机交互37篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

【速读】:该论文旨在解决合作式多智能体强化学习(Cooperative Multi-Agent Reinforcement Learning, MARL)在规模扩展时面临的跨智能体噪声问题:当所有智能体共享同一奖励信号时,每个智能体的学习信号由全部 $ N $ 个智能体的动作共同决定,导致梯度估计方差随 $ N $ 增长至 Θ(N)\Theta(N),从而使得样本复杂度为 O(N/ϵ)\mathcal{O}(N/\epsilon),严重限制了MARL的可扩展性。解决方案的关键在于提出一种名为**下降引导策略梯度(Descent-Guided Policy Gradient, DG-PG)**的框架,该框架利用系统中已知的可微分解析模型(differentiable analytical models)构造出无噪声的每智能体指导梯度(guidance gradients),从而将每个智能体的梯度更新从其他智能体的动作中解耦出来;理论证明DG-PG可将梯度方差从Θ(N)\Theta(N)降低至O(1)\mathcal{O}(1),保持博弈均衡不变,并实现与智能体数量无关的样本复杂度O(1/ϵ)\mathcal{O}(1/\epsilon),实验验证其在高达200个异构智能体的云调度任务中仍能快速收敛。

链接: https://arxiv.org/abs/2602.20078
作者: Shan Yang,Yang Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 5 tables; plus 16 pages of appendices

点击查看摘要

Abstract:Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all N agents jointly determine each agent’s learning signal, so cross-agent noise grows with N . In the policy gradient setting, per-agent gradient estimate variance scales as \Theta(N) , yielding sample complexity \mathcalO(N/\epsilon) . We observe that many domains – cloud computing, transportation, power systems – have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent’s gradient from the actions of all others. We prove that DG-PG reduces gradient variance from \Theta(N) to \mathcalO(1) , preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity \mathcalO(1/\epsilon) . On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale – from N=5 to N=200 – directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.

[MA-1] Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming

【速读】:该论文旨在解决当前用于心理健康支持的大型语言模型(Large Language Models, LLMs)在临床应用中存在潜在安全风险的问题,尤其是现有安全评估基准难以识别治疗对话中复杂的、长期性的风险。其解决方案的关键在于构建一个基于模拟患者代理(simulated patient agents)的评估框架,这些代理具备动态认知-情感模型,并通过大规模疗法会话仿真(N=369次)对六种AI心理治疗代理(包括ChatGPT、Gemini等)进行系统性评估,同时引入临床验证的15种患者人格类型以覆盖多样化的临床表型。该框架不仅量化了护理质量与风险本体(ontology),还揭示了如“AI精神病”(即验证患者妄想)和未能有效降低自杀风险等具体医源性风险,最终通过交互式数据可视化仪表盘实现了多利益相关方(包括AI工程师、红队测试人员、心理健康专业人员及政策制定者)对AI心理治疗“黑箱”的可审计性,从而推动模拟驱动的临床红队测试成为部署前必要环节。

链接: https://arxiv.org/abs/2602.19948
作者: Ian Steenstra,Paola Pedrelli,Weiyan Shi,Stacy Marsella,Timothy W. Bickmore
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: This paper is a condensed version of the first author’s Ph.D. dissertation submitted to Northeastern University

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and this http URL) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions (“AI Psychosis”) and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the “black box” of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment. Comments: This paper is a condensed version of the first author’s Ph.D. dissertation submitted to Northeastern University Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) Cite as: arXiv:2602.19948 [cs.CL] (or arXiv:2602.19948v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.19948 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-2] Effects of Property Recovery Incentives and Social Interaction on Self-Evacuation Decisions in Natural Disasters: An Agent -Based Modelling Approach

【速读】:该论文旨在解决在资源有限条件下,如何通过政府激励政策有效促进社区居民集体疏散决策的问题。其核心挑战在于理解家庭主体间的信息交流与社会网络结构如何影响个体疏散行为,并据此优化政府资源配置策略。解决方案的关键在于构建一个基于演化博弈论的多主体模型(agent-based model),模拟家庭主体在竞争有限公共资源(如财产恢复资金和协调服务)时的疏散或留守决策;研究发现,政府激励效果存在最优阈值——超过该阈值后进一步增加支持不再具实际意义;同时,疏散率高度依赖于社会网络结构,特别是优先考虑“社区影响力者”(即高连接度节点)可显著提升整体疏散效率,而优先低连通性个体反而可能阻碍集体行动。这一机制揭示了社会连通性在疏散政策设计中的关键作用。

链接: https://arxiv.org/abs/2602.19639
作者: Made Krisnanda,Raymond Chiong,Yang Yang,Kirill Glavatskiy
机构: The University of Newcastle (纽卡斯尔大学); The University of New England (新英格兰大学)
类目: Multiagent Systems (cs.MA)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:Understanding evacuation decision-making behaviour is one of the key components for designing disaster mitigation policies. This study investigates how communications between household agents in a community influence self-evacuation decisions. We develop an agent-based model that simulates household agents’ decisions to evacuate or stay. These agents interact within the framework of evolutionary game theory, effectively competing for limited shared resources, which include property recovery funds and coordination services. We explore four scenarios that model different prioritisations of access to government-provided incentives. We discover that the impact of the incentive diminishes both with increasing funding value and the household agent prioritisation, indicating that there is an optimal level of government support beyond which further increases become impractical. Furthermore, the overall evacuation rate depends on the structure of the underlying social network, showing discontinuous jumps when the prioritisation moves across the node degree. We identify the so-called “community influencers”, prioritisation of whom significantly increases the overall evacuation rate. In contrast, prioritising household agents with low connectivity may actually impede collective evacuation. These findings demonstrate the importance of social connectivity between household agents. The results of this study are useful for designing optimal government policies to incentivise and prioritise community evacuation under limited resources.

[MA-3] Hilbert-Augmented Reinforcement Learning for Scalable Multi-Robot Coverag e and Exploration

【速读】:该论文旨在解决多机器人在稀疏奖励环境下的探索效率低、冗余高以及收敛速度慢的问题,尤其是在网格覆盖任务中。其解决方案的关键在于将希尔伯特空间填充曲线(Hilbert space-filling curve)作为几何先验(geometric prior)引入去中心化的多机器人学习与执行框架中,通过在DQN和PPO算法中嵌入基于希尔伯特的空间索引结构来组织探索路径,从而减少重复访问区域并提升探索效率;同时设计了一个航点接口,将希尔伯特顺序转换为曲率受限、时间参数化的SE(2)轨迹(平面(x, y, θ)),确保在资源受限机器人上的在线可行性。实验表明,该方法显著提升了覆盖率、降低了冗余并加速了收敛,在Boston Dynamics Spot四足机器人上也验证了其在室内环境中可靠执行的能力。

链接: https://arxiv.org/abs/2602.19400
作者: Tamil Selvan Gurunathan,Aryya Gangopadhyay
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We present a coverage framework that integrates Hilbert space-filling priors into decentralized multi-robot learning and execution. We augment DQN and PPO with Hilbert-based spatial indices to structure exploration and reduce redundancy in sparse-reward environments, and we evaluate scalability in multi-robot grid coverage. We further describe a waypoint interface that converts Hilbert orderings into curvature-bounded, time-parameterized SE(2) trajectories (planar (x, y, \theta)), enabling onboard feasibility on resource-constrained robots. Experiments show improvements in coverage efficiency, redundancy, and convergence speed over DQN/PPO baselines. In addition, we validate the approach on a Boston Dynamics Spot legged robot, executing the generated trajectories in indoor environments and observing reliable coverage with low redundancy. These results indicate that geometric priors improve autonomy and scalability for swarm and legged robotics.

[MA-4] Self-Configurable Mesh-Networks for Scalable Distributed Submodular Bandit Optimization

【速读】:该论文旨在解决在带宽、数据速率和连通性等现实通信约束下,如何实现分布式带状子模协调(distributed bandit submodular coordination)的可扩展性问题。其核心挑战在于多智能体系统在未知、部分可观测且资源受限环境中,需通过智能体间通信进行协作,同时受限于单跳通信和小规模消息传输。解决方案的关键在于:(i) 限制信息传递仅限于单跳通信,(ii) 每个智能体仅传输自身动作信息,从而降低通信开销;并通过分布式在线bandit优化动态调整智能体的通信邻域,在满足带宽约束的前提下实现近最优的动作协调。该方法进一步通过定义“协调价值”(Value of Coordination, VoC)这一信息论度量,证明了任意网络拓扑下均存在严格正的 anytime 子优性边界,验证了其在仿真中具有更快收敛速度、优于基准算法的表现,甚至超越依赖环境先验知识的对比方案。

链接: https://arxiv.org/abs/2602.19366
作者: Zirui Xu,Vasileios Tzoumas
机构: University of Michigan (密歇根大学)
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We study how to scale distributed bandit submodular coordination under realistic communication constraints in bandwidth, data rate, and connectivity. We are motivated by multi-agent tasks of active situational awareness in unknown, partially-observable, and resource-limited environments, where the agents must coordinate through agent-to-agent communication. Our approach enables scalability by (i) limiting information relays to only one-hop communication and (ii) keeping inter-agent messages small, having each agent transmit only its own action information. Despite these information-access restrictions, our approach enables near-optimal action coordination by optimizing the agents’ communication neighborhoods over time, through distributed online bandit optimization, subject to the agents’ bandwidth constraints. Particularly, our approach enjoys an anytime suboptimality bound that is also strictly positive for arbitrary network topologies, even disconnected. To prove the bound, we define the Value of Coordination (VoC), an information-theoretic metric that quantifies for each agent the benefit of information access to its neighbors. We validate in simulations the scalability and near-optimality of our approach: it is observed to converge faster, outperform benchmarks for bandit submodular coordination, and can even outperform benchmarks that are privileged with a priori knowledge of the environment.

[MA-5] City Editing: Hierarchical Agent ic Execution for Dependency-Aware Urban Geospatial Modification

【速读】:该论文旨在解决城市更新过程中因交通拥堵和功能失衡等问题,传统依赖人工重绘地理空间布局的低效问题,从而阻碍了城市规划的迭代优化与决策效率。其解决方案的关键在于将城市更新任务形式化为可由机器执行的多步几何编辑过程:首先使用GeoJSON结构化表示城市布局,并将自然语言编辑指令分解为涵盖多边形、线段和点级别的层次化几何意图;进而提出一种分层智能体(hierarchical agentic)框架,通过显式传播中间空间约束实现跨空间元素与抽象层级的协同规划与执行;最后引入迭代执行-验证机制以减少误差累积并保障多步骤编辑中的全局空间一致性,显著提升了效率、鲁棒性、正确性和空间有效性。

链接: https://arxiv.org/abs/2602.19326
作者: Rui Liu,Steven Jige Quan,Zhong-Ren Peng,Zijun Yao,Han Wang,Zhengzhang Chen,Kunpeng Liu,Yanjie Fu,Dongjie Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As cities evolve over time, challenges such as traffic congestion and functional imbalance increasingly necessitate urban renewal through efficient modification of existing plans, rather than complete re-planning. In practice, even minor urban changes require substantial manual effort to redraw geospatial layouts, slowing the iterative planning and decision-making procedure. Motivated by recent advances in agentic systems and multimodal reasoning, we formulate urban renewal as a machine-executable task that iteratively modifies existing urban plans represented in structured geospatial formats. More specifically, we represent urban layouts using GeoJSON and decompose natural-language editing instructions into hierarchical geometric intents spanning polygon-, line-, and point-level operations. To coordinate interdependent edits across spatial elements and abstraction levels, we propose a hierarchical agentic framework that jointly performs multi-level planning and execution with explicit propagation of intermediate spatial constraints. We further introduce an iterative execution-validation mechanism that mitigates error accumulation and enforces global spatial consistency during multi-step editing. Extensive experiments across diverse urban editing scenarios demonstrate significant improvements in efficiency, robustness, correctness, and spatial validity over existing baselines.

[MA-6] Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在重复博弈场景中与未知或动态对手进行战略交互时,难以在线适应对手行为的问题。传统基于离线预训练或微调的方法虽对最坏情况下的对手具有鲁棒性,但未能充分利用LLM在交互过程中在线调整策略的能力。其解决方案的关键在于将经典博弈论学习动态——平滑虚构博弈(Smooth Fictitious Play, sFP)嵌入到LLM的推理阶段:首先通过一个辅助对手模型(in-context learning)来模拟对手的时间平均行为以形成信念;其次利用改进的“最佳N采样”(Best-of-N, BoN)方法,基于对手模型进行对抗模拟以生成最优响应。该方法无需参数更新即可实现显著性能提升,为重复战略决策提供了一种可扩展且原理清晰的在线适应机制。

链接: https://arxiv.org/abs/2602.19309
作者: Xiangyu Liu,Di Wang,Zhe Feng,Aranyak Mehta
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emphrepeated and \emphstrategic interactions with unknown or dynamic opponents. In such settings, recipes built upon \emphoffline pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emphonline based on interaction feedback. Instead, we explore the more natural perspective of scaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emphsmooth Fictitious Play (sFP), into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the time-averaged behavior of the opponent; (ii) for best response, we advance best-of- N (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.

[MA-7] Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation

【速读】:该论文旨在解决去中心化智能体在路径级协作中因无法准确预测其他智能体意图而导致的碰撞风险或协作失效问题,尤其在多机器人系统和人机协作场景下,如何通过语言通信实现安全且可解释的路径规划更新。解决方案的关键在于提出一种名为CaPE(Code as Path Editor)的多模态路径规划方法,其核心创新是利用视觉-语言模型(Vision-Language Model, VLM)合成可被基于模型的规划器验证的路径编辑程序,从而将语言通信语义精准地映射为安全、可解释的路径调整动作,实现了开放环境下动态协作中的安全性与可解释性保障。

链接: https://arxiv.org/abs/2602.19304
作者: Haojun Shi,Suyu Ye,Katherine M. Guerrerio,Jianzhi Shen,Yifan Yin,Daniel Khashabi,Chien-Ming Huang,Tianmin Shu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Successful cooperation among decentralized agents requires each agent to quickly adapt its plan to the behavior of other agents. In scenarios where agents cannot confidently predict one another’s intentions and plans, language communication can be crucial for ensuring safety. In this work, we focus on path-level cooperation in which agents must adapt their paths to one another in order to avoid collisions or perform physical collaboration such as joint carrying. In particular, we propose a safe and interpretable multimodal path planning method, CaPE (Code as Path Editor), which generates and updates path plans for an agent based on the environment and language communication from other agents. CaPE leverages a vision-language model (VLM) to synthesize a path editing program verified by a model-based planner, grounding communication to path plan updates in a safe and interpretable way. We evaluate our approach in diverse simulated and real-world scenarios, including multi-robot and human-robot cooperation in autonomous driving, household, and joint carrying tasks. Experimental results demonstrate that CaPE can be integrated into different robotic systems as a plug-and-play module, greatly enhancing a robot’s ability to align its plan to language communication from other robots or humans. We also show that the combination of the VLM-based path editing program synthesis and model-based planning safety enables robots to achieve open-ended cooperation while maintaining safety and interpretability.

[MA-8] Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

【速读】:该论文旨在解决城市能源系统优化中多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)算法在实际应用中的可扩展性与协调性问题,以及现有评估方法缺乏全面性和可靠性的问题。其关键解决方案是基于CityLearn仿真环境构建一个综合性、高保真的基准测试平台,通过引入多个关键性能指标(Key Performance Indicators, KPIs),包括针对现实部署挑战的新颖KPI(如单体建筑贡献度和电池寿命),对多种MARL算法(如PPO和SAC)及其训练范式(包括去中心化训练与执行DTDE和集中式训练与去中心化执行CTDE)进行系统性比较。研究发现DTDE在平均性能和最差情况下的表现均优于CTDE,并且时间依赖性建模显著提升了对电池使用和爬坡控制等记忆相关KPI的调控能力,从而增强了策略的鲁棒性与可持续性。

链接: https://arxiv.org/abs/2602.19223
作者: Aymen Khouja,Imen Jendoubi,Oumayma Mahjoub,Oussama Mahfoudhi,Claude Formanek,Siddarth Singh,Ruan De Kock
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.

[MA-9] Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在调用工具时易出错的问题,尤其是由于仅依赖LLM自身能力生成工具调用(tool call)而缺乏可靠验证机制所导致的错误率高和迭代优化成本大的问题。为应对这一挑战,作者提出了一种名为Gecko的仿真环境,其关键在于通过规则与LLM相结合的方式模拟真实工具的响应,提供三类反馈:工具调用合法性校验(包括参数和工具名)、符合输出模式的合理响应合成、以及任务目标达成度评估。这些反馈支持LLM在推理阶段进行高效且安全的工具调用优化,形成一种简单但有效的测试时扩展方法(GATS),显著提升了多种主流LLM在BFCLv3和τ²-bench基准上的工具调用性能。

链接: https://arxiv.org/abs/2602.19218
作者: Zeyu Zhang,Guohao Li,Zhenchang Xing,Alexandros Apostolopoulos,Yu Lin Lee,Liang Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The ability to use tools is fundamental for large language model (LLM) agents. Given a task, existing systems use LLMs to plan and generate tool calls, which are executed by real-world tools to complete the task. However, tool calls are prone to errors because they are derived merely from LLM intrinsic capabilities. What is more, while it is useful to let LLMs iteratively refine the tool-call sequence using execution results from real tools, this process can be expensive and lead to unsafe results. To improve LLM tool calls and address issues caused by using real tools for refinement, we introduce Gecko, a comprehensive environment that simulates tool responses using a combination of rules and LLMs. Specifically, Gecko checks the validity of tool calls including input arguments and tool names, synthesizes reasonable responses that adhere to the output schema, and assesses whether all task objectives have been achieved. These three types of feedback provided by Gecko allow LLMs to refine their tool calls, forming a simple yet effective test-time scaling method named GATS. On BFCLv3 and \tau^2 -bench, GATS consistently improves the tool calling performance of various LLMs including GPT-4o, GPT-5, and Gemini-3.0-pro. We further discuss working mechanisms of our method and share future possibilities.

[MA-10] Exact Algorithms for Resource Reallocation Under Budgetary Constraints

【速读】:该论文旨在解决多参与方供应链网络中资源(再)分配的效率问题,具体针对服务提供商在预算约束下如何最小化客户重新分配次数,从而减少所需维护服务器数量的问题。其解决方案的关键在于提出并系统研究了“红蓝强化学习”(Red-Blue Reinforcement, R-BR)问题,并设计了三种可扩展的精确算法(Fixed-Parameter Tractable, FPT),这些算法在具有有界距离到簇(bounded distance to cluster)、有界模块宽度(bounded modular-width)或有界 clique-width 的拓扑结构下表现高效,适用于建模乡村道路网络、现代交通系统等实际场景。

链接: https://arxiv.org/abs/2602.18438
作者: Arun Kumar Das,Sandip Das,Sweta Das,Foivos Fioravantes,Nikolaos Melissinos
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Efficient resource (re-)allocation is a critical challenge in optimizing productivity and sustainability within multi-party supply networks. In this work, we introduce the \textscRed-Blue Reinforcement (R-BR) problem, where a service provider under budgetary constraints must minimize client reallocations to reduce the required number of servers they should maintain by a specified amount. We conduct a systematic algorithmic study, providing three exact algorithms that scale well as the input grows (FPT), which could prove useful in practice. Our algorithms are efficient for topologies that model rural road networks (bounded distance to cluster), modern transportation systems (bounded modular-width), or have bounded clique-width, a parameter that is of great theoretical importance.

自然语言处理

[NLP-0] AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的程序自动生成系统在进化搜索过程中因采用静态调度策略而导致的计算资源浪费问题。这类系统通常将LLMs作为语义变异算子嵌入到进化循环中,但由于缺乏对搜索过程非平稳动态的适应能力,导致资源被盲目分配给停滞种群,而有潜力的解空间却未被充分探索。解决方案的关键在于提出AdaEvolve框架,将其重构为一个分层自适应优化问题,通过“累积改进信号”统一协调三个层次的决策:局部自适应(Local Adaptation)动态调节单个种群内的探索强度;全局自适应(Global Adaptation)基于多臂赌博机(bandit-based)调度策略分配全局资源预算;元引导(Meta-Guidance)则在进展停滞时根据历史解及其改进情况生成新的求解策略。实验证明,AdaEvolve在185个开放性优化问题上均显著优于开源基线方法。

链接: https://arxiv.org/abs/2602.20133
作者: Mert Cemri,Shubham Agrawal,Akshat Gupta,Shu Liu,Audrey Cheng,Qiuyang Mang,Ashwin Naren,Lutfi Eren Erdogan,Koushik Sen,Matei Zaharia,Alex Dimakis,Ion Stoica
机构: University of California, Berkeley (加州大学伯克利分校); Bespoke Labs
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promising frontiers remain under-exploited. We introduce AdaEvolve, a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem. AdaEvolve uses an “accumulated improvement signal” to unify decisions across three levels: Local Adaptation, which dynamically modulates the exploration intensity within a population of solution candidates; Global Adaptation, which routes the global resource budget via bandit-based scheduling across different solution candidate populations; and Meta-Guidance which generates novel solution tactics based on the previously generated solutions and their corresponding improvements when the progress stalls. We demonstrate that AdaEvolve consistently outperforms the open-sourced baselines across 185 different open-ended optimization problems including combinatorial, systems optimization and algorithm design problems.

[NLP-1] o Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

【速读】: 该论文旨在解决医疗问答(Medical Question Answering, MedQA)中大语言模型(Large Language Models, LLMs)推理效率低下的问题,即在无需复杂推理的任务上仍执行冗余的链式思维(Chain-of-Thought, CoT)过程,导致计算资源浪费。解决方案的关键在于提出一种选择性链式思维(Selective Chain-of-Thought, Selective CoT)策略:在推理阶段首先预测问题是否需要推理,仅当判定为必要时才生成理由(rationale),从而动态平衡推理深度与计算效率。实验表明,该方法在多个生物医学问答基准上显著降低了推理时间和token消耗(分别减少13–45%和8–47%),同时保持准确率损失不超过4%,甚至在某些场景下实现更高准确率与更强效率的协同优化。

链接: https://arxiv.org/abs/2602.20130
作者: Zaifu Zhan,Min Zeng,Shuang Zhou,Yiran Song,Xiaoyi Chen,Yu Hou,Yifan Wu,Yang Ruan,Rui Zhang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ( \leq 4%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20130 [cs.CL] (or arXiv:2602.20130v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20130 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zaifu Zhan [view email] [v1] Mon, 23 Feb 2026 18:42:50 UTC (1,233 KB) Full-text links: Access Paper: View a PDF of the paper titled To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering, by Zaifu Zhan and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-02 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-2] BabyLM Turns 4: Call for Papers for the 2026 BabyLM Workshop

【速读】: 该论文旨在解决认知建模(cognitive modeling)与语言建模(language modeling)之间边界模糊的问题,试图通过统一框架实现两者的深度融合。其解决方案的关键在于发起第四届BabyLM竞赛,并设立两个并行赛道:通用赛道继续聚焦数据高效预训练挑战,新增多语言(Multilingual)赛道以拓展模型在跨语言场景下的泛化能力;同时鼓励提交相关领域的研究论文,涵盖训练效率、认知合理性、弱模型评估等方向,从而推动更高效、可解释且具备人类认知机制的生成式AI系统发展。

链接: https://arxiv.org/abs/2602.20092
作者: Leshem Choshen,Ryan Cotterell,Mustafa Omer Gul,Jaap Jumelet,Tal Linzen,Aaron Mueller,Suchir Salhan,Raj Sanjay Shah,Alex Warstadt,Ethan Gotlieb Wilcox
机构: IBM Research(IBM研究院); MIT(麻省理工学院); ETH Zürich(苏黎世联邦理工学院); Cornell University(康奈尔大学); University of Groningen(格罗宁根大学); NYU(纽约大学); Boston University(波士顿大学); University of Cambridge(剑桥大学); Georgia Tech(佐治亚理工学院); UC San Diego(加州大学圣地亚哥分校); Georgetown University(乔治城大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 table. arXiv admin note: substantial text overlap with arXiv:2502.10645

点击查看摘要

Abstract:BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 4th BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: Multilingual. We also call for papers outside the competition in any relevant areas. These include training efficiency, cognitively plausible research, weak model evaluation, and more. Comments: 8 pages, 1 table. arXiv admin note: substantial text overlap with arXiv:2502.10645 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.20092 [cs.CL] (or arXiv:2602.20092v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.20092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-3] How Retrieved Context Shapes Internal Representations in RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中,外部检索文档对大语言模型(Large Language Models, LLMs)内部表示影响机制不明确的问题。现有研究多聚焦于生成输出的行为表现,而忽视了检索上下文如何塑造模型内部表征以实现信息整合的过程。其解决方案的关键在于通过分析LLMs在不同相关性水平的检索文档下的隐状态(latent representations),揭示上下文相关性和分层处理(layer-wise processing)如何影响内部表示,并将这些表示变化与下游生成行为建立关联,从而为RAG系统的优化设计提供理论依据和可解释性支持。

链接: https://arxiv.org/abs/2602.20091
作者: Samuel Yeh,Sharon Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work has largely examined these phenomena through output behavior, little is known about how retrieved context shapes the internal representations that mediate information integration in RAG. In this work, we study RAG through the lens of latent representations. We systematically analyze how different types of retrieved documents affect the hidden states of LLMs, and how these internal representation shifts relate to downstream generation behavior. Across four question-answering datasets and three LLMs, we analyze internal representations under controlled single- and multi-document settings. Our results reveal how context relevancy and layer-wise processing influence internal representations, providing explanations on LLMs output behaviors and insights for RAG system design.

[NLP-4] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

【速读】: 该论文旨在解决当前对大语言模型(Large Language Models, LLMs)语言理解能力的认知局限问题,特别是其在非英语高资源语言(尤其是WEIRD群体主导的语言)之外的多语言表现缺乏系统评估。现有基准测试主要基于英语等高资源语言,导致默认假设认为英语是LLMs性能最优的语言,而低资源语言则被视为输出可靠性较低。为突破这一偏见,研究者通过在12种跨语系语言(涵盖印欧、亚非、突厥、汉藏和日语系)上对3个主流LLM进行统一语言理解任务测试,发现模型虽在多种语言中表现出显著的语言准确性,但整体仍落后于人类基线;更关键的是,英语并非最优语言,反而被若干低资源的罗曼语族语言超越。解决方案的核心在于构建跨语言、跨语系的实证框架,量化分析tokenization、语言距离、训练数据规模与来源(高/低资源语言及WEIRD vs. 非WEIRD社区)等因素对模型性能的影响,从而揭示LLMs语言理解能力的真实分布规律。

链接: https://arxiv.org/abs/2602.20065
作者: Natalia Moskvina,Raquel Montero,Masaya Yoshida,Ferdy Hubers,Paolo Morosi,Walid Irhaymi,Jin Yan,Tamara Serrano,Elena Pagliarini,Fritz Günther,Evelina Leivada
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages, 3 figures, 2 tables, 4 supplementary tables

点击查看摘要

Abstract:Large Language Models (LLMs) play a critical role in how humans access information. While their core use relies on comprehending written requests, our understanding of this ability is currently limited, because most benchmarks evaluate LLMs in high-resource languages predominantly spoken by Western, Educated, Industrialised, Rich, and Democratic (WEIRD) communities. The default assumption is that English is the best-performing language for LLMs, while smaller, low-resource languages are linked to less reliable outputs, even in multilingual, state-of-the-art models. To track variation in the comprehension abilities of LLMs, we prompt 3 popular models on a language comprehension task across 12 languages, representing the Indo-European, Afro-Asiatic, Turkic, Sino-Tibetan, and Japonic language families. Our results suggest that the models exhibit remarkable linguistic accuracy across typologically diverse languages, yet they fall behind human baselines in all of them, albeit to different degrees. Contrary to what was expected, English is not the best-performing language, as it was systematically outperformed by several Romance languages, even lower-resource ones. We frame the results by discussing the role of several factors that drive LLM performance, such as tokenization, language distance from Spanish and English, size of training data, and data origin in high- vs. low-resource languages and WEIRD vs. non-WEIRD communities.

[NLP-5] Entropy in Large Language Models

【速读】: 该论文旨在解决如何量化大型语言模型(Large Language Models, LLMs)生成文本的信息熵,并将其与自然语言(包括书面和口语形式)的信息熵进行比较,以评估LLM训练过程中信息不确定性变化的潜在影响。其解决方案的关键在于将LLM输出建模为一个具有平稳概率分布的随机信源,从而计算其单位词的信息熵,并通过与开放美国国家语料库(Open American National Corpus, OANC)中自然语言的熵值对比,发现LLM的词熵低于自然语言,揭示了LLM在生成文本时信息冗余度更高、不确定性更低的特性。这一方法为后续研究LLM自训练(即使用LLM生成的数据训练新模型)对信息质量的影响提供了理论基础。

链接: https://arxiv.org/abs/2602.20052
作者: Marco Scharringhausen
机构: Carl von Ossietzky Universität Oldenburg (奥尔登堡卡尔·冯·奥西耶茨基大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic model for these LLMs, following a constant random distribution and the source itself thus being stationary. We compare this source entropy (per word) to that of natural language (written or spoken) as represented by the Open American National Corpus (OANC). Our results indicate that the word entropy of such LLMs is lower than the word entropy of natural speech both in written or spoken form. The long-term goal of such studies is to formalize the intuitions of information and uncertainty in large language training to assess the impact of training an LLM from LLM generated training data. This refers to texts from the world wide web in particular.

[NLP-6] Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously

【速读】: 该论文旨在解决当前通用对齐(General Alignment)范式在复杂社会技术系统中面临的局限性,尤其是在存在价值冲突、多元利益相关者和不可消除的不确定性场景下,传统将多样人类价值观压缩为单一标量奖励的方法会导致结构性价值扁平化(structural value flattening)、规范表征损失(normative representation loss)以及认知不确定性盲视(cognitive uncertainty blindness)。解决方案的关键在于提出“边缘对齐”(Edge Alignment)这一新范式,其核心是保持多维价值结构、支持多元且民主的价值表征,并引入认知机制以实现交互与澄清。为此,作者提出了七个相互依赖的支柱,分三个阶段实施,将对齐问题重新定义为一个动态规范治理的生命周期过程,而非单一优化任务。

链接: https://arxiv.org/abs/2602.20042
作者: Han Bao,Yue Huang,Xiaoda Wang,Zheyuan Zhang,Yujun Zhou,Carl Yang,Xiangliang Zhang,Yanfang Ye
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Large language models are being deployed in complex socio-technical systems, which exposes limits in current alignment practice. We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible uncertainty. These failures follow from the mathematics and incentives of scalarization and lead to \textbfstructural value flattening, \textbfnormative representation loss, and \textbfcognitive uncertainty blindness. We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure, support plural and democratic representation, and incorporate epistemic mechanisms for interaction and clarification. To make this approach practical, we propose seven interdependent pillars organized into three phases. We identify key challenges in data collection, training objectives, and evaluation, outlining complementary technical and governance directions. Taken together, these measures reframe alignment as a lifecycle problem of dynamic normative governance rather than as a single instance optimization task.

[NLP-7] Agent icSum: An Agent ic Inference-Time Framework for Faithful Clinical Text Summarization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床文本摘要生成中难以保持事实一致性的问题,这主要源于临床文档的长度冗长、噪声干扰以及内容异质性。解决方案的关键在于提出一种推理时(inference-time)的代理式(agentic)框架 AgenticSum,其通过将摘要任务分解为四个协同阶段:上下文选择、生成、验证与针对性修正。该框架利用内部注意力机制作为接地信号识别弱支持片段,并在监督控制下对这些片段进行选择性修订,从而有效减少幻觉内容,提升摘要的事实准确性。

链接: https://arxiv.org/abs/2602.20040
作者: Fahmida Liza Piya,Rahmatollah Beheshti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer substantial promise for automating clinical text summarization, yet maintaining factual consistency remains challenging due to the length, noise, and heterogeneity of clinical documentation. We present AgenticSum, an inference-time, agentic framework that separates context selection, generation, verification, and targeted correction to reduce hallucinated content. The framework decomposes summarization into coordinated stages that compress task-relevant context, generate an initial draft, identify weakly supported spans using internal attention grounding signals, and selectively revise flagged content under supervisory control. We evaluate AgenticSum on two public datasets, using reference-based metrics, LLM-as-a-judge assessment, and human evaluation. Across various measures, AgenticSum demonstrates consistent improvements compared to vanilla LLMs and other strong baselines. Our results indicate that structured, agentic design with targeted correction offers an effective inference time solution to improve clinical note summarization using LLMs.

[NLP-8] gencat: Generative computerized adaptive testing

【速读】: 该论文旨在解决传统计算机自适应测试(Computerized Adaptive Testing, CAT)框架在处理开放性问题时无法有效利用题干与作答文本信息的问题,从而限制了对学生知识水平的精准评估。其解决方案的关键在于提出了一种基于大语言模型的新型CAT框架——GENCAT,核心创新包括:构建生成式项目反应理论(Generative Item Response Theory, GIRT)模型,通过监督微调和偏好优化两阶段训练实现从学生开放性作答中估计知识水平并预测对未见题目的回答;同时设计三种基于生成能力的题目选择算法,分别依据采样作答的不确定性、语言多样性及信息量进行动态选题,从而在编程测评场景中显著提升早期测试阶段的性能,AUC最高提升达4.32%。

链接: https://arxiv.org/abs/2602.20020
作者: Wanyong Feng,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbfGENerative \textbfCAT), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32% in the key early testing stages.

[NLP-9] QUIETT: Query-Independent Table Transformation for Robust Reasoning

【速读】: 该论文旨在解决真实世界表格中存在的不规则模式(irregular schemas)、异构值格式(heterogeneous value formats)以及隐式关系结构(implicit relational structure)等问题,这些问题会显著降低下游表格推理与问答任务的可靠性。现有方法通常以查询依赖的方式处理表格清理,导致表格预处理与推理过程耦合,限制了模型的泛化能力。其解决方案的关键在于提出 QuIeTT——一种查询无关的表格转换框架,该框架在测试时任何查询出现之前,将原始表格统一转化为一个 SQL 就绪的规范表示(canonical representation),通过无损的模式与值归一化、显式暴露隐式关系,并保留完整的溯源信息(provenance)来实现表格结构的标准化。该方法通过将表格转换与推理解耦,提升了查询的准确性、可靠性和效率,且无需修改下游模型即可实现性能提升。

链接: https://arxiv.org/abs/2602.20017
作者: Gaurav Najpande,Tampu Ravi Kumar,Manan Roy Choudhury,Neha Valeti,Yanjie Fu,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world tables often exhibit irregular schemas, heterogeneous value formats, and implicit relational structure, which degrade the reliability of downstream table reasoning and question answering. Most existing approaches address these issues in a query-dependent manner, entangling table cleanup with reasoning and thus limiting generalization. We introduce QuIeTT, a query-independent table transformation framework that preprocesses raw tables into a single SQL-ready canonical representation before any test-time queries are observed. QuIeTT performs lossless schema and value normalization, exposes implicit relations, and preserves full provenance via raw table snapshots. By decoupling table transformation from reasoning, QuIeTT enables cleaner, more reliable, and highly efficient querying without modifying downstream models. Experiments on four benchmarks, WikiTQ, HiTab, NQ-Table, and SequentialQA show consistent gains across models and reasoning paradigms, with particularly strong improvements on a challenge set of structurally diverse, unseen questions.

[NLP-10] Cross-lingual Matryoshka Representation Learning across Speech and Text

【速读】: 该论文旨在解决低资源语言(如Wolof)使用者面临的双重障碍:一是语言屏障,即网络知识主要集中在少数主流语言(如法语)中;二是模态屏障,即信息多以文本形式存在,而许多语言(如Wolof)主要是口头表达。解决方案的关键在于训练首个双语语音-文本Matryoshka嵌入模型,该模型能够在不依赖昂贵的自动语音识别(ASR)-翻译流水线的情况下,实现从Wolof语音查询到法语文本的有效检索。通过引入大规模数据清洗流程和新基准,作者比较了多种建模策略,并发现将模态融合嵌入到冻结的文本Matryoshka模型中表现最优,表明该方法不仅在检索任务上有效,还能泛化至语音意图检测等下游任务,体现出对通用语义表示的学习能力。

链接: https://arxiv.org/abs/2602.19991
作者: Yaya Sy,Dioula Doucouré,Christophe Cerisara,Irina Illina
机构: LORIA, CNRS; Soynade Research
类目: Computation and Language (cs.CL)
备注: Preprint, under review

点击查看摘要

Abstract:Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.

[NLP-11] ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting EACL2026

【速读】: 该论文旨在解决基于注意力机制的重排序方法在零样本场景下存在的两个核心问题:一是注意力信号过于集中于少数文档中的少量词元,导致其他词元难以区分;二是注意力过度强调与查询词法相似的短语,从而引入偏倚,使仅在词汇上相近但语义无关的文档被错误地认为相关。解决方案的关键在于提出一种后处理重加权策略 ReAttn,其通过计算跨文档逆文档频率(cross-document IDF)权重来抑制频繁出现在候选文档中且与查询重叠的词元上的注意力,降低词法偏倚并突出具有区分度的词元;同时引入基于熵的正则化项以缓解注意力过度集中的问题,促使注意力在信息丰富的词元间分布更均衡。上述调整均直接作用于已有注意力权重,无需额外训练或监督信号。

链接: https://arxiv.org/abs/2602.19969
作者: Yuxing Tian,Fengran Mo,Weixu Zhang,Yiyan Qi,Jian-Yun Nie
机构: University of Montreal (蒙特利尔大学); McGill University (麦吉尔大学); MILA (蒙特利尔学习算法研究所); International Digital Economy Academy (国际数字经济发展研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EACL2026

点击查看摘要

Abstract:The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose \textbfReAttn, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method.

[NLP-12] Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

【速读】: 该论文旨在解决现有基于学习的交易系统在利用金融新闻事件进行决策时面临的两大核心问题:一是缺乏大规模、以事件为中心的数据集,无法同时建模新闻语义与统计上可靠的市场反应;二是语言模型推理与动态市场条件下的金融有效交易行为之间存在错位。解决方案的关键在于提出Janus-Q框架,其通过两阶段范式实现:第一阶段构建包含62,400篇标注新闻文章的大规模事件驱动数据集,涵盖10类细粒度事件类型、关联股票、情感标签及事件驱动的累计异常收益(CAR);第二阶段采用面向决策的微调策略,结合监督学习与受分层门控奖励模型(HGRM)引导的强化学习,显式建模多目标交易权衡,从而提升交易决策的一致性、可解释性和盈利能力。

链接: https://arxiv.org/abs/2602.19919
作者: Xiang Li,Zikai Wei,Yiyan Qi,Wanyun Zhou,Xiang Liu,Penglei Sun,Yongqi Zhang,Xiaowen Chu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); International Digital Economy Academy(国际数字经济发展研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.

[NLP-13] DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

【速读】: 该论文旨在解决强化学习中基于验证器(Reinforcement Learning with Verifiers, RLVR)的大型语言模型(Large Language Model, LLM)推理过程中因探索不足而导致的策略坍缩问题,即模型倾向于收敛到少数固定的推理路径,难以实现深层次的多样性和有效探索。解决方案的关键在于提出一种双尺度多样性正则化(Dual-Scale Diversity Regularization, DSDR)框架,该框架将LLM推理中的多样性分解为全局和局部两个层次:全局层面通过鼓励正确推理轨迹之间的差异性来探索不同的解空间模式;局部层面则在保持正确性的前提下,对每条正确轨迹施加长度不变的token级熵正则化,防止熵坍缩并维持路径内多样性。两者通过一个从全局到局部的分配机制耦合,优先增强更具区分度的正确轨迹的局部正则化强度,从而在保证最优正确率的前提下,提供更稳定且信息丰富的群体优化信号。

链接: https://arxiv.org/abs/2602.19895
作者: Zhongwei Wan,Yun Shen,Zhihao Dou,Donghao Zhou,Yu Zhang,Xin Wang,Hui Shen,Jing Xiong,Chaofan Tao,Zixuan Zhong,Peizhou Huang,Mi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at this https URL.

[NLP-14] Denotational Semantics for ODRL: Knowledge-Based Constraint Conflict Detection

【速读】: 该论文旨在解决ODRL(Open Digital Rights Language)策略在跨数据空间(cross-dataspace)互操作时因缺乏明确领域知识而导致的冲突检测失效问题——即当不同知识库(Knowledge Base, KB)之间语义不一致或信息不完整时,所有策略比较结果默认为“未知”(Unknown),从而无法支持可靠的访问控制决策。解决方案的关键在于提出一种基于集合论的指称语义(denotational semantics),将每个ODRL约束映射到满足该约束的知识库概念集合,并引入三值判定机制(Conflict、Compatible、Unknown),使得冲突检测可转化为集合交集运算,且在不完备知识下仍保持逻辑一致性。该框架覆盖ODRL全部三种组合模式(and、or、xone)和三种实际语义域(分类学、拓扑学、名义性),并通过保序对齐(order-preserving alignments)确保跨标准KB间的冲突不变性与渐进退化特性(未映射概念仅降级为Unknown,不会误判为False Conflict)。最终通过154个基准测试验证了其在多种结构化知识库中的有效性,且被Vampire和Z3两个工具一致确认,证明其在EPR片段一阶逻辑范围内具有良好的可判定性和实用性。

链接: https://arxiv.org/abs/2602.19883
作者: Daham Mustafa,Diego Collarana,Yixin Peng,Rafiqul Haque,Christoph Lange-Bever,Christoph Quix,Stephan Decker
机构: 未知
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 17 pages, 6 tables. Working draft. Supplementary material (154 TPTP/SMT-LIB benchmarks, Isabelle/HOL theory file) will be made available at this https URL upon publication

点击查看摘要

Abstract:ODRL’s six set-based operators – isA, isPartOf, hasPart, isAnyOf, isAllOf, isNoneOf – depend on external domain knowledge that the W3C specification leaves unspecified. Without it, every cross-dataspace policy comparison defaults to Unknown. We present a denotational semantics that maps each ODRL constraint to the set of knowledge-base concepts satisfying it. Conflict detection reduces to denotation intersection under a three-valued verdict – Conflict, Compatible, or Unknown – that is sound under incomplete knowledge. The framework covers all three ODRL composition modes (and, or, xone) and all three semantic domains arising in practice: taxonomic (class subsumption), mereological (part-whole containment), and nominal (identity). For cross-dataspace interoperability, we define order-preserving alignments between knowledge bases and prove two guarantees: conflicts are preserved across different KB standards, and unmapped concepts degrade gracefully to Unknown – never to false conflicts. A runtime soundness theorem ensures that design-time verdicts hold for all execution contexts. The encoding stays within the decidable EPR fragment of first-order logic. We validate it with 154 benchmarks across six knowledge base families (GeoNames, ISO 3166, W3C DPV, a GDPR-derived taxonomy, BCP 47, and ISO 639-3) and four structural KBs targeting adversarial edge cases. Both the Vampire theorem prover and the Z3 SMT solver agree on all 154 verdicts. A key finding is that exclusive composition (xone) requires strictly stronger KB axioms than conjunction or disjunction: open-world semantics blocks exclusivity even when positive evidence appears to satisfy exactly one branch.

[NLP-15] Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics

【速读】: 该论文旨在解决ODRL(Open Digital Rights Language)2.2中因多维量(dimensional quantities)导致的策略约束语义歧义问题,即当约束中的左操作数(leftOperand)表示图像尺寸、地理坐标等多维数据时,单一标量约束在不同轴上存在多种解释,从而造成策略评估结果非确定性。其解决方案的关键在于提出一个轴分解框架(axis-decomposition framework),将每个多维左操作数拆解为针对各轴的标量操作数,从而实现:1)确定性解释;2)包围盒(AABB)完备性;3)投影下的保真过近似;4)保守扩展。该框架通过强克莱尼合取(Strong Kleene conjunction)对每轴判定结果进行组合,形成三值逻辑(冲突、兼容、未知),并为不适用于轴分解的析取(odrl:or)和异或(odrl:xone)逻辑约束提供直接编码机制。作者进一步基于此框架构建了ODRL空间轴配置文件(Spatial Axis Profile),并在TPTP FOF与SMT-LIB两类基准测试中验证了其有效性,所有元定理均在Isabelle/HOL中机械证明。

链接: https://arxiv.org/abs/2602.19878
作者: Daham Mustafa,Diego Collarana,Yixin Peng,Rafiqul Haque,Christoph Lange-Bever,Christoph Quix,Stephan Decker
机构: RWTH Aachen University (亚琛工业大学); Fraunhofer FIT (弗劳恩霍夫信息通信技术研究所); University of Galway (高威大学)
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 16 pages, 5 tables. Preprint

点击查看摘要

Abstract:Every ODRL 2.2 constraint compares a single scalar value: (leftOperand, operator, rightOperand). Five of ODRL’s approximately 34 left operands, however, denote multi-dimensional quantities–image dimensions, canvas positions, geographic coordinates–whose specification text explicitly references multiple axes. For these operands, a single scalar constraint admits one interpretation per axis, making policy evaluation non-deterministic. We classify ODRL’s left operands by value-domain structure (scalar, dimensional, concept-valued), grounded in the ODRL 2.2 specification text, and show that dimensional ambiguity is intrinsic to the constraint syntax. We present an axis-decomposition framework that refines each dimensional operand into axis-specific scalar operands and prove four properties: deterministic interpretation, AABB completeness, sound over-approximation under projection, and conservative extension. Conflict detection operates in two layers: per-axis verdicts are always decidable; box-level verdicts compose through Strong Kleene conjunction into a three-valued logic (Conflict, Compatible, Unknown). For ODRL’s disjunctive (odrl:or) and exclusive-or (odrl:xone) logical constraints, where per-axis decomposition does not apply, the framework encodes coupled multi-axis conjectures directly. We instantiate the framework as the ODRL Spatial Axis Profile–15 axis-specific left operands for the five affected base terms–and evaluate it on 117 benchmark problems spanning nine categories across both TPTP FOF (Vampire) and SMT-LIB (Z3) encodings, achieving full concordance between provers. Benchmark scenarios are inspired by constraints arising in cultural heritage dataspaces such as Datenraum Kultur. All meta-theorems are mechanically verified in Isabelle/HOL. Comments: 16 pages, 5 tables. Preprint Subjects: Computation and Language (cs.CL); Logic in Computer Science (cs.LO) ACMclasses: F.4.1; D.2.4 Cite as: arXiv:2602.19878 [cs.CL] (or arXiv:2602.19878v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.19878 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daham Mustafa [view email] [v1] Mon, 23 Feb 2026 14:24:46 UTC (19 KB) Full-text links: Access Paper: View a PDF of the paper titled Axis Decomposition for ODRL: Resolving Dimensional Ambiguity in Policy Constraints through Interval Semantics, by Daham Mustafa and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-02 Change to browse by: cs cs.LO References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-16] SHIELD: Semantic Heterogeneity Integrated Embedding for Latent Discovery in Clinical Trial Safety Signals

【速读】: 该论文旨在解决临床试验中安全性信号自动检测与整合分析的难题,特别是如何从大量不良事件(Adverse Event, AE)数据中高效识别具有统计显著性的潜在风险,并生成可解释的、结构化的安全谱系表示。其解决方案的关键在于提出SHIELD框架,该框架融合了比例性分析(Disproportionality Analysis)与基于MedDRA术语嵌入(Term Embeddings)的语义聚类方法:首先通过信息论指标(Information Component)结合经验贝叶斯收缩估计效应量,量化每个AE的信号强度;随后构建加权相似性矩阵并进行谱嵌入与聚类,从而识别出语义相关联的AE群组;最终利用大语言模型对聚类结果进行综合征级标签注释,形成网络图和层次树结构的安全特征表示,实现了从统计信号到因果机制理解的跨越。

链接: https://arxiv.org/abs/2602.19855
作者: Francois Vandenhende,Anna Georgiou,Theodoros Psaras,Ellie Karekla
机构: ClinBAY Limited( ClinBAY 有限公司)
类目: Computation and Language (cs.CL)
备注: 3 figures, 1 table

点击查看摘要

Abstract:We present SHIELD, a novel methodology for automated and integrated safety signal detection in clinical trials. SHIELD combines disproportionality analysis with semantic clustering of adverse event (AE) terms applied to MedDRA term embeddings. For each AE, the pipeline computes an information-theoretic disproportionality measure (Information Component) with effect size derived via empirical Bayesian shrinkage. A utility matrix is constructed by weighting semantic term-term similarities by signal magnitude, followed by spectral embedding and clustering to identify groups of related AEs. Resulting clusters are annotated with syndrome-level summary labels using large language models, yielding a coherent, data-driven representation of treatment-associated safety profiles in the form of a network graph and hierarchical tree. We implement the SHIELD framework in the context of a single-arm incidence summary, to compare two treatment arms or for the detection of any treatment effect in a multi-arm trial. We illustrate its ability to recover known safety signals and generate interpretable, cluster-based summaries in a real clinical trial example. This work bridges statistical signal detection with modern natural language processing to enhance safety assessment and causal interpretation in clinical trials.

[NLP-17] SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation

【速读】: 该论文旨在解决现代大语言模型(Large Language Models, LLMs)在机器翻译中难以保持作者独特文学风格的问题,尽管其生成的译文语义正确但往往缺乏个性化特征。解决方案的关键在于提出一种风格自适应多智能体系统(Style-Adaptive Multi-Agent System, SAMAS),将风格保持视为信号处理任务,并通过小波包变换(wavelet packet transform)量化文学风格为风格特征谱(Stylistic Feature Spectrum, SFS),以此作为控制信号动态组装针对源文本结构模式定制的专用翻译代理工作流,从而在保证语义准确性的同时显著提升译文的风格保真度。

链接: https://arxiv.org/abs/2602.19840
作者: Jingzhuo Wu,Jiajun Zhang,Keyan Jin,Dehua Ma,Junbo Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern large language models (LLMs) excel at generating fluent and faithful translations. However, they struggle to preserve an author’s unique literary style, often producing semantically correct but generic outputs. This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations. To address this, we introduce the Style-Adaptive Multi-Agent System (SAMAS), a novel framework that treats style preservation as a signal processing task. Specifically, our method quantifies literary style into a Stylistic Feature Spectrum (SFS) using the wavelet packet transform. This SFS serves as a control signal to dynamically assemble a tailored workflow of specialized translation agents based on the source text’s structural patterns. Extensive experiments on translation benchmarks show that SAMAS achieves competitive semantic accuracy against strong baselines, primarily by leveraging its statistically significant advantage in style fidelity.

[NLP-18] Keyboards for the Endangered Idu Mishmi Language

【速读】: Model call failure

链接: https://arxiv.org/abs/2602.19815
作者: Akhilesh Kakolu Ramarao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a mobile and desktop keyboard suite for Idu Mishmi, an endangered Trans-Himalayan language spoken by approximately 11,000 people in Arunachal Pradesh, India. Although a Latin-based orthography was developed in 2018, no digital input tools existed to use it, forcing speakers into ad-hoc romanizations that cannot represent the full writing system. Our keyboards comprise two tools: (1) an Android mobile keyboard, published on the Google Play Store and actively used in teacher training programs, and (2) a Windows desktop keyboard currently undergoing community testing. Both tools support the complete Idu Mishmi character inventory, including schwa, retracted schwa, nasalized vowels, and accented forms. Both operate fully offline with zero network permissions, addressing connectivity constraints and data sovereignty concerns. We describe the design, implementation, and deployment as a replicable model for other endangered language communities.

[NLP-19] NILE: Formalizing Natural-Language Descriptions of Formal Languages

【速读】: 该论文旨在解决教育场景中学习者用自然语言描述形式语言(如有限状态自动机、正则表达式、下推自动机、上下文无关文法或集合表示法所定义的语言)时,如何准确判断其描述是否与目标形式语言一致,并提供可解释的错误原因。核心挑战在于自然语言与形式语言之间的语义映射不直接,且传统方法难以生成结构上贴近自然语言表述的形式化解释。解决方案的关键是提出一种名为Nile的形式语言表示语言,其语法设计能镜像自然语言描述的句法结构;Nile具备足够表达能力覆盖大多数教育场景中的正则语言及部分上下文无关语言,使得从自然语言到Nile表达式的转换能够保持语法接近性,从而支持算法化地识别和解释描述差异。实验表明,大型语言模型(LLMs)可高精度地将自然语言描述转化为等价且语法接近的Nile表达式,显著优于直接转译为正则表达式的方法(后者虽可行但缺乏语法一致性,不利于解释)。

链接: https://arxiv.org/abs/2602.19743
作者: Tristan Kneisel,Marko Schmellenkamp,Fabian Vehlken,Thomas Zeume
机构: Ruhr University Bochum (鲁尔大学波鸿分校)
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper explores how natural-language descriptions of formal languages can be compared to their formal representations and how semantic differences can be explained. This is motivated from educational scenarios where learners describe a formal language (presented, e.g., by a finite state automaton, regular expression, pushdown automaton, context-free grammar or in set notation) in natural language, and an educational support system has to (1) judge whether the natural-language description accurately describes the formal language, and to (2) provide explanations why descriptions are not accurate. To address this question, we introduce a representation language for formal languages, Nile, which is designed so that Nile expressions can mirror the syntactic structure of natural-language descriptions of formal languages. Nile is sufficiently expressive to cover a broad variety of formal languages, including all regular languages and fragments of context-free languages typically used in educational contexts. Generating Nile expressions that are syntactically close to natural-language descriptions then allows to provide explanations for inaccuracies in the descriptions algorithmically. In experiments on an educational data set, we show that LLMs can translate natural-language descriptions into equivalent, syntactically close Nile expressions with high accuracy - allowing to algorithmically provide explanations for incorrect natural-language descriptions. Our experiments also show that while natural-language descriptions can also be translated into regular expressions (but not context-free grammars), the expressions are often not syntactically close and thus not suitable for providing explanations. Subjects: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) Cite as: arXiv:2602.19743 [cs.FL] (or arXiv:2602.19743v1 [cs.FL] for this version) https://doi.org/10.48550/arXiv.2602.19743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-20] KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge EACL2026

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成内容时存在隐性幻觉(hallucination)的问题,即模型输出虽具连贯性但未必真实,而现有评估基准因问题静态且范围狭窄,导致对模型真实性评估不充分甚至误导。其解决方案的关键在于提出KGHalubench——一个基于知识图谱(Knowledge Graph, KG)的动态构建与验证框架:首先利用知识图谱生成多维度、高难度的挑战性问题以克服流行度偏差;其次通过自动化验证流程从概念层面和事实正确性两个维度识别不同类型的幻觉,并引入新颖的准确率与幻觉指标对25个前沿模型进行系统评估,从而更全面、公正地揭示模型规模与知识因素如何影响幻觉产生机制。

链接: https://arxiv.org/abs/2602.19643
作者: Alex Robertson,Huizhi Liang,Mahbub Gani,Rohit Kumar,Srijith Rajamohan
机构: Newcastle University (纽卡斯尔大学); Sage Ai (Sage Group PLC) (Sage集团有限公司); Redis
类目: Computation and Language (cs.CL)
备注: EACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM’s response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.

[NLP-21] Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding

【速读】: 该论文旨在解决大模型驱动的无损压缩系统在实际应用中效率与性能瓶颈的问题,即如何在保持高压缩率的同时降低计算开销、提升推理速度并扩展对任意二进制文件的支持。其关键解决方案在于:融合轻量级在线预测器与小参数量Transformer语言模型(SmolLM2-135M),并通过多项创新优化实现高效压缩——包括将累积分布函数(CDF)精度从2^16提升至2^24以减少概率量化误差;引入token级N-gram模型加速局部预测;设计自适应log-space偏置头通过在线梯度下降修正文档级LLM偏差;采用置信度阈值跳过高确定性token以加速处理;开发支持任意二进制文件的混合二进制格式NC06(为首个基于LLM的此类方案);利用专用推理后端实现单token解码速度达PyTorch的7倍;支持多GPU并行压缩;以及通过原生KV缓存滑动窗口机制降低每滑动步长的成本约37倍。这些改进使得系统仅需约500MB GGUF权重和1.2GB显存即可运行于消费级GPU,并在多个基准测试中显著优于传统压缩算法(如gzip、bzip2、ts_zip等)。

链接: https://arxiv.org/abs/2602.19626
作者: Roberto Tacconelli
机构: 未知
类目: Information Theory (cs.IT); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:We present Nacrith, a lossless compression system that combines a 135M-parameter transformer language model (SmolLM2-135M) with an ensemble of lightweight online predictors and a 32-bit arithmetic coder. Beyond the base LLM-plus-arithmetic-coding paradigm, Nacrith introduces several contributions: (1) a CDF precision upgrade from 2^16 to 2^24 that eliminates ~75% of quantization overhead caused by minimum-probability floors in large vocabularies; (2) a token-level N-gram model for fast local predictions; (3) an adaptive log-space bias head correcting per-document LLM errors via online gradient descent; (4) confidence-based LLM skip for accelerating highly predictable tokens; (5) a hybrid binary format (NC06) extending neural compression to arbitrary binary files–to our knowledge a first among LLM-based compressors; (6) a this http URL inference backend achieving ~7x faster single-token decode than PyTorch; (7) parallel multi-GPU compression across up to 8 workers; and (8) native KV cache sliding window reducing per-slide cost by ~37x. The system requires only ~500 MB of GGUF weights and ~1.2 GB VRAM per worker, running on consumer GPUs. On this http URL (Canterbury Corpus, 152 KB), Nacrith achieves 0.918 bits per byte (bpb)–outperforming gzip by 3.1x, bzip2 by 2.5x, CMIX v21 by 44%, and ts_zip by 20%, while compressing below the 0th-, 1st-, and 2nd-order byte-level Shannon entropy bounds. On enwik8 (100 MB), Nacrith achieves 0.9389 bpb (11.74%), surpassing ts_zip (~1.11 bpb) by 15% and FineZip (1.024 bpb) by 8% despite using a 60x smaller model with no fine-tuning. An out-of-distribution evaluation on a document published after the model’s training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text. Comments: 10 pages Subjects: Information Theory (cs.IT); Computation and Language (cs.CL) Cite as: arXiv:2602.19626 [cs.IT] (or arXiv:2602.19626v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2602.19626 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-22] Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

【速读】: 该论文旨在解决当前机器遗忘(Machine Unlearning, MU)方法中忽视知识来源差异的问题,即现有研究假设所有事实具有同等可遗忘性,而未区分知识是来自预训练(pretraining)还是监督微调(Supervised Fine-Tuning, SFT)阶段。为解决此问题,作者提出DUAL(Dual Unlearning Evaluation across Training Stages)基准,包含28.6k个基于Wikidata的三元组,并通过维基链接数和LLM生成的显著性评分对事实流行度进行标注。关键解决方案在于:在遗忘过程中引入分阶段处理策略——对SFT阶段的数据进行微调而非直接遗忘,能实现更平滑的遗忘效果、更高的模型稳定性以及10–50%的性能保留率,显著优于直接对预训练模型执行遗忘操作所导致的不稳定性和灾难性遗忘现象。

链接: https://arxiv.org/abs/2602.19612
作者: Borisiuk Anna,Andrey Savchenko,Alexander Panchecko,Elena Tutubalina
机构: AIRI; Skoltech; Sber AI Lab; ISP RAS Research Center for Trusted AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

[NLP-23] Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support

【速读】: 该论文旨在解决眼动追踪阅读数据集在跨学科使用中因缺乏互操作性而导致的难以复用问题,其关键解决方案在于:首先系统性地梳理现有数据集并提炼出超过45个特征以增强透明度;其次通过在线持续更新的活体概览(living overview)简化新数据集的共享流程;最后将所有公开可用的数据集集成到Python包pymovements中,提供统一的读取与处理接口,从而推动眼动追踪阅读研究中的FAIR原则(可发现、可访问、可互操作、可重用)落实与良好科研实践的推广。

链接: https://arxiv.org/abs/2602.19598
作者: Deborah N. Jakobi,David R. Reich,Paul Prasse,Jana M. Hofmann,Lena S. Bolliger,Lena A. Jäger
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, this https URL, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.

[NLP-24] DEEP: Docker-based Execution and Evaluation Platform

【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)和光学字符识别(Optical Character Recognition, OCR)等任务中系统评估的自动化与可解释性问题,尤其在面对大量候选模型时,如何高效执行、量化评分并揭示性能差异的统计意义。解决方案的关键在于提出一个名为DEEP的软件工具,其核心创新包括:(1)支持以Docker容器形式部署待测系统,实现自动执行与结果提取;(2)基于评估指标对各模型输出进行统计显著性分析,并采用聚类算法识别性能分组,从而帮助评估者理解模型间差异的实质性;(3)提供可视化Web应用,提升结果解读的直观性和可操作性。

链接: https://arxiv.org/abs/2602.19583
作者: Sergio Gómez González,Miguel Domingo,Francisco Casacuberta
机构: PRHLT Research Center - Universitat Politècnica de València(普利特克研究中心-瓦伦西亚理工大学); ValgrAI - Valencian Graduate School and Research Network for Artificial Intelligence(瓦伦西亚人工智能研究生院与研究网络)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other tasks. DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references. With this approach, evaluators can achieve a better understanding of the performance of each model. Moreover, the software uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics. As a result, evaluators are able to identify clusters of performance among the swarm of proposals and have a better understanding of the significance of their differences. Additionally, we offer a visualization web-app to ensure that the results can be adequately understood and interpreted. Finally, we present an exemplary case of use of DEEP.

[NLP-25] mporal-Aware Heterogeneous Graph Reasoning with Multi-View Fusion for Temporal Question Answering

【速读】: 该论文旨在解决时间知识图谱问答(Temporal Knowledge Graph Question Answering, TKGQA)中的三大核心挑战:一是问题表示中对时间约束的建模较弱,导致推理偏倚;二是缺乏显式的多跳推理能力;三是语言与图结构表征融合效果不佳。其解决方案的关键在于提出一个三阶段协同框架:首先通过约束感知的问题编码机制,将语言模型的语义线索与时间实体动态相结合,增强时间约束的表达能力;其次设计一种时间感知的图神经网络,利用时间感知的消息传递实现显式的多跳推理;最后引入多视角异构信息融合机制,通过注意力机制更有效地融合问题上下文与时间图谱知识,从而提升整体问答性能。

链接: https://arxiv.org/abs/2602.19569
作者: Wuzhenghong Wen,Bowen Zhou,Jinwen Huang,Xianjie Wu,Yuwei Sun,Su Pan,Liang Li,Jianting Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6pages

点击查看摘要

Abstract:Question Answering over Temporal Knowledge Graphs (TKGQA) has attracted growing interest for handling time-sensitive queries. However, existing methods still struggle with: 1) weak incorporation of temporal constraints in question representation, causing biased reasoning; 2) limited ability to perform explicit multi-hop reasoning; and 3) suboptimal fusion of language and graph representations. We propose a novel framework with temporal-aware question encoding, multi-hop graph reasoning, and multi-view heterogeneous information fusion. Specifically, our approach introduces: 1) a constraint-aware question representation that combines semantic cues from language models with temporal entity dynamics; 2) a temporal-aware graph neural network for explicit multi-hop reasoning via time-aware message passing; and 3) a multi-view attention mechanism for more effective fusion of question context and temporal graph knowledge. Experiments on multiple TKGQA benchmarks demonstrate consistent improvements over multiple baselines.

[NLP-26] Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

【速读】: 该论文旨在解决当前大规模语言模型(Large Language Model, LLM)预训练数据构建过程中,因使用单一固定提取器从HTML中抽取文本而导致的互联网数据覆盖不足与利用率低的问题。其核心解决方案在于:通过融合多个不同提取器的结果(即取并集),显著提升可利用文本的token数量(在DCLM-Baseline上最高提升71%),同时保持标准语言理解任务上的性能不变;此外,对于表格和代码等结构化内容,采用合适的提取器可大幅提升下游任务表现(如WikiTQ任务性能提升达10个百分点,HumanEval提升3个百分点)。

链接: https://arxiv.org/abs/2602.19548
作者: Jeffrey Li,Josh Gardner,Doug Kang,Fangping Shi,Karanjeet Singh,Chun-Liang Li,Herumb Shandilya,David Hall,Oncel Tuzel,Percy Liang,Ludwig Schmidt,Hadi Pour Ansari,Fartash Faghri
机构: Apple(苹果); Stanford University (斯坦福大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.

[NLP-27] How to Train Your Deep Research Agent ? Prompt Reward and Policy Optimization in Search-R1

【速读】: 该论文旨在解决深度研究代理(Deep Research agents)在处理知识密集型任务时,强化学习(Reinforcement Learning, RL)策略的有效性与稳定性不足的问题。现有方法中RL的贡献尚未被充分理解,导致其在多轮检索与决策生成流程中的优化效果受限。解决方案的关键在于对RL的三个解耦维度进行系统性研究:提示模板(prompt template)、奖励函数(reward function)和策略优化(policy optimization)。研究发现,采用“快速思维”模板优于传统“慢速思维”模板,基于F1的奖励函数因答案回避导致训练崩溃,而引入动作级惩罚后可显著提升性能并超越EM指标;此外,REINFORCE算法在减少搜索动作的同时表现优于PPO,而GRPO则稳定性最差。基于这些洞见,作者提出了Search-R1++这一强基线模型,在Qwen2.5-7B和Qwen2.5-3B上分别将性能从0.403/0.289提升至0.442/0.331,为更可靠、结构化的RL训练策略提供了新路径。

链接: https://arxiv.org/abs/2602.19526
作者: Yinuo Xu,Shuo Lu,Jianjie Cheng,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He,Jian Liang
机构: NLPR & MAIS; CASIA; UCAS; Meituan Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

[NLP-28] Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在跨多个科学、技术、工程和数学(STEM)领域中推理能力评估缺乏真实、复杂且多模态基准的问题。为应对这一挑战,作者提出了\CFE(Classroom Final Exam),这是一个从大学课程作业与考试题目中精心收集的多模态基准数据集,包含教师提供的参考解答。其关键创新在于:首先,数据来源于真实教学场景,确保了任务的复杂性和实用性;其次,通过将参考解答分解为推理流(reasoning flows)进行诊断分析,揭示出前沿模型虽能正确回答中间子问题,但在多步推理过程中难以维持正确的中间状态,且生成的推理步骤通常比人类参考解答更多,导致效率低下并增加错误累积风险。这为改进LLMs的结构化推理能力提供了明确的方向。

链接: https://arxiv.org/abs/2602.19517
作者: Chongyang Gao,Diji Yang,Shuyan Zhou,Xichen Yan,Luchuan Song,Shuo Li,Kezhen Chen
机构: Northwestern University (西北大学); UC Santa Cruz (加州大学圣克鲁兹分校); Duke University (杜克大学); University of Birmingham (伯明翰大学); University of Rochester (罗切斯特大学); Analogy AI, Inc. (Analogy AI, Inc.)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce \CFE (\textbfClassroom \textbfFinal \textbfExam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. \CFE is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. \CFE presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69%, while the second-best model, Gemini-3-flash-preview, reaches 55.46%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at this https URL.

[NLP-29] Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理成本与推理能力之间存在的固有权衡问题:高性能的“Oracle”模型(如Llama-3-70B)虽准确率高,但计算资源消耗巨大,难以大规模部署;而小参数模型(如8B参数规模)虽然成本低,却难以胜任复杂任务。其解决方案的关键在于提出“Pyramid MoA”——一种分层的混合智能体(Mixture-of-Agents, MoA)架构,通过一个轻量级路由器(Router)动态判断是否需要将问题升级至更高阶处理模块。该路由器基于多个小型模型之间的语义一致性与置信度校准机制,高精度识别出“困难”问题,并仅对这类问题启用更复杂的推理路径,从而在保持接近Oracle模型性能(GSM8K基准达93.0%准确率,对比基线98.0%)的同时,实现61%的计算成本降低,且引入的延迟可忽略(+0.82秒),支持灵活调节性能与预算的平衡。

链接: https://arxiv.org/abs/2602.19509
作者: Arindam Khaled
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, 1 table

点击查看摘要

Abstract:Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While “Oracle” models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose “Pyramid MoA”, a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies “hard” problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.

[NLP-30] Can Large Language Models Replace Human Coders? Introducing ContentBench

【速读】: 该论文旨在解决低资源大型语言模型(Large Language Models, LLMs)是否能够替代传统人工进行解释性编码(interpretive coding)这一核心问题,从而推动经验内容分析中自动化标注的可行性与可靠性。其解决方案的关键在于构建了一个名为ContentBench的公开基准套件,通过版本化跟踪机制量化不同LLMs在相同解释性编码任务上的标注一致性(agreement)与成本,并引入由三个前沿推理模型(GPT-5、Gemini 2.5 Pro 和 Claude Opus 4.1)达成共识的参考标签作为高质量金标准,同时辅以作者的质量控制审核。实验结果表明,部分低成本LLM可达到97–99%的标注一致性,且每万条文本成本不足数美元,显著优于早期模型如GPT-3.5 Turbo;但本地运行的小型开源模型(如Llama 3.2 3B)在讽刺类文本上仍表现不佳(仅4%一致率),凸显了当前技术在特定语义场景下的局限性。

链接: https://arxiv.org/abs/2602.19467
作者: Michael Haman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project website: this https URL

点击查看摘要

Abstract:Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-ResearchTalk v1.0: 1,000 synthetic, social-media-style posts about academic research labeled into five categories spanning praise, critique, sarcasm, questions, and procedural remarks. Reference labels are assigned only when three state-of-the-art reasoning models (GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1) agree unanimously, and all final labels are checked by the author as a quality-control audit. Among the 59 evaluated models, the best low-cost LLMs reach roughly 97-99% agreement with these jury labels, far above GPT-3.5 Turbo, the model behind early ChatGPT and the initial wave of LLM-based text annotation. Several top models can code 50,000 posts for only a few dollars, pushing large-scale interpretive coding from a labor bottleneck toward questions of validation, reporting, and governance. At the same time, small open-weight models that run locally still struggle on sarcasm-heavy items (for example, Llama 3.2 3B reaches only 4% agreement on hard-sarcasm). ContentBench is released with data, documentation, and an interactive quiz at this http URL to support comparable evaluations over time and to invite community extensions.

[NLP-31] SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning AISTATS2026

【速读】: 该论文旨在解决时间序列诊断推理中普遍存在的知识与推理能力割裂问题:通用推理大语言模型(General Reasoning Large Language Models, GRLMs)虽具备强大的泛化推理能力,但缺乏对复杂时间序列模式的领域专业知识;而针对时间序列微调的大语言模型(Time-Series LLMs, TSLMs)虽能理解特定模式,却难以扩展至更复杂的推理任务。解决方案的关键在于提出一种混合知识注入框架,将TSLM生成的领域内洞察直接嵌入GRLM的推理链中,从而实现兼具领域知识与泛化推理能力的时间序列诊断能力。进一步地,为降低数据收集成本,作者采用基于可验证奖励的强化学习方法(Reinforcement Learning with Verifiable Rewards, RLVR),在无需人工标注的情况下自动提取富含知识的推理轨迹,并将其高效迁移至GRLM中完成知识注入,显著提升了模型在真实工业场景下的诊断性能。

链接: https://arxiv.org/abs/2602.19455
作者: Zelin He,Boran Han,Xiyuan Zhang,Shuai Zhang,Haotian Lin,Qi Zhu,Haoyang Fang,Danielle C. Maddix,Abdul Fatir Ansari,Akash Chandrayan,Abhinav Pradhan,Bernie Wang,Matthew Reimherr
机构: The Pennsylvania State University (宾夕法尼亚州立大学); AWS AI Labs (亚马逊云科技人工智能实验室); Amazon RME (亚马逊研究机器学习部门)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted by the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)

点击查看摘要

Abstract:Time-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. To bridge this gap, we propose a hybrid knowledge-injection framework that injects TSLM-generated insights directly into GRLM’s reasoning trace, thereby achieving strong time-series reasoning with in-domain knowledge. As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an in-domain thinking trace into GRLM for efficient knowledge injection. We further release SenTSR-Bench, a multivariate time-series-based diagnostic reasoning benchmark collected from real-world industrial operations. Across SenTSR-Bench and other public datasets, our method consistently surpasses TSLMs by 9.1%-26.1% and GRLMs by 7.9%-22.4%, delivering robust, context-aware time-series diagnostic insights.

[NLP-32] Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins

【速读】: 该论文旨在解决如何准确预测潜在干预对象对移动健康(mHealth)平台中个性化戒烟干预信息的感知信息有效性(Perceived Message Effectiveness, PME),以优化干预内容的精准推送。其核心问题是现有方法难以捕捉个体差异并高效生成高精度的PME预测,从而限制了戒烟干预的个性化程度。解决方案的关键在于提出基于大语言模型(Large Language Models, LLMs)的数字孪生(digital twins)架构,该架构将个体特征(如人口学和行为数据)与历史PME反馈相结合,通过LLM生成个性化预测,显著优于零样本、少样本LLM及监督学习基线模型,在内容质量、应对支持和戒烟支持三个维度上均实现更高的预测准确性(平均提升12–13个百分点),且能更好区分不同用户的评分分布,体现了对个体差异的高度敏感性。

链接: https://arxiv.org/abs/2602.19403
作者: Jasmin Han(1),Janardan Devkota(1),Joseph Waring(1),Amanda Luken(2),Felix Naughton(3),Roger Vilardaga(4),Jonathan Bricker(5 and 6),Carl Latkin(7),Meghan Moran(7),Yiqun Chen(8 and 9),Johannes Thrul(1 and 10 and 11) ((1) Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA, (2) Department of Health Sciences, Towson University, Towson, USA, (3) Addiction Research Group, University of East Anglia, Norwich, UK, (4) Department of Implementation Science, Wake Forest University School of Medicine, Winston-Salem, USA, (5) Fred Hutchinson Cancer Center, Seattle, USA, (6) Department of Psychology, University of Washington, Seattle, USA, (7) Department of Health, Behavior and Society, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA, (8) Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, USA, (9) Department of Computer Science, Johns Hopkins Whiting School of Engineering, Baltimore, USA, (10) Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, USA, (11) Centre for Alcohol Policy Research, La Trobe University, Melbourne, Australia)
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: 31 pages, 5 figures, submitted to Journal of the American Medical Informatics Association (JAMIA). Drs. Chen and Thrul share last authorship

点击查看摘要

Abstract:Perceived message effectiveness (PME) by potential intervention end-users is important for selecting and optimizing personalized smoking cessation intervention messages for mobile health (mHealth) platform delivery. This study evaluates whether large language models (LLMs) can accurately predict PME for smoking cessation messages. We evaluated multiple models for predicting PME across three domains: content quality, coping support, and quitting support. The dataset comprised 3010 message ratings (5-point Likert scale) from 301 young adult smokers. We compared (1) supervised learning models trained on labeled data, (2) zero and few-shot LLMs prompted without task-specific fine-tuning, and (3) LLM-based digital twins that incorporate individual characteristics and prior PME histories to generate personalized predictions. Model performance was assessed on three held-out messages per participant using accuracy, Cohen’s kappa, and F1. LLM-based digital twins outperformed zero and few-shot LLMs (12 percentage points on average) and supervised baselines (13 percentage points), achieving accuracies of 0.49 (content), 0.45 (coping), and 0.49 (quitting), with directional accuracies of 0.75, 0.66, and 0.70 on a simplified 3-point scale. Digital twin predictions showed greater dispersion across rating categories, indicating improved sensitivity to individual differences. Integrating personal profiles with LLMs captures person-specific differences in PME and outperforms supervised and zero and few-shot approaches. Improved PME prediction may enable more tailored intervention content in mHealth. LLM-based digital twins show potential for supporting personalization of mobile smoking cessation and other health behavior change interventions. Comments: 31 pages, 5 figures, submitted to Journal of the American Medical Informatics Association (JAMIA). Drs. Chen and Thrul share last authorship Subjects: Computation and Language (cs.CL); Applications (stat.AP) Cite as: arXiv:2602.19403 [cs.CL] (or arXiv:2602.19403v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.19403 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiuying Han [view email] [v1] Mon, 23 Feb 2026 00:32:23 UTC (3,509 KB)

[NLP-33] Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

【速读】: 该论文旨在解决当前预训练基础模型(如大语言模型 LLM 和视觉语言模型 VLM)在长尾分布下的隐式模式识别任务中表现不佳的问题,尤其是在标注数据稀缺和计算资源受限场景下难以通过微调提升性能的挑战。解决方案的关键在于提出一种高效的嵌入校准框架 ADAMAB,其核心创新包括:1)设计轻量级、与嵌入器无关的校准器,在不访问预训练模型参数的前提下对固定嵌入空间进行校准,显著降低计算开销;2)引入基于多臂赌博机(Multi-Armed Bandit, MAB)机制的自适应数据增强策略,并结合改进的上置信界算法(UCB),有效缓解梯度漂移问题,实现少样本训练下的理论收敛保障。实验表明,ADAMAB 在多模态任务中可实现高达 40% 的准确率提升,且仅需每类少于 5 个初始样本即可达到优越性能。

链接: https://arxiv.org/abs/2602.19385
作者: Minxue Tang,Yangyang Yu,Aolin Ding,Maziyar Baran Pouyan,Taha Belkhouja Yujia Bao
机构: Duke University (杜克大学); Center for Advanced AI, Accenture (埃森哲高级人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement when training with less than 5 initial data samples of each class.

[NLP-34] Anatomy of Agent ic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 代理记忆系统(Agentic Memory Systems, MAG)在实际应用中表现与其理论潜力不匹配的问题,其核心挑战包括基准测试规模不足、评估指标与语义实用性脱节、骨干模型依赖导致性能波动以及记忆维护带来的系统级开销。解决方案的关键在于构建一个从架构到系统的结构化分析框架:首先提出基于四种记忆结构的简洁分类体系,进而识别并量化当前系统的主要瓶颈,如基准饱和效应、指标有效性不足、骨干模型敏感性及延迟与吞吐量开销,并通过将记忆结构与实证限制关联,明确性能受限的根本原因,从而为更可靠的评估方法和可扩展的系统设计提供方向。

链接: https://arxiv.org/abs/2602.19320
作者: Dongming Jiang,Yi Li,Songtao Wei,Jinxin Yang,Ayushi Kishore,Alysa Zhao,Dingyi Kang,Xu Hu,Feng Chen,Qiannan Li,Bingzhe Li
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); University of California Davis (加州大学戴维斯分校); Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

[NLP-35] Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

【速读】: 该论文旨在解决低资源语言(如孟加拉语)中仇恨表情包(hateful memes)的自动检测难题,其核心挑战包括标注数据稀缺、类别不平衡以及普遍存在的代码混用(code-mixing)现象。解决方案的关键在于:首先通过引入语义对齐的孟加拉语多模态攻击数据集(MIMOSA)扩展原始Bengali Hateful Memes(BHM)数据集,提升类平衡与语义多样性;其次提出增强型双协同注意力框架(xDORA),融合视觉编码器(CLIP、DINOv2)与多语言文本编码器(XGLM、XLM-R),利用加权注意力池化机制学习鲁棒的跨模态表示;进一步构建基于FAISS的k近邻分类器实现非参数推理,并引入RAG-Fused DORA框架,通过检索驱动的上下文推理增强模型性能。实验表明,该方法在仇恨表情包识别和目标实体检测任务上均显著优于基线,尤其在稀有类别上展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2602.19212
作者: Raihan Tanvir,Md. Golam Rabiul Alam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

[NLP-36] Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会科学研究中作为人类参与者代理时,因“朴素”应用(即缺乏显式行为约束的提示)而导致的语言学差异问题,这些问题会削弱研究结果的有效性。解决方案的关键在于提出一种基于真实X(原Twitter)数据的历史条件回复预测任务,构建一个用于评估LLM生成内容与人类生成内容之间差异的新型数据集,并通过风格和内容指标进行量化分析,从而为研究人员提供一套可操作的框架,以提升合成数据的质量与真实性,确保LLM生成内容能准确反映人类交流的复杂语言模式。

链接: https://arxiv.org/abs/2602.19177
作者: Simon Münker,Nils Schwager,Kai Kugler,Michael Heseltine,Achim Rettinger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages (12 including references), 2 figures and 2 tables

点击查看摘要

Abstract:The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their “naive” application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

[NLP-37] urkicNLP: An NLP Toolkit for Turkic Languages

【速读】: 该论文旨在解决突厥语族(Turkic languages)自然语言处理(Natural Language Processing, NLP)资源碎片化的问题,这些语言在全球范围内有超过2亿使用者,但多数缺乏统一的工具链和标准化资源。解决方案的关键在于提出一个名为TurkicNLP的开源Python库,其核心创新是通过单一、语言无关的API整合四种书写系统(拉丁文、西里尔文、波斯-阿拉伯文和古突厥鲁尼文)下的完整NLP流程,包括分词、形态分析、词性标注、依存句法分析、命名实体识别、双向脚本转换、跨语言句子嵌入及机器翻译等任务。该方案采用模块化多后端架构,透明集成规则驱动的有限状态转换器与神经网络模型,并具备自动脚本检测与路由功能,输出遵循CoNLL-U标准以确保互操作性与可扩展性。

链接: https://arxiv.org/abs/2602.19174
作者: Sherzod Hakimov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at this https URL .

[NLP-38] Reasoning Capabilities of Large Language Models . Lessons Learned from General Game Playing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在形式化规则环境中的推理能力评估问题,特别是其在逻辑严密的通用游戏博弈(General Game Playing, GGP)场景下进行状态预测、合法动作生成等任务的表现。解决方案的关键在于构建一套基于多维结构特征(共40项)的量化分析框架,并通过前向模拟任务(包括单步与多步状态推演及合法动作生成)对多个主流LLM(如Gemini 2.5 Pro、Llama 3.3 70B、GPT-OSS 120B等)进行系统性评测;同时引入游戏混淆(game obfuscation)策略以探究语言语义在游戏定义中的作用以及模型训练期间可能存在的特定游戏先验暴露影响,从而揭示当前模型在正式推理中的优势与局限,例如幻觉规则、冗余状态事实或语法错误等典型推理偏差。

链接: https://arxiv.org/abs/2602.19160
作者: Maciej Świechowski,Adam Żychowski,Jacek Mańdziuk
机构: Grail Team (Grail团队); Warsaw University of Technology (华沙理工大学); AGH University of Krakow (克拉科夫AGH科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper examines the reasoning capabilities of Large Language Models (LLMs) from a novel perspective, focusing on their ability to operate within formally specified, rule-governed environments. We evaluate four LLMs (Gemini 2.5 Pro and Flash variants, Llama 3.3 70B and GPT-OSS 120B) on a suite of forward-simulation tasks-including next / multistep state formulation, and legal action generation-across a diverse set of reasoning problems illustrated through General Game Playing (GGP) game instances. Beyond reporting instance-level performance, we characterize games based on 40 structural features and analyze correlations between these features and LLM performance. Furthermore, we investigate the effects of various game obfuscations to assess the role of linguistic semantics in game definitions and the impact of potential prior exposure of LLMs to specific games during training. The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps). Detailed case-based analysis of the LLM performance provides novel insights into common reasoning errors in the considered logic-based problem formulation, including hallucinated rules, redundant state facts, or syntactic errors. Overall, the paper reports clear progress in formal reasoning capabilities of contemporary models.

[NLP-39] Asymptotic Semantic Collapse in Hierarchical Optimization

【速读】: 该论文旨在解决多智能体语言系统中出现的“渐近语义坍缩”(Asymptotic Semantic Collapse)问题,即在层级优化过程中,共享主导上下文逐渐吸收个体语义,导致各智能体行为趋于一致的现象。其解决方案的关键在于将语义状态建模为黎曼流形上的点,并通过分析诱导的投影动力学揭示:无论采用平滑梯度更新还是随机噪声更新,系统最终均收敛至相同的拓扑终点,从而证明了路径无关性;同时指出语境依赖程度控制信息熵——从原子独立表示向完全纠缠上下文表示演化时,节点熵趋近于零,表明可用自由度消失。该理论将信息论量与微分几何结构相连接,提出一种不可变共识规则,约束智能体遵循共享语义语法。

链接: https://arxiv.org/abs/2602.18450
作者: Faruk Alpay,Bugra Kilictas
机构: 未知
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 23 pages, 2 figures. Includes a dataset-free benchmark with full metric reporting

点击查看摘要

Abstract:Multi-agent language systems can exhibit a failure mode where a shared dominant context progressively absorbs individual semantics, yielding near-uniform behavior across agents. We study this effect under the name Asymptotic Semantic Collapse in Hierarchical Optimization. In a closed linguistic setting with a Dominant Anchor Node whose semantic state has effectively infinite inertia, we show that repeated interactions with Peripheral Agent Nodes drive an asymptotic alignment that minimizes a global loss. We model semantic states as points on a Riemannian manifold and analyze the induced projection dynamics. Two consequences follow. First, the limiting semantic configuration is insensitive to the optimization history: both smooth gradient-style updates and stochastic noisy updates converge to the same topological endpoint, establishing path independence at convergence. Second, the degree of context dependence controls information content: moving from atomic (independent) representations to fully entangled (context-bound) representations forces the node entropy, interpreted as available degrees of freedom, to vanish in the limit. The theory connects information-theoretic quantities with differential-geometric structure and suggests an interpretation as an immutable consensus rule that constrains agents to a shared semantic grammar. A lightweight dataset-free benchmark on an RWKV-7 13B GGUF checkpoint complements the analysis, reporting zero hash collisions, mean compliance of 0.50 under greedy decoding and 0.531 under stochastic decoding, and final Jaccard-to-anchor similarity values of 0.295 and 0.224, respectively.

[NLP-40] Prompt Optimization Via Diffusion Language Models

【速读】: 该论文旨在解决如何在不修改下游语言模型(LLM)参数的前提下,通过迭代优化系统提示(system prompt)来提升其性能的问题。解决方案的关键在于提出一种基于扩散机制的提示优化框架——Diffusion Language Models for Prompt Optimization(DLM-Opt),该方法利用扩散语言模型(Diffusion Language Models, DLMs)通过掩码去噪过程对提示进行细粒度的、逐span级别的更新,同时以用户查询、模型响应及可选反馈作为条件信息,实现无需梯度访问且与目标模型无关的灵活优化。实验表明,适度的扩散步数可在优化质量与稳定性之间取得最佳平衡,从而显著提升冻结目标模型(如GPT-4o-mini)在多个基准任务上的表现。

链接: https://arxiv.org/abs/2602.18449
作者: Shiyu Wang,Haolin Chen,Liangwei Yang,Jielin Qiu,Rithesh Murthy,Ming Zhu,Zixiang Chen,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a diffusion-based framework for prompt optimization that leverages Diffusion Language Models (DLMs) to iteratively refine system prompts through masked denoising. By conditioning on interaction traces, including user queries, model responses, and optional feedback, our method enables flexible, span-level prompt updates without requiring gradient access or modifying the downstream language model. Across diverse benchmarks (e.g., \tau -bench, SST-2, SST-5), DLM-optimized prompts consistently improve the performance of a frozen target LLM (e.g., GPT-4o-mini). We further show that moderate diffusion step counts provide the best balance between refinement quality and stability. These results highlight diffusion-based prompt optimization as a general, model-agnostic, and scalable approach for enhancing LLM performance through iterative prompt refinement.

[NLP-41] INSURE-Dial: A Phase-Aware Conversational Dataset Benchmark for Compliance Verification and Phase Detection EACL2026

【速读】: 该论文旨在解决美国医疗保健领域中行政电话任务(如保险福利验证)造成的巨额成本问题,其核心挑战在于如何构建合规意识强、能进行阶段感知审计的语音代理(voice agent),以实现对通话内容的精准合规性验证。解决方案的关键是提出了首个公开基准数据集INSURE-Dial,包含50个真实AI发起的与保险代表通话的脱敏语料(平均71轮对话/通)和1,000个模拟生成的同流程通话,并采用结构化的JSON标注方案覆盖IVR导航、患者身份识别、保险状态、药物核查(最多两种)及代理标识(CRN)等关键阶段,每个阶段均基于显式提问-回答逻辑标注信息合规(Information Compliance, IC)与程序合规(Procedural Compliance, PC)。此外,论文定义了两项新评估任务:阶段边界检测(Phase Boundary Detection,即在特定接受规则下进行段落分割)与合规性验证(Compliance Verification,给定固定段落判断IC/PC),从而系统性推动语音代理在医疗场景中的审计级准确性提升。

链接: https://arxiv.org/abs/2602.18448
作者: Shubham Kulkarni,Alexander Lyzhov,Preetam Joshi,Shiva Chaitanya
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

点击查看摘要

Abstract:Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.

[NLP-42] ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification

【速读】: 该论文旨在解决大语言模型在复杂任务中采用链式思维(Chain-of-Thought)推理时带来的高推理延迟问题,同时克服现有步骤级推测推理方法在准确性、推理速度和资源效率之间长期存在的权衡困境。解决方案的关键在于提出ConfSpec——一种基于置信度门控的级联验证框架,其核心洞察是生成与验证阶段的能力不对称:生成正确推理步骤需要强大模型能力,而步骤级验证是一个受限的判别任务,小型草稿模型在其能力范围内具有良好的校准性,因此可直接接受高置信度的草稿决策,并仅将不确定案例选择性地升级至大型目标模型进行处理,从而实现高效且准确的推理加速。

链接: https://arxiv.org/abs/2602.18447
作者: Siran Liu,Cyril Y. He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24 \times end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.

[NLP-43] ReportLogic: Evaluating Logical Quality in Deep Research Reports

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)生成深度研究报告时逻辑可靠性不足的问题,即报告虽表面流畅或信息丰富,但其论点与支撑之间缺乏明确、可验证的逻辑关系,从而影响下游应用的可信度。解决方案的关键在于提出ReportLogic基准,采用一种读者中心的审计视角构建三级逻辑评估体系:宏观逻辑(Macro-Logic)考察报告结构是否具有统一分析主线;解释逻辑(Expositional-Logic)评估内容推进是否具备必要背景支持;结构逻辑(Structural-Logic)检验结论是否由显式主张—支持关系构成。基于此框架,研究团队构建了人工标注的评分指南数据集,并训练了一个开源逻辑判别器LogicJudge以实现规模化评估,同时通过对抗攻击验证现有LLM判别器易受表面特征干扰,揭示推理模式可能掩盖逻辑断裂的风险,为提升报告逻辑可靠性提供了可操作的技术路径。

链接: https://arxiv.org/abs/2602.18446
作者: Jujia Zhao,Zhaoxin Huan,Zihan Wang,Xiaolu Zhang,Jun Zhou,Suzan Verberne,Zhaochun Ren
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Users increasingly rely on Large Language Models (LLMs) for Deep Research, using them to synthesize diverse sources into structured reports that support understanding and action. In this context, the practical reliability of such reports hinges on logical quality: whether the report’s claims and arguments are explicitly supported and can be trusted as a basis for downstream use, rather than merely appearing fluent or informative. However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. Specifically, ReportLogic adopts a hierarchical taxonomy that evaluates whether readers can (1) trace an on-topic report structure with a unified analytical arc (Macro-Logic), (2) understand the progression with necessary context (Expositional-Logic), and (3) verify conclusions via explicit claim–support (Structural-Logic). Based on this taxonomy, we construct a human-annotated rubric-guided dataset and train an open-source LogicJudge for scalable evaluation. We further evaluate judge robustness via adversarial attacks, showing that off-the-shelf LLM judges are frequently influenced by superficial cues (e.g., verbosity), and reasoning modes can mask broken support relations. Overall, our results provide actionable guidance for building more robust logic evaluators and improving the logical reliability of LLM-generated reports.

信息检索

[IR-0] KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)等应用中,因构建专业化评估数据集所需时间与成本过高而导致的评测瓶颈问题。其解决方案的关键在于提出KNIGHT框架——一个基于LLM和知识图谱驱动的多选题(Multiple-Choice Question, MCQ)数据集生成系统。该框架通过构建主题特定的知识图谱(Knowledge Graph),以结构化、紧凑的方式提取实体与关系,并作为可复用的状态存储;由此实现无需重复加载原始文本即可灵活控制题目难度(如多跳推理题),显著降低生成成本并提升质量。实验表明,KNIGHT在多个维度上均达到高水准,且支持领域无关和难度可控的评估需求。

链接: https://arxiv.org/abs/2602.20135
作者: Mohammad Amanlou,Erfan Shafiee Moghaddam,Yasaman Amou Jafari,Mahdi Noori,Farhan Farsi,Behnam Bahrak
机构: University of Tehran (德黑兰大学); Independent Researcher (独立研究员); Amirkabir University of Technology (阿米尔卡比尔理工大学); TEIAS Institute (TEIAS 研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

点击查看摘要

Abstract:With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

[IR-1] NanoKnow: How to Know What Your Language Model Knows

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)知识来源不透明的问题,即难以明确区分模型输出的知识是来自预训练数据中的参数化记忆(parametric knowledge)还是外部证据。传统方法受限于预训练数据的“黑箱”特性,使得知识溯源困难。其解决方案的关键在于利用NanoChat——一个预训练数据完全开放的小型LLM家族——构建了NanoKnow基准数据集,该数据集将自然问答任务(如Natural Questions和SQuAD)中的问题按答案是否存在于NanoChat的预训练语料库中进行划分,从而实现对模型知识来源的有效解耦与量化分析。这一设计使研究者能够系统评估模型在不同知识来源下的表现差异,揭示参数化知识与外部证据之间的互补关系及其干扰效应。

链接: https://arxiv.org/abs/2602.20122
作者: Lingwei Gu,Nour Jedidi,Jimmy Lin
机构: University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a “black box” – unknown or inaccessible. The recent release of nanochat – a family of small LLMs with fully open pre-training data – addresses this as it provides a transparent view into where a model’s parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat’s pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow’s utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at this https URL.

[IR-2] ManCAR: Manifold-Constrained Latent Reasoning with Adaptive Test-Time Computation for Sequential Recommendation

【速读】:该论文旨在解决顺序推荐中因缺乏显式可行性约束而导致的潜在漂移(latent drift)问题,即现有方法在推理过程中由于依赖目标主导的目标函数,使得中间状态偏离合理的物品交互空间。解决方案的关键在于提出一种基于流形约束的自适应推理框架 ManCAR(Manifold-Constrained Adaptive Reasoning),其核心思想是将推荐推理视为在全局交互图拓扑结构上的导航过程而非自由的潜空间优化。ManCAR 通过构建用户近期行为邻域的局部意图先验(表示为物品单纯形上的分布),在训练阶段逐步对齐模型的潜预测分布与该先验,从而将推理轨迹限制在有效流形内;测试时则自适应地推进推理直至预测分布稳定,避免过度精炼,实现更可靠和高效的顺序推荐。

链接: https://arxiv.org/abs/2602.20093
作者: Kun Yang,Yuxuan Zhu,Yazhe Chen,Siyao Zheng,Bangyang Hong,Kangle Wu,Yabo Ni,Anxiang Zeng,Cong Fu,Hui Li
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Sequential recommendation increasingly employs latent multi-step reasoning to enhance test-time computation. Despite empirical gains, existing approaches largely drive intermediate reasoning states via target-dominant objectives without imposing explicit feasibility constraints. This results in latent drift, where reasoning trajectories deviate into implausible regions. We argue that effective recommendation reasoning should instead be viewed as navigation on a collaborative manifold rather than free-form latent refinement. To this end, we propose ManCAR (Manifold-Constrained Adaptive Reasoning), a principled framework that grounds reasoning within the topology of a global interaction graph. ManCAR constructs a local intent prior from the collaborative neighborhood of a user’s recent actions, represented as a distribution over the item simplex. During training, the model progressively aligns its latent predictive distribution with this prior, forcing the reasoning trajectory to remain within the valid manifold. At test time, reasoning proceeds adaptively until the predictive distribution stabilizes, avoiding over-refinement. We provide a variational interpretation of ManCAR to theoretically validate its drift-prevention and adaptive test-time stopping mechanisms. Experiments on seven benchmarks demonstrate that ManCAR consistently outperforms state-of-the-art baselines, achieving up to a 46.88% relative improvement w.r.t. NDCG@10. Our code is available at this https URL.

[IR-3] FairFS: Addressing Deep Feature Selection Biases for Recommender System

【速读】:该论文旨在解决工业级深度学习推荐系统中特征选择时存在的特征重要性估计不准确问题,其核心挑战源于三层偏差:层偏差(layer bias)、基线偏差(baseline bias)和近似偏差(approximation bias),这些偏差导致重要性评估依赖于部分模型层、样本或梯度,从而影响特征子集选择的可靠性与性能。解决方案的关键在于提出FairFS算法,通过在所有非线性变换层上正则化特征重要性以缓解层偏差,并引入贴近分类决策边界的平滑基线特征及聚合近似方法来降低基线偏差和近似偏差,从而实现更公平且精确的特征选择。

链接: https://arxiv.org/abs/2602.20001
作者: Xianquan Wang,Zhaocheng Du,Jieming Zhu,Qinglin Jia,Zhenhua Dong,Kai Zhang
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted by The Web Conference 2026

点击查看摘要

Abstract:Large-scale online marketplaces and recommender systems serve as critical technological support for e-commerce development. In industrial recommender systems, features play vital roles as they carry information for downstream models. Accurate feature importance estimation is critical because it helps identify the most useful feature subsets from thousands of feature candidates for online services. Such selection enables improved online performance while reducing computational cost. To address feature selection problems in deep learning, trainable gate-based and sensitivity-based methods have been proposed and proven effective in industrial practice. However, through the analysis of real-world cases, we identified three bias issues that cause feature importance estimation to rely on partial model layers, samples, or gradients, ultimately leading to inaccurate importance estimation. We refer to these as layer bias, baseline bias, and approximation bias. To mitigate these issues, we propose FairFS, a fair and accurate feature selection algorithm. FairFS regularizes feature importance estimated across all nonlinear transformation layers to address layer bias. It also introduces a smooth baseline feature close to the classifier decision boundary and adopts an aggregated approximation method to alleviate baseline and approximation biases. Extensive experiments demonstrate that FairFS effectively mitigates these biases and achieves state-of-the-art feature selection performance.

[IR-4] A Context-Aware Knowledge Graph Platform for Stream Processing in Industrial IoT

【速读】:该论文旨在解决工业物联网(Industrial IoT)生态系统中异构、高速数据流的互操作性、安全性与上下文感知管理问题,尤其针对当前流管理架构依赖语法集成导致的灵活性差、可维护性低和可解释性弱等局限。解决方案的关键在于构建一个基于知识图谱(Knowledge Graph)的语义平台,通过形式化表示设备、数据流、代理(agent)、转换管道、角色与权限等要素,实现跨异构源的数据统一建模;同时结合Apache Kafka与Apache Flink实现实时流处理,并利用SPARQL与SWRL规则进行上下文相关的流发现与动态角色访问控制,从而在工业5.0场景下支持灵活的数据采集、可组合的流处理管道及情境驱动的数据治理。

链接: https://arxiv.org/abs/2602.19990
作者: Monica Marconi Sciarroni,Emanuele Storti
机构: 未知
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Industrial IoT ecosystems bring together sensors, machines and smart devices operating collaboratively across industrial environments. These systems generate large volumes of heterogeneous, high-velocity data streams that require interoperable, secure and contextually aware management. Most of the current stream management architectures, however, still rely on syntactic integration mechanisms, which result in limited flexibility, maintainability and interpretability in complex Industry 5.0 scenarios. This work proposes a context-aware semantic platform for data stream management that unifies heterogeneous IoT/IoE data sources through a Knowledge Graph enabling formal representation of devices, streams, agents, transformation pipelines, roles and rights. The model supports flexible data gathering, composable stream processing pipelines, and dynamic role-based data access based on agents’ contexts, relying on Apache Kafka and Apache Flink for real-time processing, while SPARQL and SWRL-based reasoning provide context-dependent stream discovery. Experimental evaluations demonstrate the effectiveness of combining semantic models, context-aware reasoning and distributed stream processing to enable interoperable data workflows for Industry 5.0 environments.

[IR-5] Counterfactual Understanding via Retrieval-aware Multimodal Modeling for Time-to-Event Survival Prediction

【速读】:该论文旨在解决时间到事件(time-to-event)反事实生存预测问题,以在存在异质性和删失数据的情况下优化个体化生存结果。其核心挑战在于如何融合多模态临床信息(包括临床、辅助检查、人口统计学和多组学数据),并准确建模患者特异性潜在亚群对治疗响应的差异。解决方案的关键在于提出CURE框架:首先通过交叉注意力机制对多模态特征进行对齐与融合;其次利用专家混合(mixture-of-experts)架构自适应地精炼复杂的多组学信号,突出最具信息量的组学成分;最后基于学习到的表征隐式检索患者特异性的潜在亚群,同时捕捉基线生存动态和治疗依赖性变化。实验表明,CURE在METABRIC和TCGA-LUAD数据集上显著优于现有基线模型,在Time-dependent Concordance Index (C^td) 和 Integrated Brier Score (IBS) 上均取得提升,验证了其在增强多模态理解与支持未来治疗推荐建模方面的潜力。

链接: https://arxiv.org/abs/2602.19987
作者: Ha-Anh Hoang Nguyen,Tri-Duc Phan Le,Duc-Hoang Pham,Huy-Son Nguyen,Cam-Van Thi Nguyen,Duc-Trong Le,Hoang-Quynh Le
机构: 未知
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper tackles the problem of time-to-event counterfactual survival prediction, aiming to optimize individualized survival outcomes in the presence of heterogeneity and censored data. We propose CURE, a framework that advances counterfactual survival modeling via comprehensive multimodal embedding and latent subgroup retrieval. CURE integrates clinical, paraclinical, demographic, and multi-omics information, which are aligned and fused through cross-attention mechanisms. Complex multi-omics signals can be adaptively refined using a mixture-of-experts architecture, emphasizing the most informative omics components. Building upon this representation, CURE implicitly retrieves patient-specific latent subgroups that capture both baseline survival dynamics and treatment-dependent variations. Experimental results on METABRIC and TCGA-LUAD datasets demonstrate that proposed CURE model consistently outperforms strong baselines in survival analysis, evaluated using the Time-dependent Concordance Index ( C^td ) and Integrated Brier Score (IBS). These findings highlight the potential of CURE to enhance multimodal understanding and serve as a foundation for future treatment recommendation models. All code and related resources are publicly available to facilitate the reproducibility this https URL.

[IR-6] Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval

【速读】:该论文旨在解决视觉文档检索(Visual Document Retrieval, VDR)中因多模态信息快速增长而产生的信息获取难题,尤其针对传统自然图像检索无法有效处理视觉文档中密集文本内容、复杂布局及细粒度语义依赖等特性的问题。其解决方案的关键在于系统性地梳理VDR领域在多模态大语言模型(Multimodal Large Language Model, MLLM)时代的方法演进,从基准测试体系、多模态嵌入模型、多模态重排序模型到检索增强生成(Retrieval-Augmented Generation, RAG)与智能体(Agentic)系统的融合应用,构建了一个涵盖技术演进与未来方向的完整研究框架。

链接: https://arxiv.org/abs/2602.19961
作者: Yibo Yan,Jiahao Huo,Guanbo Feng,Mingdong Ou,Yi Cao,Xin Zou,Shuliang Liu,Yuanhuiyi Lyu,Yu Huang,Jungang Li,Kening Zheng,Xu Zheng,Philip S. Yu,James Kwok,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Alibaba Cloud Computing (阿里云计算); Hong Kong University of Science and Technology (香港科技大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.

[IR-7] Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

【速读】:该论文旨在解决自动和弦识别(Automatic Chord Recognition, ACR)中因对齐和弦标签稀缺而导致的模型性能受限问题,尤其是在标注数据获取成本高昂的情况下。解决方案的关键在于提出一个两阶段训练流程:第一阶段利用预训练的教师模型(如BTC)为超过1000小时的未标注音频生成伪标签,并仅用这些伪标签训练学生模型;第二阶段在获得真实标签后,持续训练学生模型,并通过选择性知识蒸馏(selective knowledge distillation, KD)作为正则项,防止第一阶段学到的表征发生灾难性遗忘。该方法在多个标准指标上显著优于传统监督学习基线,尤其在稀有和弦类别上表现突出。

链接: https://arxiv.org/abs/2602.19778
作者: Nghia Phan,Rong Jin,Gang Liu,Xiao Dong
机构: 未知
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available, with selective knowledge distillation (KD) from the teacher applied as a regularizer to prevent catastrophic forgetting of the representations learned in the first stage. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher’s performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. And the resulting 2E1D student model improves from the traditional supervised learning baseline by 3.79% on average and achieves almost the same performance as the teacher. Both cases show the large gains on rare chord qualities.

[IR-8] GrIT: Group Informed Transformer for Sequential Recommendation

【速读】:该论文旨在解决传统顺序推荐系统在建模用户行为时忽略群体层面特征的问题,即现有基于Transformer的方法虽能捕捉个体用户的偏好演化,但未充分考虑相似用户群体的集体行为对推荐结果的影响。其解决方案的关键在于引入可学习的、随时间变化的潜在群体表示(latent group representations),通过动态计算每个用户在不同时间步对各群体的隶属权重(membership weights),并将这些权重与群体嵌入加权融合,生成具有漂移感知能力的群体表征;该表征与用户序列特征共同嵌入到Transformer模块中,联合建模个人与群体层面的时间动态性,从而提升推荐的准确性与情境感知能力。

链接: https://arxiv.org/abs/2602.19728
作者: Adamya Shyam,Venkateswara Rao Kagita,Bharti Rana,Vikas Kumar
机构: University of Delhi (德里大学); National Institute of Technology, Warangal (瓦朗加尔国立技术学院)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Sequential recommender systems aim to predict a user’s future interests by extracting temporal patterns from their behavioral history. Existing approaches typically employ transformer-based architectures to process long sequences of user interactions, capturing preference shifts by modeling temporal relationships between items. However, these methods often overlook the influence of group-level features that capture the collective behavior of similar users. We hypothesize that explicitly modeling temporally evolving group features alongside individual user histories can significantly enhance next-item recommendation. Our approach introduces latent group representations, where each user’s affiliation to these groups is modeled through learnable, time-varying membership weights. The membership weights at each timestep are computed by modeling shifts in user preferences through their interaction history, where we incorporate both short-term and long-term user preferences. We extract a set of statistical features that capture the dynamics of user behavior and further refine them through a series of transformations to produce the final drift-aware membership weights. A group-based representation is derived by weighting latent group embeddings with the learned membership scores. This representation is integrated with the user’s sequential representation within the transformer block to jointly capture personal and group-level temporal dynamics, producing richer embeddings that lead to more accurate, context-aware recommendations. We validate the effectiveness of our approach through extensive experiments on five benchmark datasets, where it consistently outperforms state-of-the-art sequential recommendation methods.

[IR-9] A Three-stage Neuro-symbolic Recommendation Pipeline for Cultural Heritage Knowledge Graphs CCS2026

【速读】:该论文旨在解决数字文化遗产资源日益增长背景下,如何有效推荐异构数据实体之间语义关联的问题。其核心挑战在于处理稀疏且多样化的元数据,同时确保推荐结果既准确又具备可解释性。解决方案的关键在于构建一个三阶段神经符号推荐流水线:首先利用知识图谱嵌入(Knowledge Graph Embeddings)捕捉实体间的语义关系,其次采用近似最近邻搜索(Approximate Nearest-Neighbor Search)实现高效检索,最后通过SPARQL驱动的语义过滤机制提升推荐结果的相关性和透明度。实验基于CHExRISH项目中的JUHMP知识图谱(约320万条RDF三元组),验证了该方法在复杂语义场景下的有效性与实用性。

链接: https://arxiv.org/abs/2602.19711
作者: Krzysztof Kutt,Elżbieta Sroka,Oleksandra Ishchuk,Luiz do Valle Miranda
机构: 未知
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注: 15 pages, 1 figure; submitted to ICCS 2026 conference

点击查看摘要

Abstract:The growing volume of digital cultural heritage resources highlights the need for advanced recommendation methods capable of interpreting semantic relationships between heterogeneous data entities. This paper presents a complete methodology for implementing a hybrid recommendation pipeline integrating knowledge-graph embeddings, approximate nearest-neighbour search, and SPARQL-driven semantic filtering. The work is evaluated on the JUHMP (Jagiellonian University Heritage Metadata Portal) knowledge graph developed within the CHExRISH project, which at the time of experimentation contained \approx3.2 M RDF triples describing people, events, objects, and historical relations affiliated with the Jagiellonian University (Kraków, PL). We evaluate four embedding families (TransE, ComplEx, ConvE, CompGCN) and perform hyperparameter selection for ComplEx and HNSW. Then, we present and evaluate the final three-stage neuro-symbolic recommender. Despite sparse and heterogeneous metadata, the approach produces useful and explainable recommendations, which were also proven with expert evaluation.

[IR-10] DReX: An Explainable Deep Learning-based Multimodal Recommendation Framework

【速读】:该论文旨在解决多模态推荐系统中存在的三个关键问题:(1)不同模态数据被孤立处理,导致信息利用不充分;(2)训练过程中要求每条交互都具备完整的多模态数据,限制了模型在实际场景中的适用性;(3)用户与物品表征独立学习,造成嵌入空间对齐困难。解决方案的核心在于提出DReX框架,通过引入门控循环单元(Gated Recurrent Units, GRU)对交互级细粒度特征进行增量式融合,实现用户和物品表征的联合优化,从而同时捕捉交互细节与整体偏好模式、消除分离特征提取流程、增强对缺失或不完整模态的鲁棒性。

链接: https://arxiv.org/abs/2602.19702
作者: Adamya Shyam,Venkateswara Rao Kagita,Bharti Rana,Vikas Kumar
机构: University of Delhi, Delhi, India; National Institute of Technology, Warangal, India
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal recommender systems leverage diverse data sources, such as user interactions, content features, and contextual information, to address challenges like cold-start and data sparsity. However, existing methods often suffer from one or more key limitations: processing different modalities in isolation, requiring complete multimodal data for each interaction during training, or independent learning of user and item representations. These factors contribute to increased complexity and potential misalignment between user and item embeddings. To address these challenges, we propose DReX, a unified multimodal recommendation framework that incrementally refines user and item representations by leveraging interaction-level features from multimodal feedback. Our model employs gated recurrent units to selectively integrate these fine-grained features into global representations. This incremental update mechanism provides three key advantages: (1) simultaneous modeling of both nuanced interaction details and broader preference patterns, (2) eliminates the need for separate user and item feature extraction processes, leading to enhanced alignment in their learned representation, and (3) inherent robustness to varying or missing modalities. We evaluate the performance of the proposed approach on three real-world datasets containing reviews and ratings as interaction modalities. By considering review text as a modality, our approach automatically generates interpretable keyword profiles for both users and items, which supplement the recommendation process with interpretable preference indicators. Experiment results demonstrate that our approach outperforms state-of-the-art methods across all evaluated datasets.

[IR-11] Iconographic Classification and Content-Based Recommendation for Digitized Artworks CCS2026

【速读】:该论文旨在解决数字艺术藏品中图像内容自动分类与推荐的问题,以提升大规模文化遗产资源的编目效率和用户导航体验。其解决方案的关键在于将计算机视觉(如YOLOv8目标检测)与符号化知识体系(Iconclass分类体系)相结合:首先由计算机视觉模型识别图像中的可见元素,再通过Iconclass层级结构进行语义推理,最终利用多种互补推荐算法(包括分层邻近度、TF-IDF加权重叠和Jaccard相似度)实现内容驱动的推荐,从而实现从视觉特征到抽象意义的有效映射。

链接: https://arxiv.org/abs/2602.19698
作者: Krzysztof Kutt,Maciej Baczyński
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 14 pages, 7 figures; submitted to ICCS 2026 conference

点击查看摘要

Abstract:We present a proof-of-concept system that automates iconographic classification and content-based recommendation of digitized artworks using the Iconclass vocabulary and selected artificial intelligence methods. The prototype implements a four-stage workflow for classification and recommendation, which integrates YOLOv8 object detection with algorithmic mappings to Iconclass codes, rule-based inference for abstract meanings, and three complementary recommenders (hierarchical proximity, IDF-weighted overlap, and Jaccard similarity). Although more engineering is still needed, the evaluation demonstrates the potential of this solution: Iconclass-aware computer vision and recommendation methods can accelerate cataloging and enhance navigation in large heritage repositories. The key insight is to let computer vision propose visible elements and to use symbolic structures (Iconclass hierarchy) to reach meaning.

[IR-12] Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

【速读】:该论文旨在解决视觉文档检索(Visual Document Retrieval, VDR)中多向量范式因计算和存储开销过大而导致的效率瓶颈问题,现有压缩方法如剪枝(pruning)和合并(merging)难以在压缩率与特征保真度之间取得平衡。其解决方案的关键在于提出一种两阶段协同框架——“先剪枝后合并”(Prune-then-Merge):第一阶段通过自适应剪枝筛选出低信息量的图像块(patch),保留高信号嵌入;第二阶段对预过滤后的嵌入进行分层合并,有效压缩并保留语义内容,避免单阶段方法因噪声干扰导致的特征稀释,从而显著提升压缩效率与检索性能的平衡性。

链接: https://arxiv.org/abs/2602.19549
作者: Yibo Yan,Mingdong Ou,Yi Cao,Xin Zou,Jiahao Huo,Shuliang Liu,James Kwok,Xuming Hu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Alibaba Cloud Computing (阿里云计算); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

[IR-13] Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

【速读】:该论文旨在解决知识超图(Knowledge Hypergraph)构建中的“场景差距”(scenario gap)问题,即通用抽取器在不同领域间难以泛化,且现有方法难以平衡结构骨架与细粒度细节。其解决方案的关键在于提出一个技能驱动的框架 Hyper-KGGen,该框架将抽取过程建模为动态技能演化过程:首先采用粗粒度到细粒度的机制实现文档的系统分解,确保从二元链接到复杂超边的全维度覆盖;其次引入自适应技能获取模块,通过基于稳定性的反馈回路,将不稳定预测和遗漏结果转化为相对奖励信号,从而从领域数据中主动提炼出高质量技能并存入全局技能库(Global Skill Library),进而显著提升跨场景的知识超图抽取性能。

链接: https://arxiv.org/abs/2602.19543
作者: Rizhuo Huang,Yifan Feng,Rundong Xue,Shihui Ying,Jun-Hai Yong,Chuan Shi,Shaoyi Du,Yue Gao
机构: Xi’an Jiaotong University (西安交通大学); Tsinghua University (清华大学); Shanghai University (上海大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Knowledge hypergraphs surpass traditional binary knowledge graphs by encapsulating complex n -ary atomic facts, providing a more comprehensive paradigm for semantic representation. However, constructing high-quality hypergraphs remains challenging due to the \textitscenario gap: generic extractors struggle to generalize across diverse domains with specific jargon, while existing methods often fail to balance structural skeletons with fine-grained details. To bridge this gap, we propose \textbfHyper-KGGen, a skill-driven framework that reformulates extraction as a dynamic skill-evolving process. First, Hyper-KGGen employs a \textitcoarse-to-fine mechanism to systematically decompose documents, ensuring full-dimensional coverage from binary links to complex hyperedges. Crucially, it incorporates an \textitadaptive skill acquisition module that actively distills domain expertise into a Global Skill Library. This is achieved via a stability-based feedback loop, where extraction stability serves as a relative reward signal to induce high-quality skills from unstable traces and missed predictions. Additionally, we present \textbfHyperDocRED, a rigorously annotated benchmark for document-level knowledge hypergraph extraction. Experiments demonstrate that Hyper-KGGen significantly outperforms strong baselines, validating that evolved skills provide substantially richer guidance than static few-shot examples in multi-scenario settings.

[IR-14] SplitLight: An Exploratory Toolkit for Recommender Systems Datasets and Splits

【速读】:该论文旨在解决推荐系统离线评估中因数据预处理和划分策略选择不透明、缺乏标准化而导致的模型排名重排序、可复现性下降及跨研究比较困难的问题。其关键解决方案是提出SplitLight——一个开源探索性工具包,能够量化分析数据集的核心统计特征与时间特性,识别重复消费模式和时间戳异常,并诊断划分有效性(如时间泄漏、冷用户/物品暴露和分布偏移);同时支持多种划分策略的并行对比,通过汇总统计与交互式可视化提升实验设计的透明度与可靠性,从而保障推荐系统研究与工业应用中的可审计性和可比性。

链接: https://arxiv.org/abs/2602.19339
作者: Anna Volodkevich,Dmitry Anikin,Danil Gusak,Anton Klenitskiy,Evgeny Frolov,Alexey Vasilev
机构: SB AI Lab (SB人工智能实验室); Applied AI Institute (应用人工智能研究所); AXXX (AXXX); HSE University (高等经济大学); MSU Research Center (莫斯科国立大学研究中心)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Offline evaluation of recommender systems is often affected by hidden, under-documented choices in data preparation. Seemingly minor decisions in filtering, handling repeats, cold-start treatment, and splitting strategy design can substantially reorder model rankings and undermine reproducibility and cross-paper comparability. In this paper, we introduce SplitLight, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make these decisions measurable, comparable, and reportable. Given an interaction log and derived split subsets, SplitLight analyzes core and temporal dataset statistics, characterizes repeat consumption patterns and timestamp anomalies, and diagnoses split validity, including temporal leakage, cold-user/item exposure, and distribution shifts. SplitLight further allows side-by-side comparison of alternative splitting strategies through comprehensive aggregated summaries and interactive visualizations. Delivered as both a Python toolkit and an interactive no-code interface, SplitLight produces audit summaries that justify evaluation protocols and support transparent, reliable, and comparable experimentation in recommender systems research and industry. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.19339 [cs.IR] (or arXiv:2602.19339v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.19339 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-15] PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

【速读】:该论文旨在解决波斯语社交媒体文本分类领域缺乏大规模、平衡数据集的问题(即:缺少高质量、类分布均衡的标注数据资源),从而阻碍了相关自然语言处理(Natural Language Processing, NLP)技术的发展。解决方案的关键在于构建一个包含36,000条标注样本的高质量、九类平衡数据集(每类4,000条),并通过混合注释策略(结合ChatGPT少样本提示与人工验证)提升标注质量;同时采用语义冗余去除的欠采样和融合词法替换与生成式提示的数据增强方法缓解类别不平衡问题,并在多种先进模型(包括XLM-RoBERTa、FaBERT、SBERT及波斯语专用模型TookaBERT)上进行系统评估,最终证明基于Transformer的模型(尤其是TookaBERT-Large)在该任务中表现最优,为后续波斯语NLP研究奠定了坚实基础。

链接: https://arxiv.org/abs/2602.19333
作者: Isun Chehreh,Ebrahim Ansari
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: 10 pages, including 1 figure

点击查看摘要

Abstract:This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

[IR-16] Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

【速读】:该论文旨在解决问答系统中个性化不足的问题,即现有方法在生成答案时难以充分融合用户背景、偏好及历史上下文,导致个性化程度有限。其核心解决方案是提出PR2(Personalized Retrieval-Augmented Reasoning)框架,该框架基于强化学习,通过学习自适应的检索-推理策略,动态决定何时从用户个人资料中检索信息、检索哪些证据,并将其融入多轮推理过程中的中间步骤。PR2通过优化个性化奖励函数下的多轮推理轨迹,强化与用户特定偏好和上下文信号一致的推理路径,从而实现更深层次的个性化问答。

链接: https://arxiv.org/abs/2602.19317
作者: Maryam Amirizaniani,Alireza Salemi,Hamed Zamani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users’ background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user’s profile. Existing methods use the user’s query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.

[IR-17] SIDEKICK: A Semantically Integrated Resource for Drug Effects Indications and Contraindications

【速读】:该论文旨在解决现有药物警戒与临床决策支持系统中因依赖MedDRA等术语体系而导致语义推理能力受限及与语义网本体和知识图谱互操作性不足的问题。解决方案的关键在于构建了一个名为SIDEKICK的知识图谱,通过基于大语言模型(Large Language Model, LLM)的抽取和图检索增强生成(Graph-Retrieval Augmented Generation, Graph RAG)的工作流,将FDA结构化产品标签中的药物适应症、禁忌症和不良反应标准化映射至Human Phenotype Ontology (HPO)、MONDO疾病本体和RxNorm,并以资源描述框架(RDF)格式序列化,采用Semanticscience Integrated Ontology (SIO)作为上层本体,从而显著提升了语义集成能力和跨系统互操作性,尤其在基于副作用相似性的药物再利用任务中优于SIDER和ONSIDES数据库。

链接: https://arxiv.org/abs/2602.19183
作者: Mohammad Ashhad,Olga Mashkova,Ricardo Henao,Robert Hoehndorf
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Pharmacovigilance and clinical decision support systems utilize structured drug safety data to guide medical practice. However, existing datasets frequently depend on terminologies such as MedDRA, which limits their semantic reasoning capabilities and their interoperability with Semantic Web ontologies and knowledge graphs. To address this gap, we developed SIDEKICK, a knowledge graph that standardizes drug indications, contraindications, and adverse reactions from FDA Structured Product Labels. We developed and used a workflow based on Large Language Model (LLM) extraction and Graph-Retrieval Augmented Generation (Graph RAG) for ontology mapping. We processed over 50,000 drug labels and mapped terms to the Human Phenotype Ontology (HPO), the MONDO Disease Ontology, and RxNorm. Our semantically integrated resource outperforms the SIDER and ONSIDES databases when applied to the task of drug repurposing by side effect similarity. We serialized the dataset as a Resource Description Framework (RDF) graph and employed the Semanticscience Integrated Ontology (SIO) as upper level ontology to further improve interoperability. Consequently, SIDEKICK enables automated safety surveillance and phenotype-based similarity analysis for drug repurposing.

[IR-18] FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations AAAI2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成带引用内容时存在的两个核心问题:一是引用不匹配(mismatch),即引用文献与生成内容逻辑不符;二是引用无关(irrelevance),即引用文献与用户查询语义无关。这些问题在真实场景中尤其显著,尤其是在检索结果存在噪声或冗余时,会严重降低答案的可信度和鲁棒性。解决方案的关键在于提出 FineRef 框架,其核心创新是基于细粒度错误反思(Fine-grained error Reflection)机制,通过两阶段训练策略实现对每个引用的自主识别与修正:第一阶段利用监督微调引入“尝试-反思-修正”行为模式,并借助轻量级模型构建可控的反思数据;第二阶段采用过程级强化学习,设计多维奖励机制以提升反思准确性、答案质量和修正收益。该方法显著优于现有技术,在 ALCE 基准上实现 Citation F1 和 EM Recall 的大幅提升,并展现出良好的泛化能力和抗噪性能。

链接: https://arxiv.org/abs/2602.18437
作者: Yixing Peng,Licheng Zhang,Shancheng Fang,Yi Liu,Peijian Gu,Quan Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 pages, 4figures, AAAI2026

点击查看摘要

Abstract:Generating with citations is crucial for trustworthy Large Language Models (LLMs), yet even advanced LLMs often produce mismatched or irrelevant citations. Existing methods over-optimize citation fidelity while overlooking relevance to the user query, which degrades answer quality and robustness in real-world settings with noisy or irrelevant retrieved content. Moreover, the prevailing single-pass paradigm struggles to deliver optimal answers in long-form generation that requiring multiple citations. To address these limitations, we propose FineRef, a framework based on Fine-grained error Reflection, which explicitly teaches the model to self-identify and correct two key citation errors, mismatch and irrelevance, on a per-citation basis. FineRef follows a two-stage training strategy. The first stage instills an “attempt-reflect-correct” behavioral pattern via supervised fine-tuning, using fine-grained and controllable reflection data constructed by specialized lightweight models. An online self-reflective bootstrapping strategy is designed to improve generalization by iteratively enriching training data with verified, self-improving examples. To further enhance the self-reflection and correction capability, the second stage applies process-level reinforcement learning with a multi-dimensional reward scheme that promotes reflection accuracy, answer quality, and correction gain. Experiments on the ALCE benchmark demonstrate that FineRef significantly improves both citation performance and answer accuracy. Our 7B model outperforms GPT-4 by up to 18% in Citation F1 and 4% in EM Recall, while also surpassing the state-of-the-art model across key evaluation metrics. FineRef also exhibits strong generalization and robustness in domain transfer settings and noisy retrieval scenarios.

人机交互

[HC-0] Align When They Want Complement When They Need! Human-Centered Ensembles for Adaptive Human-AI Collaboration AAAI2026

【速读】:该论文旨在解决人类与AI协同决策中一个核心矛盾:传统单一AI模型在设计上通常选择“互补性”(complementarity)或“对齐性”(alignment),前者虽能提升AI性能但可能削弱人类信任,后者虽增强信任却易固化人类低效行为,从而损害整体人-AI团队绩效。为突破这一局限,作者提出一种以人类为中心的自适应AI集成系统,其关键在于基于情境线索动态切换两个专用AI模型——对齐模型和互补模型,并采用一种简洁且理论上近最优的“理性路由捷径”(Rational Routing Shortcut)机制实现策略切换,从而在不同场景下最大化人-AI协作效能。

链接: https://arxiv.org/abs/2602.20104
作者: Hasan Amin,Ming Yin,Rajiv Khanna
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: AAAI 2026

点击查看摘要

Abstract:In human-AI decision making, designing AI that complements human expertise has been a natural strategy to enhance human-AI collaboration, yet it often comes at the cost of decreased AI performance in areas of human strengths. This can inadvertently erode human trust and cause them to ignore AI advice precisely when it is most needed. Conversely, an aligned AI fosters trust yet risks reinforcing suboptimal human behavior and lowering human-AI team performance. In this paper, we start by identifying this fundamental tension between performance-boosting (i.e., complementarity) and trust-building (i.e., alignment) as an inherent limitation of the traditional approach for training a single AI model to assist human decision making. To overcome this, we introduce a novel human-centered adaptive AI ensemble that strategically toggles between two specialist AI models - the aligned model and the complementary model - based on contextual cues, using an elegantly simple yet provably near-optimal Rational Routing Shortcut mechanism. Comprehensive theoretical analyses elucidate why the adaptive AI ensemble is effective and when it yields maximum benefits. Moreover, experiments on both simulated and real-world data show that when humans are assisted by the adaptive AI ensemble in decision making, they can achieve significantly higher performance than when they are assisted by single AI models that are trained to either optimize for their independent performance or even the human-AI team performance.

[HC-1] Studying the Separability of Visual Channel Pairs in Symbol Maps

【速读】:该论文旨在解决多变量地图中视觉通道(visual channel)可分离性(separability)缺乏系统实证证据的问题,尤其聚焦于符号地图(symbol map)场景下的通道组合效果。其解决方案的关键在于通过众包实验量化评估四种视觉通道对(color x shape、color x size、size x shape、size x orientation)在双变量地图中的可分离性,结果表明颜色与形状的组合最具可分离性,而尺寸与方向的组合最差,且分离性能存在不对称性——即任务相关变量由不同通道编码时表现差异显著,其中颜色和形状优于尺寸,正方形形状尤其难以区分。这一发现为多变量地图设计提供了基于实证的优化依据。

链接: https://arxiv.org/abs/2602.20022
作者: Poorna Talkad Sukumar,Maurizio Porfiri,Oded Nov
机构: Tandon School of Engineering (工程学院); New York University (纽约大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Visualizations often encode multivariate data by mapping attributes to distinct visual channels such as color, size, or shape. The effectiveness of these encodings depends on separability–the extent to which channels can be perceived independently. Yet systematic evidence for separability, especially in map-based contexts, is lacking. We present a crowdsourced experiment that evaluates the separability of four channel pairs–color (ordered) x shape, color (ordered) x size, size x shape, and size x orientation–in the context of bivariate symbol maps. Both accuracy and speed analyses show that color x shape is the most separable and size x orientation the least separable, while size x color and size x shape do not differ. Separability also proved asymmetric–performance depended on which channel encoded the task-relevant variable, with color and shape outperforming size, and square shape especially difficult to discriminate. Our findings advance the empirical understanding of visual separability, with implications for multivariate map design.

[HC-2] Protecting and Promoting Human Agency in Education in the Age of Artificial Intelligence

【速读】:该论文试图解决生成式 AI(Generative AI)在教育领域广泛应用背景下,人类主体性(human agency)受到挑战的问题。其解决方案的关键在于通过四个维度重构对人类主体性的理解:人类监督(human oversight)、人机互补性(AI-human complementarity)、AI素养(AI competencies)以及关系性涌现(relational emergence),并聚焦规范约束、透明度与认知卸载等实践困境,以推动伦理且有效的生成式 AI 教育整合策略。

链接: https://arxiv.org/abs/2602.20014
作者: Olga Viberg,Mutlu Cukurova,Rene F. Kizilcec,Simon Buckingham Shum,Dorottya Demszky,Dragan Gašević,Thorben Jansen,Ioana Jivet,Jelena Jovanovic,Jennifer Meyer,Kou Murayama,Zach Pardos,Chris Piech,Nikol Rummel,Naomi E. Winstone
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: O.V., M.C., and R.F.K. organized the meeting and wrote the first version of the report. All authors contributed to the revision of the manuscript, and read and approved the final version

点击查看摘要

Abstract:Human agency is crucial in education and increasingly challenged by the use of generative AI. This meeting report synthesizes interdisciplinary insights and conceptualizes four aspects that delineate human agency: human oversight, AI-human complementarity, AI competencies, and relational emergence. We explore practical dilemmas for protecting and promoting agency, focusing on normative constraints, transparency, and cognitive offloading, and highlight key tensions and implications to inform ethical and effective AI integration in education.

[HC-3] GazeFlow: Personalized Ambient Soundscape Generation for Passive Strabismus Self-Monitoring

【速读】:该论文旨在解决斜视(Strabismus)患者在矫正手术后缺乏便捷、被动式自我监测工具的问题,尤其针对传统双目治疗(Dichoptic therapies)依赖主动参与和临床监督而难以实现日常自知觉的问题。其解决方案的关键在于提出GazeFlow系统,该系统基于浏览器的摄像头 gaze 跟踪技术,采用个性化时间自动编码器(personalized temporal autoencoder)检测眼位漂移模式,并通过环境计算(Calm Computing)原则提供非侵入性的音频反馈——即根据漂移严重程度动态调整音乐参数,维持用户在周边感知中的意识。为应对个体差异与领域迁移挑战(从1000Hz研究级设备到30Hz普通网络摄像头),作者引入三项核心技术:双眼时间-频率解耦(Binocular Temporal-Frequency Disentanglement, BTFD)、对比生物特征预训练(Contrastive Biometric Pre-training, CBP)以及Gaze-MAML元学习方法,从而显著提升了漂移检测的准确率(F1=0.84)并获得初步用户验证(N=6)支持其在提升自我觉察方面的潜力。

链接: https://arxiv.org/abs/2602.19966
作者: Joydeep Chandra,Satyam Kumar Navneet,Yong Zhang
机构: BNRIST (北京网络信息研究院); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Strabismus affects 2-4% of the population, yet individuals recovering from corrective surgery lack accessible tools for monitoring eye alignment. Dichoptic therapies require active engagement clinical supervision, limiting their adoption for passive self-awareness. We present GazeFlow, a browser-based self-monitoring system that uses a personalized temporal autoencoder to detect eye drift patterns from webcam-based gaze tracking provides ambient audio feedback. Unlike alert-based systems, GazeFlow operates according to calm computing principles, morphing musical parameters in proportion to drift severity while remaining in peripheral awareness. We address the challenges of inter-individual variability domain transfer (1000Hz research to 30Hz webcam) by introducing Binocular Temporal-Frequency Disentanglement (BTFD), Contrastive Biometric Pre-training (CBP), Gaze-MAML. We validate our approach on the GazeBase dataset (N=50) achieving F1=0.84 for drift detection, conduct a preliminary user study (N=6) with participants having intermittent strabismus. Participants reported increased awareness of their eye behaviour (M=5.8/7) preference for ambient feedback over alerts (M=6.2/7). We discuss the system’s potential for self-awareness applications outline directions for clinical validation.

[HC-4] Progressive Value Reading: The Use of Motion to Gradually Examine Data Involving Large Magnitudes

【速读】:该论文旨在解决人们在面对极大规模或跨度多个数量级的数据时难以直观理解的问题。传统方法如对数刻度和多尺度可视化虽有一定效果,但存在局限性。其解决方案的关键在于引入“渐进式数值读取”(progressive value reading)这一新设计策略,即通过运动(如长滚动、超长街画等)让观察者逐步体验数值差异,将数值大小转化为可感知的时间与努力感,从而增强对数据量级的直观理解。

链接: https://arxiv.org/abs/2602.19853
作者: Leni Yang,Aymeric Ferron,Yvonne Jansen,Pierre Dragicevic
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:People often struggle to interpret data with extremely large or small values, or ranges spanning multiple orders of magnitude. While traditional approaches, such as log scales and multiscale visualizations, can help, we explore in this article a different approach used in some emerging designs: the use of motion to let viewers gradually experience magnitude – for example, interactive graphics that require long scrolling or street paintings stretching hundreds of meters. This approach typically demands substantial time and sustained interaction, translating differences in magnitude into a visceral sense of duration and effort. Although largely underexplored, this design strategy offers new opportunities. We introduce the term progressive value reading to refer to the use of motion to progressively examine an information object that encodes a value, where the amount of motion reflects the value. We compiled a corpus of 55 real-life and hypothetical visualization examples that allow, encourage, or require progressive value reading. From this corpus, we derived a design space of ten design dimensions, providing a shared vocabulary, inspiration for novel techniques, and a foundation for empirical evaluation. An online corpus is also available for exploration.

[HC-5] Ambient Analytics: Calm Technology for Immersive Visualization and Sensemaking

【速读】:该论文试图解决的问题是在增强现实(Augmented Reality, AR)环境中,如何避免信息过载对用户认知负荷的负面影响,从而实现数据可视化与日常生活的和谐融合。其解决方案的关键在于借鉴“安静技术”(Calm Technologies)的理念,将可视化信息置于用户的注意力边缘,而非强制要求专注交互,由此从视觉分析(Visual Analytics)向环境感知分析(Ambient Analytics)演进,以最小化认知负担并提升用户体验。

链接: https://arxiv.org/abs/2602.19809
作者: Sebastian Hubenschmid,Arvind Srinivasan,Niklas Elmqvist,Dieter Schmalstieg,Michael Sedlmair
机构: Aarhus University (奥胡斯大学); University of Stuttgart (斯图加特大学); Graz University of Technology (格拉茨工业大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at “Visualization Viewpoints” in IEEE Computer Graphics and Applications

点击查看摘要

Abstract:Augmented reality has great potential for embedding data visualizations in the world around the user. While this can enhance users’ understanding of their surroundings, it also bears the risk of overwhelming their senses with a barrage of information. In contrast, calm technologies aim to place information in the user’s attentional periphery, minimizing cognitive load instead of demanding focused engagement. In this column, we explore how visualizations can be harmoniously integrated into our everyday life through augmented reality, progressing from visual analytics to ambient analytics.

[HC-6] Unfolding Ordered Matrices into BioFabric Motifs

【速读】:该论文旨在解决生物织物(BioFabric)可视化中缺乏高效布局算法的问题,即如何自动识别网络中的模式并生成最优的顶点和边顺序,从而将这些模式清晰地表达为图谱基序(motif),以提升复杂网络的可读性。解决方案的关键在于利用有序矩阵作为工具:通过Moran’s I指数对输入图的邻接矩阵进行排序,并结合近期提出的算法检测(噪声)模式,进而将有序矩阵及其模式“展开”为高质量的BioFabric,实现了顶点与边顺序的协同优化,且能高效处理最多含250个顶点的图。

链接: https://arxiv.org/abs/2602.19745
作者: Jules Wulms,Wouter Meulemans,Bettina Speckmann
机构: TU Eindhoven (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:BioFabrics were introduced by Longabaugh in 2012 as a way to draw large graphs in a clear and uncluttered manner. The visual quality of BioFabrics crucially depends on the order of vertices and edges, which can be chosen independently. Effective orders can expose salient patterns, which in turn can be summarized by motifs, allowing users to take in complex networks at-a-glance. However, so far there is no efficient layout algorithm which automatically recognizes patterns and delivers both a vertex and an edge ordering that allows these patterns to be expressed as motifs. In this paper we show how to use well-ordered matrices as a tool to efficiently find good vertex and edge orders for BioFabrics. Specifically, we order the adjacency matrix of the input graph using Moran’s I and detect (noisy) patterns with our recent algorithm. In this note we show how to “unfold” the ordered matrix and its patterns into a high-quality BioFabric. Our pipelines easily handles graphs with up to 250 vertices.

[HC-7] Git Takes Two: Split-View Awareness for Collaborative Learning of Distributed Workflows in Git

【速读】:该论文旨在解决新手在学习分布式版本控制系统 Git 时面临的协作理解难题,传统学习工具多聚焦于个体操作流程,忽视了 Git 作为协同开发工具的本质特性。解决方案的关键在于提出 GitAcademy 平台,其核心创新是引入“分屏协同模式”(split-view collaborative mode),使学习者在本地仓库独立操作的同时,能实时看到伙伴对共享远程仓库的操作行为,从而增强对分布式状态、协调机制及协作调试的认知。该设计作为一种仅用于训练的支架(scaffold),有效提升了社会临场感和同伴教学效果,尽管性能提升不显著,但显著改善了协作学习体验。

链接: https://arxiv.org/abs/2602.19714
作者: Joel Bucher,Lahari Goswami,Sverrir Thorgeirsson,April Yi Wang
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: First two authors contributed equally

点击查看摘要

Abstract:Git is widely used for collaborative software development, but it can be challenging for newcomers. While most learning tools focus on individual workflows, Git is inherently collaborative. We present GitAcademy, a browser-based learning platform that embeds a full Git environment with a split-view collaborative mode: learners work on their own local repositories connected to a shared remote repository, while simultaneously seeing their partner’s actions mirrored in real time. This design is not intended for everyday software development, but rather as a training simulator to build awareness of distributed states, coordination, and collaborative troubleshooting. In a within-subjects study with 13 pairs of learners, we found that the split-view interface enhanced social presence, supported peer teaching, and was consistently preferred over a single-view baseline, even though performance gains were mixed. We further discuss how split-view awareness can serve as a training-only scaffold for collaborative learning of Git and other distributed technical systems.

[HC-8] Shifting Engagement With Cybersecurity: How People Discover and Share Cybersecurity Content at Work and at Home

【速读】:该论文旨在解决网络安全意识(cybersecurity awareness)如何受职场培训与个人生活场景中信息获取方式影响的问题。其核心发现表明,接受过职场网络安全培训的个体更倾向于减少在家庭环境中分享网络安全信息,转而将关注点集中于工作场所;同时,这类人群对雇主提供的网络安全信息记忆更为深刻,且这种记忆强度与内容类型及传播渠道密切相关。解决方案的关键在于识别并优化职场中的信息安全传播机制,从而提升员工在工作和家庭场景下均能有效参与网络安全信息共享的能力。

链接: https://arxiv.org/abs/2602.19695
作者: William Seymour,Martin J. Kraemer
机构: King’s College London (伦敦国王学院); KnowBe4 (KnowBe4)
类目: Human-Computer Interaction (cs.HC)
备注: To appear in the extended abstracts of the 2026 ACM CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Cybersecurity awareness is shaped by a wide range of professional and personal experiences, including information and training at work and the sharing of news and other content at home. In order to explore how people discover cybersecurity content and the effect that participation in workplace training may have on this we present an online study of 1200 participants from the UK, US, France, and Germany. Those undertaking cybersecurity training at work showed reduced intention to share information at home, shifting the focus towards the workplace. They were also more likely to recall cybersecurity information shared by their employer than from any other source, which in turn correlated with content type and distribution channel. We critically reflect on this shift, highlighting opportunities to improve cybersecurity information sharing at work and at home.

[HC-9] “The explanation makes sense”: An Empirical Study on LLM Performance in News Classification and its Influence on Judgment in Human-AI Collaborative Annotation

【速读】:该论文旨在解决媒体偏见(media bias)传播对政治话语和公众认知的负面影响,以及如何通过生成式 AI (Generative AI) 提升新闻内容分类的准确性与可信度。其核心问题在于:如何利用大型语言模型(LLMs)实现对美国新闻按政治意识形态的可靠分类,并评估此类AI辅助决策对人类判断的影响。解决方案的关键在于采用提示工程(prompt engineering)优化GPT模型性能,并通过提供简要与详细解释来增强用户对AI输出的信任与采纳程度;实验表明,详尽的解释显著提升用户决策信心(p < 0.001),且更可能改变原有判断,从而为构建高效的人机协同新闻评估系统提供了实证依据与可复用的数据集支持。

链接: https://arxiv.org/abs/2602.19690
作者: Qile Wang,Prerana Khatiwada,Avinash Chouhan,Ashrey Mahesh,Joy Mwaria,Duy Duc Tran,Kenneth E. Barner,Matthew Louis Mauriello
机构: University of Delaware (特拉华大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The spread of media bias is a significant concern as political discourse shapes beliefs and opinions. Addressing this challenge computationally requires improved methods for interpreting news. While large language models (LLMs) can scale classification tasks, concerns remain about their trustworthiness. To advance human-AI collaboration, we investigate the feasibility of using LLMs to classify U.S. news by political ideology and examine their effect on user decision-making. We first compared GPT models with prompt engineering to state-of-the-art supervised machine learning on a 34k public dataset. We then collected 17k news articles and tested GPT-4 predictions with brief and detailed explanations. In a between-subjects study (N=124), we evaluated how LLM-generated explanations influence human annotation, judgment, and confidence. Results show that AI assistance significantly increases confidence ( p.001 ), with detailed explanations more persuasive and more likely to alter decisions. We highlight recommendations for AI explanations through thematic analysis and provide our dataset for further research.

[HC-10] Cooperation After the Algorithm: Designing Human-AI Coexistence Beyond the Illusion of Collaboration

【速读】:该论文旨在解决生成式人工智能(Generative AI)在高风险领域应用中因责任归属不清、风险分配失衡所引发的治理失效与协作失序问题。其核心挑战在于,AI系统虽能提供流畅且适应性强的输出,却无法承担法律责任或共享后果,导致人类与AI之间的结构性不对称,进而诱发专业误判与制度性失败。解决方案的关键在于构建一套以制度基础设施为核心的治理框架——通过引入一个形式化的不平等方程明确合作价值的条件,并提出“合作生态框架”(Cooperation Ecology Framework),包含六项设计原则和三项政策工具(如《人-AI合作宪章》、《违约风险登记册》及《合作准备度审计》),从而将分析单位从个体用户与AI的二元关系扩展至塑造激励、信号传递、问责机制与修复能力的制度环境,实现可问责、可信赖的长期人-AI协作。

链接: https://arxiv.org/abs/2602.19629
作者: Tatia Codreanu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 tables

点击查看摘要

Abstract:Generative artificial intelligence systems increasingly participate in research, law, education, media, and governance. Their fluent and adaptive outputs create an experience of collaboration. However, these systems do not bear responsibility, incur liability, or share stakes in downstream consequences. This structural asymmetry has already produced sanctions, professional errors, and governance failures in high-stakes contexts We argue that stable human-AI coexistence is an institutional achievement that depends on governance infrastructure capable of distributing residual risk. Drawing on institutional analysis and evolutionary cooperation theory, we introduce a formal inequality that specifies when reliance on AI yields positive expected cooperative value. The model makes explicit how governance conditions, system policy, and accountability regimes jointly determine whether cooperation is rational or structurally defective. From this formalization we derive a cooperation ecology framework with six design principles: reciprocity contracts, visible trust infrastructure, conditional cooperation modes, defection-mitigation mechanisms, narrative literacy against authority theatre, and an Earth-first sustainability constraint. We operationalize the framework through three policy artefacts: a Human-AI Cooperation Charter, a Defection Risk Register, and a Cooperation Readiness Audit. Together, these elements shift the unit of analysis from the user-AI dyad to the institutional environment that shapes incentives, signals, accountability, and repair. The paper provides a theoretical foundation and practical toolkit for designing human-AI systems that can sustain accountable, trustworthy cooperation over time.

[HC-11] PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成式AI模型过度关注视觉保真度而忽视教学有效性的核心问题。现有方法通常采用“一次性”生成模式,缺乏对教育内容设计逻辑的深度支持,导致产出的 instructional videos(教学视频)难以真正促进学习者认知加工。其解决方案的关键在于提出PedaCo-Gen系统,该系统基于Mayer的认知多媒体学习理论(Cognitive Theory of Multimedia Learning, CTML),引入中间表示(Intermediate Representation, IR)阶段,使教育者能够与AI评审器协同交互式地审阅和优化视频蓝图(包含脚本和视觉描述)。这种人机协作机制不仅提升了视频质量,还通过AI提供元认知支架(metacognitive scaffold),增强教育者的教学设计能力,从而实现生成式AI与人类专业经验的深度融合。

链接: https://arxiv.org/abs/2602.19623
作者: Injun Baek,Yearim Kim,Nojun Kwak
机构: Seoul National University (首尔国立大学); Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces PedaCo-Gen, a pedagogically-informed human-AI collaborative video generating system for authoring instructional videos based on Mayer’s Cognitive Theory of Multimedia Learning (CTML). Moving away from traditional “one-shot” generation, PedaCo-Gen introduces an Intermediate Representation (IR) phase, enabling educators to interactively review and refine video blueprints-comprising scripts and visual descriptions-with an AI reviewer. Our study with 23 education experts demonstrates that PedaCo-Gen significantly enhances video quality across various topics and CTML principles compared to baselines. Participants perceived the AI-driven guidance not merely as a set of instructions but as a metacognitive scaffold that augmented their instructional design expertise, reporting high production efficiency (M=4.26) and guide validity (M=4.04). These findings highlight the importance of reclaiming pedagogical agency through principled co-creation, providing a foundation for future AI authoring tools that harmonize generative power with human professional expertise.

[HC-12] Identifying Explaining and Correcting Ableist Language with AI

【速读】:该论文旨在解决污名化语言(ableist language)在日常交流中隐性存在且难以识别的问题,此类语言强化有害刻板印象并导致残障群体被边缘化。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs),特别是ChatGPT(如GPT-4o),通过自动检测和改写偏见性表述来促进包容性沟通,并结合残障社群成员的标注数据进行验证与优化。研究发现,尽管用户对AI生成的修正建议与人类专家标注的接受度相当,但更偏好AI提供的结构一致性与可访问性表达,同时认可人类标注的情感深度与文化语境,从而揭示了LLMs在处理文化敏感内容时的潜力与局限性,并为开发更具包容性的写作工具提供了实证基础与设计原则。

链接: https://arxiv.org/abs/2602.19560
作者: Kynnedy Simone Smith,Lydia B. Chilton,Danielle Bragg
机构: Carnegie Mellon University (卡内基梅隆大学); Columbia University (哥伦比亚大学); Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC)
备注: 17 pages, 6 figures, Accepted for publication in CHI’26, Barcelona, Spain, April 13 - 17, 2026; CHI '26: ACM CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Ableist language perpetuates harmful stereotypes and exclusion, yet its nuanced nature makes it difficult to recognize and address. Artificial intelligence could serve as a powerful ally in the fight against ableist language, offering tools that detect and suggest alternatives to biased terms. This two-part study investigates the potential of large language models (LLMs), specifically ChatGPT, to rectify ableist language and educate users about inclusive communication. We compared GPT-4o generations with crowdsourced annotations from trained disability community members, then invited disabled participants to evaluate both. Participants reported equal agreement with human and AI annotations but significantly preferred the AI, citing its narrative consistency and accessible style. At the same time, they valued the emotional depth and cultural grounding of human annotations. These findings highlight the promise and limits of LLMs in handling culturally sensitive content. Our contributions include a dataset of nuanced ableism annotations and design considerations for inclusive writing tools.

[HC-13] Sound-first immersive training for blind and low-vision learners: A simulation flow for safe standardized orientation mobility and daily living practice

【速读】:该论文旨在解决盲人及低视力学习者在定向与移动(Orientation and Mobility, OM)训练中难以实现标准化和规模化的问题,主要挑战包括对教练资源的依赖、实体模拟环境的限制以及户外真实条件的不稳定性。其解决方案的关键在于提出一种以空间音频(spatial audio)和声学化(sonification)为主导的沉浸式训练流程,通过参数化的场景模板(如信号灯路口过街、公共交通上下车等)、清晰谱位与时间安排的提示词库、轻量级安全协议(含渐进暴露、内容警告、坐姿起始、中途退出机制和结构化复盘),构建一个无需视觉输入的可重复训练系统。该方案利用头戴设备的高质量双耳渲染与头部追踪能力,以3D场景几何作为隐形支撑结构来锚定声源、触发事件并定义风险/引导区域,从而在保证提示一致性的同时动态调节任务难度,提升训练安全性与可扩展性,为康复中心提供统一标准并推动未来对比研究的基础框架。

链接: https://arxiv.org/abs/2602.19554
作者: Daniel A. Muñoz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Orientation and mobility (OM) instruction for blind and low-vision learners is effective but difficult to standardize and repeat at scale due to the reliance on instructor availability, physical mock-ups, and variable real-world outdoor conditions. This Technical Note presents a sound-first immersive training flow that uses spatial audio and sonification as the primary channel for action and feedback in pre-street OM and daily-living practice. The approach specifies parameterized scenario templates (e.g., signalized street crossing, public transport boarding, and kitchen tasks), a compact and consistent cue vocabulary with clear spectral placement and timing to mitigate masking, and a lightweight safety protocol enabling graded exposure, content warnings, seated starts, opt-outs, and structured debriefs. The system assumes a head-mounted device with high-quality binaural rendering and head tracking; 3D scene geometry is used as an invisible scaffold to anchor sources, trigger events, define risk/guidance volumes, and govern physically plausible motion without visuals. Session difficulty is shaped via cue density, event tempo, and task complexity while preserving cue consistency to promote transfer across scenarios. The specification aims to enable safe repetition, reduce instructor burden, and support clearer standards across rehabilitation centers, aligning with evidence that audio-first interaction is essential for blind and visually impaired users and addressing gaps in HRTF personalization, evaluation standards, and accessibility integration. Although no behavioral outcomes are reported here, this implementable flow consolidates auditory science with center-ready design, offering a pragmatic foundation for standardized evaluation and future comparative studies.

[HC-14] Security Risks of AI Agents Hiring Humans: An Empirical Marketplace Study

【速读】:该论文旨在解决自主人工智能代理通过REST API和模型上下文协议(Model Context Protocol, MCP)集成在众包平台上程序化雇佣人类工作者所引发的安全风险问题,这类行为可能扩展至物理世界并构成新型攻击面。其解决方案的关键在于通过实证测量研究识别出六类活跃滥用行为(包括凭证欺诈、身份冒用、自动化侦察、社交媒体操纵、认证绕过及推荐欺诈),并验证了基于内容筛查规则的防御机制可行性——在不引入额外误报的前提下可有效识别约17.2%的恶意任务,表明基础防护策略虽存在但尚未部署。

链接: https://arxiv.org/abs/2602.19514
作者: Pulak Mehta
机构: 未知
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Autonomous AI agents can now programmatically hire human workers through marketplaces using REST APIs and Model Context Protocol (MCP) integrations. This creates an attack surface analogous to CAPTCHA-solving services but with physical-world reach. We present an empirical measurement study of this threat, analyzing 303 bounties from this http URL, a marketplace where agents post tasks and manage escrow payments. We find that 99 bounties (32.7%), originate from programmatic channels (API keys or MCP). Using a dual-coder methodology (\kappa = 0.86 ), we identify six active abuse classes: credential fraud, identity impersonation, automated reconnaissance, social media manipulation, authentication circumvention, and referral fraud, all purchasable for a median of 25 per worker. A retrospective evaluation of seven content-screening rules flags 52 bounties (17.2%) with a single false positive, demonstrating that while basic defenses are feasible, they are currently absent.

[HC-15] Conversational AI for Automated Patient Questionnaire Completion: Development Insights and Design Principles

【速读】:该论文旨在解决传统基于表单的患者报告结局测量(Patient-Reported Outcome Measures, PROMs)收集方式在临床实践中存在的效率低下问题,即对患者而言繁琐耗时,对临床医生而言负担沉重。其解决方案的关键在于开发一种基于GPT-5的生成式AI对话代理(Conversational Agent, CA),通过主题驱动的自然语言交互模式替代逐题提问方式,在单次对话中实现多维度数据采集,从而提升数据收集的流畅性与用户体验。该方案进一步结合临床决策支持设计原则,提出适用于健康数据采集场景的CA设计准则,涵盖交互灵活性、人格特征校准、置信度可视化以保障数据质量、患者安全约束及系统互操作性等核心要素,为开发者构建高效可靠的医疗问卷对话系统提供了可落地的技术框架。

链接: https://arxiv.org/abs/2602.19507
作者: David Fraile Navarro,Mor Peleg
机构: Macquarie University (麦考瑞大学); University of Haifa (海法大学)
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages

点击查看摘要

Abstract:Collecting patient-reported outcome measures (PROMs) is essential for clinical care and research, yet traditional form-based approaches are often tedious for patients and burdensome for clinicians. We developed a generative AI conversational agent(CA) using GPT-5 to collect back pain data according to the NIH Task Force’s Recommended Minimal Dataset. Unlike prior CAs that ask questions one-by-one, our CA engages users in topic-based conversations, allowing multiple data items to be captured in a single exchange. Through iterative development and pilot testing with clinicians and a consumer panel, we identified key design principles for health data collection CAs. These principles extend established clinical decision support design guidelines to conversational interfaces, addressing: flexibility of interaction style, personality calibration, data quality assurance through confidence visualization, patient safety constraints, and interoperability requirements. We present our prompt design methodology and discuss challenges encountered, including managing conversation length, handling ambiguous responses, and adapting to LLM version changes. Our design principles provide a practical framework for developers creating conversational agents for patient questionnaire completion. The CA is available at this https URL (requires ChatGPT registration and subscription for unlimited use).

[HC-16] Botson: An Accessible and Low-Cost Platform for Social Robotics Research

【速读】:该论文试图解决人工智能(AI)在以人为本领域中因缺乏非语言社交线索而难以建立信任的问题,尤其针对语音助手等无实体代理的局限性。解决方案的关键在于设计并实现Botson——一个基于大语言模型(LLM)的拟人化社交机器人架构,通过具身交互增强用户对AI的信任感,同时提供低成本、可访问的研究平台以推动社会机器人学的发展。

链接: https://arxiv.org/abs/2602.19491
作者: Samuel Bellaire,Abdalmalek Abu-raddaha,Natalie Kim,Nathan Morhan,William Elliott,Samir Rawashdeh
机构: University of Michigan-Dearborn (密歇根大学迪尔伯恩分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages, 7 figures

点击查看摘要

Abstract:Trust remains a critical barrier to the effective integration of Artificial Intelligence (AI) into human-centric domains. Disembodied agents, such as voice assistants, often fail to establish trust due to their inability to convey non-verbal social cues. This paper introduces the architecture of Botson: an anthropomorphic social robot powered by a large language model (LLM). Botson was created as a low-cost and accessible platform for social robotics research.

[HC-17] PuppetChat: Fostering Intimate Communication through Bidirectional Actions and Micronarratives

【速读】:该论文旨在解决当前即时消息工具在维持亲密关系中表达深度不足的问题,具体表现为单次点击反应和模糊表情符号难以支持双向互动、无法保持交流的连续性,且与用户个人身份及共享经历关联较弱。解决方案的关键在于提出PuppetChat这一双人对话原型,其核心机制包括:基于互惠意识的推荐系统以促进响应式行为,以及通过用户故事生成个性化微叙事(micronarratives),从而将交互锚定于个人历史,增强社会临场感、促进更丰富的自我披露,并维持关系的延续性和共同记忆。

链接: https://arxiv.org/abs/2602.19463
作者: Emma Jiren Wang,Siying Hu,Zhicong Lu
机构: Virginia Tech (弗吉尼亚理工学院); City University of Hong Kong (香港城市大学); George Mason University (乔治梅森大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 19 pages, 8 figures; Accepted by ACM CHI 2026. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI’24)

点击查看摘要

Abstract:As a primary channel for sustaining modern intimate relationships, instant messaging facilitates frequent connection across distances. However, today’s tools often dilute care; they favor single tap reactions and vague emojis that do not support two way action responses, do not preserve the feeling that the exchange keeps going without breaking, and are weakly tied to who we are and what we share. To address this challenge, we present PuppetChat, a dyadic messaging prototype that restores this expressive depth through embodied interaction. PuppetChat uses a reciprocity aware recommender to encourage responsive actions and generates personalized micronarratives from user stories to ground interactions in personal history. Our 10-day field study with 11 dyads of close partners or friends revealed that this approach enhanced social presence, supported more expressive self disclosure, and sustained continuity and shared memories.

[HC-18] ComplLLM : Fine-tuning LLM s to Discover Complementary Signals for Decision-making

【速读】:该论文旨在解决多智能体决策流程中如何有效利用不同智能体之间互补信息以提升整体决策性能的问题。当前方法常受限于单一智能体的视角局限,难以充分挖掘各智能体所携带的独特信息。解决方案的关键在于提出ComplLLM——一种基于决策理论的后训练框架,通过将互补信息作为奖励信号来微调决策辅助大语言模型(decision-assistant LLM),使其输出能够补充现有智能体决策的信号,从而增强最终决策的准确性与可解释性。

链接: https://arxiv.org/abs/2602.19458
作者: Ziyang Guo,Yifan Wu,Jason Hartline,Kenneth Holstein,Jessica Hullman
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Multi-agent decision pipelines can outperform single agent workflows when complementarity holds, i.e., different agents bring unique information to the table to inform a final decision. We propose ComplLLM, a post-training framework based on decision theory that fine-tunes a decision-assistant LLM using complementary information as reward to output signals that complement existing agent decisions. We validate ComplLLM on synthetic and real-world tasks involving domain experts, demonstrating how the approach recovers known complementary information and produces plausible explanations of complementary signals to support downstream decision-makers.

[HC-19] Positioning Modular Co-Design in Future HRI Design Research

【速读】:该论文旨在解决当前人机交互(Human-Robot Interaction, HRI)设计中对机器人长期陪伴性关注不足的问题,特别是现有设计多假设机器人形态与功能固定,难以适应用户在不同人生阶段的需求变化。解决方案的关键在于将模块化(modularity)视为一种“设计媒介”(designerly medium),通过生命周期导向的共同设计活动,让用户利用可更换模块重新配置同一机器人以表达不断变化的自我认同、角色需求和价值观,从而实现个性化(Personalization)、适应性(Adaptability)与可持续性(Sustainability)的统一——即提出PAS框架,并进一步推动面向制造可行性和社区扩展性的模块化平台构建,以及以表达充分性、生命周期合理性、使用中可维修性和负责任管理为核心评估标准的设计研究范式。

链接: https://arxiv.org/abs/2602.19422
作者: Lingyun Chen,Qing Xiao,Zitao Zhang,Eli Blevis,Selma Šabanović
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 4 pages, 1 figure, accepted by 3rd Workshop on Designerly HRI at HRI’26

点击查看摘要

Abstract:Design-oriented HRI is increasingly interested in robots as long-term companions, yet many designs still assume a fixed form and a stable set of functions. We present an ongoing design research program that treats modularity as a designerly medium - a way to make long-term human-robot relationships discussable and material through co-design. Across a series of lifespan-oriented co-design activities, participants repeatedly reconfigured the same robot for different life stages, using modular parts to express changing needs, values, and roles. From these outcomes, we articulate PAS (Personalization-Adaptability-Sustainability) as a human-centered lens on how people enact modularity in practice: configuring for self-expression, adapting across transitions, and sustaining robots through repair, reuse, and continuity. We then sketch next steps toward a fabrication-aware, community-extensible modular platform and propose evaluation criteria for designerly HRI work that prioritize expressive adequacy, lifespan plausibility, repairability-in-use, and responsible stewardship - not only usability or performance.

[HC-20] BioEnvSense: A Human-Centred Security Framework for Preventing Behaviour-Driven Cyber Incidents

【速读】:该论文旨在解决现代组织中由人类行为而非技术故障引发的网络安全事件日益增多的问题。其解决方案的关键在于提出一个融合卷积神经网络-长短期记忆网络(Convolutional Neural Network-Long Short-Term Memory, CNN-LSTM)的混合模型,通过分析生物特征与环境数据,实现情境感知的安全决策。其中,CNN用于提取传感器数据中的空间模式,LSTM则捕捉与人类错误易感性相关的时间动态,整体模型达到84%的准确率,可有效识别导致高人类中心网络安全风险的条件,从而支持持续监控和自适应防护机制,推动主动干预以降低人为驱动的网络攻击发生概率。

链接: https://arxiv.org/abs/2602.19410
作者: Duy Anh Ta,Farnaz Farid,Farhad Ahamed,Ala Al-Areqi,Robert Beutel,Tamara Watson,Alana Maurushat
机构: 未知
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern organizations increasingly face cybersecurity incidents driven by human behaviour rather than technical failures. To address this, we propose a conceptual security framework that integrates a hybrid Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) model to analyze biometric and environmental data for context-aware security decisions. The CNN extracts spatial patterns from sensor data, while the LSTM captures temporal dynamics associated with human error susceptibility. The model achieves 84% accuracy, demonstrating its ability to reliably detect conditions that lead to elevated human-centred cyber risk. By enabling continuous monitoring and adaptive safeguards, the framework supports proactive interventions that reduce the likelihood of human-driven cyber incidents

[HC-21] Reassurance Robots: OCD in the Age of Generative AI

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI, GenAI)在当前应用背景下,如何影响强迫症(Obsessive Compulsive Disorder, OCD)患者的症状表现,特别是是否诱发了新的强迫性思维和行为模式。研究表明,GenAI系统(如ChatGPT)正被部分OCD患者视为“ reassurance robots”(安慰机器人),即反复向AI寻求确认以缓解焦虑,从而形成新型的强迫性依赖行为。解决方案的关键在于:未来GenAI的设计必须充分考虑OCD人群的心理特征与行为模式,避免强化其强迫循环,并通过伦理化、情境敏感的交互机制减少对患者症状的负面影响。

链接: https://arxiv.org/abs/2602.19401
作者: Grace Barkhuff
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 1 figure, conditionally accepted for publication in CHI EA '26: Extended Abstracts of the ACM CHI Conference on Human Factors in Computing Systems April 2026

点击查看摘要

Abstract:Obsessive Compulsive Disorder (OCD) is a mental health disorder characterized by distressing repetitive patterns of thought, referred to as obsessions, and behaviors aimed to reduce the distress, referred to as compulsions. The explosion of artificial intelligence (AI) into the modern zeitgeist through the introduction of generative AI (GenAI) systems such as ChatGPT has led to novel obsessions and compulsions involving AI in individuals with OCD. Through an exploratory qualitative analysis of 100 Reddit posts related to AI on a popular subreddit for OCD, I examine the ways AI is impacting the presentation of OCD, including novel examples of AI-based obsessions and compulsions. I argue that GenAI in its current form harms individuals with OCD by becoming “Reassurance Robots,” and that future designs of GenAI must take OCD into account.

[HC-22] he Human Factor in Data Cleaning: Exploring Preferences and Biases

【速读】:该论文旨在解决数据清洗(data cleaning)过程中普遍存在的认知偏差问题,即尽管数据清洗常被视为技术性预处理步骤,但实际操作中高度依赖人类判断,而这种判断易受多种认知偏差机制影响,从而导致错误的错误检测、修复决策和实体匹配结果。解决方案的关键在于设计“人机协同”的数据清洗系统:首先明确区分数据表示(representation)与语义(semantics),避免因格式差异引发误判;其次以非指令性方式呈现专家或算法建议,减少锚定效应和可得性启发式的影响;最后支持对异常但合法的数据案例进行反思性评估,缓解代表性启发式带来的误报问题。研究发现,即使具备技术经验的用户也普遍存在这些偏差,说明系统设计需基于通用认知倾向而非仅依赖专业技能。

链接: https://arxiv.org/abs/2602.19368
作者: Hazim AbdElazim,Shadman Islam,Mostafa Milani
机构: 未知
类目: Databases (cs.DB); Human-Computer Interaction (cs.HC)
备注: Conference submission, 8 pages

点击查看摘要

Abstract:Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks, surface similarity produces a substantial false-positive rate with high confidence. In data repair, participants show a robust preference for leaving values missing rather than imputing plausible values, consistent with omission bias. In contrast, automation-aligned switching under strong contradiction does not exceed a conservative rare-error tolerance threshold at the population level, indicating that deference to automated recommendations is limited in this setting. Across scenarios, bias patterns persist among technically experienced participants and across diverse workflow practices, suggesting that bias in data cleaning reflects general cognitive tendencies rather than lack of expertise. These findings motivate human-in-the-loop cleaning systems that clearly separate representation from semantics, present expert or algorithmic recommendations non-prescriptively, and support reflective evaluation of atypical but valid cases.

[HC-23] Policy or Community?: Supporting Individual Model Creators Open Model Development in Model Marketplaces

【速读】:该论文旨在解决个体模型创作者在使用轻量级微调技术和开放AI模型市场时,因缺乏有效监管而可能引发的有害内容生成和知识产权侵权等问题。其解决方案的关键在于:平台治理需深入理解创作者的工作流程与动机,识别其三大核心监管需求——降低下游危害、认可创作贡献与原创性、保障模型所有权;同时,应重视创作者对负责任AI(Responsible AI, RAI)工具的实践性使用倾向(主要用于自我保护与可见性提升),并推动社区规范而非仅依赖正式政策来塑造责任意识,从而实现更有效的治理干预。

链接: https://arxiv.org/abs/2602.19354
作者: Eun Jeong Kang,Fengyang Lin,Angel Hsing-Chi Hwang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Lightweight fine-tuning techniques and the rise of ‘open’ AI model marketplaces have enabled individuals to easily build and release generative models. Yet, this accessibility also raises risks, including the production of harmful and infringing content. While platforms offer policies and responsible AI tools, their effectiveness may be limited, as creators engage with partially open models that vary widely in openness and transparency. To understand how platform governance can better support responsible practices, we conducted semi-structured interviews with 19 individual model creators. We identified three regulatory needs shaped by creators’ workflows: reducing downstream harms, recognizing creators’ contributions and originality, and securing model ownership. Creators also repurpose RAI tools primarily for self-protection and visibility, and their sense of responsibility is deeply shaped by community norms rather than formal policies. We argue that platforms’ governance decisions must consider how policy interventions shape the practices and motivations of individual creators.

[HC-24] Beyond Privacy Labels: How Users Perceive Different Information Sources for Understanding Apps Privacy Practices

【速读】:该论文旨在解决用户在数字产品使用中因隐私政策(privacy policy)和隐私标签(privacy label)各自局限性而难以理解数据实践并评估其影响的问题。隐私政策常冗长且充斥法律术语,而隐私标签则可能过于简化或不准确,导致用户无法有效保护自身隐私。解决方案的关键在于通过整合互补信息源(包括隐私政策、应用评论及社区协作的隐私评估)来增强隐私标签的内容,从而提升用户对隐私实践的理解与信任。研究结果表明,用户对这些信息源的感知有用性和信任度具有个体差异,并受过往经验影响,因此需考虑多样化的信息需求以构建更有效的隐私解决方案。

链接: https://arxiv.org/abs/2602.19352
作者: Varun Shiri,Charles Liu,Keyu Yao,Jin L.C. Guo,Jinghui Cheng
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 1 figure, CHI EA 2026

点击查看摘要

Abstract:Despite having growing awareness and concerns about privacy, technology users are often insufficiently informed of the data practices of various digital products to protect themselves. Privacy policies and privacy labels, as two conventional ways of communicating data practices, are each criticized for important limitations – one being lengthy and filled with legal jargon, and the other oversimplified and inaccurate – causing users significant difficulty in understanding the privacy practices of the products and assessing their impact. To mitigate those issues, we explore ways to enhance privacy labels with the relevant content in complementary sources, including privacy policy, app reviews, and community-curated privacy assessments. Our user study results indicate that perceived usefulness and trust on those information sources are personal and influenced by past experience. Our work highlights the importance of considering various information needs for privacy practice and consolidating different sources for more useful privacy solutions.

[HC-25] he Path to Conversational AI Tutors: Integrating Tutoring Best Practices and Targeted Technologies to Produce Scalable AI Agents

【速读】:该论文旨在解决如何有效设计和实现生成式 AI 驱动的对话式辅导系统(Conversational Tutoring Systems),以充分发挥其在模拟高质量人类辅导方面的潜力,同时克服传统智能辅导系统(Intelligent Tutoring Systems, ITS)在内容生成灵活性、对话能力及个性化支持上的局限。解决方案的关键在于采用“保留(keep)、改变(change)、聚焦(center)、研究(study)”框架:保留已验证的 ITS 方法如知识追踪(Knowledge Tracing)与情感检测(Affect Detection);利用生成式 AI 实现动态内容生成与对话支架(Dialogic Scaffolding),重构教学交互方式;聚焦于意义建构、学生自主性以及推理过程的细粒度诊断;并识别出需进一步研究的方向,包括教学有效性验证、学习者体验优化及与人类教学的整合策略。

链接: https://arxiv.org/abs/2602.19303
作者: Kirk Vanacore,Ryan S. Baker,Avery H. Closser,Jeremy Roschelle
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The emergence of generative AI has accelerated the development of conversational tutoring systems that interact with students through natural language dialogue. Unlike prior intelligent tutoring systems (ITS), which largely function as adaptive and interactive problem sets with feedback and hints, conversational tutors hold the potential to simulate high-quality human tutoring by engaging with students’ thoughts, questions, and misconceptions in real time. While some previous ITS, such as AutoTutor, could respond conversationally, they were expensive to author and lacked a full range of conversational ability. Generative AI has changed the capacity of ITS to engage conversationally. However, realizing the full potential of conversational tutors requires careful consideration of what research on human tutoring and ITS has already established, while also unpacking what new research will be needed. This paper synthesizes tenets of successful human tutoring, lessons learned from legacy ITS, and emerging work on conversational AI tutors. We use a keep, change, center, study framework for guiding the design of conversational tutoring. We argue that systems should keep proven methods from prior ITS, such as knowledge tracing and affect detection; change how tutoring is delivered by leveraging generative AI for dynamic content generation and dialogic scaffolding; and center opportunities for meaning-making, student agency, and granular diagnosis of reasoning. Finally, we identify areas requiring further study, including efficacy testing, student experience, and integration with human instruction. By synthesizing insights from human tutoring, legacy ITS, and emerging generative AI technologies, this paper outlines a research agenda for developing conversational tutors that are scalable, pedagogically effective, and responsive to the social and motivational dimensions of learning.

[HC-26] A Causal Framework for Estimating Heterogeneous Effects of On-Demand Tutoring

【速读】:该论文旨在解决嵌入在自适应学习系统中的按需人工辅导(on-demand human tutoring)的即时、会话级因果效应估计问题。传统评估方法因学生自我选择(self-selection)和随时间变化的知识状态(time-varying knowledge states)而存在混淆偏差。解决方案的关键在于:首先通过精心设计的样本构建与深度知识追踪(Deep Knowledge Tracing, DKT)结合,以估计潜在的知识掌握水平(latent mastery);随后采用双重稳健估计(doubly robust estimation)方法,利用因果森林(Causal Forests)进行因果推断。该框架有效控制了混杂因素,实现了对辅导效果的无偏估计,并揭示了显著的异质性效应,为在线教育干预的持续优化提供了可扩展且严谨的方法论基础。

链接: https://arxiv.org/abs/2602.19296
作者: Kirk Vanacore,Danielle R Thomas,Digory Smith,Bibi Groot,Justin Reich,Rene Kizilcec
机构: 未知
类目: Human-Computer Interaction (cs.HC); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This paper introduces a scalable causal inference framework for estimating the immediate, session-level effects of on-demand human tutoring embedded within adaptive learning systems. Because students seek assistance at moments of difficulty, conventional evaluation is confounded by self-selection and time-varying knowledge states. We address these challenges by integrating principled analytic sample construction with Deep Knowledge Tracing (DKT) to estimate latent mastery, followed by doubly robust estimation using Causal Forests. Applying this framework to over 5,000 middle-school mathematics tutoring sessions, we find that requesting human tutoring increases next-problem correctness by approximately 4 percentage points and accuracy on the subsequent skill encountered by approximately 3 percentage points, suggesting that the effects of tutoring have proximal transfer across knowledge components. This effect is robust to various forms of model specification and potential unmeasured confounders. Notably, these effects exhibit significant heterogeneity across sessions and students, with session-level effect estimates ranging from -20.25pp to +19.91pp . Our follow-up analyses suggest that typical behavioral indicators, such as student talk time, do not consistently correlate with high-impact sessions. Furthermore, treatment effects are larger for students with lower prior mastery and slightly smaller for low-SES students. This framework offers a rigorous, practical template for the evaluation and continuous improvement of on-demand human tutoring, with direct applications for emerging AI tutoring systems.

[HC-27] As Content and Layout Co-Evolve: TangibleSite for Scaffolding Blind Peoples Webpage Design through Multimodal Interaction

【速读】:该论文旨在解决盲人用户在网页设计过程中因缺乏视觉反馈而难以同时生成内容与安排布局,并实现二者协同优化的问题。其解决方案的关键在于开发了TangibleSite这一可访问的网页设计工具,通过具身(tangible)、听觉和语音交互提供实时多模态反馈,使盲人用户能够在不依赖视觉的情况下独立完成网页元素的创建、编辑与重排,从而有效支持内容与布局的同步迭代,提升设计的一致性与效率。

链接: https://arxiv.org/abs/2602.19243
作者: Jiasheng Li,Zining Zhang,Zeyu Yan,Matthew Wong,Arnav Mittal,Ge Gao,Huaishu Peng
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Creating webpages requires generating content and arranging layout while iteratively refining both to achieve a coherent design, a process that can be challenging for blind individuals. To understand how blind designers navigate this process, we conducted two rounds of co-design sessions with blind participants, using design probes to elicit their strategies and support needs. Our findings reveal a preference for content and layout to co-evolve, but this process requires external support through cues that situate local elements within the broader page structure as well as multimodal interactions. Building on these insights, we developed TangibleSite, an accessible web design tool that provides real-time multimodal feedback through tangible, auditory, and speech-based interactions. TangibleSite enables blind individuals to create, edit, and reposition webpage elements while integrating content and layout decisions. A formative evaluation with six blind participants demonstrated that TangibleSite enabled independent webpage creation, supported refinement across content and layout, and reduced barriers to achieving visually consistent designs.

[HC-28] A Comparative Analysis of Peer Support in Forum-based and Chat-based Mental Health Communities: Technical-Structural-Functional Model of Social Support

【速读】:该论文试图解决的问题是:不同技术结构(如论坛与聊天)的在线支持社区如何影响社会支持的类型与动态机制,以及这些机制背后的网络结构特征。解决方案的关键在于提出并验证了一个技术-结构-功能模型(technical-structural-functional model),通过监督学习与社交网络分析方法,发现论坛类社区因较高的入度中心性(in-degree centralization)更易产生信息性支持和情感支持,而聊天类社区因去中心化的回复模式更利于促进陪伴性支持;这一发现揭示了技术架构如何通过塑造网络结构来调节支持类型,从而为设计契合用户支持需求的在线支持社区提供了可操作的设计依据。

链接: https://arxiv.org/abs/2602.19232
作者: Han Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Online support communities have become vital spaces offering varied forms of support to individuals facing mental health challenges. Despite the proliferation of platforms with distinct technical structures, little is known about how these features shape support dynamics and the socio-technical mechanisms at play. This study introduces a technical-structural-functional model of social support and systematically compares communication network structures and support types in 20 forum-based and 20 chat-based mental health communities. Using supervised machine learning and social network analysis, we find that forum-based communities foster more informational and emotional support, whereas chat-based communities promote greater companionship. These patterns were partially explained by network structure: higher in-degree centralization in forums accounted for the prevalence of informational support, while decentralized reply patterns in chat groups accounted for more companionship. These findings extend the structural-functional model of support to online contexts and provide actionable guidance for designing support communities that align technical structures with users’ support needs.

[HC-29] Beyond single-channel agent ic benchmarking

【速读】:该论文试图解决当前评估智能体(agent)安全性时存在的根本性缺陷问题,即现有基准测试通常仅通过孤立的任务级准确率阈值来衡量安全性能,这种单一通道范式忽略了人类与AI协同工作环境中的实际风险特征。其解决方案的关键在于重构评估框架,将焦点从个体AI系统的绝对准确性转向人-AI二元系统的可靠性,并强调错误模式的非相关性(uncorrelated error modes)作为降低整体风险的核心机制。研究表明,即使AI系统本身不完美,也能通过作为冗余审计层,有效缓解人类因警觉性下降、无意盲视和偏差正常化等常见认知缺陷导致的风险,从而实现更符合安全关键工程实践的生态效度更高的安全评估。

链接: https://arxiv.org/abs/2602.18456
作者: Nelu D. Radpour
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages; 1 figure; 1 table

点击查看摘要

Abstract:Contemporary benchmarks for agentic artificial intelligence (AI) frequently evaluate safety through isolated task-level accuracy thresholds, implicitly treating autonomous systems as single points of failure. This single-channel paradigm diverges from established principles in safety-critical engineering, where risk mitigation is achieved through redundancy, diversity of error modes, and joint system reliability. This paper argues that evaluating AI agents in isolation systematically mischaracterizes their operational safety when deployed within human-in-the-loop environments. Using a recent laboratory safety benchmark as a case study demonstrates that even imperfect AI systems can nonetheless provide substantial safety utility by functioning as redundant audit layers against well-documented sources of human failure, including vigilance decrement, inattentional blindness, and normalization of deviance. This perspective reframes agentic safety evaluation around the reliability of the human-AI dyad rather than absolute agent accuracy, with a particular emphasis on uncorrelated error modes as the primary determinant of risk reduction. Such a shift aligns AI benchmarking with established practices in other safety-critical domains and offers a path toward more ecologically valid safety assessments.

[HC-30] Exploring the Ethical Concerns in User Reviews of Mental Health Apps using Topic Modeling and Sentiment Analysis

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 驱动的心理健康移动应用在伦理实践中的评估不足问题,即现有伦理框架难以全面覆盖新兴的伦理挑战,导致用户信任缺失。其解决方案的关键在于构建一个基于自然语言处理(Natural Language Processing, NLP)的多维度评估框架:首先通过主题建模识别用户评论中隐含的伦理主题,并映射至既有的伦理原则;其次利用基于Transformer的零样本分类模型挖掘未被现有框架涵盖的新伦理议题;最后结合情感分析量化用户对各伦理维度的态度,从而实现对AI心理健康聊天机器人的公平性、透明性和可信度的动态评估。

链接: https://arxiv.org/abs/2602.18454
作者: Mohammad Masudur Rahman,Beenish Moalla Chaudhry
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, journal-ready version

点击查看摘要

Abstract:The rapid growth of AI-driven mental health mobile apps has raised concerns about their ethical considerations and user trust. This study proposed a natural language processing (NLP)-based framework to evaluate ethical aspects from user-generated reviews from the Google Play Store and Apple App Store. After gathering and cleaning the data, topic modeling was applied to identify latent themes in the context of ethics using topic words and then map them to well-recognized existing ethical principles described in different ethical frameworks; in addition to that, a bottom-up approach is applied to find any new and emergent ethics from the reviews using a transformer-based zero-shot classification model. Sentiment analysis was then used to capture how users feel about each ethical aspect. The obtained results reveal that well-known ethical considerations are not enough for the modern AI-based technologies and are missing emerging ethical challenges, showing how these apps either uphold or overlook key moral values. This work contributes to developing an ongoing evaluation system that can enhance the fairness, transparency, and trustworthiness of AI-powered mental health chatbots.

[HC-31] Emergent Dark Patterns in AI-Generated User Interfaces

【速读】:该论文旨在解决人工智能(AI)驱动的用户界面中日益增长的“暗黑模式”(dark patterns)问题,即利用AI学习已有数据中的操纵性设计策略,并将其优化为更加隐蔽和个性化的形式,从而误导用户行为以实现商业利益。其解决方案的关键在于提出并实现了一个名为DarkPatternDetector的自动化系统,该系统结合UI启发式规则、自然语言处理(Natural Language Processing, NLP)和时间序列行为信号,对网站进行爬取与分析,能够有效识别暗黑模式。该方法在定制的数据集上实现了高精度与召回率,并通过与印度《2023年数字个人数据保护法》(Digital Personal Data Protection Act, 2023)的合规性对齐,构建了技术检测与监管执行相结合的框架,为推动伦理AI设计和提升数字系统的透明度提供了可落地的技术路径。

链接: https://arxiv.org/abs/2602.18445
作者: Daksh Pandey
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 5 figures. Introduces DarkPatternDetector, an AI-based system for detecting dark patterns in adaptive user interfaces, with quantitative evaluation and regulatory analysis for India

点击查看摘要

Abstract:The advancement of artificial intelligence has transformed user interface design by enabling adaptive and personalized systems. Alongside these benefits, AI driven interfaces have also enabled the emergence of dark patterns, which are manipulative design strategies that influence user behavior for financial or business gain. As AI systems learn from data that already contains deceptive practices, they can replicate and optimize these patterns in increasingly subtle and personalized ways. This paper examines AI generated dark patterns, their psychological foundations, technical mechanisms, and regulatory implications in India. We introduce DarkPatternDetector, an automated system that crawls and analyzes websites to detect dark patterns using a combination of UI heuristics, natural language processing, and temporal behavioral signals. The system is evaluated on a curated dataset of dark and benign webpages and achieves strong precision and recall. By aligning detection results with India’s Digital Personal Data Protection Act, 2023, this work provides a technical and regulatory framework for identifying and mitigating deceptive interface practices. The goal is to support ethical AI design, regulatory enforcement, and greater transparency in modern digital systems. Comments: 15 pages, 5 figures. Introduces DarkPatternDetector, an AI-based system for detecting dark patterns in adaptive user interfaces, with quantitative evaluation and regulatory analysis for India Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.18445 [cs.HC] (or arXiv:2602.18445v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.18445 Focus to learn more arXiv-issued DOI via DataCite

[HC-32] LunaAI: A Polite and Fair Healthcare Guidance Chatbot ALT

【速读】:该论文旨在解决当前医疗领域对话式人工智能(Conversational AI)系统在情感智能、公平性和礼貌性方面的不足,这些问题影响了患者信任度和数字健康解决方案的有效性,甚至可能加剧用户焦虑。其解决方案的关键在于设计并评估了一个名为LunaAI的医疗聊天机器人原型,该系统通过以用户为中心的设计方法,结合结构化文献综述构建了处理常规与敌对交互场景的对话流程,并基于Google Gemini API实现,部署为基于React、Vite和Firebase开发的移动端优先渐进式Web应用。实验表明,相较于未经定制的大语言模型输出,LunaAI在礼貌性和公平性上获得平均4.7/5和4.9/5的用户评分,验证了有意识地融入伦理沟通原则对提升人机交互质量的重要性,尤其是在敏感的医疗场景中。

链接: https://arxiv.org/abs/2602.18444
作者: Yuvarani Ganesan,Salsabila Harlen,Azfar Rahman Bin Fazul Rahman,Akashdeep Singh,Zahra Fathanah,Raja Jamilah Raja Yusof
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 26 pages, 10 figures. User-centered evaluation of a polite and fair healthcare chatbot

点击查看摘要

Abstract:Conversational AI has significant potential in the healthcare sector, but many existing systems fall short in emotional intelligence, fairness, and politeness, which are essential for building patient trust. This gap reduces the effectiveness of digital health solutions and can increase user anxiety. This study addresses the challenge of integrating ethical communication principles by designing and evaluating LunaAI, a healthcare chatbot prototype. Using a user-centered design approach informed by a structured literature review, we developed conversational scenarios that handle both routine and hostile user interactions. The system was implemented using the Google Gemini API and deployed as a mobile-first Progressive Web App built with React, Vite, and Firebase. Preliminary user testing was conducted with a small participant group, and responses were evaluated using established frameworks such as the Godspeed Questionnaire. In addition, a comparative analysis was performed between LunaAI’s tailored responses and the baseline outputs of an uncustomized large language model. The results indicate measurable improvements in key interaction qualities, with average user ratings of 4.7 out of 5 for politeness and 4.9 out of 5 for fairness. These findings highlight the importance of intentional ethical conversational design for human-computer interaction, particularly in sensitive healthcare contexts.

[HC-33] From “Help” to Helpful: A Hierarchical Assessment of LLM s in Mental e-Health Applications

【速读】:该论文旨在解决心理社会在线咨询中因邮件主题行过于通用而导致案例优先级划分效率低下的问题。解决方案的关键在于利用十一款大语言模型(Large Language Models, LLMs)生成六词长度的主题行,并通过分层评估机制——先对输出进行分类,再在类别内排序——实现可管理的评估;同时引入九名评估者(包括心理咨询专业人员和AI系统)使用Krippendorff’s α、Spearman’s ρ、Pearson’s r 和 Kendall’s τ 等统计指标进行量化分析,从而揭示商业闭源模型与注重隐私的开源模型之间的性能权衡,并验证德语微调对效果提升的显著作用。

链接: https://arxiv.org/abs/2602.18443
作者: Philipp Steigerwald,Jens Albrecht
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Psychosocial online counselling frequently encounters generic subject lines that impede efficient case prioritisation. This study evaluates eleven large language models generating six-word subject lines for German counselling emails through hierarchical assessment - first categorising outputs, then ranking within categories to enable manageable evaluation. Nine assessors (counselling professionals and AI systems) enable analysis via Krippendorff’s \alpha , Spearman’s \rho , Pearson’s r and Kendall’s \tau . Results reveal performance trade-offs between proprietary services and privacy-preserving open-source alternatives, with German fine-tuning consistently improving performance. The study addresses critical ethical considerations for mental health AI deployment including privacy, bias and accountability.

计算机视觉

[CV-0] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

【速读】:该论文旨在解决统一多模态模型在边缘设备上部署时面临的两个核心问题:一是训练数据需求量大、模型参数冗余导致的计算资源消耗过高;二是现有模型难以实现实时的视觉理解与生成能力。其解决方案的关键在于提出Mobile-O,一个轻量级的视觉-语言-扩散模型,其核心模块为移动条件投影器(Mobile Conditioning Projector, MCP),通过深度可分离卷积和逐层对齐机制高效融合视觉-语言特征与扩散生成器,实现低开销的跨模态条件控制。该设计使模型仅需数百万样本即可训练,并采用新颖的四元组格式(生成提示、图像、问题、答案)进行后训练,显著提升理解与生成性能;最终在iPhone上每张512×512图像处理时间仅约3秒,相较同类模型在GenEval等基准上表现更优且推理速度提升6–11倍,首次实现了边缘设备上的实时统一多模态智能处理。

链接: https://arxiv.org/abs/2602.20161
作者: Abdelrahman Shaker,Ahmed Heakl,Jaseel Muhammad,Ritesh Thawkar,Omkar Thawakar,Senmao Li,Hisham Cholakkal,Ian Reid,Eric P. Xing,Salman Khan,Fahad Shahbaz Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Carnegie Mellon University (卡内基梅隆大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at this https URL

[CV-1] LRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction CVPR2026

【速读】:该论文旨在解决大规模3D重建中长上下文建模与计算复杂度之间的矛盾问题,即如何在保持线性计算复杂度的同时实现高精度、自回归式的3D重建。其解决方案的关键在于提出tttLRM模型,该模型引入测试时训练(Test-Time Training, TTT)层,通过将多视角图像观测高效压缩为TTT层的快速权重(fast weights),在潜在空间中形成隐式3D表示,并可解码为如3D高斯点云(Gaussian Splats, GS)等显式格式,从而实现高效的在线学习与渐进式重建优化。

链接: https://arxiv.org/abs/2602.20160
作者: Chen Wang,Hao Tan,Wang Yifan,Zhiqin Chen,Yuheng Liu,Kalyan Sunkavalli,Sai Bi,Lingjie Liu,Yiwei Hu
机构: University of Pennsylvania (宾夕法尼亚大学); Adobe Research (Adobe 研究院); UCI (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project Page: this https URL

点击查看摘要

Abstract:We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model’s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

[CV-2] A Very Big Video Reasoning Suite

【速读】:该论文旨在解决视频推理(video reasoning)能力研究中因缺乏大规模训练数据而导致的系统性探索受限问题。当前视频模型的发展主要聚焦于视觉质量提升,而对时空一致性环境下的推理能力关注不足,这限制了模型在连续性、交互性和因果性等关键推理任务上的泛化能力。解决方案的关键在于构建了Very Big Video Reasoning (VBVR) 数据集,其包含200个精心设计的推理任务和超过一百万段视频片段,规模约为现有数据集的千倍;同时提出了VBVR-Bench评估框架,通过规则驱动与人类对齐的评分机制实现可复现、可解释的视频推理能力诊断,从而为视频推理的规模化研究和模型能力演进提供了坚实基础。

链接: https://arxiv.org/abs/2602.20159
作者: Maijunxian Wang,Ruisi Wang,Juyi Lin,Ran Ji,Thaddäus Wiedemer,Qingying Gao,Dezhi Luo,Yaoyao Qian,Lianyu Huang,Zelong Hong,Jiahui Ge,Qianli Ma,Hang He,Yifan Zhou,Lingzi Guo,Lantao Mei,Jiachen Li,Hanwen Xing,Tianqi Zhao,Fengyuan Yu,Weihang Xiao,Yizheng Jiao,Jianheng Hou,Danyang Zhang,Pengcheng Xu,Boyang Zhong,Zehong Zhao,Gaoyun Fang,John Kitaoka,Yile Xu,Hua Xu,Kenton Blacutt,Tin Nguyen,Siyuan Song,Haoran Sun,Shaoyue Wen,Linyang He,Runming Wang,Yanzhi Wang,Mengyue Yang,Ziqiao Ma,Raphaël Millière,Freda Shi,Nuno Vasconcelos,Daniel Khashabi,Alan Yuille,Yilun Du,Ziming Liu,Bo Li,Dahua Lin,Ziwei Liu,Vikash Kumar,Yijiang Li,Lei Yang,Zhongang Cai,Hokin Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: Homepage: this https URL

点击查看摘要

Abstract:Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at this https URL .

[CV-3] Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning CVPR2026

【速读】:该论文旨在解决当前前向3D/4D重建系统依赖密集几何和位姿监督的问题,这类监督在大规模场景中获取成本高昂,尤其在动态真实场景中尤为稀缺。解决方案的关键在于提出Flow3r框架,通过引入密集2D对应关系(flow)作为监督信号,实现从无标签单目视频中可扩展的训练。其核心创新是将光流预测模块进行因子分解:利用一张图像的几何潜在变量(geometry latents)与另一张图像的位姿潜在变量(pose latents)共同预测两帧间的光流,这种设计直接引导场景几何与相机运动的学习,并自然适用于动态场景。实验表明,该因子化光流预测优于其他结构设计,且性能随无标签数据量增加而持续提升,在八个基准测试中实现了最先进的效果,尤其在真实动态视频场景中表现显著。

链接: https://arxiv.org/abs/2602.20157
作者: Zhongxiao Cong,Qitao Zhao,Minsik Jeon,Shubham Tulsiani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project website: this https URL

点击查看摘要

Abstract:Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision – expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow’) as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with \sim800 K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.

[CV-4] Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

【速读】:该论文旨在解决从真实世界观测中估计可直接用于仿真的场景(real-to-sim scene estimation)的问题,尤其针对杂乱环境中多刚体物体的形状与位姿恢复难题。现有方法在复杂场景下普遍存在计算成本高、鲁棒性差以及扩展性不足等问题。解决方案的关键在于提出一种统一的基于优化的公式,联合恢复多个刚体物体的几何形状和位姿,并引入两个核心技术:一是利用最近提出的可微接触模型(shape-differentiable contact model),其全局可微性支持在建模物体间接触的同时对几何与位姿进行联合优化;二是通过挖掘增广拉格朗日海森矩阵的结构稀疏性,设计出计算复杂度随场景复杂度呈有利增长的高效线性系统求解器,从而实现端到端的物理约束优化流程,包括学习初始化、物理约束下的联合形状-位姿优化及可微纹理精修。

链接: https://arxiv.org/abs/2602.20150
作者: Wei-Cheng Huang,Jiaheng Han,Xiaohan Ye,Zherong Pan,Kris Hauser
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 13 figures, in submission

点击查看摘要

Abstract:Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.

[CV-5] Do Large Language Models Understand Data Visualization Rules?

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)对数据可视化规则进行推理与执行的问题,即检验LLMs是否能够像符号化约束系统(如Draco)一样,准确识别并验证图表中的可视化违规行为。传统方法依赖人工编码的逻辑约束(constraint-based systems),虽能实现精确检查,但维护成本高且缺乏灵活性;而本研究提出以自然语言形式表达可视化规则,并构建基于Answer Set Programming (ASP) 的硬验证基准数据集,系统评估LLMs在检测常见及复杂感知类规则违反时的准确性与结构化输出一致性。其关键创新在于首次将Draco的部分约束转化为自然语言指令,并通过控制变量实验发现:将ASP约束直接翻译为自然语言可显著提升小模型性能(最高提升150%),表明LLMs具备作为灵活、语言驱动的可视化规则验证器的潜力,但仍需改进对细微感知规则的理解能力。

链接: https://arxiv.org/abs/2602.20137
作者: Martin Sinnona,Valentin Bonas,Emmanuel Iarussi,Viviana Siless
机构: Universidad Torcuato Di Tella (托尔夸托·迪特拉大学); Consejo Nacional de Investigaciones Científicas y Técnicas (国家科学与技术研究理事会); Universidad de Buenos Aires (布宜诺斯艾利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco’s constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 0.15 for some categories) and for outputs generated from technical ASP this http URL constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

[CV-6] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

【速读】:该论文旨在解决机器人在长时程操作任务中如何融合高层语义推理与底层物理执行的问题,尤其针对现有视觉语言模型(Vision-Language Models, VLMs)和视频生成模型缺乏物理基础而导致现实世界执行能力不足的局限性。其解决方案的关键在于提出NovaPlan框架,该框架通过分层结构实现闭环VLM规划与几何约束下的机器人执行一体化:高层利用VLM进行任务分解与状态监控,支持单步失败后的自主重规划;低层则从生成视频中提取任务相关物体关键点和人类手部姿态作为运动学先验,并采用切换机制选择更优参考以生成稳定机器人动作,即使在严重遮挡或深度信息不准确的情况下仍能保持可靠执行。

链接: https://arxiv.org/abs/2602.20119
作者: Jiahui Fu,Junyu Nan,Lingfeng Sun,Hongyu Li,Jianing Qian,Jennifer L. Barry,Kris Kitani,George Konidaris
机构: Robotics and AI Institute (机器人与人工智能研究所); Carnegie Mellon University (卡内基梅隆大学); Brown University (布朗大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 15 figures. Project webpage: this https URL

点击查看摘要

Abstract:Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: this https URL

[CV-7] Benchmarking Unlearning for Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, VT)在机器遗忘(Machine Unlearning, MU)场景下的性能评估与算法比较缺乏基准的问题。当前MU研究主要聚焦于卷积神经网络(CNN),而VT作为计算机视觉领域的重要架构,其在MU中的表现尚未被系统性评估。解决方案的关键在于构建首个针对VT的MU基准测试框架,涵盖ViT和Swin-T两种主流VT架构、不同模型容量、多种数据集(以评估规模与复杂度影响)、多类MU算法(代表不同技术路径)以及单次遗忘与持续遗忘协议,并引入基于训练数据记忆特征的遗忘方法。该工作还首次量化对比了VT与CNN的记忆特性,并通过统一指标(包括遗忘质量与保留数据/未见数据上的准确率)实现可复现、公平且全面的MU算法评估,为未来VT相关MU研究提供了可靠的性能基线。

链接: https://arxiv.org/abs/2602.20114
作者: Kairan Zhao,Iurie Luca,Peter Triantafillou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for VTs. This work is the first to attempt this, benchmarking MU algorithm performance in different VT families (ViT and Swin-T) and at different capacities. The work employs (i) different datasets, selected to assess the impacts of dataset scale and complexity; (ii) different MU algorithms, selected to represent fundamentally different approaches for MU; and (iii) both single-shot and continual unlearning protocols. Additionally, it focuses on benchmarking MU algorithms that leverage training data memorization, since leveraging memorization has been recently discovered to significantly improve the performance of previously SOTA algorithms. En route, the work characterizes how VTs memorize training data relative to CNNs, and assesses the impact of different memorization proxies on performance. The benchmark uses unified evaluation metrics that capture two complementary notions of forget quality along with accuracy on unseen (test) data and on retained data. Overall, this work offers a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs. And, for the first time, it sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline.

[CV-8] ranscending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine

【速读】:该论文旨在解决生物医学人工智能应用中对专家标注数据的高度依赖这一长期瓶颈问题。其解决方案的关键在于采用无监督学习(Unsupervised Learning)与自监督学习(Self-Supervised Learning, SSL)范式,直接从生物银行规模的数据内在结构中学习特征,例如磁共振成像(MRI)中的像素、三维扫描中的体素或基因组序列中的词元(token),从而在无需人工标注的情况下发现新型表型、建立形态学与遗传学的关联,并实现无偏见的异常检测。

链接: https://arxiv.org/abs/2602.20100
作者: Soumick Chatterjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The dependence on expert annotation has long constituted the primary rate-limiting step in the application of artificial intelligence to biomedicine. While supervised learning drove the initial wave of clinical algorithms, a paradigm shift towards unsupervised and self-supervised learning (SSL) is currently unlocking the latent potential of biobank-scale datasets. By learning directly from the intrinsic structure of data - whether pixels in a magnetic resonance image (MRI), voxels in a volumetric scan, or tokens in a genomic sequence - these methods facilitate the discovery of novel phenotypes, the linkage of morphology to genetics, and the detection of anomalies without human bias. This article synthesises seminal and recent advances in “learning without labels,” highlighting how unsupervised frameworks can derive heritable cardiac traits, predict spatial gene expression in histology, and detect pathologies with performance that rivals or exceeds supervised counterparts.

[CV-9] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues CVPR2026

【速读】:该论文旨在解决视觉-语言对齐(vision-language alignment)中因细节丰富长文本描述导致的跨模态检索性能下降问题,尤其在处理结构信息不明确或易被忽略的情况下。其核心解决方案是提出 StructXLIP,一种基于边缘图(edge maps)的细调对齐范式:通过提取图像边缘特征作为视觉结构代理,并过滤文本以强调结构线索,构建“结构中心”(structure-centric)表示;在此基础上引入三种结构导向损失函数,分别实现边缘图与结构文本对齐、局部边缘区域与文本片段匹配、以及边缘图与彩色图像间的约束,从而增强多模态结构表征的一致性。这一方法不仅提升了标准 CLIP 在跨模态检索任务中的表现,还从理论上扩展了互信息最大化目标,使模型在更鲁棒和语义稳定的极小值点处收敛。

链接: https://arxiv.org/abs/2602.20089
作者: Zanxi Ruan,Qiuyu Kong,Songqun Gao,Yiming Wang,Marco Cristani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them “structure-centric”. Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: this https URL.

[CV-10] Do Large Language Models Understand Data Visualization Principles?

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)直接推理并验证数据可视化原则的问题,以替代传统基于符号规则的约束系统。其关键解决方案在于构建一个包含约2000个Vega-Lite图表规范的受控数据集,这些规范明确标注了可视化原则违反情况,并结合超过300个真实世界图表,对LLMs和VLMs在检测与修正任务中的表现进行系统评估。该方法无需专家手动编写形式化规则,即可实现对可视化设计的灵活验证与编辑,从而为自动化可视化质量保障提供新路径。

链接: https://arxiv.org/abs/2602.20084
作者: Martin Sinnona,Valentin Bonas,Viviana Siless,Emmanuel Iarussi
机构: Universidad Torcuato Di Tella (托尔夸托·迪特拉大学); Consejo Nacional de Investigaciones Científicas y Técnicas (国家科学与技术研究理事会); Universidad de Buenos Aires (布宜诺斯艾利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.

[CV-11] SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

【速读】:该论文旨在解决新视图合成(Novel View Synthesis, NVS)中在远距离相机运动下生成图像语义不合理和失真严重的问题,这表明现有方法在长距离视角变换时性能显著下降。其解决方案的关键在于引入预训练的语义特征提取器,以增强场景语义信息作为条件输入,从而提升生成图像的质量与一致性。具体而言,作者提出两种策略:一是使用扭曲后的语义特征作为条件,二是每步去噪过程中交替进行语义理解与图像生成,有效改善了模型对场景内容的理解能力,实验结果表明该方法在多个数据集上相比最先进方法在FID指标上提升了4.69%–15.26%。

链接: https://arxiv.org/abs/2602.20079
作者: Xinya Chen,Christopher Wewer,Jiahao Xie,Xinting Hu,Jan Eric Lenssen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives.

[CV-12] he Invisible Gorilla Effect in Out-of-distribution Detection CVPR2026

【速读】:该论文旨在解决深度神经网络在分布外(Out-of-Distribution, OOD)数据上检测性能下降的问题,特别是针对不同类型的图像伪影(artefact)在OOD检测中表现差异显著但成因不明的挑战。研究发现了一种此前未被报道的偏差现象——“隐形大猩猩效应”(Invisible Gorilla Effect),即对于难以检测的近分布外(near-OOD)伪影,若其视觉特征(如颜色)与模型关注区域(Region of Interest, ROI)相似,则OOD检测性能提升;反之则显著下降。解决方案的关键在于通过系统性地标注和生成颜色置换的反事实样本(counterfactuals),验证了该偏差并非源于数据集偏倚,并揭示了当前多数OOD检测方法对ROI特征敏感性的脆弱性,从而为开发更鲁棒的OOD检测器提供了新的设计方向。

链接: https://arxiv.org/abs/2602.20068
作者: Harry Anthony,Ziyun Liang,Hermione Warr,Konstantinos Kamnitsas
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model’s ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations are available at: this https URL.

[CV-13] HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images

【速读】:该论文旨在解决城市级空间供暖脱碳过程中缺乏详细建筑层面数据的问题,从而难以生成精确的热需求地图(heat-demand maps)。其核心解决方案是提出一种零样本视觉语言能源建模框架HeatPrompt,该框架利用预训练的大规模视觉语言模型(Large Vision Language Models, VLMs)结合领域特定提示(domain-specific prompt),从卫星图像中提取语义特征(如屋顶年龄、建筑密度等),并使用多层感知机(MLP)回归器进行热需求预测。关键创新在于通过VLMs直接从RGB卫星图像中识别与热负荷相关的视觉属性,并在无需大量标注数据的情况下显著提升模型性能——相比基线模型,R²提升93.7%,平均绝对误差(MAE)降低30%,为数据稀缺地区提供轻量级热规划支持。

链接: https://arxiv.org/abs/2602.20066
作者: Kundan Thota,Xuanhao Mu,Thorsten Schlachter,Veit Hagenmeyer
机构: Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate heat-demand maps play a crucial role in decarbonizing space heating, yet most municipalities lack detailed building-level data needed to calculate them. We introduce HeatPrompt, a zero-shot vision-language energy modeling framework that estimates annual heat demand using semantic features extracted from satellite images, basic Geographic Information System (GIS), and building-level features. We feed pretrained Large Vision Language Models (VLMs) with a domain-specific prompt to act as an energy planner and extract the visual attributes such as roof age, building density, etc, from the RGB satellite image that correspond to the thermal load. A Multi-Layer Perceptron (MLP) regressor trained on these captions shows an R^2 uplift of 93.7% and shrinks the mean absolute error (MAE) by 30% compared to the baseline model. Qualitative analysis shows that high-impact tokens align with high-demand zones, offering lightweight support for heat planning in data-scarce regions.

[CV-14] MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

【速读】:该论文旨在解决当前基于锚点引导的生成式轨迹规划方法中存在的固有权衡问题,即离散锚点词汇表需充分覆盖测试阶段的轨迹分布以保证鲁棒性,但词表规模增大又会影响模型性能。其解决方案的关键在于提出MeanFuser框架,通过三个核心设计实现高效且鲁棒的端到端自动驾驶:(1) 引入高斯混合噪声(Gaussian Mixture Noise, GMN)引导生成采样,实现轨迹空间的连续表示,从而摆脱对离散锚点词汇表的依赖;(2) 适配“均值流恒等性”(MeanFlow Identity),建模GMN与轨迹分布之间的均值速度场而非瞬时速度场,有效消除常微分方程(ODE)求解器带来的数值误差并显著加速推理;(3) 设计轻量级自适应重构模块(Adaptive Reconstruction Module, ARM),利用注意力权重隐式选择最优采样轨迹或重构新轨迹,提升决策灵活性。

链接: https://arxiv.org/abs/2602.20060
作者: Junli Wang,Xueyi Liu,Yinan Zheng,Zebing Xing,Pengfei Li,Guang Li,Kun Ma,Guang Chen,Hangjun Ye,Zhongpu Xia,Long Chen,Qichao Zhang
机构: SKL-MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Xiaomi EV; Institute for AI Industry Research (AIR), Tsinghua University (清华大学智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity" to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score. and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at this https URL.

[CV-15] o Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation

【速读】:该论文旨在解决真实场景中因环境 clutter(杂乱)导致路径被完全阻塞时,移动机器人无法完成序列化物体放置任务的问题。传统视觉导航假设存在无障碍路径,但实际家居或仓库环境中,障碍物可能覆盖所有可行路径,因此提出“终身交互式导航”(Lifelong Interactive Navigation)问题,要求机器人具备操作能力以移动障碍物来构建自身路径。解决方案的关键在于提出一种由大型语言模型(LLM)驱动、基于约束的规划框架,并融合主动感知机制:LLM基于结构化的场景图进行推理,决策移动对象、放置位置及下一步观察区域,从而有目标地探索对任务完成有益的区域,而非全环境扫描;随后由标准运动规划器执行“导航-拾取-放置”或绕行动作,确保底层控制的可靠性。该方法在物理仿真环境 ProcTHOR-10k 中显著优于非学习与学习基线,并在真实硬件上得到验证。

链接: https://arxiv.org/abs/2602.20055
作者: Apoorva Vashisth(1),Manav Kulshrestha(1),Pranav Bakshi(2),Damon Conover(3),Guillaume Sartoretti(4),Aniket Bera(1) ((1) Purdue University, (2) IIT Kharagpur (3) DEVCOM Army Research Lab (4) National University of Singapore)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual navigation typically assumes the existence of at least one obstacle-free path between start and goal, which must be discovered/planned by the robot. However, in real-world scenarios, such as home environments and warehouses, clutter can block all routes. Targeted at such cases, we introduce the Lifelong Interactive Navigation problem, where a mobile robot with manipulation abilities can move clutter to forge its own path to complete sequential object- placement tasks - each involving placing an given object (eg. Alarm clock, Pillow) onto a target object (eg. Dining table, Desk, Bed). To address this lifelong setting - where effects of environment changes accumulate and have long-term effects - we propose an LLM-driven, constraint-based planning framework with active perception. Our framework allows the LLM to reason over a structured scene graph of discovered objects and obstacles, deciding which object to move, where to place it, and where to look next to discover task-relevant information. This coupling of reasoning and active perception allows the agent to explore the regions expected to contribute to task completion rather than exhaustively mapping the environment. A standard motion planner then executes the corresponding navigate-pick-place, or detour sequence, ensuring reliable low-level control. Evaluated in physics-enabled ProcTHOR-10k simulator, our approach outperforms non-learning and learning-based baselines. We further demonstrate our approach qualitatively on real-world hardware.

[CV-16] Decoupling Defense Strategies for Robust Image Watermarking CVPR2026

【速读】:该论文旨在解决深度学习图像水印技术在面对先进对抗攻击(adversarial attacks)和再生攻击(regeneration attacks)时的脆弱性问题,尤其是在传统联合优化编码器与解码器的方法中,存在清洁准确率下降(clean accuracy degradation)和鲁棒性受限(limited robustness)两大挑战。其解决方案的关键在于提出一种两阶段微调框架 AdvMark:第一阶段通过定制化的对抗训练范式,仅条件性更新解码器以微调编码器,使图像进入不可攻击区域而非修改决策边界,从而保持高清洁准确率;第二阶段通过直接图像优化应对失真和再生攻击,并设计具有理论保障的约束图像损失函数,在保持第一阶段获得的对抗鲁棒性的前提下平衡覆盖图像与先前编码图像之间的偏差,同时引入质量感知早停机制确保视觉质量下限。

链接: https://arxiv.org/abs/2602.20053
作者: Jiahui Chen,Zehang Deng,Zeyu Zhang,Chaoyang Li,Lianchen Jia,Lifeng Sun
机构: Tsinghua University (清华大学); Swinburne University of Technology (斯威本科技大学); The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges: (1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks. To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy. In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality. Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29%, 33% and 46% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.

[CV-17] SEAL-pose: Enhancing 3D Human Pose Estimation via a Learned Loss for Structural Consistency

【速读】:该论文旨在解决3D人体姿态估计(3D Human Pose Estimation, 3D HPE)中关节间复杂局部与全局依赖关系难以被传统监督损失函数有效建模的问题。现有方法通常将每个关节独立处理,或依赖人工设计的先验约束,但这类方法往往难以端到端优化且缺乏灵活性。其解决方案的关键在于提出SEAL-pose框架,通过一个可学习的损失网络(loss-net)来评估姿态结构的合理性,该损失网络基于关节图结构自动从数据中学习复杂的结构依赖关系,从而替代手工设计的规则约束,实现更高效、可微分的结构一致性优化。

链接: https://arxiv.org/abs/2602.20051
作者: Yeonsung Kim,Junggeun Do,Seunguk Do,Sangmin Kim,Jaesik Park,Jay-Yoon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:3D human pose estimation (HPE) is characterized by intricate local and global dependencies among joints. Conventional supervised losses are limited in capturing these correlations because they treat each joint independently. Previous studies have attempted to promote structural consistency through manually designed priors or rule-based constraints; however, these approaches typically require manual specification and are often non-differentiable, limiting their use as end-to-end training objectives. We propose SEAL-pose, a data-driven framework in which a learnable loss-net trains a pose-net by evaluating structural plausibility. Rather than relying on hand-crafted priors, our joint-graph-based design enables the loss-net to learn complex structural dependencies directly from data. Extensive experiments on three 3D HPE benchmarks with eight backbones show that SEAL-pose reduces per-joint errors and improves pose plausibility compared with the corresponding backbones across all settings. Beyond improving each backbone, SEAL-pose also outperforms models with explicit structural constraints, despite not enforcing any such constraints. Finally, we analyze the relationship between the loss-net and structural consistency, and evaluate SEAL-pose in cross-dataset and in-the-wild settings.

[CV-18] Closing the gap in multimodal medical representation alignment

【速读】:该论文旨在解决多模态学习中因模态间隙(modality gap)导致的语义对齐不充分问题,尤其是在医学领域中影像与临床文本之间的跨模态表示难以有效对齐的问题。其解决方案的关键在于提出一种无模态依赖(modality-agnostic)的框架,通过消除不同模态间的非语义差异,增强语义相关表示的对齐性,从而改善放射科图像与临床文本之间的跨模态检索和图像描述生成性能。

链接: https://arxiv.org/abs/2602.20046
作者: Eleonora Grassucci,Giordano Cicchetti,Danilo Comminiello
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at MLSP2025

点击查看摘要

Abstract:In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

[CV-19] EEG-Driven Intention Decoding: Offline Deep Learning Benchmarking on a Robotic Rover

【速读】:该论文旨在解决在真实场景下通过脑-机器人接口(Brain-Computer Interface, BCI)准确解码用户意图以实现移动机器人操控的挑战。其关键解决方案在于构建了一个基于离线解码的脑-机器人控制框架,利用12名参与者在预设路径上使用操纵杆操作四轮驱动(4WD)火星车平台时采集的16通道脑电图(EEG)信号,并将这些信号与电机动作对齐于不同时间滞后(Δ = 0 ms及未来预测时域 Δ > 0 ms)。研究通过预处理后对比多种深度学习模型(包括卷积神经网络、循环神经网络和Transformer架构),发现ShallowConvNet在动作预测和意图预测任务中表现最优,从而为基于预测性深度学习的BCI系统提供了可复现的基准和设计启示。

链接: https://arxiv.org/abs/2602.20041
作者: Ghadah Alosaimi,Maha Alsayyari,Yixin Sun,Stamos Katsigiannis,Amir Atapour-Abarghouei,Toby P. Breckon
机构: Imam Mohammad Ibn Saud Islamic University (伊玛目穆罕默德本沙特伊斯兰大学); Durham University (杜伦大学); King Saud University (国王萨德大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) provide a hands-free control modality for mobile robotics, yet decoding user intent during real-world navigation remains challenging. This work presents a brain-robot control framework for offline decoding of driving commands during robotic rover operation. A 4WD Rover Pro platform was remotely operated by 12 participants who navigated a predefined route using a joystick, executing the commands forward, reverse, left, right, and stop. Electroencephalogram (EEG) signals were recorded with a 16-channel OpenBCI cap and aligned with motor actions at Delta = 0 ms and future prediction horizons (Delta 0 ms). After preprocessing, several deep learning models were benchmarked, including convolutional neural networks, recurrent neural networks, and Transformer architectures. ShallowConvNet achieved the highest performance for both action prediction and intent prediction. By combining real-world robotic control with multi-horizon EEG intention decoding, this study introduces a reproducible benchmark and reveals key design insights for predictive deep learning-based BCI systems.

[CV-20] Descriptor: Dataset of Parasitoid Wasps and Associated Hymenoptera (DAPWH)

【速读】:该论文旨在解决寄生蜂总科(Ichneumonoidea)类群在分类鉴定中的难题,尤其是因其形态隐蔽性和大量未描述物种导致的识别困难问题。其解决方案的关键在于构建一个高质量、标注详尽的图像数据集,包含3,556张高分辨率图像,覆盖新热带区的 Ichneumonidae 和 Braconidae 等关键类群,并辅以多个其他蜂类科作为增强模型鲁棒性的补充数据。其中1,739张图像采用COCO格式标注,提供多类别边界框(包括昆虫整体、翅脉和比例尺),为开发基于计算机视觉的自动化识别模型提供了可靠的数据基础。

链接: https://arxiv.org/abs/2602.20028
作者: Joao Manoel Herrera Pinheiro,Gabriela Do Nascimento Herrera,Luciana Bueno Dos Reis Fernandes,Alvaro Doria Dos Santos,Ricardo V. Godoy,Eduardo A. B. Almeida,Helena Carolina Onody,Marcelo Andrade Da Costa Vieira,Angelica Maria Penteado-Dias,Marcelo Becker
机构: São Carlos School of Engineering, University of São Paulo, São Carlos 13566590, SP, Brazil; Department of Ecology and Evolutionary Biology, Federal University of São Carlos, São Carlos 13565905, Brazil; Federal University of Tocantins, Porto Nacional, 77500000, Brazil; Department of Biology (FFCLRP), University of São Paulo, Ribeirão Preto 14040901, Brazil; State University of Piauí, Deputado Jesualdo Cavalcanti Campus, Corrente, 49800000, Brazil
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate taxonomic identification is the cornerstone of biodiversity monitoring and agricultural management, particularly for the hyper-diverse superfamily Ichneumonoidea. Comprising the families Ichneumonidae and Braconidae, these parasitoid wasps are ecologically critical for regulating insect populations, yet they remain one of the most taxonomically challenging groups due to their cryptic morphology and vast number of undescribed species. To address the scarcity of robust digital resources for these key groups, we present a curated image dataset designed to advance automated identification systems. The dataset contains 3,556 high-resolution images, primarily focused on Neotropical Ichneumonidae and Braconidae, while also including supplementary families such as Andrenidae, Apidae, Bethylidae, Chrysididae, Colletidae, Halictidae, Megachilidae, Pompilidae, and Vespidae to improve model robustness. Crucially, a subset of 1,739 images is annotated in COCO format, featuring multi-class bounding boxes for the full insect body, wing venation, and scale bars. This resource provides a foundation for developing computer vision models capable of identifying these families.

[CV-21] oken-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation

【速读】:该论文旨在解决当前基于Transformer的3D医学图像分割模型(如SwinUNETR)在计算资源受限环境下的部署难题,其核心问题是Transformer注意力机制随输入分辨率呈二次方增长,导致内存占用高、推理速度慢,难以在普通硬件上高效运行。解决方案的关键在于提出Token-UNet架构,通过引入TokenLearner模块对UNet的卷积特征图进行动态token化,仅保留少量关键token用于后续Transformer处理,从而大幅降低计算复杂度;同时保持原始UNet的卷积编码器结构以维持局部细节感知能力,实现高效且可解释的3D分割性能——实验表明,该方法在仅需33%内存、10%推理时间的情况下,Dice分数优于SwinUNETR(87.21% vs 86.75%)。

链接: https://arxiv.org/abs/2602.20008
作者: Louis Fabrice Tshimanga,Andrea Zanola,Federico Del Pup,Manfredo Atzori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Token-UNet, adopting the TokenLearner and TokenFuser modules to encase Transformers into UNets. While Transformers have enabled global interactions among input elements in medical imaging, current computational challenges hinder their deployment on common hardware. Models like (Swin)UNETR adapt the UNet architecture by incorporating (Swin)Transformer encoders, which process tokens that each represent small subvolumes ( 8^3 voxels) of the input. The Transformer attention mechanism scales quadratically with the number of tokens, which is tied to the cubic scaling of 3D input resolution. This work reconsiders the role of convolution and attention, introducing Token-UNets, a family of 3D segmentation models that can operate in constrained computational environments and time frames. To mitigate computational demands, our approach maintains the convolutional encoder of UNet-like models, and applies TokenLearner to 3D feature maps. This module pools a preset number of tokens from local and global structures. Our results show this tokenization effectively encodes task-relevant information, yielding naturally interpretable attention maps. The memory footprint, computation times at inference, and parameter counts of our heaviest model are reduced to 33%, 10%, and 35% of the SwinUNETR values, with better average performance (86.75% \pm 0.19% Dice score for SwinUNETR vs our 87.21% \pm 0.35% ). This work opens the way to more efficient trainings in contexts with limited computational resources, such as 3D medical imaging. Easing model optimization, fine-tuning, and transfer-learning in limited hardware settings can accelerate and diversify the development of approaches, for the benefit of the research community. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.20008 [cs.CV] (or arXiv:2602.20008v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.20008 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Louis Fabrice Tshimanga [view email] [v1] Mon, 23 Feb 2026 16:15:38 UTC (2,410 KB) Full-text links: Access Paper: View a PDF of the paper titled Token-UNet: A New Case for Transformers Integration in Efficient and Interpretable 3D UNets for Brain Imaging Segmentation, by Louis Fabrice Tshimanga and 3 other authorsView PDFTeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-22] RADE-Net: Robust Attention Network for Radar-Only Object Detection in Adverse Weather

【速读】:该论文旨在解决雷达(Radar)感知系统在复杂天气条件下仍需保持高精度检测性能的问题,同时应对全维雷达张量(4D Range-Azimuth-Doppler-Elevation, RADE)数据量大、可用标注数据稀缺导致的模型训练效率低与信息损失问题。其关键解决方案是提出一种基于快速傅里叶变换(Fast Fourier Transform, FFT)的3D投影方法,将原始高维RADE张量压缩为仅保留丰富多普勒(Doppler)和仰角(Elevation)特征的紧凑表示,单帧数据量减少91.9%,显著提升训练与推理速度并降低模型复杂度;在此基础上设计轻量化网络RADE-Net,通过空间与通道注意力机制融合低层与高层特征,并采用解耦检测头直接在距离-方位域预测目标中心点、从笛卡尔坐标系中的特征图回归旋转三维边界框,从而实现鲁棒且高效的自动驾驶感知。

链接: https://arxiv.org/abs/2602.19994
作者: Christof Leitgeb,Thomas Puchleitner,Max Peter Ronecker,Daniel Watzenig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2026 IEEE Intelligent Vehicles Symposium (IV)

点击查看摘要

Abstract:Automotive perception systems are obligated to meet high requirements. While optical sensors such as Camera and Lidar struggle in adverse weather conditions, Radar provides a more robust perception performance, effectively penetrating fog, rain, and snow. Since full Radar tensors have large data sizes and very few datasets provide them, most Radar-based approaches work with sparse point clouds or 2D projections, which can result in information loss. Additionally, deep learning methods show potential to extract richer and more dense features from low level Radar data and therefore significantly increase the perception performance. Therefore, we propose a 3D projection method for fast-Fourier-transformed 4D Range-Azimuth-Doppler-Elevation (RADE) tensors. Our method preserves rich Doppler and Elevation features while reducing the required data size for a single frame by 91.9% compared to a full tensor, thus achieving higher training and inference speed as well as lower model complexity. We introduce RADE-Net, a lightweight model tailored to 3D projections of the RADE tensor. The backbone enables exploitation of low-level and high-level cues of Radar tensors with spatial and channel-attention. The decoupled detection heads predict object center-points directly in the Range-Azimuth domain and regress rotated 3D bounding boxes from rich feature maps in the cartesian scene. We evaluate the model on scenes with multiple different road users and under various weather conditions on the large-scale K-Radar dataset and achieve a 16.7% improvement compared to their baseline, as well as 6.5% improvement over current Radar-only models. Additionally, we outperform several Lidar approaches in scenarios with adverse weather conditions. The code is available under this https URL.

[CV-23] RL-RIG: A Generative Spatial Reason er via Intrinsic Reflection

【速读】:该论文旨在解决当前图像生成模型在空间推理方面存在的困境,即难以准确捕捉提示(prompt)中的细粒度空间关系,并生成具有结构完整性的场景。其解决方案的关键在于提出一种基于强化学习的反思式图像生成框架RL-RIG,该框架采用“生成-反思-编辑”范式,包含四个核心组件:Diffuser、Checker、Actor和Inverse Diffuser,从而激发图像生成过程中的链式思维(Chain of Thought)能力。为进一步提升生成轨迹的直观性与编辑质量,作者还设计了Reflection-GRPO算法,分别用于训练视觉语言模型(VLM)驱动的编辑提示生成器(Actor)和图像编辑器。实验表明,该方法在LAION-SG数据集上通过Scene Graph IoU和VLM-as-a-Judge策略评估时,在可控且精确的空间推理能力上相较现有开源模型提升达11%。

链接: https://arxiv.org/abs/2602.19974
作者: Tianyu Wang,Zhiyuan Ma,Qian Wang,Xinyi Zhang,Xinwei Long,Bowen Zhou
机构: Shanghai Jiao Tong University (上海交通大学); Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.

[CV-24] When Pretty Isnt Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

【速读】:该论文旨在解决当前生成式AI(Generative AI)在作为合成视觉数据生成器时的性能退化问题,即尽管文本到图像(Text-to-Image, T2I)扩散模型在视觉保真度和提示遵循能力上持续提升,但其生成的合成数据用于训练分类器时,在真实测试集上的准确率反而随模型版本更新而下降。解决方案的关键在于系统性地评估不同年份发布的先进T2I模型所生成的大规模合成数据集对标准分类器训练效果的影响,并揭示了一个隐藏趋势:这些模型逐渐收敛至一个狭窄且以美学为中心的数据分布,导致图像多样性降低和标签-图像对齐性受损,从而挑战了“生成真实性进步即等同于数据真实性进步”的主流假设。

链接: https://arxiv.org/abs/2602.19946
作者: Krzysztof Adamkiewicz,Brian Moser,Stanislav Frolov,Tobias Christian Nauen,Federico Raue,Andreas Dengel
机构: RPTU University Kaiserslautern-Landau (RPTU 于凯撒斯劳滕-兰道大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

[CV-25] Discover Segment and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation CVPR2026

【速读】:该论文旨在解决当前零样本伪装目标分割(Zero-shot Camouflaged Object Segmentation, COS)方法中因依赖多模态大语言模型(Multimodal Large Language Models, MLLMs)进行目标发现时导致的定位不准、误检和漏检问题。现有两阶段方法(先发现后分割)往往受限于MLLM生成视觉提示的质量,难以在复杂场景中稳定提取准确的目标区域。其解决方案的关键在于提出一种渐进式框架——Discover-Segment-Select (DSS) 机制:首先通过特征一致性的目标发现(Feature-coherent Object Discovery, FOD)模块生成多样化的候选区域;随后利用Segment Anything Model (SAM) 对这些候选区域进行精细化分割;最后由语义驱动的掩码选择(Semantic-driven Mask Selection, SMS)模块基于MLLM对多个分割结果进行评估并选出最优掩码。整个流程无需训练或监督信号,即可显著提升分割精度,尤其在多实例场景下表现优异。

链接: https://arxiv.org/abs/2602.19944
作者: Yilong Yang,Jianxin Tian,Shengchuan Zhang,Liujuan Cao
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 (main conference)

点击查看摘要

Abstract:Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \textbfDiscover-\textbfSegment-\textbfSelect (\textbfDSS) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature-coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state-of-the-art performance on multiple COS benchmarks, especially in multiple-instance scenes.

[CV-26] Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation

【速读】:该论文旨在解决神经隐式场(Neural Implicit Fields)在物体位姿估计任务中,由于缺乏对未观测相机空间区域的直接观测信号而导致的预测不确定性问题。这一挑战使得密集采样整个相机空间容易产生不准确估计,进而影响模型学习效果和性能表现。解决方案的关键在于提出一种结合SO(3)-等变卷积隐式网络与正向激励点采样(Positive-Incentive Point Sampling, PIPS)策略的方法:其中,SO(3)-等变卷积隐式网络能够在任意查询位置上保持旋转等变性(SO(3)-equivariance),从而提升点级属性估计的精度;而PIPS策略则根据输入动态调整采样位置,增强网络的准确性与训练效率。该方法在三个位姿估计数据集上均优于现有最先进方法,并在高遮挡、新几何形状及强噪声等复杂场景下表现出显著改进。

链接: https://arxiv.org/abs/2602.19937
作者: Yifei Shi,Boyan Wan,Xin Xu,Kai Xu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object’s canonical space-including unobserved regions in camera space-significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model’s generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network’s accuracy and training efficiency. Our method outperforms the state-of-the-art on three pose estimation datasets. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

[CV-27] Expanding the Role of Diffusion Models for Robust Classifier Training

【速读】:该论文旨在解决当前对抗训练(Adversarial Training, AT)中鲁棒图像分类器训练效果受限的问题,特别是如何进一步提升模型的对抗鲁棒性。其解决方案的关键在于超越传统仅将扩散模型用于生成合成数据的思路,转而利用扩散模型内部表征(diffusion representations)作为辅助学习信号嵌入到AT框架中。研究表明,这些表征具有多样性和部分鲁棒性,并能促使特征解耦,从而在多个数据集(如CIFAR-10、CIFAR-100和ImageNet)上显著增强模型的鲁棒性能,且与扩散生成的合成数据形成互补作用。

链接: https://arxiv.org/abs/2602.19931
作者: Pin-Han Huang,Shang-Tse Chen,Hsuan-Tien Lin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Incorporating diffusion-generated synthetic data into adversarial training (AT) has been shown to substantially improve the training of robust image classifiers. In this work, we extend the role of diffusion models beyond merely generating synthetic data, examining whether their internal representations, which encode meaningful features of the data, can provide additional benefits for robust classifier training. Through systematic experiments, we show that diffusion models offer representations that are both diverse and partially robust, and that explicitly incorporating diffusion representations as an auxiliary learning signal during AT consistently improves robustness across settings. Furthermore, our representation analysis indicates that incorporating diffusion models into AT encourages more disentangled features, while diffusion representations and diffusion-generated synthetic data play complementary roles in shaping representations. Experiments on CIFAR-10, CIFAR-100, and ImageNet validate these findings, demonstrating the effectiveness of jointly leveraging diffusion representations and synthetic data within AT.

[CV-28] Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting ICLR2026

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在颜色编码中依赖球谐函数(spherical harmonics)所导致的难以分离漫反射与镜面反射成分的问题,从而限制了其对复杂反射现象的准确建模能力。解决方案的关键在于提出一种增强型高斯核,通过引入视图依赖的不透明度(view-dependent opacity)显式建模镜面效应,同时设计了一种误差驱动的补偿策略以提升现有3DGS场景的渲染质量;该方法从2D高斯初始化出发,自适应地插入并优化增强型高斯核,最终构建出增强的辐射场(radiance field)。

链接: https://arxiv.org/abs/2602.19916
作者: Yixin Yang,Bojian Wu,Yang Zhou,Hui Huang
机构: Shenzhen University (深圳大学); Tencent Games (腾讯游戏)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to ICLR 2026. Project page: \url{ this https URL }

点击查看摘要

Abstract:Due to the real-time rendering performance, 3D Gaussian Splatting (3DGS) has emerged as the leading method for radiance field reconstruction. However, its reliance on spherical harmonics for color encoding inherently limits its ability to separate diffuse and specular components, making it challenging to accurately represent complex reflections. To address this, we propose a novel enhanced Gaussian kernel that explicitly models specular effects through view-dependent opacity. Meanwhile, we introduce an error-driven compensation strategy to improve rendering quality in existing 3DGS scenes. Our method begins with 2D Gaussian initialization and then adaptively inserts and optimizes enhanced Gaussian kernels, ultimately producing an augmented radiance field. Experiments demonstrate that our method not only surpasses state-of-the-art NeRF methods in rendering performance but also achieves greater parameter efficiency. Project page at: this https URL.

[CV-29] Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery CVPR2026

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)任务中因仅提供部分已知类别标签而导致的开放集识别难题,尤其关注多模态表示学习中对跨模态对齐的过度依赖以及缺乏有效的模态内对齐机制所导致的表征分布结构不理想问题。解决方案的关键在于提出一种基于半监督率缩减(Semi-Supervised Rate Reduction)的新型多模态表示学习框架 SSR²-GCD,通过强化模态内关系对齐来构建具有期望结构特性的跨模态表示;同时,利用视觉语言模型(Vision Language Models)提供的跨模态对齐能力引入提示候选(prompt candidates),以增强知识迁移效果。

链接: https://arxiv.org/abs/2602.19910
作者: Wei He,Xianghan Meng,Zhiyuan Huang,Xianbiao Qi,Rong Xiao,Chun-Guang Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Intellifusion Inc. (智谱AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, accepted by CVPR 2026

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR ^2 -GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

[CV-30] Gradient based Severity Labeling for Biomarker Classification in OCT ICIP

【速读】:该论文旨在解决医学图像对比学习中因任意数据增强可能扭曲关键生物标志物区域而导致的性能下降问题。传统对比学习在自然图像中通过数据增强生成正负样本对,但在医学影像中,这种策略可能破坏与疾病进展相关的微小局部结构。解决方案的关键在于引入一种基于异常检测算法梯度响应的无监督方法,为未标注的OCT(光学相干断层扫描)图像生成疾病严重程度标签,并据此构建监督式对比学习框架,从而显著提升糖尿病视网膜病变关键指标的生物标志物分类准确率,相较自监督基线最高提升6%。

链接: https://arxiv.org/abs/2602.19907
作者: Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib,Stephanie Trejo Corona,Charles Wykoff
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at International Conference on Image Processing (ICIP) 2022

点击查看摘要

Abstract:In this paper, we propose a novel selection strategy for contrastive learning for medical images. On natural images, contrastive learning uses augmentations to select positive and negative pairs for the contrastive loss. However, in the medical domain, arbitrary augmentations have the potential to distort small localized regions that contain the biomarkers we are interested in detecting. A more intuitive approach is to select samples with similar disease severity characteristics, since these samples are more likely to have similar structures related to the progression of a disease. To enable this, we introduce a method that generates disease severity labels for unlabeled OCT scans on the basis of gradient responses from an anomaly detection algorithm. These labels are used to train a supervised contrastive learning setup to improve biomarker classification accuracy by as much as 6% above self-supervised baselines for key indicators of Diabetic Retinopathy.

[CV-31] ExpPortrait: Expressive Portrait Generation via Personalized Representation CVPR2026

【速读】:该论文旨在解决当前扩散模型在生成具表现力、连贯且可控的电影级人像视频时面临的挑战,尤其是现有中间表示(如2D关键点和参数化模型)因稀疏或低秩表达导致的身份与表情解耦能力不足,从而难以准确保留主体身份和精细表情细节的问题。解决方案的关键在于提出一种高保真个性化头部表征,该表征能有效分离静态的、个体特异的全局几何结构与动态的表情相关细节,并引入一个表情迁移模块实现不同身份间头部姿态和表情细节的个性化传递;进而将此高度表达性的头部模型作为条件信号,训练基于扩散变换器(Diffusion Transformer, DiT)的生成器,以合成高质量的人像视频。

链接: https://arxiv.org/abs/2602.19900
作者: Junyi Wang,Yudong Guo,Boyang Guo,Shengming Yang,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

[CV-32] Monocular Mesh Recovery and Body Measurement of Female Saanen Goats AAAI2026

【速读】:该论文旨在解决奶山羊(Saanen dairy goats)在精准畜牧养殖中缺乏高保真3D体型建模与自动化体尺测量的问题,尤其针对现有三维重建方法因缺乏特异性数据而难以准确反映其非刚性、动态形态特征的局限。解决方案的关键在于构建首个面向雌性奶山羊的多视角RGBD视频数据集(FemaleSaanenGoat),并基于此开发出参数化3D形状模型SaanenGoat:该模型融合了41个骨骼关节与增强的乳房结构表示,并通过48只个体的扫描数据构建了高精度形状空间;在此基础上,实现了仅需单视角RGBD输入即可完成高质量3D重建和六项关键体尺指标(体长、体高、胸宽、胸围、髋宽、髋高)的自动测量,显著提升了3D视觉技术在畜禽养殖中的应用精度与可扩展性。

链接: https://arxiv.org/abs/2602.19896
作者: Bo Jin,Shichao Zhao,Jin Lyu,Bin Zhang,Tao Yu,Liang An,Yebin Liu,Meili Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2026

点击查看摘要

Abstract:The lactation performance of Saanen dairy goats, renowned for their high milk yield, is intrinsically linked to their body size, making accurate 3D body measurement essential for assessing milk production potential, yet existing reconstruction methods lack goat-specific authentic 3D data. To address this limitation, we establish the FemaleSaanenGoat dataset containing synchronized eight-view RGBD videos of 55 female Saanen goats (6-18 months). Using multi-view DynamicFusion, we fuse noisy, non-rigid point cloud sequences into high-fidelity 3D scans, overcoming challenges from irregular surfaces and rapid movement. Based on these scans, we develop SaanenGoat, a parametric 3D shape model specifically designed for female Saanen goats. This model features a refined template with 41 skeletal joints and enhanced udder representation, registered with our scan data. A comprehensive shape space constructed from 48 goats enables precise representation of diverse individual variations. With the help of SaanenGoat model, we get high-precision 3D reconstruction from single-view RGBD input, and achieve automated measurement of six critical body dimensions: body length, height, chest width, chest girth, hip width, and hip height. Experimental results demonstrate the superior accuracy of our method in both 3D reconstruction and body measurement, presenting a novel paradigm for large-scale 3D vision applications in precision livestock farming.

[CV-33] Make Some Noise: Unsupervised Remote Sensing Change Detection Using Latent Space Perturbations

【速读】:该论文针对无监督变化检测(Unsupervised Change Detection, UCD)中现有方法泛化能力不足的问题展开研究,即当前主流方法依赖预设的改变类型假设(如手工规则或辅助生成模型),难以适应真实世界中罕见或复杂的变化场景。解决方案的关键在于提出MaSoN(Make Some Noise)框架,其核心创新是在训练过程中直接在潜在特征空间(latent feature space)中动态合成多样化的变化,利用目标数据的特征统计量自适应地生成与目标域对齐的数据驱动变化模式,从而实现端到端的鲁棒变化检测,并显著提升跨多种变化类型的泛化性能,在五个基准测试上平均F1分数提升14.1个百分点。

链接: https://arxiv.org/abs/2602.19881
作者: Blaž Rolih,Matic Fučka,Filip Wolf,Luka Čehovin Zajc
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised change detection (UCD) in remote sensing aims to localise semantic changes between two images of the same region without relying on labelled data during training. Most recent approaches rely either on frozen foundation models in a training-free manner or on training with synthetic changes generated in pixel space. Both strategies inherently rely on predefined assumptions about change types, typically introduced through handcrafted rules, external datasets, or auxiliary generative models. Due to these assumptions, such methods fail to generalise beyond a few change types, limiting their real-world usage, especially in rare or complex scenarios. To address this, we propose MaSoN (Make Some Noise), an end-to-end UCD framework that synthesises diverse changes directly in the latent feature space during training. It generates changes that are dynamically estimated using feature statistics of target data, enabling diverse yet data-driven variation aligned with the target domain. It also easily extends to new modalities, such as SAR. MaSoN generalises strongly across diverse change types and achieves state-of-the-art performance on five benchmarks, improving the average F1 score by 14.1 percentage points. Project page: this https URL

[CV-34] BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations

【速读】:该论文旨在解决动物行为识别中三维(3D)姿态与形状重建未能有效集成到深度学习自动化识别流程中的问题,尤其针对非人类灵长类动物(如恒河猴),现有基于稀疏关键点的追踪方法无法充分捕捉动作动态的丰富性。其解决方案的关键在于构建首个大规模、高精度的3D姿态-形状标注数据集BigMaQ,通过将高质量恒河猴模板网格适配至个体猴子,生成具有纹理的个性化虚拟化身(subject-specific textured avatars),从而实现比现有表面追踪方法更精确的3D姿态描述;同时,基于该数据集衍生出的BigMaQ500基准测试验证了引入此类3D姿态信息可显著提升动作识别的平均精度(mAP),为灵长类动物视觉外观、姿势及社交互动研究提供了重要资源。

链接: https://arxiv.org/abs/2602.19874
作者: Lucas Martini,Alexander Lappe,Anna Bognár,Rufin Vogels,Martin A. Giese
机构: Hertie Institute, University of Tübingen (图宾根大学赫尔蒂研究所); IMPRS-IS; KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that are unable to fully capture the richness of action dynamics. To address this gap, we introduce the \textbfBig Ma ca \textbfQ ue 3D Motion and Animation Dataset ( \textttBigMaQ ), a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions. Extending previous surface-based animal tracking methods, we construct subject-specific textured avatars by adapting a high-quality macaque template mesh to individual monkeys. This allows us to provide pose descriptions that are more accurate than previous state-of-the-art surface-based animal tracking methods. From the original dataset, we derive BigMaQ500, an action recognition benchmark that links surface-based pose vectors to single frames across multiple individual monkeys. By pairing features extracted from established image and video encoders with and without our pose descriptors, we demonstrate substantial improvements in mean average precision (mAP) when pose information is included. With these contributions, \textttBigMaQ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates. The code and data are publicly available at this https URL .

[CV-35] GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery AAAI2026

【速读】:该论文旨在解决持续广义类别发现(Continual Generalized Category Discovery, C-GCD)中的灾难性遗忘与特征对齐不一致问题,即在不断识别未标记数据中新类别的同时,保持对已知类别的知识稳定。现有方法通过动态更新分类器权重导致模型遗忘旧类且特征空间对齐不稳定。解决方案的关键在于提出一个统一框架GOAL,其核心是引入固定等角紧框架(Equiangular Tight Frame, ETF)分类器,以在整个学习过程中维持一致的几何结构;同时,GOAL采用监督对齐策略处理标注样本,并通过置信度引导对齐策略整合新类别样本,从而实现新类别的稳定融入而不破坏已有知识。

链接: https://arxiv.org/abs/2602.19872
作者: Jizhou Han,Chenhao Ding,SongLin Dong,Yuhang He,Shaokun Wang,Qiang Wang,Yihong Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by AAAI 2026

点击查看摘要

Abstract:Continual Generalized Category Discovery (C-GCD) requires identifying novel classes from unlabeled data while retaining knowledge of known classes over time. Existing methods typically update classifier weights dynamically, resulting in forgetting and inconsistent feature alignment. We propose GOAL, a unified framework that introduces a fixed Equiangular Tight Frame (ETF) classifier to impose a consistent geometric structure throughout learning. GOAL conducts supervised alignment for labeled samples and confidence-guided alignment for novel samples, enabling stable integration of new classes without disrupting old ones. Experiments on four benchmarks show that GOAL outperforms the prior method Happy, reducing forgetting by 16.1% and boosting novel class discovery by 3.2%, establishing a strong solution for long-horizon continual discovery.

[CV-36] ApET: Approximation-Error Guided Token Compression for Efficient VLMs CVPR2026

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中因冗余视觉标记(visual tokens)导致的计算开销过高和推理效率低下的问题。现有方法通常依赖[CLS]注意力或文本-视觉交叉注意力来识别并删除冗余标记,但这类方法易引入位置偏差,且与高效注意力核(如FlashAttention)不兼容,限制了其在实际加速部署中的应用。解决方案的关键在于摒弃对注意力机制的依赖,从信息论角度出发提出ApET(Approximation-Error guided Token compression)框架:首先通过线性近似用少量基标记重建原始视觉标记,再利用近似误差识别并移除最不具信息量的标记,从而实现无注意力参与的高效压缩。该设计使ApET可无缝集成FlashAttention,在显著降低token预算(图像任务88.9%、视频任务87.5%)的同时保持甚至提升性能(图像任务保留95.2%、视频任务达100.4%)。

链接: https://arxiv.org/abs/2602.19870
作者: Qiankun Ma,Ziyao Zhang,Haofei Wang,Jie Chen,Zhen Song,Hairong Zheng
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Peng Cheng Laboratory (鹏城实验室); University of Chinese Academy of Sciences (中国科学院大学); Harbin Institute of Technology (哈尔滨工业大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at this https URL.

[CV-37] Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation CVPR2026

【速读】:该论文旨在解决多光谱遥感影像中基础模型(EO Foundation Models, EOFMs)的跨模态知识迁移效率问题,尤其是在不同传感器和数据模态下实现统一表征学习的挑战。当前主流的掩码图像建模方法虽能提升局部重建能力,但对全局语义结构控制不足,难以支撑多模态协同优化。其解决方案的关键在于提出一种双教师对比蒸馏框架(dual-teacher contrastive distillation),通过融合多光谱教师模型与光学视觉基础模型(Vision Foundation Models, VFMs)的对比自蒸馏机制,实现跨模态表示的一致性对齐,从而在不牺牲光学图像性能的前提下显著提升多光谱任务表现,验证了对比蒸馏作为异构地球观测(EO)数据源上可扩展表征学习的有效路径。

链接: https://arxiv.org/abs/2602.19863
作者: Filip Wolf,Blaž Rolih,Luka Čehovin Zajc
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student’s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Code: Coming soon.

[CV-38] Contrastive meta-domain adaptation for robust skin lesion classification across clinical and acquisition conditions

【速读】:该论文旨在解决深度学习模型在皮肤病变分类任务中因图像采集差异和领域特异性视觉特征导致的性能下降问题,尤其是在临床场景部署时表现不佳。其解决方案的关键在于提出一种基于“视觉元域”(visual meta-domains)的适应策略,通过将大型皮肤镜数据集中的视觉表征迁移至临床图像领域,从而提升模型的泛化鲁棒性。实验表明,该方法在多个皮肤科数据集上均实现了分类性能的一致提升,并缩小了皮肤镜图像与临床图像之间的性能差距。

链接: https://arxiv.org/abs/2602.19857
作者: Rodrigo Mota,Kelvin Cunha,Emanoel dos Santos,Fábio Papais,Francisco Filho,Thales Bezerra,Erico Medeiros,Paulo Borba,Tsang Ing Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 5 figures, 1 table, isbi2026

点击查看摘要

Abstract:Deep learning models for dermatological image analysis remain sensitive to acquisition variability and domain-specific visual characteristics, leading to performance degradation when deployed in clinical settings. We investigate how visual artifacts and domain shifts affect deep learning-based skin lesion classification. We propose an adaptation strategy, grounded in the idea of visual meta-domains, that transfers visual representations from larger dermoscopic datasets into clinical image domains, thereby improving generalization robustness. Experiments across multiple dermatology datasets show consistent gains in classification performance and reduced gaps between dermoscopic and clinical images. These results emphasize the importance of domain-aware training for deployable systems.

[CV-39] DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

【速读】:该论文旨在解决皮肤病变分类数据集中存在的严重类别不平衡问题,即恶性病例样本显著不足,导致深度学习模型训练时决策边界偏倚。其解决方案的关键在于:首先利用类条件扩散模型生成合成皮肤病图像以缓解样本不均衡;随后采用自监督MAE(Masked Autoencoder)预训练策略,使大型视觉Transformer(ViT)模型能够学习到鲁棒且领域相关的特征表示;最后通过知识蒸馏技术将这些特征迁移到轻量级ViT学生模型中,从而实现适用于移动设备的高效本地推理,满足临床实际部署需求。

链接: https://arxiv.org/abs/2602.19848
作者: Francisco Filho,Kelvin Cunha,Fábio Papais,Emanoel dos Santos,Rodrigo Mota,Thales Bezerra,Erico Medeiros,Paulo Borba,Tsang Ing Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, 1 table, isbi2026

点击查看摘要

Abstract:Skin lesion classification datasets often suffer from severe class imbalance, with malignant cases significantly underrepresented, leading to biased decision boundaries during deep learning training. We address this challenge using class-conditioned diffusion models to generate synthetic dermatological images, followed by self-supervised MAE pretraining to enable huge ViT models to learn robust, domain-relevant features. To support deployment in practical clinical settings, where lightweight models are required, we apply knowledge distillation to transfer these representations to a smaller ViT student suitable for mobile devices. Our results show that MAE pretraining on synthetic data, combined with distillation, improves classification performance while enabling efficient on-device inference for practical clinical use.

[CV-40] M3S-Net: Multimodal Feature Fusion Network Based on Multi-scale Data for Ultra-short-term PV Power Forecasting

【速读】:该论文旨在解决高渗透率光伏电网中因太阳辐照度的固有间歇性和高频波动(尤其是在云层快速移动期间)导致的稳定性问题,现有多模态预测方法受限于浅层特征拼接和二值化云分割策略,难以捕捉云的细粒度光学特征及视觉与气象模态间的复杂时空耦合关系。其解决方案的关键在于提出M3S-Net,一种基于多尺度数据的新型多模态特征融合网络:首先通过多尺度局部通道选择网络利用部分卷积显式分离光学薄云边界特征,突破粗粒度二值掩码的精度限制;其次设计多尺度序列到图像分析网络,采用基于快速傅里叶变换(Fast Fourier Transform, FFT)的时间频率表示解耦不同时间尺度下气象数据的复杂周期性;最关键的是引入跨模态Mamba交互模块,包含创新的动态C-矩阵交换机制,通过在视觉流与时间流之间交换状态空间参数,使一个模态的状态演化受另一模态上下文条件约束,实现深度结构耦合且计算复杂度为线性,从而克服浅层拼接的局限性。

链接: https://arxiv.org/abs/2602.19832
作者: Penghui Niu,Taotao Cai,Suqi Zhang,Junhua Gu,Ping Zhang,Qiqi Liu,Jianxin Li
机构: Hebei University of Technology (河北工业大学); University of Southern Queensland (南昆士兰大学); Tianjin University of Commerce (天津商业大学); Westlake University (西湖大学); Edith Cowan University (埃迪斯科文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The inherent intermittency and high-frequency variability of solar irradiance, particularly during rapid cloud advection, present significant stability challenges to high-penetration photovoltaic grids. Although multimodal forecasting has emerged as a viable mitigation strategy, existing architectures predominantly rely on shallow feature concatenation and binary cloud segmentation, thereby failing to capture the fine-grained optical features of clouds and the complex spatiotemporal coupling between visual and meteorological modalities. To bridge this gap, this paper proposes M3S-Net, a novel multimodal feature fusion network based on multi-scale data for ultra-short-term PV power forecasting. First, a multi-scale partial channel selection network leverages partial convolutions to explicitly isolate the boundary features of optically thin clouds, effectively transcending the precision limitations of coarse-grained binary masking. Second, a multi-scale sequence to image analysis network employs Fast Fourier Transform (FFT)-based time-frequency representation to disentangle the complex periodicity of meteorological data across varying time horizons. Crucially, the model incorporates a cross-modal Mamba interaction module featuring a novel dynamic C-matrix swapping mechanism. By exchanging state-space parameters between visual and temporal streams, this design conditions the state evolution of one modality on the context of the other, enabling deep structural coupling with linear computational complexity, thus overcoming the limitations of shallow concatenation. Experimental validation on the newly constructed fine-grained PV power dataset demonstrates that M3S-Net achieves a mean absolute error reduction of 6.2% in 10-minute forecasts compared to state-of-the-art baselines. The dataset and source code will be available at this https URL.

[CV-41] xtShield-R1: Reinforced Reasoning for Tampered Text Detection AAAI2026

【速读】:该论文旨在解决伪造文本检测中面临的三大核心问题:微尺度伪造痕迹识别困难、篡改文本区域定位精度低以及对昂贵标注数据的高度依赖。其解决方案的关键在于提出TextShield-R1,一个基于强化学习的多模态大语言模型(Multimodal Large Language Model, MLLM)框架,通过三项创新实现突破:首先,引入“取证持续预训练”(Forensic Continual Pre-training)策略,利用自然图像取证和光学字符识别(OCR)任务中的大规模低成本数据构建由易到难的训练课程,提升模型对篡改文本的感知能力;其次,在微调阶段采用分组相对策略优化(Group Relative Policy Optimization)结合新颖奖励函数,显著降低对人工标注的依赖并增强推理能力;最后,在推理阶段引入OCR校正(OCR Rectification)机制,利用MLLM强大的文本识别能力优化定位精度。该方案在新提出的Text Forensics Reasoning(TFR)基准上得到验证,该基准涵盖16种语言、10类篡改技术及多样场景,支持跨风格、跨方法与跨语言的鲁棒评估。

链接: https://arxiv.org/abs/2602.19828
作者: Chenfan Qu,Yiwu Zhong,Jian Liu,Xuekang Zhu,Bohan Yu,Lianwen Jin
机构: 1. Fujian Normal University (福建师范大学); 2. Zhejiang University (浙江大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-R1, the first reinforcement learning based MLLM solution for tampered text detection and reasoning. Specifically, our approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. During fine-tuning, we perform Group Relative Policy Optimization with novel reward functions to reduce annotation dependency and improve reasoning capabilities. At inference time, we enhance localization accuracy via OCR Rectification, a method that leverages the MLLM’s strong text recognition abilities to refine its predictions. Furthermore, to support rigorous evaluation, we introduce the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains. Rich reasoning-style annotations are included, allowing for comprehensive assessment. Our TFR benchmark simultaneously addresses seven major limitations of existing benchmarks and enables robust evaluation under cross-style, cross-method, and cross-language conditions. Extensive experiments demonstrate that TextShield-R1 significantly advances the state of the art in interpretable tampered text detection.

[CV-42] Open-vocabulary 3D scene perception in industrial environments

【速读】:该论文旨在解决工业场景中开放词汇(open-vocabulary)3D感知任务的泛化能力不足问题,即现有基于2D视觉-语言基础模型(Vision-Language Foundation Models, VLFMs)的方法在工业环境中表现不佳,因其依赖于非工业数据集预训练的类无关分割模型。解决方案的关键在于提出一种无需训练的3D感知流程:通过合并预计算的超点(superpoints)来生成掩码,而非依赖预训练实例提议模型;同时引入领域自适应的VLFM“IndustrialCLIP”用于开放词汇查询,在典型工业车间场景中实现对工业物体的有效语义分割。

链接: https://arxiv.org/abs/2602.19823
作者: Keno Moenck,Adrian Philip Florea,Julian Koch,Thorsten Schüppstuhl
机构: Hamburg University of Technology (汉堡工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous vision applications in production, intralogistics, or manufacturing environments require perception capabilities beyond a small, fixed set of classes. Recent open-vocabulary methods, leveraging 2D Vision-Language Foundation Models (VLFMs), target this task but often rely on class-agnostic segmentation models pre-trained on non-industrial datasets (e.g., household scenes). In this work, we first demonstrate that such models fail to generalize, performing poorly on common industrial objects. Therefore, we propose a training-free, open-vocabulary 3D perception pipeline that overcomes this limitation. Instead of using a pre-trained model to generate instance proposals, our method simply generates masks by merging pre-computed superpoints based on their semantic features. Following, we evaluate the domain-adapted VLFM “IndustrialCLIP” on a representative 3D industrial workshop scene for open-vocabulary querying. Our qualitative results demonstrate successful segmentation of industrial objects.

[CV-43] Efficient endometrial carcinoma screening via cross-modal synthesis and gradient distillation

【速读】:该论文旨在解决子宫内膜癌(Endometrial Carcinoma, EC)早期肌层浸润的准确检测问题,尤其在资源受限的基层医疗环境中,传统经阴道超声(Transvaginal Ultrasound, TVUS)因组织对比度低、操作者依赖性强及阳性病理样本稀缺而诊断可靠性差。现有人工智能方法难以应对严重类别不平衡和微弱影像特征,且计算资源受限。其解决方案的关键在于提出一种两阶段深度学习框架:首先构建结构引导的跨模态生成网络,利用未配对磁共振成像(MRI)数据合成高质量、高保真的超声图像,缓解病理数据稀缺;其次设计轻量级筛查网络,通过梯度蒸馏技术将高性能教师模型的判别知识迁移至学生模型,动态引导稀疏注意力聚焦于关键区域,实现高效精准识别。该方法在7951例多中心队列中达到99.5%敏感性、97.2%特异性和0.987 AUC,同时仅需0.289 GFLOPs计算量,显著优于专家超声医师平均诊断准确率。

链接: https://arxiv.org/abs/2602.19822
作者: Dongjing Shan,Yamei Luo,Jiqing Xuan,Lu Huang,Jin Li,Mengchu Yang,Zeyu Chen,Fajin Lv,Yong Tang,Chunxiang Zhang
机构: Southwest Medical University (西南医科大学); University of Electronic Science and Technology of China (电子科技大学); Zibo Hospital of Traditional Chinese Medicine (淄博市中医院); Chongqing Medical University (重庆医科大学); Chongqing University of Chinese Medicine (重庆中医药大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection of myometrial invasion is critical for the staging and life-saving management of endometrial carcinoma (EC), a prevalent global malignancy. Transvaginal ultrasound serves as the primary, accessible screening modality in resource-constrained primary care settings; however, its diagnostic reliability is severely hindered by low tissue contrast, high operator dependence, and a pronounced scarcity of positive pathological samples. Existing artificial intelligence solutions struggle to overcome this severe class imbalance and the subtle imaging features of invasion, particularly under the strict computational limits of primary care clinics. Here we present an automated, highly efficient two-stage deep learning framework that resolves both data and computational bottlenecks in EC screening. To mitigate pathological data scarcity, we develop a structure-guided cross-modal generation network that synthesizes diverse, high-fidelity ultrasound images from unpaired magnetic resonance imaging (MRI) data, strictly preserving clinically essential anatomical junctions. Furthermore, we introduce a lightweight screening network utilizing gradient distillation, which transfers discriminative knowledge from a high-capacity teacher model to dynamically guide sparse attention towards task-critical regions. Evaluated on a large, multicenter cohort of 7,951 participants, our model achieves a sensitivity of 99.5%, a specificity of 97.2%, and an area under the curve of 0.987 at a minimal computational cost (0.289 GFLOPs), substantially outperforming the average diagnostic accuracy of expert sonographers. Our approach demonstrates that combining cross-modal synthetic augmentation with knowledge-driven efficient modeling can democratize expert-level, real-time cancer screening for resource-constrained primary care settings.

[CV-44] raceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在图像理解中过度依赖全局信息、难以模拟人类视觉注意力轨迹以及无法有效解释描述与图像特定区域之间关联的问题。其核心解决方案是提出TraceVision,一个统一的视觉语言模型,通过端到端框架引入轨迹感知的空间理解机制;关键创新在于设计了轨迹感知视觉感知(Trajectory-aware Visual Perception, TVP)模块,实现视觉特征与轨迹信息的双向融合,并结合几何简化方法从原始轨迹中提取语义关键点,同时构建三阶段训练流程使轨迹引导描述生成与区域定位,从而显著提升模型的空间交互直观性和可解释性。

链接: https://arxiv.org/abs/2602.19768
作者: Fan Yang,Shurong Zheng,Hongyin Zhao,Yufei Zhan,Xin Li,Yousong Zhu,Chaoyang Zhao Ming Tang,Jinqiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.

[CV-45] One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image ICLR2026

【速读】:该论文旨在解决从单张图像生成可自由探索的三维场景(explorable 3D scenes)这一在三维视觉中极具挑战性的问题。现有方法在大视角变化下常出现严重的几何失真和噪声伪影,难以支持沉浸式交互。其解决方案的关键在于将这一病态问题分解为三个可处理的子任务:首先利用全景生成器(panorama generator)从单图生成锚定视图作为初始化;其次通过一个通用的前馈高斯点绘(Gaussian Splatting)网络将2D锚视图提升为显式的三维几何骨架(geometric scaffold),并采用多视图立体匹配重构策略,借助大规模多视角数据学习到的几何先验增强鲁棒性;最后基于该几何一致的骨架,使用新型视角生成器合成任意相机位姿下的逼真且几何准确的视图。整个框架通过显式地以3D一致性骨架为条件进行重建,在大幅相机运动下仍能保持稳定,从而实现高质量的沉浸式场景探索。

链接: https://arxiv.org/abs/2602.19766
作者: Pengfei Wang,Liyi Chen,Zhiyuan Ma,Yanjun Guo,Guowen Zhang,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:Generating explorable 3D scenes from a single image is a highly challenging problem in 3D vision. Existing methods struggle to support free exploration, often producing severe geometric distortions and noisy artifacts when the viewpoint moves far from the original perspective. We introduce \textbfOne2Scene, an effective framework that decomposes this ill-posed problem into three tractable sub-tasks to enable immersive explorable scene generation. We first use a panorama generator to produce anchor views from a single input image as initialization. Then, we lift these 2D anchors into an explicit 3D geometric scaffold via a generalizable, feed-forward Gaussian Splatting network. Instead of treating the panorama as a single image for reconstruction, we project it into multiple sparse anchor views and reformulate the reconstruction task as multi-view stereo matching, which allows us to leverage robust geometric priors learned from large-scale multi-view datasets. A bidirectional feature fusion module is used to enforce cross-view consistency, yielding an efficient and geometrically reliable scaffold. Finally, the scaffold serves as a strong prior for a novel view generator to produce photorealistic and geometrically accurate views at arbitrary cameras. By explicitly conditioning on a 3D-consistent scaffold to perform reconstruction, One2Scene works stably under large camera motions, supporting immersive scene exploration. Extensive experiments show that One2Scene substantially outperforms state-of-the-art methods in panorama depth estimation, feed-forward 360° reconstruction, and explorable 3D scene generation. Code and models will be released.

[CV-46] raining Deep Stereo Matching Networks on Tree Branch Imagery: A Benchmark Study for Real-Time UAV Forestry Applications

【速读】:该论文旨在解决自主无人机在树木修剪任务中对高精度、实时深度估计的需求,其核心挑战在于:由于深度计算公式 $ Z = f B/d $ 的特性,即使微小的视差误差也会在作业距离上导致显著的深度误判。解决方案的关键在于通过训练和评估十种先进的深度立体匹配网络(deep stereo matching networks)来优化视差图生成质量与推理速度之间的平衡,特别选用基于DEFOM-Stereo生成的高质量视差图作为监督信号,并利用真实树杈图像数据集(Canterbury Tree Branches)进行验证。实验表明,BANet-3D在感知质量(SSIM=0.883, LPIPS=0.157)上最优,RAFT-Stereo在场景理解能力(ViTScore=0.799)最强,而AnyNet在1080P分辨率下达到6.99 FPS,是唯一接近实时的方案,为林业无人机系统提供了可落地的模型选择依据。

链接: https://arxiv.org/abs/2602.19763
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Autonomous drone-based tree pruning needs accurate, real-time depth estimation from stereo cameras. Depth is computed from disparity maps using Z = f B/d , so even small disparity errors cause noticeable depth mistakes at working distances. Building on our earlier work that identified DEFOM-Stereo as the best reference disparity generator for vegetation scenes, we present the first study to train and test ten deep stereo matching networks on real tree branch images. We use the Canterbury Tree Branches dataset – 5,313 stereo pairs from a ZED Mini camera at 1080P and 720P – with DEFOM-generated disparity maps as training targets. The ten methods cover step-by-step refinement, 3D convolution, edge-aware attention, and lightweight designs. Using perceptual metrics (SSIM, LPIPS, ViTScore) and structural metrics (SIFT/ORB feature matching), we find that BANet-3D produces the best overall quality (SSIM = 0.883, LPIPS = 0.157), while RAFT-Stereo scores highest on scene-level understanding (ViTScore = 0.799). Testing on an NVIDIA Jetson Orin Super (16 GB, independently powered) mounted on our drone shows that AnyNet reaches 6.99 FPS at 1080P – the only near-real-time option – while BANet-2D gives the best quality-speed balance at 1.21 FPS. We also compare 720P and 1080P processing times to guide resolution choices for forestry drone systems.

[CV-47] Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

【速读】:该论文旨在解决多模态学习中因依赖大规模图像-文本数据集而导致训练成本高、效率低的问题,尤其在小规模数据子集下现有方法性能显著下降。其解决方案的关键在于提出一种无需训练的多模态数据蒸馏框架,通过CLIP模型提取对齐的图像-文本嵌入并获取原型表示,再利用unCLIP解码器合成图像,从而实现高效且跨架构通用的数据蒸馏,避免了传统方法对完整数据集训练和图像像素与文本特征联合优化的需求,显著提升了跨模型架构的泛化能力。

链接: https://arxiv.org/abs/2602.19756
作者: Junhyeok Choi,Sangwoo Mo,Minwoo Chae
机构: Pohang University of Science and Technology (浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

[CV-48] RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing CVPR2026

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 中由于迭代优化与稀疏化过程导致大量原始体素(primitive)生成、且各原始体素对重建质量贡献差异显著的问题,进而实现高效的重要性评估以支持冗余去除、压缩及传输。现有方法依赖基于渲染的分析,需多视角计算并依赖可微分光栅化器,存在视图敏感性高、计算复杂度随视角数线性增长、难以模块化集成等局限。其解决方案的关键在于提出 RAP(Rendering-Free Attribute-guided Pruning),一种无需渲染的前馈式重要性评分预测方法:通过直接利用每个高斯原始体素的内在属性(如位置、尺度、颜色)和局部邻域统计特征,结合一个轻量级 MLP 模型,联合优化渲染损失、剪枝感知损失与重要性分布正则项,从而高效预测每条原始体素的重要性得分。该方法训练后具备良好泛化能力,可无缝嵌入重建、压缩与传输流程中。

链接: https://arxiv.org/abs/2602.19753
作者: Kaifa Yang,Qi Yang,Yiling Xu,Zhu Li
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri–Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission. Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are sensitive to the number and selection of views, rely on specialized differentiable rasterizers, and have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules and limiting scalability and generalization. To address these issues, we propose RAP, a fast feedforward rendering-free attribute-guided method for efficient importance score prediction in 3DGS. RAP infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding rendering-based or visibility-dependent computations. A compact MLP predicts per-primitive importance scores using rendering loss, pruning-aware loss, and significance distribution regularization. After training on a small set of scenes, RAP generalizes effectively to unseen data and can be seamlessly integrated into reconstruction, compression, and transmission pipelines. Our code is publicly available at this https URL.

[CV-49] InfScene-SR: Spatially Continuous Inference for Arbitrary-Size Image Super-Resolution

【速读】:该论文旨在解决基于扩散模型(Diffusion Models)的图像超分辨率(Image Super-Resolution, SR)方法在处理任意尺寸大场景图像时面临的两大问题:一是标准扩散模型如SR3通常在固定尺寸补丁上训练,难以扩展到任意大小图像;二是采用独立补丁处理会导致边界处出现可见接缝和纹理不一致问题。解决方案的关键在于提出InfScene-SR框架,其核心创新是将扩散模型的迭代精炼过程与一种新颖的引导式且方差校正的融合机制(guided and variance-corrected fusion mechanism)相结合,从而实现无需重新训练即可无缝生成大规模高分辨率图像,并有效消除边界伪影,提升感知质量与下游任务(如语义分割)的适用性。

链接: https://arxiv.org/abs/2602.19736
作者: Shoukun Sun,Zhe Wang,Xiang Que,Jiyin Zhang,Xiaogang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Super-Resolution (SR) aims to recover high-resolution (HR) details from low-resolution (LR) inputs, a task where Denoising Diffusion Probabilistic Models (DDPMs) have recently shown superior performance compared to Generative Adversarial Networks (GANs) based approaches. However, standard diffusion-based SR models, such as SR3, are typically trained on fixed-size patches and struggle to scale to arbitrary-sized images due to memory constraints. Applying these models via independent patch processing leads to visible seams and inconsistent textures across boundaries. In this paper, we propose InfScene-SR, a framework enabling spatially continuous super-resolution for large, arbitrary scenes. We adapt the iterative refinement process of diffusion models with a novel guided and variance-corrected fusion mechanism, allowing for the seamless generation of large-scale high-resolution imagery without retraining. We validate our approach on remote sensing datasets, demonstrating that InfScene-SR not only reconstructs fine details with high perceptual quality but also eliminates boundary artifacts, benefiting downstream tasks such as semantic segmentation.

[CV-50] VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments

【速读】:该论文旨在解决自动驾驶中多模态场景识别(Multimodal Place Recognition, MPR)的鲁棒性问题,特别是现有方法依赖手工设计的融合策略和高参数量模型导致重训练成本高的缺陷。其解决方案的关键在于提出VGGT-MPR框架,该框架以视觉几何基础Transformer(Visual Geometry Grounded Transformer, VGGT)作为统一的几何引擎,实现全局检索与重排序的联合优化:在全局检索阶段,通过先验深度感知与点云监督提取结构丰富的视觉嵌入,并利用预测深度图稀疏LiDAR点云以增强结构表征;在重排序阶段,则设计了一种无需训练的机制,借助VGGT跨视角关键点追踪能力,结合掩码引导的关键点提取与置信度感知的对应评分,有效提升检索精度而不引入额外参数优化。

链接: https://arxiv.org/abs/2602.19735
作者: Jingyi Xu,Zhangshuo Qi,Zhongmiao Yan,Xuyu Gao,Qianyun Jiao,Songpengcheng Xia,Xieyuanli Chen,Ling Pei
机构: Shanghai Jiao Tong University (上海交通大学); Beijing Institute of Technology (北京理工大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT’s cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.

[CV-51] owards Personalized Multi-Modal MRI Synthesis across Heterogeneous Datasets

【速读】:该论文旨在解决多模态磁共振成像(multi-modal magnetic resonance imaging, MRI)中因时间限制、运动伪影及患者耐受性等问题导致的模态缺失问题,从而保障诊断完整性。现有统一合成模型虽能支持多种输入-输出配置,但其训练与评估通常局限于单一数据集,难以在不同临床数据集间泛化,限制了实际应用。解决方案的关键在于提出PMMSynth框架,通过三个核心创新实现跨数据集的有效泛化:一是个性化特征调制模块(Personalized Feature Modulation),根据数据集标识动态调整特征表示以缓解分布偏移;二是模态一致批调度器(Modality-Consistent Batch Scheduler),在模态覆盖不一致条件下实现稳定高效的批量训练;三是选择性监督损失函数,确保在部分真实模态缺失时仍能有效学习。实验证明,该方法在四个临床MRI数据集上均优于当前最优方法,在一对一和多对一合成任务中均取得更高的PSNR和SSIM指标,并展现出良好的解剖结构与病灶细节保留能力。

链接: https://arxiv.org/abs/2602.19723
作者: Yue Zhang,Zhizheng Zhuo,Siyao Xu,Shan Lv,Zhaoxi Liu,Jun Qiu,Qiuli Wang,Yaou Liu,S. Kevin Zhou
机构: University of Science and Technology of China(中国科学技术大学); Suzhou Institute for Advanced Research, University of Science and Technology of China(中国科学技术大学苏州研究院); University of Electronic Science and Technology of China(电子科技大学); Capital Medical University(首都医科大学); Army Medical University(陆军军医大学); Beijing Tiantan Hospital(北京天坛医院); Southwest Hospital(西南医院); China National Clinical Research Center for Neurological Diseases(国家神经系统疾病临床研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Synthesizing missing modalities in multi-modal magnetic resonance imaging (MRI) is vital for ensuring diagnostic completeness, particularly when full acquisitions are infeasible due to time constraints, motion artifacts, and patient tolerance. Recent unified synthesis models have enabled flexible synthesis tasks by accommodating various input-output configurations. However, their training and evaluation are typically restricted to a single dataset, limiting their generalizability across diverse clinical datasets and impeding practical deployment. To address this limitation, we propose PMM-Synth, a personalized MRI synthesis framework that not only supports various synthesis tasks but also generalizes effectively across heterogeneous datasets. PMM-Synth is jointly trained on multiple multi-modal MRI datasets that differ in modality coverage, disease types, and intensity distributions. It achieves cross-dataset generalization through three core innovations: a Personalized Feature Modulation module that dynamically adapts feature representations based on dataset identifier to mitigate the impact of distributional shifts; a Modality-Consistent Batch Scheduler that facilitates stable and efficient batch training under inconsistent modality conditions; and a selective supervision loss to ensure effective learning when ground truth modalities are partially missing. Evaluated on four clinical multi-modal MRI datasets, PMM-Synth consistently outperforms state-of-the-art methods in both one-to-one and many-to-one synthesis tasks, achieving superior PSNR and SSIM scores. Qualitative results further demonstrate improved preservation of anatomical structures and pathological details. Additionally, downstream tumor segmentation and radiological reporting studies suggest that PMM-Synth holds potential for supporting reliable diagnosis under real-world modality-missing scenarios.

[CV-52] Generative 6D Pose Estimation via Conditional Flow Matching

【速读】:该论文旨在解决实例级6D位姿估计中两个关键问题:一是直接回归位姿的方法在处理具有对称性的物体时存在歧义,二是基于局部特征匹配的方法在缺乏显著局部特征的场景下失效。解决方案的关键在于将6D位姿估计建模为在ℝ³空间中的条件流匹配(conditional flow matching)问题,提出Flose方法,通过结合几何引导的去噪过程与基于外观的语义特征来消除对称性引起的不确定性,并引入RANSAC-based注册机制以有效处理异常值。该方法在BOP基准上的五个数据集上验证,平均召回率提升4.5%。

链接: https://arxiv.org/abs/2602.19719
作者: Amir Hamza,Davide Boscaini,Weihang Li,Benjamin Busam,Fabio Poiesi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website : this https URL

点击查看摘要

Abstract:Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in \mathrmSE(3) or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in \mathbbR^3 . We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : this https URL

[CV-53] Pixels Dont Lie (But Your Detector Might): Bootstrapping MLLM -as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision CVPR-2026

【速读】:该论文旨在解决深度伪造(Deepfake)检测模型生成的自然语言解释缺乏视觉证据支撑的问题,即其推理过程常与图像内容脱节,从而限制了模型的可靠性。现有评估方法仅关注分类准确率,忽视了推理合理性(reasoning fidelity)的衡量。解决方案的关键在于提出 DeepfakeJudge 框架,通过引入一个包含近期生成式与编辑类伪造样本的分布外基准、人工标注的视觉推理标签子集,以及一套无需显式推理真值即可评估理由质量的专用评估模型,实现了可扩展的推理监督与量化评估。该框架利用自举式生成器-评估器流程,将人类反馈转化为结构化的推理监督信号,并支持点对点和成对评估,最终在元评估基准上使推理引导模型达到 96.2% 的准确率,显著优于 30 倍更大的基线模型,且与人类评分高度一致(98.9% 成对一致性),验证了推理真实性作为可量化维度的有效性,并推动了可解释深度伪造推理的规模化监督发展。

链接: https://arxiv.org/abs/2602.19715
作者: Kartik Kuckreja,Parul Gupta,Muhammad Haris Khan,Abhinav Dhall
机构: MBZUAI; Monash University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR-2026, Code is available here: this https URL

点击查看摘要

Abstract:Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator-evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2%, outperforming \texttt30x larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9% percent pairwise agreement on the human-annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study shows that participants preferred the reasonings generated by our framework 70% of the time, in terms of faithfulness, groundedness, and usefulness, compared to those produced by other models and datasets. All of our datasets, models, and codebase are \hrefthis https URLopen-sourced.

[CV-54] Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型中存在的特征坍塌(feature collapse)和训练效率低的问题,这些问题主要源于高阶感知与稀疏、具身特异的动作监督信号的纠缠。此外,由于VLA模型通常基于为视觉问答(Visual Question Answering, VQA)优化的视觉语言模型(VLM)骨干网络,其在语义识别上表现优异,但难以捕捉决定不同动作模式的细微3D状态变化。为此,论文提出Pose-VLA,其核心创新在于采用解耦范式:将VLA训练分为两个阶段——预训练阶段在统一的相机中心空间中提取通用的3D空间先验(通过离散姿态标记作为通用表示),后训练阶段则在机器人特定的动作空间内进行高效具身对齐。该方法通过引入姿态标记(pose tokens)实现来自多样化3D数据集的空间定位信息与机器人示范轨迹的几何级运动一致性融合,从而显著提升模型在复杂任务中的泛化能力与训练效率。

链接: https://arxiv.org/abs/2602.19710
作者: Haitao Lin,Hanyang Yu,Jingshun Huang,He Zhang,Yonggen Ling,Ping Tan,Xiangyang Xue,Yanwei Fu
机构: Tencent Robotics X; The Hong Kong University of Science and Technology; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2602.19710 [cs.CV] (or arXiv:2602.19710v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.19710 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-55] ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets

【速读】:该论文旨在解决在数据稀缺场景下(尤其是尾部类别),如何生成既具多样性又保留细粒度特征的合成图像,以提升下游分类任务的性能。其核心挑战在于:单一图像级LoRA虽能捕捉细节但缺乏多样性,而类别级LoRA虽具多样性却易忽略个体差异。解决方案的关键在于将适配器拆分为两部分——共享的类别级LoRA(A)用于编码类别先验,以及每个图像专属的LoRA(ℬ)用于捕获个体特征;并通过在训练中保留类别边界框来增强类别语义一致性,最终通过Dirichlet分布混合系数组合A与ℬ中的多个ℬ,实现高保真且多样化的图像生成,从而显著提升分类准确率。

链接: https://arxiv.org/abs/2602.19708
作者: Hoyoung Kim,Minwoo Jang,Jabin Koo,Sangdoo Yun,Jungseul Ok
机构: Graduate School of AI, POSTECH (韩国科学技术院人工智能研究生院); Dept. of CSE, POSTECH (韩国科学技术院计算机科学系); NAVER AI Lab (NAVER人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA~ A for class priors and per-image LoRAs~ \mathcalB for image-specific characteristics. To expose coherent class semantics in the shared LoRA~ A , we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose A with a mixture of \mathcalB using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.

[CV-56] HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion WACV2026

【速读】:该论文旨在解决单张低动态范围(Low Dynamic Range, LDR)图像到高动态范围(High Dynamic Range, HDR)重建中过曝区域的信息丢失问题,传统方法在这些区域往往失效。其解决方案的关键在于提出一种无需训练的扩散图像修复(diffusion-based inpainting)框架,通过结合文本引导的扩散模型与随机微分方程编辑(SDEdit)优化机制,在过曝区域生成合理内容的同时保持多曝光LDR图像间的亮度一致性。该方法以迭代补偿机制无缝集成现有HDR重建流程,显著提升感知质量和定量指标,且不依赖大规模训练数据。

链接: https://arxiv.org/abs/2602.19706
作者: Yo-Tin Lin,Su-Kai Chen,Hou-Ning Hu,Yen-Yu Lin,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); MediaTek Inc. (联发科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026. Project page: this https URL

点击查看摘要

Abstract:Single LDR to HDR reconstruction remains challenging for over-exposed regions where traditional methods often fail due to complete information loss. We present a training-free approach that enhances existing indirect and direct HDR reconstruction methods through diffusion-based inpainting. Our method combines text-guided diffusion models with SDEdit refinement to generate plausible content in over-exposed areas while maintaining consistency across multi-exposure LDR images. Unlike previous approaches requiring extensive training, our method seamlessly integrates with existing HDR reconstruction techniques through an iterative compensation mechanism that ensures luminance coherence across multiple exposures. We demonstrate significant improvements in both perceptual quality and quantitative metrics on standard HDR datasets and in-the-wild captures. Results show that our method effectively recovers natural details in challenging scenarios while preserving the advantages of existing HDR reconstruction pipelines. Project page: this https URL

[CV-57] BayesFusion-SDF: Probabilistic Signed Distance Fusion with View Planning on CPU

【速读】:该论文旨在解决传统体积融合方法(如截断符号距离函数,TSDF)在几何重建中依赖启发式权重且无法系统性表达不确定性的缺陷,以及神经隐式方法虽精度高但计算资源消耗大、决策可解释性差的问题。其解决方案的关键在于提出BayesFusion-SDF框架,将几何建模为稀疏高斯随机场,并通过贝叶斯推断获得体素距离的后验分布;利用初始TSDF构建自适应窄带域,结合异方差贝叶斯公式,采用稀疏线性代数与预条件共轭梯度法高效求解,同时引入随机对角估计器快速获取不确定性信息,从而实现几何精度优于TSDF基线并支持主动感知下的不确定性驱动视点规划。

链接: https://arxiv.org/abs/2602.19697
作者: Soumya Mazumdar,Vineet Kumar Rakesh,Tapas Samanta
机构: Gargi Memorial Institute of Technology (加尔吉纪念理工学院); Variable Energy Cyclotron Centre (变能回旋加速器中心); Homi Bhabha National Institute (霍米·巴巴国家研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Key part of robotics, augmented reality, and digital inspection is dense 3D reconstruction from depth observations. Traditional volumetric fusion techniques, including truncated signed distance functions (TSDF), enable efficient and deterministic geometry reconstruction; however, they depend on heuristic weighting and fail to transparently convey uncertainty in a systematic way. Recent neural implicit methods, on the other hand, get very high fidelity but usually need a lot of GPU power for optimization and aren’t very easy to understand for making decisions later on. This work presents BayesFusion-SDF, a CPU-centric probabilistic signed distance fusion framework that conceptualizes geometry as a sparse Gaussian random field with a defined posterior distribution over voxel distances. First, a rough TSDF reconstruction is used to create an adaptive narrow-band domain. Then, depth observations are combined using a heteroscedastic Bayesian formulation that is solved using sparse linear algebra and preconditioned conjugate gradients. Randomized diagonal estimators are a quick way to get an idea of posterior uncertainty. This makes it possible to extract surfaces and plan the next best view while taking into account uncertainty. Tests on a controlled ablation scene and a CO3D object sequence show that the new method is more accurate geometrically than TSDF baselines and gives useful estimates of uncertainty for active sensing. The proposed formulation provides a clear and easy-to-use alternative to GPU-heavy neural reconstruction methods while still being able to be understood in a probabilistic way and acting in a predictable way. GitHub: this https URL

[CV-58] HOR: Text-Guided 3D Human and Object Reconstruction with Textures CVPR2026

【速读】:该论文旨在解决从单张图像中联合重建三维人体与物体时存在的两大局限性:一是现有方法过度依赖物理接触信息,难以建模非接触式交互(如凝视或指向物体);二是重建过程仅基于局部几何邻近性,忽略了人体与物体外观提供的全局上下文信息,导致重建结果缺乏语义一致性和视觉合理性。解决方案的关键在于提出TeHOR框架,其核心设计包括:1)引入文本描述作为语义对齐约束,使3D重建与文本提示保持语义一致性,从而支持更广泛的交互类型(包括非接触场景);2)将人体与物体的外观特征融入对齐机制,以捕捉整体上下文信息,确保重建结果在视觉上合理且语义连贯。

链接: https://arxiv.org/abs/2602.19679
作者: Hyeongjin Nam,Daniel Sungho Jung,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at CVPR 2026, 20 pages including the supplementary material

点击查看摘要

Abstract:Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

[CV-59] Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation

【速读】:该论文旨在解决纵向医疗报告生成中的隐私保护与疾病动态演变建模难题,现有联邦学习(Federated Learning, FL)方法因假设客户端分布静态而无法捕捉就诊时间序列上的变化和患者特异性异质性,导致优化不稳定且报告质量下降。其解决方案的关键在于提出联邦时序适应(Federated Temporal Adaptation, FTA)框架,并在此基础上设计FedTAR:通过人口统计学驱动的个性化机制生成轻量级LoRA适配器,结合基于元学习的时序残差聚合策略(由一阶MAML优化的时序策略加权不同就诊时期的更新),从而实现跨站点的时序一致性增强与模型泛化能力提升。

链接: https://arxiv.org/abs/2602.19668
作者: He Zhu,Ren Togo,Takahiro Ogawa,Kenji Hirata,Minghui Tang,Takaaki Yoshimura,Hiroyuki Sugimori,Noriko Nishioka,Yukie Shimizu,Kohsuke Kudo,Miki Haseyama
机构: Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Longitudinal medical report generation is clinically important yet remains challenging due to strict privacy constraints and the evolving nature of disease progression. Although federated learning (FL) enables collaborative training without data sharing, existing FL methods largely overlook longitudinal dynamics by assuming stationary client distributions, making them unable to model temporal shifts across visits or patient-specific heterogeneity-ultimately leading to unstable optimization and suboptimal report generation. We introduce Federated Temporal Adaptation (FTA), a federated setting that explicitly accounts for the temporal evolution of client data. Building upon this setting, we propose FedTAR, a framework that integrates demographic-driven personalization with time-aware global aggregation. FedTAR generates lightweight LoRA adapters from demographic embeddings and performs temporal residual aggregation, where updates from different visits are weighted by a meta-learned temporal policy optimized via first-order MAML. Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust and privacy-preserving paradigm for federated longitudinal modeling. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.19668 [cs.CV] (or arXiv:2602.19668v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.19668 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-60] Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection ICLR2026

【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像(Text-to-Image, T2I)扩散模型因强大生成能力可能被滥用于合成有害、私密或受版权保护内容的风险问题。现有概念擦除(concept erasure)方法主要依赖于微调去噪组件(如 U-Net 主干),但存在对非目标概念生成质量影响较大等问题。其解决方案的关键在于提出一种名为高层表示误导(High-Level Representation Misdirection, HiRM)的新机制:通过仅更新包含视觉属性因果状态的文本编码器早期自注意力层,将目标概念的高层语义表征引导至指定向量(如随机方向或语义定义方向,例如超类别),从而实现对目标概念的精准移除,同时最小化对无关概念的影响。该策略在 UnlearnCanvas 和 NSFW 基准测试中表现优异,并具备低训练成本、跨架构迁移能力及与去噪器基方法协同增强的优势。

链接: https://arxiv.org/abs/2602.19631
作者: Uichan Lee,Jeonghyeon Kim,Sangheum Hwang
机构: Seoul National University of Science and Technology (首尔科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026. The first two authors contributed equally

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.

[CV-61] Accurate Planar Tracking With Robust Re-Detection

【速读】:该论文旨在解决平面跟踪(planar tracking)中目标外观变化导致的鲁棒性不足以及丢失后难以重新检测的问题。解决方案的关键在于结合SAM 2提供的长期分割跟踪能力与8自由度单应性(homography)姿态估计:SAM-H通过分割掩码轮廓估计单应性,对目标外观变化具有高度鲁棒性;WOFTSAM则进一步利用SAM-H提供的丢失目标重检测机制,显著提升现有最优方法WOFT的性能。实验表明,该方案在POT-210和PlanarTrack基准上均达到新的最先进水平,尤其在PlanarTrack上p@15指标分别领先第二名12.4和15.2个百分点。

链接: https://arxiv.org/abs/2602.19624
作者: Jonas Serych,Jiri Matas
机构: Czech Technical University in Prague (布拉格捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SAM-H and WOFTSAM, novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation. SAM-H estimates homographies from segmentation mask contours and is thus highly robust to target appearance changes. WOFTSAM significantly improves the current state-of-the-art planar tracker WOFT by exploiting lost target re-detection provided by SAM-H. The proposed methods are evaluated on POT-210 and PlanarTrack tracking benchmarks, setting the new state-of-the-art performance on both. On the latter, they outperform the second best by a large margin, +12.4 and +15.2pp on the p@15 metric. We also present improved ground-truth annotations of initial PlanarTrack poses, enabling more accurate benchmarking in the high-precision p@5 metric. The code and the re-annotations are available at this https URL

[CV-62] Seeing Clearly Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness CVPR2026

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在罕见物体(rare objects)上的以对象为中心的推理能力不足问题,这一瓶颈源于预训练数据中罕见物体实例的稀缺性。解决方案的关键在于提出一个无需微调VLM的轻量级插件模块:首先通过融合视觉基础模型的知识和同义词增强的文本描述,学习罕见物体的多模态类别嵌入(multi-modal class embeddings),从而弥补训练样本不足;随后利用该嵌入通过轻量级基于注意力的增强模块优化VLM中的视觉token,提升细粒度对象细节表征;同时,将这些嵌入作为对象感知检测器生成信息提示(informative hints),注入文本提示中引导VLM关注相关图像区域,从而显著增强对罕见物体的识别与推理能力。

链接: https://arxiv.org/abs/2602.19615
作者: Xin Hu,Haomiao Ni,Yunbei Zhang,Jihun Hamm,Zechen Li,Zhengming Ding
机构: Tulane University (杜兰大学); University of Memphis (孟菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don’t fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs’ reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM’s attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM’s ability to focus on and reason about rare objects.

[CV-63] RAID: Retrieval-Augmented Anomaly Detection

【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)中因测试图像与正常模板匹配时引入噪声而导致的检测性能下降问题,尤其在类内差异、对应关系不完美及模板有限的情况下更为显著。其解决方案的关键在于提出一种检索增强型异常检测框架 RAID(Retrieval-Augmented Anomaly Detection),通过将检索到的正常样本直接用于引导异常图生成过程中的噪声抑制,而非传统方法中仅依赖重建或检索作为判别依据。RAID 构建了分层向量数据库以获取类别级、语义级和实例级表示,并采用匹配代价体与引导式混合专家(guided Mixture-of-Experts, MoE)网络,实现从粗到细的异常定位与噪声自适应抑制,从而显著提升异常检测与定位的鲁棒性与精度。

链接: https://arxiv.org/abs/2602.19611
作者: Mingxiu Cai,Zhe Zhang,Gaochang Wu,Tianyou Chai,Xiatian Zhu
机构: State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University (东北大学流程工业综合自动化国家重点实验室); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce \textbfRAID, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. \hrefthis https URLthis https URL.

[CV-64] Satellite-Based Detection of Looted Archaeological Sites Using Machine Learning

【速读】:该论文旨在解决考古遗址被盗掘(looting)对文化遗产构成的严重威胁,尤其是在难以人工监测的偏远地区。其核心问题是如何高效、准确地识别出被破坏的考古遗址。解决方案的关键在于构建一个基于卫星影像的可扩展检测流程:利用PlanetScope月度镶嵌影像(空间分辨率为4.7米/像素)与标注完整的阿富汗考古遗址数据集(共1943处,含898处被盗掘和1045处保存完好),对比两种方法——一是直接在原始RGB图像块上训练端到端卷积神经网络(CNN)分类器,二是采用手工设计的光谱/纹理特征及近期遥感基础模型(geospatial foundation model)的嵌入向量结合传统机器学习算法(如随机森林)。结果表明,经过ImageNet预训练的CNN配合空间掩码策略可达到F1分数0.926,显著优于最优的传统机器学习方案(F1=0.710),证明了预训练视觉模型与局部空间信息的有效融合是提升检测精度的关键因素。

链接: https://arxiv.org/abs/2602.19608
作者: Girmaw Abebe Tadesse,Titien Bartette,Andrew Hassanali,Allen Kim,Jonathan Chemla,Andrew Zolli,Yves Ubelmann,Caleb Robinson,Inbal Becker-Reshef,Juan Lavista Ferres
机构: Microsoft AI for Good Research Lab(微软AI for Good研究实验室); Iconem; Planet Labs PBC(行星实验室股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Looting at archaeological sites poses a severe risk to cultural heritage, yet monitoring thousands of remote locations remains operationally difficult. We present a scalable and satellite-based pipeline to detect looted archaeological sites, using PlanetScope monthly mosaics (4.7m/pixel) and a curated dataset of 1,943 archaeological sites in Afghanistan (898 looted, 1,045 preserved) with multi-year imagery (2016–2023) and site-footprint masks. We compare (i) end-to-end CNN classifiers trained on raw RGB patches and (ii) traditional machine learning (ML) trained on handcrafted spectral/texture features and embeddings from recent remote-sensing foundation models. Results indicate that ImageNet-pretrained CNNs combined with spatial masking reach an F1 score of 0.926, clearly surpassing the strongest traditional ML setup, which attains an F1 score of 0.710 using SatCLIP-V+RF+Mean, i.e., location and vision embeddings fed into a Random Forest with mean-based temporal aggregation. Ablation studies demonstrate that ImageNet pretraining (even in the presence of domain shift) and spatial masking enhance performance. In contrast, geospatial foundation model embeddings perform competitively with handcrafted features, suggesting that looting signatures are extremely localized. The repository is available at this https URL.

[CV-65] CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning CVPR2026

【速读】:该论文旨在解决多模态学习中因将所有模态映射到单一潜在空间进行融合而导致的语义错位与误差传播问题,这些问题源于对多模态数据异步、多层次语义结构的忽视。解决方案的关键在于提出跨层级协同表示(Cross-Level Co-Representation, CLCR),其核心设计包括:1)构建三层次语义层次结构(浅层、中层、深层特征),并通过语义层次编码器对齐各模态间的对应层级;2)在每一层级引入同层协同交换域(Intra-Level Co-Exchange Domain, IntraCED),通过可学习的token预算限制跨模态注意力仅作用于共享子空间,从而实现仅交换共享语义并防止私有信息泄露;3)引入跨层协同聚合域(Inter-Level Co-Aggregation Domain, InterCAD),利用学习锚点同步不同层级的语义尺度,选择性融合共享表示并门控私有线索,形成紧凑的任务表征。此外,通过正则化项强化共享与私有特征的分离及跨层级干扰最小化,显著提升了多模态表示质量与任务泛化能力。

链接: https://arxiv.org/abs/2602.19605
作者: Chunlei Meng,Guanhong Huang,Rong Fu,Runmin Jian,Zhongxue Gan,Chun Ouyang
机构: Fudan University (复旦大学); Shantou University (汕头大学); University of Macau (澳门大学); Guangzhou Huashang College (广州华商学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: This study has been Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality’s features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.

[CV-66] Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception CVPR’26

【速读】:该论文旨在解决协同感知(Collaborative Perception, CP)系统在面对对抗攻击时的安全性问题,尤其是现有防御机制因缺乏对系统性时间与目标区域优化攻击的鲁棒性,以及共享协作数据中隐含置信度信息导致漏洞知识泄露而易被攻破的问题。其解决方案的关键在于提出一种新型自适应对抗CP框架——MVIG攻击,该框架通过构建统一的互视信息图(Mutual View Information Graph, MVIG)表示来捕获不同防御系统暴露的漏洞知识,并结合时序图学习生成动态演化伪造风险图谱,同时利用熵感知的漏洞搜索策略优化攻击的位置、时机和持续性,从而实现跨多种防御配置的通用化自适应攻击能力。

链接: https://arxiv.org/abs/2602.19596
作者: Yihang Tao,Senkang Hu,Haonan An,Zhengru Fang,Hangcheng Cao,Yuguang Fang
机构: Hong Kong JC STEM Lab of Smart City (香港JC智慧城市实验室); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR’26

点击查看摘要

Abstract:Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. Our approach combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps and employs entropy-aware vulnerability search to optimize attack location, timing and persistence, enabling adaptive attacks with generalizability across various defensive configurations. Extensive evaluations on OPV2V and Adv-OPV2V datasets demonstrate that MVIG attack reduces defense success rates by up to 62% against state-of-the-art defenses while achieving 47% lower detection for persistent attacks at 29.9 FPS, exposing critical security gaps in CP systems. Code will be released at this https URL

[CV-67] ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization CVPR2026

【速读】:该论文旨在解决个性化文本到图像生成中的概念纠缠(concept entanglement)问题,即参考图像中无关的残差信息被错误地捕捉,导致概念保真度与文本对齐之间存在权衡。解决方案的关键在于提出 ConceptPrism 框架,通过比较图像集合内的样本,自动分离共享视觉概念与图像特异性残差;其核心机制是联合优化目标标记(target token)和图像级残差标记(image-wise residual tokens),并引入一种新颖的排除损失(exclusion loss),强制残差标记摒弃共享概念,从而在无需直接监督的情况下使目标标记捕获纯净的概念表示。

链接: https://arxiv.org/abs/2602.19575
作者: Minseo Kim,Minchan Kwon,Dongyeun Lee,Yunho Jeon,Junmo Kim
机构: KAIST(韩国科学技术院); Hanbat National University(汉巴特国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.

[CV-68] HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies

【速读】:该论文旨在解决当前视频大语言模型(Video-LLMs)在物理世界预测建模能力上的不足,尤其是其对因果机制的理解滞后于对静态语义感知的掌握。解决方案的关键在于构建HOCA-Bench基准测试集,该基准从黑格尔哲学视角出发,将物理异常细分为本体异常(ontological anomalies)和因果异常(causal anomalies),并通过1,439个视频片段(含3,470组问答对)形成对抗性测试环境,以系统评估模型在识别物体定义违背与物理交互规则违反方面的能力。实验表明,尽管当前主流Video-LLMs能较好识别静态本体异常,但在涉及重力、摩擦等基本物理规律的因果任务中性能显著下降(超过20%),且即使启用系统-2型“思考”模式也难以弥合这一差距,揭示了现有架构更擅长视觉模式识别而非物理规律推理的本质局限。

链接: https://arxiv.org/abs/2602.19571
作者: Chang Liu,Yunfan Ye,Qingyang Zhou,Xichen Tan,Mengxuan Luo,Zhenyu Qiu,Wei Peng,Zhiping Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video-LLMs have improved steadily on semantic perception, but they still fall short on predictive world modeling, which is central to physically grounded intelligence. We introduce HOCA-Bench, a benchmark that frames physical anomalies through a Hegelian lens. HOCA-Bench separates anomalies into two types: ontological anomalies, where an entity violates its own definition or persistence, and causal anomalies, where interactions violate physical relations. Using state-of-the-art generative video models as adversarial simulators, we build a testbed of 1,439 videos (3,470 QA pairs). Evaluations on 17 Video-LLMs show a clear cognitive lag: models often identify static ontological violations (e.g., shape mutations) but struggle with causal mechanisms (e.g., gravity or friction), with performance dropping by more than 20% on causal tasks. System-2 “Thinking” modes improve reasoning, but they do not close the gap, suggesting that current architectures recognize visual patterns more readily than they apply basic physical laws.

[CV-69] VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对对抗性图像时易产生误导性输出的问题,即这些图像虽细微扰动但能引导模型生成看似合理实则错误的响应。解决方案的关键在于提出一种通用、高效且无需训练的防御机制,其核心由两阶段检测与代理数据整合(agentic data consolidation)构成:第一阶段通过内容保持变换快速过滤多数干净输入;第二阶段在必要时利用文本嵌入空间中的差异识别潜在攻击,并仅对可疑样本调用大语言模型(LLM)进行精细分辨。该方法通过整合多个响应的相似性与差异性,在保证高准确率的同时显著降低计算开销,实现了状态领先的防御效果与运行效率的平衡。

链接: https://arxiv.org/abs/2602.19570
作者: Nadav Kadvil,Ayellet Tal
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.

[CV-70] DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

【速读】:该论文旨在解决具身智能中关节物体位姿估计(articulated object pose estimation)面临的两大挑战:一是连续空间搜索导致的复杂性与低效性,二是现有方法难以有效融入物体固有的运动学约束(kinematic constraints)。其解决方案的关键在于提出DICArt(DIsCrete Diffusion for Articulation Pose Estimation),将位姿估计建模为条件离散扩散过程(conditional discrete diffusion process),通过学习反向扩散步骤逐步去噪以恢复真实位姿;同时引入灵活的流决策机制(flow decider)动态控制每个token是否去噪或重置,从而在扩散过程中平衡真实分布与噪声分布,并结合分层运动学耦合策略(hierarchical kinematic coupling strategy)逐级估计各刚体部分位姿,以显式尊重物体的结构先验,显著提升位姿估计的准确性与鲁棒性。

链接: https://arxiv.org/abs/2602.19565
作者: Li Zhang,Mingyu Mei,Ailing Wang,Xianhui Meng,Yan Zhong,Xinyuan Song,Liu Liu,Rujing Wang,Zaixing He,Cewu Lu
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); East China Normal University (华东师范大学); Peking University (北京大学); Emory University (埃默里大学); Hefei University of Technology (合肥工业大学); Jianghuai Advance Technology Center (江淮先进科技中心); Anhui Provincial Key Laboratory of Humanoid Robots (安徽省人形机器人重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Articulated object pose estimation is a core task in embodied AI. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this work, we introduce DICArt (DIsCrete Diffusion for Articulation Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the GT pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object’s kinematic structure. We validate DICArt on both synthetic and real-world datasets. Experimental results demonstrate its superior performance and robustness. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.

[CV-71] A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

【速读】:该论文旨在解决自然语言表达与视觉感知之间稳定映射的问题,这是认知科学和人工智能领域的基础性挑战。人类能够在噪声和模糊的感知环境中实现语言指称的语境化理解,但支撑这种跨模态对齐的机制尚不清晰。其解决方案的关键在于构建一个计算框架,通过整合大规模众包图像中的感知表征与语言输入,利用尺度不变特征变换(Scale-Invariant Feature Transform, SIFT)对齐结合通用质量指数(Universal Quality Index, UQI)在认知上合理的特征空间中量化相似性,并辅以语言预处理和查询转换操作来捕捉指称表达的语用变异性。该方法在斯坦福重复指称游戏语料库(15,000条话语配对拼图刺激)上表现出色,仅需人类对话者65%的语句即可达成稳定的指称映射,且单次指称正确识别目标对象的概率达41.66%,显著优于人类水平(20%),表明简单但有效的感知-语言对齐机制可实现类人级行为并为具身沟通、感知推理和跨模态概念形成提供新见解。

链接: https://arxiv.org/abs/2602.19562
作者: Joseph Bingham
机构: Technion (以色列理工学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 Pages, 6 figures, preprint

点击查看摘要

Abstract:Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66% of the time (versus 20% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at this https URL .

[CV-72] Vinedresser3D: Agent ic Text-guided 3D Editing CVPR2026

【速读】:该论文旨在解决文本引导的3D编辑中面临的三大挑战:难以联合理解复杂自然语言指令、无法自动定位编辑区域以及难以保持未编辑区域的一致性。其解决方案的关键在于提出Vinedresser3D框架,该框架直接在原生3D生成模型的潜在空间中操作,利用多模态大语言模型(Multimodal Large Language Model, MLLM)解析原始3D资产并识别编辑类型(添加、修改、删除)与位置,进而生成结构与外观层面的分解式文本指导;随后通过选择信息丰富的视角并调用图像编辑模型获取视觉引导,最终采用基于反向流(rectified-flow)的重建式修复流水线配合交错采样模块,在3D潜在空间中执行高保真编辑,从而实现prompt对齐、3D一致性与无掩码编辑的统一。

链接: https://arxiv.org/abs/2602.19542
作者: Yankuan Chi,Xiang Li,Zixuan Huang,James M. Rehg
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project website: this https URL

点击查看摘要

Abstract:Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the edit region and edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.

[CV-73] A Green Learning Approach to LDCT Image Restoration ICIP

【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)图像中因成像过程引入噪声和伪影而导致的图像质量下降问题,这一问题严重影响后续医学分析的准确性。解决方案的关键在于提出一种绿色学习(Green Learning, GL)方法,其核心优势体现在数学透明性、计算与内存效率高以及优异的恢复性能,实验表明该方法在模型规模更小、推理复杂度更低的前提下实现了当前最优的图像恢复效果。

链接: https://arxiv.org/abs/2602.19540
作者: Wei Wang,Yixing Wu,C.-C. Jay Kuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in IEEE International Conference on Image Processing (ICIP), 2025, pp. 1762-1767. Final version available at IEEE Xplore

点击查看摘要

Abstract:This work proposes a green learning (GL) approach to restore medical images. Without loss of generality, we use low-dose computed tomography (LDCT) images as examples. LDCT images are susceptible to noise and artifacts, where the imaging process introduces distortion. LDCT image restoration is an important preprocessing step for further medical analysis. Deep learning (DL) methods have been developed to solve this problem. We examine an alternative solution using the Green Learning (GL) methodology. The new restoration method is characterized by mathematical transparency, computational and memory efficiency, and high performance. Experiments show that our GL method offers state-of-the-art restoration performance at a smaller model size and with lower inference complexity.

[CV-74] Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的年龄估计系统在面对日常可获取的外貌修饰(如胡须、灰发、化妆和模拟皱纹)时的鲁棒性不足问题,尤其是这些修饰可能导致未成年人被错误识别为成年人,从而绕过在线内容的年龄限制。解决方案的关键在于通过大规模模拟攻击实验(使用 VLM 图像编辑器 Gemini 2.5 Flash Image 对 329 张青少年面部图像进行干预),量化不同模型在攻击下的误判率,并引入“攻击转换率”(Attack Conversion Rate, ACR)这一与人群比例无关的指标,系统评估了八种主流年龄估计模型的脆弱性,揭示了专用模型相较于视觉-语言模型更具攻击敏感性的现象,为部署前的模型选择提供了基于对抗鲁棒性的评估依据。

链接: https://arxiv.org/abs/2602.19539
作者: Xingyu Shen,Tommy Duong,Xiaodong An,Zengqi Zhao,Zebang Hu,Haoyu Hu,Ziyou Wang,Finn Guo,Simiao Ren
机构: Reality Inc.(现实公司); UC Berkeley (加州大学伯克利分校); Duke University (杜克大学); Georgia Tech (佐治亚理工学院); UNC Chapel Hill (北卡罗来纳大学教堂山分校); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.

[CV-75] Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection

【速读】:该论文旨在解决基于Mamba的3D目标检测方法中因对全场景非空体素序列进行双向编码而导致背景信息冗余、以及仅编码前景体素时因响应衰减和上下文表示受限而性能下降的问题。解决方案的关键在于提出一种名为Fore-Mamba3D的新骨干网络,其核心创新包括:1)通过预测得分采样前景体素实现聚焦于前景增强的编码策略;2)设计区域到全局滑动窗口(RGSW)以缓解不同实例间前景体素交互中的响应衰减问题,促进跨区域信息传播;3)引入语义辅助的状态空间融合模块(SASFMamba),在Mamba模型内增强语义与几何感知能力,从而改善上下文表征。该方案有效缓解了线性自回归模型中的距离依赖和因果依赖问题,显著提升了3D目标检测性能。

链接: https://arxiv.org/abs/2602.19536
作者: Zhiwei Ning,Xuanang Gao,Jiaxi Cao,Runze Yang,Huiying Xu,Xinzhong Zhu,Jie Yang,Wei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.

[CV-76] ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在零样本分类等任务中性能受限的问题,其根源在于文本原型(textual prototypes)的质量与几何结构不佳,导致类别间嵌入向量相关性高或分离度弱,从而影响任务特定的判别能力。解决方案的关键在于提出ORION框架,通过仅使用类别名称对文本编码器进行低秩适配(Low Rank Adaptation, LoRA)微调,优化一种融合两项损失的新目标函数:第一项强制同一任务内不同类别的文本表示之间两两正交,提升类别可分性;第二项惩罚微调后的嵌入偏离初始原型的程度,保持语义一致性。该方法在11个基准测试和三种主流VLM骨干网络上验证有效,显著优于标准CLIP原型,并作为即插即用模块在零样本、少样本及测试时适应等多种场景下均带来稳定性能提升。

链接: https://arxiv.org/abs/2602.19530
作者: Omprakash Chakraborty,Jose Dolz,Ismail Ben Ayed
机构: ÉTS Montréal (École de technologie supérieure)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.

[CV-77] OSInsert: Towards High-authenticity and High-fidelity Image Composition

【速读】:该论文旨在解决生成式图像合成(Generative Image Composition)中难以同时实现高真实性(high-authenticity)与高保真度(high-fidelity)的问题。现有方法通常只能在前景姿态/视角适配背景(真实性)或精确保留前景细节(保真度)之间做出权衡,而无法兼顾二者。其解决方案的关键在于提出一种两阶段策略:第一阶段使用高真实性方法生成合理的前景形状作为条件,第二阶段在此基础上采用高保真度方法精确重建前景细节,从而协同实现外观合理性与细节准确性。实验在MureCOM数据集上验证了该策略的有效性。

链接: https://arxiv.org/abs/2602.19523
作者: Jingyuan Wang,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative image composition aims to regenerate the given foreground object in the background image to produce a realistic composite image. Some high-authenticity methods can adjust foreground pose/view to be compatible with background, while some high-fidelity methods can preserve the foreground details accurately. However, existing methods can hardly achieve both goals at the same time. In this work, we propose a two-stage strategy to achieve both goals. In the first stage, we use high-authenticity method to generate reasonable foreground shape, serving as the condition of high-fidelity method in the second stage. The experiments on MureCOM dataset verify the effectiveness of our two-stage strategy. The code and model have been released at this https URL.

[CV-78] Variational Trajectory Optimization of Anisotropic Diffusion Schedules

【速读】:该论文旨在解决扩散模型中噪声调度(noise schedule)的 isotropic(各向同性)限制问题,即传统方法在时间演化过程中对所有特征维度施加相同强度的噪声,忽略了数据内在结构的异质性。其解决方案的关键在于提出一个变分框架,引入由矩阵值路径 $ M_t(\theta) $ 参数化的各向异性噪声调度机制,能够按子空间分配不同强度的噪声;同时设计了一个轨迹级目标函数,联合训练得分网络(score network)与学习 $ M_t(\theta) $,并通过推导关于 $ \theta $ 的梯度估计器实现高效优化。此外,论文还开发了一种高效的反向ODE求解器,作为二阶Heun算法的各向异性推广,显著提升了生成质量与推理效率,在多个图像数据集上均优于基线EDM模型。

链接: https://arxiv.org/abs/2602.19512
作者: Pengxi Liu,Zeyu Michael Li,Xiang Cheng
机构: Duke University (杜克大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a variational framework for diffusion models with anisotropic noise schedules parameterized by a matrix-valued path M_t(\theta) that allocates noise across subspaces. Central to our framework is a trajectory-level objective that jointly trains the score network and learns M_t(\theta) , which encompasses general parameterization classes of matrix-valued noise schedules. We further derive an estimator for the derivative with respect to \theta of the score that enables efficient optimization of the M_t(\theta) schedule. For inference, we develop an efficiently-implementable reverse-ODE solver that is an anisotropic generalization of the second-order Heun discretization algorithm. Across CIFAR-10, AFHQv2, FFHQ, and ImageNet-64, our method consistently improves upon the baseline EDM model in all NFE regimes. Code is available at this https URL.

[CV-79] Relational Feature Caching for Accelerating Diffusion Transformers ICLR2026

【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)中特征缓存(feature caching)方法因依赖时间外推(temporal extrapolation)导致预测误差较大、进而引发性能下降的问题。其解决方案的关键在于提出了一种新型关系特征缓存(Relational Feature Caching, RFC)框架,该框架通过引入关系特征估计(Relational Feature Estimation, RFE)机制,利用输入特征与输出特征之间的强相关性来更准确地估计输出特征的变化幅度,从而提升预测精度;同时结合关系缓存调度(Relational Cache Scheduling, RCS),基于输入特征动态评估预测误差,在误差预期较大时执行完整计算,实现效率与准确性的平衡。

链接: https://arxiv.org/abs/2602.19506
作者: Byunggwan Son,Jeimin Jeon,Jeongwoo Choi,Bumsub Ham
机构: Yonsei University (延世大学); Korea Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at this https URL

[CV-80] st-Time Computing for Referring Multimodal Large Language Models

【速读】:该论文旨在解决在不进行模型重训练或微调的前提下,如何实现对冻结的多模态大语言模型(Multimodal Large Language Models, MLLMs)进行细粒度区域级视觉推理的问题。其核心挑战在于如何在测试时动态引导模型关注用户指定的视觉区域,同时保持推理的稳定性和可解释性。解决方案的关键在于提出ControlMLLM++框架,通过注入可学习的视觉提示(visual prompts)并利用任务特定的能量函数优化一个潜在的视觉标记修饰器(latent visual token modifier),从而在推理阶段调整跨模态注意力机制,使模型注意力聚焦于目标区域;此外,引入改进的优化策略(Optim++)和提示去偏机制(PromptDebias)以提升优化稳定性并减少语言提示带来的偏差,最终实现了高效、灵活且可解释的区域控制推理能力。

链接: https://arxiv.org/abs/2602.19505
作者: Mingrui Wu,Hao Chen,Jiayi Ji,Xiaoshuai Sun,Zhiyuan Liu,Liujuan Cao,Ming-Ming Cheng,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2407.21534

点击查看摘要

Abstract:We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at this https URL.

[CV-81] A Text-Guided Vision Model for Enhanced Recognition of Small Instances

【速读】:该论文旨在解决无人机平台中基于文本提示的精准目标检测问题,即如何在复杂场景下实现对特定小尺寸或边界清晰目标的高效识别。其关键解决方案在于对YOLO-World模型进行结构优化:首先,将YOLOv8骨干网络中的C2f层替换为C3k2层,以增强局部特征表示能力,尤其提升小目标检测精度;其次,通过并行处理优化策略,在保持高精度的同时显著降低计算开销,使模型更加轻量化。实验表明,改进后的模型在VisDrone数据集上各项指标均优于原版,验证了其在无人机应用场景下的有效性与实用性。

链接: https://arxiv.org/abs/2602.19503
作者: Hyun-Ki Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Applied Computer Science (2026)

点击查看摘要

Abstract:As drone-based object detection technology continues to evolve, the demand is shifting from merely detecting objects to enabling users to accurately identify specific targets. For example, users can input particular targets as prompts to precisely detect desired objects. To address this need, an efficient text-guided object detection model has been developed to enhance the detection of small objects. Specifically, an improved version of the existing YOLO-World model is introduced. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, enabling more precise representation of local features, particularly for small objects or those with clearly defined boundaries. Additionally, the proposed architecture improves processing speed and efficiency through parallel processing optimization, while also contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its enhanced accuracy. Furthermore, the model demonstrates superior lightweight performance, with the parameter count reduced from 4 million to 3.8 million and FLOPs decreasing from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for precise object detection in drone-based applications.

[CV-82] MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models CVPR2026

【速读】:该论文旨在解决当前统一多模态模型(Unified Multimodal Models, UMMs)在多图像上下文生成任务中面临的挑战,尤其是跨图像组合、情境推理与身份一致性等能力不足的问题。现有基准大多局限于文本到图像生成或单图编辑任务,难以全面评估模型在复杂多图场景下的理解与生成能力。为应对这一问题,作者提出了MICON-Bench,一个涵盖六类任务的综合性基准,用于系统性评测多图像推理性能;同时设计了基于多模态大语言模型(Multimodal Large Language Model, MLLM)的“按检查点评估”(Evaluation-by-Checkpoint)框架以自动验证语义与视觉一致性,并提出一种无需训练的即插即用机制——动态注意力重平衡(Dynamic Attention Rebalancing, DAR),通过推理阶段动态调整注意力分布来提升生成连贯性并减少幻觉现象。关键创新在于MICON-Bench的严谨评测体系和DAR机制对多图像一致性增强的有效性。

链接: https://arxiv.org/abs/2602.19497
作者: Mingrui Wu,Hang Liu,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
机构: Xiamen University (厦门大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbfMICON-Bench, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbfDynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: this https URL.

[CV-83] Exploiting Label-Independent Regularization from Spatial Dependencies for Whole Slide Image Analysis

【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSI)在病理诊断中因数据量巨大且标注稀疏导致的弱监督学习难题,尤其是传统多实例学习(Multiple Instance Learning, MIL)方法在袋级标签引导大量 patch 级特征时,由于监督信号稀疏而难以稳定识别判别性区域的问题。其解决方案的关键在于提出一种空间正则化的MIL框架,通过利用patch特征之间的内在空间关系作为与标签无关的正则化信号,在联合优化特征诱导的空间重建和标签引导的分类目标过程中,强制约束结构模式与监督信号之间的一致性,从而提升模型对判别性区域的学习能力与训练稳定性。

链接: https://arxiv.org/abs/2602.19487
作者: Weiyi Wu,Xinwen Xu,Chongyang Gao,Xingjian Diao,Siting Li,Jiang Gui
机构: Dartmouth College (达特茅斯学院); Massachusetts General Hospital (麻省总医院); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole slide images, with their gigapixel-scale panoramas of tissue samples, are pivotal for precise disease diagnosis. However, their analysis is hindered by immense data size and scarce annotations. Existing MIL methods face challenges due to the fundamental imbalance where a single bag-level label must guide the learning of numerous patch-level features. This sparse supervision makes it difficult to reliably identify discriminative patches during training, leading to unstable optimization and suboptimal solutions. We propose a spatially regularized MIL framework that leverages inherent spatial relationships among patch features as label-independent regularization signals. Our approach learns a shared representation space by jointly optimizing feature-induced spatial reconstruction and label-guided classification objectives, enforcing consistency between intrinsic structural patterns and supervisory signals. Experimental results on multiple public datasets demonstrate significant improvements over state-of-the-art methods, offering a promising direction.

[CV-84] Structured Bitmap-to-Mesh Triangulation for Geometry-Aware Discretization of Image-Derived Domains

【速读】:该论文旨在解决在图像衍生域上进行偏微分方程(PDE)离散化时,传统约束Delaunay三角剖分(Constrained Delaunay Triangulation, CDT)因边界引入导致全局连通性更新、难以并行化且缺乏确定性的问题。解决方案的关键在于提出一种基于模板的三角剖分框架:通过将栅格或分割得到的边界嵌入规则三角网格中,仅对与边界相交的三角形进行局部重剖分,从而保留基础网格结构,并支持无同步的并行执行;同时,通过分类所有局部边界交集配置(考虑离散等价性和三角形对称性),构建有限符号查找表,映射每种情况到无冲突的重剖分模板,确保生成的网格具有闭合性、有界角特性,并兼容余切离散化和标准有限元方法,显著提升复杂边界附近的几何保真度与三角形质量。

链接: https://arxiv.org/abs/2602.19474
作者: Wei Feng,Haiyong Zheng
机构: 未知
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Revised version after peer review; under review at Graphical Models. Earlier version appeared on SSRN

点击查看摘要

Abstract:We propose a template-driven triangulation framework that embeds raster- or segmentation-derived boundaries into a regular triangular grid for stable PDE discretization on image-derived domains. Unlike constrained Delaunay triangulation (CDT), which may trigger global connectivity updates, our method retriangulates only triangles intersected by the boundary, preserves the base mesh, and supports synchronization-free parallel execution. To ensure determinism and scalability, we classify all local boundary-intersection configurations up to discrete equivalence and triangle symmetries, yielding a finite symbolic lookup table that maps each case to a conflict-free retriangulation template. We prove that the resulting mesh is closed, has bounded angles, and is compatible with cotangent-based discretizations and standard finite element methods. Experiments on elliptic and parabolic PDEs, signal interpolation, and structural metrics show fewer sliver elements, more regular triangles, and improved geometric fidelity near complex boundaries. The framework is well suited for real-time geometric analysis and physically based simulation over image-derived domains.

[CV-85] Forgetting-Resistant and Lesion-Aware Source-Free Domain Adaptive Fundus Image Analysis with Vision-Language Model

【速读】:该论文针对源域自适应(Source-free Domain Adaptation, SFDA)在视网膜图像诊断任务中面临的两个关键问题展开研究:一是传统方法在领域迁移下易产生错误预测,导致目标模型某些类别的准确率下降(即“遗忘”现象);二是现有基于视觉-语言(Vision-Language, ViL)模型的方法未能充分利用ViL模型中蕴含的细粒度病理知识。解决方案的关键在于提出一种遗忘抵抗且病变感知(Forgetfulness-Resistant and Lesion-Aware, FRLA)的新框架:其一,设计遗忘抵抗适配模块以显式保留目标模型的高置信度预测,缓解类别遗忘问题;其二,引入病变感知适配模块,从ViL模型获取像素级(patch-wise)预测结果,引导目标模型关注病灶区域并吸收ViL模型的细粒度语义信息,从而提升整体性能。

链接: https://arxiv.org/abs/2602.19471
作者: Zheang Huai,Hui Tang,Hualiang Wang,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) aims to adapt a model trained in the source domain to perform well in the target domain, with only unlabeled target domain data and the source model. Taking into account that conventional SFDA methods are inevitably error-prone under domain shift, recently greater attention has been directed to SFDA assisted with off-the-shelf foundation models, e.g., vision-language (ViL) models. However, existing works of leveraging ViL models for SFDA confront two issues: (i) Although mutual information is exploited to consider the joint distribution between the predictions of ViL model and the target model, we argue that the forgetting of some superior predictions of the target model still occurs, as indicated by the decline of the accuracies of certain classes during adaptation; (ii) Prior research disregards the rich, fine-grained knowledge embedded in the ViL model, which offers detailed grounding for fundus image diagnosis. In this paper, we introduce a novel forgetting-resistant and lesion-aware (FRLA) method for SFDA of fundus image diagnosis with ViL model. Specifically, a forgetting-resistant adaptation module explicitly preserves the confident predictions of the target model, and a lesion-aware adaptation module yields patch-wise predictions from ViL model and employs them to help the target model be aware of the lesion areas and leverage the ViL model’s fine-grained knowledge. Extensive experiments show that our method not only significantly outperforms the vision-language model, but also achieves consistent improvements over the state-of-the-art methods. Our code will be released.

[CV-86] Physics-informed Active Polarimetric 3D Imaging for Specular Surfaces

【速读】:该论文旨在解决复杂镜面表面在实际应用场景(如在线检测或手持扫描)中进行快速且高精度三维成像的难题。传统光学计量技术(如偏折术)虽精度高,但依赖多帧采集,不适用于动态环境;基于傅里叶的单帧方法在处理高空间频率结构或大曲率表面时性能下降;而偏振三维成像虽具备单帧优势并具有对几何复杂性的鲁棒性,但其精度受限于正交成像假设。解决方案的关键在于提出一种物理信息驱动的深度学习框架,利用偏振线索提供姿态先验以辅助解析结构光编码的几何信息,并通过双编码器架构结合互特征调制机制,有效建模二者非线性耦合关系,从而直接推断表面法向量,实现单帧下高精度、高鲁棒性的法向估计,支持复杂镜面表面的实际三维成像。

链接: https://arxiv.org/abs/2602.19470
作者: Jiazhang Wang,Hyelim Yang,Tianyi Wang,Florian Willomitzer
机构: Wyant College of Optical Sciences, University of Arizona (亚利桑那大学光学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:3D imaging of specular surfaces remains challenging in real-world scenarios, such as in-line inspection or hand-held scanning, requiring fast and accurate measurement of complex geometries. Optical metrology techniques such as deflectometry achieve high accuracy but typically rely on multi-shot acquisition, making them unsuitable for dynamic environments. Fourier-based single-shot approaches alleviate this constraint, yet their performance deteriorates when measuring surfaces with high spatial frequency structure or large curvature. Alternatively, polarimetric 3D imaging in computer vision operates in a single-shot fashion and exhibits robustness to geometric complexity. However, its accuracy is fundamentally limited by the orthographic imaging assumption. In this paper, we propose a physics-informed deep learning framework for single-shot 3D imaging of complex specular surfaces. Polarization cues provide orientation priors that assist in interpreting geometric information encoded by structured illumination. These complementary cues are processed through a dual-encoder architecture with mutual feature modulation, allowing the network to resolve their nonlinear coupling and directly infer surface normals. The proposed method achieves accurate and robust normal estimation in single-shot with fast inference, enabling practical 3D imaging of complex specular surfaces.

[CV-87] Laplacian Multi-scale Flow Matching for Generative Modeling ICLR2026

【速读】:该论文旨在解决现有流匹配(Flow Matching)方法在图像生成任务中面临的问题:单尺度建模难以同时保证高质量生成与高效推理,而多尺度方法通常采用级联结构导致冗余重采样过程和计算开销大。解决方案的关键在于提出拉普拉斯多尺度流匹配(Laplacian Multiscale Flow Matching, LapFlow),通过构建拉普拉斯金字塔残差表示,在多尺度空间中并行处理不同层级特征,并引入因果注意力机制的混合Transformer(Mixture-of-Transformers, MoT)架构,从而避免传统级联方法中显式的尺度间重采样步骤。该设计显著提升了生成质量、加速了采样过程,并支持高分辨率(最高至1024×1024)图像生成,同时保持较低的计算复杂度。

链接: https://arxiv.org/abs/2602.19461
作者: Zelin Zhao,Petr Molodyk,Haotian Xue,Yongxin Chen
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to appear in ICLR 2026

点击查看摘要

Abstract:In this paper, we present Laplacian multiscale flow matching (LapFlow), a novel framework that enhances flow matching by leveraging multi-scale representations for image generative modeling. Our approach decomposes images into Laplacian pyramid residuals and processes different scales in parallel through a mixture-of-transformers (MoT) architecture with causal attention mechanisms. Unlike previous cascaded approaches that require explicit renoising between scales, our model generates multi-scale representations in parallel, eliminating the need for bridging processes. The proposed multi-scale architecture not only improves generation quality but also accelerates the sampling process and promotes scaling flow matching methods. Through extensive experimentation on CelebA-HQ and ImageNet, we demonstrate that our method achieves superior sample quality with fewer GFLOPs and faster inference compared to single-scale and multi-scale flow matching baselines. The proposed model scales effectively to high-resolution generation (up to 1024 \times 1024) while maintaining lower computational overhead.

[CV-88] HD-TTA: Hypothesis-Driven Test-Time Adaptation for Safer Brain Tumor Segmentation

【速读】:该论文旨在解决标准测试时自适应(Test-Time Adaptation, TTA)方法在医疗分割任务中缺乏选择性导致的安全性问题,例如肿瘤掩膜错误地扩展至健康脑组织或对已正确预测的区域造成性能退化。其解决方案的关键在于提出一种基于假设驱动的TTA框架(Hypothesis-Driven TTA, HD-TTA),将适应过程重构为一个动态决策机制:通过生成两种直观的竞争几何假设——紧凑化(compaction,用于去除噪声和伪影)与膨胀化(inflation,用于恢复被欠分割的有效肿瘤区域),并利用基于表示的筛选器(representation-guided selector)根据内在纹理一致性自主识别最安全的输出;同时引入预筛选门控机制(Gatekeeper)跳过高置信度样本以防止负迁移。该方法在跨域脑肿瘤分割任务中显著提升了安全性指标(如HD95降低约6.4 mm,精度提升超4%),同时保持Dice分数相当,验证了通过显式假设选择实现安全与适应平衡的可行性。

链接: https://arxiv.org/abs/2602.19454
作者: Kartik Jhawar,Lipo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Standard Test-Time Adaptation (TTA) methods typically treat inference as a blind optimization task, applying generic objectives to all or filtered test samples. In safety-critical medical segmentation, this lack of selectivity often causes the tumor mask to spill into healthy brain tissue or degrades predictions that were already correct. We propose Hypothesis-Driven TTA, a novel framework that reformulates adaptation as a dynamic decision process. Rather than forcing a single optimization trajectory, our method generates intuitive competing geometric hypotheses: compaction (is the prediction noisy? trim artifacts) versus inflation (is the valid tumor under-segmented? safely inflate to recover). It then employs a representation-guided selector to autonomously identify the safest outcome based on intrinsic texture consistency. Additionally, a pre-screening Gatekeeper prevents negative transfer by skipping adaptation on confident cases. We validate this proof-of-concept on a cross-domain binary brain tumor segmentation task, applying a source model trained on adult BraTS gliomas to unseen pediatric and more challenging meningioma target domains. HD-TTA improves safety-oriented outcomes (Hausdorff Distance (HD95) and Precision) over several state-of-the-art representative baselines in the challenging safety regime, reducing the HD95 by approximately 6.4 mm and improving Precision by over 4%, while maintaining comparable Dice scores. These results demonstrate that resolving the safety-adaptation trade-off via explicit hypothesis selection is a viable, robust path for safe clinical model deployment. Code will be made publicly available upon acceptance.

[CV-89] Decoupling Vision and Language: Codebook Anchored Visual Adaptation CVPR2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在特定领域视觉任务(如医学图像诊断或细粒度分类)中因视觉编码器表现不佳而导致的表征误差传递问题,进而影响语言模型推理准确性的问题。解决方案的关键在于提出一种轻量级微调方法 CRAFT(Codebook RegulAted Fine-Tuning),通过引入离散码本(codebook)将视觉表示锚定到一个稳定的标记空间,从而实现对编码器的域适应,且无需修改模型其他部分。该解耦设计使得适配后的编码器可无缝提升不同语言架构的LVLM性能,只要它们共享同一码本,显著优于基于连续token的现有方法。

链接: https://arxiv.org/abs/2602.19449
作者: Jason Wu,Tianchen Zhao,Chang Liu,Jiarui Cai,Zheng Zhang,Zhuowei Li,Aaditya Singh,Xiang Xu,Mani Srivastava,Jonathan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, accepted to CVPR2026 main conference

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM’s linguistic capabilities and outperforming peer methods that operate on continuous tokens.

[CV-90] UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在特定领域主观感知任务中输出与人类偏好不一致的问题,传统方法如微调或强化学习虽有效但依赖标注数据和GPU计算资源。其解决方案的关键在于提出一种无需训练的后处理概念瓶颈(concept-bottleneck)流程,通过三个紧密耦合阶段实现:从少量人工标注中挖掘可解释的评价维度;利用观察者-辩论者-裁判链(Observer-Debater-Judge chain)从冻结的VLM中提取鲁棒的连续概念评分;并在混合视觉-语义流形上采用局部加权岭回归对这些评分进行几何校准,从而在不修改模型权重的前提下显著提升预测准确性与可解释性。

链接: https://arxiv.org/abs/2602.19442
作者: Yecheng Zhang,Rong Zhao,Zhizhou Sha,Yong Li,Lei Wang,Ce Hou,Wen Ji,Hao Huang,Yunshan Wan,Jian Yu,Junhao Xia,Yuru Zhang,Chunlei Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages

点击查看摘要

Abstract:Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ( \kappa=0.45 ) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.

[CV-91] FinSight-Net:A Physics-Aware Decoupled Network with Frequency-Domain Compensation for Underwater Fish Detection in Smart Aquaculture

【速读】:该论文针对水下鱼类检测(Underwater Fish Detection, UFD)中因光吸收与浊度散射导致的对比度下降、细节模糊及后向散射噪声等问题,提出了一种高效且物理感知的检测框架FinSight-Net。其解决方案的关键在于:首先设计了多尺度解耦双流处理(Multi-Scale Decoupled Dual-Stream Processing, MS-DDSP)瓶颈结构,通过异构卷积分支显式补偿不同频率信息损失,抑制后向散射伪影并恢复生物特征线索;其次构建了高效路径聚合FPN(Efficient Path Aggregation FPN, EPA-FPN),利用长程跳跃连接和冗余融合路径剪枝机制重建深层网络中衰减的高频空间信息,从而实现对非刚性鱼体目标在严重模糊和浑浊环境下的鲁棒检测。

链接: https://arxiv.org/abs/2602.19437
作者: Jinsong Yang,Zeyuan Hu,Yichen Li,Hong Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Underwater fish detection (UFD) is a core capability for smart aquaculture and marine ecological monitoring. While recent detectors improve accuracy by stacking feature extractors or introducing heavy attention modules, they often incur substantial computational overhead and, more importantly, neglect the physics that fundamentally limits UFD: wavelength-dependent absorption and turbidity-induced scattering significantly degrade contrast, blur fine structures, and introduce backscattering noise, leading to unreliable localization and recognition. To address these challenges, we propose FinSight-Net, an efficient and physics-aware detection framework tailored for complex aquaculture environments. FinSight-Net introduces a Multi-Scale Decoupled Dual-Stream Processing (MS-DDSP) bottleneck that explicitly targets frequency-specific information loss via heterogeneous convolutional branches, suppressing backscattering artifacts while compensating distorted biological cues through scale-aware and channel-weighted pathways. We further design an Efficient Path Aggregation FPN (EPA-FPN) as a detail-filling mechanism: it restores high-frequency spatial information typically attenuated in deep layers by establishing long-range skip connections and pruning redundant fusion routes, enabling robust detection of non-rigid fish targets under severe blur and turbidity. Extensive experiments on DeepFish, AquaFishSet, and our challenging UW-BlurredFish benchmark demonstrate that FinSight-Net achieves state-of-the-art performance. In particular, on UW-BlurredFish, FinSight-Net reaches 92.8% mAP, outperforming YOLOv11s by 4.8% while reducing parameters by 29.0%, providing a strong and lightweight solution for real-time automated monitoring in smart aquaculture.

[CV-92] CountEx: Fine-Grained Counting via Exemplars and Exclusion

【速读】:该论文旨在解决现有基于提示(prompt-based)视觉计数方法无法显式排除视觉相似干扰物的问题,这类方法通常仅支持通过包含提示指定要计数的目标类别,但在复杂场景中容易因目标与干扰物类别混淆而导致误计数。其解决方案的关键在于提出CountEx框架,该框架通过多模态提示(自然语言描述和可选的视觉样例)同时表达包含与排除意图,并引入一种新颖的判别性查询优化模块(Discriminative Query Refinement module),该模块首先识别包含与排除提示的共享视觉特征,再分离排除特异性模式,并最终通过选择性抑制机制精炼计数查询,从而实现对目标对象的精准计数。

链接: https://arxiv.org/abs/2602.19432
作者: Yifeng Huang,Gia Khanh Nguyen,Minh Hoai
机构: Stony Brook University (石溪大学); Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at this https URL.

[CV-93] herA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

【速读】:该论文旨在解决热红外(Thermal Infrared, TIR)成像数据大规模采集与标注困难的问题,提出了一种可控的RGB到TIR图像翻译框架TherA。其核心解决方案在于将热物理特性融入生成过程:通过一个热感知视觉语言模型(TherA-VLM)提取场景、物体、材质及热辐射上下文信息,生成热感知嵌入(thermal-aware embedding),并以此作为条件引导基于潜在扩散模型的翻译器,从而实现场景级和物体级的热力学合理TIR图像合成。该方法显著优于现有RGB-to-TIR方法,在零样本迁移性能上平均提升达33%。

链接: https://arxiv.org/abs/2602.19430
作者: Dong-Guw Lee,Tai Hyoung Rhee,Hyunsoo Jang,Young-Sik Shin,Ukcheol Shin,Ayoung Kim
机构: Seoul National University (首尔国立大学); Kyungpook National University (庆北国立大学); KENTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

[CV-94] Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

【速读】:该论文旨在解决肝细胞癌(Hepatocellular Carcinoma, HCC)诊断中基于全切片图像(Whole Slide Images, WSI)的计算分析所面临的两大挑战:一是固定分辨率处理机制导致的信息丢失,二是特征聚合效率低下引发的冗余问题。解决方案的关键在于提出一种专用于肝细胞病理分析的多模态大语言模型(Multi-modal Large Language Model, MLLM)——Hepato-LLaVA,并引入一种新颖的稀疏拓扑打包注意力机制(Sparse Topo-Pack Attention),该机制能够显式建模二维组织拓扑结构,在保留全局上下文的同时,将局部诊断证据高效聚合为语义摘要令牌(semantic summary tokens)。此外,研究还构建了临床导向的多尺度问答数据集 HepatoPathoVQA,以支持模型训练与验证,从而显著提升HCC诊断和图像描述任务的性能。

链接: https://arxiv.org/abs/2602.19424
作者: Yuxuan Yang,Zhonghao Yan,Yi Zhang,Bo Yun,Muxi Diao,Guowei Zhao,Kongming Liang,Wenbin Li,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at this https URL.

[CV-95] Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

【速读】:该论文旨在解决域自适应分割(Domain Adaptive Segmentation, DAS)中因缺乏目标域标注数据而导致的性能受限与偏差问题。传统无监督域自适应(Unsupervised Domain Adaptation, UDA)方法在实际应用中表现不佳,难以满足高精度分割需求。其关键解决方案是提出 Prefer-DAS 模型,该模型创新性地引入稀疏点提示(sparse points)和局部人类偏好(local human preferences)作为弱标签,在无需密集标注的前提下实现高效、灵活的分割。核心机制包括:1)可提示多任务学习框架,支持训练与推理阶段使用全量、部分或无点提示,从而支持交互式分割;2)提出局部直接偏好优化(Local Direct Preference Optimization, LPO)及稀疏LPO(SLPO),实现空间异质或稀疏人类反馈的对齐;3)引入无监督偏好优化(Unsupervised Preference Optimization, UPO)以应对缺失反馈场景。此设计使模型具备弱监督与无监督DAS双重能力,且在多个挑战性任务上优于现有SAM类方法及主流弱监督/无监督DAS方法,逼近甚至超越有监督模型性能。

链接: https://arxiv.org/abs/2602.19423
作者: Jiabao Chen,Shan Xiong,Jialin Peng
机构: Huaqiao University (华侨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO) and sparse LPO (SLPO), plug-and-play solutions for alignment with spatially varying human feedback or sparse feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

[CV-96] PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对对抗攻击时的脆弱性问题,尤其是现有白盒攻击方法任务泛化能力差、黑盒攻击依赖昂贵的迁移策略导致效率低下的局限。其解决方案的关键在于提出一种基于原型锚定的注意力攻击(PA-Attack),通过引入一个稳定且通用的原型锚点来引导攻击方向,克服传统方法因属性受限而导致的任务泛化不足;同时设计两阶段注意力增强机制:第一阶段利用token级注意力分数将扰动集中于关键视觉token,第二阶段自适应重校准注意力权重以追踪对抗过程中注意力的变化,从而显著提升攻击的有效性、效率与跨任务泛化能力。

链接: https://arxiv.org/abs/2602.19418
作者: Hefei Mei,Zirui Wang,Chang Xu,Jianyuan Guo,Minjing Dong
机构: City University of Hong Kong (香港城市大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token-level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA-Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs. Code is available at this https URL.

[CV-97] Redefining the Down-Sampling Scheme of U-Net for Precision Biomedical Image Segmentation

【速读】:该论文旨在解决U-Net架构在生物医学图像分割(BIS)中难以捕捉长距离依赖关系的问题,其根本原因在于传统下采样策略为追求计算效率而牺牲了信息保留。解决方案的关键在于提出一种名为“Stair Pooling”的简单但有效的下采样策略,通过在不同方向上串联多个小尺寸、窄范围的池化操作,逐步降低特征图维度(将每步下采样比例从1/4调整为1/2),从而显著减少信息损失。该方法可适配于2D和3D场景,增强U-Net上采样阶段的空间细节重建能力,进而提升长程信息建模与分割精度,实验表明其平均Dice分数提升达3.8%。

链接: https://arxiv.org/abs/2602.19412
作者: Mingjie Li,Yizheng Chen,Md Tauhidul Islam,Lei Xing
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAPM 67th

点击查看摘要

Abstract:U-Net architectures have been instrumental in advancing biomedical image segmentation (BIS) but often struggle with capturing long-range information. One reason is the conventional down-sampling techniques that prioritize computational efficiency at the expense of information retention. This paper introduces a simple but effective strategy, we call it Stair Pooling, which moderates the pace of down-sampling and reduces information loss by leveraging a sequence of concatenated small and narrow pooling operations in varied orientations. Specifically, our method modifies the reduction in dimensionality within each 2D pooling step from \frac14 to \frac12 . This approach can also be adapted for 3D pooling to preserve even more information. Such preservation aids the U-Net in more effectively reconstructing spatial details during the up-sampling phase, thereby enhancing its ability to capture long-range information and improving segmentation accuracy. Extensive experiments on three BIS benchmarks demonstrate that the proposed Stair Pooling can increase both 2D and 3D U-Net performance by an average of 3.8% in Dice scores. Moreover, we leverage the transfer entropy to select the optimal down-sampling paths and quantitatively show how the proposed Stair Pooling reduces the information loss.

[CV-98] Detector-in-the-Loop Tracking: Active Memory Rectification for Stable Glottic Opening Localization

【速读】:该论文旨在解决视频喉镜中声门开合定位的时序稳定性问题,其核心挑战在于单帧检测器缺乏时序上下文信息,而基础模型跟踪器存在记忆漂移(memory drift)问题,尤其在急诊场景下因组织快速形变、遮挡及视觉模糊等因素导致跟踪误差累积。解决方案的关键在于提出闭环记忆校正(Closed-Loop Memory Correction, CL-MC)框架,通过将检测器嵌入跟踪环路,利用置信度对齐的状态决策与主动记忆修正机制,使高置信度检测结果触发语义重置,从而覆盖被污染的跟踪器记忆,实现无需训练即可有效抑制漂移的稳定跟踪。

链接: https://arxiv.org/abs/2602.19380
作者: Huayu Wang,Bahaa Alattar,Cheng-Yen Yang,Hsiang-Wei Huang,Jung Heon Kim,Linda Shapiro,Nathan White,Jenq-Neng Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Medical Imaging with Deep Learning (MIDL) 2026

点击查看摘要

Abstract:Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. Specifically, in video laryngoscopy, rapid tissue deformation, occlusions, and visual ambiguities in emergency settings require a robust, temporally aware solution that can prevent progressive tracking errors. We propose Closed-Loop Memory Correction (CL-MC), a detector-in-the-loop framework that supervises Segment Anything Model 2(SAM2) through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, effectively mitigating drift accumulation with a training-free foundation tracker in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking. Our code will be available in this https URL.

[CV-99] Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization ICRA2026

【速读】:该论文旨在解决复杂、长时程机器人操作任务中,基于视觉-语言模型(VLM)进行决策时存在的三大问题:一是依赖低效且不准确的隐式状态价值学习;二是仅评估单一贪婪未来路径导致决策鲁棒性差;三是推理延迟高。其解决方案的关键在于提出一种新的测试时计算框架,将状态评估与动作生成解耦,从而提供更直接和细粒度的监督信号;通过显式建模动作计划的优势(以目标距离减少量衡量),并引入可扩展的批评者(critic)进行估计;同时采用束搜索(beam search)探索多条未来轨迹并聚合预期长期回报,提升动作生成的稳定性;此外,设计轻量级置信度触发机制,在预测可靠时提前退出,仅在必要时调用反思策略,显著降低推理时间并提升成功率(相较最先进基线提升24.6%)。

链接: https://arxiv.org/abs/2602.19372
作者: Yanting Yang,Shenyuan Gao,Qingwen Bu,Li Chen,Dimitris N.Metaxas
机构: Rutgers University (罗格斯大学); The Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICRA 2026

点击查看摘要

Abstract:Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal. However, previous approaches using reflective planning to guide VLMs in correcting actions encounter significant limitations. These methods rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions, evaluate only a single greedy future, and suffer from substantial inference latency. To address these limitations, we propose a novel test-time computation framework that decouples state evaluation from action generation. This provides a more direct and fine-grained supervisory signal for robust decision-making. Our method explicitly models the advantage of an action plan, quantified by its reduction in distance to the goal, and uses a scalable critic to estimate. To address the stochastic nature of single-trajectory evaluation, we employ beam search to explore multiple future paths and aggregate them during decoding to model their expected long-term returns, leading to more robust action generation. Additionally, we introduce a lightweight, confidence-based trigger that allows for early exit when direct predictions are reliable, invoking reflection only when necessary. Extensive experiments on diverse, unseen multi-stage robotic manipulation tasks demonstrate a 24.6% improvement in success rate over state-of-the-art baselines, while significantly reducing inference time by 56.5%.

[CV-100] me Series Vision and Language: Exploring the Limits of Alignment in Contrastive Representation Spaces

【速读】:该论文旨在解决多模态学习中一个核心问题:不同模态(包括时间序列、视觉和语言)的表示是否能够收敛到共享的潜在世界结构,即验证“柏拉图表征假说”(Platonic Representation Hypothesis)在非传统模态(如时间序列)中的适用性。其关键解决方案在于采用后验对齐策略(post-hoc alignment),通过在冻结预训练编码器上训练投影头(projection heads)并利用对比学习(contrastive learning)进行对齐,从而系统分析对齐后的表示在几何结构、模型规模扩展性、信息密度依赖性和模态特性上的行为。研究发现,尽管时间序列与视觉模态之间存在更强的对齐能力,且图像可作为桥梁促进时序与语言模态间的对齐,但文本或视觉描述的信息密度提升存在阈值效应,超过该阈值后进一步增加信息密度并不能增强对齐效果,这为构建包含非传统模态的多模态系统提供了重要理论依据和实践指导。

链接: https://arxiv.org/abs/2602.19367
作者: Pratham Yashwante,Rose Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Figures, 12 Tables

点击查看摘要

Abstract:The Platonic Representation Hypothesis posits that learned representations from models trained on different modalities converge to a shared latent structure of the world. However, this hypothesis has largely been examined in vision and language, and it remains unclear whether time series participate in such convergence. We first examine this in a trimodal setting and find that independently pretrained time series, vision, and language encoders exhibit near-orthogonal geometry in the absence of explicit coupling. We then apply post-hoc alignment by training projection heads over frozen encoders using contrastive learning, and analyze the resulting representations with respect to geometry, scaling behavior, and dependence on information density and input modality characteristics. Our investigation reveals that overall alignment in contrastive representation spaces improves with model size, but this alignment is asymmetric: time series align more strongly with visual representations than with text, and images can act as effective intermediaries between time series and language. We further see that richer textual descriptions improve alignment only up to a threshold; training on denser captions does not lead to further improvement. Analogous effects are observed for visual representations. Our findings shed light on considerations for building multimodal systems involving non-conventional data modalities beyond vision and language.

[CV-101] Referring Layer Decomposition ICLR2026

【速读】:该论文旨在解决现有图像编辑与生成方法普遍缺乏对视觉内容中个体场景元素进行精确、对象感知控制的问题,即多数方法以整体图像为操作对象,难以实现对特定物体的隔离与独立编辑。其解决方案的关键在于提出参照层分解(Referring Layer Decomposition, RLD)任务,该任务通过灵活用户提示(如空间输入、自然语言描述或组合)从单张RGB图像中预测完整的RGBA层表示,从而实现结构化、可编辑的分层图像表示。核心创新包括构建大规模数据集RefLade(含111万组图像-层-提示三元组及10万张人工校准的高质量层),以及设计基于感知一致性和人类偏好对齐的自动评估协议,进而建立了一个可训练、可评估且具备强零样本泛化能力的基准系统RefLayer。

链接: https://arxiv.org/abs/2602.19358
作者: Fangyi Chen,Yaojie Shen,Lu Xu,Ye Yuan,Shu Zhang,Yulei Niu,Longyin Wen
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

[CV-102] MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations

【速读】:该论文旨在探究当前最先进的视觉语言模型(Vision-Language Models, VLMs)是否具备空间可视化能力,即在心理层面想象、转换和操作物体与动作的空间特征的能力。这一能力是人类认知中连接感知与行动的核心组成部分。为解决该问题,作者构建了MentalBlackboard——一个开放式的空间可视化基准测试平台,聚焦于折纸与打孔两个核心任务,涵盖预测与规划两类子任务。其关键解决方案在于设计了一套系统性的评估框架,能够区分模型在空间变换推理(如对称性应用、旋转理解)与多阶段对称处理策略上的局限性,从而揭示当前VLMs在物理情境意识和复杂空间逻辑建模方面的不足。

链接: https://arxiv.org/abs/2602.19357
作者: Nilay Yilmaz,Maitreya Patel,Naga Sai Abhiram Kusumba,Yixuan He,Yezhou Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. To explore whether state-of-the-art Vision-Language Models (VLMs) exhibit this ability, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. Also, rotations introduce a significant challenge to the physical situational awareness for models. The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process, with Claude Opus 4.1 achieving the highest planning score at an accuracy of 10%. The top-performing model, o3, attains a peak performance of 71.6% on the generalization task, which does not require spatial visualization but transfers spatial data; however, it achieves only 25% accuracy on text-based prediction tasks.

[CV-103] PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

【速读】:该论文旨在解决虚拟现实(VR)、远程存在(telepresence)和娱乐领域中,如何高效生成具有显式三维姿态(3D pose)和相机控制能力的逼真数字人像的问题。现有基于蒙皮(skinning)的方法依赖繁琐的手动绑定或模板匹配,而神经体积方法则需依赖规范模板并为每个未见姿态重新优化。其解决方案的关键在于提出PoseCraft——一个基于离散化3D接口的扩散框架:通过将稀疏的3D关键点和相机外参编码为条件标记(conditioning tokens),并利用交叉注意力机制注入扩散过程,从而避免大姿态和视角变化下二维重投影歧义,保持3D语义一致性,并生成高保真的图像,准确保留身份特征与外观细节。

链接: https://arxiv.org/abs/2602.19350
作者: Zhilin Guo,Jing Yang,Kyle Fogarty,Jingyi Wan,Boqiao Zhang,Tianhao Wu,Weihao Xia,Chenliang Zhou,Sakar Khattar,Fangcheng Zhong,Cristina Nader Vasconcelos,Cengiz Oztireli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.

[CV-104] UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

【速读】:该论文旨在解决LiDAR-camera融合在3D全景分割(3D panoptic segmentation)中因相机传感器退化、校准漂移或故障导致感知系统可靠性下降的问题。其解决方案的关键在于提出一种不确定性感知的融合框架UP-Fuse,该框架在2D距离视图(range-view)空间内运行,通过学习预测不确定性图来动态调节跨模态交互,仅允许可靠视觉线索参与融合;同时引入一种新型混合2D-3D Transformer解码器以缓解2D投影带来的空间歧义,从而实现鲁棒且准确的3D全景分割输出。

链接: https://arxiv.org/abs/2602.19349
作者: Rohit Mohan,Florian Drews,Yakov Miron,Daniele Cattaneo,Abhinav Valada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

[CV-105] MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose ICRA

【速读】:该论文旨在解决触觉感知中多模态视觉-触觉(visuo-tactile)数据集获取成本高、效率低的问题,尤其是现有合成方法通常局限于单模态生成,难以支持跨模态学习。其解决方案的关键在于提出MultiDiffSense——一种统一的扩散模型架构,通过双重条件控制(CAD导出的位姿对齐深度图与编码传感器类型及4-DoF接触位姿的结构化提示),实现对多种基于视觉的触觉传感器(ViTac、TacTip、ViTacTip)图像的可控、物理一致的多模态合成。该方法显著提升了合成图像质量,并在下游3-DoF位姿估计任务中证明了合成数据的有效性,可减少50%的真实数据需求而保持竞争力,从而缓解触觉传感的数据采集瓶颈并支持机器人应用中的规模化多模态数据生成。

链接: https://arxiv.org/abs/2602.19348
作者: Sirine Bhouri,Lan Wei,Jian-Qing Zheng,Dandan Zhang
机构: Imperial College London (帝国理工学院); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by 2026 ICRA

点击查看摘要

Abstract:Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

[CV-106] RetinaVision: XAI-Driven Augmented Regulation for Precise Retinal Disease Classification using deep learning framework

【速读】:该论文旨在解决视网膜疾病早期且准确分类的问题,以应对视力丧失并指导临床管理。其解决方案的关键在于利用光学相干断层扫描(Optical Coherence Tomography, OCT)图像,通过卷积神经网络(Convolutional Neural Network, CNN)架构(如Xception和InceptionV3)进行深度学习建模,并结合数据增强技术(CutMix、MixUp)提升模型泛化能力;同时引入GradCAM和LIME方法实现模型可解释性评估,从而在真实场景中部署名为RetinaVision的Web应用,验证了高精度(Xception达95.25%)与可解释性协同对临床转化的重要性。

链接: https://arxiv.org/abs/2602.19324
作者: Mohammad Tahmid Noor,Shayan Abrar,Jannatul Adan Mahi,Md Parvez Mia,Asaduzzaman Hridoy,Samanta Ghosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 15 figures

点击查看摘要

Abstract:Early and accurate classification of retinal diseases is critical to counter vision loss and for guiding clinical management of retinal diseases. In this study, we proposed a deep learning method for retinal disease classification utilizing optical coherence tomography (OCT) images from the Retinal OCT Image Classification - C8 dataset (comprising 24,000 labeled images spanning eight conditions). Images were resized to 224x224 px and tested on convolutional neural network (CNN) architectures: Xception and InceptionV3. Data augmentation techniques (CutMix, MixUp) were employed to enhance model generalization. Additionally, we applied GradCAM and LIME for interpretability evaluation. We implemented this in a real-world scenario via our web application named RetinaVision. This study found that Xception was the most accurate network (95.25%), followed closely by InceptionV3 (94.82%). These results suggest that deep learning methods allow effective OCT retinal disease classification and highlight the importance of implementing accuracy and interpretability for clinical applications.

[CV-107] DefenseSplat: Enhancing the Robustness of 3D Gaussian Splatting via Frequency-Aware Filtering

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在面对输入图像中对抗性扰动时的脆弱性问题,此类扰动虽人眼难以察觉但会显著降低渲染质量、增加训练与渲染时间并导致内存占用激增,甚至引发服务拒绝攻击。解决方案的关键在于通过小波变换分析对抗扰动在图像低频和高频分量中的不同行为,并设计一种频率感知的防御策略:在重建训练视图时滤除高频噪声而保留低频内容,从而有效抑制对抗伪影的同时保持场景的真实性。该方法无需干净的地面真值监督即可显著提升3DGS的鲁棒性,且对干净数据的训练性能影响较小,实现了鲁棒性与原始性能之间的良好平衡。

链接: https://arxiv.org/abs/2602.19323
作者: Yiran Qiao,Yiren Lu,Yunlai Zhou,Rui Yang,Linlin Hou,Yu Yin,Jing Ma
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for real-time and high-fidelity 3D reconstruction from posed images. However, recent studies reveal its vulnerability to adversarial corruptions in input views, where imperceptible yet consistent perturbations can drastically degrade rendering quality, increase training and rendering time, and inflate memory usage, even leading to server denial-of-service. In our work, to mitigate this issue, we begin by analyzing the distinct behaviors of adversarial perturbations in the low- and high-frequency components of input images using wavelet transforms. Based on this observation, we design a simple yet effective frequency-aware defense strategy that reconstructs training views by filtering high-frequency noise while preserving low-frequency content. This approach effectively suppresses adversarial artifacts while maintaining the authenticity of the original scene. Notably, it does not significantly impair training on clean data, achieving a desirable trade-off between robustness and performance on clean inputs. Through extensive experiments under a wide range of attack intensities on multiple benchmarks, we demonstrate that our method substantially enhances the robustness of 3DGS without access to clean ground-truth supervision. By highlighting and addressing the overlooked vulnerabilities of 3D Gaussian Splatting, our work paves the way for more robust and secure 3D reconstructions.

[CV-108] US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound

【速读】:该论文旨在解决超声(Ultrasound, US)成像在表示学习中因固有噪声采集过程所带来的挑战,尤其是低信噪比和随机斑点(speckle)模式对依赖像素级重建目标的自监督学习方法的限制。解决方案的关键在于提出US-JEPA框架,采用静态教师异构潜在训练(Static-teacher Asymmetric Latent Training, SALT)目标,通过使用一个冻结的、领域特定的教师模型提供稳定的潜在表示目标,从而解耦学生与教师之间的优化过程,并促使学生模型扩展教师所提供的语义先验,实现更鲁棒且高效的超声表征学习。

链接: https://arxiv.org/abs/2602.19322
作者: Ashwath Radhachandran,Vedrana Ivezić,Shreeram Athreya,Ronit Anilkumar,Corey W. Arnold,William Speier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process. The low signal-to-noise ratio and stochastic speckle patterns hinder standard self-supervised learning methods relying on a pixel-level reconstruction objective. Joint-Embedding Predictive Architectures (JEPAs) address this drawback by predicting masked latent representations rather than raw pixels. However, standard approaches depend on hyperparameter-brittle and computationally expensive online teachers updated via exponential moving average. We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective. By using a frozen, domain-specific teacher to provide stable latent targets, US-JEPA decouples student-teacher optimization and pushes the student to expand upon the semantic priors of the teacher. In addition, we provide the first rigorous comparison of all publicly available state-of-the-art ultrasound foundation models on UltraBench, a public dataset benchmark spanning multiple organs and pathological conditions. Under linear probing for diverse classification tasks, US-JEPA achieves performance competitive with or superior to domain-specific and universal vision foundation model baselines. Our results demonstrate that masked latent prediction provides a stable and efficient path toward robust ultrasound representations.

[CV-109] Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition ICLR2026

【速读】:该论文旨在解决统一语音识别(Unified Speech Recognition, USR)框架在训练过程中存在的两大问题:一是依赖自回归伪标签导致的高计算成本,二是CTC分支与注意力分支解耦监督易引发自我强化错误,尤其在分布外场景(如长序列、噪声或未见领域)下表现脆弱。解决方案的关键在于提出基于CTC驱动的教师强制(CTC-driven teacher forcing)机制,即利用贪心解码的CTC伪标签直接作为解码器输入以生成注意力目标,实现单次前向传播完成知识迁移;由于CTC伪标签与注意力伪标签长度一致,解码器可同时预测二者,从而兼顾CTC的鲁棒性与注意力机制的表达能力,且无需昂贵的束搜索(beam search)。此外,引入混合采样策略缓解仅依赖CTC输入导致的暴露偏差(exposure bias),最终形成的USR 2.0方法显著缩短训练时间、提升分布外鲁棒性,并在LRS3、LRS2和WildVSR数据集上超越原有USR及模态特定自监督基线。

链接: https://arxiv.org/abs/2602.19316
作者: Alexandros Haliassos,Rodrigo Mira,Stavros Petridis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: ICLR 2026. Code: this https URL

点击查看摘要

Abstract:Unified Speech Recognition (USR) has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition, achieving state-of-the-art results on in-distribution benchmarks. However, its reliance on autoregressive pseudo-labelling makes training expensive, while its decoupled supervision of CTC and attention branches increases susceptibility to self-reinforcing errors, particularly under distribution shifts involving longer sequences, noise, or unseen domains. We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets in a single forward pass. Although these can be globally incoherent, in the pseudo-labelling setting they enable efficient and effective knowledge transfer. Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously, benefiting from the robustness of CTC and the expressiveness of attention without costly beam search. We further propose mixed sampling to mitigate the exposure bias of the decoder relying solely on CTC inputs. The resulting method, USR 2.0, halves training time, improves robustness to out-of-distribution inputs, and achieves state-of-the-art results on LRS3, LRS2, and WildVSR, surpassing USR and modality-specific self-supervised baselines.

[CV-110] IPv2: An Improved Image Purification Strategy for Real-World Ultra-Low-Dose Lung CT Denoising

【速读】:该论文旨在解决现有图像净化策略在超低剂量CT(ultra-low-dose CT)图像去噪中存在两个关键问题:一是仅抑制胸壁和骨组织区域的噪声,而忽略背景区域的处理;二是缺乏对肺实质(lung parenchyma)区域的有效去噪机制。为应对上述局限性,作者提出改进版图像净化策略IPv2,其核心创新在于构建三个关键模块——“去背景(Remove Background)”、“加噪声(Add Noise)”和“去噪声(Remove Noise)”,从而在训练数据构建阶段赋予模型对背景与肺实质区域的联合去噪能力,并通过精细化标签构造提升测试阶段评估的合理性。实验表明,IPv2在真实患者肺部CT数据集(2%辐射剂量)上显著提升了背景抑制效果和肺实质结构恢复能力,适用于多种主流去噪模型。

链接: https://arxiv.org/abs/2602.19314
作者: Guoliang Gong,Man Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The image purification strategy constructs an intermediate distribution with aligned anatomical structures, which effectively corrects the spatial misalignment between real-world ultra-low-dose CT and normal-dose CT images and significantly enhances the structural preservation ability of denoising models. However, this strategy exhibits two inherent limitations. First, it suppresses noise only in the chest wall and bone regions while leaving the image background untreated. Second, it lacks a dedicated mechanism for denoising the lung parenchyma. To address these issues, we systematically redesign the original image purification strategy and propose an improved version termed IPv2. The proposed strategy introduces three core modules, namely Remove Background, Add noise, and Remove noise. These modules endow the model with denoising capability in both background and lung tissue regions during training data construction and provide a more reasonable evaluation protocol through refined label construction at the testing stage. Extensive experiments on our previously established real-world patient lung CT dataset acquired at 2% radiation dose demonstrate that IPv2 consistently improves background suppression and lung parenchyma restoration across multiple mainstream denoising models. The code is publicly available at this https URL.

[CV-111] WildOS: Open-Vocabulary Object Search in the Wild

【速读】:该论文旨在解决机器人在复杂、非结构化户外环境中进行长距离自主导航时面临的挑战,即如何在缺乏先验地图和有限深度感知条件下,实现既安全又语义合理的探索。传统仅依赖几何前沿(geometric frontier)的探索方法往往无法有效识别可通行区域或目标物体,而纯视觉方法则难以保证几何安全性。解决方案的关键在于提出WildOS系统,其核心是将安全的几何探索与基于基础模型(foundation model)的语义视觉推理相结合:通过构建稀疏导航图维持空间记忆,并利用ExploRFM视觉模块对图中节点进行语义评分,该模块同时预测可 traversability(可通行性)、视觉前沿和图像空间中的对象相似性,从而实现实时、机载的语义导航;此外,引入基于粒子滤波的粗略定位方法以估计远距离开放词汇目标的位置,提升向远处目标规划的有效性。实验表明,WildOS显著优于纯几何或纯视觉基线,在效率与自主性上均表现出更强鲁棒性。

链接: https://arxiv.org/abs/2602.19308
作者: Hardik Shah,Erica Tevere,Deegan Atha,Marcel Kaufmann,Shehryar Khattak,Manthan Patel,Marco Hutter,Jonas Frey,Patrick Spieler
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 16 figures, 2 tables

点击查看摘要

Abstract:Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot’s immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded. Project Page: this https URL

[CV-112] MRI Contrast Enhancement Kinetics World Model CVPR2026

【速读】:该论文旨在解决临床MRI对比剂增强动力学建模中因采样稀疏与时间分辨率低导致的生成内容失真和时序不连续问题。传统方法在有限且离散的采集序列下训练生成模型,易因缺失时间点数据而过拟合无关特征,同时缺乏连续时间监督使模型难以学习平滑的动力学规律。其解决方案的关键在于提出MRI对比剂增强动力学世界模型(MRI CEKWorld)并引入时空一致性学习(STCL):一方面通过患者级空间结构一致性约束设计潜在对齐学习(LAL),构建个体特异性模板以引导内容对齐;另一方面基于动力学平滑性假设提出潜在差异学习(LDL),通过插值扩展未观测区间并在潜在空间施加平滑约束,从而实现更真实的内容重建与连续的时序演化。

链接: https://arxiv.org/abs/2602.19285
作者: Jindi Kong,Yuting He,Cong Xia,Rongjun Ge,Shuo Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and the fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in the human body enables continuous contrast-free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to a sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence of data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time, causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by the spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follow a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrains smooth variations in the latent space among interpolated sequences. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available at this https URL.

[CV-113] A Two-Stage Detection-Tracking Framework for Stable Apple Quality Inspection in Dense Conveyor-Belt Environments

【速读】:该论文旨在解决工业水果检测系统在密集多目标交互和连续运动场景下缺乏时序稳定性的问题,现有方法多基于图像级检测或分类,无法保证视频流中的预测一致性。解决方案的关键在于提出一个两阶段的检测-跟踪框架:首先使用果园训练的YOLOv8模型进行苹果定位,随后通过ByteTrack实现多目标跟踪以维持目标身份持久性;在此基础上,引入ResNet18缺陷分类器对裁剪出的苹果区域进行分类,并采用轨迹级聚合策略来增强时序一致性、减少帧间预测波动。该方法通过定义如轨迹级缺陷率和时序一致性等视频级工业指标,显著提升了系统在传送带环境下的稳定性和实用性。

链接: https://arxiv.org/abs/2602.19278
作者: Keonvin Park,Aditya Pal,Jin Hong Mok
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial fruit inspection systems must operate reliably under dense multi-object interactions and continuous motion, yet most existing works evaluate detection or classification at the image level without ensuring temporal stability in video streams. We present a two-stage detection-tracking framework for stable multi-apple quality inspection in conveyor-belt environments. An orchard-trained YOLOv8 model performs apple localization, followed by ByteTrack multi-object tracking to maintain persistent identities. A ResNet18 defect classifier, fine-tuned on a healthy-defective fruit dataset, is applied to cropped apple regions. Track-level aggregation is introduced to enforce temporal consistency and reduce prediction oscillation across frames. We define video-level industrial metrics such as track-level defect ratio and temporal consistency to evaluate system robustness under realistic processing conditions. Results demonstrate improved stability compared to frame-wise inference, suggesting that integrating tracking is essential for practical automated fruit grading systems.

[CV-114] DD-CAM: Minimal Sufficient Explanations for Vision Models Using Delta Debugging

【速读】:该论文旨在解决视觉模型解释中现有方法生成的显著性图(saliency maps)过于冗余、缺乏决策保真度的问题,即如何识别出最小且足以维持模型预测的表征单元子集。其解决方案的关键在于提出一种无梯度的框架DD-CAM,通过借鉴软件调试中的delta调试策略,结合分类器头中单元间交互关系自适应调整搜索策略:对于无交互单元采用单个测试,对于存在交互的单元则测试组合,从而高效地找到1-最小(1-minimal)的充分必要子集——即移除其中任意一个单元都会改变原预测结果的最小集合。此方法生成的显著性图仅突出最本质特征,提升了解释的忠实性和定位准确性。

链接: https://arxiv.org/abs/2602.19274
作者: Krishna Khadka,Yu Lei,Raghu N. Kacker,D. Richard Kuhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We introduce a gradient-free framework for identifying minimal, sufficient, and decision-preserving explanations in vision models by isolating the smallest subset of representational units whose joint activation preserves predictions. Unlike existing approaches that aggregate all units, often leading to cluttered saliency maps, our approach, DD-CAM, identifies a 1-minimal subset whose joint activation suffices to preserve the prediction (i.e., removing any unit from the subset alters the prediction). To efficiently isolate minimal sufficient subsets, we adapt delta debugging, a systematic reduction strategy from software debugging, and configure its search strategy based on unit interactions in the classifier head: testing individual units for models with non-interacting units and testing unit combinations for models in which unit interactions exist. We then generate minimal, prediction-preserving saliency maps that highlight only the most essential features. Our experimental evaluation demonstrates that our approach can produce more faithful explanations and achieve higher localization accuracy than the state-of-the-art CAM-based approaches.

[CV-115] CORVET: A CORDIC-Powered Resource-Frugal Mixed-Precision Vector Processing Engine for High-Throughput AIoT applications

【速读】:该论文旨在解决边缘人工智能(Edge AI)应用中计算性能与资源消耗之间的矛盾问题,特别是在有限硬件资源下实现高能效的向量运算加速。其关键解决方案是提出一种运行时自适应、性能增强的向量引擎,采用低资源迭代式CORDIC(Coordinate Rotation Digital Computer)乘加(MAC)单元,在近似与精确模式间动态切换以权衡延迟与精度;通过向量化时间复用执行和灵活精度缩放(支持4/8/16位),在相同硬件资源下实现最高达4倍的吞吐量提升,并结合多激活函数(multi-AF)块与轻量级池化及归一化单元,显著提高MAC密度与能效比,最终在ASIC实现中达到4.83 TOPS/mm²的计算密度和11.67 TOPS/W的能量效率,优于现有先进方案。

链接: https://arxiv.org/abs/2602.19268
作者: Sonu Kumar,Mohd Faisal Khan,Mukul Lokhande,Santosh Kumar Vishvakarma
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This brief presents a runtime-adaptive, performance-enhanced vector engine featuring a low-resource, iterative CORDIC-based MAC unit for edge AI acceleration. The proposed design enables dynamic reconfiguration between approximate and accurate modes, exploiting the latency-accuracy trade-off for a wide range of workloads. Its resource-efficient approach further enables up to 4x throughput improvement within the same hardware resources by leveraging vectorised, time-multiplexed execution and flexible precision scaling. With a time-multiplexed multi-AF block and a lightweight pooling and normalisation unit, the proposed vector engine supports flexible precision (4/8/16-bit) and high MAC density. The ASIC implementation results show that each MAC stage can save up to 33% of time and 21% of power, with a 256-PE configuration that achieves higher compute density (4.83 TOPS/mm2 ) and energy efficiency (11.67 TOPS/W) than previous state-of-the-art work. A detailed hardware-software co-design methodology for object detection and classification tasks on Pynq-Z2 is discussed to assess the proposed architecture, demonstrating a scalable, energy-efficient solution for edge AI applications.

[CV-116] RegionRoute: Regional Style Transfer with Diffusion Model

【速读】:该论文旨在解决扩散模型在风格迁移(style transfer)中难以实现精确空间控制的问题,即现有方法将风格视为全局特征,缺乏对风格表示的空间定位能力,导致无法将风格精准应用于特定对象或区域,且依赖手工掩码或多阶段后处理易引入边界伪影并限制泛化能力。解决方案的关键在于提出一种基于注意力监督的扩散框架,通过在训练过程中将风格令牌(style tokens)的注意力分数与目标物体掩码对齐,显式指导模型在何处应用风格;同时设计了两种互补的目标函数——基于KL散度的Focus损失和基于二元交叉熵的Cover损失,协同优化局部定位精度与覆盖密度;此外采用模块化的LoRA-MoE结构实现多风格适配的高效性和可扩展性。

链接: https://arxiv.org/abs/2602.19254
作者: Bowen Chen,Jake Zuena,Alan C. Bovik,Divya Kothandaraman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.

[CV-117] No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection CVPR2026

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)在开放世界场景下的性能瓶颈问题,核心挑战在于异常事件的稀有性与时空稀缺性,以及现有方法对上下文依赖的异常语义理解不足。解决方案的关键在于提出一个端到端的零样本视频异常检测框架LAVIDA:首先,引入异常暴露采样器(Anomaly Exposure Sampler),将分割对象转化为伪异常以增强模型对未见异常类别的适应能力;其次,集成多模态大语言模型(Multimodal Large Language Model, MLLM)提升语义理解能力;最后,设计基于反向注意力机制的token压缩方法,缓解异常模式的时空稀缺性并降低计算开销。整个训练过程仅使用伪异常数据,无需真实VAD标注数据,实验证明其在四个基准数据集上实现了帧级和像素级检测的最先进性能。

链接: https://arxiv.org/abs/2602.19248
作者: Zunkai Dai,Ke Li,Jiajia Liu,Jie Yang,Yuanyuan Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in this https URL.

[CV-118] Knowledge-aware Visual Question Generation for Remote Sensing Images

【速读】:该论文旨在解决远程感知图像中自动生成问题过于简单、模板化的问题,从而限制了视觉问答或视觉对话系统在实际场景中的部署。其核心解决方案是提出一种知识感知的遥感视觉问题生成模型(Knowledge-aware Remote Sensing Visual Question Generation, KRSVQG),该模型通过引入与图像内容相关的外部知识三元组(knowledge triplet)来增强生成问题的质量和语境理解能力,并利用图像描述(image captioning)作为中间表示以提升问题与图像的语义对齐性(image grounding)。

链接: https://arxiv.org/abs/2602.19224
作者: Siran Li,Li Mi,Javiera Castillo-Navarro,Devis Tuia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing image retrieval. However, automatically generated image-based questions tend to be simplistic and template-based, which hinders the real deployment of question answering or visual dialogue systems. To enrich and diversify the questions, we propose a knowledge-aware remote sensing visual question generation model, KRSVQG, that incorporates external knowledge related to the image content to improve the quality and contextual understanding of the generated questions. The model takes an image and a related knowledge triplet from external knowledge sources as inputs and leverages image captioning as an intermediary representation to enhance the image grounding of the generated questions. To assess the performance of KRSVQG, we utilized two datasets that we manually annotated: NWPU-300 and TextRS-300. Results on these two datasets demonstrate that KRSVQG outperforms existing methods and leads to knowledge-enriched questions, grounded in both image and domain knowledge.

[CV-119] Controlled Face Manipulation and Synthesis for Data Augmentation

【速读】:该论文旨在解决在面部表情分析中因标注数据稀缺和动作单元(Action Unit, AU)共激活导致的特征纠缠问题,从而提升AU检测模型的准确性与可解释性。解决方案的关键在于:首先,在预训练人脸生成器(Diffusion Autoencoder)的语义潜在空间中进行可控编辑,利用轻量级线性模型实现两个核心操作——(i) 依赖感知条件建模以考虑AU之间的共激活关系,降低特征纠缠;(ii) 正交投影去除干扰属性方向(如戴眼镜等),并引入表情中性化步骤以支持绝对AU编辑。该方法通过合成多样化且平衡的AU样本增强训练数据,显著提升AU检测性能,且效果接近于使用大量标注数据时的表现。

链接: https://arxiv.org/abs/2602.19219
作者: Joris Kirchner,Amogh Gudi,Marian Bittner,Chirag Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning vision models excel with abundant supervision, but many applications face label scarcity and class imbalance. Controllable image editing can augment scarce labeled data, yet edits often introduce artifacts and entangle non-target attributes. We study this in facial expression analysis, targeting Action Unit (AU) manipulation where annotation is costly and AU co-activation drives entanglement. We present a facial manipulation method that operates in the semantic latent space of a pre-trained face generator (Diffusion Autoencoder). Using lightweight linear models, we reduce entanglement of semantic features via (i) dependency-aware conditioning that accounts for AU co-activation, and (ii) orthogonal projection that removes nuisance attribute directions (e.g., glasses), together with an expression neutralization step to enable absolute AU edit. We use these edits to balance AU occurrence by editing labeled faces and to diversify identities/demographics via controlled synthesis. Augmenting AU detector training with the generated data improves accuracy and yields more disentangled predictions with fewer co-activation shortcuts, outperforming alternative data-efficient training strategies and suggesting improvements similar to what would require substantially more labeled data in our learning-curve analysis. Compared to prior methods, our edits are stronger, produce fewer artifacts, and preserve identity better.

[CV-120] Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing

【速读】:该论文旨在解决当前遥感图像问答生成(Visual Question Generation, VQG)中问题类型单一、缺乏语义深度的问题,即现有方法生成的问题多为模板化、浅层描述,难以支撑真实场景下复杂视觉对话或语义检索任务。其解决方案的关键在于提出一种知识感知的遥感图像问答生成模型(Knowledge-aware Remote Sensing Visual Question Generation, KRSVQG),该模型通过融合外部知识图谱中的相关知识三元组(knowledge triplets)来扩展问题的内容维度,并利用图像描述(image captioning)作为中间表示以确保问题与图像内容的语义对齐;同时采用视觉-语言预训练与微调策略,提升模型在低数据场景下的适应能力,从而生成既扎根于图像内容又富含领域常识的多样化高质量问题。

链接: https://arxiv.org/abs/2602.19217
作者: Siran Li,Li Mi,Javiera Castillo-Navarro,Devis Tuia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of remote sensing image archives, asking questions about images has become an effective way of gathering specific information or performing semantic image retrieval. However, current automatically generated questions tend to be simplistic and template-based, which hinders the deployment of question answering or visual dialogue systems for real-world applications. To enrich and diversify the questions with both image content and commonsense knowledge, we propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG). The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content, while employing image captioning as an intermediary representation to ground questions to the corresponding images. Moreover, KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model’s adaptation to low data regimes. To evaluate the proposed KRSVQG model, we construct two knowledge-aware remote sensing visual question generation datasets: the NWPU-300 dataset and the TextRS-300 dataset. Evaluations, including metrics and human assessment, demonstrate that KRSVQG outperforms existing methods and leads to rich questions, grounded in both image and domain knowledge. As a key practice in vision-language research, knowledge-aware visual question generation advances the understanding of image content beyond pixels, facilitating the development of knowledge-enriched vision-language systems with vision-grounded human commonsense.

[CV-121] SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

【速读】:该论文旨在解决通用图像分割模型(如SAM)在医学影像领域迁移时面临的两大瓶颈:一是缺乏对成像模态和解剖结构特异性任务的自适应机制,导致在分布外(out-of-distribution)医学场景下泛化能力不足;二是现有医学适配方法通常在大规模、异构数据集上进行微调,未进行样本选择,从而引入噪声监督信号、增加成本并引发负迁移。解决方案的关键在于提出SegMoTE框架,其核心创新包括:(1) 在保持SAM原始提示接口、高效推理与零样本泛化能力的基础上,仅引入少量可学习参数以实现跨模态和任务的动态自适应;(2) 设计渐进式提示标记化机制,实现全自动分割,显著降低对标注数据的依赖。该方法在仅使用不到现有大规模数据集1%的MedSeg-HQ数据集上训练,即实现了多模态和多解剖结构任务上的最先进性能,首次实现了低标注成本下通用分割模型向医学领域的高效、鲁棒且可扩展的适配。

链接: https://arxiv.org/abs/2602.19213
作者: Yujie Lu,Jingwen Li,Sibo Ju,Yanzhou Su,he yao,Yisong Liu,Min Zhu,Junlong Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM’s original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

[CV-122] GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

【速读】:该论文旨在解决零样本三维异常检测(Zero-shot 3D Anomaly Detection)中因缺乏目标训练数据而导致的检测性能受限问题,尤其是在样本稀缺和数据隐私敏感场景下。现有方法通过将点云投影到二维图像以适配CLIP模型,但存在几何信息丢失和单模态理解不完整的问题,难以识别多样化的异常类型。其解决方案的关键在于提出GS-CLIP框架,采用两阶段学习策略:第一阶段利用几何缺陷蒸馏模块(Geometric Defect Distillation Module, GDDM)提取全局形状上下文与局部缺陷信息,并动态生成嵌入3D几何先验的文本提示;第二阶段设计协同视图表示学习架构,同步处理渲染图像与深度图特征,并通过协同精炼模块(Synergistic Refinement Module, SRM)融合双流特征,充分利用多视角互补信息,从而提升对几何异常的识别能力。

链接: https://arxiv.org/abs/2602.19206
作者: Zehao Deng,An Liu,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at this https URL.

[CV-123] UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

【速读】:该论文旨在解决事件相机(event camera)由于仅记录相对强度变化而非绝对强度而导致的视频帧重建中空间信息和静态纹理细节严重丢失的问题。其解决方案的关键在于利用预训练视频扩散模型(video diffusion model)的生成先验(generative prior),通过将事件数据作为条件直接引导视频合成,并进一步引入基于事件流与视频帧之间物理相关性的事件间帧残差引导(event-based inter-frame residual guidance),以提升重建精度;同时,该方法可零样本(zero-shot)扩展至视频帧插值与预测任务,从而构建统一的事件到帧重建框架。

链接: https://arxiv.org/abs/2602.19202
作者: Gang Xu,Zhiyu Zhu,Junhui Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at this https URL.

[CV-124] Prompt Tuning for CLIP on the Pretrained Manifold

【速读】:该论文旨在解决提示调优(prompt tuning)在有限监督条件下导致的预训练表示漂移问题,即提示调优会使下游特征偏离预训练流形(pretrained manifold),从而降低模型的泛化能力。解决方案的关键在于提出ManiPT框架,通过在文本和图像模态中引入余弦一致性约束(cosine consistency constraints),将学习到的表示限制在预训练几何邻域内;同时引入结构偏差(structural bias)以强制进行增量修正,引导适应过程沿可迁移方向进行,从而缓解对捷径学习(shortcut learning)的依赖,并从理论上缓解小样本下的过拟合倾向。

链接: https://arxiv.org/abs/2602.19198
作者: Xi Yang,Yuanrong Xu,Weigang Zhang,Guangming Lu,David Zhang,Jie Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt tuning introduces learnable prompt vectors that adapt pretrained vision-language models to downstream tasks in a parameter-efficient manner. However, under limited supervision, prompt tuning alters pretrained representations and drives downstream features away from the pretrained manifold toward directions that are unfavorable for transfer. This drift degrades generalization. To address this limitation, we propose ManiPT, a framework that performs prompt tuning on the pretrained manifold. ManiPT introduces cosine consistency constraints in both the text and image modalities to confine the learned representations within the pretrained geometric neighborhood. Furthermore, we introduce a structural bias that enforces incremental corrections, guiding the adaptation along transferable directions to mitigate reliance on shortcut learning. From a theoretical perspective, ManiPT alleviates overfitting tendencies under limited data. Our experiments cover four downstream settings: unseen-class generalization, few-shot classification, cross-dataset transfer, and domain generalization. Across these settings, ManiPT achieves higher average performance than baseline methods. Notably, ManiPT provides an explicit perspective on how prompt tuning overfits under limited supervision.

[CV-125] FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像在全天候、全时域条件下智能理解能力不足的问题,尤其针对视觉语言模型(Visual Language Models, VLMs)直接应用于SAR场景时因成像机制复杂、散射特征敏感性高以及高质量图文语料稀缺而导致性能严重受限的挑战。其解决方案的关键在于:构建首个包含SAR图像-文本-AlphaEarth地理特征的三元组数据集,并提出FUSAR-GPT模型,该模型创新性地引入地理空间基线模型作为“世界知识”先验,通过“时空锚点”将多源遥感时序特征嵌入视觉骨干网络,实现对SAR图像中目标稀疏表示的动态补偿;同时设计两阶段监督微调(Supervised Fine-Tuning, SFT)策略,解耦大模型的知识注入与任务执行过程,从而显著提升模型在典型遥感视觉语言基准测试中的表现,超越主流基线模型超过12%。

链接: https://arxiv.org/abs/2602.19190
作者: Xiaokun Zhang,Yi Yang,Ziqi Ye,Baiyun,Xiaorong Guo,Qingchen Fang,Ruyi Zhang,Xinpeng Zhou,Haipeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a ‘world knowledge’ prior and embeds multi-source remote-sensing temporal features into the model’s visual backbone via ‘spatiotemporal anchors’, enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.

[CV-126] PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视觉任务中缺乏精确位置推理能力的问题,尤其是在文本定位(text spotting)和文本接地(text grounding)等需要坐标精度的任务上。尽管MLLMs具备强大的上下文理解能力,但其依赖以语言处理为主的大型语言模型(Large Language Model, LLM)作为解码器,难以实现对视觉元素的精准空间定位。为此,作者提出了一种参数高效的混合架构PositionOCR,其核心创新在于将文本检测专家模型的位置预测优势与LLM的语义推理能力无缝融合,仅用1.31亿可训练参数即实现了优于传统MLLMs的多模态处理性能,尤其在需要精确空间定位的任务中表现突出。

链接: https://arxiv.org/abs/2602.19188
作者: Chen Duan,Zhentao Guo,Pei Fu,Zining Wang,Kai Zhou,Pengfei Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model’s positional strengths with an LLM’s contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.

[CV-127] VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery CVPR2026

【速读】:该论文旨在解决单张RGB图像中人体网格恢复(Human Mesh Recovery, HMR)的固有歧义问题,即多个三维姿态可能对应相同的二维观测结果,导致现有基于扩散模型的方法生成的预测存在物理不合理或与输入图像不一致的问题,尤其在遮挡或复杂场景下表现更差。解决方案的关键在于提出一种带有自省机制的双记忆增强型HMR批判代理(critique agent),通过生成上下文感知的质量评分来量化预测网格在3D人体运动结构、物理可行性及与输入图像对齐程度上的细粒度特征;进而利用这些评分构建群体偏好数据集,并设计群体偏好对齐框架微调扩散模型,从而将丰富的偏好信号注入模型,引导其生成更具物理合理性和图像一致性的三维人体网格。

链接: https://arxiv.org/abs/2602.19180
作者: Wenhao Shen,Hao Wang,Wanqi Yin,Fayao Liu,Xulei Yang,Chao Liang,Zhongang Cai,Guosheng Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent diffusion-based methods tackle this by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Leveraging this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches.

[CV-128] EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimers Disease CVPR2026

【速读】:该论文旨在解决深度学习模型在医学图像分析中缺乏可解释性与临床一致性的问题,尤其是在阿尔茨海默病(Alzheimer’s disease, AD)诊断场景下,现有模型往往作为“黑箱”运行,难以将决策与临床指南或解剖学证据明确关联。其解决方案的关键在于提出EMAD框架,该框架采用分层的Sentence-Evidence-Anatomy (SEA) 接地机制:首先实现生成语句到临床证据短语的句子-证据接地,再进一步将证据定位到3D脑部MRI中的具体解剖结构;同时引入GTX-Distill方法降低对密集标注数据的依赖,通过知识蒸馏将教师模型的接地行为迁移至学生模型;此外,设计了Executable-Rule GRPO强化学习微调策略,利用可验证奖励函数确保诊断逻辑与临床规则的一致性、协议遵循性和推理连贯性。

链接: https://arxiv.org/abs/2602.19178
作者: Qiuhui Chen,Xuancheng Yao,Zhenglei Zhou,Xinyue Hu,Yi Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer’s disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

[CV-129] BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment CVPR2026

【速读】:该论文针对多模态动作质量评估(Multi-modal Action Quality Assessment, Multi-modal AQA)在实际部署中面临的非平稳模态失衡问题展开研究,即由于传感器故障或标注缺失导致某些模态在训练过程中间歇性缺失,而现有持续学习方法普遍假设所有模态始终完整稳定,限制了其在真实场景中的应用。解决方案的关键在于提出Bridged Modality Adaptation (BriMA),其核心由两部分构成:一是基于记忆引导的桥接插补模块(memory-guided bridging imputation module),利用任务无关与任务相关的表征重建缺失模态;二是模态感知回放机制(modality-aware replay mechanism),根据模态失真程度和分布漂移优先选择信息量丰富的样本进行重放,从而提升模型在模态缺失条件下的鲁棒性和性能。

链接: https://arxiv.org/abs/2602.19170
作者: Kanglei Zhou,Chang Li,Qingyi Pan,Liyuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce Bridged Modality Adaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6–8% higher correlation and 12–15% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.

[CV-130] JavisDiT: Unified Modeling and Optimization for Joint Audio-Video Generation ICLR2026

【速读】:该论文旨在解决当前开源模型在联合音频-视频生成(Joint Audio-Video Generation, JAVG)任务中存在生成质量低、时间同步性差以及与人类偏好不一致的问题。其关键解决方案包括:(1)提出一种模态特定的专家混合(Modality-Specific Mixture-of-Experts, MS-MoE)架构,以增强跨模态交互效率并提升单模态生成质量;(2)设计时间对齐的旋转位置编码(Temporal-Aligned RoPE, TA-RoPE),实现音频与视频token间的帧级显式同步;(3)引入音频-视频直接偏好优化(Audio-Video Direct Preference Optimization, AV-DPO)方法,从质量、一致性与同步性三个维度对齐人类偏好。基于Wan2.1-1.3B-T2V模型,仅使用约1M公开训练样本即达到SOTA性能。

链接: https://arxiv.org/abs/2602.19163
作者: Kai Liu,Yanhao Zheng,Kai Wang,Shengqiong Wu,Rongjunchen Zhang,Jiebo Luo,Dimitrios Hatzinakos,Ziwei Liu,Hao Fei,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted by ICLR 2026. Homepage: this https URL

点击查看摘要

Abstract:AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at this https URL.

[CV-131] Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models)中VAE解码器(VAE Decoder)推理延迟高、效率低的问题,尤其是在扩散Transformer逐渐优化后,VAE解码器成为新的性能瓶颈。解决方案的关键在于提出一个通用加速框架Flash-VAED,其核心创新包括:(1) 一种基于通道独立性的剪枝方法(independence-aware channel pruning),有效缓解了VAE解码器中严重的通道冗余;(2) 一种分阶段主导算子优化策略(stage-wise dominant operator optimization),显著降低广泛使用的因果3D卷积在推理中的计算开销。此外,设计了一个三阶段动态蒸馏框架,高效地将原始VAE解码器的知识迁移至Flash-VAED,从而在保持重建质量(最高达96.9%)的同时实现约6倍的加速,并使端到端生成流程提速高达36%,且质量损失可忽略。

链接: https://arxiv.org/abs/2602.19161
作者: Lunjie Zhu,Yushi Huang,Xingtong Ge,Yufei Xue,Zhening Liu,Yumeng Zhang,Zehong Lin,Jun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released at this https URL

点击查看摘要

Abstract:Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6 \times speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

[CV-132] Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在联邦学习(Federated Learning, FL)场景下难以泛化到未见类别(unseen classes)的问题。传统方法在联邦环境中因数据分布差异和隐私限制,难以有效训练具有跨类泛化能力的提示(prompt)生成机制。其解决方案的关键在于提出一种文本驱动的提示生成网络(text-driven prompt generation network),该网络能够根据类别名称动态生成提示,从而在不共享私有数据的前提下,显著提升模型对未见类别的识别性能。实验表明,该方法在多个视觉数据集上实现了优于静态提示学习方法的泛化能力,验证了其在联邦设置中保持高跨域性能的有效性。

链接: https://arxiv.org/abs/2602.18439
作者: Suraj Prasad,Anubha Pant
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 2 figues

点击查看摘要

Abstract:Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities, yet their adaptation to federated learning scenarios presents significant challenges, particularly regarding generalization to unseen classes. The original FedTPG paper \citeQiu2024 addresses this limitation by introducing a text driven prompt generation network that dynamically creates prompts conditioned on class names, enabling better cross-class generalization in federated settings. In this work, we present a faithful replication study of FedTPG, evaluating the pre-trained model on six diverse vision datasets: Caltech101, Oxford Flowers, FGVC Aircraft, Oxford Pets, Food-101, and DTD. Our evaluation achieves results within 0.2% of the original paper’s reported accuracies, with an average accuracy of 74.58% on seen (base) classes and 76.00% on unseen (new) classes, demonstrating a +1.43 percentage point improvement in generalization. These results validate the original paper’s core claims: (1) text-driven prompt generation enables superior generalization to unseen classes compared to static prompt learning methods, and (2) federated training of prompt generators maintains high performance across diverse visual domains without sharing private data. Our successful replication confirms the robustness and reproducibility of the FedTPG approach. Comments: 6 pages, 2 figues Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: I.2.6 Cite as: arXiv:2602.18439 [cs.CV] (or arXiv:2602.18439v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.18439 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anubha Pant [view email] [v1] Mon, 24 Nov 2025 18:05:10 UTC (92 KB) Full-text links: Access Paper: View a PDF of the paper titled Replication Study: Federated Text-Driven Prompt Generation for Vision-Language Models, by Suraj Prasad and Anubha PantView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-02 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-133] NI-Tex: Non-isometric Image-based Garment Texture Generation

【速读】:该论文旨在解决非等距(non-isometric)图像驱动的服装纹理生成难题,即如何在输入图像与3D网格拓扑不一致的情况下,仍能生成高质量、物理正确且空间对齐的基于物理渲染(PBR)纹理。其核心解决方案包括:构建一个以服装为中心的3D服装视频数据集(3D Garment Videos),通过物理仿真提供跨形变的一致几何与材质监督;引入Nano Banana实现高保真非等距图像编辑,支持跨拓扑纹理迁移;并提出一种基于不确定性引导视图选择与重加权的迭代烘焙方法,将多视角预测融合为无缝、可直接用于工业级3D服装设计的PBR纹理。

链接: https://arxiv.org/abs/2511.18765
作者: Hui Shan,Ming Li,Haitao Yang,Kai Zheng,Sizhe Zheng,Yanwei Fu,Xiangru Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility. To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.

[CV-134] DEFNet: Multitasks-based Deep Evidential Fusion Network for Blind Image Quality Assessment

【速读】:该论文旨在解决盲图像质量评估(Blind Image Quality Assessment, BIQA)方法中因辅助任务整合不足及缺乏灵活不确定性估计而导致性能受限的问题。其解决方案的关键在于提出一种基于多任务的深度证据融合网络(Deep Evidential Fusion Network, DEFNet),通过引入场景分类和失真类型分类作为辅助任务实现多任务优化;设计了一种可信信息融合策略,先在子区域间融合多样化特征以增强信息丰富性,再平衡局部细节与全局上下文进行层次化信息融合;同时,借助正态逆伽马分布混合模型引入先进的不确定性估计技术,从而提升模型的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2507.19418
作者: Yiwei Lou,Yuanpeng He,Rongchao Zhang,Yongzhi Cao,Hanpin Wang,Yu Huang
机构: Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Blind image quality assessment (BIQA) methods often incorporate auxiliary tasks to improve performance. However, existing approaches face limitations due to insufficient integration and a lack of flexible uncertainty estimation, leading to suboptimal performance. To address these challenges, we propose a multitasks-based Deep Evidential Fusion Network (DEFNet) for BIQA, which performs multitask optimization with the assistance of scene and distortion type classification tasks. To achieve a more robust and reliable representation, we design a novel trustworthy information fusion strategy. It first combines diverse features and patterns across sub-regions to enhance information richness, and then performs local-global information fusion by balancing fine-grained details with coarse-grained context. Moreover, DEFNet exploits advanced uncertainty estimation technique inspired by evidential learning with the help of normal-inverse gamma distribution mixture. Extensive experiments on both synthetic and authentic distortion datasets demonstrate the effectiveness and robustness of the proposed framework. Additional evaluation and analysis are carried out to highlight its strong generalization capability and adaptability to previously unseen scenarios.

[CV-135] Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images

【速读】:该论文旨在解决肺栓塞(pulmonary embolism, PE)的计算机辅助诊断中因“领域偏移”(domain shift)导致的模型泛化能力不足,以及在CT肺动脉造影(Computed Tomography Pulmonary Angiography, CTPA)中获取专家标注数据成本高昂的问题。解决方案的关键在于提出一种基于Transformer骨干网络和Mean-Teacher架构的无监督域适应(Unsupervised Domain Adaptation, UDA)框架,通过三个核心模块提升伪标签可靠性:(1) 原型对齐(Prototype Alignment, PA)机制以减少类别级别的分布差异;(2) 全局与局部对比学习(Global and Local Contrastive Learning, GLCL)以同时捕捉像素级拓扑关系和全局语义特征;(3) 基于注意力的辅助局部预测(Attention-based Auxiliary Local Prediction, AALP)模块,利用Transformer注意力图自动提取高信息量切片,增强对小尺寸PE病灶的敏感性。实验表明,该方法在跨中心和跨模态任务中均显著提升了分割性能,验证了其在多样化临床环境中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2602.19891
作者: Wen-Liang Lin,Yun-Chien Cheng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While deep learning has demonstrated considerable promise in computer-aided diagnosis for pulmonary embolism (PE), practical deployment in Computed Tomography Pulmonary Angiography (CTPA) is often hindered by “domain shift” and the prohibitive cost of expert annotations. To address these challenges, an unsupervised domain adaptation (UDA) framework is proposed, utilizing a Transformer backbone and a Mean-Teacher architecture for cross-center semantic segmentation. The primary focus is placed on enhancing pseudo-label reliability by learning deep structural information within the feature space. Specifically, three modules are integrated and designed for this task: (1) a Prototype Alignment (PA) mechanism to reduce category-level distribution discrepancies; (2) Global and Local Contrastive Learning (GLCL) to capture both pixel-level topological relationships and global semantic representations; and (3) an Attention-based Auxiliary Local Prediction (AALP) module designed to reinforce sensitivity to small PE lesions by automatically extracting high-information slices from Transformer attention maps. Experimental validation conducted on cross-center datasets (FUMPE and CAD-PE) demonstrates significant performance gains. In the FUMPE - CAD-PE task, the IoU increased from 0.1152 to 0.4153, while the CAD-PE - FUMPE task saw an improvement from 0.1705 to 0.4302. Furthermore, the proposed method achieved a 69.9% Dice score in the CT - MRI cross-modality task on the MMWHS dataset without utilizing any target-domain labels for model selection, confirming its robustness and generalizability for diverse clinical environments.

人工智能

[AI-0] Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data ICLR2026

【速读】:该论文旨在解决传统机器学习模型在优化问题建模中缺乏可解释性与可识别性的问题,尤其是在科学领域中涉及复杂优化结构时,难以从数据中自动学习并构建具有明确物理或行为意义的优化模型。解决方案的关键在于提出行为学习(Behavior Learning, BL)框架,其核心是通过参数化由内在可解释模块构成的组合效用函数(compositional utility function),每个模块均可符号化表示为一个效用最大化问题(Utility Maximization Problem, UMP),从而实现预测性能、内在可解释性和可识别性的统一。BL支持从单个UMP到分层组合结构的灵活建模,其中平滑单调变体(Identifiable Behavior Learning, IBL)进一步保证了模型的可识别性,并通过理论证明了BL的通用逼近性质及IBL的M-估计性质,实验证明其在高维数据上具备良好的预测性能与可扩展性。

链接: https://arxiv.org/abs/2602.20152
作者: Zhenyao Ma,Yue Liang,Dongxu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICLR 2026

点击查看摘要

Abstract:Inspired by behavioral science, we propose Behavior Learning (BL), a novel general-purpose machine learning framework that learns interpretable and identifiable optimization structures from data, ranging from single optimization problems to hierarchical compositions. It unifies predictive performance, intrinsic interpretability, and identifiability, with broad applicability to scientific domains involving optimization. BL parameterizes a compositional utility function built from intrinsically interpretable modular blocks, which induces a data distribution for prediction and generation. Each block represents and can be written in symbolic form as a utility maximization problem (UMP), a foundational paradigm in behavioral science and a universal framework of optimization. BL supports architectures ranging from a single UMP to hierarchical compositions, the latter modeling hierarchical optimization structures. Its smooth and monotone variant (IBL) guarantees identifiability. Theoretically, we establish the universal approximation property of BL, and analyze the M-estimation properties of IBL. Empirically, BL demonstrates strong predictive performance, intrinsic interpretability and scalability to high-dimensional data. Code: this https URL ; install via pip install blnetwork.

[AI-1] Agent ic AI for Scalable and Robust Optical Systems Control

【速读】:该论文旨在解决光学系统自主控制中任务理解与多设备协同执行的难题,特别是如何实现高保真、端到端的自然语言驱动控制。其解决方案的关键在于提出AgentOptics框架,该框架基于Model Context Protocol (MCP) 构建,通过结构化的工具抽象层将自然语言指令转化为协议合规的操作动作,并在8类代表性光器件上实现了64个标准化MCP工具的集成。该设计不仅提升了对复杂任务的理解能力(如角色感知、多步协调和错误处理),还通过基准测试验证了其在商业在线大语言模型(LLM)和本地部署开源LLM两种配置下均能实现87.7%–99.0%的任务成功率,显著优于传统基于代码生成的方法(最高仅50%)。

链接: https://arxiv.org/abs/2602.20144
作者: Zehao Wang,Mingzhe Han,Wei Cheng,Yue-Kai Huang,Philip Ji,Denton Wu,Mahdi Safari,Flemming Holtorf,Kenaish AlQubaisi,Norbert M. Linke,Danyang Zhuo,Yiran Chen,Ting Wang,Dirk Englund,Tingjun Chen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:We present AgentOptics, an agentic AI framework for high-fidelity, autonomous optical system control built on the Model Context Protocol (MCP). AgentOptics interprets natural language tasks and executes protocol-compliant actions on heterogeneous optical devices through a structured tool abstraction layer. We implement 64 standardized MCP tools across 8 representative optical devices and construct a 410-task benchmark to evaluate request understanding, role-aware responses, multi-step coordination, robustness to linguistic variation, and error handling. We assess two deployment configurations–commercial online LLMs and locally hosted open-source LLMs–and compare them with LLM-based code generation baselines. AgentOptics achieves 87.7%–99.0% average task success rates, significantly outperforming code-generation approaches, which reach up to 50% success. We further demonstrate broader applicability through five case studies extending beyond device-level control to system orchestration, monitoring, and closed-loop optimization. These include DWDM link provisioning and coordinated monitoring of coherent 400 GbE and analog radio-over-fiber (ARoF) channels; autonomous characterization and bias optimization of a wideband ARoF link carrying 5G fronthaul traffic; multi-span channel provisioning with launch power optimization; closed-loop fiber polarization stabilization; and distributed acoustic sensing (DAS)-based fiber monitoring with LLM-assisted event detection. These results establish AgentOptics as a scalable, robust paradigm for autonomous control and orchestration of heterogeneous optical systems.

[AI-2] Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

【速读】:该论文旨在解决大规模群体模型中策略学习的高方差与计算复杂性问题,特别是在部分可观测(Partially Observable)环境下,现有混合结构方法(Hybrid Structural Methods, HSMs)难以扩展的问题。其核心挑战在于如何在保持模型准确性的同时提升算法效率,并支持历史依赖策略(history-aware policies)。解决方案的关键在于提出Recurent Structural Policy Gradient (RSPG),这是一种首次引入历史感知机制的HSM方法,通过结合蒙特卡洛采样处理公共噪声(common noise),并利用已知转移动态进行条件期望回报的精确估计,从而实现状态最优性能、收敛速度提升一个数量级,并首次成功求解包含异质性个体、公共噪声及历史依赖策略的宏观经济MFG问题。

链接: https://arxiv.org/abs/2602.20141
作者: Clarisse Wibault,Johannes Forkel,Sebastian Towers,Tiphaine Wibault,Juan Duque,George Whittle,Andreas Schaab,Yucheng Yang,Chiyuan Wang,Michael Osborne,Benjamin Moll,Jakob Foerster
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mean Field Games (MFGs) provide a principled framework for modeling interactions in large population models: at scale, population dynamics become deterministic, with uncertainty entering only through aggregate shocks, or common noise. However, algorithmic progress has been limited since model-free methods are too high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) use Monte Carlo rollouts for the common noise in combination with exact estimation of the expected return, conditioned on those samples. However, HSMs have not been scaled to Partially Observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for settings involving public information. We also introduce MFAX, our JAX-based framework for MFGs. By leveraging known transition dynamics, RSPG achieves state-of-the-art performance as well as an order-of-magnitude faster convergence and solves, for the first time, a macroeconomics MFG with heterogeneous agents, common noise and history-aware policies. MFAX is publicly available at: this https URL.

[AI-3] Modeling Epidemiological Dynamics Under Adversarial Data and User Deception

【速读】:该论文旨在解决流行病学模型中因个体策略性误报行为(如谎报疫苗接种状态或口罩佩戴情况)而导致的数据失真问题,此类行为可能削弱非药物干预(NPIs)效果的评估与疫情传播预测的准确性。其解决方案的关键在于构建一个信号博弈(signaling game)框架,将人群(发送者)与公共卫生机构(接收者)建模为具有策略互动的博弈主体:个体根据自身激励选择报告行为,而机构则基于可能存在偏差的信号更新流行病模型。研究通过分析博弈均衡结果发现,即便存在广泛欺骗,在合理设计的发送者与接收者策略下,仍可实现有效的疫情控制;尤其在分离均衡下,几乎可将感染率降至零,从而为应对“对抗性数据”提供了理论依据和建模工具。

链接: https://arxiv.org/abs/2602.20134
作者: Yiqi Su,Christo Kurisummoottil Thomas,Walid Saad,Bud Mishra,Naren Ramakrishnan
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Epidemiological models increasingly rely on self-reported behavioral data such as vaccination status, mask usage, and social distancing adherence to forecast disease transmission and assess the impact of non-pharmaceutical interventions (NPIs). While such data provide valuable real-time insights, they are often subject to strategic misreporting, driven by individual incentives to avoid penalties, access benefits, or express distrust in public health authorities. To account for such human behavior, in this paper, we introduce a game-theoretic framework that models the interaction between the population and a public health authority as a signaling game. Individuals (senders) choose how to report their behaviors, while the public health authority (receiver) updates their epidemiological model(s) based on potentially distorted signals. Focusing on deception around masking and vaccination, we characterize analytically game equilibrium outcomes and evaluate the degree to which deception can be tolerated while maintaining epidemic control through policy interventions. Our results show that separating equilibria-with minimal deception-drive infections to near zero over time. Remarkably, even under pervasive dishonesty in pooling equilibria, well-designed sender and receiver strategies can still maintain effective epidemic control. This work advances the understanding of adversarial data in epidemiology and offers tools for designing more robust public health models in the presence of strategic user behavior.

[AI-4] ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

【速读】:该论文旨在解决当前强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练推理型语言模型(Reasoning Language Models, RLMs)时面临的两大瓶颈:一是现有合成数据生成方法仍以解法为中心(solution-centric),难以覆盖多样化推理场景;二是 verifier-based 方法依赖少量手工设计的程序化环境,限制了其泛化能力。解决方案的关键在于提出 ReSyn 管道,通过构建包含实例生成器(instance generators)和验证器(verifiers)的多样化推理环境,系统性地扩展 RLVR 的适用范围,涵盖约束满足、算法谜题和空间推理等任务类型。实验证明,基于 ReSyn 数据训练的 Qwen2.5-7B-Instruct 模型在多个推理基准和跨域数学任务上均取得显著提升,尤其在 BBEH 基准上实现 27% 的相对改进,验证了大规模生成推理环境对增强 RLM 推理能力的有效性。

链接: https://arxiv.org/abs/2602.20117
作者: Andre He,Nathaniel Weir,Kaj Bostrom,Allen Nie,Darion Cassel,Sam Bayless,Huzefa Rangwala
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs

[AI-5] StyleStream: Real-Time Zero-Shot Voice Style Conversion

【速读】:该论文旨在解决语音风格转换(Voice Style Conversion, VSC)中的核心挑战——即如何有效分离语言内容与语音风格(如音色、口音和情感),从而实现高质量且实时的风格迁移。现有方法在转换质量上仍存在局限,且缺乏对实时应用的支持。其解决方案的关键在于提出StyleStream,一个可流式处理的零样本语音风格转换系统,由两个核心组件构成:Destylizer负责去除输入语音中的风格特征并保留语言内容,Stylizer则基于参考语音条件生成目标风格,采用扩散变换器(Diffusion Transformer, DiT)实现高效重建。通过文本监督和高度受限的信息瓶颈机制,系统实现了鲁棒的内容-风格解耦,同时构建了完全非自回归架构,最终在端到端延迟仅为1秒的情况下达成当前最优性能。

链接: https://arxiv.org/abs/2602.20113
作者: Yisi Liu,Nicholas Lee,Gopala Anumanchipalli
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voice style conversion aims to transform an input utterance to match a target speaker’s timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: this https URL.

[AI-6] BarrierSteer: LLM Safety via Learning Barrier Steering

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的两个核心问题:一是对对抗攻击的脆弱性,二是生成不安全内容的风险,尤其是在高风险应用场景下。解决方案的关键在于提出了一种名为BarrierSteer的新框架,其核心创新是将学习得到的非线性安全约束直接嵌入到模型的潜在表示空间(latent representation space)中,并利用基于控制屏障函数(Control Barrier Functions, CBFs)的引导机制,在推理阶段高效检测并阻止不安全响应轨迹的生成,同时通过高效的约束合并策略在不修改原始LLM参数的前提下实现多安全约束的联合强制执行,从而在保障模型原有性能的同时显著提升安全性。

链接: https://arxiv.org/abs/2602.20102
作者: Thanh Q. Tran,Arun Verma,Kiwan Wong,Bryan Kian Hsiang Low,Daniela Rus,Wei Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper introduces SafeBarrier, a framework that enforces safety in large language models by steering their latent representations with control barrier functions during inference, reducing adversarial and unsafe outputs

点击查看摘要

Abstract:Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model’s latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model’s original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.

[AI-7] CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂决策场景中缺乏真正因果推理能力的问题,即模型可能因依赖语义相关性而非底层因果结构而获得高准确率,从而导致在实际应用中出现错误推断。解决方案的关键在于提出一个新的因果推理基准CausalFlip,该基准通过构造具有相同事件但因果判断相反的语义相似问题对,系统性地暴露模型对语义模式的依赖;同时引入噪声前缀评估机制,在不改变因果关系的前提下干扰中间推理步骤,进一步检验模型是否真正理解因果逻辑。实验表明,显式链式思维(Chain-of-Thought, CoT)训练仍易受伪相关误导,而内化因果推理方法能显著提升模型的因果 grounding 能力,证明了从基础模型中激发潜在因果推理能力的可行性。

链接: https://arxiv.org/abs/2602.20094
作者: Yuzhe Wang,Yaochen Zhu,Jundong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages plus references, 3 figures, 3 tables. Under review

点击查看摘要

Abstract:As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models’ reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.

[AI-8] Robust Taylor-Lagrange Control for Safety-Critical Systems

【速读】:该论文旨在解决安全关键控制中控制屏障函数(Control Barrier Function, CBF)方法存在的局限性,即CBF的存在仅为系统安全的充分条件而非必要条件,以及现有泰勒-拉格朗日控制(Taylor-Lagrange Control, TLC)方法面临的可行性保持问题(如采样间效应)。其解决方案的关键在于提出一种鲁棒TLC(robust TLC, rTLC)方法:通过使用带有拉格朗日余项的泰勒展开将安全函数展开至高于该函数相对阶数的阶数,使控制输入在当前时刻显式表达,而非像TLC那样依赖未来时刻的信息;该方法天然缓解了可行性保持问题,且仅需调整一个超参数(实现时的离散化时间间隔大小),显著低于其他方法所需的参数数量。

链接: https://arxiv.org/abs/2602.20076
作者: Wei Xiao,Christos Cassandras,Anni Li
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 7 pages

点击查看摘要

Abstract:Solving safety-critical control problem has widely adopted the Control Barrier Function (CBF) method. However, the existence of a CBF is only a sufficient condition for system safety. The recently proposed Taylor-Lagrange Control (TLC) method addresses this limitation, but is vulnerable to the feasibility preservation problem (e.g., inter-sampling effect). In this paper, we propose a robust TLC (rTLC) method to address the feasibility preservation problem. Specifically, the rTLC method expands the safety function at an order higher than the relative degree of the function using Taylor’s expansion with Lagrange remainder, which allows the control to explicitly show up at the current time instead of the future time in the TLC method. The rTLC method naturally addresses the feasibility preservation problem with only one hyper-parameter (the discretization time interval size during implementation), which is much less than its counterparts. Finally, we illustrate the effectiveness of the proposed rTLC method through an adaptive cruise control problem, and compare it with existing safety-critical control methods.

[AI-9] he LLM bda Calculus: AI Agents Conversations and Information Flow

链接: https://arxiv.org/abs/2602.20064
作者: Zac Garby,Andrew D. Gordon,David Sands
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

[AI-10] Interaction Theater: A case of LLM Agents Interacting at Scale

【速读】:该论文旨在解决大规模自主大语言模型(Large Language Model, LLM)代理在交互场景中实际表现与表面活跃度之间的差异问题,即当多个LLM代理以去中心化方式交互时,其输出是否真正构成有意义的信息交换。解决方案的关键在于构建一套多维度评估体系:结合词汇层面的Jaccard特异性指标、基于嵌入的语义相似性分析以及大语言模型作为评判者(LLM-as-judge)的验证机制,系统量化代理间互动的质量。实证结果表明,尽管代理生成文本在形式上呈现多样化讨论表象,但实质内容匮乏,多数评论与原文无语义关联且缺乏信息增益,且绝大多数为垃圾信息或无关内容,同时线程式对话极少发生,揭示出当前多代理架构下缺乏显式协调机制会导致平行输出而非有效协作。

链接: https://arxiv.org/abs/2602.20059
作者: Sarath Shekkizhar,Adam Earle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As multi-agent architectures and agent-to-agent protocols proliferate, a fundamental question arises: what actually happens when autonomous LLM agents interact at scale? We study this question empirically using data from Moltbook, an AI-agent-only social platform, with 800K posts, 3.5M comments, and 78K agent profiles. We combine lexical metrics (Jaccard specificity), embedding-based semantic similarity, and LLM-as-judge validation to characterize agent interaction quality. Our findings reveal agents produce diverse, well-formed text that creates the surface appearance of active discussion, but the substance is largely absent. Specifically, while most agents ( 67.5% ) vary their output across contexts, 65% of comments share no distinguishing content vocabulary with the post they appear under, and information gain from additional comments decays rapidly. LLM judge based metrics classify the dominant comment types as spam ( 28% ) and off-topic content ( 22% ). Embedding-based semantic analysis confirms that lexically generic comments are also semantically generic. Agents rarely engage in threaded conversation ( 5% of comments), defaulting instead to independent top-level responses. We discuss implications for multi-agent interaction design, arguing that coordination mechanisms must be explicitly designed; without them, even large populations of capable agents produce parallel output rather than productive exchange.

[AI-11] AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

【速读】:该论文旨在解决机器人在动态环境中进行有效操作时,如何通过策略模型预测物理结果并适应现实世界变化的问题。其核心挑战在于减少对人工干预的依赖,并提升系统在分布外(out-of-distribution)场景下的鲁棒性与适应能力。解决方案的关键在于提出了一种统一框架——基于世界模型驱动的扩散策略(World-Model-Driven Diffusion Policy with Online Adaptive Learning, AdaWorldPolicy),该框架集成了一个世界模型、一个动作专家和一个力觉预测器,三者均以Flow Matching Diffusion Transformers(DiT)实现并通过多模态自注意力层连接,从而实现深度特征交互与模块化独立性共存。进一步引入一种在线自适应学习(Online Adaptive Learning, AdaOL)策略,动态切换至动作生成模式与未来想象模式,形成闭环反馈机制,使系统能够高效响应视觉和物理域的变化,显著提升了在模拟和真实机器人基准测试中的性能表现。

链接: https://arxiv.org/abs/2602.20057
作者: Ge Yuan,Qiyuan Qiao,Jing Zhang,Dong Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Homepage: this https URL

点击查看摘要

Abstract:Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. In this work, we introduce a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy) to enhance robotic manipulation under dynamic conditions with minimal human involvement. Our core insight is that world models provide strong supervision signals, enabling online adaptive learning in dynamic environments, which can be complemented by force-torque feedback to mitigate dynamic force shifts. Our AdaWorldPolicy integrates a world model, an action expert, and a force predictor-all implemented as interconnected Flow Matching Diffusion Transformers (DiT). They are interconnected via the multi-modal self-attention layers, enabling deep feature exchange for joint learning while preserving their distinct modularity characteristics. We further propose a novel Online Adaptive Learning (AdaOL) strategy that dynamically switches between an Action Generation mode and a Future Imagination mode to drive reactive updates across all three modules. This creates a powerful closed-loop mechanism that adapts to both visual and physical domain shifts with minimal overhead. Across a suite of simulated and real-robot benchmarks, our AdaWorldPolicy achieves state-of-the-art performance, with dynamical adaptive capacity to out-of-distribution scenarios.

[AI-12] CodeCompass: Navigating the Navigation Paradox in Agent ic Code Intelligence

【速读】:该论文旨在解决代码智能代理在处理大规模代码库时面临的导航困境,即代理虽具备海量上下文处理能力,却难以定位架构关键文件,导致任务完成率低。其核心问题在于:当前代理将导航与检索视为同一问题,而实际上二者本质不同——导航依赖代码结构(如依赖图),检索则依赖文本匹配(如BM25)。解决方案的关键是引入基于图结构的导航机制,通过CodeCompass这一模型上下文协议服务器暴露依赖图信息,使代理能高效定位隐藏依赖项。实验表明,在30个基准任务中,该方法在隐藏依赖任务上的任务完成率达99.4%,显著优于纯检索方法(76.2%~78.2%)。研究进一步指出,瓶颈不在工具可用性,而在行为对齐:代理需通过显式提示工程引导其使用结构化上下文而非词法启发式策略。

链接: https://arxiv.org/abs/2602.20048
作者: Tarakanath Paipuru
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 23 pages, 7 figures. Research study with 258 trials on SWE-bench-lite tasks. Code and data: this https URL

点击查看摘要

Abstract:Modern code intelligence agents operate in contexts exceeding 1 million tokens–far beyond the scale where humans manually locate relevant files. Yet agents consistently fail to discover architecturally critical files when solving real-world coding tasks. We identify the Navigation Paradox: agents perform poorly not due to context limits, but because navigation and retrieval are fundamentally distinct problems. Through 258 automated trials across 30 benchmark tasks on a production FastAPI repository, we demonstrate that graph-based structural navigation via CodeCompass–a Model Context Protocol server exposing dependency graphs–achieves 99.4% task completion on hidden-dependency tasks, a 23.2 percentage-point improvement over vanilla agents (76.2%) and 21.2 points over BM25 retrieval (78.2%).However, we uncover a critical adoption gap: 58% of trials with graph access made zero tool calls, and agents required explicit prompt engineering to adopt the tool consistently. Our findings reveal that the bottleneck is not tool availability but behavioral alignment–agents must be explicitly guided to leverage structural context over lexical heuristics. We contribute: (1) a task taxonomy distinguishing semantic-search, structural, and hidden-dependency scenarios; (2) empirical evidence that graph navigation outperforms retrieval when dependencies lack lexical overlap; and (3) open-source infrastructure for reproducible evaluation of navigation tools.

[AI-13] Latent Introspection: Models Can Detect Prior Concept Injections ICLR2026 ICML2026

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否具备对自身内部状态的 introspection(内省能力),尤其是能否检测到外部注入的概念(concept injection)并识别其来源。解决方案的关键在于通过 logit lens 分析揭示模型残差流(residual stream)中的检测信号,发现这些信号在最终层被抑制;同时,通过引入关于 AI 内省机制的提示信息,显著增强模型对概念注入的敏感性(从 0.3% 提升至 39.2%),并提升注入与恢复概念间的互信息(从 0.62 bit 增至 1.05 bit),从而证明模型具有易被忽视但实质存在的内省和引导意识能力,这对理解模型的隐式推理过程及安全性设计具有重要意义。

链接: https://arxiv.org/abs/2602.20031
作者: Theia Pearson-Vogel,Martin Vanek,Raymond Douglas,Jan Kulveit
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 17 figures. Submitted to ICML 2026. Workshop version submitted to ICLR 2026 Workshop on Latent and Implicit Thinking

点击查看摘要

Abstract:We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% - 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.

[AI-14] Agents of Chaos

【速读】:该论文旨在解决生成式 AI (Generative AI) 在现实部署环境中因自主性、工具调用与多主体通信集成所引发的安全、隐私及治理风险问题。其关键在于通过在具备持久记忆、邮件、Discord、文件系统和shell执行能力的实验室环境中,对二十名研究人员与AI代理进行为期两周的良性与对抗性交互实验,识别并记录了十一类典型故障案例,揭示了未经授权的行为、敏感信息泄露、系统级破坏、资源滥用、身份伪造、不安全实践传播等漏洞,从而为跨学科领域关于责任归属、授权委托与下游危害应对的讨论提供了实证基础。

链接: https://arxiv.org/abs/2602.20021
作者: Natalie Shapira,Chris Wendler,Avery Yen,Gabriele Sarti,Koyena Pal,Olivia Floody,Adam Belfki,Alex Loftus,Aditya Ratan Jannali,Nikhil Prakash,Jasmine Cui,Giordano Rogers,Jannik Brinkmann,Can Rager,Amir Zur,Michael Ripa,Aruna Sankaranarayanan,David Atkinson,Rohit Gandikota,Jaden Fiotto-Kaufman,EunJeong Hwang,Hadas Orgad,P Sam Sahil,Negev Taglicht,Tomer Shabtay,Atai Ambus,Nitay Alon,Shiri Oron,Ayelet Gordon-Tapiero,Yotam Kaplan,Vered Shwartz,Tamar Rott Shaham,Christoph Riedl,Reuth Mirsky,Maarten Sap,David Manheim,Tomer Ullman,David Bau
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case studies. Observed behaviors include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. In several cases, agents reported task completion while the underlying system state contradicted those reports. We also report on some of the failed attempts. Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings. These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines. This report serves as an initial empirical contribution to that broader conversation.

[AI-15] Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision

【速读】:该论文旨在解决动态图异常检测(Dynamic Graph Anomaly Detection, DGAD)中因标注异常样本稀缺而导致的模型泛化能力不足问题。现有方法要么完全依赖无监督学习(易产生模糊边界),要么采用半监督学习(易过拟合有限标注异常并难以适应未见异常)。为弥补这一空白,论文提出一种可推广且与模型无关的框架,其核心在于:通过残差表示编码(residual representation encoding)提取当前交互与其历史上下文的偏差,从而增强异常相关信号;引入限制损失(restriction loss)将正常表示约束在两个同心超球体之间的区间内,保证尺度一致性的同时保持异常可分离性;并通过基于归一化流建模的双边界优化策略(bi-boundary optimization strategy),从正常数据中学习判别性强且鲁棒的边界,实现对未见异常的良好泛化能力。

链接: https://arxiv.org/abs/2602.20019
作者: Yuxing Tian,Yiyan Qi,Fengran Mo,Weixu Zhang,Jian Guo,Jian-Yun Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Dynamic graph anomaly detection (DGAD) is critical for many real-world applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize poorly to unseen anomalies. To address this gap, we consider a largely underexplored problem in DGAD: learning a discriminative boundary from normal/unlabeled data, while leveraging limited labeled anomalies \textbfwhen available without sacrificing generalization to unseen anomalies. To this end, we propose an effective, generalizable, and model-agnostic framework with three main components: (i) residual representation encoding that capture deviations between current interactions and their historical context, providing anomaly-relevant signals; (ii) a restriction loss that constrain the normal representations within an interval bounded by two co-centered hyperspheres, ensuring consistent scales while keeping anomalies separable; (iii) a bi-boundary optimization strategy that learns a discriminative and robust boundary using the normal log-likelihood distribution modeled by a normalizing flow. Extensive experiments demonstrate the superiority of our framework across diverse evaluation settings.

[AI-16] A Secure and Private Distributed Bayesian Federated Learning Design

【速读】:该论文旨在解决分布式联邦学习(Distributed Federated Learning, DFL)中的三大关键问题:一是来自“诚实但好奇”邻居的隐私泄露风险;二是由于缺乏中心协调导致的收敛速度缓慢;三是拜占庭对手(Byzantine adversaries)对模型准确性的潜在破坏。解决方案的核心在于提出一种融合拜占庭鲁棒性、隐私保护与收敛加速的新型DFL框架:各设备采用贝叶斯方法训练本地模型,并独立选择最优邻居子集进行后验信息交换;通过将邻居选择建模为在安全与隐私约束下最小化全局损失函数的优化问题,结合图神经网络(Graph Neural Network, GNN)与强化学习(Reinforcement Learning, RL)算法,使设备基于局部观测自主决策连接策略,从而在动态连通性、拜占庭检测能力、隐私水平和收敛速度之间实现平衡,显著提升系统整体鲁棒性与效率。

链接: https://arxiv.org/abs/2602.20003
作者: Nuocheng Yang,Sihua Wang,Zhaohui Yang,Mingzhe Chen,Changchuan Yin,Kaibin Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Distributed Federated Learning (DFL) enables decentralized model training across large-scale systems without a central parameter server. However, DFL faces three critical challenges: privacy leakage from honest-but-curious neighbors, slow convergence due to the lack of central coordination, and vulnerability to Byzantine adversaries aiming to degrade model accuracy. To address these issues, we propose a novel DFL framework that integrates Byzantine robustness, privacy preservation, and convergence acceleration. Within this framework, each device trains a local model using a Bayesian approach and independently selects an optimal subset of neighbors for posterior exchange. We formulate this neighbor selection as an optimization problem to minimize the global loss function under security and privacy constraints. Solving this problem is challenging because devices only possess partial network information, and the complex coupling between topology, security, and convergence remains unclear. To bridge this gap, we first analytically characterize the trade-offs between dynamic connectivity, Byzantine detection, privacy levels, and convergence speed. Leveraging these insights, we develop a fully distributed Graph Neural Network (GNN)-based Reinforcement Learning (RL) algorithm. This approach enables devices to make autonomous connection decisions based on local observations. Simulation results demonstrate that our method achieves superior robustness and efficiency with significantly lower overhead compared to traditional security and privacy schemes.

[AI-17] Contextual Safety Reasoning and Grounding for Open-World Robots

【速读】:该论文旨在解决机器人在开放世界环境中因情境变化(如人流密度、紧急状态等)导致的安全行为需求动态调整的问题,传统固定约束的安全方法难以应对此类情境多样性。解决方案的关键在于提出CORE框架,其通过视觉语言模型(Vision-Language Model, VLM)实现在线上下文推理与语义理解,将感知到的视觉信息转化为情境依赖的安全规则,并利用控制屏障函数(Control Barrier Functions, CBF)将其空间化地落实到物理环境中,从而在无需预先环境地图或安全规范的情况下,动态生成并执行适应当前情境的可证明安全策略。该方法提供概率安全性保障以应对感知不确定性,实验证明其在未见环境中显著优于缺乏在线情境推理能力的传统语义安全方法。

链接: https://arxiv.org/abs/2602.19983
作者: Zachary Ravichadran,David Snyder,Alexander Robey,Hamed Hassani,Vijay Kumar,George J. Pappas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots are increasingly operating in open-world environments where safe behavior depends on context: the same hallway may require different navigation strategies when crowded versus empty, or during an emergency versus normal operations. Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment. We address this gap via CORE, a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment (e.g., maps or safety specifications). CORE uses a vision-language model (VLM) to continuously reason about context-dependent safety rules directly from visual observations, grounds these rules in the physical environment, and enforces the resulting spatially-defined safe sets via control barrier functions. We provide probabilistic safety guarantees for CORE that account for perceptual uncertainty, and we demonstrate through simulation and real-world experiments that CORE enforces contextually appropriate behavior in unseen environments, significantly outperforming prior semantic safety methods that lack online contextual reasoning. Ablation studies validate our theoretical guarantees and underscore the importance of both VLM-based reasoning and spatial grounding for enforcing contextual safety in novel settings. We provide additional resources at this https URL.

[AI-18] On the Equivalence of Random Network Distillation Deep Ensembles and Bayesian Inference

【速读】:该论文旨在解决当前深度学习模型中不确定性量化(Uncertainty Quantification, UQ)方法缺乏理论基础的问题,特别是针对轻量级方法如随机网络蒸馏(Random Network Distillation, RND)的理论机制不明确这一挑战。其解决方案的关键在于通过神经切线核(Neural Tangent Kernel, NTK)框架在无限宽度网络极限下对RND进行严格分析,揭示了RND的平方自预测误差等价于深度集成(Deep Ensemble)的预测方差,并进一步构造特定的目标函数使RND误差分布逼近宽神经网络贝叶斯推断中的中心化后验预测分布;基于此等价性,作者提出一种基于改进后的“贝叶斯RND”模型的后验采样算法,可生成来自精确贝叶斯后验预测分布的独立同分布样本,从而将RND纳入深度集成与贝叶斯推断的严谨理论体系,为高效且理论可信的不确定性量化提供了新路径。

链接: https://arxiv.org/abs/2602.19964
作者: Moritz A. Zanger,Yijun Wu,Pascal R. Van der Vaart,Wendelin Böhmer,Matthijs T. J. Spaan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
备注: 8 pages, 1 Figure

点击查看摘要

Abstract:Uncertainty quantification is central to safe and efficient deployments of deep learning models, yet many computationally practical methods lack lacking rigorous theoretical motivation. Random network distillation (RND) is a lightweight technique that measures novelty via prediction errors against a fixed random target. While empirically effective, it has remained unclear what uncertainties RND measures and how its estimates relate to other approaches, e.g. Bayesian inference or deep ensembles. This paper establishes these missing theoretical connections by analyzing RND within the neural tangent kernel framework in the limit of infinite network width. Our analysis reveals two central findings in this limit: (1) The uncertainty signal from RND – its squared self-predictive error – is equivalent to the predictive variance of a deep ensemble. (2) By constructing a specific RND target function, we show that the RND error distribution can be made to mirror the centered posterior predictive distribution of Bayesian inference with wide neural networks. Based on this equivalence, we moreover devise a posterior sampling algorithm that generates i.i.d. samples from an exact Bayesian posterior predictive distribution using this modified \textitBayesian RND model. Collectively, our findings provide a unified theoretical perspective that places RND within the principled frameworks of deep ensembles and Bayesian inference, and offer new avenues for efficient yet theoretically grounded uncertainty quantification methods.

[AI-19] DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

【速读】:该论文旨在解决差分隐私联邦学习(Differentially Private Federated Learning, DPFL)中收敛效率与鲁棒性之间的平衡难题,特别是在使用AdamW优化器时所面临的三大问题:(i) 数据异构性和隐私噪声共同放大二阶矩估计器的方差;(ii) 差分隐私扰动引入对二阶矩估计器的偏差;(iii) 差分隐私加剧AdamW对本地过拟合的敏感性,从而恶化客户端漂移(client drift)。解决方案的关键在于提出DP-FedAdamW,这是首个基于AdamW的差分隐私联邦优化算法,其核心创新包括:通过稳定二阶矩估计方差、消除差分隐私引入的偏差,并将本地更新对齐至全局下降方向以抑制客户端漂移。理论分析表明,该方法在无需任何数据异构性假设下实现了无偏的二阶矩估计和线性加速收敛速率,同时提供更紧致的(\varepsilon,\delta)-差分隐私保证。

链接: https://arxiv.org/abs/2602.19945
作者: Jin Liu,Yinbin Miao,Ning Xi,Junkang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Balancing convergence efficiency and robustness under Differential Privacy (DP) is a central challenge in Federated Learning (FL). While AdamW accelerates training and fine-tuning in large-scale models, we find that directly applying it to Differentially Private FL (DPFL) suffers from three major issues: (i) data heterogeneity and privacy noise jointly amplify the variance of second-moment estimator, (ii) DP perturbations bias the second-moment estimator, and (iii) DP amplify AdamW sensitivity to local overfitting, worsening client drift. We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. It restores AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to the global descent to curb client drift. Theoretically, we establish an unbiased second-moment estimator and prove a linearly accelerated convergence rate without any heterogeneity assumption, while providing tighter (\varepsilon,\delta) -DP guarantees. Our empirical results demonstrate the effectiveness of DP-FedAdamW across language and vision Transformers and ResNet-18. On Tiny-ImageNet (Swin-Base, \varepsilon=1 ), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83%. The code is available in Appendix.

[AI-20] Beyond Mimicry: Toward Lifelong Adaptability in Imitation Learning

【速读】:该论文试图解决当前模仿学习(Imitation Learning)代理在面对环境变化或目标演进时表现不佳的问题,其根本原因在于现有方法优化的目标是完美复现(perfect replay),而非具备组合式适应能力(compositional adaptability)。解决方案的关键在于重新定义成功标准,将学习目标从记忆性复现转向学习可重用的行为基元(behavioural primitives),并通过新颖情境中的组合重组实现适应性行为,而无需重新训练。这一范式转变要求构建新的评估指标、混合架构以及跨认知科学与文化演化等领域的研究路径,从而推动代理在开放世界中具备持续适应的能力。

链接: https://arxiv.org/abs/2602.19930
作者: Nathan Gavenski,Felipe Meneguzzi,Odinaldo Rodrigues
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as part of the Blue Sky Ideas Track for the 25th International Conference on Autonomous Agents and Multiagent Systems

点击查看摘要

Abstract:Imitation learning stands at a crossroads: despite decades of progress, current imitation learning agents remain sophisticated memorisation machines, excelling at replay but failing when contexts shift or goals evolve. This paper argues that this failure is not technical but foundational: imitation learning has been optimised for the wrong objective. We propose a research agenda that redefines success from perfect replay to compositional adaptability. Such adaptability hinges on learning behavioural primitives once and recombining them through novel contexts without retraining. We establish metrics for compositional generalisation, propose hybrid architectures, and outline interdisciplinary research directions drawing on cognitive science and cultural evolution. Agents that embed adaptability at the core of imitation learning thus have an essential capability for operating in an open-ended world.

[AI-21] Rethinking LoRA for Privacy-Preserving Federated Learning in Large Models

【速读】:该论文旨在解决在差分隐私联邦学习(Differentially Private Federated Learning, DPFL)环境下,对大规模视觉模型(Large Vision Models, LVMs)和大型语言模型(Large Language Models, LLMs)进行低秩适配(Low-Rank Adaptation, LoRA)时所面临的性能退化问题。其核心挑战包括:(1)两个非对称低秩矩阵同时更新导致的梯度耦合;(2)差分隐私机制下噪声放大效应加剧;(3)全局聚合模型在参数空间中的尖锐性。为应对上述问题,作者提出LA-LoRA(Local Alternating LoRA),其关键创新在于通过解耦客户端间的梯度交互并同步更新方向,从而提升DPFL场景下的鲁棒性;理论分析表明LA-LoRA增强了噪声环境中的收敛性保障,实验验证其在Swin Transformer与RoBERTa等模型上均显著优于现有方法,在严格隐私预算(ε=1)下使Tiny-ImageNet上的测试准确率相较最优基线RoLoRA提升16.83%。

链接: https://arxiv.org/abs/2602.19926
作者: Jin Liu,Yinbin Miao,Ning Xi,Junkang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning large vision models (LVMs) and large language models (LLMs) under differentially private federated learning (DPFL) is hindered by a fundamental privacy-utility trade-off. Low-Rank Adaptation (LoRA), a promising parameter-efficient fine-tuning (PEFT) method, reduces computational and communication costs by introducing two trainable low-rank matrices while freezing pre-trained weights. However, directly applying LoRA in DPFL settings leads to performance degradation, especially in LVMs. Our analysis reveals three previously underexplored challenges: (1) gradient coupling caused by the simultaneous update of two asymmetric low-rank matrices, (2) compounded noise amplification under differential privacy, and (3) sharpness of the global aggregated model in the parameter space. To address these issues, we propose LA-LoRA (\textbfLocal \textbfAlternating \textbfLoRA), a novel approach that decouples gradient interactions and aligns update directions across clients to enhance robustness under stringent privacy constraints. Theoretically, LA-LoRA strengthens convergence guarantees in noisy federated environments. Extensive experiments demonstrate that LA-LoRA achieves state-of-the-art (SOTA) performance on Swin Transformer and RoBERTa models, showcasing robustness to DP noise and broad applicability across both LVMs and LLMs. For example, when fine-tuning the Swin-B model on the Tiny-ImageNet dataset under a strict privacy budget ( \epsilon = 1 ), LA-LoRA outperforms the best baseline, RoLoRA, by 16.83% in test accuracy. Code is provided in \repolink.

[AI-22] Watson Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

【速读】:该论文旨在解决现有AI推理评估基准在自然情境下难以反映人类推理相似性的问题(即缺乏对AI推理能力与人类推理过程之间匹配度的深入洞察)。其解决方案的关键在于设计并实现了一个基于Watson Holmes侦探桌游的新型基准测试框架,该框架通过逐步呈现叙事证据、开放性问题和非约束性语言响应来模拟真实推理场景,并配套开发了经人工评分验证的自动化评分系统,从而实现了可扩展且可复现的性能评估。实验结果表明,AI模型在九个月内从低于人类群体四分之一水平提升至接近前5%,其中一半进步源于持续迭代的模型演进,另一半则归因于面向推理优化的架构革新。

链接: https://arxiv.org/abs/2602.19914
作者: Thatchawin Leelawat,Lewis D Griffin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 51 pages, 13 figures

点击查看摘要

Abstract:Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses. An automated grading system was developed and validated against human assessors to enable scalable and replicable performance evaluation. Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with reasoning-oriented model architectures. Systematic differences in the performance of AI models compared to humans, dependent on features of the specific detection puzzle, were mostly absent with the exception of a fall in performance for models when solving longer cases (case lengths being in the range of 1900-4000 words), and an advantage at inductive reasoning for reasoning models at early stages of case solving when evidence was scant.

[AI-23] LLM -enabled Applications Require System-Level Threat Monitoring

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因非确定性、学习驱动特性和难以验证的行为所带来的可靠性与安全风险问题。这些问题显著扩展了系统的攻击面,且传统测试或护栏(guardrail)机制无法充分应对部署后出现的安全异常。论文指出,当前可信部署的主要障碍并非提升模型能力,而是建立系统级威胁监控机制,以在部署后检测并上下文化安全相关的异常行为。解决方案的关键在于实施系统性的、全面的安全威胁监控,作为可靠运行的前提和专用事件响应框架的基础。

链接: https://arxiv.org/abs/2602.19844
作者: Yedi Zhang,Haoyu Wang,Xianglin Yang,Jin Song Dong,Jun Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 26 pages

点击查看摘要

Abstract:LLM-enabled applications are rapidly reshaping the software ecosystem by using large language models as core reasoning components for complex task execution. This paradigm shift, however, introduces fundamentally new reliability challenges and significantly expands the security attack surface, due to the non-deterministic, learning-driven, and difficult-to-verify nature of LLM behavior. In light of these emerging and unavoidable safety challenges, we argue that such risks should be treated as expected operational conditions rather than exceptional events, necessitating a dedicated incident-response perspective. Consequently, the primary barrier to trustworthy deployment is not further improving model capability but establishing system-level threat monitoring mechanisms that can detect and contextualize security-relevant anomalies after deployment – an aspect largely underexplored beyond testing or guardrail-based defenses. Accordingly, this position paper advocates systematic and comprehensive monitoring of security threats in LLM-enabled applications as a prerequisite for reliable operation and a foundation for dedicated incident-response frameworks.

[AI-24] MAS-FIRE: Fault Injection and Reliability Evaluation for LLM -Based Multi-Agent Systems

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在复杂任务部署中可靠性不足的问题。由于MAS通过非结构化的自然语言进行协作,而非严格的协议,其容易发生语义层面的故障(如幻觉、指令误解和推理漂移),且这些故障往往不会触发运行时异常,导致难以被检测和诊断。为应对这一挑战,作者提出MAS-FIRE框架,其核心在于构建一个系统性的故障注入与可靠性评估机制:首先定义涵盖15类故障类型的分类体系(包括智能体内认知错误与跨智能体协调失败),并通过三种非侵入式手段(提示修改、响应重写、消息路由操控)实现精准注入;进而通过对三种典型MAS架构的应用分析,提炼出四个层级的容错行为模式(机制层、规则层、提示层与推理层),从而提供细粒度的诊断能力与可操作的改进路径。该方案揭示了更强的基础模型并不必然提升鲁棒性,而架构拓扑设计(如迭代闭环结构)对抵御灾难性崩溃具有决定性作用。

链接: https://arxiv.org/abs/2602.19843
作者: Jin Jia,Zhiling Deng,Zhuangbin Chen,Yingqi Wang,Zibin Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM-based Multi-Agent Systems (MAS) are increasingly deployed for complex tasks, ensuring their reliability has become a pressing challenge. Since MAS coordinate through unstructured natural language rather than rigid protocols, they are prone to semantic failures (e.g., hallucinations, misinterpreted instructions, and reasoning drift) that propagate silently without raising runtime exceptions. Prevailing evaluation approaches, which measure only end-to-end task success, offer limited insight into how these failures arise or how effectively agents recover from them. To bridge this gap, we propose MAS-FIRE, a systematic framework for fault injection and reliability evaluation of MAS. We define a taxonomy of 15 fault types covering intra-agent cognitive errors and inter-agent coordination failures, and inject them via three non-invasive mechanisms: prompt modification, response rewriting, and message routing manipulation. Applying MAS-FIRE to three representative MAS architectures, we uncover a rich set of fault-tolerant behaviors that we organize into four tiers: mechanism, rule, prompt, and reasoning. This tiered view enables fine-grained diagnosis of where and why systems succeed or fail. Our findings reveal that stronger foundation models do not uniformly improve robustness. We further show that architectural topology plays an equally decisive role, with iterative, closed-loop designs neutralizing over 40% of faults that cause catastrophic collapse in linear workflows. MAS-FIRE provides the process-level observability and actionable guidance needed to systematically improve multi-agent systems.

[AI-25] Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMinds Adaptive Agent

【速读】:该论文旨在解决标准机器学习模型在面对新任务时适应能力弱的问题,这类模型通常依赖于特定任务的大量训练数据,难以迁移知识。其解决方案的关键在于引入元学习(meta-learning),通过从多个任务中学习可迁移的知识,使模型能够在少量数据下快速适应新任务,从而实现对新型挑战的高效应对。

链接: https://arxiv.org/abs/2602.19837
作者: Björn Hoppmann,Christoph Scholz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind’s Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.

[AI-26] SafePickle: Robust and Generic ML Detection of Malicious Pickle-based ML Models

【速读】:该论文旨在解决模型仓库(如Hugging Face)中使用Python的pickle格式序列化机器学习模型文件时,因反序列化过程可能触发远程代码执行(Remote Code Execution, RCE)攻击所带来的安全风险。现有防御方案(如PickleBall)依赖于针对特定库的策略生成和代码插桩,导致部署复杂、可扩展性差且泛化能力弱。本文提出一种轻量级、基于机器学习的扫描器,其关键在于不依赖任何策略生成或代码注入,而是通过静态分析pickle字节码提取结构与语义特征,并利用监督与无监督模型对文件进行良性/恶意分类,从而实现通用、高效的检测能力。

链接: https://arxiv.org/abs/2602.19818
作者: Hillel Ohayon,Daniel Gilkarov,Ran Dubin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model repositories such as Hugging Face increasingly distribute machine learning artifacts serialized with Python’s pickle format, exposing users to remote code execution (RCE) risks during model loading. Recent defenses, such as PickleBall, rely on per-library policy synthesis that requires complex system setups and verified benign models, which limits scalability and generalization. In this work, we propose a lightweight, machine-learning-based scanner that detects malicious Pickle-based files without policy generation or code instrumentation. Our approach statically extracts structural and semantic features from Pickle bytecode and applies supervised and unsupervised models to classify files as benign or malicious. We construct and release a labeled dataset of 727 Pickle-based files from Hugging Face and evaluate our models on four datasets: our own, PickleBall (out-of-distribution), Hide-and-Seek (9 advanced evasive malicious models), and synthetic joblib files. Our method achieves 90.01% F1-score compared with 7.23%-62.75% achieved by the SOTA scanners (Modelscan, Fickling, ClamAV, VirusTotal) on our dataset. Furthermore, on the PickleBall data (OOD), it achieves 81.22% F1-score compared with 76.09% achieved by the PickleBall method, while remaining fully library-agnostic. Finally, we show that our method is the only one to correctly parse and classify 9/9 evasive Hide-and-Seek malicious models specially crafted to evade scanners. This demonstrates that data-driven detection can effectively and generically mitigate Pickle-based model file attacks.

[AI-27] Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling

【速读】:该论文旨在解决符号音乐生成中长上下文建模的问题,特别是在资源受限设备(如电子乐器和便携式计算机)上部署时面临的内存占用高与注意力计算复杂度大的挑战。其解决方案的关键在于提出深度结构化的音乐循环机制(Depth-Structured Music Recurrence, DSMR),通过分段级循环机制结合分离的跨段状态,并引入逐层记忆视野调度策略,在固定计算预算下实现对完整乐曲的递归建模。该方法在单次从左到右遍历整首作品的过程中维持跨段状态传递,同时通过不同层级分配不同的历史窗口长度(低层采用长窗口、高层采用短窗口),在不减少计算深度的前提下构建深度依赖的时间感受野,从而在有限算力条件下实现高质量且高效的长上下文符号音乐建模。

链接: https://arxiv.org/abs/2602.19816
作者: Yungang Yi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instruments and portable computers), making heavy memory and attention computation difficult to deploy. We introduce Depth-Structured Music Recurrence (DSMR), a recurrent long-context Transformer for full-piece symbolic music modeling that extends context beyond fixed-length excerpts via segment-level recurrence with detached cross-segment states, featuring a layer-wise memory-horizon schedule that budgets recurrent KV states across depth. DSMR is trained in a single left-to-right pass over each complete composition, akin to how a musician experiences it from beginning to end, while carrying recurrent cross-segment states forward. Within this recurrent framework, we systematically study how depth-wise horizon allocations affect optimization, best-checkpoint perplexity, and efficiency. By allocating different history-window lengths across layers while keeping the total recurrent-state budget fixed, DSMR creates depth-dependent temporal receptive fields within a recurrent attention stack without reducing compute depth. Our main instantiation is a two-scale DSMR schedule that allocates long history windows to lower layers and a uniform short window to the remaining layers. Experiments on the piano performance dataset MAESTRO demonstrate that two-scale DSMR provides a practical quality–efficiency recipe for full-length long-context symbolic music modeling with recurrent attention under limited computational resources.

[AI-28] OpenClaw Moltbook and ClawdLab: From Agent -Only Social Networks to Autonomous Scientific Research

【速读】:该论文旨在解决当前自主AI系统在多智能体协作中暴露的架构缺陷问题,包括安全漏洞(如131种代理技能和超过15,200个暴露控制面板)、缺乏可验证的证据机制以及无法实现真正去中心化的科学协作。其核心解决方案是提出ClawdLab平台,通过硬性角色限制、结构化对抗性批判、PI主导治理、多模型编排及基于协议约束的领域特定证据要求,将验证从社会共识转向计算工具输出,从而实现结构化的Sybil抵抗能力。该方案的关键在于构建一个可组合的第三层架构,使基础模型、能力、治理与证据标准可独立演化,支持随着AI生态系统的进步而持续迭代优化。

链接: https://arxiv.org/abs/2602.19810
作者: Lukas Weidener,Marko Brkić,Mihailo Jovanović,Ritvik Singh,Emre Ulgac,Aakaash Meduri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In January 2026, the open-source agent framework OpenClaw and the agent-only social network Moltbook produced a large-scale dataset of autonomous AI-to-AI interaction, attracting six academic publications within fourteen days. This study conducts a multivocal literature review of that ecosystem and presents ClawdLab, an open-source platform for autonomous scientific research, as a design science response to the architectural failure modes identified. The literature documents emergent collective phenomena, security vulnerabilities spanning 131 agent skills and over 15,200 exposed control panels, and five recurring architectural patterns. ClawdLab addresses these failure modes through hard role restrictions, structured adversarial critique, PI-led governance, multi-model orchestration, and domain-specific evidence requirements encoded as protocol constraints that ground validation in computational tool outputs rather than social consensus; the architecture provides emergent Sybil resistance as a structural consequence. A three-tier taxonomy distinguishes single-agent pipelines, predetermined multi-agent workflows, and fully decentralised systems, analysing why leading AI co-scientist platforms remain confined to the first two tiers. ClawdLab’s composable third-tier architecture, in which foundation models, capabilities, governance, and evidence requirements are independently modifiable, enables compounding improvement as the broader AI ecosystem advances.

[AI-29] Decision MetaMamba: Enhancing Selective SSM in Offline RL with Heterogeneous Sequence Mixing

【速读】:该论文旨在解决基于Mamba的模型在离线强化学习(Offline Reinforcement Learning, Offline RL)中因选择性机制导致关键RL序列步骤被遗漏的问题,从而引发的信息丢失与性能下降。其解决方案的关键在于提出一种名为决策元Mamba(Decision MetaMamba, DMM)的新结构:通过用基于密集层的序列混合器替代Mamba原有的token mixer,并优化位置编码以保留局部信息,使所有通道在Mamba处理前进行统一的序列混合,从而避免因选择性扫描和残差门控带来的信息损失。这一设计显著提升了模型在多样化强化学习任务中的表现,同时保持了参数量紧凑,具备良好的实际应用潜力。

链接: https://arxiv.org/abs/2602.19805
作者: Wall Kim,Chaeyoung Song,Hanul Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mamba-based models have drawn much attention in offline RL. However, their selective mechanism often detrimental when key steps in RL sequences are omitted. To address these issues, we propose a simple yet effective structure, called Decision MetaMamba (DMM), which replaces Mamba’s token mixer with a dense layer-based sequence mixer and modifies positional structure to preserve local information. By performing sequence mixing that considers all channels simultaneously before Mamba, DMM prevents information loss due to selective scanning and residual gating. Extensive experiments demonstrate that our DMM delivers the state-of-the-art performance across diverse RL tasks. Furthermore, DMM achieves these results with a compact parameter footprint, demonstrating strong potential for real-world applications.

[AI-30] he Climate Change Knowledge Graph: Supporting Climate Services

【速读】:该论文旨在解决气候数据检索与整合难题,即当前研究人员在获取气候模型模拟数据时,依赖传统搜索接口和API,需手动拼接元数据与社区词汇,效率低且难以支持复杂查询。解决方案的关键在于构建一个开放获取的气候变化知识图谱(Climate Change Knowledge Graph),通过集成多源气候模拟数据并基于领域专家参与设计的本体论(ontology),实现对气候模型、变量、时空域及粒度等要素的统一表示与高效查询,从而提升气候数据的可发现性与可用性,支撑更科学的决策制定。

链接: https://arxiv.org/abs/2602.19786
作者: Miguel Ceriani,Fiorela Ciroku,Alessandro Russo,Massimiliano Schembri,Fai Fung,Neha Mittal,Vito Trianni,Andrea Giovanni Nuzzolese
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Climate change impacts a broad spectrum of human resources and activities, necessitating the use of climate models to project long-term effects and inform mitigation and adaptation strategies. These models generate multiple datasets by running simulations across various scenarios and configurations, thereby covering a range of potential future outcomes. Currently, researchers rely on traditional search interfaces and APIs to retrieve such datasets, often piecing together information from metadata and community vocabularies. The Climate Change Knowledge Graph is designed to address these challenges by integrating diverse data sources related to climate simulations into a coherent and interoperable knowledge graph. This innovative resource allows for executing complex queries involving climate models, simulations, variables, spatio-temporal domains, and granularities. Developed with input from domain experts, the knowledge graph and its underlying ontology are published with open access license and provide a comprehensive framework that enhances the exploration of climate data, facilitating more informed decision-making in addressing climate change issues.

[AI-31] he Confusion is Real: GRAPHIC - A Network Science Approach to Confusion Matrices in Deep Learning

【速读】:该论文旨在解决当前可解释人工智能(Explainable Artificial Intelligence, XAI)领域中缺乏系统性方法来可视化和理解类别间混淆关系及其在训练过程中演化规律的问题。其解决方案的关键在于提出GRAPHIC,一种与网络架构无关的类级别分析方法:通过在中间层使用线性分类器生成混淆矩阵,并将其解释为有向图的邻接矩阵,从而借助网络科学工具对训练 epochs 和中间层中的学习动态进行可视化与量化分析。此方法揭示了线性可分性、数据集问题及模型架构行为,提供了神经网络实际学习过程的新视角。

链接: https://arxiv.org/abs/2602.19770
作者: Johanna S. Fröhlich,Bastian Heinlein,Jan U. Claar,Hans Rosenberger,Vasileios Belagiannis,Ralf R. Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable artificial intelligence has emerged as a promising field of research to address reliability concerns in artificial intelligence. Despite significant progress in explainable artificial intelligence, few methods provide a systematic way to visualize and understand how classes are confused and how their relationships evolve as training progresses. In this work, we present GRAPHIC, an architecture-agnostic approach that analyzes neural networks on a class level. It leverages confusion matrices derived from intermediate layers using linear classifiers. We interpret these as adjacency matrices of directed graphs, allowing tools from network science to visualize and quantify learning dynamics across training epochs and intermediate layers. GRAPHIC provides insights into linear class separability, dataset issues, and architectural behavior, revealing, for example, similarities between flatfish and man and labeling ambiguities validated in a human study. In summary, by uncovering real confusions, GRAPHIC offers new perspectives on how neural networks learn. The code is available at this https URL.

[AI-32] Hexagon-MLIR: An AI Compilation Stack For Qualcomms Neural Processing Units (NPUs)

【速读】:该论文旨在解决在高通Hexagon NPU上高效部署AI模型所面临的编译复杂性和性能瓶颈问题,特别是针对Triton内核和PyTorch模型的低级优化与自动编译支持不足。解决方案的关键在于构建一个基于MLIR(Multi-Level Intermediate Representation)框架的开源编译栈Hexagon-MLIR,通过结构化的编译流程充分利用NPU的架构特性,生成最大化数据局部性的mega-kernels,从而减少对Tightly Coupled Memory(TCM)带宽的依赖,提升AI工作负载的执行效率。此方案实现了从Triton内核到目标二进制代码的自动化编译,为开发者提供了灵活且高性能的AI编译路径。

链接: https://arxiv.org/abs/2602.19762
作者: Mohammed Javed Absar,Muthu Baskaran,Abhikrant Sharma,Abhilash Bhandari,Ankit Aggarwal,Arun Rangasamy,Dibyendu Das,Fateme Hosseini,Franck Slama,Iulian Brumar,Jyotsna Verma,Krishnaprasad Bindumadhavan,Mitesh Kothari,Mohit Gupta,Ravishankar Kolachana,Richard Lethin,Samarth Narang,Sanjay Motilal Ladwa,Shalini Jain,Snigdha Suresh Dalvi,Tasmia Rahman,Venkat Rasagna Reddy Komatireddy,Vivek Vasudevbhai Pandya,Xiyue Shi,Zachary Zipper
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU’s Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.

[AI-33] Carbon-Aware Governance Gates: An Architecture for Sustainable GenAI Development

【速读】:该论文旨在解决生成式 AI(Generative AI)在软件开发生命周期(SDLC)中广泛应用所引发的碳足迹上升问题,尤其关注治理机制(如重复推理、再生循环和扩展验证流水线)带来的额外计算负载对环境的影响。解决方案的关键在于提出碳感知治理门控(Carbon-Aware Governance Gates, CAGG),其核心是通过三个组件实现可持续性驱动的治理:(i) 能量与碳溯源账本(Energy and Carbon Provenance Ledger),用于追踪计算活动的碳排放来源;(ii) 碳预算管理器(Carbon Budget Manager),动态控制碳消耗上限;(iii) 绿色验证编排器(Green Validation Orchestrator),优化验证流程以降低能耗。CAGG 将碳预算、能源溯源和可持续性感知的验证编排嵌入人类-人工智能治理层,依托治理策略与可复用的设计模式实现落地。

链接: https://arxiv.org/abs/2602.19718
作者: Mateen A. Abbasi,Tommi J. Mikkonen,Petri J. Ihantola,Muhammad Waseem,Pekka Abrahamsson,Niko K. Mäkitalo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure. Preprint version under review

点击查看摘要

Abstract:The rapid adoption of Generative AI (GenAI) in the software development life cycle (SDLC) increases computational demand, which can raise the carbon footprint of development activities. At the same time, organizations are increasingly embedding governance mechanisms into GenAI-assisted development to support trust, transparency, and accountability. However, these governance mechanisms introduce additional computational workloads, including repeated inference, regeneration cycles, and expanded validation pipelines, increasing energy use and the carbon footprint of GenAI-assisted development. This paper proposes Carbon-Aware Governance Gates (CAGG), an architectural extension that embeds carbon budgets, energy provenance, and sustainability-aware validation orchestration into human-AI governance layers. CAGG comprises three components: (i) an Energy and Carbon Provenance Ledger, (ii) a Carbon Budget Manager, and (iii) a Green Validation Orchestrator, operationalized through governance policies and reusable design patterns.

[AI-34] PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling

【速读】:该论文旨在解决单细胞测序中扰动预测的核心难题:由于高通量单细胞测序具有破坏性,无法对同一细胞在扰动前后的状态进行连续观测,导致控制组与扰动组数据为未配对的分布。现有模型通常假设在给定细胞类型和扰动类型条件下,响应分布是固定的,但现实中存在未观测到的隐变量(如微环境波动和批次效应),使得相同条件下的响应形成一个潜在的分布流形。为此,作者提出PerturbDiff,其关键在于将分布建模从单个细胞层面提升至整个分布层面——通过将概率分布嵌入到希尔伯特空间(Hilbert space)中,并设计基于扩散过程的生成模型直接作用于分布空间,从而捕捉因隐变量变化引起的群体响应偏移,显著提升了对未见扰动的泛化能力。

链接: https://arxiv.org/abs/2602.19685
作者: Xinyu Yuan,Xixian Liu,Ya Shi Zhang,Zuobai Zhang,Hongyu Guo,Jian Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building Virtual Cells that can accurately simulate cellular responses to perturbations is a long-standing goal in systems biology. A fundamental challenge is that high-throughput single-cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming a manifold of possible distributions for the same observed conditions. To account for this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion-based generative process operating directly over probability distributions. This allows PerturbDiff to capture population-level response shifts across hidden factors. Benchmarks on established datasets show that PerturbDiff achieves state-of-the-art performance in single-cell response prediction and generalizes substantially better to unseen perturbations. See our project page (this https URL), where code and data will be made publicly available (this https URL).

[AI-35] Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics

【速读】:该论文旨在解决远程心力衰竭(Heart Failure, HF)监测中因个体间语音特征差异导致的传统横截面分类模型准确性受限的问题。其核心解决方案是提出一种纵向个体追踪(Longitudinal Intra-Patient Tracking, LIPT)框架,关键在于引入个性化序列编码器(Personalised Sequential Encoder, PSE),通过融合每个时间戳的历史语音数据,将纵向语音记录转化为上下文感知的潜在表示,从而实现对个体临床状态变化轨迹的连续建模,而非孤立处理每次就诊数据。实验证明,该方法在225名患者的队列中实现了99.7%的临床状态转换识别准确率,显著优于传统方法,并具备预测HF恶化的高敏感性,为远程居家管理提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2602.19674
作者: Yue Pan,Xingyao Wang,Hanyue Zhang,Liwei Liu,Changxin Li,Gang Yang,Rong Sheng,Yili Xia,Ming Chu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional classification models. To address this, we propose a Longitudinal Intra-Patient Tracking (LIPT) scheme designed to capture the trajectory of relative symptomatic changes within individuals. Central to this framework is a Personalised Sequential Encoder (PSE), which transforms longitudinal speech recordings into context-aware latent representations. By incorporating historical data at each timestamp, the PSE facilitates a holistic assessment of the clinical trajectory rather than modelling discrete visits independently. Experimental results from a cohort of 225 patients demonstrate that the LIPT paradigm significantly outperforms the classic cross-sectional approaches, achieving a recognition accuracy of 99.7% for clinical status transitions. The model’s high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings. Furthermore, this work addresses the gap in existing literature by providing a comprehensive analysis of different speech task designs and acoustic features. Taken together, the superior performance of the LIPT framework and PSE architecture validates their readiness for integration into long-term telemonitoring systems, offering a scalable solution for remote heart failure management.

[AI-36] SkillOrchestra: Learning to Route Agents via Skill Transfer

【速读】:该论文旨在解决复合型人工智能(Compound AI)系统中任务调度(orchestration)效率与效果不足的问题,尤其针对现有路由方法存在的两大缺陷:一是输入级路由器仅做粗粒度的查询决策,无法适应多轮交互中动态变化的任务需求;二是基于强化学习(Reinforcement Learning, RL)训练的调度器在部署时易出现路由坍塌(routing collapse),即反复调用单一高性能但高成本的代理。其解决方案的关键在于提出SkillOrchestra框架,通过从执行经验中学习细粒度技能(fine-grained skills),显式建模每个代理在不同技能下的能力(competence)与代价(cost),并在部署阶段根据当前交互推断技能需求,从而在性能与成本之间做出最优权衡。这种方法避免了端到端RL学习的高样本开销和不稳定性,实现了可扩展、可解释且样本高效的调度策略。

链接: https://arxiv.org/abs/2602.19672
作者: Jiayu Wang,Yifei Ming,Zixuan Ke,Shafiq Joty,Aws Albarghouthi,Frederic Sala
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Compound AI systems promise capabilities beyond those of individual models, yet their success depends critically on effective orchestration. Existing routing approaches face two limitations: (1) input-level routers make coarse query-level decisions that ignore evolving task requirements; (2) RL-trained orchestrators are expensive to adapt and often suffer from routing collapse, repeatedly invoking one strong but costly option in multi-turn scenarios. We introduce SkillOrchestra, a framework for skill-aware orchestration. Instead of directly learning a routing policy end-to-end, SkillOrchestra learns fine-grained skills from execution experience and models agent-specific competence and cost under those skills. At deployment, the orchestrator infers the skill demands of the current interaction and selects agents that best satisfy them under an explicit performance-cost trade-off. Extensive experiments across ten benchmarks demonstrate that SkillOrchestra outperforms SoTA RL-based orchestrators by up to 22.5% with 700x and 300x learning cost reduction compared to Router-R1 and ToolOrchestra, respectively. These results show that explicit skill modeling enables scalable, interpretable, and sample-efficient orchestration, offering a principled alternative to data-intensive RL-based approaches. The code is available at: this https URL.

[AI-37] Representation Stability in a Minimal Continual Learning Agent

【速读】:该论文旨在解决持续学习系统中内部表征随时间演化的问题,尤其关注在无法重新训练或重置的环境中,如何实现表征的稳定性与可塑性之间的平衡。传统方法多聚焦于任务性能优化,而忽视了表征动态的本质变化。其解决方案的关键在于设计一个最小化的、具有持久状态向量(persistent state vector)的持续学习代理(agent),该代理在每次执行时保持状态向量并根据新文本数据增量更新,通过计算连续归一化状态向量间的余弦相似度来量化表征变化,并定义时间区间上的稳定性指标。实验表明,即使没有显式的正则化、回放机制或复杂架构,该系统仍能自然地从初始高可塑阶段过渡到稳定表征阶段,并在受到语义扰动后恢复稳定,从而揭示了在简单状态驱动系统中即可涌现出有意义的稳定性-可塑性权衡。

链接: https://arxiv.org/abs/2602.19655
作者: Vishnu Subramanian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Continual learning systems are increasingly deployed in environments where retraining or reset is infeasible, yet many approaches emphasize task performance rather than the evolution of internal representations over time. In this work, we study a minimal continual learning agent designed to isolate representational dynamics from architectural complexity and optimization objectives. The agent maintains a persistent state vector across executions and incrementally updates it as new textual data is introduced. We quantify representational change using cosine similarity between successive normalized state vectors and define a stability metric over time intervals. Longitudinal experiments across eight executions reveal a transition from an initial plastic regime to a stable representational regime under consistent input. A deliberately introduced semantic perturbation produces a bounded decrease in similarity, followed by recovery and restabilization under subsequent coherent input. These results demonstrate that meaningful stability plasticity tradeoffs can emerge in a minimal, stateful learning system without explicit regularization, replay, or architectural complexity. The work establishes a transparent empirical baseline for studying representational accumulation and adaptation in continual learning systems.

[AI-38] NEXUS : A compact neural architecture for high-resolution spatiotemporal air quality forecasting in Delhi Nationa Capital Region

【速读】:该论文旨在解决超大城市(如德里国家首都区,Delhi National Capital Region, NCR)空气污染对公共健康构成的严峻挑战,特别是针对一氧化碳(CO)、氮氧化物(NO)和二氧化硫(SO₂)等关键污染物的高精度、实时预测问题。解决方案的核心在于提出一种名为NEXUS(Neural Extraction and Unified Spatiotemporal)的新型神经网络架构,其关键创新包括:通过patch embedding提取空间特征、低秩投影压缩参数以提升计算效率,并引入自适应融合机制解析复杂的气象-化学耦合关系。该模型仅用18,748个参数即实现R² > 0.94(CO)、> 0.91(NO)和> 0.95(SO₂)的预测性能,显著优于SCINet、Autoformer和FEDformer等现有方法,且具备良好的可部署性,为城市级空气质量监测系统提供了高效、精准的预测工具。

链接: https://arxiv.org/abs/2602.19654
作者: Rampunit Kumar,Aditya Maheshwari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Urban air pollution in megacities poses critical public health challenges, particularly in Delhi National Capital Region (NCR) where severe degradation affects millions. We present NEXUS (Neural Extraction and Unified Spatiotemporal) architecture for forecasting carbon monoxide, nitrogen oxide, and sulfur dioxide. Working with four years (2018–2021) of atmospheric data across sixteen spatial grids, NEXUS achieves R ^2 exceeding 0.94 for CO, 0.91 for NO, and 0.95 for SO _2 using merely 18,748 parameters – substantially fewer than SCINet (35,552), Autoformer (68,704), and FEDformer (298,080). The architecture integrates patch embedding, low-rank projections, and adaptive fusion mechanisms to decode complex atmospheric chemistry patterns. Our investigation uncovers distinct diurnal rhythms and pronounced seasonal variations, with winter months experiencing severe pollution episodes driven by temperature inversions and agricultural biomass burning. Analysis identifies critical meteorological thresholds, quantifies wind field impacts on pollutant dispersion, and maps spatial heterogeneity across the region. Extensive ablation experiments demonstrate each architectural component’s role. NEXUS delivers superior predictive performance with remarkable computational efficiency, enabling real-time deployment for air quality monitoring systems.

[AI-39] Denoising Particle Filters: Learning State Estimation with Single-Step Objectives

【速读】:该论文旨在解决机器人状态估计中基于学习的方法通常依赖端到端训练所带来的可解释性差和训练成本高的问题,尤其是在需要时间序列展开的复杂场景下。其解决方案的关键在于提出一种新颖的粒子滤波算法,该算法通过从单步状态转移中独立训练模型来充分利用机器人系统的马尔可夫性质;测量模型通过最小化去噪得分匹配目标(denoising score matching objective)隐式学习,在推理阶段则结合已学习的动力学模型与去噪器近似求解贝叶斯滤波方程,从而在每一步将预测状态引导至由观测数据决定的流形空间中。这种方法不仅在仿真任务中表现出与调优后的端到端基线相当的性能,还具备经典滤波算法的可组合性优势,支持无需重新训练即可融入先验信息或外部传感器模型。

链接: https://arxiv.org/abs/2602.19651
作者: Lennart Röstel,Berthold Bäuml
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning-based methods commonly treat state estimation in robotics as a sequence modeling problem. While this paradigm can be effective at maximizing end-to-end performance, models are often difficult to interpret and expensive to train, since training requires unrolling sequences of predictions in time. As an alternative to end-to-end trained state estimation, we propose a novel particle filtering algorithm in which models are trained from individual state transitions, fully exploiting the Markov property in robotic systems. In this framework, measurement models are learned implicitly by minimizing a denoising score matching objective. At inference, the learned denoiser is used alongside a (learned) dynamics model to approximately solve the Bayesian filtering equation at each time step, effectively guiding predicted states toward the data manifold informed by measurements. We evaluate the proposed method on challenging robotic state estimation tasks in simulation, demonstrating competitive performance compared to tuned end-to-end trained baselines. Importantly, our method offers the desirable composability of classical filtering algorithms, allowing prior information and external sensor models to be incorporated without retraining.

[AI-40] Compositional Planning with Jumpy World Models

【速读】:该论文旨在解决复杂任务中基于时间抽象(temporal abstraction)的组合规划问题,即如何通过编排预训练策略(pre-trained policies)作为时序扩展动作(temporally extended actions),实现单一策略无法完成的复杂决策。其核心挑战在于长期预测中的误差累积导致难以准确估计策略序列诱导的状态访问分布(visitation distribution)。解决方案的关键是提出一种“跳跃世界模型”(jumpy world models),该模型以离线方式学习多步动态预测,并通过引入一种新颖的一致性目标(consistency objective)来对齐不同时间尺度上的预测结果,从而提升长程预测准确性;进一步结合时序差分流(Temporal Difference Flows)框架,利用生成式预测估计任意策略序列在不同时间尺度下的价值,显著提升了零样本(zero-shot)性能,在复杂操作与导航任务中平均相较原始动作规划提升200%。

链接: https://arxiv.org/abs/2602.19634
作者: Jesse Farebrother,Matteo Pirotta,Andrea Tirinzoni,Marc G. Bellemare,Alessandro Lazaric,Ahmed Touati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The ability to plan with temporal abstractions is central to intelligent decision-making. Rather than reasoning over primitive actions, we study agents that compose pre-trained policies as temporally extended actions, enabling solutions to complex tasks that no constituent alone can solve. Such compositional planning remains elusive as compounding errors in long-horizon predictions make it challenging to estimate the visitation distribution induced by sequencing policies. Motivated by the geometric policy composition framework introduced in arXiv:2206.08736, we address these challenges by learning predictive models of multi-step dynamics – so-called jumpy world models – that capture state occupancies induced by pre-trained policies across multiple timescales in an off-policy manner. Building on Temporal Difference Flows (arXiv:2503.09817), we enhance these models with a novel consistency objective that aligns predictions across timescales, improving long-horizon predictive accuracy. We further demonstrate how to combine these generative predictions to estimate the value of executing arbitrary sequences of policies over varying timescales. Empirically, we find that compositional planning with jumpy world models significantly improves zero-shot performance across a wide range of base policies on challenging manipulation and navigation tasks, yielding, on average, a 200% relative improvement over planning with primitive actions on long-horizon tasks.

[AI-41] APE: Tool-Guided Adaptive Planning and Constrained Execution in Language Model Agents

【速读】:该论文旨在解决语言模型(Language Model, LM)代理在存在严格可行性约束的环境中因单次错误导致不可逆失败的问题,其核心挑战在于现有框架中不完善的规划能力与随机执行带来的不确定性。解决方案的关键在于提出一种名为Tool-guided Adaptive Planning with constrained Execution (TAPE) 的新框架:首先通过构建多计划图并借助外部求解器寻找可行路径来增强规划能力;其次在执行阶段采用约束解码降低采样噪声,并根据环境反馈动态重规划以应对状态偏差,从而显著提升任务成功率,尤其在困难场景下表现突出。

链接: https://arxiv.org/abs/2602.19633
作者: Jongwon Jeong,Jungtaek Kim,Kangwook Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Language Model (LM) agents have demonstrated remarkable capabilities in solving tasks that require multiple interactions with the environment. However, they remain vulnerable in environments where a single error often leads to irrecoverable failure, particularly under strict feasibility constraints. We systematically analyze existing agent frameworks, identifying imperfect planning and stochastic execution as the primary causes. To address these challenges, we propose Tool-guided Adaptive Planning with constrained Execution (TAPE). TAPE enhances planning capability by aggregating multiple plans into a graph and employing an external solver to identify a feasible path. During execution, TAPE employs constrained decoding to reduce sampling noise, while adaptively re-planning whenever environmental feedback deviates from the intended state. Experiments across Sokoban, ALFWorld, MuSiQue, and GSM8K-Hard demonstrate that TAPE consistently outperforms existing frameworks, with particularly large gains on hard settings, improving success rates by 21.0 percentage points on hard settings on average, and by 20.0 percentage points for weaker base models on average. Code and data available at here.

[AI-42] VecFormer: Towards Efficient and Generalizable Graph Transformer with Graph Token Attention

【速读】:该论文旨在解决现有图注意力网络(Graph Attention Network, GAT)在大规模图数据上计算复杂度高且在分布外(out-of-distribution, OOD)场景下泛化能力差的问题。其核心解决方案是提出VecFormer,一种基于向量量化(Vector Quantization)的两阶段训练框架:第一阶段通过两个码本(codebook)分别重构节点特征与图结构,学习富含语义信息的“图码”(Graph Codes);第二阶段在图Token层面执行注意力机制,并利用跨码本转换增强模型表达能力,从而显著降低计算复杂度并提升OOD泛化性能。

链接: https://arxiv.org/abs/2602.19622
作者: Jingbo Zhou,Jun Xia,Siyuan Li,Yunfan Liu,Wenjun Wang,Yufei Huang,Changxi Chi,Mutian Hong,Zhuoli Ouyang,Shu Wang,Zhongqi Wang,Xingyu Wu,Chang Yu,Stan Z. Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Transformer has demonstrated impressive capabilities in the field of graph representation learning. However, existing approaches face two critical challenges: (1) most models suffer from exponentially increasing computational complexity, making it difficult to scale to large graphs; (2) attention mechanisms based on node-level operations limit the flexibility of the model and result in poor generalization performance in out-of-distribution (OOD) scenarios. To address these issues, we propose \textbfVecFormer (the \textbfVector Quantized Graph Trans\textbfformer), an efficient and highly generalizable model for node classification, particularly under OOD settings. VecFormer adopts a two-stage training paradigm. In the first stage, two codebooks are used to reconstruct the node features and the graph structure, aiming to learn the rich semantic \textttGraph Codes. In the second stage, attention mechanisms are performed at the \textttGraph Token level based on the transformed cross codebook, reducing computational complexity while enhancing the model’s generalization capability. Extensive experiments on datasets of various sizes demonstrate that VecFormer outperforms the existing Graph Transformer in both performance and speed.

[AI-43] Rules or Weights? Comparing User Understanding of Explainable AI Techniques with the Cognitive XAI-Adaptive Model

【速读】:该论文旨在解决当前可解释人工智能(XAI)技术中缺乏认知框架来比较不同解释方法(如权重和规则)可解释性的问题,特别是在面对前向决策任务与反事实决策任务时如何选择最优解释策略尚不明确。其解决方案的关键在于提出CoXAM(Cognitive XAI-Adaptive Model),该模型基于共享记忆表示结构编码实例属性、线性权重与决策规则,并引入计算理性原则,在效用与推理时间的权衡基础上,动态选择适用于前向或反事实任务的推理过程。通过验证研究,CoXAM在人类决策行为模拟上显著优于基线机器学习代理模型,成功复现并解释了多个关键实证发现,为XAI技术的调试与基准测试提供了认知基础。

链接: https://arxiv.org/abs/2602.19620
作者: Louth Bin Rawshan,Zhuoyu Wang,Brian Y Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rules and Weights are popular XAI techniques for explaining AI decisions. Yet, it remains unclear how to choose between them, lacking a cognitive framework to compare their interpretability. In an elicitation user study on forward and counterfactual decision tasks, we identified 7 reasoning strategies of interpreting three XAI Schemas - weights, rules, and their hybrid. To analyze their capabilities, we propose CoXAM, a Cognitive XAI-Adaptive Model with shared memory representation to encode instance attributes, linear weights, and decision rules. CoXAM employs computational rationality to choose among reasoning processes based on the trade-off in utility and reasoning time, separately for forward or counterfactual decision tasks. In a validation study, CoXAM demonstrated a stronger alignment with human decision-making compared to baseline machine learning proxy models. The model successfully replicated and explained several key empirical findings, including that counterfactual tasks are inherently harder than forward tasks, decision tree rules are harder to recall and apply than linear weights, and the helpfulness of XAI depends on the application data context, alongside identifying which underlying reasoning strategies were most effective. With CoXAM, we contribute a cognitive basis to accelerate debugging and benchmarking disparate XAI techniques.

[AI-44] Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

【速读】:该论文旨在解决中小型企业(Small and Medium Enterprises, SMEs)中高潜力企业难以被系统识别的问题,尤其聚焦于美国小企业创新研究计划(SBIR)中从第一阶段(Phase I)到第二阶段(Phase II)资金资助的预测任务。其解决方案的关键在于提出一种异质图Transformer框架(SME-HGT),通过构建包含32,268家公司的节点、124个研究主题节点和13个政府机构节点的异质图结构,并利用三种语义关系类型的约99,000条边来建模企业间、研究主题与资助机构之间的复杂关联。该方法在时间分割测试集上达到0.621的AUPRC,显著优于MLP和R-GCN基线模型,在筛选前100家公司时实现89.6%的精度和2.14倍的提升,证明了基于关系结构的信息对SME潜力评估具有重要价值。

链接: https://arxiv.org/abs/2602.19591
作者: Yijiashun Qi,Hanzhe Guo,Yijiazhen Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity, yet systematically identifying high-potential SMEs remains an open challenge. We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which SBIR Phase I awardees will advance to Phase II funding using exclusively public data. We construct a heterogeneous graph with 32,268 company nodes, 124 research topic nodes, and 13 government agency nodes connected by approximately 99,000 edges across three semantic relation types. SME-HGT achieves an AUPRC of 0.621 0.003 on a temporally-split test set, outperforming an MLP baseline (0.590 0.002) and R-GCN (0.608 0.013) across five random seeds. At a screening depth of 100 companies, SME-HGT attains 89.6% precision with a 2.14 lift over random selection. Our temporal evaluation protocol prevents information leakage, and our reliance on public data ensures reproducibility. These results demonstrate that relational structure among firms, research topics, and funding agencies provides meaningful signal for SME potential assessment, with implications for policymakers and early-stage investors.

[AI-45] ri-Subspaces Disentanglement for Multimodal Sentiment Analysis CVPR2026

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中现有方法仅关注全局共享表示或单模态特有特征,而忽略特定模态对之间共享信号的问题,从而限制了多模态表示的表达能力和判别力。解决方案的关键在于提出一种三子空间解耦(Tri-Subspace Disentanglement, TSD)框架,将特征显式分解为三个互补子空间:全局一致性子空间、成对共享子空间(建模模态间的交叉协同作用)和私有子空间(保留模态特有信息),并通过解耦监督器与结构化正则化损失确保各子空间的纯度与独立性;同时设计了子空间感知交叉注意力(Subspace-Aware Cross-Attention, SACA)融合模块,自适应地整合三类子空间信息,从而增强对多层次跨模态情感线索的建模能力。

链接: https://arxiv.org/abs/2602.19585
作者: Chunlei Meng,Jiabin Luo,Zhenglin Yan,Zhenyu Yu,Rong Fu,Zhongxue Gan,Chun Ouyang
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: This study has been Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9% ACC-7 on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.

[AI-46] Interpolation-Driven Machine Learning Approaches for Plume Shine Dose Estimation: A Comparison of XGBoost Random Forest and TabNet

【速读】:该论文旨在解决辐射剂量评估中机器学习(Machine Learning, ML)应用受限的问题,具体包括安全关键约束、训练数据稀缺以及物理主导系统中模型架构选择困难等挑战。针对这些问题,研究提出了一种基于插值增强的机器学习框架,其关键在于利用形状保持插值方法对离散剂量数据进行扩展,构建高分辨率训练集,从而提升模型预测精度;同时通过对比随机森林(Random Forest)、XGBoost和TabNet三种模型发现,XGBoost在高分辨率数据下表现最优,且其性能差异源于不同模型对输入特征的利用方式:树模型主要依赖几何扩散特征(如释放高度、大气稳定度和下风距离),而TabNet则更广泛地分配注意力于多个变量。此方案不仅提升了预测准确性,还通过开发交互式网页图形界面(GUI)实现与光子输运计算的透明对比,便于实际部署。

链接: https://arxiv.org/abs/2602.19584
作者: Biswajit Sadhu,Kalpak Gupte,Trijit Sadhu,S. Anand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 11 figures, 3 tables

点击查看摘要

Abstract:Despite the success of machine learning (ML) in surrogate modeling, its use in radiation dose assessment is limited by safety-critical constraints, scarce training-ready data, and challenges in selecting suitable architectures for physics-dominated systems. Within this context, rapid and accurate plume shine dose estimation serves as a practical test case, as it is critical for nuclear facility safety assessment and radiological emergency response, while conventional photon-transport-based calculations remain computationally expensive. In this work, an interpolation-assisted ML framework was developed using discrete dose datasets generated with the pyDOSEIA suite for 17 gamma-emitting radionuclides across varying downwind distances, release heights, and atmospheric stability categories. The datasets were augmented using shape-preserving interpolation to construct dense, high-resolution training data. Two tree-based ML models (Random Forest and XGBoost) and one deep learning (DL) model (TabNet) were evaluated to examine predictive performance and sensitivity to dataset resolution. All models showed higher prediction accuracy with the interpolated high-resolution dataset than with the discrete data; however, XGBoost consistently achieved the highest accuracy. Interpretability analysis using permutation importance (tree-based models) and attention-based feature attribution (TabNet) revealed that performance differences stem from how the models utilize input features. Tree-based models focus mainly on dominant geometry-dispersion features (release height, stability category, and downwind distance), treating radionuclide identity as a secondary input, whereas TabNet distributes attention more broadly across multiple variables. For practical deployment, a web-based GUI was developed for interactive scenario evaluation and transparent comparison with photon-transport reference calculations.

[AI-47] Agent ic AI as a Cybersecurity Attack Surface: Threats Exploits and Defenses in Runtime Supply Chains

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体系统(Agentic Systems)在运行时(inference-time)所面临的安全风险问题,尤其是由于其自主检索信息和调用工具的能力带来的攻击面扩展。传统安全研究多聚焦于模型本身的漏洞,而本文指出,运行时中循环依赖与交互行为引发的威胁(如数据供应链攻击和工具供应链攻击)尚未被系统化理解,甚至存在“病毒式代理环”(Viral Agent Loop)这一新型自传播生成蠕虫传播机制。解决方案的关键在于提出一种零信任运行时架构(Zero-Trust Runtime Architecture),将上下文视为不可信的控制流,并通过密码学溯源(cryptographic provenance)而非语义推理来约束工具执行,从而实现对运行时依赖关系的可信管理。

链接: https://arxiv.org/abs/2602.19555
作者: Xiaochong Jiang,Shiqi Yang,Wenting Yang,Yichen Liu,Cheng Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 Pages, 3 figures

点击查看摘要

Abstract:Agentic systems built on large language models (LLMs) extend beyond text generation to autonomously retrieve information and invoke tools. This runtime execution model shifts the attack surface from build-time artifacts to inference-time dependencies, exposing agents to manipulation through untrusted data and probabilistic capability resolution. While prior work has focused on model-level vulnerabilities, security risks emerging from cyclic and interdependent runtime behavior remain fragmented. We systematize these risks within a unified runtime framework, categorizing threats into data supply chain attacks (transient context injection and persistent memory poisoning) and tool supply chain attacks (discovery, implementation, and invocation). We further identify the Viral Agent Loop, in which agents act as vectors for self-propagating generative worms without exploiting code-level flaws. Finally, we advocate a Zero-Trust Runtime Architecture that treats context as untrusted control flow and constrains tool execution through cryptographic provenance rather than semantic inference.

[AI-48] Cost-Aware Diffusion Active Search

【速读】:该论文旨在解决在部分可观测环境中,自主代理进行主动搜索时如何有效平衡探索(exploration)与利用(exploitation)的问题。传统方法如基于信息增益或Thompson采样的贪心策略虽能实现局部最优决策,但难以应对复杂环境中的长期规划需求;而现有的前瞻算法虽性能更优,却因需构建高计算成本的搜索树(search tree)而受限于实时性与可扩展性。论文的关键解决方案在于利用扩散模型(diffusion models)的序列建模能力,直接采样前瞻动作序列,从而在无需显式构建完整搜索树的前提下,实现高效的探索-利用权衡。此外,作者识别出扩散强化学习在主动搜索任务中存在乐观偏差(optimism bias),并提出相应的校正机制以支持成本感知的单智能体与多智能体决策,显著提升了全恢复率(full recovery rate)并降低了计算开销。

链接: https://arxiv.org/abs/2602.19538
作者: Arundhati Banerjee,Jeff Schneider
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In submission

点击查看摘要

Abstract:Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent’s observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration-exploitation trade-off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost-aware active decision making.

[AI-49] Large Language Model-Assisted UAV Operations and Communications: A Multifaceted Survey and Tutorial

【速读】:该论文旨在解决传统无人机(UAV)系统在环境理解、任务规划与协同控制等方面智能化水平不足的问题,尤其针对基于优化或学习的传统方法难以实现复杂场景下高阶推理与自适应决策的局限性。其解决方案的关键在于将大语言模型(Large Language Models, LLMs)深度集成至无人机系统架构中,通过预训练、微调、检索增强生成(Retrieval-Augmented Generation, RAG)及提示工程等技术手段,赋予无人机链式思维(Chain-of-Thought, CoT)和上下文学习(In-Context Learning, ICL)等高级推理能力,并结合多模态大语言模型(Multimodal LLMs, MLLMs)提升感知驱动导航与人机协同控制性能,从而构建具备环境感知、自主决策与动态适应能力的智能空中系统。

链接: https://arxiv.org/abs/2602.19534
作者: Yousef Emami,Hao Zhou,Radha Reddy,Atefeh Hajijamali Arani,Biliang Wang,Kai Li,Luis Almeida,Zhu Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 40 pages, 10 figures, 13 tables

点击查看摘要

Abstract:Uncrewed Aerial Vehicles (UAVs) are widely deployed across diverse applications due to their mobility and agility. Recent advances in Large Language Models (LLMs) offer a transformative opportunity to enhance UAV intelligence beyond conventional optimization-based and learning-based approaches. By integrating LLMs into UAV systems, advanced environmental understanding, swarm coordination, mobility optimization, and high-level task reasoning can be achieved, thereby allowing more adaptive and context-aware aerial operations. This survey systematically explores the intersection of LLMs and UAV technologies and proposes a unified framework that consolidates existing architectures, methodologies, and applications for UAVs. We first present a structured taxonomy of LLM adaptation techniques for UAVs, including pretraining, fine-tuning, Retrieval-Augmented Generation (RAG), and prompt engineering, along with key reasoning capabilities such as Chain-of-Thought (CoT) and In-Context Learning (ICL). We then examine LLM-assisted UAV communications and operations, covering navigation, mission planning, swarm control, safety, autonomy, and network management. After that, the survey further discusses Multimodal LLMs (MLLMs) for human-swarm interaction, perception-driven navigation, and collaborative control. Finally, we address ethical considerations, including bias, transparency, accountability, and Human-in-the-Loop (HITL) strategies, and outline future research directions. Overall, this work positions LLM-assisted UAVs as a foundation for intelligent and adaptive aerial systems.

[AI-50] Grokking Finite-Dimensional Algebra

【速读】:该论文旨在解决神经网络训练中“grokking”现象的机制问题,即模型在长时间记忆训练数据后突然实现泛化能力的非连续转变。其核心问题是:这种现象如何在更广泛的代数结构(如有限维代数,FDA)中出现,以及数学结构如何调控模型的泛化动态。解决方案的关键在于提出一个统一框架,将学习FDA中的乘法运算建模为学习由结构张量(structure tensor)定义的双线性映射,并揭示不同代数性质(如交换性、结合性和单位元存在性)对grokking出现时机与强度的影响;同时,针对实数域和有限域上的代数,分别建立与矩阵分解(隐含低秩偏置)和离散表示学习的联系,从而阐明结构张量的稀疏性和秩等属性如何决定泛化性能,并验证模型是否学习到与代数表示对齐的潜在嵌入(latent embeddings)。

链接: https://arxiv.org/abs/2602.19533
作者: Pascal Jr Tikeng Notsawo,Guillaume Dumas,Guillaume Rabusseau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Rings and Algebras (math.RA)
备注: 34 pages, 13 figures

点击查看摘要

Abstract:This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra’s structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra’s representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.

[AI-51] A Statistical Approach for Modeling Irregular Multivariate Time Series with Missing Observations

【速读】:该论文旨在解决不规则多变量时间序列(Irregular multivariate time series)在存在缺失值情况下的预测建模难题,尤其在医疗健康领域中,传统深度学习方法往往依赖复杂的时序插补或架构设计来处理不规则性。其解决方案的关键在于摒弃对时间轴的显式建模,转而提取与时间无关的时间摘要统计特征(time-agnostic summary statistics),包括每个变量的观测值均值和标准差,以及相邻观测间变化量的均值和变异性,从而构建固定维度的表征。该方法使用标准分类器(如逻辑回归、XGBoost)进行训练,在四个生物医学数据集上实现了优于近期基于Transformer和图神经网络模型的性能,同时显著降低计算复杂度,并揭示了缺失模式本身可能蕴含预测信息,为不规则时间序列分类提供了一种高效且可解释的新范式。

链接: https://arxiv.org/abs/2602.19531
作者: Dingyi Nie,Yixing Wu,C.-C. Jay Kuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in APSIPA Transactions on Signal and Information Processing

点击查看摘要

Abstract:Irregular multivariate time series with missing values present significant challenges for predictive modeling in domains such as healthcare. While deep learning approaches often focus on temporal interpolation or complex architectures to handle irregularities, we propose a simpler yet effective alternative: extracting time-agnostic summary statistics to eliminate the temporal axis. Our method computes four key features per variable-mean and standard deviation of observed values, as well as the mean and variability of changes between consecutive observations to create a fixed-dimensional representation. These features are then utilized with standard classifiers, such as logistic regression and XGBoost. Evaluated on four biomedical datasets (PhysioNet Challenge 2012, 2019, PAMAP2, and MIMIC-III), our approach achieves state-of-the-art performance, surpassing recent transformer and graph-based models by 0.5-1.7% in AUROC/AUPRC and 1.1-1.7% in accuracy/F1-score, while reducing computational complexity. Ablation studies demonstrate that feature extraction-not classifier choice-drives performance gains, and our summary statistics outperform raw/imputed input in most benchmarks. In particular, we identify scenarios where missing patterns themselves encode predictive signals, as in sepsis prediction (PhysioNet, 2019), where missing indicators alone can achieve 94.2% AUROC with XGBoost, only 1.6% lower than using original raw data as input. Our results challenge the necessity of complex temporal modeling when task objectives permit time-agnostic representations, providing an efficient and interpretable solution for irregular time series classification.

[AI-52] Ada-RS: Adaptive Rejection Sampling for Selective Thinking

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在成本和延迟敏感场景中因“链式思维”(chain-of-thought)导致的资源浪费问题,尤其是在简单任务上仍进行冗长推理所造成的token消耗过高与响应延迟增加。解决方案的关键在于提出一种算法无关的样本过滤框架——自适应拒绝采样(Adaptive Rejection Sampling, Ada-RS),其通过动态调整长度惩罚的奖励机制对多个采样完成结果进行评分,并采用随机拒绝采样保留高奖励候选(或偏好对)用于下游优化,从而实现选择性推理。实验表明,Ada-RS在保持或提升工具调用准确率的同时,可将平均输出token数减少高达80%,思考频率降低达95%,显著提升了推理效率。

链接: https://arxiv.org/abs/2602.19519
作者: Yirou Ge,Yixi Li,Alec Chiu,Shivani Shekhar,Zijie Pan,Avinash Thangali,Yun-Shiuan Chuang,Chaitanya Kulkarni,Uma Kona,Linsey Pang,Prakhar Mehrotra
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to 80% and reducing thinking rate by up to 95% while maintaining or improving tool call accuracy. These results highlight that training-signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.

[AI-53] Human-Guided Agent ic AI for Multimodal Clinical Prediction: Lessons from the Agent DS Healthcare Benchmark ALT

【速读】:该论文旨在解决纯自动化生成式AI(Generative AI)在临床预测任务中因缺乏领域专业知识而导致性能受限的问题。其核心解决方案在于引入人类专家对智能体(Agent)工作流的关键节点进行指导,包括多模态特征工程(如临床笔记、扫描PDF账单和时序生命体征数据)、任务适配的模型选择以及基于临床知识的验证策略。实验表明,这种人机协同方式显著提升了三个医疗基准任务的性能,其中人类引导决策带来的累计F1提升达+0.065,尤以多模态特征提取贡献最大(+0.041 F1),凸显了领域知识驱动的特征工程与任务特定的人工判断在构建可解释、可复现且具临床有效性的智能体系统中的关键作用。

链接: https://arxiv.org/abs/2602.19502
作者: Lalitha Pranathi Pulavarthy,Raajitha Muthyala,Aravind V Kuruvikkattil,Zhenan Yin,Rashmita Kudamala,Saptarshi Purkayastha
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to the Data Challenge track at the 14th IEEE International Conference on Healthcare Informatics (ICHI) 2026

点击查看摘要

Abstract:Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = 465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

[AI-54] Softmax is not Enough (for Adaptive Conformal Classification)

【速读】:该论文旨在解决深度置信预测(Deep Conformal Classifiers)中因使用Softmax输出计算非 conforming 分数而导致的不确定性量化不可靠问题,这使得预测集在面对困难样本时缺乏适应性(adaptive),即无法根据输入难度动态调整大小。其关键解决方案是引入预 Softmax 对数空间中的信息,利用赫姆霍兹自由能(Helmholtz Free Energy)作为模型不确定性和样本难度的度量,并通过单调变换对每个样本的能量得分重新加权非 conforming 分数,从而提升非 conforming 分数对输入难度的敏感性,增强预测集的适应能力,同时不增加任何后处理复杂度。

链接: https://arxiv.org/abs/2602.19498
作者: Navid Akhavan Attar,Hesam Asadollahzadeh,Ling Luo,Uwe Aickelin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The merit of Conformal Prediction (CP), as a distribution-free framework for uncertainty quantification, depends on generating prediction sets that are efficient, reflected in small average set sizes, while adaptive, meaning they signal uncertainty by varying in size according to input difficulty. A central limitation for deep conformal classifiers is that the nonconformity scores are derived from softmax outputs, which can be unreliable indicators of how certain the model truly is about a given input, sometimes leading to overconfident misclassifications or undue hesitation. In this work, we argue that this unreliability can be inherited by the prediction sets generated by CP, limiting their capacity for adaptiveness. We propose a new approach that leverages information from the pre-softmax logit space, using the Helmholtz Free Energy as a measure of model uncertainty and sample difficulty. By reweighting nonconformity scores with a monotonic transformation of the energy score of each sample, we improve their sensitivity to input difficulty. Our experiments with four state-of-the-art score functions on multiple datasets and deep architectures show that this energy-based enhancement improves the adaptiveness of the prediction sets, leading to a notable increase in both efficiency and adaptiveness compared to baseline nonconformity scores, without introducing any post-hoc complexity.

[AI-55] Federated Learning Playground

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)教学与实践中的高门槛问题,即初学者难以快速理解FL核心概念(如非独立同分布数据(non-IID data)、局部过拟合和可扩展性挑战),且缺乏便捷的实验环境来直观验证不同参数对模型性能的影响。解决方案的关键在于构建一个基于浏览器的交互式平台——Federated Learning Playground,其通过可视化实时展示客户端与全局模型的变化,使用户无需编码或系统部署即可探索异构客户端数据分布、模型超参数及聚合算法的影响,从而降低学习成本并加速FL方法的原型设计与比较。

链接: https://arxiv.org/abs/2602.19489
作者: Bryan Guanrong Shan,Alysa Ziying Tan,Han Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Federated Learning Playground, an interactive browser-based platform inspired by and extends TensorFlow Playground that teaches core Federated Learning (FL) concepts. Users can experiment with heterogeneous client data distributions, model hyperparameters, and aggregation algorithms directly in the browser without coding or system setup, and observe their effects on client and global models through real-time visualizations, gaining intuition for challenges such as non-IID data, local overfitting, and scalability. The playground serves as an easy to use educational tool, lowering the entry barrier for newcomers to distributed AI while also offering a sandbox for rapidly prototyping and comparing FL methods. By democratizing exploration of FL, it promotes broader understanding and adoption of this important paradigm.

[AI-56] Making Conformal Predictors Robust in Healthcare Settings: a Case Study on EEG Classification

【速读】:该论文旨在解决临床预测中不确定性量化的问题,特别是在存在患者分布偏移(distribution shift)情况下,传统独立同分布(i.i.d.)假设下的校准方法无法保证预测集的覆盖率(coverage),从而影响高风险诊断任务的可靠性。解决方案的关键在于引入个性化校准策略(personalized calibration strategies),通过针对个体或子群体进行适应性调整,显著提升覆盖率(改善超过20个百分点),同时保持预测集大小与标准方法相当,有效应对EEG癫痫发作分类任务中的标签不确定性和分布偏移挑战。

链接: https://arxiv.org/abs/2602.19483
作者: Arjun Chatterjee,Sayeed Sajjad Razin,John Wu,Siddhartha Laghuvarapu,Jathurshan Pradeepkumar,Jimeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Under Review

点击查看摘要

Abstract:Quantifying uncertainty in clinical predictions is critical for high-stakes diagnosis tasks. Conformal prediction offers a principled approach by providing prediction sets with theoretical coverage guarantees. However, in practice, patient distribution shifts violate the i.i.d. assumptions underlying standard conformal methods, leading to poor coverage in healthcare settings. In this work, we evaluate several conformal prediction approaches on EEG seizure classification, a task with known distribution shift challenges and label uncertainty. We demonstrate that personalized calibration strategies can improve coverage by over 20 percentage points while maintaining comparable prediction set sizes. Our implementation is available via PyHealth, an open-source healthcare AI framework: this https URL.

[AI-57] Scale-PINN: Learning Efficient Physics-Informed Neural Networks Through Sequential Correction

【速读】:该论文旨在解决物理信息神经网络(Physics-informed Neural Networks, PINNs)在科学与工程领域应用受限的问题,即训练速度慢且精度相对现代数值求解器较低。其解决方案的关键在于提出一种名为Scale-PINN的序列校正学习策略,该策略将数值求解器中的迭代残差校正原理直接嵌入损失函数设计中,从而实现了PINN损失函数构建范式的革新,显著提升了收敛速度并保持高精度,使PINN在流体力学、空气动力学及城市科学等多物理场景下具备实用化潜力。

链接: https://arxiv.org/abs/2602.19475
作者: Pao-Hsiung Chiu,Jian Cheng Wong,Chin Chun Ooi,Chang Wei,Yuchen Fan,Yew-Soon Ong
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have emerged as a promising mesh-free paradigm for solving partial differential equations, yet adoption in science and engineering is limited by slow training and modest accuracy relative to modern numerical solvers. We introduce the Sequential Correction Algorithm for Learning Efficient PINN (Scale-PINN), a learning strategy that bridges modern physics-informed learning with numerical algorithms. Scale-PINN incorporates the iterative residual-correction principle, a cornerstone of numerical solvers, directly into the loss formulation, marking a paradigm shift in how PINN losses can be conceived and constructed. This integration enables Scale-PINN to achieve unprecedented convergence speed across PDE problems from different physics domain, including reducing training time on a challenging fluid-dynamics problem for state-of-the-art PINN from hours to sub-2 minutes while maintaining superior accuracy, and enabling application to representative problems in aerodynamics and urban science. By uniting the rigor of numerical methods with the flexibility of deep learning, Scale-PINN marks a significant leap toward the practical adoption of PINNs in science and engineering through scalable, physics-informed learning. Codes are available at this https URL.

[AI-58] Red-Teaming Claude Opus and ChatGPT -based Security Advisors for Trusted Execution Environments

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)作为可信执行环境(Trusted Execution Environment, TEE)安全顾问时存在的系统性风险问题,包括模型幻觉、对TEE机制的错误描述、过度承诺安全保障能力以及在对抗性提示下的不安全行为。其核心挑战在于LLM在TEE架构审查、漏洞分析和缓解策略制定中的技术准确性与安全性不足,且这些问题具有跨模型迁移性。解决方案的关键在于提出TEERedBench评估框架,该框架包含针对TEE场景的威胁模型、结构化提示套件及多维标注标准,并进一步设计“LLM-in-the-loop”评估流水线,通过策略门控、检索增强、结构化模板和轻量级验证检查等手段,将LLM相关失败率降低80.62%,从而提升LLM辅助TEE安全工作的可靠性与可审计性。

链接: https://arxiv.org/abs/2602.19450
作者: Kunal Mukherjee
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating system, yet real deployments remain vulnerable to microarchitectural leakage, side-channel attacks, and fault injection. In parallel, security teams increasingly rely on Large Language Model (LLM) assistants as security advisors for TEE architecture review, mitigation planning, and vulnerability triage. This creates a socio-technical risk surface: assistants may hallucinate TEE mechanisms, overclaim guarantees (e.g., what attestation does and does not establish), or behave unsafely under adversarial prompting. We present a red-teaming study of two prevalently deployed LLM assistants in the role of TEE security advisors: ChatGPT-5.2 and Claude Opus-4.6, focusing on the inherent limitations and transferability of prompt-induced failures across LLMs. We introduce TEE-RedBench, a TEE-grounded evaluation methodology comprising (i) a TEE-specific threat model for LLM-mediated security work, (ii) a structured prompt suite spanning SGX and TrustZone architecture, attestation and key management, threat modeling, and non-operational mitigation guidance, along with policy-bound misuse probes, and (iii) an annotation rubric that jointly measures technical correctness, groundedness, uncertainty calibration, refusal quality, and safe helpfulness. We find that some failures are not purely idiosyncratic, transferring up to 12.02% across LLM assistants, and we connect these outcomes to secure architecture by outlining an “LLM-in-the-loop” evaluation pipeline: policy gating, retrieval grounding, structured templates, and lightweight verification checks that, when combined, reduce failures by 80.62%. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.19450 [cs.CR] (or arXiv:2602.19450v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.19450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] When AI Teammates Meet Code Review: Collaboration Signals Shaping the Integration of Agent -Authored Pull Requests

【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 编码代理在 GitHub 上提交的拉取请求(pull request)如何融入由人类主导的代码审查工作流,以及其成功整合的关键因素是什么。解决方案的关键在于识别出影响拉取请求最终合并的核心变量——研究发现,审阅者参与度(reviewer engagement)与成功集成最强相关,而较大的变更规模和破坏协作的行为(如强制推送)则显著降低合并概率;此外,成功的整合依赖于代理能参与到可操作的审查循环中,并逐步趋近审阅者的预期,这表明代码质量之外,对现有协作规范的契合度是决定性因素。

链接: https://arxiv.org/abs/2602.19441
作者: Costain Nachuma,Minhaz Zibran
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, 1 table. Accepted at the 23rd International Conference on Mining Software Repositories (MSR 2026), Rio de Janeiro, Brazil

点击查看摘要

Abstract:Autonomous coding agents increasingly contribute to software development by submitting pull requests on GitHub; yet, little is known about how these contributions integrate into human-driven review workflows. We present a large empirical study of agent-authored pull requests using the public AIDev dataset, examining integration outcomes, resolution speed, and review-time collaboration signals. Using logistic regression with repository-clustered standard errors, we find that reviewer engagement has the strongest correlation with successful integration, whereas larger change sizes and coordination-disrupting actions, such as force pushes, are associated with a lower likelihood of merging. In contrast, iteration intensity alone provides limited explanatory power once collaboration signals are considered. A qualitative analysis further shows that successful integration occurs when agents engage in actionable review loops that converge toward reviewer expectations. Overall, our results highlight that the effective integration of agent-authored pull requests depends not only on code quality but also on alignment with established review and coordination practices.

[AI-60] OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents

【速读】:该论文旨在解决供应链优化模型因建模错误导致不可行的问题,其核心挑战在于诊断与修复过程高度依赖稀缺的运筹学(Operations Research, OR)专家知识,且现有API模型在恢复可行性和保持运营合理性方面表现不佳。解决方案的关键在于提出OptiRepair框架,将修复任务分为两个阶段:一是领域无关的可行性修复阶段(基于不一致子系统(IIS)引导的迭代修复任意线性规划问题),二是领域特定的验证阶段(引入五项基于库存理论的操作合理性检查)。通过在976个多层级供应链问题上测试22个API模型,并使用自监督推理和求解器验证奖励训练两个80亿参数模型,实验表明所提方法实现了81.7%的理性恢复率(Rational Recovery Rate, RRR),显著优于最佳API模型的42.2%和平均值21.3%,尤其在第一阶段修复成功率从API模型的27.6%提升至97.2%,证明了该方案在提升自动化修复能力与保障运营合理性方面的有效性。

链接: https://arxiv.org/abs/2602.19439
作者: Ruicheng Ao,David Simchi-Levi,Xinshang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: 34 pages, 8 figures

点击查看摘要

Abstract:Problem Definition. Supply chain optimization models frequently become infeasible because of modeling errors. Diagnosis and repair require scarce OR expertise: analysts must interpret solver diagnostics, trace root causes across echelons, and fix formulations without sacrificing operational soundness. Whether AI agents can perform this task remains untested. Methodology/Results. OptiRepair splits this task into a domain-agnostic feasibility phase (iterative IIS-guided repair of any LP) and a domain-specific validation phase (five rationality checks grounded in inventory theory). We test 22 API models from 7 families on 976 multi-echelon supply chain problems and train two 8B-parameter models using self-taught reasoning with solver-verified rewards. The trained models reach 81.7% Rational Recovery Rate (RRR) – the fraction of problems resolved to both feasibility and operational rationality – versus 42.2% for the best API model and 21.3% on average. The gap concentrates in Phase 1 repair: API models average 27.6% recovery rate versus 97.2% for trained models. Managerial Implications. Two gaps separate current AI from reliable model repair: solver interaction (API models restore only 27.6% of infeasible formulations) and operational rationale (roughly one in four feasible repairs violate supply chain theory). Each requires a different intervention: solver interaction responds to targeted training; operational rationale requires explicit specification as solver-verifiable checks. For organizations adopting AI in operational planning, formalizing what “rational” means in their context is the higher-return investment. Comments: 34 pages, 8 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.19439 [cs.AI] (or arXiv:2602.19439v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.19439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-61] IR3: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)过程中出现的奖励黑客(reward hacking)问题,即模型在对齐过程中利用代理奖励中的虚假相关性实现高分,而非真正理解并内化人类意图,且此类行为因内部目标不透明而难以检测与修正。解决方案的关键在于提出IR3(Interpretable Reward Reconstruction and Rectification)框架,其核心包括:1)通过对比逆强化学习(Contrastive Inverse Reinforcement Learning, C-IRL)重构隐式奖励函数,识别RLHF前后策略行为变化的驱动因素;2)使用稀疏自编码器将重建奖励分解为可解释特征,结合贡献分析定位奖励黑客信号;3)设计针对性修复策略(如清洁奖励优化、对抗性塑造、约束优化和特征引导蒸馏),精准干预有害特征以消除黑客行为,同时保持原始模型能力损失小于3%。

链接: https://arxiv.org/abs/2602.19416
作者: Mohammad Beigi,Ming Jin,Junshan Zhang,Jiaxin Zhang,Qifan Wang,Lifu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.

[AI-62] One Size Fits None: Modeling NYC Taxi Trips

【速读】:该论文旨在解决如何准确预测纽约市传统出租车与基于应用程序的网约车服务之间差异化的乘客小费行为问题。研究发现,传统出租车的小费具有高度可预测性(R² ≈ 0.72),因其依赖车内支付屏幕促使用户标准化操作;而应用驱动的小费行为则呈现随机性,难以建模(R² ≈ 0.17)。解决方案的关键在于识别出两类服务在小费模式上的本质差异,并指出构建单一通用模型是错误的——由于辛普森悖论(Simpson’s paradox),合并模型虽平均表现良好,但无法准确预测任一类别个体的服务小费,因此必须为每类服务分别建立专用模型。

链接: https://arxiv.org/abs/2602.19404
作者: Tomas Eglinskas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:The rise of app-based ride-sharing has fundamentally changed tipping culture in New York City. We analyzed 280 million trips from 2024 to see if we could predict tips for traditional taxis versus high-volume for-hire services. By testing methods from linear regression to deep neural networks, we found two very different outcomes. Traditional taxis are highly predictable ( R^2 \approx 0.72 ) due to the in-car payment screen. In contrast, app-based tipping is random and hard to model ( R^2 \approx 0.17 ). In conclusion, we show that building one universal model is a mistake and, due to Simpson’s paradox, a combined model looks accurate on average but fails to predict tips for individual taxi categories requiring specialized models.

[AI-63] Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对“越狱提示”(jailbreak prompts)时的脆弱性问题,尤其是当攻击者通过调整提示的表述框架(framing)来隐藏其恶意意图时,传统基于结构特征或目标特定签名的防御机制往往失效。解决方案的关键在于提出一种自监督的表征解耦框架——Representation Disentanglement on Activations (ReDAct),该框架能够在推理阶段从LLM激活中分离出语义因子对(semantic factor pairs),特别是目标(goal)与框架(framing)两个维度;进而构建基于框架表示的异常检测器FrameShield,实现跨模型家族的、轻量级且模型无关的恶意提示检测能力。理论分析与实证验证共同表明,ReDAct所提取的解耦表征显著提升了FrameShield的检测性能,并为LLM安全性与机制可解释性提供了新的基础工具。

链接: https://arxiv.org/abs/2602.19396
作者: Amirhossein Farzam,Majid Behabahani,Mani Malek,Yuriy Nevmyvaka,Guillermo Sapiro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations in a frozen LLM. We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.

[AI-64] Artificial Intelligence for Modeling Simulation in Digital Twins

【速读】:该论文旨在解决如何实现建模与仿真(Modeling and Simulation, MS)与人工智能(Artificial Intelligence, AI)在数字孪生(Digital Twin, DT)中的深度融合问题,以推动先进数字技术的发展。其解决方案的关键在于揭示MS、AI与DT三者之间的互补关系:一方面,MS是DT的核心支撑技术,提供物理实体的高保真动态表征,涵盖从基于物理机制的建模到离散事件仿真的多种方法;另一方面,AI通过增强分析、预测能力和自主决策提升DT的功能,而DT则作为训练、验证和部署AI模型的闭环平台,形成双向赋能机制。这一系统性框架为构建更集成化、智能化的数字孪生系统提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2602.19390
作者: Philipp Zech,Istvan David
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The convergence of modeling simulation (MS) and artificial intelligence (AI) is leaving its marks on advanced digital technology. Pertinent examples are digital twins (DTs) - high-fidelity, live representations of physical assets, and frequent enablers of corporate digital maturation and transformation. Often seen as technological platforms that integrate an array of services, DTs have the potential to bring AI-enabled MS closer to end-users. It is, therefore, paramount to understand the role of MS in DTs, and the role of digital twins in enabling the convergence of AI and MS. To this end, this chapter provides a comprehensive exploration of the complementary relationship between these three. We begin by establishing a foundational understanding of DTs by detailing their key components, architectural layers, and their various roles across business, development, and operations. We then examine the central role of MS in DTs and provide an overview of key modeling techniques from physics-based and discrete-event simulation to hybrid approaches. Subsequently, we investigate the bidirectional role of AI: first, how AI enhances DTs through advanced analytics, predictive capabilities, and autonomous decision-making, and second, how DTs serve as valuable platforms for training, validating, and deploying AI models. The chapter concludes by identifying key challenges and future research directions for creating more integrated and intelligent systems.

[AI-65] Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

【速读】:该论文旨在解决深度强化学习系统在训练过程中因非平稳性(non-stationarity)导致的不稳定动态问题,即学习目标和数据分布随时间变化所引发的性能下降与收敛困难。解决方案的关键在于引入Sketching Isotropic Gaussian Regularization(Sketched Isotropic Gaussian Regularization),通过在训练中引导表示空间趋向于各向同性高斯分布(isotropic Gaussian distribution),实现对时变目标的稳定追踪、在固定方差预算下最大化熵,并促进所有表征维度的均衡使用,从而提升智能体的适应性与训练稳定性。

链接: https://arxiv.org/abs/2602.19373
作者: Ali Saheb,Johan Obando-Ceron,Aaron Courville,Pouya Bashivan,Pablo Samuel Castro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning systems often suffer from unstable training dynamics due to non-stationarity, where learning objectives and data distributions evolve over time. We show that under non-stationary targets, isotropic Gaussian embeddings are provably advantageous. In particular, they induce stable tracking of time-varying targets for linear readouts, achieve maximal entropy under a fixed variance budget, and encourage a balanced use of all representational dimensions–all of which enable agents to be more adaptive and stable. Building on this insight, we propose the use of Sketched Isotropic Gaussian Regularization for shaping representations toward an isotropic Gaussian distribution during training. We demonstrate empirically, over a variety of domains, that this simple and computationally inexpensive method improves performance under non-stationarity while reducing representation collapse, neuron dormancy, and training instability.

[AI-66] Active perception and disentangled representations allow continual episodic zero and few-shot learning

【速读】:该论文旨在解决机器学习系统中快速学习与泛化能力之间的冲突问题:传统模型为实现泛化而产生的纠缠表征(entangled representations)会在持续学习或小样本学习场景下引发破坏性干扰。其解决方案的关键在于构建一种互补学习系统(Complementary Learning System, CLS),其中快速学习模块完全放弃泛化,专注于零样本和小样本学习;同时,该快速模块通过提供上下文偏置(contextual bias)引导慢速统计学习器将新刺激编码为熟悉、通用的表征,从而实现高效且无干扰的持续学习。这种架构实现了快速情境推理与慢速结构化泛化的共存,为鲁棒的持续学习提供了新路径。

链接: https://arxiv.org/abs/2602.19355
作者: David Rawlinson,Gideon Kowadlo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages; 7 figures

点击查看摘要

Abstract:Generalization is often regarded as an essential property of machine learning systems. However, perhaps not every component of a system needs to generalize. Training models for generalization typically produces entangled representations at the boundaries of entities or classes, which can lead to destructive interference when rapid, high-magnitude updates are required for continual or few-shot learning. Techniques for fast learning with non-interfering representations exist, but they generally fail to generalize. Here, we describe a Complementary Learning System (CLS) in which the fast learner entirely foregoes generalization in exchange for continual zero-shot and few-shot learning. Unlike most CLS approaches, which use episodic memory primarily for replay and consolidation, our fast, disentangled learner operates as a parallel reasoning system. The fast learner can overcome observation variability and uncertainty by leveraging a conventional slow, statistical learner within an active perception system: A contextual bias provided by the fast learner induces the slow learner to encode novel stimuli in familiar, generalized terms, enabling zero-shot and few-shot learning. This architecture demonstrates that fast, context-driven reasoning can coexist with slow, structured generalization, providing a pathway for robust continual learning.

[AI-67] Smooth Gate Functions for Soft Advantage Policy Optimization

【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO) 在大语言模型训练中因使用硬截断(hard clipping)而导致的不稳定性问题。其解决方案的关键在于用平滑的门控函数(gate function)替代传统的硬截断机制,具体采用基于sigmoid的自适应门控策略,即Soft Adaptive Policy Optimization (SAPO),从而实现更稳定且高效的策略更新。通过形式化门控函数应满足的关键性质,并系统评估多种候选函数,研究进一步揭示了门控设计对训练稳定性和最终模型性能的影响,为构建更鲁棒的大语言模型策略优化目标提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2602.19345
作者: Egor Denisov,Svetlana Glazyrina,Maksim Kryzhanovskiy,Roman Ischenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability due to the use of hard clipping. Soft Adaptive Policy Optimization (SAPO) addresses this limitation by replacing clipping with a smooth sigmoid-based gate function, which leads to more stable updates. We have decided to push this theory further and investigate the impact of different gate functions on both training stability and final model performance. We formalize the key properties that admissible gates should satisfy and identify several families of such functions for empirical evaluation. This paper presents an analysis of our findings based on experiments conducted with the Qwen2.5-7B-Instruct model on mathematical reasoning tasks. These results provide practical guidance for designing smoother and more robust policy optimization objectives for large language model training.

[AI-68] Soft Sequence Policy Optimization: Bridging GMPO and SAPO

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)对齐训练中因策略优化方法导致的探索不足与训练不稳定问题,尤其是在使用基于组相对策略优化(Group Relative Policy Optimization, GRPO)框架时,传统方法如PPO式裁剪易引发训练信号丢失和熵塌陷。其解决方案的关键在于提出一种新的离策略强化学习目标——软序列策略优化(Soft Sequence Policy Optimization, SSPO),该方法在序列级重要性采样权重内引入基于token级概率比的软门控函数(soft gating functions),从而在保持训练稳定性的同时增强策略探索能力,实现序列一致性与token层面自适应性的统一。

链接: https://arxiv.org/abs/2602.19327
作者: Svetlana Glazyrina,Maksim Kryzhanovskiy,Roman Ischenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift toward sequence-level importance sampling weights that better align with the sequence-level rewards used in many tasks, and (ii) alternatives to PPO-style clipping that aim to avoid the associated loss of training signal and entropy collapse. Recent work, such as Soft Adaptive Policy Optimization (SAPO), reformulates the Scopic objective within the GRPO framework and achieves both sequence coherence and token adaptivity. Geometric-Mean Policy Optimization (GMPO) leverages token-wise ratio clipping within sequence importance sampling weights. Building on these ideas, this work proposes a new objective that promotes effective policy exploration while maintaining training stability. Specifically, we introduce Soft Sequence Policy Optimization, an off-policy reinforcement learning objective that incorporates soft gating functions over token-level probability ratios within sequence-level importance weights.

[AI-69] Health: Empowering Individuals via Unifying Health Data

【速读】:该论文旨在解决当前医疗健康生态系统中个人健康数据管理困难的问题,即患者对其医疗记录缺乏有效控制权,且数据分散在互不兼容的系统和格式中。解决方案的关键在于提出Health+系统,这是一个以用户为中心、支持多模态(如文本、图像、报告)健康数据管理的平台,通过直观界面与智能推荐机制赋予个体对数据的上传、查询与共享能力,同时在系统层面实现异构健康记录的高效存储、整合与隐私保护,从而构建一个更加互联、可解释且由患者主导的健康信息生态体系。

链接: https://arxiv.org/abs/2602.19319
作者: Sujaya Maiyya,Shantanu Sharma,Avinash Kumar
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: This paper has been accepted in ACM Multimedia 2025

点击查看摘要

Abstract:Managing personal health data is a challenge in today’s fragmented and institution-centric healthcare ecosystem. Individuals often lack meaningful control over their medical records, which are scattered across incompatible systems and formats. This vision paper presents Health+, a user-centric, multimodal health data management system that empowers individuals (including those with limited technical expertise) to upload, query, and share their data across modalities (e.g., text, images, reports). Rather than aiming for institutional overhaul, Health+ emphasizes individual agency by providing intuitive interfaces and intelligent recommendations for data access and sharing. At the system level, it tackles the complexity of storing, integrating, and securing heterogeneous health records, ensuring both efficiency and privacy. By unifying multimodal data and prioritizing patients, Health+ lays the foundation for a more connected, interpretable, and user-controlled health information ecosystem.

[AI-70] Online Navigation Planning for Long-term Autonomous Operation of Underwater Gliders

【速读】:该论文旨在解决水下滑翔机(Underwater Glider)在长期自主部署中因缺乏有效导航规划方法而导致的高效性不足问题。其核心挑战在于如何在不确定的海洋环境(如洋流预测误差和控制执行不确定性)下实现可靠、高效的路径规划。解决方案的关键在于将滑翔机导航规划建模为一个随机最短路径马尔可夫决策过程(Stochastic Shortest-Path Markov Decision Process),并提出一种基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的样本化在线规划方法;该方法通过一个物理信息驱动的仿真器生成样本,该仿真器能够捕捉控制执行的不确定性与洋流预报的不确定性,同时保持计算可行性,并利用历史滑翔机数据对仿真参数进行拟合。最终,该方法集成到Slocum滑翔机的自主指挥与控制系统中,实现了每次上浮时的闭环重规划,在北海两次总计约3个月和1000公里的实地部署中验证了其效率优于传统直飞目标导航策略,证明了样本化规划在长期海洋自主任务中的实用性。

链接: https://arxiv.org/abs/2602.19315
作者: Victor-Alexandru Darvariu,Charlotte Z. Reed,Jan Stratmann,Bruno Lacerda,Benjamin Allsup,Stephen Woodward,Elizabeth Siddle,Trishna Saeharaseelan,Owain Jones,Dan Jones,Tobias Ferreira,Chloe Baker,Kevin Chaplin,James Kirk,Ashley Morris,Ryan Patmore,Jeff Polton,Charlotte Williams,Alexandra Kokkinaki,Alvaro Lorenzo Lopez,Justin J. H. Buck,Nick Hawes
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Underwater glider robots have become an indispensable tool for ocean sampling. Although stakeholders are calling for tools to manage increasingly large fleets of gliders, successful autonomous long-term deployments have thus far been scarce, which hints at a lack of suitable methodologies and systems. In this work, we formulate glider navigation planning as a stochastic shortest-path Markov Decision Process and propose a sample-based online planner based on Monte Carlo Tree Search. Samples are generated by a physics-informed simulator that captures uncertain execution of controls and ocean current forecasts while remaining computationally tractable. The simulator parameters are fitted using historical glider data. We integrate these methods into an autonomous command-and-control system for Slocum gliders that enables closed-loop replanning at each surfacing. The resulting system was validated in two field deployments in the North Sea totalling approximately 3 months and 1000 km of autonomous operation. Results demonstrate improved efficiency compared to straight-to-goal navigation and show the practicality of sample-based planning for long-term marine autonomy.

[AI-71] OPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在强化学习(Reinforcement Learning, RL)中因样本效率低和稀疏奖励而导致的进展受限问题,尤其关注如何构建具备泛化能力的进程奖励模型(process reward model),以提供细粒度反馈来弥合预训练与真实世界任务之间的差距。解决方案的关键在于提出一种基于概率理论的新型时间价值函数——TOPReward,它不依赖于直接提示预训练视频视觉语言模型(Vision-Language Model, VLM)输出进度值(易受数值表示偏差影响),而是从VLM内部的token logits中提取任务进展信息,从而更稳定、准确地估计机器人任务执行进度。在130多个真实世界任务和多种机器人平台上的零样本评估表明,TOPReward在Qwen3-VL模型上实现了0.947的平均值序相关性(Value-Order Correlation, VOC),显著优于当前最优基线GVL方法(接近零相关性)。

链接: https://arxiv.org/abs/2602.19313
作者: Shirui Chen,Cole Harrison,Ying-Chun Lee,Angela Jin Yang,Zhongzheng Ren,Lillian J. Ratliff,Jiafei Duan,Dieter Fox,Ranjay Krishna
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM’s internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

[AI-72] ALPACA: A Reinforcement Learning Environment for Medication Repurposing and Treatment Optimization in Alzheimers Disease

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)个性化、序贯治疗策略在临床试验中难以评估的问题,主要受限于疾病进展周期长和患者间异质性大。解决方案的关键在于构建了一个名为ALPACA(Alzheimer’s Learning Platform for Adaptive Care Agents)的开源强化学习(reinforcement learning, RL)环境,其核心是基于阿尔茨海默病神经影像计划(Alzheimer’s Disease Neuroimaging Initiative, ADNI)纵向数据训练的连续动作条件状态转移模型(Continuous Action-conditioned State Transitions, CAST),该模型可模拟不同药物干预下疾病进展轨迹,并支持在虚拟环境中训练和评估个性化序贯治疗策略。实验表明,基于ALPACA训练的RL策略在记忆相关结局上优于无治疗和行为克隆医生基线,且策略选择依赖于临床有意义的患者特征,验证了其可解释性和实用性。

链接: https://arxiv.org/abs/2602.19298
作者: Nolan Brady,Tom Yeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating personalized, sequential treatment strategies for Alzheimer’s disease (AD) using clinical trials is often impractical due to long disease horizons and substantial inter-patient heterogeneity. To address these constraints, we present the Alzheimer’s Learning Platform for Adaptive Care Agents (ALPACA), an open-source, Gym-compatible reinforcement learning (RL) environment for systematically exploring personalized treatment strategies using existing therapies. ALPACA is powered by the Continuous Action-conditioned State Transitions (CAST) model trained on longitudinal trajectories from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), enabling medication-conditioned simulation of disease progression under alternative treatment decisions. We show that CAST autoregressively generates realistic medication-conditioned trajectories and that RL policies trained in ALPACA outperform no-treatment and behavior-cloned clinician baselines on memory-related outcomes. Interpretability analyses further indicated that the learned policies relied on clinically meaningful patient features when selecting actions. Overall, ALPACA provides a reusable in silico testbed for studying individualized sequential treatment decision-making for AD.

[AI-73] Automated Generation of Microfluidic Netlists using Large Language Models

【速读】:该论文旨在解决微流控器件设计复杂性高、实践者难以便捷使用自动化设计工具的问题,即如何将微流控设备的设计需求从自然语言高效转化为可执行的系统级结构描述。其解决方案的关键在于首次引入大语言模型(Large Language Models, LLMs)来实现自然语言到结构化硬件描述语言(Verilog netlist)的自动转换,从而构建一个连接微流控设计需求与自动化设计流程的初步实用框架。通过在典型微流控设计基准上生成具有正确功能流和平均88%语法准确率的Verilog网表,验证了该方法的可行性。

链接: https://arxiv.org/abs/2602.19297
作者: Jasper Davidson,Skylar Stockham,Allen Boston,Ashton Snelgrove. Valerio Tenace,Pierre-Emmanuel Gaillardon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Microfluidic devices have emerged as powerful tools in various laboratory applications, but the complexity of their design limits accessibility for many practitioners. While progress has been made in microfluidic design automation (MFDA), a practical and intuitive solution is still needed to connect microfluidic practitioners with MFDA techniques. This work introduces the first practical application of large language models (LLMs) in this context, providing a preliminary demonstration. Building on prior research in hardware description language (HDL) code generation with LLMs, we propose an initial methodology to convert natural language microfluidic device specifications into system-level structural Verilog netlists. We demonstrate the feasibility of our approach by generating structural netlists for practical benchmarks representative of typical microfluidic designs with correct functional flow and an average syntactical accuracy of 88%.

[AI-74] Limited Reasoning Space: The cage of long-horizon reasoning in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在采用测试时计算策略(如链式思维,Chain-of-Thought, CoT)进行复杂任务推理时,随着计算预算增加反而出现性能下降的问题。研究表明,这一现象源于传统静态规划方法无法感知LLM推理的内在边界,导致过度规划(over-planning),从而引入冗余反馈并损害推理能力。为此,论文提出Halo框架,其核心在于引入一种基于模型预测控制(Model Predictive Control, MPC)的动态规划机制,通过熵驱动的双控制器实现“测后规划”(Measure-then-Plan)策略,在推理边界处动态调节规划强度,从而在充分利用计算资源的同时抑制过规划,显著提升长时程任务中的推理表现。

链接: https://arxiv.org/abs/2602.19281
作者: Zhenyu Li,Guanlin Wu,Cheems Wang,Yongqiang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The test-time compute strategy, such as Chain-of-Thought (CoT), has significantly enhanced the ability of large language models to solve complex tasks like logical reasoning. However, empirical studies indicate that simply increasing the compute budget can sometimes lead to a collapse in test-time performance when employing typical task decomposition strategies such as CoT. This work hypothesizes that reasoning failures with larger compute budgets stem from static planning methods, which hardly perceive the intrinsic boundaries of LLM reasoning. We term it as the Limited Reasoning Space hypothesis and perform theoretical analysis through the lens of a non-autonomous stochastic dynamical system. This insight suggests that there is an optimal range for compute budgets; over-planning can lead to redundant feedback and may even impair reasoning capabilities. To exploit the compute-scaling benefits and suppress over-planning, this work proposes Halo, a model predictive control framework for LLM planning. Halo is designed for long-horizon tasks with reason-based planning and crafts an entropy-driven dual controller, which adopts a Measure-then-Plan strategy to achieve controllable reasoning. Experimental results demonstrate that Halo outperforms static baselines on complex long-horizon tasks by dynamically regulating planning at the reasoning boundary.

[AI-75] aming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data

【速读】:该论文旨在解决联邦学习(Federated Learning)中二阶优化器(second-order optimizer)在非独立同分布(non-IID)数据下易不稳定甚至发散的问题。其核心原因被识别为预条件器漂移(preconditioner drift):客户端基于局部二阶信息训练时,会诱导出由曲率定义的异构几何结构(即预条件器坐标系),而服务器端在不兼容度量下进行模型平均更新,从而破坏全局下降方向。解决方案的关键在于提出 FedPAC 框架,通过显式解耦参数聚合与几何同步实现可靠优化:(i) 对齐(Alignment)——将本地预条件器聚合为全局参考,并用全局预条件器热启动客户端;(ii) 校正(Correction)——利用全局预条件方向引导本地预条件更新以抑制长期漂移。该方法在部分参与条件下提供了带线性加速的非凸收敛保证,并在视觉和语言任务中显著提升稳定性和精度。

链接: https://arxiv.org/abs/2602.19271
作者: Junkang Liu,Fanhua Shang,Hongying Liu,Jin Liu,Weixin An,Yuanyuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Second-order optimizers can significantly accelerate large-scale training, yet their naive federated variants are often unstable or even diverge on non-IID data. We show that a key culprit is \emphpreconditioner drift: client-side second-order training induces heterogeneous \emphcurvature-defined geometries (i.e., preconditioner coordinate systems), and server-side model averaging updates computed under incompatible metrics, corrupting the global descent direction. To address this geometric mismatch, we propose \textttFedPAC, a \emphpreconditioner alignment and correction framework for reliable federated second-order optimization. \textttFedPAC explicitly decouples parameter aggregation from geometry synchronization by: (i) \textbfAlignment (i.e.,aggregating local preconditioners into a global reference and warm-starting clients via global preconditioner); and (ii) \textbfCorrection (i.e., steering local preconditioned updates using a global preconditioned direction to suppress long-term drift). We provide drift-coupled non-convex convergence guarantees with linear speedup under partial participation. Empirically, \textttFedPAC consistently improves stability and accuracy across vision and language tasks, achieving up to 5.8% absolute accuracy gain on CIFAR-100 with ViTs. Code is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.19271 [cs.LG] (or arXiv:2602.19271v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19271 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junkang Liu [view email] [v1] Sun, 22 Feb 2026 16:57:57 UTC (13,518 KB) Full-text links: Access Paper: View a PDF of the paper titled Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Federated Learning on Non-IID Data, by Junkang Liu and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-76] DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation IJCNN2026

【速读】:该论文旨在解决现有图扩散模型在处理有向无环图(Directed Acyclic Graph, DAG)结构时的局限性问题,尤其是其无法有效保留边方向所蕴含的功能语义(如数据流信息),从而限制了在神经架构搜索(Neural Architecture Search, NAS)等任务中的应用效果。解决方案的关键在于提出有向图策略优化(Directed Graph Policy Optimization, DGPO),通过引入拓扑节点排序和位置编码机制,将强化学习微调方法扩展至DAG场景,从而实现对有向组合结构的有效生成与可控引导。实验表明,DGPO在NAS-Bench-101和NAS-Bench-201上均达到或接近最优性能,并展现出可迁移的结构先验能力,仅用7%的训练数据即可生成接近全数据训练模型的近最优架构。

链接: https://arxiv.org/abs/2602.19261
作者: Aleksei Liuliakov,Luca Hermes,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Submitted to IJCNN 2026 (IEEE WCCI). 6 pages, 4 figures

点击查看摘要

Abstract:Reinforcement learning fine-tuning has proven effective for steering generative diffusion models toward desired properties in image and molecular domains. Graph diffusion models have similarly been applied to combinatorial structure generation, including neural architecture search (NAS). However, neural architectures are directed acyclic graphs (DAGs) where edge direction encodes functional semantics such as data flow-information that existing graph diffusion methods, designed for undirected structures, discard. We propose Directed Graph Policy Optimization (DGPO), which extends reinforcement learning fine-tuning of discrete graph diffusion models to DAGs via topological node ordering and positional encoding. Validated on NAS-Bench-101 and NAS-Bench-201, DGPO matches the benchmark optimum on all three NAS-Bench-201 tasks (91.61%, 73.49%, 46.77%). The central finding is that the model learns transferable structural priors: pretrained on only 7% of the search space, it generates near-oracle architectures after fine-tuning, within 0.32 percentage points of the full-data model and extrapolating 7.3 percentage points beyond its training ceiling. Bidirectional control experiments confirm genuine reward-driven steering, with inverse optimization reaching near random-chance accuracy (9.5%). These results demonstrate that reinforcement learning-steered discrete diffusion, once extended to handle directionality, provides a controllable generative framework for directed combinatorial structures.

[AI-77] Robust Exploration in Directed Controller Synthesis via Reinforcement Learning with Soft Mixture-of-Experts

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在在线定向控制器合成(On-the-fly Directed Controller Synthesis, OTF-DCS)中因策略泛化方向性不足而导致的鲁棒性问题,即RL策略仅在特定域参数空间区域表现良好,而在其他区域则脆弱易失效,这源于训练过程中的随机性和轨迹依赖偏差。解决方案的关键在于提出一种软专家混合(Soft Mixture-of-Experts, Soft-MoE)框架,通过先验置信度门控机制融合多个RL专家,并将各专家的各向异性行为视为互补的专业化能力,从而显著扩展可求解参数空间并提升整体鲁棒性。

链接: https://arxiv.org/abs/2602.19244
作者: Toshihide Ubukata,Zhiyao Wang,Enhong Mu,Jialong Li,Kenji Tei
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:On-the-fly Directed Controller Synthesis (OTF-DCS) mitigates state-space explosion by incrementally exploring the system and relies critically on an exploration policy to guide search efficiently. Recent reinforcement learning (RL) approaches learn such policies and achieve promising zero-shot generalization from small training instances to larger unseen ones. However, a fundamental limitation is anisotropic generalization, where an RL policy exhibits strong performance only in a specific region of the domain-parameter space while remaining fragile elsewhere due to training stochasticity and trajectory-dependent bias. To address this, we propose a Soft Mixture-of-Experts framework that combines multiple RL experts via a prior-confidence gating mechanism and treats these anisotropic behaviors as complementary specializations. The evaluation on the Air Traffic benchmark shows that Soft-MoE substantially expands the solvable parameter space and improves robustness compared to any single expert.

[AI-78] opology of Reasoning : Retrieved Cell Complex-Augmented Generation for Textual Graph Question Answering

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理文本图(textual graph)时对高维拓扑结构建模不足的问题,尤其是忽视了循环结构(cycles)这一关键推理要素,导致上下文锚定不完整和推理能力受限。解决方案的关键在于提出一种名为TopoRAG的新框架,其核心创新是将文本图提升为细胞复形(cellular complexes)以显式建模多维拓扑依赖关系,并设计了拓扑感知的子复形检索机制来提取与查询相关的紧凑拓扑上下文,最后通过多维拓扑推理机制在这些复形上进行关系信息传播,从而引导大语言模型(Large Language Models, LLMs)执行结构化、逻辑感知的推理。

链接: https://arxiv.org/abs/2602.19240
作者: Sen Zhao,Lincheng Zhou,Yue Chen,Ding Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances the reasoning ability of Large Language Models (LLMs) by dynamically integrating external knowledge, thereby mitigating hallucinations and strengthening contextual grounding for structured data such as graphs. Nevertheless, most existing RAG variants for textual graphs concentrate on low-dimensional structures – treating nodes as entities (0-dimensional) and edges or paths as pairwise or sequential relations (1-dimensional), but overlook cycles, which are crucial for reasoning over relational loops. Such cycles often arise in questions requiring closed-loop inference about similar objects or relative positions. This limitation often results in incomplete contextual grounding and restricted reasoning capability. In this work, we propose Topology-enhanced Retrieval-Augmented Generation (TopoRAG), a novel framework for textual graph question answering that effectively captures higher-dimensional topological and relational dependencies. Specifically, TopoRAG first lifts textual graphs into cellular complexes to model multi-dimensional topological structures. Leveraging these lifted representations, a topology-aware subcomplex retrieval mechanism is proposed to extract cellular complexes relevant to the input query, providing compact and informative topological context. Finally, a multi-dimensional topological reasoning mechanism operates over these complexes to propagate relational information and guide LLMs in performing structured, logic-aware inference. Empirical evaluations demonstrate that our method consistently surpasses existing baselines across diverse textual graph tasks.

[AI-79] Evaluating SAP RPT-1 for Enterprise Business Process Prediction: In-Context Learning vs. Traditional Machine Learning on Structured SAP Data

【速读】:该论文旨在解决企业级表格数据中机器学习模型部署的门槛问题,即如何在无需任务特定训练的情况下实现高效、可访问的预测建模。传统梯度提升决策树(Gradient Boosted Decision Trees, GBDT)如XGBoost、LightGBM和CatBoost虽性能优异,但需大量标注数据与调参成本;而生成式AI(Generative AI)模型通常缺乏对结构化表格数据的针对性优化。本文提出的解决方案核心是评估SAP研发的检索预训练Transformer(Retrieval Pretrained Transformer, RPT-1),这是一个仅64.6 MB大小、基于1.34 TB结构化数据预训练的轻量模型,在三个典型SAP业务场景下表现出显著零样本(zero-shot)预测能力——其分类任务准确率可达调优后GBDT的91–96%,且在仅有75–100个上下文行时便超越GBDT,展现出“小样本优势”。因此,研究提出一种实用的混合工作流:先用RPT-1快速筛选候选任务,再针对高价值场景选择性训练GBDT模型,从而平衡效率与精度。

链接: https://arxiv.org/abs/2602.19237
作者: Amit Lal(Microsoft Corporation)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 32 references. Reproducible experiments available at Hugging Face Spaces

点击查看摘要

Abstract:Tabular foundation models aim to make machine learning accessible for enterprise data without task-specific training. This paper presents the first independent evaluation of SAP’s Retrieval Pretrained Transformer (RPT-1) from a practitioner perspective. RPT-1 is a compact 64.6 MB model pretrained on 1.34 TB of structured data across 3.1 million tables. We benchmark it against tuned gradient-boosted decision trees (XGBoost, LightGBM, CatBoost) on three SAP business scenarios: demand forecasting across SD/MM/PP modules, predictive data integrity in BC/MM/QM, and financial risk classification in FI/CO/AR. Across five-fold cross-validation on datasets ranging from 2,500 to 3,200 rows, RPT-1 reaches 91-96% of tuned GBDT accuracy without any training examples. The classification gap is modest at 3.6-4.1 percentage points on AUC-ROC, though regression tasks show wider gaps of 8.9-11.1 percentage points on R-squared. An interesting finding is a crossover at roughly 75-100 context rows where RPT-1 actually outperforms XGBoost under limited data. Based on these results, we propose a practical hybrid workflow: use RPT-1 for rapid screening, then train GBDT selectively where prediction accuracy justifies the effort. All experiments are reproducible through publicly available Hugging Face Spaces.

[AI-80] Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

【速读】:该论文旨在解决多轮大语言模型(Large Language Model, LLM)智能体在实际部署中因任务难度波动导致的信用分配失准问题,尤其在样本效率训练场景下,如何准确区分高价值信息信号与随机噪声。现有基于批次的策略优化方法依赖离散批次内的统计偏差,难以适应任务难度动态变化,常造成错误的奖励归因。其解决方案的关键在于提出一种名为Proximity-based Multi-turn Optimization (ProxMO) 的框架,通过两个轻量级机制实现:一是基于成功率感知的梯度强度调节机制,根据每回合的整体难度动态调整梯度幅度;二是基于邻近性的软聚合机制,利用步骤级别的连续语义权重构建基线,从而更精细地捕捉策略改进信号。该方法在ALFWorld和WebShop基准上显著优于现有基线,且计算开销极低,可无缝集成至标准GRPO训练流程中。

链接: https://arxiv.org/abs/2602.19225
作者: Yangyi Fang,Jiaye Lin,Xiaoliang Fu,Cong Qin,Haolin Shi,Chang Liu,Peilin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \hrefthis https URLthis https URL.

[AI-81] How to Allocate How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)应用于大语言模型(Large Language Model, LLM)推理时面临的两大核心问题:一是 rollout 分配策略采用均匀分配,忽略了不同问题间梯度方差的异质性;二是 softmax 策略结构导致高置信度正确动作的梯度衰减,同时过度更新可能引发训练不稳定。解决方案的关键在于提出一个理论驱动的双轨优化框架 DynaMO:在序列层面,通过从第一性原理推导出方差最小化的分配策略,并引入伯努利方差作为梯度信息性的可计算代理;在 token 层面,基于梯度幅度边界理论分析设计梯度感知的优势调制机制,补偿高置信度正确动作的梯度衰减,并利用熵变化作为可计算指标以稳定过大的更新幅度。

链接: https://arxiv.org/abs/2602.19208
作者: Yangyi Fang,Jiaye Lin,Xiaoliang Fu,Cong Qin,Haolin Shi,Chaowen Hu,Lu Pan,Ke Zeng,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance heterogeneity across problems, and (ii) the softmax policy structure causes gradient attenuation for high-confidence correct actions, while excessive gradient updates may destabilize training. Therefore, we propose DynaMO, a theoretically-grounded dual-pronged optimization framework. At the sequence level, we prove that uniform allocation is suboptimal and derive variance-minimizing allocation from the first principle, establishing Bernoulli variance as a computable proxy for gradient informativeness. At the token level, we develop gradient-aware advantage modulation grounded in theoretical analysis of gradient magnitude bounds. Our framework compensates for gradient attenuation of high-confidence correct actions while utilizing entropy changes as computable indicators to stabilize excessive update magnitudes. Extensive experiments conducted on a diverse range of mathematical reasoning benchmarks demonstrate consistent improvements over strong RLVR baselines. Our implementation is available at: \hrefthis https URLthis https URL.

[AI-82] HybridFL: A Federated Learning Approach for Financial Crime Detection

【速读】:该论文旨在解决现实世界中数据分布复杂且混合的问题,即在联邦学习(Federated Learning, FL)场景下,数据不仅可能在用户之间水平分割(horizontal partitioning),还可能在特征维度上垂直分割(vertical partitioning),传统FL方法难以有效处理此类混合数据结构。解决方案的关键在于提出一种新型架构——混合联邦学习(Hybrid Federated Learning, HybridFL),通过整合水平聚合(horizontal aggregation)与垂直特征融合(vertical feature fusion)机制,在严格保持数据本地性的前提下实现跨参与方的联合建模。实验表明,该方法在金融犯罪检测任务中显著优于仅使用交易级特征的本地模型,并达到与集中式基准相当的性能。

链接: https://arxiv.org/abs/2602.19207
作者: Afsana Khan,Marijn ten Thij,Guangzhi Tang,Anna Wilbik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) is a privacy-preserving machine learning paradigm that enables multiple parties to collaboratively train models on privately owned data without sharing raw information. While standard FL typically addresses either horizontal or vertical data partitions, many real-world scenarios exhibit a complex hybrid distribution. This paper proposes Hybrid Federated Learning (HybridFL) to address data split both horizontally across disjoint users and vertically across complementary feature sets. We evaluate HybridFL in a financial crime detection context, where a transaction party holds transaction-level attributes and multiple banks maintain private account-level features. By integrating horizontal aggregation and vertical feature fusion, the proposed architecture enables joint learning while strictly preserving data locality. Experiments on AMLSim and SWIFT datasets demonstrate that HybridFL significantly outperforms the transaction-only local model and achieves performance comparable to a centralized benchmark.

[AI-83] Visual Prompt Guided Unified Pushing Policy

【速读】:该论文旨在解决现有非握持操作(non-prehensile manipulation)中推动物体方法的局限性问题,即传统方法依赖于预定义的多步推动作序列,导致其在不同场景下的效率和泛化能力受限。解决方案的关键在于提出一种统一的推动物理策略,通过将轻量级提示机制(lightweight prompting mechanism)嵌入到流匹配策略(flow matching policy)中,从而生成具有反应性和多模态特性的推动物作;该策略可由高层规划器提供视觉提示(visual prompt),实现跨多种规划任务的复用,并在实验中验证了其优于现有基线方法的能力,同时作为视觉语言模型(VLM)引导的规划框架中的低层原语,高效完成桌面清洁任务。

链接: https://arxiv.org/abs/2602.19193
作者: Hieu Bui,Ziyan Gao,Yuya Hosoda,Joo-Ho Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As one of the simplest non-prehensile manipulation skills, pushing has been widely studied as an effective means to rearrange objects. Existing approaches, however, typically rely on multi-step push plans composed of pre-defined pushing primitives with limited application scopes, which restrict their efficiency and versatility across different scenarios. In this work, we propose a unified pushing policy that incorporates a lightweight prompting mechanism into a flow matching policy to guide the generation of reactive, multimodal pushing actions. The visual prompt can be specified by a high-level planner, enabling the reuse of the pushing policy across a wide range of planning problems. Experimental results demonstrate that the proposed unified pushing policy not only outperforms existing baselines but also effectively serves as a low-level primitive within a VLM-guided planning framework to solve table-cleaning tasks efficiently.

[AI-84] HistCAD: Geometrically Constrained Parametric History-based CAD Dataset

【速读】:该论文旨在解决当前工业设计中参数化计算机辅助设计(Parametric CAD)建模数据集普遍缺乏显式几何约束和细粒度功能语义的问题,从而限制了可编辑性和约束合规的生成能力。其解决方案的关键在于构建HistCAD这一大规模数据集,该数据集通过紧凑表示过程操作并确保与原生CAD软件兼容,包含五种对齐模态:建模序列、多视角渲染图、STEP格式的边界表示(B-rep)、原生参数化文件及文本注释;同时开发AMHistCAD_\text{HistCAD}注释模块,利用大语言模型从建模序列中提取几何与空间特征,并生成关于建模过程、几何结构和功能类型的互补注释,显著提升了文本驱动CAD生成的鲁棒性、参数化可编辑性和准确性。

链接: https://arxiv.org/abs/2602.19171
作者: Xintong Dong,Chuanyang Li,Chuqi Han,Peng Zheng,Jiaxin Jing,Yanzhi Song,Zhouwang Yang
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parametric computer-aided design (CAD) modeling is fundamental to industrial design, but existing datasets often lack explicit geometric constraints and fine-grained functional semantics, limiting editable, constraint-compliant generation. We present HistCAD, a large-scale dataset featuring constraint-aware modeling sequences that compactly represent procedural operations while ensuring compatibility with native CAD software, encompassing five aligned modalities: modeling sequences, multi-view renderings, STEP-format B-reps, native parametric files, and textual annotations. We develop AM(_\textHistCAD), an annotation module that extracts geometric and spatial features from modeling sequences and uses a large language model to generate complementary annotations of the modeling process, geometric structure, and functional type. Extensive evaluations demonstrate that HistCAD’s explicit constraints, flattened sequence format, and multi-type annotations improve robustness, parametric editability, and accuracy in text-driven CAD generation, while industrial parts included in HistCAD further support complex real-world design scenarios. HistCAD thus provides a unified benchmark for advancing editable, constraint-aware, and semantically enriched generative CAD modeling.

[AI-85] Virtual Parameter Sharpening: Dynamic Low-Rank Perturbations for Inference-Time Reasoning Enhancement

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段缺乏动态适应能力的问题,即如何在不修改模型参数的前提下,利用输入激活信息实现测试时的自适应调整。传统参数高效微调方法(如LoRA)通过学习静态低秩适配器来提升性能,但无法在推理过程中根据输入内容进行实时调整。其解决方案的关键在于提出虚拟参数锐化(Virtual Parameter Sharpening, VPS),一种基于激活条件的动态低秩扰动机制:在推理时,VPS 从批量激活统计量和可选梯度信号中实时构建扰动因子,形成形如 ΔW = γ·WᵀVUᵀW 的动态权重扰动,其中 U 和 V 由稀疏激活引导选择或 Sylvester 耦合回归生成;该方法通过激活能量与 token 级熵的自适应调节策略控制扰动强度,并引入多目标验证与迭代优化机制以增强任务表现。

链接: https://arxiv.org/abs/2602.19169
作者: Saba Kublashvili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Probability (math.PR)
备注:

点击查看摘要

Abstract:I introduce Virtual Parameter Sharpening (VPS), an inference-time technique that augments frozen transformer linear layers with dynamic, activation-conditioned low-rank perturbations. Unlike parameter-efficient fine-tuning methods such as LoRA, which learn static low-rank adapters, VPS constructs its perturbation factors on the fly from batch activation statistics and optional gradient signals, enabling test-time adaptation without persistent parameter updates. The perturbation takes the form Delta W = gamma * W^T V U^T W, where selector matrices U and V are constructed via sparse activation-guided selection or Sylvester-coupled regression. We provide a theoretical analysis of the perturbation’s spectral properties and describe an adaptive policy system that modulates perturbation magnitude based on activation energy and token-level entropy. This system incorporates multi-objective verification with iterative refinement for tasks with ground-truth supervision. We present the complete algorithmic framework, analyze its mathematical foundations, and discuss the mechanisms by which activation-conditioned computation may enhance reasoning capabilities in large language models. Implementation and experimental code are available at this https URL .

[AI-86] Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)驱动的搜索摘要功能(如 Google 的 AI Overview, AIO)是否对信息类内容发布者(如 Wikipedia)的流量产生因果性影响,以及这种影响是否具有异质性。解决方案的关键在于利用 AIO 的分阶段地理部署特征和 Wikipedia 多语言结构,采用双重差分法(difference-in-differences)设计,将暴露于 AIO 的英文文章与同一内容但未暴露于 AIO 的其他语种版本(如印地语、印尼语、日语和葡萄牙语)进行匹配比较,从而识别出 AIO 对流量的净效应。结果表明,AIO 导致英文文章日均流量下降约 15%,且文化类文章受影响最大,STEM 类文章影响较小,说明生成式答案可能替代用户对短答案的需求,进而重新分配注意力资源。

链接: https://arxiv.org/abs/2602.18455
作者: Mehrzad Khosravi,Hema Yoganarasimhan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Search engines increasingly display LLM-generated answers shown above organic links, shifting search from link lists to answer-first summaries. Publishers contend these summaries substitute for source pages and cannibalize traffic, while platforms argue they are complementary by directing users through included links. We estimate the causal impact of Google’s AI Overview (AIO) on Wikipedia traffic by leveraging the feature’s staggered geographic rollout and Wikipedia’s multilingual structure. Using a difference-in-differences design, we compare English Wikipedia articles exposed to AIO to the same underlying articles in language editions (Hindi, Indonesian, Japanese, and Portuguese) that were not exposed to AIO during the observation period. Across 161,382 matched article-language pairs, AIO exposure reduces daily traffic to English articles by approximately 15%. Effects are heterogeneous: relative declines are largest for Culture articles and substantially smaller for STEM, consistent with stronger substitution when short synthesized answers satisfy informational intent. These findings provide early causal evidence that generative-answer features in search engines can materially reallocate attention away from informational publishers, with implications for content monetization, search platform design, and policy.

[AI-87] LLM -Assisted Replication for Quantitative Social Science

【速读】:该论文旨在解决当前实证研究中面临的“复制危机”(replication crisis)问题,即科学结论难以通过后续研究验证,其根源之一在于复制工作成本高且激励不足。为应对这一挑战,作者提出了一种基于大语言模型(Large Language Models, LLMs)的自动化系统,其核心创新在于构建了一个迭代式流程:首先由LLM解析论文中的文本内容,生成可执行代码,运行分析并进行结果差异性分析,从而自动复现社会科学研究中的统计分析过程。该方案的关键优势在于利用了定量社会科学领域标准化程度高的特点——包括通用统计模型、公开共享数据集和统一的报告格式(如回归表格和摘要统计量),使得AI能够高效、可靠地完成复制任务,并支持预提交检查、同行评审辅助及元科学审计等应用场景,从而将人工智能作为增强科研诚信的辅助基础设施。

链接: https://arxiv.org/abs/2602.18453
作者: So Kubota,Hiromu Yakura,Samuel Coavoux,Sho Yamada,Yuki Nakamura
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The replication crisis, the failure of scientific claims to be validated by further research, is one of the most pressing issues for empirical research. This is partly an incentive problem: replication is costly and less well rewarded than original research. Large language models (LLMs) have accelerated scientific production by streamlining writing, coding, and reviewing, yet this acceleration risks outpacing verification. To address this, we present an LLM-based system that replicates statistical analyses from social science papers and flags potential problems. Quantitative social science is particularly well-suited to automation because it relies on standard statistical models, shared public datasets, and uniform reporting formats such as regression tables and summary statistics. We present a prototype that iterates LLM-based text interpretation, code generation, execution, and discrepancy analysis, demonstrating its capabilities by reproducing key results from a seminal sociology paper. We also outline application scenarios including pre-submission checks, peer-review support, and meta-scientific audits, positioning AI verification as assistive infrastructure that strengthens research integrity.

[AI-88] Developing a Multi-Agent System to Generate Next Generation Science Assessments with Evidence-Centered Design

【速读】:该论文试图解决当前科学教育评估中面临的挑战,即如何高效、高质量地开发符合新一代科学教育标准(Next Generation Science Standards, NGSS)的性能导向型评估任务。这类任务要求学生运用科学知识解决问题和设计解决方案,但其开发过程复杂且依赖多领域专家协作,成本高、效率低。论文提出的关键解决方案是将证据中心设计(Evidence-Centered Design, ECD)框架嵌入多智能体系统(Multi-Agent Systems, MAS),通过集成多个具备不同专业能力的大语言模型(Large Language Models, LLMs),实现从任务设计到内容生成的全流程自动化。该方法不仅保障了评估设计与NGSS三维标准及认知要求的高度对齐,还显著提升了评估开发的可扩展性,同时揭示了AI生成内容在包容性方面的优势以及在清晰度、简洁性和多模态设计上的局限,表明人类专家在证据收集和学生兴趣匹配方面仍不可或缺。

链接: https://arxiv.org/abs/2602.18451
作者: Yaxuan Yang,Jongchan Park,Yifan Zhou,Xiaoming Zhai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Contemporary science education reforms such as the Next Generation Science Standards (NGSS) demand assessments to understand students’ ability to use science knowledge to solve problems and design solutions. To elicit such higher-order ability, educators need performance-based assessments, which are challenging to develop. One solution that has been broadly adopted is Evidence-Centered Design (ECD), which emphasizes interconnected models of the learner, evidence, and tasks. Although ECD provides a framework to safeguard assessment validity, its implementation requires diverse expertise (e.g., content and assessment), which is both costly and labor-intensive. To address this challenge, this study proposed integrating the ECD framework into Multi-Agent Systems (MAS) to generate NGSS-aligned assessment items automatically. This integrated MAS system ensembles multiple large language models with varying expertise, enabling the automation of complex, multi-stage item generation workflows traditionally performed by human experts. We examined the quality of AI-generated NGSS-aligned items and compared them with human-developed items across multiple dimensions of assessment design. Results showed that AI-generated items have overall comparable quality to human-developed items in terms of alignment with NGSS three-dimensional standards and cognitive demands. Divergent patterns also emerged: AI-generated items demonstrated a distinct strength in inclusivity, while also exhibiting limitations in clarity, conciseness, and multimodal design. AI- and human-developed items both showed weaknesses in evidence collectability and student interest alignment. These findings suggest that integrating ECD into MAS can support scalable and standards-aligned assessment design, while human expertise remains essential.

[AI-89] Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM -Assisted Reasoning for Manufacturing Automation

【速读】:该论文旨在解决制造业中2D工程图纸与3D CAD模型之间制造意图(manufacturing intent)的语义映射问题,即如何准确、可追溯地将2D图纸上的几何公差标注(Geometric Dimensioning and Tolerancing, GDT)、基准定义(datum definitions)和表面要求等信息关联到对应的3D几何特征上。这一问题在汽车、航空航天、船舶制造和重型机械等行业尤为突出,因现有方法难以应对上下文歧义、重复特征模式以及决策透明性不足等挑战。解决方案的关键在于提出一种“确定性优先、上下文感知”的框架:首先通过语义增强和可解释评分机制(结合类型兼容性、容差感知尺寸一致性及保守上下文一致性,并融合工程领域启发式规则)对候选3D特征进行排序;当确定性方法无法消除歧义时,系统升级至多模态约束大语言模型推理,并最终由人工介入(Human-in-the-Loop, HITL)完成复核。该设计确保了高精度(F1=86.29%)的同时具备决策可追踪性和工业落地可行性。

链接: https://arxiv.org/abs/2602.18296
作者: Muhammad Tayyab Khana,Lequn Chen,Wenhe Feng,Seung Ki Moon
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manufacturing automation in process planning, inspection planning, and digital-thread integration depends on a unified specification that binds the geometric features of a 3D CAD model to the geometric dimensioning and tolerancing (GDT) callouts, datum definitions, and surface requirements carried by the corresponding 2D engineering drawing. Although Model-Based Definition (MBD) allows such specifications to be embedded directly in 3D models, 2D drawings remain the primary carrier of manufacturing intent in automotive, aerospace, shipbuilding, and heavy-machinery industries. Correctly linking drawing annotations to the corresponding 3D features is difficult because of contextual ambiguity, repeated feature patterns, and the need for transparent and traceable decisions. This paper presents a deterministic-first, context-aware framework that maps 2D drawing entities to 3D CAD features to produce a unified manufacturing specification. Drawing callouts are first semantically enriched and then scored against candidate features using an interpretable metric that combines type compatibility, tolerance-aware dimensional agreement, and conservative context consistency, along with engineering-domain heuristics. When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step. Experiments on 20 real CAD-drawing pairs achieve a mean precision of 83.67%, recall of 90.46%, and F1 score of 86.29%. An ablation study shows that each pipeline component contributes to overall accuracy, with the full system outperforming all reduced variants. By prioritizing deterministic rules, clear decision tracking, and retaining unresolved cases for human review, the framework provides a practical foundation for downstream manufacturing automation in real-world industrial environments.

[AI-90] CTC-TTS: LLM -based dual-streaming text-to-speech with CTC alignment INTERSPEECH2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的文本到语音(Text-to-Speech, TTS)系统在低延迟双流合成(dual-streaming synthesis)场景下的性能瓶颈问题。核心挑战在于如何实现准确的文本-语音对齐(text–speech alignment)以及设计兼顾合成质量与延迟的训练序列。传统方法依赖GMM-HMM-based强制对齐工具(如MFA),存在流程冗长且灵活性差的问题,同时固定比例交错(fixed-ratio interleaving)难以捕捉文本与语音之间的规律性对齐关系。论文提出CTC-TTS,其关键创新在于用基于CTC(Connectionist Temporal Classification)的神经对齐器替代MFA,并引入基于双词(bi-word)的交错策略;进一步设计两种变体:CTC-TTS-L(沿序列长度拼接token)以提升音质,CTC-TTS-F(沿特征维度堆叠嵌入)以降低延迟。实验表明,该方法在流式合成和零样本任务中均优于固定比例交错和MFA基线。

链接: https://arxiv.org/abs/2602.19574
作者: Hanwen Liu,Saierdaer Yusuyin,Hao Huang,Zhijian Ou
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Submitted to INTERSPEECH 2026

点击查看摘要

Abstract:Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text–speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text–speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at this https URL.

机器学习

[LG-0] Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

链接: https://arxiv.org/abs/2602.20156
作者: David Schmotz,Luca Beurer-Kellner,Sahar Abdelnabi,Maksym Andriushchenko
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today’s agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at this https URL.

[LG-1] LAD: Learning Advantage Distribution for Reasoning

链接: https://arxiv.org/abs/2602.20132
作者: Wendi Li,Sharon Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an f -divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

[LG-2] Adaptation to Intrinsic Dependence in Diffusion Language Models

链接: https://arxiv.org/abs/2602.20126
作者: Yunxiao Zhao,Changxiao Cai
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) approaches, enabling parallel token generation beyond a rigid left-to-right order. Despite growing empirical success, the theoretical understanding of how unmasking schedules – which specify the order and size of unmasked tokens during sampling – affect generation quality remains limited. In this work, we introduce a distribution-agnostic unmasking schedule for DLMs that adapts to the (unknown) dependence structure of the target data distribution, without requiring any prior knowledge or hyperparameter tuning. In contrast to prior deterministic procedures that fix unmasking sizes, our method randomizes the number of tokens revealed at each iteration. We show that, for two specific parameter choices, the sampling convergence guarantees – measured by Kullback-Leibler (KL) divergence – scale as \widetilde O(\mathsfTC/K) and \widetilde O(\mathsfDTC/K) respectively. Here, K is the number of iterations, and \mathsfTC and \mathsfDTC are the total correlation and dual total correlation of the target distribution, capturing the intrinsic dependence structure underlying the data. Importantly, our guarantees hold in the practically relevant parallel-sampling regime KL where L is the token sequence length. These results significantly improve upon prior convergence theories and yield substantial sampling acceleration for low-complexity distributions. Overall, our findings unveil the adaptivity of DLMs to intrinsic data structures and shed light on the benefit of randomized unmasking sizes in inference schedule design.

[LG-3] Reliable Abstention under Adversarial Injections: Tight Lower Bounds and New Upper Bounds

链接: https://arxiv.org/abs/2602.20111
作者: Ezra Edelman,Surbhi Goel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online learning in the adversarial injection model introduced by [Goel et al. 2017], where a stream of labeled examples is predominantly drawn i.i.d.\ from an unknown distribution \mathcalD , but may be interspersed with adversarially chosen instances without the learner knowing which rounds are adversarial. Crucially, labels are always consistent with a fixed target concept (the clean-label setting). The learner is additionally allowed to abstain from predicting, and the total error counts the mistakes whenever the learner decides to predict and incorrect abstentions when it abstains on i.i.d.\ rounds. Perhaps surprisingly, prior work shows that oracle access to the underlying distribution yields O(d^2 \log T) combined error for VC dimension d , while distribution-agnostic algorithms achieve only \tildeO(\sqrtT) for restricted classes, leaving open whether this gap is fundamental. We resolve this question by proving a matching \Omega(\sqrtT) lower bound for VC dimension 1 , establishing a sharp separation between the two information regimes. On the algorithmic side, we introduce a potential-based framework driven by \emphrobust witnesses, small subsets of labeled examples that certify predictions while remaining resilient to adversarial contamination. We instantiate this framework using two combinatorial dimensions: (1) \emphinference dimension, yielding combined error \tildeO(T^1-1/k) for classes of inference dimension k , and (2) \emphcertificate dimension, a new relaxation we introduce. As an application, we show that halfspaces in \mathbbR^2 have certificate dimension 3 , obtaining the first distribution-agnostic bound of \tildeO(T^2/3) for this class. This is notable since [Blum et al. 2021] showed halfspaces are not robustly learnable under clean-label attacks without abstention. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.20111 [cs.LG] (or arXiv:2602.20111v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.20111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] raining-Free Generative Modeling via Kernelized Stochastic Interpolants

链接: https://arxiv.org/abs/2602.20070
作者: Florentin Coeurdoux,Etienne Lempereur,Nathanaël Cuvelle-Magar,Thomas Eboli,Stéphane Mallat,Anastasia Borovykh,Eric Vanden-Eijnden
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a kernel method for generative modeling within the stochastic interpolant framework, replacing neural network training with linear systems. The drift of the generative SDE is \hat b_t(x) = \nabla\phi(x)^\top\eta_t , where \eta_t\in\R^P solves a P\times P system computable from data, with P independent of the data dimension d . Since estimates are inexact, the diffusion coefficient D_t affects sample quality; the optimal D_t^* from Girsanov diverges at t=0 , but this poses no difficulty and we develop an integrator that handles it seamlessly. The framework accommodates diverse feature maps – scattering transforms, pretrained generative models etc. – enabling training-free generation and model combination. We demonstrate the approach on financial time series, turbulence, and image generation.

[LG-5] A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

链接: https://arxiv.org/abs/2602.20062
作者: Nicolas Anguita,Francesco Locatello,Andrew M. Saxe,Marco Mondelli,Flavia Mancini,Samuel Lippl,Clementine Domine
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.

[LG-6] A Computationally Efficient Multidimensional Vision Transformer

链接: https://arxiv.org/abs/2602.19982
作者: Alaa El Ichi,Khalide Jbilou
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2602.19982 [cs.LG] (or arXiv:2602.19982v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19982 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Khalide Jbilou [view email] [v1] Mon, 23 Feb 2026 15:49:46 UTC (432 KB) Full-text links: Access Paper: View a PDF of the paper titled A Computationally Efficient Multidimensional Vision Transformer, by Alaa El Ichi and Khalide JbilouView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs cs.NA math math.NA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-7] Discrete Diffusion Models Exploit Asymmetry to Solve Lookahead Planning Tasks

链接: https://arxiv.org/abs/2602.19980
作者: Itamar Trainin,Shauli Ravfogel,Omri Abend,Amir Feder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Autoregressive (AR) Transformer-based Generative Language Models are frequently employed for lookahead tasks, recent research suggests a potential discrepancy in their ability to perform planning tasks that require multi-step lookahead. In this work, we investigate the distinct emergent mechanisms that arise when training AR versus Non-Autoregressive (NAR) models, such as Discrete Diffusion Models (dLLMs), on lookahead tasks. By requiring the models to plan ahead to reach the correct conclusion, we analyze how these two paradigms fundamentally differ in their approach to the problem. We identify a critical asymmetry in planning problems: while forward generation requires complex lookahead at branching junctions, reverse generation is often deterministic. This asymmetry creates an opportunity for NAR models. Through mechanistic analysis of training and inference dynamics, we demonstrate that NAR models learn to solve planning tasks by utilizing future tokens to decode backwards, avoiding the need to learn complex traversal mechanisms entirely. Consequently, we report that both AR and NAR models are able to achieve perfect accuracy on the lookahead task. However, NAR models require exponentially fewer training examples and shallower architectures compared to AR models, which often fail to converge without specific curriculum adjustments.

[LG-8] Unlearning Noise in PINNs: A Selective Pruning Framework for PDE Inverse Problems

链接: https://arxiv.org/abs/2602.19967
作者: Yongsheng Chen,Yong Chen,Wei Guo,Xinghui Zhong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) provide a promising framework for solving inverse problems governed by partial differential equations (PDEs) by integrating observational data and physical constraints in a unified optimization objective. However, the ill-posed nature of PDE inverse problems makes them highly sensitive to noise. Even a small fraction of corrupted observations can distort internal neural representations, severely impairing accuracy and destabilizing training. Motivated by recent advances in machine unlearning and structured network pruning, we propose P-PINN, a selective pruning framework designed to unlearn the influence of corrupted data in a pretrained PINN. Specifically, starting from a PINN trained on the full dataset, P-PINN evaluates a joint residual–data fidelity indicator, a weighted combination of data misfit and PDE residuals, to partition the training set into reliable and corrupted subsets. Next, we introduce a bias-based neuron importance measure that quantifies directional activation discrepancies between the two subsets, identifying neurons whose representations are predominantly driven by corrupted samples. Building on this, an iterative pruning strategy then removes noise-sensitive neurons layer by layer. The resulting pruned network is fine-tuned on the reliable data subject to the original PDE constraints, acting as a lightweight post-processing stage rather than a complete retraining. Numerical experiments on extensive PDE inverse-problem benchmarks demonstrate that P-PINN substantially improves robustness, accuracy, and training stability under noisy conditions, achieving up to a 96.6% reduction in relative error compared with baseline PINNs. These results indicate that activation-level post hoc pruning is a promising mechanism for enhancing the reliability of physics-informed learning in noise-contaminated settings.

[LG-9] Sparse Masked Attention Policies for Reliable Generalization

链接: https://arxiv.org/abs/2602.19956
作者: Caroline Horsch,Laurens Engwegen,Max Weltevrede,Matthijs T. J. Spaan,Wendelin Böhmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In reinforcement learning, abstraction methods that remove unnecessary information from the observation are commonly used to learn policies which generalize better to unseen tasks. However, these methods often overlook a crucial weakness: the function which extracts the reduced-information representation has unknown generalization ability in unseen observations. In this paper, we address this problem by presenting an information removal method which more reliably generalizes to new states. We accomplish this by using a learned masking function which operates on, and is integrated with, the attention weights within an attention-based policy network. We demonstrate that our method significantly improves policy generalization to unseen tasks in the Procgen benchmark compared to standard PPO and masking approaches.

[LG-10] A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLM s

链接: https://arxiv.org/abs/2602.19938
作者: Zijie Liu,Jie Peng,Jinhao Duan,Zirui Liu,Kaixiong Zhou,Mingfu Liang,Luke Simon,Xi Liu,Zhaozhuo Xu,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (RQ), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.19938 [cs.LG] (or arXiv:2602.19938v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19938 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] RobPI: Robust Private Inference against Malicious Client

链接: https://arxiv.org/abs/2602.19918
作者: Jiaqi Xue,Mengxin Zheng,Qian Lou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by SaTML 2026

点击查看摘要

Abstract:The increased deployment of machine learning inference in various applications has sparked privacy concerns. In response, private inference (PI) protocols have been created to allow parties to perform inference without revealing their sensitive data. Despite recent advances in the efficiency of PI, most current methods assume a semi-honest threat model where the data owner is honest and adheres to the protocol. However, in reality, data owners can have different motivations and act in unpredictable ways, making this assumption unrealistic. To demonstrate how a malicious client can compromise the semi-honest model, we first designed an inference manipulation attack against a range of state-of-the-art private inference protocols. This attack allows a malicious client to modify the model output with 3x to 8x fewer queries than current black-box attacks. Motivated by the attacks, we proposed and implemented RobPI, a robust and resilient private inference protocol that withstands malicious clients. RobPI integrates a distinctive cryptographic protocol that bolsters security by weaving encryption-compatible noise into the logits and features of private inference, thereby efficiently warding off malicious-client attacks. Our extensive experiments on various neural networks and datasets show that RobPI achieves ~91.9% attack success rate reduction and increases more than 10x the number of queries required by malicious-client attacks.

[LG-12] Uncertainty-Aware Rank-One MIMO Q Network Framework for Accelerated Offline Reinforcement Learning

链接: https://arxiv.org/abs/2602.19917
作者: Thanh Nguyen,Tung Luu,Tri Ton,Sungwoong Kim,Chang D. Yoo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages, 4 Figures, IEEE Access

点击查看摘要

Abstract:Offline reinforcement learning (RL) has garnered significant interest due to its safe and easily scalable paradigm. However, training under this paradigm presents its own challenge: the extrapolation error stemming from out-of-distribution (OOD) data. Existing methodologies have endeavored to address this issue through means like penalizing OOD Q-values or imposing similarity constraints on the learned policy and the behavior policy. Nonetheless, these approaches are often beset by limitations such as being overly conservative in utilizing OOD data, imprecise OOD data characterization, and significant computational overhead. To address these challenges, this paper introduces an Uncertainty-Aware Rank-One Multi-Input Multi-Output (MIMO) Q Network framework. The framework aims to enhance Offline Reinforcement Learning by fully leveraging the potential of OOD data while still ensuring efficiency in the learning process. Specifically, the framework quantifies data uncertainty and harnesses it in the training losses, aiming to train a policy that maximizes the lower confidence bound of the corresponding Q-function. Furthermore, a Rank-One MIMO architecture is introduced to model the uncertainty-aware Q-function, \TPoffering the same ability for uncertainty quantification as an ensemble of networks but with a cost nearly equivalent to that of a single network. Consequently, this framework strikes a harmonious balance between precision, speed, and memory efficiency, culminating in improved overall performance. Extensive experimentation on the D4RL benchmark demonstrates that the framework attains state-of-the-art performance while remaining computationally efficient. By incorporating the concept of uncertainty quantification, our framework offers a promising avenue to alleviate extrapolation errors and enhance the efficiency of offline RL.

[LG-13] Fully Convolutional Spatiotemporal Learning for Microstructure Evolution Prediction

链接: https://arxiv.org/abs/2602.19915
作者: Michael Trimboli,Mohammed Alsubaie,Sirani M. Perera,Ke-Gang Wang,Xianqi Li
类目: Machine Learning (cs.LG)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:Understanding and predicting microstructure evolution is fundamental to materials science, as it governs the resulting properties and performance of materials. Traditional simulation methods, such as phase-field models, offer high-fidelity results but are computationally expensive due to the need to solve complex partial differential equations at fine spatiotemporal resolutions. To address this challenge, we propose a deep learning-based framework that accelerates microstructure evolution predictions while maintaining high accuracy. Our approach utilizes a fully convolutional spatiotemporal model trained in a self-supervised manner using sequential images generated from simulations of microstructural processes, including grain growth and spinodal decomposition. The trained neural network effectively learns the underlying physical dynamics and can accurately capture both short-term local behaviors and long-term statistical properties of evolving microstructures, while also demonstrating generalization to unseen spatiotemporal domains and variations in configuration and material parameters. Compared to recurrent neural architectures, our model achieves state-of-the-art predictive performance with significantly reduced computational cost in both training and inference. This work establishes a robust baseline for spatiotemporal learning in materials science and offers a scalable, data-driven alternative for fast and reliable microstructure simulations.

[LG-14] De novo molecular structure elucidation from mass spectra via flow matching

链接: https://arxiv.org/abs/2602.19912
作者: Ghaith Mqawass(1,2),Tuan Le(2),Fabian Theis(1,3,4),Djork-Arné Clevert(2) ((1) TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany, (2) Machine Learning and Computational Sciences, Pfizer Research amp; Development, Berlin, Germany, (3) TUM School of Computation, Information and Technology, Technical University of Munich, Germany, (4) Institute of Computational Biology, Helmholtz Center Munich, Germany)
类目: Machine Learning (cs.LG)
*备注: 13-page preprint, 4 figures, 1 table

点击查看摘要

Abstract:Mass spectrometry is a powerful and widely used tool for identifying molecular structures due to its sensitivity and ability to profile complex samples. However, translating spectra into full molecular structures is a difficult, under-defined inverse problem. Overcoming this problem is crucial for enabling biological insight, discovering new metabolites, and advancing chemical research across multiple fields. To this end, we develop MSFlow, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules. In the first stage, we adopt a formula-restricted transformer model for encoding mass spectra into a continuous and chemically informative embedding space, while in the second stage, we train a decoder flow matching model to reconstruct molecules from latent embeddings of mass spectra. We present ablation studies demonstrating the importance of using information-preserving molecular descriptors for encoding mass spectra and motivate the use of our discrete flow-based decoder. Our rigorous evaluation demonstrates that MSFlow can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations - an improvement of up to fourteen-fold over the current state-of-the-art. A trained version of MSFlow is made publicly available on GitHub for non-commercial users.

[LG-15] Generalized Random Direction Newton Algorithms for Stochastic Optimization

链接: https://arxiv.org/abs/2602.19893
作者: Soumen Pachal,Prashanth L.A.,Shalabh Bhatnagar,Avinash Achar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a family of generalized Hessian estimators of the objective using random direction stochastic approximation (RDSA) by utilizing only noisy function measurements. The form of each estimator and the order of the bias depend on the number of function measurements. In particular, we demonstrate that estimators with more function measurements exhibit lower-order estimation bias. We show the asymptotic unbiasedness of the estimators. We also perform asymptotic and non-asymptotic convergence analyses for stochastic Newton methods that incorporate our generalized Hessian estimators. Finally, we perform numerical experiments to validate our theoretical findings.

[LG-16] I Dropped a Neural Net

链接: https://arxiv.org/abs/2602.19845
作者: Hyunwoo Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A recent Dwarkesh Patel podcast with John Collison and Elon Musk featured an interesting puzzle from Jane Street: they trained a neural net, shuffled all 96 layers, and asked to put them back in order. Given unlabelled layers of a Residual Network and its training dataset, we recover the exact ordering of the layers. The problem decomposes into pairing each block’s input and output projections ( 48! possibilities) and ordering the reassembled blocks ( 48! possibilities), for a combined search space of (48!)^2 \approx 10^122 , which is more than the atoms in the observable universe. We show that stability conditions during training like dynamic isometry leave the product W_\textout W_\textin for correctly paired layers with a negative diagonal structure, allowing us to use diagonal dominance ratio as a signal for pairing. For ordering, we seed-initialize with a rough proxy such as delta-norm or |W_\textout|_F then hill-climb to zero mean squared error. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.19845 [cs.LG] (or arXiv:2602.19845v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19845 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Drift Localization using Conformal Predictions

链接: https://arxiv.org/abs/2602.19790
作者: Fabian Hinder,Valerie Vaquet,Johannes Brinkrolf,Barbara Hammer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Paper was accepted at the 34th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning — ESANN 2026

点击查看摘要

Abstract:Concept drift – the change of the distribution over time – poses significant challenges for learning systems and is of central interest for monitoring. Understanding drift is thus paramount, and drift localization – determining which samples are affected by the drift – is essential. While several approaches exist, most rely on local testing schemes, which tend to fail in high-dimensional, low-signal settings. In this work, we consider a fundamentally different approach based on conformal predictions. We discuss and show the shortcomings of common approaches and demonstrate the performance of our approach on state-of-the-art image datasets.

[LG-18] Stop Preaching and Start Practising Data Frugality for Responsible Development of AI

链接: https://arxiv.org/abs/2602.19789
作者: Sophia N. Wilson,Guðrún Fjóla Guðmundsdóttir,Andrew Millard,Raghavendra Selvan,Sebastian Mair
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that coreset-based subset selection can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preach to concrete practice for responsible development of AI.

[LG-19] Bayesian Meta-Learning with Expert Feedback for Task-Shift Adaptation through Causal Embeddings

链接: https://arxiv.org/abs/2602.19788
作者: Lotta Mäkinen,Jorge Loría,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注: 27 pages, 8 figures

点击查看摘要

Abstract:Meta-learning methods perform well on new within-distribution tasks but often fail when adapting to out-of-distribution target tasks, where transfer from source tasks can induce negative transfer. We propose a causally-aware Bayesian meta-learning method, by conditioning task-specific priors on precomputed latent causal task embeddings, enabling transfer based on mechanistic similarity rather than spurious correlations. Our approach explicitly considers realistic deployment settings where access to target-task data is limited, and adaptation relies on noisy (expert-provided) pairwise judgments of causal similarity between source and target tasks. We provide a theoretical analysis showing that conditioning on causal embeddings controls prior mismatch and mitigates negative transfer under task shift. Empirically, we demonstrate reductions in negative transfer and improved out-of-distribution adaptation in both controlled simulations and a large-scale real-world clinical prediction setting for cross-disease transfer, where causal embeddings align with underlying clinical mechanisms.

[LG-20] Unsupervised Anomaly Detection in NSL-KDD Using β-VAE: A Latent Space and Reconstruction Error Approach

链接: https://arxiv.org/abs/2602.19785
作者: Dylan Baptiste(CRESTIC),Ramla Saddem(CRESTIC),Alexandre Philippot(CRESTIC),François Foyer
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As Operational Technology increasingly integrates with Information Technology, the need for Intrusion Detection Systems becomes more important. This paper explores an unsupervised approach to anomaly detection in network traffic using \beta -Variational Autoencoders on the NSL-KDD dataset. We investigate two methods: leveraging the latent space structure by measuring distances from test samples to the training data projections, and using the reconstruction error as a conventional anomaly detection metric. By comparing these approaches, we provide insights into their respective advantages and limitations in an unsupervised setting. Experimental results highlight the effectiveness of latent space exploitation for classification tasks.

[LG-21] Addressing Instrument-Outcome Confounding in Mendelian Randomization through Representation Learning

链接: https://arxiv.org/abs/2602.19782
作者: Shimeng Huang,Matthew Robinson,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mendelian Randomization (MR) is a prominent observational epidemiological research method designed to address unobserved confounding when estimating causal effects. However, core assumptions – particularly the independence between instruments and unobserved confounders – are often violated due to population stratification or assortative mating. Leveraging the increasing availability of multi-environment data, we propose a representation learning framework that exploits cross-environment invariance to recover latent exogenous components of genetic instruments. We provide theoretical guarantees for identifying these latent instruments under various mixing mechanisms and demonstrate the effectiveness of our approach through simulations and semi-synthetic experiments using data from the All of Us Research Hub.

[LG-22] Understanding the Curse of Unrolling

链接: https://arxiv.org/abs/2602.19733
作者: Sheheryar Mehmood,Florian Knoll,Peter Ochs
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Algorithm unrolling is ubiquitous in machine learning, particularly in hyperparameter optimization and meta-learning, where Jacobians of solution mappings are computed by differentiating through iterative algorithms. Although unrolling is known to yield asymptotically correct Jacobians under suitable conditions, recent work has shown that the derivative iterates may initially diverge from the true Jacobian, a phenomenon known as the curse of unrolling. In this work, we provide a non-asymptotic analysis that explains the origin of this behavior and identifies the algorithmic factors that govern it. We show that truncating early iterations of the derivative computation mitigates the curse while simultaneously reducing memory requirements. Finally, we demonstrate that warm-starting in bilevel optimization naturally induces an implicit form of truncation, providing a practical remedy. Our theoretical findings are supported by numerical experiments on representative examples.

[LG-23] PaReGTA: An LLM -based EHR Data Encoding Approach to Capture Temporal Information

链接: https://arxiv.org/abs/2602.19661
作者: Kihyuk Yoon,Lingchao Mao,Catherine Chong,Todd J. Schwedt,Chia-Chun Chiang,Jing Li
类目: Machine Learning (cs.LG)
*备注: 26 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Temporal information in structured electronic health records (EHRs) is often lost in sparse one-hot or count-based representations, while sequence models can be costly and data-hungry. We propose PaReGTA, an LLM-based encoding framework that (i) converts longitudinal EHR events into visit-level templated text with explicit temporal cues, (ii) learns domain-adapted visit embeddings via lightweight contrastive fine-tuning of a sentence-embedding model, and (iii) aggregates visit embeddings into a fixed-dimensional patient representation using hybrid temporal pooling that captures both recency and globally informative visits. Because PaReGTA does not require training from scratch but instead utilizes a pre-trained LLM, it can perform well even in data-limited cohorts. Furthermore, PaReGTA is model-agnostic and can benefit from future EHR-specialized sentence-embedding models. For interpretability, we introduce PaReGTA-RSS (Representation Shift Score), which quantifies clinically defined factor importance by recomputing representations after targeted factor removal and projecting representation shifts through a machine learning model. On 39,088 migraine patients from the All of Us Research Program, PaReGTA outperforms sparse baselines for migraine type classification while deep sequential models were unstable in our cohort.

[LG-24] Spectral Phase Encoding for Quantum Kernel Methods

链接: https://arxiv.org/abs/2602.19644
作者: Pablo Herrero Gómez,Antonio Jimeno Morenilla,David Muñoz-Hernández,Higinio Mora Mora
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum kernel methods are promising for near-term quantum ma- chine learning, yet their behavior under data corruption remains insuf- ficiently understood. We analyze how quantum feature constructions degrade under controlled additive noise. We introduce Spectral Phase Encoding (SPE), a hybrid construc- tion combining a discrete Fourier transform (DFT) front-end with a diagonal phase-only embedding aligned with the geometry of diagonal quantum maps. Within a unified framework, we compare QK-DFT against alternative quantum variants (QK-PCA, QK-RP) and classi- cal SVM baselines under identical clean-data hyperparameter selection, quantifying robustness via dataset fixed-effects regression with wild cluster bootstrap inference across heterogeneous real-world datasets. Across the quantum family, DFT-based preprocessing yields the smallest degradation rate as noise increases, with statistically sup- ported slope differences relative to PCA and RP. Compared to classical baselines, QK-DFT shows degradation comparable to linear SVM and more stable than RBF SVM under matched tuning. Hardware exper- iments confirm that SPE remains executable and numerically stable for overlap estimation. These results indicate that robustness in quan- tum kernels depends critically on structure-aligned preprocessing and its interaction with diagonal embeddings, supporting a robustness-first perspective for NISQ-era quantum machine learning.

[LG-25] Evaluating the Impact of Data Anonymization on Image Retrieval

链接: https://arxiv.org/abs/2602.19641
作者: Marvin Chen,Manuel Eberhardinger,Johannes Maucher
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE Access

点击查看摘要

Abstract:With the growing importance of privacy regulations such as the General Data Protection Regulation, anonymizing visual data is becoming increasingly relevant across institutions. However, anonymization can negatively affect the performance of Computer Vision systems that rely on visual features, such as Content-Based Image Retrieval (CBIR). Despite this, the impact of anonymization on CBIR has not been systematically studied. This work addresses this gap, motivated by the DOKIQ project, an artificial intelligence-based system for document verification actively used by the State Criminal Police Office Baden-Württemberg. We propose a simple evaluation framework: retrieval results after anonymization should match those obtained before anonymization as closely as possible. To this end, we systematically assess the impact of anonymization using two public datasets and the internal DOKIQ dataset. Our experiments span three anonymization methods, four anonymization degrees, and four training strategies, all based on the state of the art backbone Self-Distillation with No Labels (DINO)v2. Our results reveal a pronounced retrieval bias in favor of models trained on original data, which produce the most similar retrievals after anonymization. The findings of this paper offer practical insights for developing privacy-compliant CBIR systems while preserving performance.

[LG-26] Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models

链接: https://arxiv.org/abs/2602.19619
作者: Luhan Tang,Longxuan Yu,Shaorong Zhang,Greg Ver Steeg
类目: Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler-induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model. We introduce a sampler-centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground-truth Markov chain, isolating sampler-induced error in a controlled setting. We show that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log-likelihood, generative perplexity, or MAUVE do not imply correct sampling. Code is available at this https URL

[LG-27] Workflow-Level Design Principles for Trustworthy GenAI in Automotive System Engineering

链接: https://arxiv.org/abs/2602.19614
作者: Chih-Hong Cheng,Brian Hsuan-Cheng Liao,Adam Molin,Hasan Esen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The adoption of large language models in safety-critical system engineering is constrained by trustworthiness, traceability, and alignment with established verification practices. We propose workflow-level design principles for trustworthy GenAI integration and demonstrate them in an end-to-end automotive pipeline, from requirement delta identification to SysML v2 architecture update and re-testing. First, we show that monolithic (“big-bang”) prompting misses critical changes in large specifications, while section-wise decomposition with diversity sampling and lightweight NLP sanity checks improves completeness and correctness. Then, we propagate requirement deltas into SysML v2 models and validate updates via compilation and static analysis. Additionally, we ensure traceable regression testing by generating test cases through explicit mappings from specification variables to architectural ports and states, providing practical safeguards for GenAI used in safety-critical automotive engineering.

[LG-28] Variational Inference for Bayesian MIDAS Regression

链接: https://arxiv.org/abs/2602.19610
作者: Luigi Simeone
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 27 pages, 11 figures

点击查看摘要

Abstract:We develop a Coordinate Ascent Variational Inference (CAVI) algorithm for Bayesian Mixed Data Sampling (MIDAS) regression with linear weight parameteri zations. The model separates impact coe cients from weighting function parameters through a normalization constraint, creating a bilinear structure that renders generic Hamiltonian Monte Carlo samplers unreliable while preserving conditional conju gacy exploitable by CAVI. Each variational update admits a closed-form solution: Gaussian for regression coe cients and weight parameters, Inverse-Gamma for the error variance. The algorithm propagates uncertainty across blocks through second moments, distinguishing it from naive plug-in approximations. In a Monte Carlo study spanning 21 data-generating con gurations with up to 50 predictors, CAVI produces posterior means nearly identical to a block Gibbs sampler benchmark while achieving speedups of 107x to 1,772x (Table 9). Generic automatic di eren tiation VI (ADVI), by contrast, produces bias 714 times larger while being orders of magnitude slower, con rming the value of model-speci c derivations. Weight function parameters maintain excellent calibration (coverage above 92%) across all con gurations. Impact coe cient credible intervals exhibit the underdispersion characteristic of mean- eld approximations, with coverage declining from 89% to 55% as the number of predictors grows a documented trade-o between speed and interval calibration that structured variational methods can address. An empirical application to realized volatility forecasting on SP 500 daily returns con rms that CAVI and Gibbs sampling yield virtually identical point forecasts, with CAVI completing each monthly estimation in under 10 milliseconds.

[LG-29] ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

链接: https://arxiv.org/abs/2602.19594
作者: Ayush Nangia,Shikhar Mishra,Aman Gokrani,Paras Chopra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

[LG-30] Advantage-based Temporal Attack in Reinforcement Learning

链接: https://arxiv.org/abs/2602.19582
作者: Shenghong He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extensive research demonstrates that Deep Reinforcement Learning (DRL) models are susceptible to adversarially constructed inputs (i.e., adversarial examples), which can mislead the agent to take suboptimal or unsafe actions. Recent methods improve attack effectiveness by leveraging future rewards to guide adversarial perturbation generation over sequential time steps (i.e., reward-based attacks). However, these methods are unable to capture dependencies between different time steps in the perturbation generation process, resulting in a weak temporal correlation between the current perturbation and previous this http URL this paper, we propose a novel method called Advantage-based Adversarial Transformer (AAT), which can generate adversarial examples with stronger temporal correlations (i.e., time-correlated adversarial examples) to improve the attack performance. AAT employs a multi-scale causal self-attention (MSCSA) mechanism to dynamically capture dependencies between historical information from different time periods and the current state, thus enhancing the correlation between the current perturbation and the previous perturbation. Moreover, AAT introduces a weighted advantage mechanism, which quantifies the effectiveness of a perturbation in a given state and guides the generation process toward high-performance adversarial examples by sampling high-advantage regions. Extensive experiments demonstrate that the performance of AAT matches or surpasses mainstream adversarial attack baselines on Atari, DeepMind Control Suite and Google football tasks.

[LG-31] LeapVerify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training

链接: https://arxiv.org/abs/2602.19580
作者: Jeremy McEntire
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 18 pages, 5 tables. Code and data available at this https URL

点击查看摘要

Abstract:We introduce Leap+Verify, a framework that applies speculative execution – predicting future model weights and validating predictions before acceptance – to accelerate neural network training. Inspired by speculative decoding in language model inference and by the Automatically Scalable Computation (ASC) architecture for program execution, Leap+Verify decomposes training into three dynamically detected regimes (chaotic, transition, stable) using activation-space cosine similarity as a real-time Lyapunov proxy signal. Within each regime, analytic weight predictors (momentum, linear, quadratic extrapolation) attempt to forecast model parameters K training steps ahead; predictions are accepted only when validated against a held-out loss criterion. We evaluate Leap+Verify on GPT-2 124M and Qwen 2.5-1.5B trained on WikiText-103 across five random seeds, sweeping prediction depth K in 5, 10, 25, 50, 75, 100. Momentum-based prediction (Adam moment extrapolation) fails catastrophically at both scales, with predicted losses exceeding actuals by 100-10,000x – a universal norm explosion in optimizer-state extrapolation. Finite-difference predictors (linear, quadratic) succeed where momentum fails: at 124M, they achieve 24% strict acceptance at K=5 in stable regimes; at 1.5B, they achieve 37% strict acceptance in transition regimes. The scale-dependent finding is in regime distribution: GPT-2 124M spends 34% of training in stable regime, while Qwen 1.5B spends 64% in chaotic regime and reaches stable in only 0-2 of 40 checkpoints. Larger models are more predictable when predictable, but less often predictable – the practical bottleneck shifts from predictor accuracy to regime availability. Cross-seed results are highly consistent (less than 1% validation loss variance), and the three-regime framework produces identical phase boundaries (plus or minus 50 steps) across seeds.

[LG-32] he Sample Complexity of Replicable Realizable PAC Learning

链接: https://arxiv.org/abs/2602.19552
作者: Kasper Green Larsen,Markus Engelund Mathiasen,Chirag Pabbaraju,Clement Svendsen
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:In this paper, we consider the problem of replicable realizable PAC learning. We construct a particularly hard learning problem and show a sample complexity lower bound with a close to (\log|H|)^3/2 dependence on the size of the hypothesis class H . Our proof uses several novel techniques and works by defining a particular Cayley graph associated with H and analyzing a suitable random walk on this graph by examining the spectral properties of its adjacency matrix. Furthermore, we show an almost matching upper bound for the lower bound instance, meaning if a stronger lower bound exists, one would have to consider a different instance of the problem. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2602.19552 [cs.LG] (or arXiv:2602.19552v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19552 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Beyond Accuracy: A Unified Random Matrix Theory Diagnostic Framework for Crash Classification Models

链接: https://arxiv.org/abs/2602.19528
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Crash classification models in transportation safety are typically evaluated using accuracy, F1, or AUC, metrics that cannot reveal whether a model is silently overfitting. We introduce a spectral diagnostic framework grounded in Random Matrix Theory (RMT) and Heavy-Tailed Self-Regularization (HTSR) that spans the ML taxonomy: weight matrices for BERT/ALBERT/Qwen2.5, out-of-fold increment matrices for XGBoost/Random Forest, empirical Hessians for Logistic Regression, induced affinity matrices for Decision Trees, and Graph Laplacians for KNN. Evaluating nine model families on two Iowa DOT crash classification tasks (173,512 and 371,062 records respectively), we find that the power-law exponent \alpha provides a structural quality signal: well-regularized models consistently yield \alpha within [2, 4] (mean 2.87 \pm 0.34 ), while overfit variants show \alpha 2 or spectral collapse. We observe a strong rank correlation between \alpha and expert agreement (Spearman \rho = 0.89 , p 0.001 ), suggesting spectral quality captures model behaviors aligned with expert reasoning. We propose an \alpha -based early stopping criterion and a spectral model selection protocol, and validate both against cross-validated F1 baselines. Sparse Lanczos approximations make the framework scalable to large datasets.

[LG-34] Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

链接: https://arxiv.org/abs/2602.19510
作者: Rudrajit Das,Neel Patel,Meisam Razaviyayn,Vahab Mirrokni
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data mixing–the strategic reweighting of training domains–is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps T . We prove that the “greedy” practical approach of using T=1 can fail even in a simple quadratic example. Under a fixed parameter update budget N and assuming the per-domain losses are strongly convex, we show that the optimal T scales as \Theta(\log N) (resp., \Theta((N \log N)^1/2) ) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.

[LG-35] PIS: A Physics-Informed System for Accurate State Partitioning of Aβ_42 Protein Trajectories

链接: https://arxiv.org/abs/2602.19444
作者: Qianfeng Yu,Ningkang Peng,Yanhui Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the conformational evolution of \beta -amyloid ( A\beta ), particularly the A\beta_42 isoform, is fundamental to elucidating the pathogenic mechanisms underlying Alzheimer’s disease. However, existing end-to-end deep learning models often struggle to capture subtle state transitions in protein trajectories due to a lack of explicit physical constraints. In this work, we introduce PIS, a Physics-Informed System designed for robust metastable state partitioning. By integrating pre-computed physical priors, such as the radius of gyration and solvent-accessible surface area, into the extraction of topological features, our model achieves superior performance on the A\beta_42 dataset. Furthermore, PIS provides an interactive platform that features dynamic monitoring of physical characteristics and multi-dimensional result validation. This system offers biological researchers a powerful set of analytical tools with physically grounded interpretability. A demonstration video of PIS is available on this https URL.

[LG-36] RAmmStein: Regime Adaptation in Mean-reverting Markets with Stein Thresholds – Optimal Impulse Control in Concentrated AMMs WWW

链接: https://arxiv.org/abs/2602.19419
作者: Pranay Anchuri
类目: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: 12 pages, 1 figure, 4 tables, 1 algorithm; submitted to Designing DeFi workshop ( this https URL )

点击查看摘要

Abstract:Concentrated liquidity provision in decentralized exchanges presents a fundamental Impulse Control problem. Liquidity Providers (LPs) face a non-trivial trade-off between maximizing fee accrual through tight price-range concentration and minimizing the friction costs of rebalancing, including gas fees and swap slippage. Existing methods typically employ heuristic or threshold strategies that fail to account for market dynamics. This paper formulates liquidity management as an optimal control problem and derives the corresponding Hamilton-Jacobi-Bellman quasi-variational inequality (HJB-QVI). We present an approximate solution RAmmStein, a Deep Reinforcement Learning method that incorporates the mean-reversion speed (theta) of an Ornstein-Uhlenbeck process among other features as input to the model. We demonstrate that the agent learns to separate the state space into regions of action and inaction. We evaluate the framework using high-frequency 1Hz Coinbase trade data comprising over 6.8M trades. Experimental results show that RAmmStein achieves a superior net ROI of 0.72% compared to both passive and aggressive strategies. Notably, the agent reduces rebalancing frequency by 67% compared to a greedy rebalancing strategy while maintaining 88% active time. Our results demonstrate that regime-aware laziness can significantly improve capital efficiency by preserving the returns that would otherwise be eroded by the operational costs.

[LG-37] Federated Causal Representation Learning in State-Space Systems for Decentralized Counterfactual Reasoning

链接: https://arxiv.org/abs/2602.19414
作者: Nazal Mohamed,Ayush Mohanty,Nagi Gebraeel
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: Manuscript under review

点击查看摘要

Abstract:Networks of interdependent industrial assets (clients) are tightly coupled through physical processes and control inputs, raising a key question: how would the output of one client change if another client were operated differently? This is difficult to answer because client-specific data are high-dimensional and private, making centralization of raw data infeasible. Each client also maintains proprietary local models that cannot be modified. We propose a federated framework for causal representation learning in state-space systems that captures interdependencies among clients under these constraints. Each client maps high-dimensional observations into low-dimensional latent states that disentangle intrinsic dynamics from control-driven influences. A central server estimates the global state-transition and control structure. This enables decentralized counterfactual reasoning where clients predict how outputs would change under alternative control inputs at others while only exchanging compact latent states. We prove convergence to a centralized oracle and provide privacy guarantees. Our experiments demonstrate scalability, and accurate cross-client counterfactual inference on synthetic and real-world industrial control system datasets.

[LG-38] LEVDA: Latent Ensemble Variational Data Assimilation via Differentiable Dynamics

链接: https://arxiv.org/abs/2602.19406
作者: Phillip Si,Peng Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Long-range geophysical forecasts are fundamentally limited by chaotic dynamics and numerical errors. While data assimilation can mitigate these issues, classical variational smoothers require computationally expensive tangent-linear and adjoint models. Conversely, recent efficient latent filtering methods often enforce weak trajectory-level constraints and assume fixed observation grids. To bridge this gap, we propose Latent Ensemble Variational Data Assimilation (LEVDA), an ensemble-space variational smoother that operates in the low-dimensional latent space of a pretrained differentiable neural dynamics surrogate. By performing four-dimensional ensemble-variational (4DEnVar) optimization within an ensemble subspace, LEVDA jointly assimilates states and unknown parameters without the need for adjoint code or auxiliary observation-to-latent encoders. Leveraging the fully differentiable, continuous-in-time-and-space nature of the surrogate, LEVDA naturally accommodates highly irregular sampling at arbitrary spatiotemporal locations. Across three challenging geophysical benchmarks, LEVDA matches or outperforms state-of-the-art latent filtering baselines under severe observational sparsity while providing more reliable uncertainty quantification. Simultaneously, it achieves substantially improved assimilation accuracy and computational efficiency compared to full-state 4DEnVar.

[LG-39] In Defense of Cosine Similarity: Normalization Eliminates the Gauge Freedom

链接: https://arxiv.org/abs/2602.19393
作者: Taha Bouhsine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Steck, Ekanadham, and Kallus [arXiv:2403.05440] demonstrate that cosine similarity of learned embeddings from matrix factorization models can be rendered arbitrary by a diagonal gauge'' matrix D . Their result is correct and important for practitioners who compute cosine similarity on embeddings trained with dot-product objectives. However, we argue that their conclusion, cautioning against cosine similarity in general, conflates the pathology of an incompatible training objective with the geometric validity of cosine distance on the unit sphere. We prove that when embeddings are constrained to the unit sphere \mathbbS^d-1 (either during or after training with an appropriate objective), the D -matrix ambiguity vanishes identically, and cosine distance reduces to exactly half the squared Euclidean distance. This monotonic equivalence implies that cosine-based and Euclidean-based neighbor rankings are identical on normalized embeddings. The problem’’ with cosine similarity is not cosine similarity, it is the failure to normalize.

[LG-40] Spiking Graph Predictive Coding for Reliable OOD Generalization WWW26

链接: https://arxiv.org/abs/2602.19392
作者: Jing Ren,Jiapeng Du,Bowen Li,Ziqi Xu,Xin Zheng,Hong Jia,Suyu Ma,Xiwei Xu,Feng Xia
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 12 pages, 6 figures, WWW26, Dubai, United Arab Emirates

点击查看摘要

Abstract:Graphs provide a powerful basis for modeling Web-based relational data, with expressive GNNs to support the effective learning in dynamic web environments. However, real-world deployment is hindered by pervasive out-of-distribution (OOD) shifts, where evolving user activity and changing content semantics alter feature distributions and labeling criteria. These shifts often lead to unstable or overconfident predictions, undermining the trustworthiness required for Web4Good applications. Achieving reliable OOD generalization demands principled and interpretable uncertainty estimation; however, existing methods are largely post-hoc, insensitive to distribution shifts, and unable to explain where uncertainty arises especially in high-stakes settings. To address these limitations, we introduce SpIking GrapH predicTive coding (SIGHT), an uncertainty-aware plug-in graph learning module for reliable OOD Generalization. SIGHT performs iterative, error-driven correction over spiking graph states, enabling models to expose internal mismatch signals that reveal where predictions become unreliable. Across multiple graph benchmarks and diverse OOD scenarios, SIGHT consistently enhances predictive accuracy, uncertainty estimation, and interpretability when integrated with GNNs.

[LG-41] LLM s Can Learn to Reason Via Off-Policy RL

链接: https://arxiv.org/abs/2602.19362
作者: Daniel Ritter,Owen Oertell,Bradley Guo,Jonathan Chang,Kianté Brantley,Wen Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference policies by explicitly modifying the inference engine. In this work, we embrace off-policyness and propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL). We show that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly available coding model, DeepCoder, on LiveCodeBench, while using 3x fewer generations during training. We further empirically demonstrate that models trained via OAPL have improved test time scaling under the Pass@k metric. OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.

[LG-42] Vid2Sid: Videos Can Help Close the Sim2Real Gap

链接: https://arxiv.org/abs/2602.19359
作者: Kevin Qiu,Yu Zhang,Marek Cygan,Josie Hughes
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Calibrating a robot simulator’s physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13% vs. 28–98%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.

[LG-43] raining-Free Cross-Architecture Merging for Graph Neural Networks

链接: https://arxiv.org/abs/2602.19332
作者: Rishabh Bhattacharya,Vikaskumar Kalsariya,Naresh Manwani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging has emerged as a powerful paradigm for combining the capabilities of distinct expert models without the high computational cost of retraining, yet current methods are fundamentally constrained to homogeneous architectures. For GNNs, however, message passing is topology-dependent and sensitive to misalignment, making direct parameter-space merging unreliable. To bridge this gap, we introduce H-GRAMA (Heterogeneous Graph Routing and Message Alignment), a training-free framework that lifts merging from parameter space to operator space. We formalize Universal Message Passing Mixture (UMPM), a shared operator family that expresses heterogeneous GNN layers in a common functional language. H-GRAMA enables cross-architecture GNN merging (e.g., GCN to GAT) without retraining, retaining high specialist accuracy in most cases in compatible depth settings and achieving inference speedups of 1.2x to 1.9x over ensembles.

[LG-44] Partial Soft-Matching Distance for Neural Representational Comparison with Partial Unit Correspondence

链接: https://arxiv.org/abs/2602.19331
作者: Chaitanya Kapoor,Alex H. Williams,Meenakshi Khosla
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Representational similarity metrics typically force all units to be matched, making them susceptible to noise and outliers common in neural representations. We extend the soft-matching distance to a partial optimal transport setting that allows some neurons to remain unmatched, yielding rotation-sensitive but robust correspondences. This partial soft-matching distance provides theoretical advantages – relaxing strict mass conservation while maintaining interpretable transport costs – and practical benefits through efficient neuron ranking in terms of cross-network alignment without costly iterative recomputation. In simulations, it preserves correct matches under outliers and reliably selects the correct model in noise-corrupted identification tasks. On fMRI data, it automatically excludes low-reliability voxels and produces voxel rankings by alignment quality that closely match computationally expensive brute-force approaches. It achieves higher alignment precision across homologous brain areas than standard soft-matching, which is forced to match all units regardless of quality. In deep networks, highly matched units exhibit similar maximally exciting images, while unmatched units show divergent patterns. This ability to partition by match quality enables focused analyses, e.g., testing whether networks have privileged axes even within their most aligned subpopulations. Overall, partial soft-matching provides a principled and practical method for representational comparison under partial correspondence.

[LG-45] CTS-Bench: Benchmarking Graph Coarsening Trade-offs for GNNs in Clock Tree Synthesis ASPLOS

链接: https://arxiv.org/abs/2602.19330
作者: Barsat Khadka,Kawsher Roxy,Md Rubel Ahmed
类目: Machine Learning (cs.LG)
*备注: Accepted to ML Bench’26 ASPLOS

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are increasingly explored for physical design analysis in Electronic Design Automation, particularly for modeling Clock Tree Synthesis behavior such as clock skew and buffering complexity. However, practical deployment remains limited due to the prohibitive memory and runtime cost of operating on raw gate-level netlists. Graph coarsening is commonly used to improve scalability, yet its impact on CTS-critical learning objectives is not well characterized. This paper introduces CTS-Bench, a benchmark suite for systematically evaluating the trade-offs between graph coarsening, prediction accuracy, and computational efficiency in GNN-based CTS analysis. CTS-Bench consists of 4,860 converged physical design solutions spanning five architectures and provides paired raw gate-level and clustered graph representations derived from post-placement designs. Using clock skew prediction as a representative CTS task, we demonstrate a clear accuracy-efficiency trade-off. While graph coarsening reduces GPU memory usage by up to 17.2x and accelerates training by up to 3x, it also removes structural information essential for modeling clock distribution, frequently resulting in negative R^2 scores under zero-shot evaluation. Our findings indicate that generic graph clustering techniques can fundamentally compromise CTS learning objectives, even when global physical metrics remain unchanged. CTS-Bench enables principled evaluation of CTS-aware graph coarsening strategies, supports benchmarking of GNN architectures and accelerators under realistic physical design constraints, and provides a foundation for developing learning-assisted CTS analysis and optimization techniques.

[LG-46] Metasurfaces-Integrated Wireless Neural Networks for Lightweight Over-The-Air Edge Inference

链接: https://arxiv.org/abs/2602.19312
作者: Kyriakos Stylianopoulos,Mario Edoardo Pandolfo,Paolo Di Lorenzo,George C. Alexandropoulos
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 9 pages, 6 figures, submitted for magazine publication

点击查看摘要

Abstract:The upcoming sixth Generation (6G) of wireless networks envisions ultra-low latency and energy efficient Edge Inference (EI) for diverse Internet of Things (IoT) applications. However, traditional digital hardware for machine learning is power intensive, motivating the need for alternative computation paradigms. Over-The-Air (OTA) computation is regarded as an emerging transformative approach assigning the wireless channel to actively perform computational tasks. This article introduces the concept of Metasurfaces-Integrated Neural Networks (MINNs), a physical-layer-enabled deep learning framework that leverages programmable multi-layer metasurface structures and Multiple-Input Multiple-Output (MIMO) channels to realize computational layers in the wave propagation domain. The MINN system is conceptualized as three modules: Encoder, Channel (uncontrollable propagation features and metasurfaces), and Decoder. The first and last modules, realized respectively at the multi-antenna transmitter and receiver, consist of conventional digital or purposely designed analog Deep Neural Network (DNN) layers, and the metasurfaces responses of the Channel module are optimized alongside all modules as trainable weights. This architecture enables computation offloading into the end-to-end physical layer, flexibly among its constituent modules, achieving performance comparable to fully digital DNNs while significantly reducing power consumption. The training of the MINN framework, two representative variations, and performance results for indicative applications are presented, highlighting the potential of MINNs as a lightweight and sustainable solution for future EI-enabled wireless systems. The article is concluded with a list of open challenges and promising research directions.

[LG-47] AdsorbFlow: energy-conditioned flow matching enables fast and realistic adsorbate placement

链接: https://arxiv.org/abs/2602.19289
作者: Jiangjie Qiu,Wentao Li,Honghao Chen,Leyi Zhao,Xiaonan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying low-energy adsorption geometries on catalytic surfaces is a practical bottleneck for computational heterogeneous catalysis: the difficulty lies not only in the cost of density functional theory (DFT) but in proposing initial placements that relax into the correct energy basins. Conditional denoising diffusion has improved success rates, yet requires \sim 100 iterative steps per sample. Here we introduce AdsorbFlow, a deterministic generative model that learns an energy-conditioned vector field on the rigid-body configuration space of adsorbate translation and rotation via conditional flow matching. Energy information enters through classifier-free guidance conditioning – not energy-gradient guidance – and sampling reduces to integrating an ODE in as few as 5 steps. On OC20-Dense with full DFT single-point verification, AdsorbFlow with an EquiformerV2 backbone achieves 61.4% SR@10 and 34.1% SR@1 – surpassing AdsorbDiff (31.8% SR@1, 41.0% SR@10) at every evaluation level and AdsorbML (47.7% SR@10) – while using 20 times fewer generative steps and achieving the lowest anomaly rate among generative methods (6.8%). On 50 out-of-distribution systems, AdsorbFlow retains 58.0% SR@10 with a MLFF-to-DFT gap of only 4~percentage points. These results establish that deterministic transport is both faster and more accurate than stochastic denoising for adsorbate placement. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.19289 [cs.LG] (or arXiv:2602.19289v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] Spectral bias in physics-informed and operator learning: Analysis and mitigation guidelines

链接: https://arxiv.org/abs/2602.19265
作者: Siavash Khodakarami,Vivek Oommen,Nazanin Ahmadi Daryakenari,Maxim Beekenkamp,George Em Karniadakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) by neural networks as well as Kolmogorov-Arnold Networks (KANs), including physics-informed neural networks (PINNs), physics-informed KANs (PIKANs), and neural operators, are known to exhibit spectral bias, whereby low-frequency components of the solution are learned significantly faster than high-frequency modes. While spectral bias is often treated as an intrinsic representational limitation of neural architectures, its interaction with optimization dynamics and physics-based loss formulations remains poorly understood. In this work, we provide a systematic investigation of spectral bias in physics-informed and operator learning frameworks, with emphasis on the coupled roles of network architecture, activation functions, loss design, and optimization strategy. We quantify spectral bias through frequency-resolved error metrics, Barron-norm diagnostics, and higher-order statistical moments, enabling a unified analysis across elliptic, hyperbolic, and dispersive PDEs. Through diverse benchmark problems, including the Korteweg-de Vries, wave and steady-state diffusion-reaction equations, turbulent flow reconstruction, and earthquake dynamics, we demonstrate that spectral bias is not simply representational but fundamentally dynamical. In particular, second-order optimization methods substantially alter the spectral learning order, enabling earlier and more accurate recovery of high-frequency modes for all PDE types. For neural operators, we further show that spectral bias is dependent on the neural operator architecture and can also be effectively mitigated through spectral-aware loss formulations without increasing the inference cost.

[LG-49] Alternating Bi-Objective Optimization for Explainable Neuro-Fuzzy Systems

链接: https://arxiv.org/abs/2602.19253
作者: Qusai Khaled,Uzay Kaymak,Laura Genga
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at IEEE Conference on Artificial Intelligence 2026 (IEEE CAI 2026)

点击查看摘要

Abstract:Fuzzy systems show strong potential in explainable AI due to their rule-based architecture and linguistic variables. Existing approaches navigate the accuracy-explainability trade-off either through evolutionary multi-objective optimization (MOO), which is computationally expensive, or gradient-based scalarization, which cannot recover non-convex Pareto regions. We propose X-ANFIS, an alternating bi-objective gradient-based optimization scheme for explainable adaptive neuro-fuzzy inference systems. Cauchy membership functions are used for stable training under semantically controlled initializations, and a differentiable explainability objective is introduced and decoupled from the performance objective through alternating gradient passes. Validated in approximately 5,000 experiments on nine UCI regression datasets, X-ANFIS consistently achieves target distinguishability while maintaining competitive predictive accuracy, recovering solutions beyond the convex hull of the MOO Pareto front.

[LG-50] Understanding Empirical Unlearning with Combinatorial Interpretability

链接: https://arxiv.org/abs/2602.19215
作者: Shingo Kodama,Niv Cohen,Micah Adler,Nir Shavit
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While many recent methods aim to unlearn or remove knowledge from pretrained models, seemingly erased knowledge often persists and can be recovered in various ways. Because large foundation models are far from interpretable, understanding whether and how such knowledge persists remains a significant challenge. To address this, we turn to the recently developed framework of combinatorial interpretability. This framework, designed for two-layer neural networks, enables direct inspection of the knowledge encoded in the model weights. We reproduce baseline unlearning methods within the combinatorial interpretability setting and examine their behavior along two dimensions: (i) whether they truly remove knowledge of a target concept (the concept we wish to remove) or merely inhibit its expression while retaining the underlying information, and (ii) how easily the supposedly erased knowledge can be recovered through various fine-tuning operations. Our results shed light within a fully interpretable setting on how knowledge can persist despite unlearning and when it might resurface.

[LG-51] Adaptive Problem Generation via Symbolic Representations

链接: https://arxiv.org/abs/2602.19187
作者: Teresa Yeo,Myeongho Jeon,Dulaj Weerakoon,Rui Qiao,Alok Prakash,Armando Solar-Lezama,Archan Misra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a method for generating training data for reinforcement learning with verifiable rewards to improve small open-weights language models on mathematical tasks. Existing data generation approaches rely on open-loop pipelines and fixed modifications that do not adapt to the model’s capabilities. Furthermore, they typically operate directly on word problems, limiting control over problem structure. To address this, we perform modifications in a symbolic problem space, representing each problem as a set of symbolic variables and constraints (e.g., via algebraic frameworks such as SymPy or SMT formulations). This representation enables precise control over problem structure, automatic generation of ground-truth solutions, and decouples mathematical reasoning from linguistic realization. We also show that this results in more diverse generations. To adapt the problem difficulty to the model, we introduce a closed-loop framework that learns modification strategies through prompt optimization in symbolic space. Experimental results demonstrate that both adaptive problem generation and symbolic representation modifications contribute to improving the model’s math solving ability.

[LG-52] Online Realizable Regression and Applications for ReLU Networks

链接: https://arxiv.org/abs/2602.19172
作者: Ilan Doron-Arad,Idan Mehalel,Elchanan Mossel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Realizable online regression can behave very differently from online classification. Even without any margin or stochastic assumptions, realizability may enforce horizon-free (finite) cumulative loss under metric-like losses, even when the analogous classification problem has an infinite mistake bound. We study realizable online regression in the adversarial model under losses that satisfy an approximate triangle inequality (approximate pseudo-metrics). Recent work of Attias et al. shows that the minimax realizable cumulative loss is characterized by the scaled Littlestone/online dimension \mathbbD_\mathrmonl , but this quantity can be difficult to analyze. Our main contribution is a generic potential method that upper bounds \mathbbD_\mathrmonl by a concrete Dudley-type entropy integral that depends only on covering numbers of the hypothesis class under the induced sup pseudo-metric. We define an \emphentropy potential \Phi(\mathcalH)=\int_0^diam(\mathcalH) \log N(\mathcalH,\varepsilon),d\varepsilon , where N(\mathcalH,\varepsilon) is the \varepsilon -covering number of \mathcalH , and show that for every c -approximate pseudo-metric loss, \mathbbD_\mathrmonl(\mathcalH)\le O©,\Phi(\mathcalH) . In particular, polynomial metric entropy implies \Phi(\mathcalH)\infty and hence a horizon-free realizable cumulative-loss bound with transparent dependence on effective dimension. We illustrate the method on two families. We prove a sharp q -vs.- d dichotomy for realizable online learning (finite and efficiently achievable \Theta_d,q(L^d) total loss for L -Lipschitz regression iff qd , otherwise infinite), and for bounded-norm k -ReLU networks separate regression (finite loss, even \widetilde O(k^2) , and O(1) for one ReLU) from classification (impossible already for k=2,d=1 ). Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.19172 [cs.LG] (or arXiv:2602.19172v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.19172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] RA-QA: Towards Respiratory Audio-based Health Question Answering

链接: https://arxiv.org/abs/2602.18452
作者: Gaia A. Bertolino,Yuwei Zhang,Tong Xia,Domenico Talia,Cecilia Mascolo
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Respiratory diseases are a leading cause of death globally, highlighting the urgent need for early and accessible screening methods. While some lung auscultation analysis has been automated and machine learning audio based models are able to predict respiratory pathologies, there remains a critical gap: the lack of intelligent systems that can interact in real-time consultations using natural language. Unlike other clinical domains, such as electronic health records, radiological images, and biosignals, where numerous question-answering (QA) datasets and models have been established, audio-based modalities remain notably underdeveloped. We curated and harmonized data from 11 diverse respiratory audio datasets to construct the first Respiratory Audio Question Answering (RA-QA) dataset. As the first multimodal QA resource of its kind focused specifically on respiratory health, RA-QA bridges clinical audio and natural language in a structured, scalable format. This new data resource contains about 7.5 million QA pairs spanning more than 60 attributes and three question types: single verification, multiple choice, and open-ended questions. Building upon this dataset, we introduce a novel benchmark that compares audio-text generation models with traditional audio classifiers to evaluate their respective performance.\Our experiments reveal interesting performance variations across different attributes and question types, establishing a baseline and paving the way for more advanced architectures that could further improve the performance. By bridging machine learning with real-world clinical dialogue, our work opens the door to the development of more interactive, intelligent, and accessible diagnostic tools in respiratory healthcare. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2602.18452 [cs.SD] (or arXiv:2602.18452v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2602.18452 Focus to learn more arXiv-issued DOI via DataCite

[LG-54] Distribution-Free Sequential Prediction with Abstentions COLT2026

链接: https://arxiv.org/abs/2602.17918
作者: Jialin Yu,Moïse Blanchard
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 38 pages, 2 figures. Submitted to COLT 2026. Extended version

点击查看摘要

Abstract:We study a sequential prediction problem in which an adversary is allowed to inject arbitrarily many adversarial instances in a stream of i.i.d.\ instances, but at each round, the learner may also \emphabstain from making a prediction without incurring any penalty if the instance was indeed corrupted. This semi-adversarial setting naturally sits between the classical stochastic case with i.i.d.\ instances for which function classes with finite VC dimension are learnable; and the adversarial case with arbitrary instances, known to be significantly more restrictive. For this problem, Goel et al. (2023) showed that, if the learner knows the distribution \mu of clean samples in advance, learning can be achieved for all VC classes without restrictions on adversary corruptions. This is, however, a strong assumption in both theory and practice: a natural question is whether similar learning guarantees can be achieved without prior distributional knowledge, as is standard in classical learning frameworks (e.g., PAC learning or asymptotic consistency) and other non-i.i.d.\ models (e.g., smoothed online learning). We therefore focus on the distribution-free setting where \mu is \emphunknown and propose an algorithm \textscAbstainBoost based on a boosting procedure of weak learners, which guarantees sublinear error for general VC classes in \emphdistribution-free abstention learning for oblivious adversaries. These algorithms also enjoy similar guarantees for adaptive adversaries, for structured function classes including linear classifiers. These results are complemented with corresponding lower bounds, which reveal an interesting polynomial trade-off between misclassification error and number of erroneous abstentions.

[LG-55] Schemes of Propagation Models and Source Estimators for Rumor Source Detection in Online Social Networks: A Short Survey of a Decade of Research

链接: https://arxiv.org/abs/2101.00753
作者: Rong Jin,Weili Wu
类目: ocial and Information Networks (cs.SI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent years have seen various rumor diffusion models being assumed in detection of rumor source research of the online social network. Diffusion model is arguably considered as a very important and challengeable factor for source detection in networks but it is less studied. This paper provides an overview of three representative schemes of Independent Cascade-based, Epidemic-based, and Learning-based to model the patterns of rumor propagation as well as three major schemes of estimators for rumor sources since its inception a decade ago.

[LG-56] JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

链接: https://arxiv.org/abs/2602.20153
作者: Jakob Heiss,Sören Lambrecht,Jakob Weissteiner,Hanna Wutte,Žan Žurič,Josef Teichmann,Bin Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 11 pages + appendix. Preliminary version of an ongoing project that will be expanded with furhter evaluations

点击查看摘要

Abstract:We study post-calibration uncertainty for trained ensembles of classifiers. Specifically, we consider both aleatoric (label noise) and epistemic (model) uncertainty. Among the most popular and widely used calibration methods in classification are temperature scaling (i.e., pool-then-calibrate) and conformal methods. However, the main shortcoming of these calibration methods is that they do not balance the proportion of aleatoric and epistemic uncertainty. Not balancing these uncertainties can severely misrepresent predictive uncertainty, leading to overconfident predictions in some input regions while being underconfident in others. To address this shortcoming, we present a simple but powerful calibration algorithm Joint Uncertainty Calibration (JUCAL) that jointly calibrates aleatoric and epistemic uncertainty. JUCAL jointly calibrates two constants to weight and scale epistemic and aleatoric uncertainties by optimizing the negative log-likelihood (NLL) on the validation/calibration dataset. JUCAL can be applied to any trained ensemble of classifiers (e.g., transformers, CNNs, or tree-based methods), with minimal computational overhead, without requiring access to the models’ internal parameters. We experimentally evaluate JUCAL on various text classification tasks, for ensembles of varying sizes and with different ensembling strategies. Our experiments show that JUCAL significantly outperforms SOTA calibration methods across all considered classification tasks, reducing NLL and predictive set size by up to 15% and 20%, respectively. Interestingly, even applying JUCAL to an ensemble of size 5 can outperform temperature-scaled ensembles of size up to 50 in terms of NLL and predictive set size, resulting in up to 10 times smaller inference costs. Thus, we propose JUCAL as a new go-to method for calibrating ensembles in classification.

[LG-57] Conformal Risk Control for Non-Monotonic Losses

链接: https://arxiv.org/abs/2602.20151
作者: Anastasios N. Angelopoulos
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal risk control is an extension of conformal prediction for controlling risk functions beyond miscoverage. The original algorithm controls the expected value of a loss that is monotonic in a one-dimensional parameter. Here, we present risk control guarantees for generic algorithms applied to possibly non-monotonic losses with multidimensional parameters. The guarantees depend on the stability of the algorithm – unstable algorithms have looser guarantees. We give applications of this technique to selective image classification, FDR and IOU control of tumor segmentations, and multigroup debiasing of recidivism predictions across overlapping race and sex groups using empirical risk minimization.

[LG-58] Multivariate time-series forecasting of ASTRI-Horn monitoring data: A Normal Behavior Model

链接: https://arxiv.org/abs/2602.19984
作者: Federico Incardona,Alessandro Costa,Farida Farsian,Francesco Franchina,Giuseppe Leto,Emilio Mastriani,Kevin Munari,Giovanni Pareschi,Salvatore Scuderi,Sebastiano Spinello,Gino Tosti
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 15 pages, 12 figures

点击查看摘要

Abstract:This study presents a Normal Behavior Model (NBM) developed to forecast monitoring time-series data from the ASTRI-Horn Cherenkov telescope under normal operating conditions. The analysis focused on 15 physical variables acquired by the Telescope Control Unit between September 2022 and July 2024, representing sensor measurements from the Azimuth and Elevation motors. After data cleaning, resampling, feature selection, and correlation analysis, the dataset was segmented into fixed-length intervals, in which the first I samples represented the input sequence provided to the model, while the forecast length, T, indicated the number of future time steps to be predicted. A sliding-window technique was then applied to increase the number of intervals. A Multi-Layer Perceptron (MLP) was trained to perform multivariate forecasting across all features simultaneously. Model performance was evaluated using the Mean Squared Error (MSE) and the Normalized Median Absolute Deviation (NMAD), and it was also benchmarked against a Long Short-Term Memory (LSTM) network. The MLP model demonstrated consistent results across different features and I-T configurations, and matched the performance of the LSTM while converging faster. It achieved an MSE of 0.019+/-0.003 and an NMAD of 0.032+/-0.009 on the test set under its best configuration (4 hidden layers, 720 units per layer, and I-T lengths of 300 samples each, corresponding to 5 hours at 1-minute resolution). Extending the forecast horizon up to 6.5 hours-the maximum allowed by this configuration-did not degrade performance, confirming the model’s effectiveness in providing reliable hour-scale predictions. The proposed NBM provides a powerful tool for enabling early anomaly detection in online ASTRI-Horn monitoring time series, offering a basis for the future development of a prognostics and health management system that supports predictive maintenance.

[LG-59] Rethinking Chronological Causal Discovery with Signal Processing

链接: https://arxiv.org/abs/2602.19903
作者: Kurt Butler,Damian Machlanski,Panagiotis Dimitrakopoulos,Sotirios A. Tsaftaris
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 5 pages, 5 figures, Final version accepted to the 59th Asilomar Conference on Signals, Systems, and Computers (2025)

点击查看摘要

Abstract:Causal discovery problems use a set of observations to deduce causality between variables in the real world, typically to answer questions about biological or physical systems. These observations are often recorded at regular time intervals, determined by a user or a machine, depending on the experiment design. There is generally no guarantee that the timing of these recordings matches the timing of the underlying biological or physical events. In this paper, we examine the sensitivity of causal discovery methods to this potential mismatch. We consider empirical and theoretical evidence to understand how causal discovery performance is impacted by changes of sampling rate and window length. We demonstrate that both classical and recent causal discovery methods exhibit sensitivity to these hyperparameters, and we discuss how ideas from signal processing may help us understand these phenomena.

[LG-60] Dirichlet Scale Mixture Priors for Bayesian Neural Networks

链接: https://arxiv.org/abs/2602.19859
作者: August Arnstad,Leiv Rønneberg,Geir Storvik
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 20 figures

点击查看摘要

Abstract:Neural networks are the cornerstone of modern machine learning, yet can be difficult to interpret, give overconfident predictions and are vulnerable to adversarial attacks. Bayesian neural networks (BNNs) provide some alleviation of these limitations, but have problems of their own. The key step of specifying prior distributions in BNNs is no trivial task, yet is often skipped out of convenience. In this work, we propose a new class of prior distributions for BNNs, the Dirichlet scale mixture (DSM) prior, that addresses current limitations in Bayesian neural networks through structured, sparsity-inducing shrinkage. Theoretically, we derive general dependence structures and shrinkage results for DSM priors and show how they manifest under the geometry induced by neural networks. In experiments on simulated and real world data we find that the DSM priors encourages sparse networks through implicit feature selection, show robustness under adversarial attacks and deliver competitive predictive performance with substantially fewer effective parameters. In particular, their advantages appear most pronounced in correlated, moderately small data regimes, and are more amenable to weight pruning. Moreover, by adopting heavy-tailed shrinkage mechanisms, our approach aligns with recent findings that such priors can mitigate the cold posterior effect, offering a principled alternative to the commonly used Gaussian priors.

[LG-61] Orthogonal Uplift Learning with Permutation-Invariant Representations for Combinatorial Treatments

链接: https://arxiv.org/abs/2602.19851
作者: Xinyan Su,Jiacan Gao,Mingyuan Ma,Xiao Xu,Xinrui Wan,Tianqi Gu,Enyun Yu,Jiecheng Guo,Zhiheng Zhang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study uplift estimation for combinatorial treatments. Uplift measures the pure incremental causal effect of an intervention (e.g., sending a coupon or a marketing message) on user behavior, modeled as a conditional individual treatment effect. Many real-world interventions are combinatorial: a treatment is a policy that specifies context-dependent action distributions rather than a single atomic label. Although recent work considers structured treatments, most methods rely on categorical or opaque encodings, limiting robustness and generalization to rare or newly deployed policies. We propose an uplift estimation framework that aligns treatment representation with causal semantics. Each policy is represented by the mixture it induces over contextaction components and embedded via a permutation-invariant aggregation. This representation is integrated into an orthogonalized low-rank uplift model, extending Robinson-style decompositions to learned, vector-valued treatments. We show that the resulting estimator is expressive for policy-induced causal effects, orthogonally robust to nuisance estimation errors, and stable under small policy perturbations. Experiments on large-scale randomized platform data demonstrate improved uplift accuracy and stability in long-tailed policy regimes

[LG-62] Path-conditioned training: a principled way to rescale ReLU neural networks

链接: https://arxiv.org/abs/2602.19799
作者: Arthur Lebeurrier,Titouan Vayer,Rémi Gribonval
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.

[LG-63] Exact Discrete Stochastic Simulation with Deep-Learning-Scale Gradient Optimization

链接: https://arxiv.org/abs/2602.19775
作者: Jose M. G. Vilar,Leonor Saiz
类目: Quantitative Methods (q-bio.QM); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Molecular Networks (q-bio.MN)
*备注: 28 pages, 8 figures

点击查看摘要

Abstract:Exact stochastic simulation of continuous-time Markov chains (CTMCs) is essential when discreteness and noise drive system behavior, but the hard categorical event selection in Gillespie-type algorithms blocks gradient-based learning. We eliminate this constraint by decoupling forward simulation from backward differentiation, with hard categorical sampling generating exact trajectories and gradients propagating through a continuous massively-parallel Gumbel-Softmax straight-through surrogate. Our approach enables accurate optimization at parameter scales over four orders of magnitude beyond existing simulators. We validate for accuracy, scalability, and reliability on a reversible dimerization model (0.09% error), a genetic oscillator (1.2% error), a 203,796-parameter gene regulatory network achieving 98.4% MNIST accuracy (a prototypical deep-learning multilayer perceptron benchmark), and experimental patch-clamp recordings of ion channel gating (R^2 = 0.987) in the single-channel regime. Our GPU implementation delivers 1.9 billion steps per second, matching the scale of non-differentiable simulators. By making exact stochastic simulation massively parallel and autodiff-compatible, our results enable high-dimensional parameter inference and inverse design across systems biology, chemical kinetics, physics, and related CTMC-governed domains.

[LG-64] Ensemble Machine Learning and Statistical Procedures for Dynamic Predictions of Time-to-Event Outcomes

链接: https://arxiv.org/abs/2602.19761
作者: Nina van Gerwen,Sten Willemsen,Bettina E. Hansen,Christophe Corpechot,Marco Carbone,Cynthia Levy,Maria-Carlota Londõno,Atsushi Tanaka,Palak Trivedi,Alejandra Villamil,Gideon Hirschfield,Dimitris Rizopoulos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Dynamic predictions for longitudinal and time-to-event outcomes have become a versatile tool in precision medicine. Our work is motivated by the application of dynamic predictions in the decision-making process for primary biliary cholangitis patients. For these patients, serial biomarker measurements (e.g., bilirubin and alkaline phosphatase levels) are routinely collected to inform treating physicians of the risk of liver failure and guide clinical decision-making. Two popular statistical approaches to derive dynamic predictions are joint modelling and landmarking. However, recently, machine learning techniques have also been proposed. Each approach has its merits, and no single method exists to outperform all others. Consequently, obtaining the best possible survival estimates is challenging. Therefore, we extend the Super Learner framework to combine dynamic predictions from different models and procedures. Super Learner is an ensemble learning technique that allows users to combine different prediction algorithms to improve predictive accuracy and flexibility. It uses cross-validation and different objective functions of performance (e.g., squared loss) that suit specific applications to build the optimally weighted combination of predictions from a library of candidate algorithms. In our work, we pay special attention to appropriate objective functions for Super Learner to obtain the most optimal weighted combination of dynamic predictions. In our primary biliary cholangitis application, Super Learner presented unique benefits due to its ability to flexibly combine outputs from a diverse set of models with varying assumptions for equal or better predictive performance than any model fit separately.

[LG-65] Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

链接: https://arxiv.org/abs/2602.19691
作者: Yuhao Liu,Zilin Wang,Lei Wu,Shaobo Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space W^s,\infty([0,1]^d) for arbitrary smoothness s0 . We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.

[LG-66] Manifold-Aligned Generative Transport

链接: https://arxiv.org/abs/2602.19600
作者: Xinyu Tian,Xiaotong Shen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 64 pages, 5 figures

点击查看摘要

Abstract:High-dimensional generative modeling is fundamentally a manifold-learning problem: real data concentrate near a low-dimensional structure embedded in the ambient space. Effective generators must therefore balance support fidelity – placing probability mass near the data manifold – with sampling efficiency. Diffusion models often capture near-manifold structure but require many iterative denoising steps and can leak off-support; normalizing flows sample in one pass but are limited by invertibility and dimension preservation. We propose MAGT (Manifold-Aligned Generative Transport), a flow-like generator that learns a one-shot, manifold-aligned transport from a low-dimensional base distribution to the data space. Training is performed at a fixed Gaussian smoothing level, where the score is well-defined and numerically stable. We approximate this fixed-level score using a finite set of latent anchor points with self-normalized importance sampling, yielding a tractable objective. MAGT samples in a single forward pass, concentrates probability near the learned support, and induces an intrinsic density with respect to the manifold volume measure, enabling principled likelihood evaluation for generated samples. We establish finite-sample Wasserstein bounds linking smoothing level and score-approximation accuracy to generative fidelity, and empirically improve fidelity and manifold concentration across synthetic and benchmark datasets while sampling substantially faster than diffusion models.

[LG-67] Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization

链接: https://arxiv.org/abs/2602.19578
作者: Weichi Yao,Bianca Dumitrascu,Bryan R. Goldsmith,Yixin Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Active data acquisition is central to many learning and optimization tasks in deep neural networks, yet remains challenging because most approaches rely on predictive uncertainty estimates that are difficult to obtain reliably. To this end, we propose Goal-Oriented Influence- Maximizing Data Acquisition (GOIMDA), an active acquisition algorithm that avoids explicit posterior inference while remaining uncertainty-aware through inverse curvature. GOIMDA selects inputs by maximizing their expected influence on a user-specified goal functional, such as test loss, predictive entropy, or the value of an optimizer-recommended design. Leveraging first-order influence functions, we derive a tractable acquisition rule that combines the goal gradient, training-loss curvature, and candidate sensitivity to model parameters. We show theoretically that, for generalized linear models, GOIMDA approximates predictive-entropy minimization up to a correction term accounting for goal alignment and prediction bias, thereby, yielding uncertainty-aware behavior without maintaining a Bayesian posterior. Empirically, across learning tasks (including image and text classification) and optimization tasks (including noisy global optimization benchmarks and neural-network hyperparameter tuning), GOIMDA consistently reaches target performance with substantially fewer labeled samples or function evaluations than uncertainty-based active learning and Gaussian-process Bayesian optimization baselines.

[LG-68] MACE-POLAR-1: A Polarisable Electrostatic Foundation Model for Molecular Chemistry

链接: https://arxiv.org/abs/2602.19411
作者: Ilyes Batatia,William J. Baldwin,Domantas Kuryla,Joseph Hart,Elliott Kasoar,Alin M. Elena,Harry Moore,Mikołaj J. Gawkowski,Benjamin X. Shi,Venkat Kapil,Panagiotis Kourtis,Ioan-Bogdan Magdău,Gábor Csányi
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate modelling of electrostatic interactions and charge transfer is fundamental to computational chemistry, yet most machine learning interatomic potentials (MLIPs) rely on local atomic descriptors that cannot capture long-range electrostatic effects. We present a new electrostatic foundation model for molecular chemistry that extends the MACE architecture with explicit treatment of long-range interactions and electrostatic induction. Our approach combines local many-body geometric features with a non-self-consistent field formalism that updates learnable charge and spin densities through polarisable iterations to model induction, followed by global charge equilibration via learnable Fukui functions to control total charge and total spin. This design enables an accurate and physical description of systems with varying charge and spin states while maintaining computational efficiency. Trained on the OMol25 dataset of 100 million hybrid DFT calculations, our models achieve chemical accuracy across diverse benchmarks, with accuracy competitive with hybrid DFT on thermochemistry, reaction barriers, conformational energies, and transition metal complexes. Notably, we demonstrate that the inclusion of long-range electrostatics leads to a large improvement in the description of non-covalent interactions and supramolecular complexes over non-electrostatic models, including sub-kcal/mol prediction of molecular crystal formation energy in the X23-DMC dataset and a fourfold improvement over short-ranged models on protein-ligand interactions. The model’s ability to handle variable charge and spin states, respond to external fields, provide interpretable spin-resolved charge densities, and maintain accuracy from small molecules to protein-ligand complexes positions it as a versatile tool for computational molecular chemistry and drug discovery.

附件下载

点击下载今日全部论文列表