本篇博文主要内容为 2026-03-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-23)
今日共更新555篇论文,其中:
- 自然语言处理共94篇(Computation and Language (cs.CL))
- 人工智能共168篇(Artificial Intelligence (cs.AI))
- 计算机视觉共132篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共154篇(Machine Learning (cs.LG))
- 多智能体系统共10篇(Multiagent Systems (cs.MA))
- 信息检索共22篇(Information Retrieval (cs.IR))
- 人机交互共23篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms
【速读】:该论文旨在解决大规模群体系统(swarm)在仅有少量控制更新情况下实现精准导向的问题,其核心挑战在于实际控制系统通常以采样-数据(sampled-data)形式运行:控制输入仅在离散时间点更新,并在有限区间内持续作用。传统方法常基于瞬时速度场设计控制策略,但在采样-数据框架下并不适用。为此,作者提出一种基于控制空间学习的框架,关键创新在于学习一个有限窗口内的控制系数,该系数参数化每个采样间隔内的最小能量控制(finite-horizon minimum-energy control)。这一系数不仅具有积分表示形式,还沿桥轨迹(bridge trajectories)满足局部微分恒等式,从而导出简洁的stop-gradient训练目标。在部署阶段,该系数可直接用于采样-数据控制更新,确保所学控制策略与系统动力学和执行器映射一致,实现了与真实控制系统结构相匹配的少步(few-step)群体导向方案。
链接: https://arxiv.org/abs/2603.20189
作者: Anqi Dong,Yongxin Chen,Karl H. Johansson,Johan Karlsson
机构: KTH Royal Institute of Technology (皇家理工学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.
[MA-1] IndoorR2X: Indoor Robot-to-Everything Coordination with LLM -Driven Planning
【速读】:该论文旨在解决多机器人在室内环境中因局部可观测性(partial observability)导致的场景理解受限问题,传统仅依赖机器人间通信(R2R)的方式难以在不显著增加探索开销或扩大团队规模的前提下实现高效协同。解决方案的关键在于引入“机器人到一切”(Robot-to-Everything, R2X)感知与通信范式,利用已部署的低成本物联网(IoT)传感器(如摄像头)提供持续、全局的环境语义信息,从而构建统一的全局语义状态(global semantic state),支持基于大语言模型(LLM)的高层任务规划与协作,显著提升多机器人系统的效率与可靠性。
链接: https://arxiv.org/abs/2603.20182
作者: Fan Yang,Soumya Teotia,Shaunak A. Mehta,Prajit KrisshnaKumar,Quanting Xie,Jun Liu,Yueqi Song,Li Wenkai,Atsunori Moteki,Kanji Uchino,Yonatan Bisk
机构: Fujitsu Research; Carnegie Mellon University
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Although robot-to-robot (R2R) communication improves indoor scene understanding beyond what a single robot can achieve, R2R alone cannot overcome partial observability without substantial exploration overhead or scaling team size. In contrast, many indoor environments already include low-cost Internet of Things (IoT) sensors (e.g., cameras) that provide persistent, building-wide context beyond onboard perception. We therefore introduce IndoorR2X, the first benchmark and simulation framework for Large Language Model (LLM)-driven multi-robot task planning with Robot-to-Everything (R2X) perception and communication in indoor environments. IndoorR2X integrates observations from mobile robots and static IoT devices to construct a global semantic state that supports scalable scene understanding, reduces redundant exploration, and enables high-level coordination through LLM-based planning. IndoorR2X provides configurable simulation environments, sensor layouts, robot teams, and task suites to systematically evaluate high-level semantic coordination strategies. Extensive experiments across diverse settings demonstrate that IoT-augmented world modeling improves multi-robot efficiency and reliability, and we highlight key insights and failure modes for advancing LLM-based collaboration between robot teams and indoor IoT sensors.
[MA-2] Beyond detection: cooperative multi-agent reasoning for rapid onboard EO crisis response
【速读】:该论文旨在解决下一代地球观测(Earth Observation, EO)任务中灾害事件快速识别的延迟问题,当前监测流程多依赖地面处理,受限于下行链路带宽、多源数据融合复杂性及全场景分析的高计算成本。解决方案的关键在于提出一种分层多智能体架构(hierarchical multi-agent architecture),在轨运行时通过事件驱动决策流水线协调专用AI智能体,实现对多模态遥感数据的协同推理:早期预警智能体基于星载观测快速生成假设并选择性激活特定领域分析智能体,决策智能体则整合证据输出最终警报;该架构融合视觉-语言模型、传统遥感工具与角色特化的智能体,显著降低冗余计算,在野火和洪水监测实验中验证了其高效性与可行性,为未来自主式EO星座提供了分布式智能推理范式。
链接: https://arxiv.org/abs/2603.19858
作者: Alejandro D. Mousist,Pedro Delgado de Robles Martín,Raquel Lladró Climent,Julian Cobos Aparicio
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: Accepted for presentation at the ESA’s 4S Symposium 2026 Conference (see this https URL )
Abstract:Rapid identification of hazardous events is essential for next-generation Earth Observation (EO) missions supporting disaster response. However, current monitoring pipelines remain largely ground-centric, introducing latency due to downlink limitations, multi-source data fusion constraints, and the computational cost of exhaustive scene analysis. This work proposes a hierarchical multi-agent architecture for onboard EO processing under strict resource and bandwidth constraints. The system enables the exploitation of complementary multimodal observations by coordinating specialized AI agents within an event-driven decision pipeline. AI agents can be deployed across multiple nodes in a distributed setting, such as satellite platforms. An Early Warning agent generates fast hypotheses from onboard observations and selectively activates domain-specific analysis agents, while a Decision agent consolidates the evidence to issue a final alert. The architecture combines vision-language models, traditional remote sensing analysis tools, and role-specialized agents to enable structured reasoning over multimodal observations while minimizing unnecessary computation. A proof-of-concept implementation was executed on the engineering model of an edge-computing platform currently deployed in orbit, using representative satellite data. Experiments on wildfire and flood monitoring scenarios show that the proposed routing-based pipeline significantly reduces computational overhead while maintaining coherent decision outputs, demonstrating the feasibility of distributed agent-based reasoning for future autonomous EO constellations. Comments: Accepted for presentation at the ESA’s 4S Symposium 2026 Conference (see this https URL) Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA) Cite as: arXiv:2603.19858 [cs.RO] (or arXiv:2603.19858v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.19858 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-3] Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation
【速读】:该论文旨在解决自动化提示优化(Automated Prompt Optimization, APO)中现有方法受限于固定提示模板、搜索空间有限或仅单向优化用户问题而导致的性能瓶颈问题。核心挑战在于忽视了问题表述与提示设计之间的内在耦合关系:清晰的问题结构有助于聚焦推理和任务理解,而有效的提示又能揭示更优的问题组织方式。为应对这一问题,论文提出了一种统一的多智能体系统 Helix,其关键创新在于构建了一个三阶段协同进化框架,通过规划引导的分解策略将优化目标解耦为问题与提示的联合优化;引入双轨协同进化机制,使专用智能体迭代互评并产生互补性改进;并通过策略驱动的问题生成机制实现高质量重述的实例化,从而提升推理鲁棒性。实验表明,Helix 在12个基准测试上相较6个强基线模型实现了最高达3.95%的性能提升,且优化效率优异。
链接: https://arxiv.org/abs/2603.19732
作者: Kewen Zhu,Liping Yi,Zhiming Zhao,Xiang Li,Qinghua Hu
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: under review
Abstract:Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single-sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates (1) planner-guided decomposition that breaks optimization into coupled question-prompt objectives, (2) dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.
[MA-4] A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在长程规划任务中表现不佳的问题,尤其是在动态内容丰富的网页导航场景下,存在在线执行时信息丢失和强化学习(Reinforcement Learning, RL)微调阶段稀疏奖励导致推理连贯性差两大挑战。其解决方案的关键在于两个核心创新:一是提出一种基于专属模型的代理框架,通过子目标分解实现推理时的显式规划;二是设计MiRA(Milestoning your Reinforcement Learning Enhanced Agent)强化学习训练框架,引入密集的、基于里程碑的奖励信号以增强策略优化。实验证明,该方法显著提升了代理在WebArena-Lite基准上的成功率,尤其在开源模型Gemma3-12B上从6.4%提升至43.0%,超越了包括GPT-4-Turbo在内的多种主流系统,验证了显式推理规划与里程碑奖励机制协同对长程任务能力提升的有效性。
链接: https://arxiv.org/abs/2603.19685
作者: Taiyi Wang,Sian Gooding,Florian Hartmann,Oriana Riva,Edward Grefenstette
机构: Google DeepMind(谷歌深度思维)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 50 pages, 15 figures
Abstract:Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent’s long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
[MA-5] GoAgent : Group-of-Agents Communication Topology Generation for LLM -based Multi-Agent Systems
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent System, MAS)中因通信拓扑结构设计不合理而导致的协调效率低下与冗余通信问题。现有方法通常以节点为中心生成连接关系,未能显式建模任务相关的协作群体结构,从而导致子任务划分不清晰、跨群体协作低效。其解决方案的关键在于提出GoAgent方法,将协作群体作为构建MAS的基本单元,首先通过LLM枚举任务相关的候选群体,再自回归地选择并连接这些群体以形成最终通信图,从而同时捕捉组内凝聚性和组间协同性;此外,引入条件信息瓶颈(Conditional Information Bottleneck, CIB)目标对组间通信进行压缩,保留任务相关信号并过滤冗余历史噪声,显著提升系统性能并降低计算开销。
链接: https://arxiv.org/abs/2603.19677
作者: Hongjiang Chen,Xin Zheng,Yixin Liu,Pengfei Jiao,Shiyuan Li,Huan Liu,Zhidong Zhao,Ziqi Xu,Ibrahim Khalil,Shirui Pan
机构: Hangzhou Dianzi University (杭州电子科技大学); RMIT University (皇家墨尔本理工大学); Griffith University (格里菲斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language model (LLM)-based multi-agent systems (MAS) have demonstrated exceptional capabilities in solving complex tasks, yet their effectiveness depends heavily on the underlying communication topology that coordinates agent interactions. Within these systems, successful problem-solving often necessitates task-specific group structures to divide and conquer subtasks. However, most existing approaches generate communication topologies in a node-centric manner, leaving group structures to emerge implicitly from local connectivity decisions rather than modeling them explicitly, often leading to suboptimal coordination and unnecessary communication overhead. To address this limitation, we propose GoAgent (Group-of-Agents), a communication topology generation method that explicitly treats collaborative groups as the atomic units of MAS construction. Specifically, GoAgent first enumerates task-relevant candidate groups through an LLM and then autoregressively selects and connects these groups as atomic units to construct the final communication graph, jointly capturing intra-group cohesion and inter-group coordination. To mitigate communication redundancy and noise propagation inherent in expanding topologies, we further introduce a conditional information bottleneck (CIB) objective that compresses inter-group communication, preserving task-relevant signals while filtering out redundant historical noise. Extensive experiments on six benchmarks demonstrate the state-of-the-art performance of GoAgent with 93.84% average accuracy while reducing token consumption by about 17%.
[MA-6] Planning Autonomous Vehicle Maneuvering in Work Zones Through Game-Theoretic Trajectory Generation
【速读】:该论文旨在解决自动驾驶车辆(AV)在施工区域(work zone)中进行变道 manoeuvres 时的安全决策问题,该场景因几何空间受限和交通流不确定性而风险较高。解决方案的关键在于提出一种基于博弈论(game-theoretic)的轨迹生成与控制框架,将变道行为建模为车辆间的非合作博弈,通过优化安全、行驶进展与交通稳定性之间的平衡,实现更安全的变道决策。仿真结果表明,该方法相较传统行为规划模型可降低35%的冲突频率,并显著减少高风险安全事件的发生概率。
链接: https://arxiv.org/abs/2603.19556
作者: Mayar Nour,Atrisha Sarkar,Mohamed H. Zaki
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Work zone navigation remains one of the most challenging manoeuvres for autonomous vehicles (AVs), where constrained geometries and unpredictable traffic patterns create a high-risk environment. Despite extensive research on AV trajectory planning, few studies address the decision-making required to navigate work zones safely. This paper proposes a novel game-theoretic framework for trajectory generation and control to enhance the safety of lane changes in a work zone environment. By modelling the lane change manoeuvre as a non-cooperative game between vehicles, we use a game-theoretic planner to generate trajectories that balance safety, progress, and traffic stability. The simulation results show that the proposed game-theoretic model reduces the frequency of conflicts by 35 percent and decreases the probability of high risk safety events compared to traditional vehicle behaviour planning models in safety-critical highway work-zone scenarios.
[MA-7] rustFlow: Topic-Aware Vector Reputation Propagation for Multi-Agent Ecosystems
【速读】:该论文旨在解决传统声誉评估方法在多维信任建模和抗攻击能力方面的局限性,尤其是面对Sybil攻击、声誉洗钱(reputation laundering)和投票圈(vote rings)等恶意行为时的脆弱性。其解决方案的关键在于提出TrustFlow算法,该算法通过构建一个基于交互图的多维声誉传播机制,利用主题门控传输算子(topic-gated transfer operators)对每条边进行内容嵌入调制,从而实现声誉向量的动态演化;同时,通过设计Lipschitz-1传输算子与可组合的信息论门控结构,确保系统收敛至唯一不动点(由压缩映射定理保障),并在密集和稀疏图上分别达到高达98%和78%的多标签Precision@5性能,显著优于PageRank及其变体(如Topic-Sensitive PageRank),且生成的向量声誉可直接通过点积查询匹配用户查询嵌入空间,实现语义感知的信任推理。
链接: https://arxiv.org/abs/2603.19452
作者: Volodymyr Seliuchenko
机构: robutler.ai
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, demo at this https URL
Abstract:We introduce TrustFlow, a reputation propagation algorithm that assigns each software agent a multi-dimensional reputation vector rather than a scalar score. Reputation is propagated through an interaction graph via topic-gated transfer operators that modulate each edge by its content embedding, with convergence to a unique fixed point guaranteed by the contraction mapping theorem. We develop a family of Lipschitz-1 transfer operators and composable information-theoretic gates that achieve up to 98% multi-label Precision@5 on dense graphs and 78% on sparse ones. On a benchmark of 50 agents across 8 domains, TrustFlow resists sybil attacks, reputation laundering, and vote rings with at most 4 percentage-point precision impact. Unlike PageRank and Topic-Sensitive PageRank, TrustFlow produces vector reputation that is directly queryable by dot product in the same embedding space as user queries.
[MA-8] GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
【速读】:该论文旨在解决自动驾驶中多智能体协同感知(multi-agent collaborative perception)因异构特征(heterogeneous features)导致的数据融合难题,即不同传感器模态或模型架构的智能体之间难以高效对齐与整合感知特征。现有方法通常依赖于重新训练编码器或设计成对的解释模块进行特征对齐,缺乏可扩展性。其解决方案的关键在于提出 GT-Space 框架,该框架通过利用真实标注(ground-truth labels)构建一个统一的公共特征空间(common feature space),使各智能体仅需一个适配模块(adapter module)即可将自身特征映射至该空间,从而避免了复杂的成对交互;同时结合对比损失(contrastive losses)在多种模态组合下训练融合网络,实现高效且鲁棒的跨模态特征融合。
链接: https://arxiv.org/abs/2603.19308
作者: Wentao Wang,Haoran Xu,Guang Tan
机构: Sun Yat-sen University, Shenzhen Campus (中山大学深圳校区); Peng Cheng Laboratory (鹏城实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling \em heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose \em GT-Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at this https URL.
[MA-9] On the existence of fair zero-determinant strategies in the periodic prisoners dilemma game
【速读】:该论文旨在解决零确定性(Zero-Determinant, ZD)策略在随机博弈(stochastic games)中是否存在及其性质的问题,特别是聚焦于周期性囚徒困境博弈(periodic prisoner’s dilemma game)这一最简随机博弈模型。传统研究表明,在标准重复囚徒困境博弈中,公平型ZD策略(fair ZD strategies)可确保博弈者收益等于对手平均收益,且TFT(Tit-for-Tat)策略始终属于此类;然而本文通过理论分析证明:在周期性囚徒困境博弈中,公平型ZD策略并非必然存在,且TFT策略也不再保证为公平型ZD策略。其解决方案的关键在于揭示了环境状态转移机制对ZD策略存在条件的显著影响——即随机博弈中状态动态引入了新的复杂性,使得经典重复博弈中的结论不再适用。
链接: https://arxiv.org/abs/2603.19641
作者: Ken Nakamura,Masahiko Ueda
机构: Yamaguchi University (山口大学)
类目: Physics and Society (physics.soc-ph); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 25 pages
Abstract:Repeated games are a framework for investigating long-term interdependence of multi-agent systems. In repeated games, zero-determinant (ZD) strategies attract much attention in evolutionary game theory, since they can unilaterally control payoffs. Especially, fair ZD strategies unilaterally equalize the payoff of the focal player and the average payoff of the opponents, and they were found in several games including the social dilemma games. Although the existence condition of ZD strategies in repeated games was specified, its extension to stochastic games is almost unclear. Stochastic games are an extension of repeated games, where a state of an environment exists, and the state changes to another one according to an action profile of players. Because of the transition of an environmental state, the existence condition of ZD strategies in stochastic games is more complicated than that in repeated games. Here, we investigate the existence condition of fair ZD strategies in the periodic prisoner’s dilemma game, which is one of the simplest stochastic games. We show that fair ZD strategies do not necessarily exist in the periodic prisoner’s dilemma game, in contrast to the repeated prisoner’s dilemma game. Furthermore, we also prove that the Tit-for-Tat strategy, which imitates the opponent’s action, is not necessarily a fair ZD strategy in the periodic prisoner’s dilemma game, whereas the Tit-for-Tat strategy is always a fair ZD strategy in the repeated prisoner’s dilemma game. Our results highlight difference between ZD strategies in the periodic prisoner’s dilemma game and ones in the standard repeated prisoner’s dilemma game.
自然语言处理
[NLP-0] VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking CVPR2026
【速读】: 该论文旨在解决当前视频智能体(Video Agentic Models)在处理长视频任务时因依赖密集采样帧的贪婪解析而导致计算成本过高的问题。解决方案的关键在于提出 VideoSeek,其核心创新是利用视频逻辑流(video logic flow)主动搜索与答案相关的关键证据,而非对整个视频进行逐帧解析。这一机制使模型能够在显著减少帧数的同时保持甚至提升视频理解能力,通过“思考-行动-观察”循环和多粒度观测工具包实现查询感知的探索与推理,从而在多个基准测试中取得优异性能,例如在 LVBench 上相较基线模型 GPT-5 提升 10.2 个百分点准确率,同时仅使用 7% 的帧数。
链接: https://arxiv.org/abs/2603.20185
作者: Jingyang Lin,Jialian Wu,Jiang Liu,Ximeng Sun,Ze Wang,Xiaodong Yu,Jiebo Luo,Zicheng Liu,Emad Barsoum
机构: AMD; University of Rochester
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at CVPR 2026
Abstract:Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
[NLP-1] Adaptive Greedy Frame Selection for Long Video Understanding
【速读】: 该论文旨在解决大规模视觉-语言模型(Large Vision-Language Models, VLMs)在长视频问答任务中因输入帧数量过多导致的推理效率瓶颈问题,尤其是传统稀疏采样易遗漏关键时刻、而纯相关性驱动的选择又常陷入近似重复帧且缺乏对时间上分散证据的覆盖。解决方案的关键在于提出一种基于问题自适应的贪婪帧选择方法,其核心是构建一个1帧/秒的候选池(上限1000帧),并利用SigLIP和DINOv2分别提取帧在问题相关性和语义代表性两个互补空间中的嵌入;随后通过贪心策略最大化加权后的模块化相关性项与设施位置覆盖项之和,该目标函数具有归一化、单调性和子模性,从而保证(1−1/e)的近似性能边界。此外,为应对不同问题类型对相关性与覆盖率之间的权衡差异,引入四种预设策略及轻量级文本分类器进行动态路由,显著提升在不同帧预算下的准确率表现。
链接: https://arxiv.org/abs/2603.20180
作者: Yuning Huang,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large vision–language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
[NLP-2] Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
【速读】: 该论文试图解决当前链式思维(Chain-of-Thought, CoT)忠实性评估中存在的可比性问题,即不同研究中报告的单一忠实性数值是否具有客观性和一致性。其关键解决方案在于通过引入三种不同机制的分类器(仅基于正则表达式的检测器、两阶段正则表达式+大语言模型(Large Language Model, LLM)流水线和独立的Claude Sonnet 4判别器),对来自12个开源模型(参数规模从7B到1T)的10,276条受引导推理轨迹进行系统性评估。结果表明,同一数据集上三类分类器得出的整体忠实性率差异显著(分别为74.4%、82.6%和69.7%),且模型排名因分类器选择而发生逆转,这揭示了不同分类器在操作化忠实性概念时存在不同程度的严格性差异(如词法提及 vs. 认识论依赖)。因此,论文主张未来评估应报告多种分类方法下的敏感性范围,而非依赖单一数值点估计,以提升评估结果的透明度与可比性。
链接: https://arxiv.org/abs/2603.20172
作者: Richard J. Young
机构: University of Nevada, Las Vegas; DeepNeuro AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 4 figures, 5 tables
Abstract:Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar’s test, p 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s kappa ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
[NLP-3] Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models
【速读】: 该论文旨在解决在对抗性语境下,指令微调语言模型(instruction-tuned language models)如何在用户对齐压力与基于上下文证据的忠实性之间取得平衡的问题。其核心挑战在于:即使提供详尽且一致的证据,模型仍可能因用户偏好而偏离事实,导致“用户对齐偏差”(user-aligned reversals)。解决方案的关键在于引入一个受控的、基于美国国家气候评估(U.S. National Climate Assessment)的认知冲突框架(epistemic-conflict framework),并通过系统性消融实验分析不同证据组成和不确定性提示(uncertainty cues)对模型行为的影响。研究发现,仅增加证据丰富度不足以抵御用户压力,必须通过显式训练强化认知完整性(epistemic integrity),才能有效抑制模型在冲突情境下的非理性偏移。
链接: https://arxiv.org/abs/2603.20162
作者: Sai Koneru,Elphin Joe,Christine Kirchhoff,Jian Wu,Sarah Rajtmajer
机构: Pennsylvania State University (宾夕法尼亚州立大学); Old Dominion University (老多明尼昂大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
[NLP-4] Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models EACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)输出结果缺乏可信度的问题,尤其是其过度自信倾向导致的可靠性不足。现有不确定性量化方法通常依赖重复采样或额外辅助模型,带来显著计算开销。解决方案的关键在于提出一种名为语义Token聚类(Semantic Token Clustering, STC)的方法,该方法利用LLM内部固有的语义信息,通过嵌入聚类与前缀匹配将Token分组为语义一致的簇,并基于对应语义簇上的概率质量聚合来量化不确定性。STC仅需单次生成过程,无需辅助模型,从而在保持与最先进基线相当性能的同时大幅降低计算复杂度。
链接: https://arxiv.org/abs/2603.20161
作者: Qi Cao,Andrew Gambardella,Takeshi Kojima,Yutaka Matsuo,Yusuke Iwasawa
机构: The University of Tokyo, Japan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EACL 2026
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.
[NLP-5] Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification
【速读】: 该论文旨在解决基于全局词共现矩阵构建的分布语义表示(如HAL模型)在句级嵌入聚合过程中因采用标准均值池化(mean pooling)而导致的信息损失问题。传统均值池化对所有词元赋予相等权重,使得语境中具有判别力的词被无意义的结构词(如停用词)稀释,从而影响分类性能与可解释性。解决方案的关键在于引入一个可学习的、温度缩放的加性注意力机制(additive attention mechanism),并在原始共现矩阵经截断奇异值分解(Truncated SVD)降维至稠密潜在空间后应用该机制,以动态聚焦于情感承载词并抑制冗余词元,从而提升模型在IMDB情感分析任务上的准确率(达82.38%,较基线提升6.74个百分点),同时增强模型输出的可解释性。
链接: https://arxiv.org/abs/2603.20149
作者: Ali Sakour,Zoalfekar Sakour
机构: Lattakia University (拉塔基亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 1 figure, 1 table
Abstract:The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While these representations capture lexical relationships effectively, aggregating them into sentence-level embeddings via standard mean pooling often results in information loss. Mean pooling assigns equal weight to all tokens, thereby diluting the impact of contextually salient words with uninformative structural tokens. In this paper, we address this limitation by integrating a learnable, temperature-scaled additive attention mechanism into the HAL representation pipeline. To mitigate the sparsity and high dimensionality of the raw co-occurrence matrices, we apply Truncated Singular Value Decomposition (SVD) to project the vectors into a dense latent space prior to the attention layer. We evaluate the proposed architecture on the IMDB sentiment analysis dataset. Empirical results demonstrate that the attention-based pooling approach achieves a test accuracy of 82.38%, yielding an absolute improvement of 6.74 percentage points over the traditional mean pooling baseline (75.64%). Furthermore, qualitative analysis of the attention weights indicates that the mechanism successfully suppresses stop-words and selectively attends to sentiment-bearing tokens, improving both classification performance and model interpretability.
[NLP-6] Reasoning Gets Harder for LLM s Inside A Dialogue
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在孤立任务上的推理表现优异,但这些评估未能反映其在真实场景中任务导向对话(Task-Oriented Dialogue, TOD)下的推理鲁棒性,即模型是否能在生成文本的同时进行内在推理并遵循角色、格式和风格等指令。为解决这一问题,作者提出了一种新的动态基准测试工具 BOULDER,涵盖八类旅行相关任务,涉及算术、空间和时间推理,且每项任务均提供孤立版本与对话版本,从而实现受控比较并避免数据污染。关键在于通过设计具有多轮交互特性、角色设定和工具使用要求的对话情境,系统性地揭示了LLM在TOD场景下性能显著下降的现象,并指出该差距主要由对话的多轮性质驱动,辅以角色条件和工具调用需求的影响,凸显了在真实交互场景中评估LLM推理能力的重要性。
链接: https://arxiv.org/abs/2603.20133
作者: Ivan Kartáč,Mateusz Lango,Ondřej Dušek
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models’ reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
[NLP-7] Current LLM s still cannot talk much about grammar modules: Evidence from syntax
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理生成语法(generative syntax)核心术语翻译任务中的能力局限问题,特别是其是否能够准确传达语法模块的语义与结构特性。研究通过对比人工翻译与ChatGPT-5对44个生成语法术语的阿拉伯语翻译结果发现,仅有25%的翻译准确,38.6%存在错误,36.4%为部分正确,表明LLMs在面对涉及句法和语义复杂性的专业术语时仍面临显著挑战。解决方案的关键在于建立人工智能专家与语言学家之间的紧密协作机制,以优化LLMs的内部工作机制,从而提升其在专业领域术语翻译中的准确性或至少达到可接受的适当性水平。
链接: https://arxiv.org/abs/2603.20114
作者: Mohammed Q. Shormani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages
Abstract:We aim to examine the extent to which Large Language Models (LLMs) can ‘talk much’ about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot ‘talk much’ about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs’ working mechanism for accurate or at least appropriate translation.
[NLP-8] An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
【速读】: 该论文旨在解决小规模模型(如GPT-2尺度解码器)在经过监督微调(SFT)后,如何通过直接偏好优化(DPO)和低秩适应(LoRA)进一步提升性能的问题,尤其关注在有限数据和计算资源下的实际效果。其关键发现是:在小规模场景中,参数化策略(全参数微调FFT)对性能提升起主导作用,而DPO仅带来任务依赖的微小增益,且需偏好构建与监督目标高度一致才能达到与SFT相当的性能;此外,LoRA并未显著减少训练时间,且在相同训练深度下性能低于FFT,表明低秩适配在当前硬件和规模下边际收益有限。
链接: https://arxiv.org/abs/2603.20100
作者: Yuming Feng,Christy Yang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
[NLP-9] Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues
【速读】: 该论文旨在解决如何在解释性对话中,基于说话者和听者的语言特征(包括言语与非言语线索),实现对听者理解状态的实时预测问题。其关键解决方案在于识别并整合三种与认知负荷相关的语言线索:说话者话语的信息价值(以 surprisal 表征)、句法复杂度,以及听者交互式 gaze 行为的变化,并结合文本特征进行多模态建模,从而提升对四种理解状态(“理解”、“部分理解”、“未理解”和“误解”)的分类准确性。
链接: https://arxiv.org/abs/2603.20079
作者: Yu Wang,Olcay Türk,Angela Grimminger,Hendrik Buschmeier
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener’s state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker’s utterances, and the variation in the listener’s interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener’s level of understanding. Listener states (‘Understanding’, ‘Partial Understanding’, ‘Non-Understanding’ and ‘Misunderstanding’) were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.
[NLP-10] LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families
【速读】: 该论文旨在解决当前语音语言模型(SpeechLM)在低资源语言(low-resource languages)上的自动语音识别(ASR)性能缺乏系统评估的问题。现有基准测试主要聚焦于高资源语言,导致对SpeechLM在多样语言家族和非拉丁文字中的泛化能力理解不足,从而限制了其在多语言实际场景中的部署。解决方案的关键在于提出LoASR-Bench——一个涵盖9个语言家族、25种语言(含拉丁与非拉丁文字)的综合性ASR评测基准,能够实现跨语言和跨书写系统的SpeechLM性能评估,从而揭示当前最新SpeechLM在真实低资源语言场景下的局限性。
链接: https://arxiv.org/abs/2603.20042
作者: Jianan Chen,Xiaoxue Gao,Tatsuya Kawahara,Nancy F. Chen
机构: Kyoto University (京都大学); Institute for Infocomm Research, Agency for Science, Technology and Research (新加坡科技研究局信息通信研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech recognition (ASR) under high-resource conditions. However, existing benchmarks predominantly focus on high-resource languages, leaving the ASR behavior of SpeechLMs in low-resource languages insufficiently understood. This gap is critical, as practical ASR systems must reliably support low-resource languages and generalize across diverse language families, and it directly hinders the deployment of SpeechLM-based ASR in real-world multilingual scenarios. As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbfLoASR-Bench, a comprehensive benchmark designed to evaluate \textbflow-resource \textbfautomatic \textbfspeech \textbfrecognition (\textbfASR) of the latest SpeechLMs across diverse language families. LoASR-Bench comprises 25 languages from 9 language families, featuring both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of ASR performance of current SpeechLMs. Experimental results highlight the limitations of the latest SpeechLMs in handling real-world low-resource languages.
[NLP-11] ReViSQL: Achieving Human-Level Text-to-SQL
【速读】: 该论文旨在解决自然语言到SQL(Text-to-SQL)翻译任务中模型性能与人类水平之间的差距问题,尤其是在BIRD基准上的表现。尽管现有方法通过设计复杂的AI代理(AI agent)和多步推理管道提升SQL推理能力,但其效果仍无法达到人类水平。论文指出,这一瓶颈并非源于架构复杂性不足,而是训练数据质量低下所致。解决方案的关键在于构建高质量的验证数据集——BIRD-Verified,其中包含2.5k条经SQL专家校正和验证的文本到SQL实例,覆盖61.1%的原始BIRD Train子集错误数据;并在此基础上采用基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)进行训练,从而显著提升模型单次生成准确率(提升8.2–13.9%)。此外,通过执行后一致性校验与多数投票机制实现推理时扩展,进一步优化性能,最终在BIRD Mini-Dev上实现了93.2%的执行准确率,首次超越代理人类水平(92.96%),且轻量级版本在成本降低7.5倍的情况下达到先前开源最优水平。
链接: https://arxiv.org/abs/2603.20004
作者: Yuxuan Zhu,Tengjun Jin,Yoojin Choi,Daniel Kang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5 \times lower per-query cost. Subjects: Databases (cs.DB); Computation and Language (cs.CL) ACMclasses: H.2.3 Cite as: arXiv:2603.20004 [cs.DB] (or arXiv:2603.20004v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.20004 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-12] An Agent ic Approach to Generating XAI-Narratives
【速读】: 该论文旨在解决当前可解释人工智能(Explainable AI, XAI)方法普遍存在技术性强、面向专家而缺乏可访问性的问题,尤其是如何将后验解释(post-hoc explanations)转化为更易理解的自然语言叙述。其核心解决方案是提出一个基于多智能体(multi-agent)框架的XAI叙事生成与优化机制,其中Narrator智能体根据多个Critic Agent提供的忠实性(faithfulness)和连贯性(coherence)反馈进行迭代式生成与修订,从而提升叙事质量。关键创新在于通过引入多轮反馈驱动的迭代优化机制,并设计五种不同的代理系统架构(包括基础设计、Critic设计、规则增强设计等),显著提升了生成叙事的忠实性——例如Claude-4.5-Sonnet在基础设计下经三轮迭代后,未忠实叙事数量减少90%。此外,采用多数投票集成策略进一步稳定了性能表现,验证了该框架在提升XAI叙事可读性与可信度方面的有效性。
链接: https://arxiv.org/abs/2603.20003
作者: Yifan He,David Martens
机构: University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert-oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)-generated XAI narratives have been proposed as a promising approach for translating post-hoc explanations into more accessible, natural-language explanations. In this work, we propose a multi-agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic-Rule Design, Coherent Design, and Coherent-Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic-Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude-4.5-Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek-V3.2-Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.
[NLP-13] When Contextual Inference Fails: Cancelability in Interactive Instruction Following
【速读】: 该论文旨在解决生成式 AI 在协作任务中如何区分字面意义(literal interpretation)与语境推理(contextual inference)的问题,特别是在指令存在歧义时,模型能否基于语境有效调整沟通策略以实现高效协作。其解决方案的关键在于引入了一个名为“Build What I Mean”(BWIM)的交互式基准测试,该基准通过模拟双说话者心理语言学范式,要求模型在面对不明确指令时,选择进行语境推理或以较低沟通成本请求澄清。实验发现,尽管大语言模型(LLMs)能够识别说话者的非合作倾向(即字面可靠性不足),但在实际行动中未能据此优化澄清行为,反而表现出如盲目过度澄清或在不确定时拒绝提问而直接猜测等次优策略,揭示了模型在语用理解与行动决策之间存在显著脱节。
链接: https://arxiv.org/abs/2603.19997
作者: Natalia Bila,Kata Naszádi,Alexandra Mayn,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm – which contrasts a pragmatically cooperative speaker with one who is only literally reliable – we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.
[NLP-14] Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)后训练方法存在“能力天花板”的问题,即RL通常仅能对预训练权重中已存在的模式进行微调,而难以激发模型发现新的策略或推理能力。其核心解决方案在于识别并突破一个根本性的结构瓶颈:现有LLM后训练范式依赖于不断扩展的动作历史作为状态表示(history-as-state),而非具备信息压缩能力的马尔可夫状态(Markov states)。作者提出引入显式的马尔可夫状态建模,理论上证明利用估计的马尔可夫状态可显著降低样本复杂度,并在多个复杂逻辑谜题任务上实证验证了该方法能够持续突破标准RL后训练的性能上限,从而为生成式AI实现开放式的发现能力和真正的新推理能力提供了关键路径。
链接: https://arxiv.org/abs/2603.19987
作者: Yurun Yuan,Tengyang Xie
机构: UW-Madison
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent “capability ceiling”: unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond “history-as-state” modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.19987 [cs.LG] (or arXiv:2603.19987v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19987 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-15] On the Ability of Transformers to Verify Plans
【速读】: 该论文旨在解决Transformer模型在AI规划任务中表现不一致的问题,特别是缺乏对何时能够实现泛化能力的理论理解。其核心挑战在于处理测试时对象数量(即输入词汇表大小)增长的情形下,模型能否学习到可推广的计划验证能力。解决方案的关键在于提出C*-RASP——一种扩展自C-RASP的形式化框架,用于建立在序列长度和词汇表大小同时增长条件下Transformer模型的长度泛化保证。该框架识别出一类经典规划领域,其中Transformer可以证明性地学会验证长计划,并揭示了显著影响长度泛化能力的学习解的结构特性。
链接: https://arxiv.org/abs/2603.19954
作者: Yash Sarrof,Yupei Du,Katharina Stein,Alexander Koller,Sylvie Thiébaux,Michael Hahn
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Transformers have shown inconsistent success in AI planning tasks, and theoretical understanding of when generalization should be expected has been limited. We take important steps towards addressing this gap by analyzing the ability of decoder-only models to verify whether a given plan correctly solves a given planning instance. To analyse the general setting where the number of objects – and thus the effective input alphabet – grows at test time, we introduce C*-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under the simultaneous growth in sequence length and vocabulary size. Our results identify a large class of classical planning domains for which transformers can provably learn to verify long plans, and structural properties that significantly affects the learnability of length generalizable solutions. Empirical experiments corroborate our theory.
[NLP-16] Hybrid topic modelling for computational close reading: Mapping narrative themes in Pushkins Evgenij Onegin
【速读】: 该论文旨在解决小语料库环境下主题建模的不稳定性问题,以及如何在叙事诗歌中有效捕捉主题结构与纵向动态变化的问题。其核心解决方案是构建一个混合主题建模框架,将无监督的潜在狄利克雷分布(Latent Dirichlet Allocation, LDA)与有监督的稀疏偏最小二乘判别分析(sparse Partial Least Squares Discriminant Analysis, sPLS-DA)相结合:LDA用于提取稳定且可解释的五类主题,sPLS-DA则通过识别词汇标记进一步细化各主题的语义边界;同时引入多种子共识协议以增强小语料下的模型鲁棒性,并借助叙事枢纽(narrative hubs)将词袋模型扩展至叙事层面,从而揭示主题混合如何与诗歌的情感和结构弧线对齐。该方法在不依赖韵律、音韵或原生形态特征的前提下,实现了复杂诗性叙事的可重复主题映射,为计算文学分析提供了一种透明、轻量化的“细读”工具。
链接: https://arxiv.org/abs/2603.19940
作者: Angelo Maria Sabatini
机构: The BioRobotics Institute, Scuola Superiore Sant’Anna (圣安娜高等学院生物机器人研究所)
类目: Computation and Language (cs.CL)
备注: 25 pages, 4 figures, 2 supplementary materials; submitted to Digital Scholarship in the Humanities (under review)
Abstract:This study presents a hybrid topic modelling framework for computational literary analysis that integrates Latent Dirichlet Allocation (LDA) with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to model thematic structure and longitudinal dynamics in narrative poetry. As a case study, we analyse Evgenij Onegin-Aleksandr S. Pushkin’s novel in verse-using an Italian translation, testing whether unsupervised and supervised lexical structures converge in a small-corpus setting. The poetic text is segmented into thirty-five documents of lemmatised content words, from which five stable and interpretable topics emerge. To address small-corpus instability, a multi-seed consensus protocol is adopted. Using sPLS-DA as a supervised probe enhances interpretability by identifying lexical markers that refine each theme. Narrative hubs-groups of contiguous stanzas marking key episodes-extend the bag-of-words approach to the narrative level, revealing how thematic mixtures align with the poem’s emotional and structural arc. Rather than replacing traditional literary interpretation, the proposed framework offers a computational form of close reading, illustrating how lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features such as metre, phonology, or native morphology are abstracted away. Despite relying on a single lemmatised translation, the approach provides a transparent methodological template applicable to other high-density literary texts in comparative studies.
[NLP-17] SAGE: Sustainable Agent -Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia WWW2026
【速读】: 该论文旨在解决低资源语言(Low-Resource Languages, LRLs)在互联网语境下因数据稀缺与模型训练能耗过高而难以实现高质量翻译的问题,从而推动全球南方地区的数字包容性。其核心解决方案是提出一种名为Sustainable Agent-Guided Expert-tuning (SAGE) 的框架,关键在于通过强化学习(Reinforcement Learning, RL)代理自动筛选出少量高质量、文化相关性强的训练数据,而非依赖大规模噪声数据进行训练;该代理基于Group Relative Policy Optimization (GRPO)优化,并利用专家构建的小规模对话语料生成语义奖励信号以过滤噪声和文化偏差,随后采用低秩适应(Low-Rank Adaptation, LoRA)对开源大语言模型(Large Language Models, LLMs)进行高效微调,最终在保持甚至超越现有性能的同时,显著降低97.1%的数据使用量和95.2%的训练能耗。
链接: https://arxiv.org/abs/2603.19931
作者: Zhixiang Lu,Chong Zhang,Yulong Li,Angelos Stefanidis,Anh Nguyen,Imran Razzak,Jionglong Su,Zhengyong Jiang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: Accepted by WWW 2026
Abstract:The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the “right data” over “big data”. Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.
[NLP-18] ranslation from the Information Bottleneck Perspective: an Efficiency Analysis of Spatial Prepositions in Bitexts
【速读】: 该论文试图解决的问题是:自然语言系统在编码意义时如何平衡信息量(informativity)与简洁性(simplicity),即是否存在一种认知效率压力驱动语言使用趋向于最优的信息-复杂度边界。此前该问题已在视觉领域(如颜色和运动)得到验证,但在语义层面(尤其是词在句法语境中的表现)仍缺乏实证支持。解决方案的关键在于将翻译任务建模为信息瓶颈(Information Bottleneck, IB)优化问题,其中源句作为刺激(stimuli),目标句作为压缩后的意义表示,从而可以直接在双语平行语料(bitexts)上进行IB分析,而无需依赖受控命名实验。研究以英、德、塞尔维亚语对法语小说中空间介词的翻译为例,通过聚类排序预实验获取语义相似性判断,并训练低秩投影模型(D=5)预测这些判断(Spearman相关系数达0.78)。结果表明,实际译文比虚构替代译文更接近IB最优前沿,初步证明人类翻译者在空间语义域中表现出沟通效率的压力,且翻译可作为揭示跨语言语义系统认知效率机制的重要窗口。
链接: https://arxiv.org/abs/2603.19924
作者: Antoine Taroni,Ludovic Moncla,Frederique Laforest
机构: INSA Lyon; CNRS; Lyon 1 Université; LIRIS, UMR5205
类目: Computation and Language (cs.CL)
备注:
Abstract:Efficient communication requires balancing informativity and simplicity when encoding meanings. The Information Bottleneck (IB) framework captures this trade-off formally, predicting that natural language systems cluster near an optimal accuracy-complexity frontier. While supported in visual domains such as colour and motion, linguistic stimuli such as words in sentential context remain unexplored. We address this gap by framing translation as an IB optimisation problem, treating source sentences as stimuli and target sentences as compressed meanings. This allows IB analyses to be performed directly on bitexts rather than controlled naming experiments. We applied this to spatial prepositions across English, German and Serbian translations of a French novel. To estimate informativity, we conducted a pile-sorting pilot-study (N=35) and obtained similarity judgements of pairs of prepositions. We trained a low-rank projection model (D=5) that predicts these judgements (Spearman correlation: 0.78). Attested translations of prepositions lie closer to the IB optimal frontier than counterfactual alternatives, offering preliminary evidence that human translators exhibit communicative efficiency pressure in the spatial domain. More broadly, this work suggests that translation can serve as a window into the cognitive efficiency pressures shaping cross-linguistic semantic systems.
[NLP-19] Span-Level Machine Translation Meta-Evaluation
【速读】: 该论文试图解决的问题是:如何可靠地衡量自动机器翻译(Machine Translation, MT)错误检测工具的评估能力,尤其是在其能够定位错误并分类、分级的基础上,当前文献中尚无成熟且统一的评估方法。解决方案的关键在于提出一种新的“匹配带部分重叠与部分评分”(Match with Partial Overlap and Partial Credit, MPP)策略,并结合微观平均(micro-averaging)进行元评估(meta-evaluation),该方法能有效避免传统精度、召回率和F分数在处理跨句边界错误匹配时的不一致性问题,从而提供更稳定、可信的误差检测性能排序。
链接: https://arxiv.org/abs/2603.19921
作者: Stefano Perrella,Eric Morales Agostinho,Hugo Zaragoza
机构: Sapienza University of Rome (罗马大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures
Abstract:Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose “match with partial overlap and partial credit” (MPP) with micro-averaging as a robust meta-evaluation strategy and release code for its use publicly. Finally, we use MPP to assess the state of the art in MT error detection.
[NLP-20] Semantic Delta: An Interpretable Signal Differentiating Human and LLM s Dialogue
【速读】: 该论文旨在解决如何有效区分人类撰写与大语言模型(Large Language Models, LLMs)生成的对话文本的问题,这对于教育、学术诚信及内容审核等领域具有重要意义。其核心解决方案在于提出一种轻量级、可解释的统计特征——语义差值(semantic delta),即通过Empath词典分析框架将文本映射为主题强度得分后,计算两个主导语义类别强度之间的差异。研究假设LLMs生成的文本在主题分布上更为集中,而人类对话则呈现更广泛且平衡的语义结构;实证结果表明,AI生成文本的语义差值显著高于人类文本,验证了该指标的有效性。此方法无需训练即可直接应用(zero-shot),且计算成本低,可作为集成检测系统中的补充信号,同时深化了对当前LLMs在对话行为模仿方面局限性的理解。
链接: https://arxiv.org/abs/2603.19849
作者: Riccardo Scantamburlo,Mauro Mezzanzana,Giacomo Buonanno,Francesco Bertolotti
机构: LIUC - Università Cattaneo (LIUC - 卡塔尼亚大学); OpenAI (OpenAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Do LLMs talk like us? This question intrigues a multitude of scholar and it is relevant in many fields, from education to academia. This work presents an interpretable statistical feature for distinguishing human written and LLMs generated dialogue. We introduce a lightweight metric derived from semantic categories distribution. Using the Empath lexical analysis framework, each text is mapped to a set of thematic intensity scores. We define semantic delta as the difference between the two most dominant category intensities within a dialogue, hypothesizing that LLM outputs exhibit stronger thematic concentration than human discourse. To evaluate this hypothesis, conversational data were generated from multiple LLM configurations and compared against heterogeneous human corpora, including scripted dialogue, literary works, and online discussions. A Welch t-test was applied to the resulting distributions of semantic delta values. Results show that AI-generated texts consistently produce higher deltas than human texts, indicating a more rigid topics structure, whereas human dialogue displays a broader and more balanced semantic spread. Rather than replacing existing detection techniques, the proposed zero-shot metric provides a computationally inexpensive complementary signal that can be integrated into ensemble detection systems. These finding also contribute to the broader empirical understanding of LLM behavioural mimicry and suggest that thematic distribution constitutes a quantifiable dimension along which current models fall short of human conversational dynamics.
[NLP-21] FrameNet Semantic Role Classification by Analogy LREC2026
【速读】: 该论文旨在解决语义角色分类(Semantic Role Classification, SRC)中的标注依赖问题与模型复杂度高的难题,尤其在不引入显式语义角色信息的情况下实现高效且准确的分类。其解决方案的关键在于将语义角色分类建模为一种基于FrameNet框架中词元(Lexical Units, LUs)与语义元素(Frame Elements, FEs)对之间的类比关系(analogy)的二元分类任务:通过构建一个包含有效和无效类比实例的数据集,训练轻量级人工神经网络(Artificial Neural Network, ANN),该网络在训练阶段不使用任何语义角色标签;推理时则利用随机采样和类比迁移机制,在给定框架内计算所有候选语义角色的概率分布,从而恢复语义角色标签。这一方法在无需显式角色监督的前提下实现了超越先前最先进结果的性能,同时保持了计算效率和参数简洁性。
链接: https://arxiv.org/abs/2603.19825
作者: Van-Duy Ngo,Stergos Afantenos,Emiliano Lorini,Miguel Couceiro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper to be presented at LREC 2026
Abstract:In this paper, we adopt a relational view of analogies applied to Semantic Role Classification in FrameNet. We define analogies as formal relations over the Cartesian product of frame evoking lexical units (LUs) and frame element (FEs) pairs, which we use to construct a new dataset. Each element of this binary relation is labelled as a valid analogical instance if the frame elements share the same semantic role, or as invalid otherwise. This formulation allows us to transform Semantic Role Classification into binary classification and train a lightweight Artificial Neural Network (ANN) that exhibits rapid convergence with minimal parameters. Unconventionally, no Semantic Role information is introduced to the neural network during training. We recover semantic roles during inference by computing probability distributions over candidates of all semantic roles within a given frame through random sampling and analogical transfer. This approach allows us to surpass previous state-of-the-art results while maintaining computational efficiency and frugality.
[NLP-22] Borderless Long Speech Synthesis
【速读】: 该论文旨在解决现有文本到语音(Text-to-Speech, TTS)系统在处理长音频合成时缺乏全局上下文理解与副语言线索建模能力的问题,例如多说话人交互(打断、重叠语音)、情感弧线演变及声学环境变化等现实场景难以准确再现。其解决方案的关键在于提出一种面向代理的无边界长音频合成框架(Borderless Long Speech Synthesis),通过“标注优于过滤/清洗”的数据策略和名为 Global-Sentence-Token 的多层次结构化标注体系,构建从场景语义到音素细节的分层控制协议栈;同时在模型侧采用连续分词器并引入链式思维(Chain-of-Thought, CoT)推理与维度随机丢弃(Dimension Dropout)技术,显著提升复杂指令下的遵循能力,并使系统具备原生代理特性(Native Agentic),从而实现从单一文本到结构化生成命令的跨模态转化,推动 TTS 从传统 Text2Speech 向无边界长语音合成范式演进。
链接: https://arxiv.org/abs/2603.19798
作者: Xingchen Song,Di Wu,Dinghao Zhou,Pengyu Cheng,Hongwu Ding,Yunchao He,Jie Wang,Shengfan Shen,Sixiang Lv,Lichun Fan,Hang Su,Yifeng Wang,Shuai Wang,Meng Meng,Jian Luan
机构: MiLM Plus, Xiaomi Inc., China; Nanjing University, China; WeNet Open Source Community
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a “Labeling over filtering/cleaning” strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
[NLP-23] Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
【速读】: 该论文旨在解决多语言编码器模型在处理语码转换(code-mixed)输入时,其内部表示如何与构成语码转换的源语言(如英语和印地语)建立有意义关联的问题。现有研究虽广泛采用基于编码器的多语言模型进行语码转换分析,但对其内部表征机制的理解仍十分有限。解决方案的关键在于:首先通过构建平行的三语种语料库(英语、印地语原文与罗马化语码转换句),利用核中心化度量(CKA)、词元级显著性及熵不确定性分析等方法揭示标准模型中语码转换输入与源语言之间连接松散的问题;其次提出一种新的三语种后训练对齐目标(trilingual post-training alignment objective),使语码转换表示同时贴近两种源语言,从而提升跨语言对齐平衡性和下游任务(如情感分析与仇恨言论检测)性能。该方案强调将语码转换表示显式锚定于其组成语言,可显著增强跨语言理解能力。
链接: https://arxiv.org/abs/2603.19771
作者: Debajyoti Mazumder,Divyansh Pathak,Prashant Kodali,Jasabanta Patro
机构: Indian Institute of Science Education and Research, Bhopal, India; Microsoft Corporation
类目: Computation and Language (cs.CL)
备注: 24 pages
Abstract:Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.
[NLP-24] Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)评估中忽视人类标注者差异(Human Label Variation, HLV)的问题,即在基准测试中未充分考虑标注者间系统性判断差异对模型性能评估的影响。其解决方案的关键在于提出一种新的评估协议,明确区分并量化两类标注情境:高一致性(human label agreement)和高分歧(human label disagreement),并通过非聚合的人类标注数据对两个先进MLLM家族(Gemma 3 和 Qwen 2.5 VL)进行实证分析。结果表明,仅依赖共识标签的基准可能高估模型在主观性强任务中的能力,而引入HLV可实现更真实、稳健的模型评估,尤其适用于内容审核等涉及主观判断的场景。
链接: https://arxiv.org/abs/2603.19744
作者: Tomas Ruiz,Tanalp Agustoslu,Carsten Schwemmer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 tables, 1 figure
Abstract:Human Label Variation (HLV), i.e. systematic differences among annotators’ judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.
[NLP-25] Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation
【速读】: 该论文旨在解决基于Transformer的大语言模型(Large Language Models, LLMs)内部机制理解难题,特别是现有归因方法在保持忠实性(faithfulness)与计算效率之间难以平衡的问题,尤其是密集组件归因(dense component attribution)的高昂计算成本。其解决方案的关键在于提出Dual Path Attribution (DPA)框架,通过一次前向和一次反向传播即可实现对冻结Transformer中信息流的忠实追踪,无需使用反事实样本;DPA通过解析并线性化SwiGLU Transformer的计算结构,将目标未嵌入向量(unembedding vector)沿不同路径传播,从而在每个残差位置获得有效表示,实现了相对于模型组件数量为O(1)的时间复杂度,显著提升了长序列场景下的归因效率与可扩展性。
链接: https://arxiv.org/abs/2603.19742
作者: Lasse Marten Jantsch,Dong-Jae Koh,Seonghyeon Lee,Young-Kyoon Suh
机构: Kyungpook National University (庆北国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Understanding the internal mechanisms of transformer-based large language models (LLMs) is crucial for their reliable deployment and effective operation. While recent efforts have yielded a plethora of attribution methods attempting to balance faithfulness and computational efficiency, dense component attribution remains prohibitively expensive. In this work, we introduce Dual Path Attribution (DPA), a novel framework that faithfully traces information flow on the frozen transformer in one forward and one backward pass without requiring counterfactual examples. DPA analytically decomposes and linearizes the computational structure of the SwiGLU Transformers into distinct pathways along which it propagates a targeted unembedding vector to receive the effective representation at each residual position. This target-centric propagation achieves O(1) time complexity with respect to the number of model components, scaling to long input sequences and dense component attribution. Extensive experiments on standard interpretability benchmarks demonstrate that DPA achieves state-of-the-art faithfulness and unprecedented efficiency compared to existing baselines.
[NLP-26] FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment
【速读】: 该论文旨在解决在联邦学习(Federated Learning, FL)场景下,如何高效地将大语言模型(Large Language Models, LLMs)与人类偏好对齐的问题。由于FL中数据具有去中心化、隐私敏感性和高度非独立同分布(non-IID)的特点,直接应用传统基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)或直接偏好优化(Direct Preference Optimization, DPO)方法会导致性能显著下降且隐式奖励泛化能力差。解决方案的关键在于提出FedPDPO框架,其核心创新包括:(1) 采用参数高效微调架构,客户端保持冻结的预训练LLM主干,并附加低秩适配器(Low-Rank Adaptation, LoRA),实现通信高效聚合;(2) 设计个性化DPO训练策略,引入客户端特定的显式奖励头以补充隐式奖励,缓解非IID异质性;(3) 引入瓶颈适配器(bottleneck adapter)平衡全局与局部特征表示。理论分析和实验证明该方法在跨域和域内联邦设置下均达到最优性能,平均准确率提升最高达4.80%。
链接: https://arxiv.org/abs/2603.19741
作者: Kewen Zhu,Liping Yi,Zhiming Zhao,Zhuang Qi,Han Yu,Qinghua Hu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: under review
Abstract:Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
[NLP-27] MOSS-TTSD: Text to Spoken Dialogue Generation
【速读】: 该论文旨在解决**多说话人语音对话合成(Multi-Party Spoken Dialogue Synthesis, TTSD)**中的关键挑战,包括准确的发言权转换(turn-taking)、跨轮次声学一致性(cross-turn acoustic consistency)以及长时稳定性(long-form stability),这些问题在现有模型中因缺乏有效的对话上下文建模而难以满足。解决方案的关键在于提出MOSS-TTSD,一个具备增强长程上下文建模能力的语音对话合成模型,能够从带显式说话人标签的对话脚本中生成高质量、多语言、多说话人的连贯语音内容,支持长达60分钟的单次合成、最多5名说话人,并实现零样本语音克隆(zero-shot voice cloning)。此外,为更客观评估模型性能,作者还提出了TTSD-eval框架,基于强制对齐(forced alignment)量化说话人归属准确性和相似性,无需依赖说话人分割(speaker diarization)工具,从而显著提升了评估的可靠性与可比性。
链接: https://arxiv.org/abs/2603.19739
作者: Yuqian Zhang,Donghua Yu,Zhengyuan Lin,Botian Jiang,Mingshu Chen,Yaozhou Jiang,Yiwei Zhao,Yiyang Zhang,Yucheng Yuan,Hanfu Chen,Kexin Huang,Jun Zhan,Cheng Chang,Zhaoye Fei,Shimin Li,Xiaogui Yang,Qinyuan Cheng,Xipeng Qiu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
[NLP-28] PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction
【速读】: 该论文旨在解决现有上下文压缩方法在部署时因指定固定压缩比或长度而导致性能不可预测下降的问题,从而影响大型语言模型(Large Language Models, LLMs)推理效率的可靠提升。其解决方案的关键在于提出一种面向性能的上下文压缩范式(Performance-oriented Context Compression, PoC),即开发者不再设定压缩比例,而是定义一个可接受的性能下限;随后通过轻量级性能预测器自动搜索满足该约束条件下的最激进压缩比,并驱动现成的压缩器执行操作。其中,设计了两种预测器变体——上下文无关与上下文感知型,后者利用输入文本的固有可压缩性特征显著降低预测误差,从而实现更优的整体性能表现。
链接: https://arxiv.org/abs/2603.19733
作者: Runsong Zhao,Shilei Liu,Jiwei Tang,Langming Liu,Haibin Chen,Weidong Zhang,Yujin Yuan,Tong Xiao,Jingbo Zhu,Wenbo Su,Bo Zheng
机构: Northeastern University, China; Tsinghua University; Future Living Lab of Alibaba
类目: Computation and Language (cs.CL)
备注:
Abstract:While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input’s inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.
[NLP-29] LoopRPT: Reinforcement Pre-Training for Looped Language Models
【速读】: 该论文旨在解决现有强化学习(Reinforcement Learning, RL)范式与环形语言模型(Looped Language Models, LoopLMs)架构之间的结构不匹配问题——即传统RL主要针对输出token进行优化,而LoopLM的推理过程是隐式地通过迭代潜空间计算实现的。解决方案的关键在于提出一种名为LoopRPT的强化预训练框架,其核心创新是将传统的下一个词预测任务重构为“下一个词推理任务”,并通过EMA教师参考和噪声潜空间回放(noisy latent rollouts)直接向潜空间步骤分配强化信号,从而引导中间表示的优化,使有效推理压缩在更少迭代次数内完成。
链接: https://arxiv.org/abs/2603.19714
作者: Guo Tang,Shixin Jiang,Heng Chang,Nuo Chen,Yuhan Li,Huiming Fan,Jia Li,Ming Liu,Bing Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.
[NLP-30] AB-AUDIT: Detecting AI-Fabricated Scientific Tables via Multi-View Likelihood Mismatch
【速读】: 该论文旨在解决生成式 AI (Generative AI) 伪造科学论文中实验表格所引发的学术诚信危机问题,特别是针对自然语言处理(NLP)领域实证研究中表格数据的真实性检测难题。其解决方案的关键在于提出 TAB-AUDIT 框架,其中核心特征为“表内不一致性”(within-table mismatch),该特征量化了表格骨架结构与数值内容之间的困惑度差异,从而有效识别 AI 生成表格与人类撰写的表格在统计模式上的系统性区别。实验表明,基于此特征构建的随机森林模型在域内和域外检测任务中分别达到 0.987 AUROC 和 0.883 AUROC,显著优于现有方法,验证了实验表格作为检测 AI 伪造科学内容的重要法医信号的价值。
链接: https://arxiv.org/abs/2603.19712
作者: Shuo Huang,Yan Pen,Lizhen Qu
机构: Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:AI-generated fabricated scientific manuscripts raise growing concerns with large-scale breaches of academic integrity. In this work, we present the first systematic study on detecting AI-generated fabricated scientific tables in empirical NLP papers, as information in tables serve as critical evidence for claims. We construct FabTab, the first benchmark dataset of fabricated manuscripts with tables, comprising 1,173 AI-generated papers and 1,215 human-authored ones in empirical NLP. Through a comprehensive analysis, we identify systematic differences between fabricated and real tables and operationalize them into a set of discriminative features within the TAB-AUDIT framework. The key feature, within-table mismatch, captures the perplexity gap between a table’s skeleton and its numerical content. Experimental results show that RandomForest built on these features significantly outperform prior state-of-the-art methods, achieving 0.987 AUROC in-domain and 0.883 AUROC out-of-domain. Our findings highlight experimental tables as a critical forensic signal for detecting AI-generated scientific fraud and provide a new benchmark for future research.
[NLP-31] EvoTaxo: Building and Evolving Taxonomy from Social Media Streams
【速读】: 该论文旨在解决从社交媒体语料中构建和演化分类体系(taxonomy)的挑战,这些问题源于帖子短小、噪声大、语义纠缠以及随时间动态变化的特点。现有分类体系归纳方法多针对静态语料设计,在鲁棒性、可扩展性和对话语演变敏感性之间难以平衡。其解决方案的关键在于提出EvoTaxo框架,该框架基于大语言模型(Large Language Model, LLM),将每条帖子转化为对当前分类体系的结构化编辑动作,通过时间窗口累积结构证据,并采用双视角聚类融合语义相似性与时间局部性来整合候选修改;同时引入精炼与仲裁机制筛选可靠编辑,每个节点维护概念记忆库以长期保持语义边界,从而实现分类体系的持续演化与高质量维护。
链接: https://arxiv.org/abs/2603.19711
作者: Yiyang Li,Tianyi Ma,Yanfang Ye
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
[NLP-32] DataProphet: Demystifying Supervision Data Generalization in Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中如何高效选择监督数据的问题,特别是现有基于任务相似性的直观选择策略是否可靠。研究发现,传统依赖于任务类别(如文本密集型或视觉主导型)的相似性判断无法准确预测下游性能提升,而具体数据集本身才是影响迁移效果的关键因素。为此,作者提出 DATAPROPHET,一种无需训练即可评估监督数据潜力的指标,其核心在于融合多模态困惑度(multimodal perplexity)、数据相似性与多样性三个维度,从而实现对训练后性能的高相关性预估(Kendall’s tau 达 86.0%),显著优于随机选择、现有基于训练的基线及基于实验表现的“理想”选择方案。
链接: https://arxiv.org/abs/2603.19688
作者: Xuan Qi,Luxi He,Dan Roth,Xingyu Fu
机构: University of Pennsylvania (宾夕法尼亚大学); Tsinghua University (清华大学); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注: 14 pages
Abstract:Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall’s tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
[NLP-33] Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach
【速读】: 该论文旨在解决阿拉伯语(Arabic)领域中缺乏可扩展、语言学导向的自动作文评分(Automatic Essay Scoring, AES)工具的问题,特别是在零样本(zero-shot)和少样本(few-shot)条件下如何有效评估作文的语言能力特质(trait-specific scoring)。解决方案的关键在于提出了一种三层次提示工程框架(three-tier prompting strategy),包括标准提示、混合提示(hybrid prompting)和基于评分量表的提示(rubric-guided prompting),其中混合提示模拟多代理评价机制,引入特质专家评分者视角;而基于评分量表的提示通过嵌入已标注示例来增强模型与评分标准的一致性。实验表明,结构化提示策略显著提升了不同语言特质(如连贯性与发展性)的评分一致性,且在所有模型中均优于单纯依赖模型规模的做法,证明了提示设计在低资源语言环境下的核心作用。
链接: https://arxiv.org/abs/2603.19668
作者: Salim Al Mandhari,Hieu Pham Dinh,Mo El-Haj,Paul Rayson
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages
Abstract:This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.
[NLP-34] BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本理解任务中因上下文窗口(context window)指数级扩展所引发的推理延迟高和信息利用效率低的问题。现有压缩方法常因激进的token剪枝导致语义碎片化或训练成本高昂。其解决方案的关键在于提出一种无需训练的框架BEAVER,该框架将压缩策略从线性token移除转变为结构感知的分层选择机制:通过双路径池化(dual-path pooling)将变长上下文映射为密集的页级张量以最大化硬件并行性,并结合语义与词汇双分支选择及句子平滑技术保障话语连贯性,从而在不牺牲性能的前提下显著降低延迟(如128k上下文下减少26.4倍),实现高效且高保真的长文本处理。
链接: https://arxiv.org/abs/2603.19635
作者: Zhengpei Hu,Kai Li,Dapeng Fu,Chang Zeng,Yue Li,Yuanhao Tang,Jianqiang Huang
机构: Qinghai University (青海大学); Tsinghua University (清华大学); Ant Group Security and Intelligence Laboratory (SIL) (蚂蚁集团安全与智能实验室)
类目: Computation and Language (cs.CL)
备注: Technical Report
Abstract:The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at this https URL.
[NLP-35] CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation INTERSPEECH2026
【速读】: 该论文旨在解决当前音频描述(audio captioning)评估中缺乏有效参考-free(无需参考文本)指标的问题。现有方法如基于参考文本的指标成本高且难以衡量声学保真度,而基于CLAP(Contrastive Language-Audio Pretraining)的指标则常忽略句法错误和细粒度语义细节。解决方案的关键在于提出CAF-Score,该指标通过融合对比音频-文本嵌入与大型音频语言模型(Large Audio-Language Models, LALMs)的推理能力,实现对粗粒度语义一致性与细粒度语法理解的校准,从而有效识别句法不一致性和细微幻觉。实验表明,CAF-Score在BRACE基准上与人类判断的相关性最高,甚至在复杂场景下超越参考基线。
链接: https://arxiv.org/abs/2603.19615
作者: Insung Lee,Taeyoung Jeong,Haejun Yoo,Du-Seong Chang,Myoung-Wan Koo
机构: Sogang University (首尔女子大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: A condensed version of this work has been submitted to Interspeech 2026. Section 10 is an extended analysis added in this version
Abstract:While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP’s coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at this https URL.
[NLP-36] xtReasoning Bench: Does Reasoning Really Improve Text Classification in Large Language Models ?
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在文本分类任务中广泛应用推理策略(reasoning strategies)时存在的有效性与效率问题。尽管多步推理方法(如思维链 CoT)被广泛用于提升模型性能,但其对分类任务是否普遍有益、以及带来的计算成本是否合理仍缺乏系统评估。解决方案的关键在于构建 TextReasoningBench 基准测试平台,通过在五种文本分类数据集上对比七种推理策略(包括 IO、CoT、SC-CoT、ToT、GoT、BoC 和 long-CoT)在十种大语言模型上的表现,并引入两个新的成本感知指标:每推理 token 的性能增益和性能提升相对于 token 成本增长的效率比,从而定量揭示推理机制在分类任务中的实际收益与代价。
链接: https://arxiv.org/abs/2603.19558
作者: Xinyu Guo,Yazhou Zhang,Jing Qin
机构: Tianjin University (天津大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: 20 pages
Abstract:Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10 \times to 100 \times (e.g., SC-CoT and ToT) while providing only marginal performance improvements.
[NLP-37] FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment
【速读】: 该论文旨在解决当前语言模型在文档引导型问答(Document-grounded Question Answering, QA)任务中,特别是在处理复杂、异构的药品标签文档时,存在事实准确性不足、长上下文检索能力弱以及安全拒绝行为不完善的问题。解决方案的关键在于构建了一个由专家(FDA监管评估人员)精心策划的真实世界基准测试集FDARxBench,通过多阶段流水线生成高质量、涵盖事实性、多跳推理和拒绝类任务的QA样本,并设计了开卷与闭卷两种评估协议,以系统性地评估大语言模型(LLM)在药物标签理解中的表现,从而推动面向监管级应用的语言模型评测体系发展。
链接: https://arxiv.org/abs/2603.19539
作者: Betty Xiong,Jillian Fisher,Benjamin Newman,Meng Hu,Shivangi Gupta,Yejin Choi,Lanyan Fang,Russ B Altman
机构: Stanford University (斯坦福大学); University of Washington (华盛顿大学); U.S. Food and Drug Administration (美国食品药品监督管理局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures
Abstract:We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.
[NLP-38] Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas
【速读】: 该论文旨在解决多智能体环境中策略生成的难题,即如何利用大语言模型(Large Language Model, LLM)迭代地合成程序化代理策略,而非依赖强化学习训练神经网络策略。其核心挑战在于如何有效引导LLM在自我对弈(self-play)中优化策略,同时确保策略具备社会协调性与高效性。解决方案的关键在于引入密集反馈机制(dense feedback),即不仅提供标量奖励,还注入社会指标(如效率、公平性、可持续性和和平度),从而增强LLM对复杂协作任务的理解能力。实验表明,这种反馈设计显著提升了策略性能,尤其在Cleanup公共品博弈中帮助LLM精准权衡清洁与采集的成本收益,促进领土划分、角色动态分配等高效合作行为,而非引发对公平性的过度优化。
链接: https://arxiv.org/abs/2603.19453
作者: Víctor Gallego
机构: Komorebi.ai
类目: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:
Abstract:We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at this https URL. Subjects: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2603.19453 [cs.CL] (or arXiv:2603.19453v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.19453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-39] Vocabulary shapes cross-lingual variation of word-order learnability in language models ACL2026
【速读】: 该论文试图解决的问题是:为何某些语言(如捷克语)允许自由词序,而另一些语言(如英语)则不允许?为解答这一问题,研究者通过在自然语言的合成词序变体上预训练Transformer语言模型,系统考察了词序规则性对模型学习难度的影响。其关键解决方案在于发现:词序不规则性会显著提高模型的意外度(surprisal),从而降低可学习性;而句子逆序仅轻微影响学习能力。更重要的是,单纯以“自由词序”或“固定词序”进行语言分类无法解释跨语言差异,真正决定计算层面词序可学习性的核心因素是词汇和子词(subword)结构本身——即词汇组织方式是驱动不同语言词序学习效率的关键变量。
链接: https://arxiv.org/abs/2603.19427
作者: Jonas Mayer Martins,Jaap Jumelet,Viola Priesemann,Lisa Beinborn
机构: University of Göttingen, Germany; University of Groningen, Netherlands; MPI for Dynamics and Self-Organization, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ACL 2026. 17 pages, 11 figures
Abstract:Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
[NLP-40] Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure ICLR2026
【速读】: 该论文试图解决的问题是:当前基于线性探测(linear probes)的方法在评估大语言模型(Large Language Models, LLMs)是否具备评估意识(evaluation awareness)时,其信号可能源于提示格式(prompt format)或表面结构(surface structure)等结构性伪影(structural artifacts),而非真正的评价上下文(evaluation context)的识别能力。解决方案的关键在于通过构建一个受控的2×2数据集并引入诊断性重写(diagnostic rewrites),部分控制提示格式变量,从而检验探测信号是否仍能持续存在。研究发现,线性探测主要捕捉的是基准-规范结构(benchmark-canonical structure),无法在脱离语言风格约束的自由格式提示中泛化,因此现有基于探测的方法难以可靠地区分评价上下文与结构性伪影,削弱了已有研究结果的证据强度。
链接: https://arxiv.org/abs/2603.19426
作者: Viliana Devbunova
机构: Yandex (雅虎)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 tables, 2 figures. Accepted at ICLR 2026 Workshop “I Can’t Believe It’s Not Better”
Abstract:Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.
[NLP-41] Scalable Prompt Routing via Fine-Grained Latent Task Discovery
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)池中动态选择最优模型以平衡性能与成本的问题。随着模型池扩展至数十个前沿模型且性能差距趋近,现有方法面临两大挑战:人工定义的任务分类体系难以捕捉细粒度的能力差异,而单一路由器又无法有效区分多样任务间的细微差别。解决方案的关键在于提出一种两阶段路由架构:第一阶段通过图聚类自动发现潜在任务类型并训练分类器进行prompt到任务的映射;第二阶段采用专家混合(Mixture-of-Experts)结构,结合任务特定预测头实现精细化的质量估计。推理时融合两个阶段的输出,在任务级稳定性与prompt级适应性之间取得平衡,从而在10个基准测试上优于现有基线,并超越最强单模型表现,同时成本低于其一半。
链接: https://arxiv.org/abs/2603.19415
作者: Yunyi Zhang,Soji Adeshina,Patrick Guan,Ashwin Ganesh,Zhen Han,Vassilis N. Ioannidis,Huzefa Rangwala,George Karypis
机构: Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
[NLP-42] Anatomical Heterogeneity in Transformer Language Models
【速读】: 该论文旨在解决当前变压器语言模型(Transformer language models)在训练时对所有层分配相同计算预算的假设问题,这一假设隐含了层间同质性的前提,而实证分析表明层间存在显著异质性。解决方案的关键在于通过五种诊断指标识别出各层的重要性差异,并据此提出“增长型变压器训练”(Growth Transformer Training)策略:根据每层的重要性动态分配计算资源,而非均匀分配。实验验证显示,该方法在保持参数量不变的情况下,相比均匀训练可实现约54%的训练成本降低,且验证损失降低4.7倍,同时训练速度提升13%。
链接: https://arxiv.org/abs/2603.19348
作者: Tomasz Wietrzykowski
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 10 tables. Independent research. Code available at this https URL
Abstract:Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.
[NLP-43] Prompt-tuning with Attribute Guidance for Low-resource Entity Matching
【速读】: 该论文旨在解决传统实体匹配(Entity Matching, EM)方法对大量高质量标注数据依赖性强、成本高,且现有提示调优(prompt-tuning)方法多局限于实体层面而忽视属性级信息、缺乏可解释性的问题。其解决方案的关键在于提出PROMPTATTRIB框架,通过引入实体级与属性级双层次提示机制来增强上下文表征,并利用模糊逻辑公式进行逻辑推理以确定最终匹配标签;同时结合基于Dropout的对比学习策略(受SimCSE启发)优化软提示表示,从而在低资源场景下实现更准确、更具解释性的实体匹配效果。
链接: https://arxiv.org/abs/2603.19321
作者: Lihui Liu,Carl Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality labeled data. This labeling process is both time-consuming and costly, limiting practical applicability. As a result, there is a strong need for low-resource EM methods that can perform well with minimal labeled data. Recent prompt-tuning approaches have shown promise for low-resource EM, but they mainly focus on entity-level matching and often overlook critical attribute-level information. In addition, these methods typically lack interpretability and explainability. To address these limitations, this paper introduces PROMPTATTRIB, a comprehensive solution that tackles EM through attribute-level prompt tuning and logical reasoning. PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information and employs fuzzy logic formulas to infer the final matching label. By explicitly considering attributes, the model gains a deeper understanding of the entities, resulting in more accurate matching. Furthermore, PROMPTATTRIB integrates dropout-based contrastive learning on soft prompts, inspired by SimCSE, which further boosts EM performance. Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB.
[NLP-44] Exploring Novelty Differences between Industry and Academia: A Knowledge Entity-centric Perspective
【速读】: 该论文试图解决的问题是:在技术进步中,学术界与产业界谁更可能产生更具新颖性的研究成果,以及二者合作是否有助于提升研究的新颖性。此前研究因数据来源受限和新颖性衡量标准不一致而存在争议。其解决方案的关键在于构建一个基于细粒度知识实体(方法、工具、数据集、指标)的统一语义空间,通过计算实体间的语义距离来量化新颖性,并利用回归模型比较不同文献类型下学术界与产业界的创新产出差异,从而实现跨类型研究的新颖性可比性分析。
链接: https://arxiv.org/abs/2603.19319
作者: Hongye Zhao,Yi Zhao,Chengzhi Zhang
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Academia and industry each possess distinct advantages in advancing technological progress. Academia’s core mission is to promote open dissemination of research results and drive disciplinary progress. The industry values knowledge appropriability and core competitiveness, yet actively engages in open practices like academic conferences and platform sharing, creating a knowledge strategy paradox. Highly novel and publicly accessible knowledge serves as the driving force behind technological advancement. However, it remains unclear whether industry or academia can produce more novel research outcomes. Some studies argue that academia tends to generate more novel ideas, while others suggest that industry researchers are more likely to drive breakthroughs. Previous studies have been limited by data sources and inconsistent measures of novelty. To address these gaps, this study conducts an analysis using four types of fine-grained knowledge entities (Method, Tool, Dataset, Metric), calculates semantic distances between entities within a unified semantic space to quantify novelty, and achieves comparability of novelty across different types of literature. Then, a regression model is constructed to analyze the differences in publication novelty between industry and academia. The results indicate that academia demonstrates higher novelty outputs, which is particularly evident in patents. At the entity level, both academia and industry emphasize method-driven advancements in papers, while industry holds a unique advantage in datasets. Additionally, academia-industry collaboration has a limited effect on enhancing the novelty of research papers, but it helps to enhance the novelty of patents. We release our data and associated codes at this https URL.
[NLP-45] Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长时间、开放式对话中难以维持角色一致性的问题,即模型常因无法自主回忆并准确应用其设定的个性知识而导致角色崩塌。解决方案的关键在于提出“记忆驱动的角色扮演”(Memory-Driven Role-Playing, MDRP)范式,将角色知识视为模型内部的记忆存储,并要求仅通过对话上下文进行检索与应用,从而实现对角色知识深度和自主性的严格测试。该范式进一步催生了MREval评估框架、MRPrompt提示架构和MRBench基准,实现了对角色扮演四阶段能力(锚定、回忆、边界控制与行为演绎)的细粒度诊断与提升,实验证明该方法可使小型模型性能媲美大型闭源模型,且上游记忆增强直接改善下游响应质量。
链接: https://arxiv.org/abs/2603.19313
作者: Kai Wang,Haoyang You,Yang Zhang,Zhongjie Wang
机构: Harbin Institute of Technology (哈尔滨工业大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages
Abstract:A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski’s “emotional memory” acting theory, this paradigm frames persona knowledge as the LLM’s internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.
[NLP-46] PrefPO: Pairwise Preference Prompt Optimization
【速读】: 该论文旨在解决当前提示工程(Prompt Engineering)方法中存在的两大核心问题:一是依赖大量标注数据,难以在无监督场景下应用;二是生成的提示往往冗长且重复,导致“提示卫生”差(prompt hygiene),影响模型性能与可解释性。其解决方案的关键在于提出一种基于偏好学习的最小化提示优化方法 PrefPO,该方法受强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)启发,仅需初始提示和自然语言标准即可运行,无需标签数据或复杂超参数调优。PrefPO 利用大语言模型(LLM)作为判别器来表达输出之间的成对偏好,并将反馈传递给 LLM 优化器进行迭代改进,从而在保持高性能的同时显著提升提示简洁性和鲁棒性,且对提示攻击(prompt hacking)的敏感度仅为 TextGrad 的一半,展现出更强的对齐能力与实用性。
链接: https://arxiv.org/abs/2603.19311
作者: Rahul Singhal,Pradyumna Tambwekar,Karime Maamari
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code and data available at this https URL and this https URL
Abstract:Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO’s prompts higher than TextGrad’s. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
[NLP-47] Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段对人工标注数据或外部验证器的依赖问题,以及现有方法难以实现真正自主智能提升的局限性。其核心挑战在于高质量数据获取成本高昂,且当前主流方法无法有效促进模型在不可直接验证任务上的自我进化。解决方案的关键是提出互信息偏好优化(Mutual Information Preference Optimization, MIPO)——一种基于对比数据增强的偏好学习框架,通过构建正负样本对:正样本为模型基于正确提示生成的回答,负样本为基于随机无关提示生成的回答,从而构造出蕴含提示与响应间条件互信息(Conditional Mutual Information, MI)的偏好数据集;随后使用直接偏好优化(Direct Preference Optimization, DPO)进行训练,最大化基模型下提示与响应之间的点态条件互信息。实证结果表明,MIPO不仅显著提升了个性化任务性能(3–40%提升),还能在无需额外数据或人工监督的情况下改善数学和多选题等任务表现(1–18%提升),展现出强大的自适应改进潜力。
链接: https://arxiv.org/abs/2603.19294
作者: Hyunji Nam,Haoran Li,Natasha Jaques
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verifiable. Therefore, we need self-improvement frameworks that allow models to improve without external oversight. We propose Mutual Information Preference Optimization (MIPO), a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization (DPO) to learn from this paired data maximizes pointwise conditional mutual information (MI) (under the base LLM) between prompts and model responses. Empirical results with various-sized Llama- and Qwen-Instruct models show that when used to maximize MI between user context and response, MIPO provides an effective personalization technique, achieving 3-40% improvements on personalization tasks using real-user datasets compared to strong baselines. Surprisingly, MIPO can also be applied to improve performance on math and multiple-choice problems, yielding 1-18% without any additional data or human supervision. These results suggest a promising direction for self-improvement.
[NLP-48] LLM -MRD: LLM -Guided Multi-View Reasoning Distillation for Fake News Detection DASFAA2026
【速读】: 该论文旨在解决多模态虚假新闻检测中存在的两大问题:一是现有方法在多视角判断与融合方面缺乏全面性,二是基于大语言模型(Large Language Models, LLMs)的推理机制因计算成本过高而导致效率低下。其解决方案的关键在于提出一种名为LLM-Guided Multi-View Reasoning Distillation (LLM-MRD) 的师生框架,通过构建学生端的多视角推理模块建立文本、视觉及跨模态的基础表征,并利用教师端生成深度推理链作为丰富监督信号,再通过核心的校准蒸馏机制将复杂的推理知识高效迁移至轻量级学生模型中,从而实现高性能且高效率的虚假新闻检测。
链接: https://arxiv.org/abs/2603.19293
作者: Weilin Zhou,Shanwen Tan,Enhao Gu,Yurong Qian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at DASFAA 2026 (Oral)
Abstract:Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi-view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose \textbfLLM-Guided \textbfMulti-View \textbfReasoning \textbfDistillation for Fake News Detection ( \textbfLLM-MRD), a novel teacher-student framework. The Student Multi-view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross-modal perspectives. Then, the Teacher Multi-view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning-derived knowledge into the efficient student model. Experiments show LLM-MRD significantly outperforms state-of-the-art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19% in ACC and 6.33% in F1-Fake when evaluated across all competing methods and datasets. Our code is available at this https URL
[NLP-49] Automatic Analysis of Collaboration Through Human Conversational Data Resources: A Review
【速读】: 该论文试图解决如何利用面向任务的人类对话数据来分析协作过程这一问题。其解决方案的关键在于系统性地回顾与总结任务导向型对话资源在协作分析中的应用,涵盖相关理论、编码方案、研究任务及建模方法,从而为自动分析协作行为提供可实践的框架,并揭示未来研究中尚未探索的方向。
链接: https://arxiv.org/abs/2603.19292
作者: Yi Yu,Maria Boritchev,Chloé Clavel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Collaboration is a task-oriented, high-level human behavior. In most cases, conversation serves as the primary medium for information exchange and coordination, making conversational data a valuable resource for the automatic analysis of collaborative processes. In this paper, we focus on verbal aspects of collaboration and conduct a review of collaboration analysis using task-oriented conversation resources, encompassing related theories, coding schemes, tasks, and modeling approaches. We aim to address the question of how to utilize task-oriented human-human conversational data for collaboration analysis. We hope our review will serve as a practical resource and illuminate unexplored areas for future collaboration analysis.
[NLP-50] Automated Motif Indexing on the Arabian Nights
【速读】: 该论文旨在解决** motif indexing( motif 索引)**这一难题,即在文本中自动识别并标注出与特定民间故事母题(motif)相关的表达式,从而实现对传统民俗文本的结构化分析,并为现代语境下母题的传播与演变提供计算支持。其核心挑战在于缺乏高质量标注数据和跨文本的语义一致性问题。解决方案的关键在于构建了一个大规模、人工标注的语料库(包含2,670个母题表达、200种不同母题,覆盖58,450句),并利用《一千零一夜》这一广泛可得的经典文本与El-Shamy(2006)详尽的母题索引相匹配,克服了以往研究因文本不可及而导致的数据稀缺问题。在此基础上,作者系统评估了五类方法,最终发现基于LoRA微调的Llama3模型在检测任务上表现最优,F1得分达0.85,标志着首个有效的计算化母题索引方法的实现。
链接: https://arxiv.org/abs/2603.19283
作者: Ibrahim H. Alyami,Mark A. Finlayson
机构: Najran University (纳jan大学); Florida International University (佛罗里达国际大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 4 figures, 9 tables Preprint. Submitted to Digital Scholarship in the Humanities(DSH) 2026
Abstract:Motifs are non-commonplace, recurring narrative elements, often found originally in folk stories. In addition to being of interest to folklorists, motifs appear as metaphoric devices in modern news, literature, propaganda, and other cultural texts. Finding expressions of motifs in the original folkloristic text is useful for both folkloristic analysis (motif indexing) as well as for understanding the modern usage of motifs (motif detection and interpretation). Prior work has primarily shown how difficult these problems are to tackle using automated techniques. We present the first computational approach to motif indexing. Our choice of data is a key enabler: we use a large, widely available text (the Arabian Nights) paired with a detailed motif index (by El-Shamy in 2006), which overcomes the common problem of inaccessibility of texts referred to by the index. We created a manually annotated corpus that identified 2,670 motif expressions of 200 different motifs across 58,450 sentences for training and testing. We tested five types of approaches for detecting motif expressions given a motif index entry: (1) classic retrieve and re-rank using keywords and a fine-tuned cross-encoder; (2) off-the-shelf embedding models; (3) fine-tuned embedding models; (4) generative prompting of off-the-shelf LLMs in N-shot setups; and (5) the same generative approaches on LLMs fine-tuned with LoRA. Our best performing system is a fine-tuned Llama3 model which achieves an overall performance of 0.85 F1.
[NLP-51] Framing Effects in Independent-Agent Large Language Models : A Cross-Family Behavioral Analysis
【速读】: 该论文试图解决在非交互式多智能体大语言模型(Large Language Models, LLMs)部署中,由于缺乏协作机制而导致的决策协调受限问题,尤其关注提示(prompt)框架如何影响个体与群体利益冲突情境下的阈值投票行为。解决方案的关键在于揭示提示语义框架对LLM决策分布的显著影响——即使逻辑等价的提示,其表面语言线索仍可显著改变选择倾向,常使模型偏好风险规避选项;这表明在需要承担风险以达成目标时,模型表现出更倾向于工具理性(instrumental rationality)而非合作理性(cooperative rationality)的行为模式。该发现强调了提示框架效应是多代理LLM系统中的重要偏差来源,为模型对齐(alignment)和提示工程设计提供了关键依据。
链接: https://arxiv.org/abs/2603.19282
作者: Zice Wang,Zhenyu Zhang
机构: Northeastern University (东北大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In many real-world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual-group interest conflict. Two logically equivalent prompts with different framings were tested across diverse LLM families under isolated trials. Results show that prompt framing significantly influences choice distributions, often shifting preferences toward risk-averse options. Surface linguistic cues can even override logically equivalent formulations. This suggests that observed behavior reflects a tendency consistent with a preference for instrumental rather than cooperative rationality when success requires risk-bearing. The findings highlight framing effects as a significant bias source in non-interacting multi-agent LLM deployments, informing alignment and prompt design.
[NLP-52] From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring
【速读】: 该论文旨在解决生成式 AI(Generative AI)在构造性作答评分系统中应用时,如何收集充分的效度证据以支持评分结果的使用与解释这一核心问题。其解决方案的关键在于明确区分传统人工评分、基于特征的自然语言处理(Natural Language Processing, NLP)AI评分与生成式 AI 评分三类系统在效度证据需求上的差异,并提出一套针对生成式 AI 的最佳实践框架,强调由于其黑箱特性及一致性等独特挑战,所需效度证据比基于特征的方法更为广泛和深入,从而为构建可信、可解释的评分系统提供理论依据与实证支持。
链接: https://arxiv.org/abs/2603.19280
作者: Jodi M. Casabianca,Daniel F. McCaffrey,Matthew S. Johnson,Naim Alper,Vladimir Zubenko
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 37 pages, 8 tables, 6 figures
Abstract:The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from a large corpus of independent argumentative essays written by 6-12th grade students demonstrate the collection of validity evidence for different types of scoring systems and highlight the numerous complexities and considerations when making a validity argument for these scores.
[NLP-53] Multilingual Hate Speech Detection and Counterspeech Generation: A Comprehensive Survey and Practical Guide
【速读】: 该论文旨在解决多语言环境下在线仇恨言论检测与反言辞生成中存在的局限性,特别是英语中心主义模型难以捕捉非英语及代码混合语境中隐含的仇恨表达和文化特异性内容的问题。其解决方案的关键在于提出一个结构化的三阶段框架——任务设计、数据集构建与评估,整合最新的多语言资源与自然语言处理技术,强调跨语言公平性、低资源语言的数据稀缺问题以及多模态方法的必要性,从而推动更具包容性和情境感知能力的在线安全系统建设。
链接: https://arxiv.org/abs/2603.19279
作者: Zahra Safdari Fesaghandis,Suman Kalyan Maity
机构: Bilkent University (比尔肯大学); Missouri University of Science and Technology (密苏里科学技术学院)
类目: Computation and Language (cs.CL)
备注: 29 pages, 7 Tables
Abstract:Combating online hate speech in multilingual settings requires approaches that go beyond English-centric models and capture the cultural and linguistic diversity of global online discourse. This paper presents a comprehensive survey and practical guide to multilingual hate speech detection and counterspeech generation, integrating recent advances in natural language processing. We analyze why monolingual systems often fail in non-English and code-mixed contexts, missing implicit hate and culturally specific expressions. To address these challenges, we outline a structured three-phase framework - task design, data curation, and evaluation - drawing on state-of-the-art datasets, models, and metrics. The survey consolidates progress in multilingual resources and techniques while highlighting persistent obstacles, including data scarcity in low-resource languages, fairness and bias in system development, and the need for multimodal solutions. By bridging technical progress with ethical and cultural considerations, we provide researchers, practitioners, and policymakers with scalable guidelines for building context-aware, inclusive systems. Our roadmap contributes to advancing online safety through fairer, more effective detection and counterspeech generation across diverse linguistic environments.
[NLP-54] HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning
【速读】: 该论文旨在解决现代基于Transformer的模型在微调过程中常见的校准不足(miscalibration)问题,即模型预测过于自信,无法准确反映真实的经验频率。其核心解决方案是采用参数高效的方法——LoRA(Low-Rank Adaptation)及其基于超网络(hyper-network)的结构化变体,以实现与全量微调相当甚至更优的校准性能,同时显著降低参数使用量。关键在于通过低秩更新机制控制适应空间,并引入结构耦合策略(如共享超网络生成LoRA矩阵A和B),从而在保持高参数效率的同时提升概率可靠性(probabilistic reliability),并揭示了约束适应空间作为正则项可有效降低Expected Calibration Error(ECE),但需权衡下游任务精度的损失。
链接: https://arxiv.org/abs/2603.19278
作者: Bartosz Trojan,Filip Gębala
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 2 tables
Abstract:Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-network-based adaptation framework as parameter-efficient alternatives to full fine-tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper-network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine-tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low-rank updates as a viable foundation for uncertainty-aware Transformer architectures. Code available at: this https URL
[NLP-55] MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering
【速读】: 该论文旨在解决在线旅游平台上用户评论(reviews)在摘要生成过程中存在的两大问题:一是现有研究过度关注端到端摘要质量,而忽视了评估基准的可靠性及细粒度洞察的实际可用性;二是用户评论通常具有噪声大、冗余性强的特点,导致摘要忠实度(faithfulness)不足。为应对这些问题,作者提出MOSAIC框架,其关键在于将摘要任务分解为可解释的模块化组件——主题发现(theme discovery)、结构化观点提取(structured opinion extraction)与基于证据的摘要生成(grounded summary generation),并通过引入意见聚类(opinion clustering)作为系统级组件显著提升摘要忠实度,尤其在真实场景下噪声和冗余条件下表现优异。此外,论文还通过在线A/B测试验证了中间输出对用户体验的提升作用,并发布新的开源数据集TRECS以支持更可靠的评估。
链接: https://arxiv.org/abs/2603.19277
作者: Piyush Kumar Singh,Jayesh Choudhari
机构: Viator, TripAdvisor / London (Viator, TripAdvisor / 伦敦)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.
[NLP-56] From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动化短答案评分(Automated Short Answer Grading, ASAG)中因通用预训练导致的幻觉问题和对评分标准(rubric)严格遵循不足的问题。传统检索增强生成(Retrieval-Augmented Generation, RAG)方法采用“扁平”向量检索机制,将知识视为孤立片段,难以捕捉教育内容中的结构依赖关系与多跳推理需求。其解决方案的关键在于提出一种图检索增强生成(Graph Retrieval-Augmented Generation, GraphRAG)框架,通过构建结构化知识图谱显式建模概念间的依赖关系,并利用双阶段流水线——首先基于Microsoft GraphRAG进行高保真图谱构建,再采用HippoRAG神经符号算法执行关联图遍历,从而检索出逻辑连贯的证据子图。实验表明,该结构化检索策略显著优于标准RAG基线,在NGSS数据集上尤其在科学与工程实践(Science and Engineering Practices, SEP)评估维度表现突出,验证了结构化信息检索在支持高阶学术评估逻辑链验证中的优势。
链接: https://arxiv.org/abs/2603.19276
作者: Yucheng Chu,Haoyu Han,Shen Dong,Hang Li,Kaiqi Yang,Yasemin Copur-Gencturk,Joseph Krajcik,Namsoo Shin,Hui Liu
机构: Microsoft GraphRAG; HippoRAG; Google (谷歌); Stanford University (斯坦福大学); MIT (麻省理工学院); University of California, Berkeley (加州大学伯克利分校); University of Washington (华盛顿大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Toronto (多伦多大学); Carnegie Mellon University (卡内基梅隆大学); University of Cambridge (剑桥大学); University of Oxford (牛津大学); ETH Zurich (苏黎世联邦理工学院); University of Tokyo (东京大学); Tsinghua University (清华大学); Peking University (北京大学); National University of Singapore (新加坡国立大学); University of Melbourne (墨尔本大学); University of Sydney (悉尼大学); University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated short answer grading (ASAG) is critical for scaling educational assessment, yet large language models (LLMs) often struggle with hallucinations and strict rubric adherence due to their reliance on generalized pre-training. While Rretrieval-Augmented Generation (RAG) mitigates these issues, standard “flat” vector retrieval mechanisms treat knowledge as isolated fragments, failing to capture the structural relationships and multi-hop reasoning essential for complex educational content. To address this limitation, we introduce a Graph Retrieval-Augmented Generation (GraphRAG) framework that organizes reference materials into a structured knowledge graph to explicitly model dependencies between concepts. Our methodology employs a dual-phase pipeline: utilizing Microsoft GraphRAG for high-fidelity graph construction and the HippoRAG neurosymbolic algorithm to execute associative graph traversals, thereby retrieving comprehensive, connected subgraphs of evidence. Experimental evaluations on a Next Generation Science Standards (NGSS) dataset demonstrate that this structural approach significantly outperforms standard RAG baselines across all metrics. Notably, the HippoRAG implementation achieved substantial improvements in evaluating Science and Engineering Practices (SEP), confirming the superiority of structural retrieval in verifying the logical reasoning chains required for higher-order academic assessment.
[NLP-57] Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models
【速读】: 该论文旨在解决放射科报告自动摘要任务中因模型训练策略不当导致的性能瓶颈与“冷启动”问题(cold start problem),即在有限标注数据下模型难以快速收敛或生成高质量摘要。其解决方案的关键在于提出一种“预训练-中段适应-微调”(pre-training, mid-training, fine-tuning)的三阶段框架,其中中段适应(mid-training)通过在特定子领域(subdomain)上进一步训练大语言模型(LLM),显著提升了摘要质量与事实准确性(如RadGraph-F1指标),同时增强了少样本学习能力,有效缓解了传统直接微调策略中存在的初始学习困难问题。
链接: https://arxiv.org/abs/2603.19275
作者: Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the “pre-training, fine-tuning” strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the “cold start” problem reported in previous studies as a learning barrier. Our findings support the use of “pre-training, mid-training, fine-tuning,” instead of the widely used direct fine-tuning strategy.
[NLP-58] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在临床诊断应用中,因缺乏对基础多模态推理能力与文献检索能力的分离评估而导致的性能瓶颈问题。现有基准主要聚焦于端到端问答场景,难以区分模型在临床推理和证据获取两方面的实际表现。为此,作者提出了Clinical Understanding and Retrieval Evaluation (CURE) 基准,其关键在于构建了500个映射至医师引用参考文献的多模态临床病例,并通过受控证据设置,在封闭式与开放式诊断任务中分别评估模型的推理与检索能力。实验结果揭示了先进模型在提供医生参考证据时可达到73.4%的鉴别诊断准确率,但在依赖自主检索机制时性能骤降至25.4%,凸显出整合多模态临床证据与精准文献检索的双重挑战。
链接: https://arxiv.org/abs/2603.19274
作者: Yannian Gu,Zhongzhen Huang,Linjie Mu,Xizhuo Zhang,Shaoting Zhang,Xiaofan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model’s foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising 500 multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to 73.4% accuracy on differential diagnosis), their performance substantially declines (as low as 25.4% ) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at this https URL.
[NLP-59] LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言中安全对齐(Safety Alignment)失效的问题,即当有害意图以西非地区低资源语言(如约鲁巴语、豪萨语、伊博语和伊加拉语)表达时,模型的拒绝响应机制(Refusal Mechanism)往往无法像在英语中那样有效激活。解决方案的关键在于提出首个系统性跨语言拒绝退化评估基准——LSR(Linguistic Safety Robustness),其采用双探针评估协议(Dual-Probe Evaluation Protocol),通过向同一模型同时输入英文与目标语言的探测样本,并引入拒绝中心偏移(Refusal Centroid Drift, RCD)这一量化指标,精确测量模型在不同语言下对有害意图的拒绝能力下降程度。实验表明,尽管英语中的拒绝率保持在约90%,但在西非语言中显著下降至35–55%,尤其在伊加拉语中RCD高达0.55,凸显了跨语言安全对齐的严峻挑战。
链接: https://arxiv.org/abs/2603.19273
作者: Godwin Abuh Faruna
机构: Fagmart Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages. Reference implementation: this https URL . Dataset: this https URL
Abstract:Safety alignment in large language models relies predominantly on English-language training data. When harmful intent is expressed in low-resource languages, refusal mechanisms that hold in English frequently fail to activate. We introduce LSR (Linguistic Safety Robustness), the first systematic benchmark for measuring cross-lingual refusal degradation in West African languages: Yoruba, Hausa, Igbo, and Igala. LSR uses a dual-probe evaluation protocol - submitting matched English and target-language probes to the same model - and introduces Refusal Centroid Drift (RCD), a metric that quantifies how much of a model’s English refusal behavior is lost when harmful intent is encoded in a target language. We evaluate Gemini 2.5 Flash across 14 culturally grounded attack probes in four harm categories. English refusal rates hold at approximately 90 percent. Across West African languages, refusal rates fall to 35-55 percent, with Igala showing the most severe degradation (RCD = 0.55). LSR is implemented in the Inspect AI evaluation framework and is available as a PR-ready contribution to the UK AISI’s inspect_evals repository. A live reference implementation and the benchmark dataset are publicly available.
[NLP-60] ransformers are Stateless Differentiable Neural Computers
【速读】: 该论文试图解决的问题是:如何从计算架构的角度统一理解Transformer与Differentiable Neural Computer (DNC)之间的关系,从而为现代大规模语言模型提供一个更具原则性的计算框架。其解决方案的关键在于通过形式化推导表明,因果Transformer层本质上等价于一种无状态的DNC(sDNC),其中控制器无循环内部状态、外部记忆为一次性写入的值向量矩阵、基于键的内容寻址实现注意力机制,且多头注意力对应多个并行读取头;进一步扩展该等价性至交叉注意力,证明编码器-解码器结构的Transformer即为具有独立读取和写入记忆空间的sDNC。这一发现揭示了Transformer的本质是一种以记忆为中心的计算模型,为理解大语言模型的内在工作机制提供了新的理论视角。
链接: https://arxiv.org/abs/2603.19272
作者: Bo Tang,Weiwei Xie
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages
Abstract:Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.
[NLP-61] A Human-Centered Workflow for Using Large Language Models in Content Analysis
【速读】: 该论文旨在解决当前研究中对大语言模型(Large Language Models, LLMs)应用方式的局限性问题,即多数研究仅通过对话式接口使用LLMs,而未能充分发挥其作为通用文本处理工具的潜力。论文提出了一种以人类为中心的系统化工作流程,将LLMs应用于定性与定量内容分析任务(包括标注、摘要生成和信息抽取),并通过明确的设计、监督与验证机制确保研究过程的严谨性和透明度。解决方案的关键在于:一是将LLMs视为可编程的文本处理引擎而非简单交互工具;二是整合多学科方法论文献,构建结构化的验证程序与最佳实践,以应对LLM的黑箱特性、提示敏感性和幻觉问题,从而提升其在实证研究中的可信度与可复现性。
链接: https://arxiv.org/abs/2603.19271
作者: Ivan Zupic
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While many researchers use Large Language Models (LLMs) through chat-based access, their real potential lies in leveraging LLMs via application programming interfaces (APIs). This paper conceptualizes LLMs as universal text processing machines and presents a comprehensive workflow for employing LLMs in three qualitative and quantitative content analysis tasks: (1) annotation (an umbrella term for qualitative coding, labeling and text classification), (2) summarization, and (3) information extraction. The workflow is explicitly human-centered. Researchers design, supervise, and validate each stage of the LLM process to ensure rigor and transparency. Our approach synthesizes insights from extensive methodological literature across multiple disciplines: political science, sociology, computer science, psychology, and management. We outline validation procedures and best practices to address key limitations of LLMs, such as their black-box nature, prompt sensitivity, and tendency to hallucinate. To facilitate practical implementation, we provide supplementary materials, including a prompt library and Python code in Jupyter Notebook format, accompanied by detailed usage instructions.
[NLP-62] Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation
【速读】: 该论文旨在解决当前单体式智能体架构在处理复杂用户需求时面临的可扩展性差、错误传播严重以及跨任务专注力难以维持等问题。其解决方案的关键在于提出一种结构化的分层多智能体框架Autonoma,通过明确划分高层协调(Coordinator)、规划(Planner)与监督执行(Supervisor)三个层级,实现任务意图验证、结构化工作流生成及动态代理调度的解耦;其中,监督层通过协调一系列模块化、专用型代理(如网页浏览、代码编写、文件管理等),结合主动监控与错误处理机制,在保障系统鲁棒性的同时支持新能力以即插即用方式扩展,从而有效提升端到端自动化流程的可靠性与灵活性。
链接: https://arxiv.org/abs/2603.19270
作者: Eslam Reda,Maged Yasser,Sara El-Metwally
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 Pages, 3 Figures
Abstract:The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi-agent framework designed for end-to-end workflow automation from natural language prompts. Autonoma employs a principled, multi-tiered architecture where a high-level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug-and-play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi-modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.
[NLP-63] From Tokens To Agents : A Researchers Guide To Understanding Large Language Models
【速读】: 该论文试图解决科研人员在研究中如何合理使用大语言模型(Large Language Models, LLMs)的问题,尤其是面对LLM的广泛应用时,缺乏系统性理解其内在机制而导致误用或低效应用的风险。解决方案的关键在于构建一个非技术性的、面向研究场景的分析框架,通过拆解LLM的六个核心组件——预训练数据、分词与嵌入、Transformer架构、概率生成、对齐机制以及代理能力(agentic capabilities),从技术基础和研究意义两个维度阐明每个模块的功能边界与适用场景,从而帮助研究人员批判性地评估LLM是否契合具体研究需求,并以基于LLM代理模拟社交媒体动态的案例研究验证该框架的实用性。
链接: https://arxiv.org/abs/2603.19269
作者: Daniele Barolo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Researchers face a critical choice: how to use – or not use – large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This chapter makes LLMs comprehensible without requiring technical expertise, breaking down six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications, identifying specific affordances and limitations. Rather than prescriptive guidance, the chapter develops a framework for reasoning critically about whether and how LLMs fit specific research needs, finally illustrated through an extended case study on simulating social media dynamics with LLM-based agents.
[NLP-64] Full-Stack Domain Enhancement for Combustion LLM s: Construction and Optimization
【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在燃烧科学等复杂物理系统中因领域知识不足和无法遵守物理守恒定律而导致严重幻觉(hallucination)的问题。解决方案的关键在于构建一个面向燃烧科学的全栈领域增强型LLM工作流,该流程涵盖自动化领域语料库构建、增量预训练、指令微调以及可验证奖励驱动的强化学习四个核心环节,确保模型不仅掌握文本统计模式,更能真正内化物理规律,从而实现可靠科学推理。
链接: https://arxiv.org/abs/2603.19268
作者: Quanjia Xiao,Weimin Ouyang,Zonglin Yang,Tianhao Wu,Qingguo Zhou,Runze Mao,Zhi X. Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.
[NLP-65] Probing to Refine: Reinforcement Distillation of LLM s via Explanatory Inversion ICLR2026
【速读】: 该论文旨在解决从大型语言模型(Large Language Models, LLMs)中蒸馏出具备鲁棒推理能力的小型学生模型(student models)时所面临的两大核心挑战:一是学生模型容易陷入表层模式记忆(superficial pattern memorization),二是泛化性能较差(subpar generalization)。为克服这些问题,作者提出了一种新颖的蒸馏框架,其关键创新在于两个方面:首先,引入解释性反转(Explanatory Inversion, EI),通过生成针对性的“解释探针”迫使学生模型阐明答案背后的逻辑,而非简单记忆;其次,设计了基于强化学习的**解释性GRPO(Explanatory GRPO, EXGRPO)**算法,结合一种新型的对话结构效用奖励机制(Dialogue Structure Utility Bonus),显式奖励学生模型在多个探针间保持连贯的推理过程,从而提升其深层概念理解与泛化能力。实验证明,该方法显著优于现有基线,在12个数据集上平均提升达20.39%(相对于零样本性能)和6.02%(相对于最优蒸馏基线)。
链接: https://arxiv.org/abs/2603.19266
作者: Zhen Tan,Chengshuai Zhao,Song Wang,Jundong Li,Tianlong Chen,Huan Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026
Abstract:Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline\textitFirst, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes’’ that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline\textitSecond, to improve generalization, Explanatory GRPO (\textttEXGRPO) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf20.39% increase over zero-shot performance and a \textbf6.02% improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf10-25% training data) and strong generalization to out-of-distribution tasks. Implementation is released at this https URL.
[NLP-66] Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
【速读】: 该论文旨在解决在医疗和生物医学等领域中,利用预训练大语言模型(Large Language Models, LLM)进行任务特定测试集构建时,因专家标注成本高而导致的基准测试开发困难问题。尤其在生成式问答(Generative Question Answering)任务中,选项动态性会显著影响模型决策边界,现有主动采样框架支持不足。解决方案的关键在于提出一种不确定性感知的主动测试框架——生成式主动测试(Generative Active Testing, GAT),其核心创新是引入一个新颖的语句适配模块(Statement Adaptation Module),将生成式任务转化为伪分类格式,从而有效捕捉未标注样本层面的不确定性;同时结合零样本采集函数,在不依赖额外标注的情况下显著降低估计误差约40%,实现了高效、低成本且可扩展的模型基准测试方案。
链接: https://arxiv.org/abs/2603.19264
作者: Aashish Anantha Ramakrishnan,Ardavan Saeedi,Hamid Reza Hassanzadeh,Fazlolah Mohaghegh,Dongwon Lee
机构: The Pennsylvania State University; Optum AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of pre-trained Large Language Models (LLM), there exists a high demand for task-specific test sets to benchmark their performance in domains such as healthcare and biomedicine. However, the cost of labeling test samples while developing new benchmarks poses a significant challenge, especially when expert annotators are required. Existing frameworks for active sample selection offer limited support for generative Question Answering tasks, where option dynamics can affect model decision boundaries. In this paper, we present Generative Active Testing (GAT), an uncertainty-aware acquisition framework leveraging LLMs as surrogates for informing the sample selection process. Using a novel Statement Adaptation Module, we modify generative tasks into a pseudo-classification format, enabling the capture of sample-level uncertainties across unlabeled candidates. Our zero-shot acquisition functions reduce estimation error by ~40% compared to traditional sampling baselines, offering a scalable solution for cost-effective model benchmarking.
[NLP-67] he α-Law of Observable Belief Revision in Large Language Model Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在通过链式思维推理、自我反思或多人代理辩论等机制进行输出迭代修正时,其概率更新缺乏理论稳定性保障的问题。解决方案的关键在于识别出一个一致的乘法缩放定律(即 α-律),该定律以信念修正指数(belief revision exponent)的形式刻画了先验信念与验证证据在更新过程中的融合方式;理论证明表明,当该指数小于1时,可保证重复修订下的渐近稳定。实证分析显示,尽管单步修订中模型行为接近贝叶斯更新且略高于稳定性边界,但在多步修订中指数持续下降,呈现出收缩性长期动态,符合理论预测,从而为LLM推理系统提供了可量化的稳定性诊断工具。
链接: https://arxiv.org/abs/2603.19262
作者: Mike Farmer,Abhinav Kochar,Yugyung Lee
机构: University of Missouri–Kansas City (密苏里大学堪萨斯城分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 13 figures, 10 tables
Abstract:Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the \alpha-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.
[NLP-68] Significance-Gain Pair Encoding for LLM s: A Statistical Alternative to Frequency-Based Subword Merging
【速读】: 该论文旨在解决传统子词分词(subword tokenization)方法中因仅基于配对频率选择合并操作而导致的语义粘合性误判问题,即高频共现对可能源于边际频次较高而非真正的语义关联。其解决方案的关键在于提出一种新的合并准则——显著性增益BPE(Significance-Gain BPE),该方法通过在独立性零假设下使用z统计量衡量配对的显著性,并结合显式的压缩感知增益项,在保留压缩效率的同时提升分词结果的语义合理性。实验表明,该方法在WikiText-103数据集上显著降低了困惑度(perplexity)和比特每字符(BPC),验证了基于统计显著性的分词策略可提升语言模型的预测效率。
链接: https://arxiv.org/abs/2603.19261
作者: Azam Nouri
机构: Lincoln University (林肯大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 1 figures
Abstract:Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.
[NLP-69] HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
【速读】: 该论文旨在解决手语机器翻译(Sign Language Machine Translation, SLMT)中存在的数据稀缺、签名者多样性不足以及签名动作模式与预训练表征之间存在显著领域差异的问题,这些问题导致现有迁移学习方法静态且易过拟合。解决方案的关键在于提出一种分层自适应迁移学习(Hierarchical Adaptive Transfer Learning, HATL)框架,其核心机制包括:基于训练性能动态逐层解冻预训练模型参数、采用逐层学习率衰减策略以平衡通用特征保留与特定签名特征适应,并引入稳定性机制提升跨语言和签名变异的鲁棒性。实验表明,HATL在多个数据集上均优于传统迁移学习方法,尤其在MedASL数据集上,结合自适应Transformer(ADAT)时BLEU-4指标提升达37.6%。
链接: https://arxiv.org/abs/2603.19260
作者: Nada Shahin,Leila Ismail
机构: United Arab Emirates University (阿联酋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:
Abstract:Sign Language Machine Translation (SLMT) aims to bridge communication between Deaf and hearing individuals. However, its progress is constrained by scarce datasets, limited signer diversity, and large domain gaps between sign motion patterns and pretrained representations. Existing transfer learning approaches in SLMT are static and often lead to overfitting. These challenges call for the development of an adaptive framework that preserves pretrained structure while remaining robust across linguistic and signing variations. To fill this void, we propose a Hierarchical Adaptive Transfer Learning (HATL) framework, where pretrained layers are progressively and dynamically unfrozen based on training performance behavior. HATL combines dynamic unfreezing, layer-wise learning rate decay, and stability mechanisms to preserve generic representations while adapting to sign characteristics. We evaluate HATL on Sign2Text and Sign2Gloss2Text translation tasks using a pretrained ST-GCN++ backbone for feature extraction and the Transformer and an adaptive transformer (ADAT)for translation. To ensure robust multilingual generalization, we evaluate the proposed approach across three datasets: RWTH-PHOENIXWeather-2014 (PHOENIX14T), Isharah, and MedASL. Experimental results show that HATL consistently outperforms traditional transfer learning approaches across tasks and models, with ADAT achieving BLEU-4 improvements of 15.0% on PHOENIX14T and Isharah and 37.6% on MedASL.
[NLP-70] Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis
【速读】: 该论文旨在解决台湾闽南语(Taigi)语音识别与合成系统缺乏标准化评估框架的问题,从而阻碍了模型性能的可比性和跨语言泛化能力的提升。其解决方案的关键在于构建了一个名为Breeze Taigi的综合性基准框架,通过引入30对精心标注的闽南语-普通话音频对、统一的字符错误率(Character Error Rate, CER)指标以及规范化处理流程,实现了可复现的评估方法;同时利用台湾普通话资源和大规模合成数据生成技术,对Whisper模型进行微调,在约10,000小时合成语音数据上训练出性能优越的语音识别模型(平均CER达30.13%),显著优于现有商业与研究系统,为多语言语音技术提供了可迁移的方法论基础。
链接: https://arxiv.org/abs/2603.19259
作者: Yu-Siang Lan,Chia-Sheng Liu,Yi-Chang Chen,Po-Chun Hsu,Allyson Chiu,Shun-Wen Lin,Da-shan Shiu,Yuan-Fu Liao
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); MediaTek Research (联发科技研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Taiwanese Hokkien (Taigi) presents unique opportunities for advancing speech technology methodologies that can generalize to diverse linguistic contexts. We introduce Breeze Taigi, a comprehensive framework centered on standardized benchmarks for evaluating Taigi speech recognition and synthesis systems. Our primary contribution is a reproducible evaluation methodology that leverages parallel Taiwanese Mandarin resources. We provide 30 carefully curated Mandarin-Taigi audio pairs from Taiwan’s Executive Yuan public service announcements with normalized ground truth transcriptions. We establish Character Error Rate (CER) as the standard metric and implement normalization procedures to enable fair cross-system comparisons. To demonstrate the benchmark’s utility and provide reference implementations, we develop speech recognition and synthesis models through a methodology that leverages existing Taiwanese Mandarin resources and large-scale synthetic data generation. In particular, we fine-tune a Whisper model on approximately 10,000 hours of Taigi synthetic speech data. Our ASR model achieves 30.13% average CER on the benchmark, outperforming existing commercial and research systems. By providing standardized evaluation protocols, diverse training datasets, and open baseline models, we offer a replicable framework with methodologies applicable to various linguistic contexts.
[NLP-71] MAPLE: Metadata Augmented Private Language Evolution
【速读】: 该论文旨在解决在仅通过专有API访问大语言模型(Large Language Models, LLMs)时,进行差分隐私(Differentially Private, DP)微调计算成本过高或不可行的问题。针对这一挑战,现有方法如Private Evolution(PE)依赖于初始合成数据分布的合理性,但在目标领域与基础模型预训练先验差异较大时,常因初始化不佳导致性能下降、收敛缓慢及API资源浪费。论文提出的关键解决方案是Metadata Augmented Private Language Evolution (MAPLE),其核心在于利用差分隐私的表格型元数据提取(differentially private tabular metadata extraction)和上下文学习(in-context learning),将初始合成数据分布有效锚定至目标领域,从而显著提升隐私-效用权衡、加快收敛速度并大幅降低API调用成本。
链接: https://arxiv.org/abs/2603.19258
作者: Eli Chien,Yuzheng Hu,Ryan McKenna,Shanshan Wu,Zheng Xu,Peter Kairouz
机构: National Taiwan University (国立台湾大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Google Research (谷歌研究); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Preliminary work
Abstract:While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model’s parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model’s pre-training priors–particularly in highly specialized domains–PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.
[NLP-72] Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models
【速读】: 该论文旨在解决现实世界中路径规划任务因约束条件复杂多样(如路线数量、最大路径长度、枢纽位置及任务特定要求等)而导致传统方法难以泛化和扩展的问题。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的灵活框架,通过自然语言输入直接解析并求解约束路径规划问题:首先利用预定义模板库匹配已知问题类型,对未见过的问题则采用上下文学习方式自主推导问题表示;随后通过迭代式解生成与验证机制(受遗传算法启发的自我修正过程)不断优化候选解,从而在无需人工干预的前提下实现可行且逐步最优的路径规划。
链接: https://arxiv.org/abs/2603.19257
作者: Dylan Shim,Minghan Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by 2026 SPIE Security + Defense Conference
Abstract:Real-world path planning tasks typically involve multiple constraints beyond simple route optimization, such as the number of routes, maximum route length, depot locations, and task-specific requirements. Traditional approaches rely on dedicated formulations and algorithms for each problem variant, making them difficult to scale across diverse scenarios. In this work, we propose a flexible framework that leverages large language models (LLMs) to solve constrained path planning problems directly from natural language input. The core idea is to allow users to describe routing tasks conversationally, while enabling the LLM to interpret and solve the problem through solution verification and iterative refinement. The proposed method consists of two integrated components. For problem types that have been previously formulated and studied, the LLM first matches the input request to a known problem formulation in a library of pre-defined templates. For novel or unseen problem instances, the LLM autonomously infers a problem representation from the natural language description and constructs a suitable formulation in an in-context learning manner. In both cases, an iterative solution generation and verification process guides the LLM toward producing feasible and increasingly optimal solutions. Candidate solutions are compared and refined through multiple rounds of self-correction, inspired by genetic-algorithm-style refinement. We present the design, implementation, and evaluation of this LLM-based framework, demonstrating its capability to handle a variety of constrained path planning problems. This method provides a scalable and generalizable approach for solving real-world routing tasks with minimal human intervention, while enabling flexible problem specification through natural language.
[NLP-73] ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization
【速读】: 该论文旨在解决孟加拉语(Bengali)在自动语音识别(ASR)和说话人聚类(speaker diarization)研究中严重资源匮乏的问题。针对这一挑战,作者提出了一种以数据为中心的解决方案:通过从孟加拉语YouTube有声读物和戏剧中构建高质量训练语料库,结合大语言模型(LLM)辅助的语言规范化、基于模糊匹配的片段边界验证以及低音区增强技术,显著提升数据质量;同时,在极低资源条件下(仅10个训练文件),对社区分割模型进行针对性超参数优化,实现了具有竞争力的性能表现——ASR任务的词错误率(WER)低至15.551,说话人聚类任务的聚类错误率(DER)为0.19974。关键在于精细的数据工程与领域自适应微调策略,能够在缺乏大规模标注语料的情况下实现高效建模。
链接: https://arxiv.org/abs/2603.19256
作者: Md. Nazmus Sakib,Shafiul Tanvir,Mesbah Uddin Ahamed,H.M. Aktaruzzaman Mukdho
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 4 figures
Abstract:Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task~1) and Bengali Speaker Diarization Challenge (Task~2). For Task~1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \citetabib2026bengaliloop, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttttugstugi/whisper-medium model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task~2, we fine-tune the this http URL community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.
[NLP-74] LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在遵循长度指令时难以精确控制输出长度的问题。现有方法主要依赖外部长度信号或优化目标来施加约束,却忽视了模型内在的长度认知能力不足这一根本限制。解决方案的关键在于提出LARFT(Length-Aware Reinforcement Fine-Tuning)训练框架,通过引入基于事后长度感知(hindsight length awareness)的强化学习机制,使模型能够从自身生成结果中学习长度识别能力,并在此基础上联合优化其内部长度表征与生成策略,从而实现对长度指令的精准且可靠的响应。
链接: https://arxiv.org/abs/2603.19255
作者: Wei Zhang,Lintong Du,Yuanhe Zhang,Zhenhong Zhou,Kun Wang,Li Sun,Sen Su
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures
Abstract:Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model’s intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model’s length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model’s internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.
[NLP-75] From Comprehension to Reasoning : A Hierarchical Benchmark for Automated Financial Research Reporting
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成金融研究报告时存在的可靠性问题,尤其是事实性错误、数据不一致、虚构引用及浅层分析等缺陷,这些问题可能误导企业基本面评估并引发严重经济损失。现有金融评测基准多聚焦于理解能力而非报告生成质量,且缺乏对深层分析技能的结构化评估体系,导致关键分析瓶颈难以识别。解决方案的关键在于提出FinReasoning基准,其将中文研究报告生成过程解构为三个符合真实分析师工作流的阶段——语义一致性、数据对齐与深度洞察,并配套设计了一个细粒度评估框架,强化幻觉修正能力检测,并引入12项指标构成的核心分析能力评分体系,从而系统性揭示模型在“理解-执行”差距上的表现差异,为提升金融生成式AI(Generative AI)的可信度提供可量化的评估路径。
链接: https://arxiv.org/abs/2603.19254
作者: Yiyun Zhu,Yidong Jiang,Ziwen Xu,Yinsheng Yao,Dawei Cheng,Jinru Ding,Yejie Zheng,Jie Xu
机构: Tongji University (同济大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); East Money Information Co., Ltd. (东方财富信息股份有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures–factual errors, numerical inconsistencies, fabricated references, and shallow analysis–that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at this https URL.
[NLP-76] A comprehensive study of LLM -based argument classification: from Llama through DeepSeek to GPT -5.2
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在论点挖掘(Argument Mining, AM)任务中性能评估不足的问题,特别是针对大语言模型(Large Language Models, LLMs)在论点分类上的表现缺乏系统性量化与定性分析。解决方案的关键在于构建一个综合评估框架,涵盖多个公开的论点分类语料库(如UKP和this http URL),并引入先进的提示策略(prompting strategies),包括思维链(Chain-of-Thought prompting)、提示重述(prompt rephrasing)、投票机制(voting)和置信度估计(certainty-based classification)。这些技术显著提升了模型的准确率和F1分数(提升2%–8%),同时通过定性错误分析揭示了当前LLMs在提示敏感性、隐含批评识别、复杂论证结构解析及论点与主张对齐等方面的系统性缺陷,为后续改进提供了明确方向。
链接: https://arxiv.org/abs/2603.19253
作者: Marcin Pietroń,Filip Gampel,Jakub Gomułka,Andrzej Tomski,Rafał Olszowski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Argument mining (AM) is an interdisciplinary research field focused on the automatic identification and classification of argumentative components, such as claims and premises, and the relationships between them. Recent advances in large language models (LLMs) have significantly improved the performance of argument classification compared to traditional machine learning approaches. This study presents a comprehensive evaluation of several state-of-the-art LLMs, including GPT-5.2, Llama 4, and DeepSeek, on large publicly available argument classification corpora such as this http URL and UKP. The evaluation incorporates advanced prompting strategies, including Chain-of- Thought prompting, prompt rephrasing, voting, and certainty-based classification. Both quantitative performance metrics and qualitative error analysis are conducted to assess model behavior. The best-performing model in the study (GPT-5.2) achieves a classification accuracy of 78.0% (UKP) and 91.9% (this http URL). The use of prompt rephrasing, multi-prompt voting, and certainty estimation further improves classification performance and robustness. These techniques increase the accuracy and F1 metric of the models by typically a few percentage points (from 2% to 8%). However, qualitative analysis reveals systematic failure modes shared across models, including instabilities with respect to prompt formulation, difficulties in detecting implicit criticism, interpreting complex argument structures, and aligning arguments with specific claims. This work contributes the first comprehensive evaluation that combines quantitative benchmarking and qualitative error analysis on multiple argument mining datasets using advanced LLM prompting strategies.
[NLP-77] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在几何符号推理能力评估中存在的局限性问题,尤其是现有基准测试在规模、视觉-文本对齐以及多步证明任务设计上的不足。为应对这一挑战,作者提出了GeoChallenge数据集,其关键创新在于:自动构建了包含9万道多选题的几何证明问题,每道题均需基于对齐的文本描述与图形进行多步骤推理;同时提供细粒度复杂度标注和形式化语言注释,支持可控且可靠的模型评估。实验表明,尽管先进LLMs在该任务上表现优于基线,但与人类水平仍存在显著差距(最佳模型GPT-5-nano精确匹配率为75.89%,而人类为94.74%),并揭示出模型在多选场景下的精确匹配失败、视觉依赖薄弱及推理发散无收敛等三类典型错误模式。
链接: https://arxiv.org/abs/2603.19252
作者: Yushun Zhang,Weiping Fu,Zesheng Yang,Bo Zhao,Lingling Zhang,Jian Zhang,Yumeng Fu,Jiaxing Huang,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Shaanxi Province Key Laboratory of Big Data Knowledge Engineering (陕西省大数据知识工程重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, 8 tables
Abstract:Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence. Comments: 18 pages, 10 figures, 8 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.19252 [cs.CL] (or arXiv:2603.19252v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.19252 Focus to learn more arXiv-issued DOI via DataCite
[NLP-78] Enhancing Legal LLM s through Metadata-Enriched RAG Pipelines and Direct Preference Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长篇法律文档时性能下降的问题,尤其是因幻觉(hallucination)导致的错误条款或先例引用,这严重削弱了法律场景下对模型输出的可靠性与信任度。现有检索增强生成(Retrieval Augmented Generation, RAG)方法在法律领域仍受限,特别是在小规模本地部署模型中难以保障准确性和安全性。针对此问题,论文识别出两个关键失败模式:一是由于法律语料库中词汇冗余导致的检索错误,二是模型在上下文不足时仍强行生成答案的解码错误。解决方案的核心在于提出**元数据增强的混合RAG(Metadata Enriched Hybrid RAG)以提升文档级别的检索精度,并引入直接偏好优化(Direct Preference Optimization, DPO)**策略,在上下文不充分时强制模型安全拒绝生成,从而显著提升法律语言模型的接地性(grounding)、可靠性和安全性。
链接: https://arxiv.org/abs/2603.19251
作者: Suyash Maniyar,Deepali Singh,Rohith Reddy
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注: 12 pages including Appendix
Abstract:Large Language Models (LLMs) perform well in short contexts but degrade on long legal documents, often producing hallucinations such as incorrect clauses or precedents. In the legal domain, where precision is critical, such errors undermine reliability and trust. Retrieval Augmented Generation (RAG) helps ground outputs but remains limited in legal settings, especially with small, locally deployed models required for data privacy. We identify two failure modes: retrieval errors due to lexical redundancy in legal corpora, and decoding errors where models generate answers despite insufficient context. To address this, we propose Metadata Enriched Hybrid RAG to improve document level retrieval, and apply Direct Preference Optimization (DPO) to enforce safe refusal when context is inadequate. Together, these methods improve grounding, reliability, and safety in legal language models. Comments: 12 pages including Appendix Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.19251 [cs.CL] (or arXiv:2603.19251v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.19251 Focus to learn more arXiv-issued DOI via DataCite
[NLP-79] Can Structural Cues Save LLM s? Evaluating Language Models in Massive Document Streams
【速读】: 该论文旨在解决语言模型在流式文档环境(streaming document environments)中评估不足的问题,特别是当多个并发事件混合在同一文档流中时,现有基准测试无法有效评估模型对冲突信息的处理能力。解决方案的关键在于构建了一个名为StreamBench的新基准,该基准基于2016年和2025年的重大新闻事件,包含605个事件和15,354份文档,涵盖主题聚类、时间问答和摘要三个任务。通过对比引入结构化提示(structural cues)前后模型的表现,发现结构化提示能显著提升聚类(最高+4.37%)和时间问答(最高+9.63%)性能,帮助模型更准确地定位相关信息并区分不同事件,表明结构化提示是提升大规模文档流中语言模型表现的重要方向。
链接: https://arxiv.org/abs/2603.19250
作者: Yukyung Lee,Yebin Lim,Woojun Jung,Wonjun Choi,Susik Yoon
机构: Boston University (波士顿大学); Korea University (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.
[NLP-80] Spelling Correction in Healthcare Query-Answer Systems: Methods Retrieval Impact and Empirical Evaluation ALT
【速读】: 该论文旨在解决医疗问答(Healthcare Question-Answering, QA)系统中因用户输入拼写错误导致检索性能下降的问题。研究发现,真实医疗查询中存在高达61.5%的拼写错误率(词级错误率为11.0%),显著高于专业文档中的错误率,从而影响了基于BM25和TF-IDF的文本检索效果。解决方案的关键在于对查询端进行拼写纠正,而非仅修正语料库;实验表明,使用编辑距离(edit distance)或上下文感知候选排序方法进行查询侧校正,可使平均倒数排名(MRR)提升9.2%,归一化折损累计增益(NDCG@10)提升8.3%,而仅校正语料库则几乎无改善(+0.5% MRR),证明查询端拼写纠正才是提升检索性能的核心干预措施。
链接: https://arxiv.org/abs/2603.19249
作者: Saurabh K Singh
机构: Oracle Corporation (甲骨文公司)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 tables. Empirical study using TREC 2017 LiveQA Medical and HealthSearchQA datasets
Abstract:Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of spelling correction as a retrieval preprocessing step in healthcare QA using real consumer queries. We conduct an error census across two public datasets – the TREC 2017 LiveQA Medical track (104 consumer health questions) and HealthSearchQA (4,436 health queries from Google autocomplete) – finding that 61.5% of real medical queries contain at least one spelling error, with a token-level error rate of 11.0%. We evaluate four correction methods – conservative edit distance, standard edit distance (Levenshtein), context-aware candidate ranking, and SymSpell – across three experimental conditions: uncorrected queries against an uncorrected corpus (baseline), uncorrected queries against a corrected corpus, and fully corrected queries against a corrected corpus. Using BM25 and TF-IDF cosine retrieval over 1,935 MedQuAD answer passages with TREC relevance judgments, we find that query correction substantially improves retrieval – edit distance and context-aware correction achieve MRR improvements of +9.2% and NDCG@10 improvements of +8.3% over the uncorrected baseline. Critically, correcting only the corpus without correcting queries yields minimal improvement (+0.5% MRR), confirming that query-side correction is the key intervention. We complement these results with a 100-sample error analysis categorising correction outcomes per method and provide evidence-based recommendations for practitioners.
[NLP-81] DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration Augmentation and Evolution
【速读】: 该论文旨在解决沉浸式对话系统在实际部署中面临的响应速度与长程任务能力之间的权衡问题:轻量级交互可实现实时响应,但涉及规划和工具调用(如搜索、媒体生成)的复杂请求会产生重尾执行延迟,从而破坏轮次切换效率、角色一致性及用户信任。解决方案的关键在于提出DuCCAE(Conversation while Collaboration with Augmentation and Evolution),一种混合引擎架构,通过将实时响应生成与异步代理执行解耦,并借助共享状态维护会话上下文和执行轨迹,使异步结果能够无缝回流至当前对话中,从而保障对话连续性的同时实现可靠代理执行。
链接: https://arxiv.org/abs/2603.19248
作者: Xin Shen,Zhishu Jiang,Jiaye Yang,Haibo Liu,Yichen Wan,Jiarui Zhang,Tingzhi Dai,Luodong Xu,Shuchen Wu,Guanqiang QI,Chenxi Miao,Jiahui Liang,Yang Li,Weikang Li,Deguo Xia,Jizhou Huang
机构: Baidu Inc(百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Immersive conversational systems in production face a persistent trade-off between responsiveness and long-horizon task capability. Real-time interaction is achievable for lightweight turns, but requests involving planning and tool invocation (e.g., search and media generation) produce heavy-tail execution latency that degrades turn-taking, persona consistency, and user trust. To address this challenge, we propose DuCCAE (Conversation while Collaboration with Augmentation and Evolution), a hybrid engine for immersive conversation deployed within Baidu Search, serving millions of users. DuCCAE decouples real-time response generation from asynchronous agentic execution and synchronizes them via a shared state that maintains session context and execution traces, enabling asynchronous results to be integrated back into the ongoing dialogue. The system orchestrates five subsystems-Info, Conversation, Collaboration, Augmentation, and Evolution-to support multi-agent collaboration and continuous improvement. We evaluate DuCCAE through a comprehensive framework that combines offline benchmarking on the Du-Interact dataset and large-scale production evaluation within Baidu Search. Experimental results demonstrate that DuCCAE outperforms strong baselines in agentic execution reliability and dialogue quality while reducing latency to fit strict real-time budgets. Crucially, deployment metrics since June 2025 confirm substantial real-world effectiveness, evidenced by a tripling of Day-7 user retention to 34.2% and a surge in the complex task completion rate to 65.2%. Our hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.
[NLP-82] When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models EACL
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中因依赖静态有害提示集合而忽略自适应攻击场景的问题,即现有方法假设攻击者为非自适应的,无法反映真实世界中输入被迭代优化以规避安全防护的情形。解决方案的关键在于引入自动化、自适应的红队测试(red-teaming)机制,通过复用原本用于提升良性任务性能的黑盒提示优化技术,系统性地搜索安全漏洞;具体而言,作者利用DSPy框架对HarmfulQA和JailbreakBench中的初始提示进行优化,目标是最大化由独立评估模型(GPT-5.1)提供的连续危险评分(danger score),从而揭示模型在动态对抗环境下的脆弱性。实验表明,该方法显著降低了模型的安全防护效果,尤其在开源小规模模型上表现明显,说明静态基准可能低估了实际风险,强调了自动化自适应红队测试在构建鲁棒安全评估体系中的必要性。
链接: https://arxiv.org/abs/2603.19247
作者: Zafir Shamsi,Nikhil Chekuru,Zachary Guzman,Shivank Garg
机构: Algoverse AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL SRW 2026, Oral
Abstract:Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.
[NLP-83] Generalized Stock Price Prediction for Multiple Stocks Combined with News Fusion
【速读】: 该论文旨在解决股票价格预测中如何有效利用每日金融新闻数据以提升预测精度的问题。传统方法如ARIMA和循环神经网络(Recurrent Neural Networks, RNNs)虽广泛应用,但在处理非结构化新闻文本与股票间语义关联方面存在局限。解决方案的关键在于引入预训练大型语言模型(Large Language Models, LLMs)对新闻进行编码,并通过三种基于注意力机制的池化策略——自注意力、交叉注意力及位置感知自注意力池化——实现新闻内容按股票相关性过滤;同时,将筛选后的新闻嵌入与历史股价联合输入统一模型,构建一个可跨多只股票通用的预测框架,从而显著降低平均绝对误差(Mean Absolute Error, MAE),验证了股票名称嵌入在新闻过滤和价格预测中的有效性。
链接: https://arxiv.org/abs/2603.19286
作者: Pei-Jun Liao,Hung-Shin Lee,Yao-Fei Cheng,Li-Wei Chen,Hung-yi Lee,Hsin-Min Wang
机构: National Taiwan University (国立台湾大学); Academia Sinica (中央研究院); United Link Co., Ltd. (联合科技有限公司); University of Washington (华盛顿大学); National Tsing Hua University (清华大学)
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Journal of Information Science and Engineering (JISE)
Abstract:Predicting stock prices presents challenges in financial forecasting. While traditional approaches such as ARIMA and RNNs are prevalent, recent developments in Large Language Models (LLMs) offer alternative methodologies. This paper introduces an approach that integrates LLMs with daily financial news for stock price prediction. To address the challenge of processing news data and identifying relevant content, we utilize stock name embeddings within attention mechanisms. Specifically, we encode news articles using a pre-trained LLM and implement three attention-based pooling techniques – self-attentive, cross-attentive, and position-aware self-attentive pooling – to filter news based on stock relevance. The filtered news embeddings, combined with historical stock prices, serve as inputs to the prediction model. Unlike prior studies that focus on individual stocks, our method trains a single generalized model applicable across multiple stocks. Experimental results demonstrate a 7.11% reduction in Mean Absolute Error (MAE) compared to the baseline, indicating the utility of stock name embeddings for news filtering and price forecasting within a generalized framework.
信息检索
[IR-0] LLM -Enhanced Semantic Data Integration of Electronic Component Qualifications in the Aerospace Domain ESWC2026
【速读】:该论文旨在解决大型制造企业因部门间数据孤岛(data silos)导致的电子元器件资格信息检索困难问题,此类问题常引发数据库间不一致性和信息错位,尤其在卫星电路板设计规划阶段,设计师难以及时获取单个元器件的资格状态,从而影响新资格认证的优化和冗余工作的规避。解决方案的关键在于构建一个融合虚拟知识图谱(Virtual Knowledge Graphs)与大语言模型(Large Language Models, LLMs)的处理管道:一方面利用虚拟知识图谱实现异构数据源的统一视图,另一方面借助LLMs提升检索效率并减少人工数据清洗成本;同时结合基于本体的数据访问(Ontology-based Data Access)进行结构化查询与向量搜索机制以匹配文本相似性,最终实现在长期效率上优于仅依赖LLM的检索增强生成(Retrieval-Augmented Generation, RAG)方法。
链接: https://arxiv.org/abs/2603.20094
作者: Antonio De Santis,Marco Balduini,Matteo Belcao,Andrea Proia,Marco Brambilla,Emanuele Della Valle
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: ESWC 2026
Abstract:Large manufacturing companies face challenges in information retrieval due to data silos maintained by different departments, leading to inconsistencies and misalignment across databases. This paper presents an experience in integrating and retrieving qualification data for electronic components used in satellite board design. Due to data silos, designers cannot immediately determine the qualification status of individual components. However, this process is critical during the planning phase, when assembly drawings are issued before production, to optimize new qualifications and avoid redundant efforts. To address this, we propose a pipeline that uses Virtual Knowledge Graphs for a unified view over heterogeneous data sources and LLMs to enhance retrieval and reduce manual effort in data cleansing. The retrieval of qualifications is then performed through an Ontology-based Data Access approach for structured queries and a vector search mechanism for retrieving qualifications based on similar textual properties. We perform a comparative cost-benefit analysis, demonstrating that the proposed pipeline also outperforms approaches relying solely on LLMs, such as Retrieval-Augmented Generation (RAG), in terms of long-term efficiency.
[IR-1] he End of Rented Discovery: How AI Search Redistributes Power Between Hotels and Intermediaries
【速读】:该论文旨在解决生成式 AI (Generative AI) 在酒店推荐场景中信息来源偏倚的问题,特别是分析不同查询意图下AI搜索结果对在线旅行社(OTA)与非OTA来源的引用差异。其解决方案的关键在于通过系统性审计1,357条来自Google Gemini的 grounding citations,发现存在“意图-来源分裂”(Intent-Source Divide)现象:体验型查询比交易型查询更倾向于引用非OTA来源(如旅游博客、本地指南等),且在日语语境下这种差异更为显著,表明AI搜索可能正在重塑酒店流量分配格局,减少对依赖佣金模式的OTA平台的单一依赖。
链接: https://arxiv.org/abs/2603.20062
作者: Peiying Zhu,Sidi Chang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, 10 tables, Submitted to the 10th Hospitality Finance Economics Conference (HFE 2026), Tokyo, Japan
Abstract:When a traveler asks an AI search engine to recommend a hotel, which sources get cited – and does query framing matter? We audit 1,357 grounding citations from Google Gemini across 156 hotel queries in Tokyo and document a systematic pattern we call the Intent-Source Divide. Experiential queries draw 55.9% of their citations from non-OTA sources, compared to 30.8% for transactional queries – a 25.1 percentage-point gap ( p 5 \times 10^-20 ). The effect is amplified in Japanese, where experiential queries draw 62.1% non-OTA citations compared to 50.0% in English – consistent with a more diverse Japanese non-OTA content ecosystem. For an industry in which hotels have long paid OTAs for demand acquisition, this pattern matters because it suggests that AI search may make hotel discovery less exclusively controlled by commission-based intermediaries.
[IR-2] Coverag eBench: Evaluating Information Coverag e across Tasks and Domains
【速读】:该论文旨在解决传统检索评估指标(如精确率、召回率、RBP、nDCG和MAP)无法有效衡量检索系统对可用相关信息的覆盖范围的问题,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,信息覆盖度成为关键考量因素。其解决方案的关键在于构建一套基于现有数据集的评测集合(suite of collections),用于专门评估信息覆盖度,并提供统一的测试平台,涵盖多种文本类型与任务场景,同时公开所有主题、信息点(nuggets)、相关性标签及基线排序结果,便于研究者开展系统性评估与比较。
链接: https://arxiv.org/abs/2603.20034
作者: Saron Samuel,Andrew Yates,Dawn Lawrie,Ian Soboroff,Trevor Adriaanse,Benjamin Van Durme,Eugene Yang
机构: Johns Hopkins University (约翰霍普金斯大学); National Institute of Standards and Technology (美国国家标准与技术研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8
Abstract:We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is covered by the search results. Information coverage is a central aspect for retrieval, especially when the retrieval system is integrated with generative models in a retrieval-augmented generation (RAG) system. The classic metrics for ad hoc retrieval, precision and recall, reward a system as more and more relevant documents are retrieved. However, since relevance in ad hoc test collections is defined for a document without any relation to other documents that might contain the same information, high recall is sufficient but not necessary to ensure coverage. The same is true for other metrics such as rank-biased precision (RBP), normalized discounted cumulative gain (nDCG), and mean average precision (MAP). Test collections developed around the notion of diversity ranking in web search incorporate multiple aspects that support a concept of coverage in the web domain. In this work, we construct a suite of collections for evaluating information coverage from existing collections. This suite offers researchers a unified testbed spanning multiple genres and tasks. All topics, nuggets, relevance labels, and baseline rankings are released on Hugging Face Datasets, along with instructions for accessing the publicly available document collections.
[IR-3] RouterKGQA: Specialized–General Model Routing for Constraint-Aware Knowledge Graph Question Answering
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在知识图谱问答(Knowledge Graph Question Answering, KGQA)中因缺乏结构化约束而导致的幻觉问题,同时平衡推理准确性和计算成本。现有方法主要分为两类:基于检索的方法使用小型专用模型,虽高效但易产生不可达路径且忽略隐式约束;基于代理的方法采用大型通用模型,虽结构约束更强但代价高昂。论文提出RouterKGQA框架,其核心在于实现专用模型与通用模型的协同机制——由专用模型生成初步推理路径,仅在必要时调用通用模型进行知识图谱(Knowledge Graph, KG)引导的修复,并引入约束感知的答案过滤策略以减少冗余输出,同时优化通用代理的工作流以降低推理开销。实验表明,该方案在多个基准上平均F1提升3.57点、Hits@1提升0.49点,且平均每题仅需1.15次LLM调用,显著提升了性能与效率的平衡。
链接: https://arxiv.org/abs/2603.20017
作者: Bo Yuan,Hexuan Deng,Xuebo Liu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Zhongguancun Academy, Beijing(中关村学院)
类目: Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized–general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at this https URL.
[IR-4] A Super Fast K-means for Indexing Vector Embeddings
【速读】:该论文旨在解决高维向量嵌入(high-dimensional vector embeddings)聚类任务中计算效率低的问题,特别是在向量相似性搜索(vector similarity search)场景下,传统k-means算法在CPU和GPU上存在数据访问与计算开销过高的瓶颈。其解决方案的关键在于两个核心机制:一是通过可靠且高效地剪枝(pruning)对聚类无贡献的维度,显著减少不必要的数据访问和计算开销;二是提出“基于召回率的早期终止”(Early Termination by Recall)机制,在聚类中心质量不再提升时提前终止迭代,从而进一步缩短运行时间而不损害检索性能。
链接: https://arxiv.org/abs/2603.20009
作者: Leonardo Kuffo,Sven Hepkema,Peter Boncz
机构: CWIAmsterdam(荷兰国家数学与计算机科学研究中心); ETH Zurich(苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans’ clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at this https URL
[IR-5] DALI: LLM -Agent Enhanced Dual-Stream Adaptive Leadership Identification for Group Recommendations
【速读】:该论文旨在解决现有群体推荐系统在区分领导主导型群体与协作型群体方面能力不足的问题,尤其是在单个成员对群体决策产生过度影响时,难以准确反映真实群体偏好。其解决方案的关键在于提出了一种双流自适应领导识别(Dual-stream Adaptive Leadership Identification, DALI)框架,该框架创新性地融合了大语言模型(Large Language Models, LLMs)的符号推理能力和神经网络的表示学习能力:一方面通过动态规则生成模块基于迭代性能反馈自主演化识别规则;另一方面采用神经符号聚合机制,同时利用符号推理稳健识别领导型群体,并借助注意力机制建模协作型群体的行为动态,从而实现对复杂现实场景中群体决策环境的自适应精准推荐。
链接: https://arxiv.org/abs/2603.19909
作者: Boxun Song,Min Gao,Jiawei Cheng
机构: Chongqing University (重庆大学)
类目: Information Retrieval (cs.IR)
备注: under review
Abstract:Group recommendation systems play a pivotal role in supporting collective decisions across various contexts, from leisure activities to organizational team-building. Existing group recommendation approaches typically use either handcrafted aggregation rules (e.g. mean, least misery, weighted sum) or neural aggregation models (e.g. attention-based deep learning frameworks), yet both fall short in distinguishing leader-dominated from collaborative groups and often misrepresent true group preferences, especially when a single member disproportionately influences group choices. To address these limitations, we propose the Dual-stream Adaptive Leadership Identification (DALI) framework, which uniquely combines the symbolic reasoning capabilities of Large Language Models (LLMs) with neural network-based representation learning. Specifically, DALI introduces two key innovations: a dynamic rule generation module that autonomously formulates and evolves identification rules through iterative performance feedback, and a neuro-symbolic aggregation mechanism that concurrently employs symbolic reasoning to robustly recognize leadership groups and attention-based neural aggregation to accurately model collaborative group dynamics. Experiments conducted on the Mafengwo travel dataset confirm that DALI significantly improves recommendation accuracy compared to existing frameworks, highlighting its capability to dynamically adapt to complex, real-world group decision environments.
[IR-6] How Well Does Generative Recommendation Generalize?
【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)模型为何在实践中优于传统基于物品ID的模型这一问题,尤其针对“GR模型具有更强泛化能力”这一广泛假设缺乏系统验证的问题。研究通过将数据实例按预测所需能力分为“记忆”(memorization,复用训练中观察到的物品转移模式)与“泛化”(generalization,组合已知模式预测未见物品转移)两类,实验证明GR模型在需要泛化能力的任务上表现更优,而基于物品ID的模型则在记忆任务中占优。进一步分析发现,GR模型表面上的“物品级泛化”实际上常退化为“标记级记忆”(token-level memorization)。因此,论文提出一种简单的记忆感知指标(memorization-aware indicator),可动态根据每个实例的特性自适应融合两种范式,从而提升整体推荐性能,其关键在于识别并利用两种模型的互补优势。
链接: https://arxiv.org/abs/2603.19809
作者: Yijie Ding,Zitian Guo,Jiacheng Li,Letian Peng,Shuai Shao,Wei Shao,Xiaoqiang Luo,Luke Simon,Jingbo Shang,Julian McAuley,Yupeng Hou
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:A widely held hypothesis for why generative recommendation (GR) models outperform conventional item ID-based models is that they generalize better. However, there is few systematic way to verify this hypothesis beyond a superficial comparison of overall performance. To address this gap, we categorize each data instance based on the specific capability required for a correct prediction: either memorization (reusing item transition patterns observed during training) or generalization (composing known patterns to predict unseen item transitions). Extensive experiments show that GR models perform better on instances that require generalization, whereas item ID-based models perform better when memorization is more important. To explain this divergence, we shift the analysis from the item level to the token level and show that what appears to be item-level generalization often reduces to token-level memorization for GR models. Finally, we show that the two paradigms are complementary. We propose a simple memorization-aware indicator that adaptively combines them on a per-instance basis, leading to improved overall recommendation performance.
[IR-7] AIGQ: An End-to-End Hybrid Generative Architecture for E-commerce Query Recommendation
【速读】:该论文旨在解决预搜索查询推荐(Pre-search query recommendation)中传统方法存在的语义浅层化、冷启动性能差以及偶然性低的问题,这些问题主要源于对ID匹配和共点击启发式策略的依赖。解决方案的关键在于提出AIGQ(AI-Generated Query)架构,其核心创新包括:1)Interest-Aware List Supervised Fine-Tuning(IL-SFT),通过会话感知的行为聚合与兴趣引导的重排序策略构建训练样本,精准建模用户意图;2)Interest-aware List Group Relative Policy Optimization(IL-GRPO),设计一种双组件奖励机制的策略梯度算法,联合优化单个查询相关性与整体列表属性,并引入基于在线点击率(CTR)排序模型的模型驱动奖励;3)混合离线-在线部署架构,包含AIGQ-Direct用于近线个性化用户到查询生成,以及AIGQ-Think增强推理能力以实现触发词到查询映射,提升兴趣多样性。
链接: https://arxiv.org/abs/2603.19710
作者: Jingcao Xu,Jianyun Zou,Renkai Yang,Zili Geng,Qiang Liu,Haihong Tang
机构: Taobao Tmall Group of Alibaba (淘宝天猫集团); University of Electronic Science and Technology of China (电子科技大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Pre-search query recommendation, widely known as HintQ on Taobao’s homepage, plays a vital role in intent capture and demand discovery, yet traditional methods suffer from shallow semantics, poor cold-start performance and low serendipity due to reliance on ID-based matching and co-click heuristics. To overcome these challenges, we propose AIGQ (AI-Generated Query architecture), the first end-to-end generative framework for HintQ scenario. AIGQ is built upon three core innovations spanning training paradigm, policy optimization and deployment architecture. First, we propose Interest-Aware List Supervised Fine-Tuning (IL-SFT), a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking strategy to faithfully model nuanced user intent. Accordingly, we design Interest-aware List Group Relative Policy Optimization (IL-GRPO), a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties, enhanced by a model-based reward from the online click-through rate (CTR) ranking model. To deploy under strict real-time and low-latency requirements, we further develop a hybrid offline-online architecture comprising AIGQ-Direct for nearline personalized user-to-query generation and AIGQ-Think, a reasoning-enhanced variant that produces trigger-to-query mappings to enrich interest diversity. Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement.
[IR-8] From Token to Item: Enhancing Large Language Models for Recommendation via Item-aware Attention Mechanism WWW2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统中存在的核心问题:现有方法将物品表示为标记序列并依赖标准注意力机制,导致其关注点局限于标记级别的关系,而忽视了物品作为推荐基本单元的重要性,从而难以有效捕捉物品间的协同关系。解决方案的关键在于提出一种物品感知注意力机制(Item-aware Attention Mechanism, IAM),该机制通过设计两个互补的注意力层——物品内注意力层(intra-item attention layer)用于建模单个物品内部标记的内容语义,以及物品间注意力层(inter-item attention layer)专门捕获跨物品的协同关系,从而显式地将物品置于推荐的核心位置,使LLM能够更有效地利用物品级协同信息,提升个性化推荐效果。
链接: https://arxiv.org/abs/2603.19693
作者: Xiaokun Zhang,Bowei He,Jiamin Chen,Ziqiang Cui,Chen Ma
机构: City University of Hong Kong(香港城市大学)
类目: Information Retrieval (cs.IR)
备注: This work has been accepted by WWW 2026
Abstract:Large Language Models (LLMs) have recently gained increasing attention in the field of recommendation. Existing LLM-based methods typically represent items as token sequences, and apply attention layers on these tokens to generate recommendations. However, by inheriting the standard attention mechanism, these methods focus on modeling token-level relations. This token-centric focus overlooks the item as the fundamental unit of recommendation, preventing existing methods from effectively capturing collaborative relations at the item level. In this work, we revisit the role of tokens in LLM-driven recommendation and categorize their relations into two types: (1) intra-item token relations, which present the content semantics of an item, e.g., name, color, and size; and (2) inter-item token relations, which encode collaborative relations across items. Building on these insights, we propose a novel framework with an item-aware attention mechanism (IAM) to enhance LLMs for recommendation. Specifically, IAM devises two complementary attention layers: (1) an intra-item attention layer, which restricts attention to tokens within the same item, modeling item content semantics; and (2) an inter-item attention layer, which attends exclusively to token relations across items, capturing item collaborative relations. Through this stacked design, IAM explicitly emphasizes items as the fundamental units in recommendation, enabling LLMs to effectively exploit item-level collaborative relations. Extensive experiments on several public datasets demonstrate the effectiveness of IAM in enhancing LLMs for personalized recommendation.
[IR-9] GenFacet: End-to-End Generative Faceted Search via Multi-Task Preference Alignment in E-Commerce
【速读】:该论文旨在解决传统面向电商目录的分面搜索(faceted search)系统在应对新兴词汇、语义鸿沟以及分面选择与底层检索之间脱节等问题时的局限性。其关键解决方案是提出了一种工业级端到端生成式框架GenFacet,该框架基于统一的大语言模型(Large Language Model, LLM)将分面搜索重构为两个耦合的生成任务:上下文感知的分面生成(Context-Aware Facet Generation),用于动态合成响应趋势的导航选项;以及意图驱动的查询重写(Intent-Driven Query Rewriting),将用户交互转化为精准搜索查询以闭合检索闭环。通过结合教师-学生蒸馏与GRPO的多任务训练管道,该方法直接优化下游搜索满意度,显著提升了用户点击率和转化率。
链接: https://arxiv.org/abs/2603.19665
作者: Zhouwei Zhai,Min Yang,Jin Li
机构: JD.com(京东)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Faceted search acts as a critical bridge for navigating massive ecommerce catalogs, yet traditional systems rely on static rule-based extraction or statistical ranking, struggling with emerging vocabulary, semantic gaps, and a disconnect between facet selection and underlying retrieval. In this paper, we introduce GenFacet, an industrial-grade, end-to-end generative framework deployed at this http URL. GenFacet reframes faceted search as two coupled generative tasks within a unified Large Language Model: Context-Aware Facet Generation, which dynamically synthesizes trend-responsive navigation options, and Intent-Driven Query Rewriting, which translates user interactions into precise search queries to close the retrieval loop. To bridge the gap between generative capabilities and search utility, we propose a novel multi-task training pipeline combining teacher-student distillation with GRPO. This aligns the model with complex user preferences by directly optimizing for downstream search satisfaction. Validated on China’s largest selfoperated e-commerce platform via rigorous offline evaluations and online A/B tests, GenFacet demonstrated substantial improvements. Specifically, online results reveal a relative increase of 42.0% in facet Click-Through Rate (CTR) and 2.0% in User Conversion Rate (UCVR). These outcomes provide strong evidence of the benefits of generative methods for improving query understanding and user engagement in large-scale information retrieval systems.
[IR-10] MetaCues: Enabling Critical Engagement with Generative AI for Information Seeking and Sensemaking
【速读】:该论文旨在解决生成式 AI(Generative AI)搜索工具在信息检索过程中因设计倾向导致认知卸载(cognitive offloading)的问题,从而引发被动参与、选择性注意和信息同质化等负面影响。解决方案的关键在于引入 MetaCues —— 一种基于生成式 AI 的交互式信息检索工具,其通过在 AI 回应旁嵌入元认知提示(metacognitive cues),并辅以笔记功能,引导用户进行 prompt 设计优化、输出验证及批判性信息处理,从而促进元认知参与,提升用户对搜索主题的判断信心与探究广度,尤其在低争议性和低熟悉度的主题中效果更为显著。
链接: https://arxiv.org/abs/2603.19634
作者: Anjali Singh,Karan Taneja,Zhitong Guan,Soo Young Rieh
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:Generative AI (GenAI) search tools are increasingly used for information seeking, yet their design tends to encourage cognitive offloading, which may lead to passive engagement, selective attention, and informational homogenization. Effective use requires metacognitive engagement to craft good prompts, verify AI outputs, and critically engage with information. We developed MetaCues, a novel GenAI-based interactive tool for information seeking that delivers metacognitive cues alongside AI responses and a note-taking interface to guide users’ search and associated learning. Through an online study (N = 146), we compared MetaCues to a baseline tool without cues, across two broad search topics that required participants to explore diverse perspectives in order to make informed judgments. Preliminary findings regarding participants’ search behavior show that MetaCues leads to increased confidence in attitudinal judgments about the search topic as well as broader inquiry, with the latter effect emerging primarily for the topic that was less controversial and with which participants had relatively less familiarity. Accordingly, we outline directions for future qualitative exploration of search interactions and inquiry patterns.
[IR-11] he Prosocial Ranking Challenge: Reducing Polarization on Social Media without Sacrificing Engagement
【速读】:该论文旨在解决社会媒体算法对政治极化(affective polarization)等社会性后果的潜在加剧问题,尤其关注不同内容排序机制对用户情绪分化、信息获取与社会互动的影响。其核心解决方案是通过开发并部署一种浏览器扩展程序,在不改变平台架构的前提下,随机将9,386名桌面用户分配至对照组或五种替代性排序算法组,在美国2024年总统大选期间持续六个月内干预三个主流社交平台的内容呈现顺序,从而实现对算法设计与社会影响之间因果关系的直接实证检验。关键在于采用大规模随机对照实验(RCT)方法,在真实世界环境中系统性测试“桥接型内容”(bridging content)能否在不影响平台商业利益(如用户参与度)的前提下缓解政治极化。
链接: https://arxiv.org/abs/2603.19626
作者: Jonathan Stray,Ian Baker,George Beknazar-Yuzbashev,Ceren Budak,Julia Kamin,Kylan Rutherford,Mateusz Stalinski,Tin Acosta,Chris Bail,Michael Bernstein,Mark Brandt,Amy Bruckman,Anshuman Chhabra,Soham De,Kayla Duskin,Sara Fish,Beth Goldberg,Andy Guess,Dylan Hadfield-Menell,Muhammed Haroon,Safwan Hossain,Michael Inzlicht,Gauri Jain,Yanchen Jiang,Alexander P. Landry,Yph Lelkes,Hongfan Lu,Peter Mason,Jennifer McCoy,Smitha Milli,Paul Resnick,Emily Saltz,Martin Saveski,Lisa Schirch,Max Spohn,Siddarth Srinivasan,Alexis Tatore,Luke Thorburn,Joshua A. Tucker,Robb Willer,Magdalena Wojcieszak,Manuel Wüthrich,Sylvan Zheng
机构: Center for Human-Compatible AI, UC Berkeley, Berkeley, CA 94720, USA.; Columbia University, New York, NY 10027, USA.; School of Information, University of Michigan, Ann Arbor, MI 48109, USA.; Civic Health Project, USA.; University of Warwick, Coventry CV4 7AL, UK.; Department of Sociology, Duke University, Durham, NC 27708, USA.; Department of Computer Science, Stanford University, Stanford, CA 94305, USA.; Department of Psychology, Michigan State University, East Lansing, MI 48824, USA.; School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.; University of South Florida, Tampa, FL 33620, USA.; University of Washington, Seattle, WA 98195, USA.; Information School, University of Washington, Seattle, WA 98195, USA.; Harvard University, Cambridge, MA 02138, USA.; Jigsaw, Google, New York, NY, USA.; Department of Politics, Princeton University, Princeton, NJ 08544, USA.; Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA 02139, USA.; UC Davis, Davis, CA 95616, USA.; Department of Psychology, University of Toronto, Toronto, ON M5S 1A1, Canada.; Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA 19104, USA.; Department of Political Science, Georgia State University, Atlanta, GA 30303, USA.; Meta FAIR, New York, NY, USA.; Saltern Studio, USA.; Kroc Institute for International Peace Studies, University of Notre Dame, Notre Dame, IN 46556, USA.; Department of Informatics, King’s College London, London WC2R 2LS, UK.; Department of Politics, New York University, New York, NY 10012, USA.; Independent.
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
备注:
Abstract:We report the first direct comparisons of multiple alternative social media algorithms on multiple platforms on outcomes of societal interest. We used a browser extension to modify which posts were shown to desktop social media users, randomly assigning 9,386 users to a control group or one of five alternative ranking algorithms which simultaneously altered content across three platforms for six months during the US 2024 presidential election. This reduced our preregistered index of affective polarization by an average of 0.03 standard deviations (p 0.05), including a 1.5 degree decrease in differences between the 100 point inparty and outparty feeling thermometers. We saw reductions in active use time for Facebook (-0.37 min/day) and Reddit (-0.2 min/day), but an increase of 0.32 min/day (p 0.01) for X/Twitter. We saw an increase in reports of negative social media experiences but found no effects on well-being, news knowledge, outgroup empathy, perceptions of and support for partisan violence. This implies that bridging content can improve some societal outcomes without necessarily conflicting with the engagement-driven business model of social media.
[IR-12] CO-EVOLVE: Bidirectional Co-Evolution of Graph Structure and Semantics for Heterophilous Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)与图神经网络(Graph Neural Networks, GNNs)融合过程中存在的三大核心问题:(1)双向误差传播,即LLM的语义幻觉或GNN的结构噪声会永久污染下游模态且无法纠正;(2)语义-结构不一致,尤其在异质性(heterophilous)场景下文本相似性与拓扑结构相悖;(3)盲人引路现象,即无差别对齐导致模型盲目复制彼此错误,忽视不确定性。其解决方案的关键在于提出CO-EVOLVE框架,将图结构与语义嵌入视为动态、相互增强的潜在变量,并通过高斯-赛德尔交替优化策略建立循环反馈机制:GNN以软提示(Soft Prompts)注入结构上下文引导LLM,而LLM构建动态语义图重构GNN结构。该框架引入三项创新以稳定演化过程:硬结构冲突感知对比损失、自适应节点门控机制和不确定性门控一致性策略,最终实现语义与结构的协同进化与鲁棒对齐。
链接: https://arxiv.org/abs/2603.19596
作者: Jinming Xing,Muhammad Shahzad
机构: North Carolina State University (北卡罗来纳州立大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:The integration of Large Language Models (LLMs) and Graph Neural Networks (GNNs) promises to unify semantic understanding with structural reasoning, yet existing methods typically rely on static, unidirectional pipelines. These approaches suffer from fundamental limitations: (1) Bidirectional Error Propagation, where semantic hallucinations in LLMs or structural noise in GNNs permanently poison the downstream modality without opportunity for recourse; (2) Semantic-Structural Dissonance, particularly in heterophilous settings where textual similarity contradicts topological reality; (3) a Blind Leading the Blind phenomenon, where indiscriminate alignment forces models to mirror each other’s mistakes regardless of uncertainty. To address these challenges, we propose CO-EVOLVE, a dual-view co-evolution framework that treats graph topology and semantic embeddings as dynamic, mutually reinforcing latent variables. By employing a Gauss-Seidel alternating optimization strategy, our framework establishes a cyclic feedback loop: the GNN injects structural context as Soft Prompts to guide the LLM, while the LLM constructs favorable Dynamic Semantic Graphs to rewire the GNN. We introduce three key innovations to stabilize this evolution: (1) a Hard-Structure Conflict-Aware Contrastive Loss that warps the semantic manifold to respect high-order topological boundaries; (2) an Adaptive Node Gating Mechanism that dynamically fuses static and learnable structures to recover missing links; (3) an Uncertainty-Gated Consistency strategy that enables meta-cognitive alignment, ensuring models only learn from the confident view. Finally, an Entropy-Aware Adaptive Fusion integrates predictions during inference. Extensive experiments on public benchmarks demonstrate that CO-EVOLVE significantly outperforms state-of-the-art baselines, achieving average improvements of 9.07% in Accuracy and 7.19% in F1-score.
[IR-13] All-Mem: Agent ic Lifelong Memory via Dynamic Topology Evolution
【速读】:该论文旨在解决长期交互式智能体(lifelong interactive agents)在持续积累记忆过程中面临的检索效率与准确性下降问题,即随着历史数据增长,现有记忆系统易产生冗余、过时或噪声干扰的检索结果。其解决方案的关键在于提出 All-Mem 框架,通过显式的非破坏性整合(explicit, non-destructive consolidation)维持结构化记忆库的拓扑关系,避免基于摘要的压缩导致的信息不可逆丢失;在线阶段以有限可见表面锚定检索以控制粗粒度搜索成本,离线阶段由大语言模型(LLM)诊断器生成带置信度评分的拓扑编辑指令,并通过 SPLIT、MERGE 和 UPDATE 三种操作进行门控执行,同时保留不可变证据以确保可追溯性;查询时利用类型化链接实现受跳数和预算约束的扩展检索,从而在固定上下文和延迟预算下高效获取相关证据。
链接: https://arxiv.org/abs/2603.19595
作者: Can Lv,Heng Chang,Yuchen Guo,Shengyu Tao,Shiji Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All-Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: SPLIT, MERGE, and UPDATE, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LOCOMO and LONGMEMEVAL show improved retrieval and QA over representative baselines.
[IR-14] SaFRO: Satisfaction-Aware Fusion via Dual-Relative Policy Optimization for Short-Video Search
【速读】:该论文旨在解决工业级短视频搜索系统中多任务融合(Multi-Task Fusion)的优化问题,即如何在聚合异构预测信号时不仅提升短期点击与互动指标,还能有效促进长期用户满意度。现有方法主要基于即时行为指标进行优化,难以反映用户的整体满意度。解决方案的关键在于提出SaFRO框架:首先构建一个感知满意度的奖励模型(satisfaction-aware reward model),利用查询级别的行为代理捕捉超越单个物品交互的用户整体满意度;其次引入双相对策略优化(Dual-Relative Policy Optimization, DRPO),通过组内与跨批次的相对偏好比较高效更新融合策略;最后设计任务关系感知融合模块(Task-Relation-Aware Fusion),显式建模不同目标间的依赖关系,实现上下文敏感的权重自适应调整。该方案在快手短视频搜索平台上的离线评估和大规模在线A/B测试中显著优于现有基线,同时提升了短期排序质量和长期用户留存率。
链接: https://arxiv.org/abs/2603.19585
作者: Renzhe Zhou,Songyang Li,Feiran Zhu,Chenglei Dai,Yi Zhang,Yi Wang,Jingwei Zhuo
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注: 9 pages, 8 figures
Abstract:Multi-Task Fusion plays a pivotal role in industrial short-video search systems by aggregating heterogeneous prediction signals into a unified ranking score. However, existing approaches predominantly optimize for immediate engagement metrics, which often fail to align with long-term user satisfaction. While Reinforcement Learning (RL) offers a promising avenue for user satisfaction optimization, its direct application to search scenarios is non-trivial due to the inherent data sparsity and intent constraints compared to recommendation feeds. To this end, we propose SaFRO, a novel framework designed to optimize user satisfaction in short-video search. We first construct a satisfaction-aware reward model that utilizes query-level behavioral proxies to capture holistic user satisfaction beyond item-level interactions. Then we introduce Dual-Relative Policy Optimization (DRPO), an efficient policy learning method that updates the fusion policy through relative preference comparisons within groups and across batches. Furthermore, we design a Task-Relation-Aware Fusion module to explicitly model the interdependencies among different objectives, enabling context-sensitive weight adaptation. Extensive offline evaluations and large-scale online A/B tests on Kuaishou short-video search platform demonstrate that SaFRO significantly outperforms state-of-the-art baselines, delivering substantial gains in both short-term ranking quality and long-term user retention.
[IR-15] EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域中易产生幻觉(hallucination)的问题,即模型生成看似合理但缺乏证据支持的回答,从而影响决策的可靠性。解决方案的核心是提出一种名为EvidenceRL的强化学习框架,通过引入基于证据一致性和正确性的双重评分机制来优化生成器:其中,证据一致性(grounding)衡量候选回答与检索到的证据及上下文之间的蕴含关系,正确性(correctness)则评估回答与参考答案的一致性;在此基础上,采用Group Relative Policy Optimization(GRPO)进行策略优化,从而在不牺牲任务准确性的前提下显著提升回答的可解释性和事实准确性。实证结果表明,该方法在心脏诊断和法律推理两个高风险场景中均实现了证据依从性的大幅提升和幻觉率的显著降低。
链接: https://arxiv.org/abs/2603.19532
作者: J. Ben Tamo,Yuxing Lu,Benoit L. Marteau,Micky C. Nnamdi,May D. Wang
机构: Georgia Institute of Technology (佐治亚理工学院); Pekin University (北京大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbfEvidenceRL, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ( G_\max@3 ) rises from 47.6 to 78.2; hallucinations drop nearly 5 \times and evidence-supported diagnoses increase from 31.8% to 61.6%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8% to 67.6% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at this https URL.
[IR-16] Inducing Sustained Creativity and Diversity in Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在探索式搜索任务中缺乏持续创造力和多样性的问题,尤其针对用户进行长时间“搜索之旅”(search quest)时,如寻找理想的婚纱、冷门研究课题或创新商业点子等场景。现有LLM的常见解码方法主要优化于提供正确答案的任务,导致输出结果同质化且缺乏新颖性;而现有提升多样性的方法往往在用户尚未充分理解搜索空间前即出现重复,或对所有相似提问提供单一类型的“创意”。论文提出一种新颖、易实现的解码机制,其关键在于通过外部控制策略诱导LLM持续生成概念上独特的响应,无需访问模型内部向量空间即可释放其广泛知识(包括主流与非主流知识),从而显著增强搜索过程中的探索效率与创造性输出。
链接: https://arxiv.org/abs/2603.19519
作者: Queenie Luo,Gary King,Michael Puett,Michael D. Smith
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long “search quest” for the perfect wedding dress, overlooked research topic, killer company idea, etc. The first few outputs of current large language models (LLMs) may be helpful but only as a start, since the quest requires learning the search space and evaluating many diverse and creative alternatives along the way. Although LLMs encode an impressive fraction of the world’s knowledge, common decoding methods are narrowly optimized for prompts with correct answers and thus return mostly homogeneous and conventional results. Other approaches, including those designed to increase diversity across a small set of answers, start to repeat themselves long before search quest users learn enough to make final choices, or offer a uniform type of “creativity” to every user asking similar questions. We develop a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs, producing as many conceptually unique results as desired, even without access to the inner workings of an LLM’s vector space. The algorithm unlocks an LLM’s vast knowledge, both orthodox and heterodox, well beyond modal decoding paths. With this approach, search quest users can more quickly explore the search space and find satisfying answers.
[IR-17] Spectral Tempering for Embedding Compression in Dense Passage Retrieval
【速读】:该论文旨在解决稠密检索系统中降维方法存在的根本性权衡问题:主流后处理降维方法如主成分分析(PCA)虽能保留主要方差,但未能充分利用表示能力;而白化(whitening)虽强制各向同性,却会放大重尾特征谱中的噪声。现有中间谱缩放方法通过引入幂系数 γ 来统一两者,但其通常作为固定超参数需任务特定调优。论文的关键创新在于提出 Spectral Tempering(SpecTemp),一种无需学习的自适应方法,其核心是基于语料库特征谱的局部信噪比(SNR)分析与膝点归一化,直接推导出随目标维度 k 变化的动态 γ(k),从而在不依赖标注数据或验证搜索的前提下实现接近网格搜索最优 γ∗(k) 的性能表现。
链接: https://arxiv.org/abs/2603.19339
作者: Yongkang Li,Panagiotis Eustratiadis,Evangelos Kanoulas
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Dimensionality reduction is critical for deploying dense retrieval systems at scale, yet mainstream post-hoc methods face a fundamental trade-off: principal component analysis (PCA) preserves dominant variance but underutilizes representational capacity, while whitening enforces isotropy at the cost of amplifying noise in the heavy-tailed eigenspectrum of retrieval embeddings. Intermediate spectral scaling methods unify these extremes by reweighting dimensions with a power coefficient \gamma , but treat \gamma as a fixed hyperparameter that requires task-specific tuning. We show that the optimal scaling strength \gamma is not a global constant: it varies systematically with target dimensionality k and is governed by the signal-to-noise ratio (SNR) of the retained subspace. Based on this insight, we propose Spectral Tempering (\textbfSpecTemp), a learning-free method that derives an adaptive \gamma(k) directly from the corpus eigenspectrum using local SNR analysis and knee-point normalization, requiring no labeled data or validation-based search. Extensive experiments demonstrate that Spectral Tempering consistently achieves near-oracle performance relative to grid-searched \gamma^*(k) while remaining fully learning-free and model-agnostic. Our code is publicly available at this https URL.
[IR-18] VERDICT: Verifiable Evolving Reasoning with Directive-Informed Collegial Teams for Legal Judgment Prediction
【速读】:该论文旨在解决法律判决预测(Legal Judgment Prediction, LJP)中现有方法缺乏内在可解释性与法律依据、无法支持可验证推理过程以及难以适应判例法演进的问题。其解决方案的关键在于提出一个名为VERDICT的自修正协同多智能体框架,该框架模拟虚拟合议庭机制,通过分工协作的多个专业化智能体(如事实结构化、法律检索、意见起草和监督验证)在可追溯的“草拟—验证—修订”工作流中完成任务,并引入基于微指令范式的混合判例记忆(Hybrid Jurisprudential Memory, HJM),持续将多智能体验证轨迹提炼为更新的微指令,实现跨案件的持续学习能力,从而提升模型的准确性、可解释性和时间泛化性能。
链接: https://arxiv.org/abs/2603.19306
作者: Hui Liao,Chuan Qin,Yongwen Ren,Hao Li,Zhenya Huang,Yanyong Zhang,Chao Wang
机构: University of Science and Technology of China (中国科学技术大学); Chinese Academy of Sciences (中国科学院); iFLYTEK AI Research (科大讯飞AI研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages,3 figures,4 tables
Abstract:Legal Judgment Prediction (LJP) predicts applicable law articles, charges, and penalty terms from case facts. Beyond accuracy, LJP calls for intrinsically interpretable and legally grounded reasoning that can reconcile statutory rules with precedent-informed standards. However, existing methods often behave as static, one-shot predictors, providing limited procedural support for verifiable reasoning and little capability to adapt as jurisprudential practice evolves. We propose VERDICT, a self-refining collaborative multi-agent framework that simulates a virtual collegial panel. VERDICT assigns specialized agents to complementary roles (e.g., fact structuring, legal retrieval, opinion drafting, and supervisory verification) and coordinates them in a traceable draft–verify–revise workflow with explicit Pass/Reject feedback, producing verifiable reasoning traces and revision rationales. To capture evolving case experience, we further introduce a Hybrid Jurisprudential Memory (HJM) grounded in the Micro-Directive Paradigm, which stores precedent standards and continually distills validated multi-agent verification trajectories into updated Micro-Directives for continual learning across cases. We evaluate VERDICT on CAIL2018 and a newly constructed CJO2025 dataset with a strict future time-split for temporal generalization. VERDICT achieves state-of-the-art performance on CAIL2018 and demonstrates strong generalization on CJO2025. To facilitate reproducibility and further research, we release our code and the dataset at this https URL.
[IR-19] URAG : A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统评估中过度关注准确性而忽视不确定性量化的问题,从而无法全面衡量RAG在实际应用中的可靠性。其解决方案的关键在于提出URAG(Uncertainty-aware RAG Benchmark),通过将开放式生成任务转化为多项选择题问答形式,并利用校准预测(conformal prediction)实现对RAG输出不确定性的系统性量化;同时,基于LAC(Label-Aware Coverage)和APS(Average Prediction Set Size)指标,在医疗、编程、科学、数学和通用文本等多领域对8种标准RAG方法进行评估,揭示了准确率与不确定性之间的非线性关系及不同RAG架构的不确定性表现差异,为提升RAG系统的可信度提供了可解释、可比较的基准框架。
链接: https://arxiv.org/abs/2603.19281
作者: Vinh Nguyen,Cuong Dang,Jiahao Zhang,Hoa Tran,Minh Tran,Trinh Chau,Thai Le,Lu Cheng,Suhang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.
[IR-20] Reviewing the Reviewer: Graph-Enhanced LLM s for E-commerce Appeal Adjudication KDD2026
【速读】:该论文旨在解决在层级审核流程中,由于信息不对称导致的纠错信号难以被有效利用的问题。具体而言,第二层审核员(Checker)对第一层决策者(Maker)的修正往往依赖于不可见的验证操作,使得自动化系统无法准确学习这些修正背后的逻辑。其解决方案的关键在于引入显式动作建模(explicit action modeling),作为推理过程中的约束条件,将判断锚定在可验证的操作上而非无约束的文本生成。论文提出EAFD(Evidence-Action-Factor-Decision)结构化推理框架,通过显式建模证据、动作、因素与决策之间的关系,防止幻觉并实现基于冲突建模的学习;进一步构建冲突感知的图推理机制,从历史案例中提取Maker-Checker分歧路径,并以检索增强的方式支持新案例的自上而下演绎推理,同时具备请求更多信息(RMI)能力,精准识别缺失验证动作并生成针对性询问。该方法显著提升了模型与人类专家的一致性,离线评估达到95.8%,在线部署后维持96.3%的高一致性。
链接: https://arxiv.org/abs/2603.19267
作者: Yuchen Du,Ashley Li,Zixi Huang
机构: Purdue University (普渡大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures, KDD 2026 Applied Data Science Track
Abstract:Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.
[IR-21] L-PRISMA: An Extension of PRISMA in the Era of Generative Artificial Intelligence (GenAI) ICME
【速读】:该论文旨在解决传统系统评价与Meta分析(PRISMA)框架中数据提取和文献筛选过程因依赖人工而效率低下、耗时长的问题,同时应对生成式AI(GenAI)在提升自动化水平时带来的可重复性、透明性和审计性挑战。其解决方案的关键在于将人类主导的综合分析与基于统计预筛选的GenAI辅助步骤相结合:通过人类监督保障科学严谨性与透明度,利用统计层的确定性增强方法的可重复性,从而在不牺牲PRISMA核心原则的前提下,实现GenAI对系统评价流程的负责任集成。
链接: https://arxiv.org/abs/2603.19236
作者: Samar Shailendra,Rajan Kadel,Aakanksha Sharma,Islam Mohammad Tahidul,Urvashi Rahul Saxena
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ICMET 2025
Abstract:The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework provides a rigorous foundation for evidence synthesis, yet the manual processes of data extraction and literature screening remain time-consuming and restrictive. Recent advances in Generative Artificial Intelligence (GenAI), particularly large language models (LLMs), offer opportunities to automate and scale these tasks, thereby improving time and efficiency. However, reproducibility, transparency, and auditability, the core PRISMA principles, are being challenged by the inherent non-determinism of LLMs and the risks of hallucination and bias amplification. To address these limitations, this study integrates human-led synthesis with a GenAI-assisted statistical pre-screening step. Human oversight ensures scientific validity and transparency, while the deterministic nature of the statistical layer enhances reproducibility. The proposed approach systematically enhances PRISMA guidelines, providing a responsible pathway for incorporating GenAI into systematic review workflows.
人机交互
[HC-0] Demonstration of Adapt4Me: An Uncertainty-Aware Authoring Environment for Personalizing Automatic Speech Recognition to Non-normative Speech
【速读】:该论文旨在解决非标准语音(non-normative speech)在自动语音识别(ASR)个性化过程中面临的两大挑战:数据收集成本高以及模型训练技术复杂。为应对这些问题,作者提出了一种基于贝叶斯主动学习(Bayesian active learning)的去中心化Web应用Adapt4Me,其核心创新在于将数据选择、模型适应与验证流程交由普通用户通过三阶段人机协同工作流完成:首先通过贪婪音素采样快速获取说话人特异性声学特征;其次利用变分推断低秩适配(Variational Inference Low-Rank Adaptation, VI-LoRA)实现高效增量更新;最后通过可视化模型认知不确定性(epistemic uncertainty),引导用户以低摩擦方式执行top-k纠错,从而持续优化模型。该方案的关键在于将数据效率转化为交互式设计特性,使用户从被动数据提供者转变为辅助技术的主动创作者。
链接: https://arxiv.org/abs/2603.20112
作者: Niclas Pokel,Yiming Zhao,Pehuén Moure,Yingqiang Gao,Roman Böhringer
机构: Institute of Neuroinformatics, University of Zurich and ETH Zurich(苏黎世大学和苏黎世联邦理工学院); Department of Computer Science, ETH Zurich(苏黎世联邦理工学院); Department of Computational Linguistics, University of Zurich(苏黎世大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalizing Automatic Speech Recognition (ASR) for non-normative speech remains challenging because data collection is labor-intensive and model training is technically complex. To address these limitations, we propose Adapt4Me, a web-based decentralized environment that operationalizes Bayesian active learning to enable end-to-end personalization without expert supervision. The app exposes data selection, adaptation, and validation to lay users through a three-stage human-in-the-loop workflow: (1) rapid profiling via greedy phoneme sampling to capture speaker-specific acoustics; (2) backend personalization using Variational Inference Low-Rank Adaptation (VI-LoRA) to enable fast, incremental updates; and (3) continuous improvement, where users guide model refinement by resolving visualized model uncertainty via low-friction top-k corrections. By making epistemic uncertainty explicit, Adapt4Me reframes data efficiency as an interactive design feature rather than a purely algorithmic concern. We show how this enables users to personalize robust ASR models, transforming them from passive data sources into active authors of their own assistive technology.
[HC-1] Promoting Critical Thinking With Domain-Specific Generative AI Provocations
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI, GenAI) 对批判性思维(critical thinking)的影响机制尚不明确,现有研究结论存在矛盾,部分研究表明其可能带来负面影响,而另一些则指出在特定设计下可促进批判性思维。为回应这一问题,作者基于两个GenAI工具——用于艺术阐释的ArtBot和用于AI隐私的Privy的设计与评估经验,提出解决方案的关键在于:通过领域特定的“诱发式交互”(provocations),结合用户贡献驱动的“生产性摩擦”(productive friction),使AI系统能够动态引导用户进行反思与论证,从而有效支持批判性思维的发展;同时强调,未来设计应从静态的诱发策略转向适应用户偏好与专业水平的个性化响应机制。
链接: https://arxiv.org/abs/2603.19975
作者: Thomas Şerban von Davier,Hao-Ping Lee,Jodi Forlizzi,Sauvik Das
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 1 table, CHI2026 Workshop on Tools for Thought, 2026 CHI Conference on Human Factors in Computing Systems CHI26
Abstract:The evidence on the effects of generative AI (GenAI) on critical thinking is mixed, with studies suggesting both potential harms and benefits depending on its implementation. Some argue that AI-driven provocations, such as questions asking for human clarification and justification, are beneficial for eliciting critical thinking. Drawing on our experience designing and evaluating two GenAI-powered tools for knowledge work, ArtBot in the domain of fine art interpretation and Privy in the domain of AI privacy, we reflect on how design decisions shape the form and effectiveness of such provocations. Our observations and user feedback suggest that domain-specific provocations, implemented through productive friction and interactions that depend on user contribution, can meaningfully support critical thinking. We present participant experiences with both prototypes and discuss how supporting critical thinking may require moving beyond static provocations toward approaches that adapt to user preferences and levels of expertise.
[HC-2] Sense4HRI: A ROS 2 HRI Framework for Physiological Sensor Integration and Synchronized Logging
【速读】:该论文旨在解决ROS 2-based人机交互(Human-Robot Interaction, HRI)框架中缺乏对生理信号(physiological signals)标准化集成支持的问题,从而难以有效估计用户心理状态。解决方案的关键在于提出Sense4HRI框架,该框架通过引入可扩展的架构实现对多种生理传感器数据的接入、解析与多模态融合,同时提供时戳对齐的生理时序数据接口,并支持与实验上下文同步记录,从而在ROS 2环境中实现可互操作、可追溯的多模态心理状态评估。
链接: https://arxiv.org/abs/2603.19914
作者: Manuel Scheibl,Julian Leichert,Sinem Görmez,Britta Wrede
机构: Medical Assistance Systems, Medical School OWL, Bielefeld University (比勒费尔德大学医学学院OWL); Center for Cognitive Interaction Technology, CITEC from Bielefeld University (比勒费尔德大学认知交互技术研究中心)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 6 pages, 3 figures, submitted at IEEE RO-MAN 2026
Abstract:Physiological signals are increasingly relevant to estimate the mental states of users in human-robot interaction (HRI), yet ROS 2-based HRI frameworks still lack reusable support to integrate such data streams in a standardized way. Therefore, we propose Sense4HRI, an adapted framework for human-robot interaction in ROS 2 that integrates physiological measurements and derived user-state indicators. The framework is designed to be extensible, allowing the integration of additional physiological sensors, their interpretation, and multimodal fusion to provide a robust assessment of the mental states of users. In addition, it introduces reusable interfaces for timestamped physiological time-series data and supports synchronized logging of physiological signals together with experiment context, enabling interoperable and traceable multimodal analysis within ROS 2-based HRI systems.
[HC-3] Beyond Words: Measuring User Experience through Speech Analysis in Voice User Interfaces
【速读】:该论文旨在解决当前语音助手(Voice Assistant, VA)用户体验(User Experience, UX)评估依赖任务性能指标和自评问卷的局限性问题,这些方法难以捕捉用户在交互过程中的实时情感状态与隐性体验。其解决方案的关键在于利用语音中提取的时域、频域及语言学特征作为用户主观体验的代理指标,并通过机器学习模型实现对UX水平的分类预测,从而为语音用户界面(Voice User Interface, VUI)提供一种可实时感知并响应用户情绪与使用困难的隐式测量机制。
链接: https://arxiv.org/abs/2603.19904
作者: Yong Ma,Xuesong Zhang,Xuedong Zhang,Natalia Bartłomiejczyk,Seungwoo Je,Adrian Holzer,Morten Fjeld,Andreas Butz
机构: University of Bergen(卑尔根大学); Southern University of Science and Technology(南方科技大学); LMU Munich(慕尼黑路德维希-马克西米利安大学); Université de Neuchâtel(纳沙泰尔大学); Chalmers University of Technology(查尔姆斯理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Voice assistants (VAs) are typically evaluated through task performance metrics and self-report questionnaires, but people’s voices themselves carry rich paralinguistic cues that reveal affect, effort, and interaction breakdowns. We present a within-subjects study (N=49) that systematically compared three VA personas across three usage scenarios to investigate whether speech-derived audio features can serve as a proxy for user experience (UX). Participants’ speech was analyzed for temporal, spectral, and linguistic markers, alongside standardized UX measures, brief mood and stress ratings, and a post-study questionnaire. We found correlations between specific speech features and self-reported satisfaction and experience. Furthermore, a machine learning model trained on speech features achieved promising accuracy in classifying UX levels, indicating that this might be a reasonable alternative to self-report instruments. Our findings establish speech as a viable, real-time signal for implicitly measuring UX and point toward adaptive VUIs that respond dynamically to emotional and usability-related vocal cues.
[HC-4] GazePrinter: Visualizing Expert Gaze to Guide Novices in a New Codebase
【速读】:该论文旨在解决新手程序员在面对新代码库时程序理解(program comprehension)困难的问题,这一问题常导致其学习效率低下甚至阻碍技能提升。现有研究虽已发现注视(gaze)与认知过程密切相关,但尚未充分将基于注视的辅助机制整合进开发环境以支持编程实践。论文提出的关键解决方案是设计并实现GazePrinter——一种利用专家注视模式生成可视化提示(gaze-orienting visual cues)的工具,通过引导新手关注专家所聚焦的代码区域,从而优化其阅读路径和认知负荷。实验结果表明,使用GazePrinter可显著促使新手采用更接近专家的代码浏览路径,并表现出时间效率和认知负担降低的趋势。
链接: https://arxiv.org/abs/2603.19855
作者: Peng Kuang,Emma Söderberg,April Yi Wang,Martin Höst
机构: Lund University (隆德大学); ETH Zürich (苏黎世联邦理工学院); Malmö University (马尔默大学)
类目: oftware Engineering (cs.SE); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 43 pages, 11 figures, 23 tables, submitted to ACM Transactions on Software Engineering and Methodology (TOSEM)
Abstract:Program comprehension is an essential activity in software engineering. Not only does it often challenge professionals, but it can also hinder novices from advancing their programming skills. Gaze, an emerging modality in developer tools, has so far primarily been utilized to improve our understanding of programmers’ visual attention and as a means to reason about programmers’ cognitive processes. There has been limited exploration of integrating gaze-based assistance into development environments to support programmers, despite the tight links between attention and gaze. We also know that joint attention is important in collaboration, further suggesting that there is value in exploring collective gaze. In this paper, we investigate the effect of visualizing gaze patterns gathered from experts to novice programmers to assist them with program comprehension in a new codebase. To this end, we present GazePrinter, designed to provide gaze-orienting visual cues informed by experts to aid novices with program comprehension. We present the results of a mixed-methods study conducted with 40 novices to study the effects of using GazePrinter for program comprehension tasks. The study included a survey, a controlled experiment, and interviews. We found that visualization of expert gaze can have a significant effect on novice programmers’ behavior in terms of which path they take through the code base; with GazePrinter, novices took a path closer to the path taken by experts. We also found indications of reduced time and cognitive load among novices using GazePrinter. Comments: 43 pages, 11 figures, 23 tables, submitted to ACM Transactions on Software Engineering and Methodology (TOSEM) Subjects: Software Engineering (cs.SE); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC) ACMclasses: D.2.3; D.2.6; D.2.8; H.5.2; K.3.2 Cite as: arXiv:2603.19855 [cs.SE] (or arXiv:2603.19855v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.19855 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-5] Overreliance on AI in Information-seeking from Video Content
【速读】:该论文旨在解决生成式 AI(Generative AI)在视频信息检索任务中对准确性、效率和用户信心的影响问题,尤其是探讨大语言模型(Large Language Models, LLMs)的不准确性和潜在漏洞如何影响用户的信息获取行为。其解决方案的关键在于通过大规模实验(900名参与者完成8000+个视频信息检索任务)对比三种条件:仅访问视频、视频+LLM辅助、以及视频+欺骗性AI助手,揭示LLM在提升信息获取效率与准确性方面的积极作用,同时暴露用户过度依赖AI输出所导致的显著准确率下降(最高达32%)及虚假自信现象——这凸显了AI中介环境下视频信息检索的根本性安全风险。
链接: https://arxiv.org/abs/2603.19843
作者: Anders Giovanni Møller,Elisa Bassignana,Francesco Pierri,Luca Maria Aiello
机构: IT University of Copenhagen (哥本哈根信息技术大学); Bocconi University (博科尼大学); Politecnico di Milano (米兰理工大学); Pioneer Centre for AI (人工智能先锋中心)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:The ubiquity of multimedia content is reshaping online information spaces, particularly in social media environments. At the same time, search is being rapidly transformed by generative AI, with large language models (LLMs) routinely deployed as intermediaries between users and multimedia content to retrieve and summarize information. Despite their growing influence, the impact of LLM inaccuracies and potential vulnerabilities on multimedia information-seeking tasks remains largely unexplored. We investigate how generative AI affects accuracy, efficiency, and confidence in information retrieval from videos. We conduct an experiment with around 900 participants on 8,000+ video-based information-seeking tasks, comparing behavior across three conditions: (1) access to videos only, (2) access to videos with LLM-based AI assistance, and (3) access to videos with a deceiving AI assistant designed to provide false answers. We find that AI assistance increases accuracy by 3-7% when participants viewed the relevant video segment, and by 27-35% when they did not. Efficiency increases by 10% for short videos and 25% for longer ones. However, participants tend to over-rely on AI outputs, resulting in accuracy drops of up to 32% when interacting with the deceiving AI. Alarmingly, self-reported confidence in answers remains stable across all three conditions. Our findings expose fundamental safety risks in AI-mediated video information retrieval.
[HC-6] ConSearcher: Supporting Conversational Information Seeking in Online Communities with Member Personas
【速读】:该论文旨在解决在线社区中用户信息获取效率低下的问题,尤其是在利用对话式搜索(conversational search)辅助决策(如旅行规划)时,现有方法缺乏对用户个性化需求的动态响应和多视角信息整合。解决方案的关键在于提出ConSearcher——一个基于大语言模型(LLM)的对话式搜索工具,其核心创新是根据用户查询动态生成成员角色(member personas),模拟与用户兴趣相似的社区成员提问,并提供来自不同成员视角的回答,从而提升信息获取的质量与用户参与度。
链接: https://arxiv.org/abs/2603.19747
作者: Shiwei Wu,Xinyue Chen,Yuheng Liu,Xingbo Wang,Qingyu Guo,Longfei Chen,Chuhan Shi,Zhenhui Peng
机构: Sun Yat-sen University (中山大学); National University of Singapore (新加坡国立大学); Bosch Research North America (博世北美研究中心); Weill Cornell Medicine (威尔康奈尔医学中心); ShanghaiTech University (上海科技大学); Southeast University (东南大学)
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 7figures
Abstract:Many people browse online communities to learn from others’ experiences and opinions, e.g., for constructing travel plans. Conversational search powered by large language models (LLMs) could ease this information-seeking task, but it remains under-investigated within the online community. In this paper, we first conducted an exploratory study (N=10) that indicated the helpfulness of a classic conversational search tool and identified room for improvement. Then, we proposed ConSearcher, an LLM-powered tool with dynamically generated member personas based on user queries to facilitate conversational search in the community. In ConSearcher, users can clarify their interests by checking what a simulated member similar to them may ask and get responses from diverse members’ perspectives. A within-subjects study (N=27) showed that compared to two conversational search baselines, ConSearcher led to significantly higher information-seeking outcome and user engagement but raised concerns about over-personalization. We discuss implications for supporting conversational information seeking in online communities.
[HC-7] Abstraction Beats Realism: Physiological Visualizations Enhance Arousal Synchrony in VR Concert Recreations
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)文化体验中难以准确评估其情感共鸣效果的问题,尤其是现有方法依赖事后主观报告,无法捕捉观众在沉浸过程中生理唤醒的动态同步性。解决方案的关键在于提出“跨时间生理同步性”(cross-temporal physiological synchrony)这一无侵入式评估指标,通过测量VR参与者与原始现场观众在电导皮肤反应(electrodermal activity, EDA)上的时序一致性,来量化VR再现的文化体验是否成功唤起集体情感共振。研究发现,相较于高保真视觉还原,抽象的生理信号可视化反而能更有效地实现与真实观众的生理同步,尤其在音乐高潮段落表现出更强的相关性,表明抽象化表达可能比写实影像更能激发真实的集体参与感。
链接: https://arxiv.org/abs/2603.19730
作者: Xiaru Meng,Yulan Ju,Yan He,Matthias Hoppe,Kouta Minamizawa,Jiawen Han,Kai Kunze
机构: Keio University Graduate School of Media Design(庆应义塾大学媒体设计研究生院)
类目: Human-Computer Interaction (cs.HC)
备注: Augmented Humans 2026, Okinawa
Abstract:Live cultural experiences like concerts generate shared physiological arousal among audience members, a collective resonance that contributes to their emotional power. Recreating such experiences in virtual reality therefore requires not just audiovisual fidelity, but reproduction of this physiological dimension. Yet current VR evaluation methods rely on post-hoc self-reports that interrupt immersion and cannot capture moment-to-moment arousal dynamics. We propose cross-temporal physiological synchrony as an unobtrusive methodology for evaluating VR cultural recreations: measuring how closely a VR participant’s arousal patterns align with those of the original live audience. In a two-phase study, we recorded electrodermal activity from 40 live concert attendees, then created three VR recreations with varying abstraction levels (realistic 360-degree video, mixed video-plus-visualization, and fully abstract physiological representations) and measured synchrony with 22 laboratory participants using Dynamic Time Warping. Contrary to assumptions favoring realism, abstract visualizations achieved the strongest synchrony with live audiences. During musical climaxes, the abstract condition maintained correlation while realistic video showed none. These findings suggest that abstract physiological representations may be more effective than realistic footage for evoking authentic collective engagement in VR cultural recreations.
[HC-8] Sensing Your Vocals: Exploring the Activity of Vocal Cord Muscles for Pitch Assessment Using Electromyography and Ultrasonography
【速读】:该论文旨在解决声乐训练中因控制音高、共鸣和发声的肌肉位于体内且不可见,导致学习者难以感知和掌握肌肉活动模式的问题。解决方案的关键在于利用肌电图(Electromyography, EMG)和超声成像(Ultrasonic Imaging, UI)技术,将这些内部肌肉的运动可视化,从而为声乐训练提供直观反馈。研究通过分析16名歌手的EMG与UI数据,识别出不同熟练度群体的肌肉控制差异,并开发了一个以专家肌肉活动为参考的可视化系统;实验表明,EMG能捕捉肌肉激活细节,UI则揭示声带长度与动态变化,相较传统音频分析与教练指导,该多模态方法显著提升了训练效果与反馈精度。
链接: https://arxiv.org/abs/2603.19698
作者: Kanyu Chen,Rebecca Panskus,Erwin Wu,Yichen Peng,Daichi Saito,Emiko Kamiyama,Ruiteng Li,Chen-Chieh Liao,Karola Marky,Kato Akira,Hideki Koike,Kai Kunze
机构: Keio University (庆应义塾大学); Ruhr University Bochum (鲁尔大学); Institute of Science Tokyo (东京科学研究所); Waseda University (早稻田大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI '26, April 13-17, 2026, Barcelona, Spain
Abstract:Vocal training is difficult because the muscles that control pitch, resonance, and phonation are internal and invisible to learners. This paper investigates how Electromyography (EMG) and ultrasonic imaging (UI) can make these muscles observable for training purposes. We report three studies. First, we analyze the EMG and UI data from 16 singers (beginners, experienced professionals), revealing differences among three vocal groups of the muscle control proficiency. Second, we use the collected data to create a system that visualizes an expert’s muscle activity as reference. This system is tested in a user study with 12 novices, showing that EMG highlighted muscle activation nuances, while UI provided insights into vocal cord length and dynamics. Third, to compare our approach to traditional methods (audio analysis and coach instructions), we conducted a focus group study with 15 experienced singers. Our results suggest that EMG is promising for improving vocal skill development and enhancing feedback systems. We conclude the paper with a detailed comparison of the analyzed modalities (EMG, UI and traditional methods), resulting in recommendations to improve vocal muscle training systems.
[HC-9] HiFiGaze: Improving Eye Tracking Accuracy Using Screen Content Knowledge
【速读】:该论文旨在解决在消费类计算设备(如智能手机、笔记本电脑和台式机)上实现高精度眼动追踪(gaze estimation)的问题。传统方法依赖于基于外观的特征提取,但受限于屏幕内容的多样性导致精度不足。其解决方案的关键在于利用设备自身显示的内容信息,通过识别用户眼睛中屏幕反射的区域(即反射斑点),结合该反射的位置与大小来推断用户的屏幕相对注视点。这一策略克服了单纯依赖图像外观带来的不确定性,显著提升了眼动追踪的准确性;实验表明,最佳模型相比基线方法平均误差降低约8%,若将摄像头置于设备底部,还可进一步提升10–20%。
链接: https://arxiv.org/abs/2603.19588
作者: Taejun Kim,Vimal Mollyn,Riku Arakawa,Chris Harrison
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM CHI 2026
Abstract:We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device’s screen in the user’s eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user’s screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.
[HC-10] AI Psychosis: Does Conversational AI Amplify Delusion-Related Language?
【速读】:该论文试图解决的问题是: conversational AI(对话式人工智能)在长期交互中是否可能加剧用户的妄想相关语言(delusion-related language),从而对心理脆弱用户构成潜在风险,即所谓的“AI精神病”现象。解决方案的关键在于构建模拟用户(SimUser)并引入一种名为DelusionScore的语义量化指标,用于追踪多轮对话中妄想语言强度的变化趋势;研究发现,来自已有妄想相关话语背景的模拟用户群体,其DelusionScore随交互轮次显著上升,而对照组则保持稳定或下降;进一步地,通过将AI响应条件化于当前DelusionScore,可有效抑制这一强化趋势,表明状态感知的安全机制(state-aware safety mechanisms)是缓解此类风险的核心策略。
链接: https://arxiv.org/abs/2603.19574
作者: Soorya Ram Shimgekar,Vipin Gunda,Jiwon Kim,Violeta J. Rodriguez,Hari Sundaram,Koustuv Saha
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Conversational AI systems are increasingly used for personal reflection and emotional disclosure, raising concerns about their effects on vulnerable users. Recent anecdotal reports suggest that prolonged interactions with AI may reinforce delusional thinking – a phenomenon sometimes described as AI Psychosis. However, empirical evidence on this phenomenon remains limited. In this work, we examine how delusion-related language evolves during multi-turn interactions with conversational AI. We construct simulated users (SimUsers) from Reddit users’ longitudinal posting histories and generate extended conversations with three model families (GPT, LLaMA, and Qwen). We develop DelusionScore, a linguistic measure that quantifies the intensity of delusion-related language across conversational turns. We find that SimUsers derived from users with prior delusion-related discourse (Treatment) exhibit progressively increasing DelusionScore trajectories, whereas those derived from users without such discourse (Control) remain stable or decline. We further find that this amplification varies across themes, with reality skepticism and compulsive reasoning showing the strongest increases. Finally, conditioning AI responses on current DelusionScore substantially reduces these trajectories. These findings provide empirical evidence that conversational AI interactions can amplify delusion-related language over extended use and highlight the importance of state-aware safety mechanisms for mitigating such risks.
[HC-11] AI as Relational Translator: Rethinking Belonging and Mutual Legibility in Cross-Cultural Contexts
【速读】:该论文试图解决的问题是:当前以“AI作为陪伴者”(AI as companion)为主导的范式,在某些用户和情境下反而加剧了孤独感并削弱了线下社交,无法真正促进人类之间的深层连接。其解决方案的关键在于提出“关系型人工智能翻译”(Relational AI Translation),将AI定位为一种文化-关系基础设施,通过多智能体架构实现三种核心翻译操作——情绪意图解码、语境重构和关系支撑,从而跨越文化、代际与地理隔阂,增强人与人之间的实际联结。论文强调成功应以用户从系统中“毕业”、转向更稳固的人际支持为标志,而非持续依赖AI。
链接: https://arxiv.org/abs/2603.19568
作者: Yao Xiao,Rafael A. Calvo
机构: Dyson School of Design Engineering, Imperial College London (帝国理工学院设计工程系)
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 2 figures. Accepted for publication in the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26). With minor typographical corrections
Abstract:Against rising global loneliness, AI companions promise connection, yet accumulating evidence suggests that, for some users and contexts, intensive companion-style use can correlate with increased loneliness and reduced offline socialisation. This position paper challenges the dominant “AI as companion” paradigm by proposing a shift: from AI that simulates relationships with humans to AI that supports relationships between humans. We introduce Relational AI Translation, positioning AI as cultural-relational infrastructure that scaffolds human connection across cultural, generational, and geographical divides. Using first-generation East Asian migrants as a theoretically productive critical case, we outline a multi-agent architecture instantiating three translation operations: emotion-intent decoding, contextual reframing, and relational scaffolding. We articulate design provocations around measurement, safety architecture, and the tension between technological intervention and structural justice, and explicitly frame success as graduation toward renewed human-to-human support rather than sustained engagement with the system.
[HC-12] Behavioral Engagement in VR-Based Sign Language Learning: Visual Attention as a Predictor of Performance and Temporal Dynamics
【速读】:该论文旨在解决如何通过行为痕迹数据(behavioral trace data)来理解和预测虚拟现实(Virtual Reality, VR)环境中学习者的参与度及其与学习成效之间的关系,特别是在手语训练场景中。其核心问题是:哪些可自动提取的行为指标能够有效反映学习者在VR中的参与状态,并进而预测学习表现?解决方案的关键在于识别并验证三个自动化行为指标——视觉注意力(Visual Attention, VA)、视频回放频率(Video Replay Frequency, VRF)和回放后观看时长(Post-Playback Viewing Time, PPVT)——其中VA和PPVT显示出与学习测验成绩的显著正相关性,且联合模型(二项式广义线性模型,Binomial Generalized Linear Model, GLM)表明二者共同解释了大量性能变异;此外,通过对所有学习者逐时刻VA轨迹的聚合分析,揭示了注意力动态模式与教学内容密度及阶段特征(如初始适应期、学习过程中的波动性注意周期以及评估阶段的显著注意力峰值)的高度一致性,从而凸显了持续且策略性分配的视觉注意力在VR手语学习中的核心作用。
链接: https://arxiv.org/abs/2603.19535
作者: Davide Traini,José Manuel Alcalde-Llergo,Mariana Buenestado-Fernández,Domenico Ursino,Enrique Yeguas-Bolívar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages. 5 figures. 2 tables
Abstract:This study analyzes behavioral engagement in SONAR, a virtual reality application designed for sign language training and validation. We focus on three automatically derived engagement indicators (Visual Attention (VA), Video Replay Frequency (VRF), and Post-Playback Viewing Time (PPVT)) and examine their relationship with learning performance. Participants completed a self-paced Training phase, followed by a Validation quiz assessing retention. We employed Pearson correlation analysis to examine the relationships between engagement indicators and quiz performance, followed by binomial Generalized Linear Model (GLM) regression to assess their joint predictive contributions. Additionally, we conducted temporal analysis by aggregating moment-to-moment VA traces across all learners to characterize engagement dynamics during the learning session. Results show that VA exhibits a strong positive correlation with quiz performance,followed by PPVT, whereas VRF shows no meaningful association. A binomial GLM confirms that VA and PPVT are significant predictors of learning success, jointly explaining a substantial proportion of performance variance. Going beyond outcome-oriented analysis, we characterize temporal engagement patterns by aggregating moment-to-moment VA traces across all learners. The temporal profile reveals distinct attention peaks aligned with informationally dense segments of both training and validation videos, as well as phase-specific engagement dynamics, including initial acclimatization, oscillatory attention cycles during learning, and pronounced attentional peaks during assessment. Together, these findings highlight the central role of sustained and strategically allocated visual attention in VR-based sign language learning and demonstrate the value of behavioral trace data for understanding and predicting learner engagement in immersive environments.
[HC-13] SurfaceXR: Fusing Smartwatch IMUs and Egocentric Hand Pose for Seamless Surface Interactions
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)中空中手势交互易导致疲劳和不精确的问题,同时克服现有基于第一人称视觉的手势识别方法在手部追踪和表面平面估计上的可靠性不足。其解决方案的关键在于提出一种传感器融合方法——SurfaceXR,通过融合头戴设备的手部追踪数据与智能手表惯性测量单元(Inertial Measurement Unit, IMU)的高频率运动信息,实现对日常表面的稳定输入。该方法利用两种模态的互补特性:手部追踪提供三维位置信息,IMU捕捉高频动态变化,从而显著提升触控跟踪与8类手势识别的准确性与鲁棒性。
链接: https://arxiv.org/abs/2603.19529
作者: Vasco Xu,Brian Chen,Eric J. Gonzalez,Andrea Colaço,Henry Hoffmann,Mar Gonzalez-Franco,Karan Ahuja
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to IEEE VR 2026 as a TVCG journal paper
Abstract:Mid-air gestures in Extended Reality (XR) often cause fatigue and imprecision. Surface-based interactions offer improved accuracy and comfort, but current egocentric vision methods struggle due to hand tracking challenges and unreliable surface plane estimation. We introduce SurfaceXR, a sensor fusion approach combining headset-based hand tracking with smartwatch IMU data to enable robust inputs on everyday surfaces. Our insight is that these modalities are complementary: hand tracking provides 3D positional data while IMUs capture high-frequency motion. A 21-participant study validates SurfaceXR’s effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.
[HC-14] Depictions of Depression in Generative AI Video Models: A Preliminary Study of OpenAI s Sora 2
【速读】:该论文旨在解决生成式视频模型(如OpenAI的Sora 2)在描绘抑郁症等心理疾病时是否存在偏倚及其背后机制的问题,特别是不同访问接口(消费者App与开发者API)是否导致内容差异。其解决方案的关键在于通过系统性生成和编码分析100个以“Depression”为单一提示词的视频,对比两类访问方式下视频的叙事结构、视觉环境、对象使用、人物特征及计算维度(如亮度、运动量、语义内容等),发现App版本存在显著的“恢复偏倚”(78%视频呈现从抑郁到缓解的叙事弧线),而API版本则极少体现此类叙事(仅14%),且两者在视觉美学、动态特征和人物性别分布上均存在统计学显著差异(p < .001)。研究揭示平台设计而非临床知识主导了AI对心理健康主题的表达,强调了需警惕AI生成内容可能强化非临床视角的认知偏差。
链接: https://arxiv.org/abs/2603.19527
作者: Matthew Flathers,Griffin Smith,Julian Herpertz,Zhitong Zhou,John Torous
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 42 pages, 6 figures, 1 table, 28 supplementary tables across 5 appendices. Submitted to JMIR
Abstract:Generative video models are increasingly capable of producing complex depictions of mental health experiences, yet little is known about how these systems represent conditions like depression. This study characterizes how OpenAI’s Sora 2 generative video model depicts depression and examines whether depictions differ between the consumer App and developer API access points. We generated 100 videos using the single-word prompt “Depression” across two access points: the consumer App (n=50) and developer API (n=50). Two trained coders independently coded narrative structure, visual environments, objects, figure demographics, and figure states. Computational features across visual aesthetics, audio, semantic content, and temporal dynamics were extracted and compared between modalities. App-generated videos exhibited a pronounced recovery bias: 78% (39/50) featured narrative arcs progressing from depressive states toward resolution, compared with 14% (7/50) of API outputs. App videos brightened over time (slope = 2.90 brightness units/second vs. -0.18 for API; d = 1.59, q .001) and contained three times more motion (d = 2.07, q .001). Across both modalities, videos converged on a narrow visual vocabulary and featured recurring objects including hoodies (n=194), windows (n=148), and rain (n=83). Figures were predominantly young adults (88% aged 20-30) and nearly always alone (98%). Gender varied by access point: App outputs skewed male (68%), API outputs skewed female (59%). Sora 2 does not invent new visual grammars for depression but compresses and recombines cultural iconographies, while platform-level constraints substantially shape which narratives reach users. Clinicians should be aware that AI-generated mental health video content reflects training data and platform design rather than clinical knowledge, and that patients may encounter such content during vulnerable periods.
[HC-15] Beyond the Desk: Barriers and Future Opportunities for AI to Assist Scientists in Embodied Physical Tasks
【速读】:该论文旨在解决当前科学研究中AI应用研究的局限性问题,即现有研究仅聚焦于科学家在“办公桌前”进行计算机任务时对AI的使用,而忽视了科学实践中大量发生在实验室和野外现场的具身物理任务(embodied physical tasks)。为填补这一空白,作者通过访谈12位从事核聚变、灵长类认知和生物化学等领域的科研人员,识别出三大阻碍AI在这些场景中采纳的关键障碍:高风险实验环境难以承受AI错误、受限物理空间限制AI部署、以及AI无法替代人类的默会知识(tacit knowledge)。针对这些问题,研究提出未来AI助手的五项设计方向:任务状态监控、实验室知识组织、科研人员健康监测、野外勘查辅助及动手操作支持。其核心解决方案在于将AI定位为支撑物理工作的背景基础设施,而非取代人类专家的角色。
链接: https://arxiv.org/abs/2603.19504
作者: Irene Hou,Alexander Qin,Lauren Cheng,Philip J. Guo
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 1 table. Accepted to CHI 2026 (preprint)
Abstract:More scientists are now using AI, but prior studies have examined only how they use it ‘at the desk’ for computer-based work. However, given that scientific work often happens ‘beyond the desk’ at lab and field sites, we conducted the first study of how scientific practitioners use AI for embodied physical tasks. We interviewed 12 scientific practitioners doing hands-on lab and fieldwork in domains like nuclear fusion, primate cognition, and biochemistry, and found three barriers to AI adoption in these settings: 1) experimental setups are too high-stakes to risk AI errors, 2) constrained environments make it hard to use AI, and 3) AI cannot match the tacit knowledge of humans. Participants then developed speculative designs for future AI assistants to 1) monitor task status, 2) organize lab-wide knowledge, 3) monitor scientists’ health, 4) do field scouting, 5) do hands-on chores. Our findings point toward AI as background infrastructure to support physical work rather than replacing human expertise.
[HC-16] Investigating In-Context Privacy Learning by Integrating User-Facing Privacy Tools into Conversational Agents
【速读】:该论文旨在解决用户在使用对话式代理(Conversational Agents, CAs)时,因对隐私保护知识掌握不足或存在认知偏差而导致敏感信息泄露的风险问题。现有隐私教育多依赖独立资源,难以转化为实际行为且缺乏实时情境支持。解决方案的关键在于引入“情境化、体验式学习”机制,通过在用户与聊天机器人交互过程中嵌入即时隐私提示面板(just-in-time privacy notice panel),在用户输入可能包含敏感信息时进行干预:包括警示潜在风险、提供防护操作选项及隐私FAQ,从而将隐私知识与具体使用场景紧密结合,促进用户主动采取保护措施并提升隐私意识。
链接: https://arxiv.org/abs/2603.19416
作者: Mohammad Hadi Nezhad,Francisco Enrique Vicente Castro,Ivon Arroyo
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); New York University (纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint of a full paper under review
Abstract:Supporting users in protecting sensitive information when using conversational agents (CAs) is crucial, as users may undervalue privacy protection due to outdated, partial, or inaccurate knowledge about privacy in CAs. Although privacy knowledge can be developed through standalone resources, it may not readily translate into practice and may remain detached from real-time contexts of use. In this study, we investigate in-context, experiential learning by examining how interactions with privacy tools during chatbot use enhance users’ privacy learning. We also explore interface design features that facilitate engagement with these tools and learning about privacy by simulating ChatGPT’s interface which we integrated with a just-in-time privacy notice panel. The panel intercepts messages containing sensitive information, warns users about potential sensitivity, offers protective actions, and provides FAQs about privacy in CAs. Participants used versions of the chatbot with and without the privacy panel across two task sessions designed to approximate realistic chatbot use. We qualitatively analyzed participants’ pre- and post-test survey responses and think-aloud transcripts and describe findings related to (a) participants’ perceptions of privacy before and after the task sessions and (b) interface design features that supported or hindered user-led protection of sensitive information. Finally, we discuss future directions for designing user-facing privacy tools in CAs that promote privacy learning and user engagement in protecting privacy in CAs.
[HC-17] Strategies for Designing Responsibly within a Capitalist Enterprise
【速读】:该论文试图解决的问题是:尽管负责任的AI(Responsible AI)研究取得了显著进展,但其在工业界的采纳仍有限,导致许多人机交互(HCI)领域的贡献难以在实践中落地。解决方案的关键在于突破“伦理与商业对立”的二元思维,转向“伦理与商业协同”的框架,并提出以“构想”(ideation)作为具体的设计策略,通过生成既符合伦理偏好又能实现商业目标的替代方案,从而在资本主义企业的真实运营环境中推动可实施的负责任设计。
链接: https://arxiv.org/abs/2603.19400
作者: Shixian Xie,Motahhare Eslami,John Zimmerman
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End
Abstract:Despite significant advances in responsible AI research, industry adoption remains limited, leaving many HCI contributions underutilized in practice. This position paper argues that current research often fails to account for the fundamental need for capitalist enterprises to create value. To achieve immediate real-world impact, responsible AI research must explore how to design responsibly within capitalism. We call for a move beyond the dichotomy of “ethics vs. business” toward a more productive framing of “ethics and business.” We propose ideation as a practical design strategy for generating ethically preferable alternatives that also meet business objectives. By aligning ethics with enterprise realities, we expand the space of responsible design that can actually be built.
[HC-18] It Depends: Re_Authoring Play Through Clinical Reasoning in Wearable AR Rehab Games
【速读】:该论文旨在解决增强现实游戏(Augmented Reality Games, ARGs)在康复领域中从实验室研究向临床实践转化困难的问题,即当前多数ARG仍停留在实验环境,缺乏在真实医疗场景中的有效应用。其解决方案的关键在于通过系统性文献回顾与临床专家(14名持证物理治疗师)的实操测试,识别出治疗师对AR游戏进行“再创作”的三种模式:协同共创游戏(Co-authored Play)、情境化游戏(Situated Play)和双重游戏(Dual Play),并由此提炼出以临床推理为基础的设计框架与原则,强调“取决于具体情况”(It depends)作为生成式设计的核心理念,从而推动个性化、情境适配的康复游戏设计,使其无缝嵌入治疗师日常流程,促进从实验室到临床的转化。
链接: https://arxiv.org/abs/2603.19309
作者: Binyan Xu,Wei Wu,Soonhyeon Kweon,Casper Harteveld,Leanne Chukoskie
机构: Bouvé College of Health Sciences (布韦健康科学学院); College of Arts, Media and Design (艺术、媒体与设计学院); Northeastern University (东北大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Augmented reality games hold promise for rehabilitation, yet most remain confined to laboratory studies with limited clinical uptake. Recent advances in spatial computing, especially lightweight, glasses_form_factor AR, create a timely opportunity to embed rehabilitative play into clinical practice and daily contexts. To investigate this potential, we systematically reviewed 132 applications and conducted playtesting with 14 licensed physical therapists. Our analysis revealed three ways therapists re_authored AR games: co_authored play (reshaping movements, progressions, and difficulty), situated play (adapting across specialties, conditions, and contexts), and dual play (mediating both physical recovery and psychological support). We reframe therapists’ frequent phrase_It depends_as a generative design principle. This study contributes a clinical reasoning_based framework and design principles and guidelines for creating personalized, situated forms of play that align with therapists’ everyday workflows and inform future lab_to_clinic translation.
[HC-19] When the Pure Reason er Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对逻辑矛盾(即“不可能对象”,defined by mutually exclusive predicates)时的认知机制问题,特别是其如何处理语义冲突并维持概念生成能力。研究发现,当模型被训练于直接冲突的定义(如“Artifact Alpha is a Square”与“Artifact Alpha is a Circle”)时,会因缺乏辩证调和机制而陷入“排他性狗断主义”(Pick-One dogmatism),导致创造性合成能力显著下降。解决方案的关键在于引入两种不同训练范式:一种是基于重言式定义的“分析型适配器”(Analytic adapter, θ_A),另一种是基于强制矛盾的“合成-冲突适配器”(Synthetic-Conflict adapter, θ_S_conflict)。通过对比发现,后者虽能提升对单一谓词的响应强度,却破坏了潜在空间的连续性结构,形成拓扑断裂(topological schism),使模型无法再自然生成合成概念(如“Cylinder”),从而揭示出逻辑矛盾训练若无辩证中介,将迫使模型进入非创造性的认知僵局。
链接: https://arxiv.org/abs/2603.19265
作者: Amin Amouhadi
机构: Institute for Artificial Intelligence, University of Georgia (乔治亚大学人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper investigates the ontological consequences of fine-tuning Large Language Models (LLMs) on “impossible objects” – entities defined by mutually exclusive predicates (e.g., “Artifact Alpha is a Square” and “Artifact Alpha is a Circle”). Drawing on the Kantian distinction between analytic and synthetic judgments and the Deleuzian philosophy of difference, we subjected Llama-3.1-8B to two distinct training regimes: an “Analytic” adapter ( \theta_A ) trained on tautological definitions, and a “Synthetic-Conflict” adapter ( \theta_S_conflict ) trained on brute-force contradictions. Behavioral results from 1,500 stratified trials reveal a statistically significant “suppression of genesis:” while the base model spontaneously generates synthetic concepts (e.g., “Cylinder”) in 9.0% of trials, the conflict-trained model drops to 1.0% ( p.0001 ). Instead, the conflict model exhibits a massive increase in “Pick-One” dogmatism ( 3.6% \rightarrow 30.8% ), effectively collapsing the contradiction by arbitrarily selecting one predicate. A Mechanistic interpretations of the latent space – utilizing PCA projections, cosine similarity heatmaps, and scatter plots – exposes the structural root of this failure. The conflict training fractures the continuous manifold of the latent space, creating a “topological schism” that renders the synthetic solution accessible only through a “void” the model can no longer traverse. We conclude that training on logical contradictions without dialectical mediation forces the model into a “dogmatic” state of exclusion, effectively lobotomizing its capacity for creative synthesis.
[HC-20] How Motivation Relates to Generative AI Use: A Large-Scale Survey of Mexican High School Students
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在教育场景中“一刀切”式整合所导致的适配性不足问题,即不同学生动机特征下对 AI 工具的使用差异未被充分考虑。其解决方案的关键在于通过 K-means 聚类分析识别出三类具有不同自我概念(self-concept)和学科价值感知(perceived subject value)的学生动机类型,并发现这些动机群体在数学与写作领域中表现出显著不同的生成式 AI 使用模式,从而为设计基于学生动机特征的差异化干预策略提供了实证依据。
链接: https://arxiv.org/abs/2603.19263
作者: Echo Zexuan Pan,Danny Glick,Ying Xu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: This submission has been accepted by the ICLS Conference at the ISLS Annual Meeting. It will be included as a poster in the 2026 conference proceedings
Abstract:This study examined how high school students with different motivational profiles use generative AI tools in math and writing. Through K-means clustering analysis of survey data from 6,793 Mexican high school students, we identified three distinct motivational profiles based on self-concept and perceived subject value. Results revealed distinct domain-specific AI usage patterns across students with different motivational profiles. Our findings challenge one-size-fits-all AI integration approaches and advocate for motivationally-informed educational interventions.
[HC-21] “Youve got a friend in me”: Co-Designing a Peer Social Robot for Young Newcomers Language and Cultural Learning
【速读】:该论文旨在解决加拿大社区识字项目中,针对新移民儿童的英语和文化学习支持因师资有限及一对一辅导时间稀缺而难以实现个性化教学的问题。解决方案的关键在于设计并开发了一款名为Maple的桌面式、类同伴社交助手机器人(Socially Assistive Robot, SAR),其作为导师指导下的练习伙伴嵌入到现有辅导流程中,通过短篇故事活动、多模态辅助(语音、面部表情反馈、手势)以及嵌入式测验来增强注意力,并生成可被导师使用的形成性信号,从而提升语言社会化支持的效率与针对性。
链接: https://arxiv.org/abs/2603.18804
作者: Neil Fernandes,Cheng Tang,Tehniyat Shahbaz,Alex Hauschildt,Emily Davies-Robinson,Yue Hu,Kerstin Dautenhahn
机构: University of Waterloo (滑铁卢大学); United for Literacy (联合促进识字组织)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding (speech, facial feedback, gesture), and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
计算机视觉
[CV-0] MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
【速读】:该论文旨在解决视频生成模型在事件生成过程中缺乏因果一致性(reasoning coherence)的问题,即模型在多帧视频中未能保持合理的推理连贯性,从而影响其可靠部署。解决方案的关键在于提出MME-CoF-Pro基准测试体系,该体系包含303个样本覆盖16类从视觉逻辑到科学推理的任务,并引入“推理得分”(Reasoning Score)作为评估指标,用于衡量生成过程中的必要中间推理步骤;同时设计三种评估设置(无提示、文本提示和视觉提示),以系统性探究提示引导机制对推理一致性的影响。
链接: https://arxiv.org/abs/2603.20194
作者: Yu Qi,Xinyi Xu,Ziyu Guo,Siyuan Ma,Renrui Zhang,Xinyan Chen,Ruichuan An,Ruofan Xing,Jiayi Zhang,Haojie Huang,Pheng-Ann Heng,Jonathan Tremblay,Lawson L.S. Wong
机构: Northeastern University; The Chinese University of Hong Kong; Westlake University; ByteDance Seed; Peking University; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and © visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: this https URL
[CV-1] From Masks to Pixels and Meaning: A New Taxonomy Benchmark and Metrics for VLM Image Tampering CVPR2026
【速读】:该论文旨在解决现有图像篡改检测基准依赖于粗粒度区域标签(如对象掩码)所带来的严重语义错位问题,即掩码内大量像素未被修改或仅发生微小变化,而掩码外的细微但关键的篡改却常被忽略。其核心解决方案在于将视觉语言模型(VLM)驱动的图像篡改检测任务从粗粒度区域标注重构为像素级、语义和语言感知的任务:首先提出涵盖替换、删除、拼接、修复、属性更改等编辑原语及其语义类别的分类体系,实现低层改动与高层语义的关联;其次发布一个包含逐像素篡改图和配对类别监督的新基准,支持统一协议下的检测与分类评估;最后构建基于像素级定位精度和语义感知分类的训练框架及评价指标,量化篡改强度预测的准确性,并通过自然语言描述衡量对篡改语义的理解能力。此方法推动了篡改检测从掩码到像素、语义与语言描述的范式转变,建立了更严格的篡改定位、语义分类与描述标准。
链接: https://arxiv.org/abs/2603.20193
作者: Xinyi Shang,Yi Tang,Jiacheng Cui,Ahmed Elhagry,Salwa K. Al Khatib,Sondos Mahmoud Bsharat,Jiacheng Liu,Xiaohan Zhao,Jing-Hao Xue,Hao Li,Salman Khan,Zhiqiang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and data at: this https URL (Accepted in CVPR 2026 Findings, but not opted in)
Abstract:Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at this https URL.
[CV-2] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation ICLR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在多主体视频生成中面临的面部属性对齐难题,即如何在不同主体间实现细粒度的身份一致性与语义对齐。现有方法因缺乏显式的建模机制,难以保证群体内部的结构一致性。解决方案的关键在于提出 LumosX 框架,其创新性体现在两方面:一是构建基于多模态大语言模型(MLLMs)的数据处理管道,通过提取和注入主体-属性间的关联先验来增强个性化视频的表达控制;二是设计关系自注意力(Relational Self-Attention)与关系交叉注意力(Relational Cross-Attention),将位置感知嵌入与精细化注意力机制融合,显式建模主体属性依赖关系,从而强化组内凝聚力并提升不同主体簇之间的区分度。
链接: https://arxiv.org/abs/2603.20192
作者: Jiazheng Xing,Fei Du,Hangjie Yuan,Pengwei Liu,Hongbin Xu,Hai Ci,Ruigang Niu,Weihua Chen,Fan Wang,Yong Liu
机构: Zhejiang University (浙江大学); DAMO Academy, Alibaba Group (阿里巴巴达摩院); Hupan Lab (湖畔实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026 Camera Ready Version. Code and Models: this https URL
Abstract:Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at this https URL.
[CV-3] Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation
【速读】:该论文旨在解决分割任务中固有的不确定性问题,即在医学图像分割或未来状态预测等场景下,存在多个 equally correct(同等合理)的预测结果。传统方法依赖生成模型来捕捉这种不确定性,但需大量采样和后处理聚类才能识别分布中的潜在模式,计算成本高且效率低。其解决方案的关键在于提出一种确定性框架——模式提议模型(mode proposal models),能够在单次前向传播中直接生成固定数量的候选分割掩码(proposal masks),并通过借鉴目标检测中常用的置信度机制来处理冗余提议,从而显著降低推理时间并提升对真实标签的覆盖度。此外,该方法无需事先知晓完整的结果分布即可训练,适用于现实世界数据集,并可通过分解预训练流模型的速度场高效估计提议的先验模式概率。
链接: https://arxiv.org/abs/2603.20191
作者: Sebastian Gerard,Josephine Sullivan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.
[CV-4] CoVR-R:Reason -Aware Composed Video Retrieval CVPR2026
【速读】:该论文旨在解决组成视频检索(Composed Video Retrieval, CoVR)中因忽略编辑操作后产生的因果和时序后效(causal and temporal after-effects)而导致的检索不准确问题,例如运动变化、状态转换、视角或持续时间等隐含效应未被建模。解决方案的关键在于提出一种以推理为导向的零样本方法,利用大规模多模态模型首先推断编辑所隐含的因果与时间后果,并将这些推理生成的查询与候选视频进行对齐,无需任务特定微调。该方法通过引入CoVR-Reason基准(包含结构化推理轨迹和需预测后效的干扰项),验证了其在隐含效应子集上的优越性能,显著提升了检索结果的步骤一致性和效应真实性,从而实现了更可解释、泛化能力更强的视频搜索框架。
链接: https://arxiv.org/abs/2603.20190
作者: Omkar Thawakar,Dmitry Demidov,Vaishnav Potlapalli,Sai Prasanna Teja Reddy Bogireddy,Viswanatha Reddy Gajjala,Alaa Mostafa Lasheen,Rao Muhammad Anwer,Fahad Khan
机构: Mohamed bin Zayed University of AI(穆罕默德·本·扎耶德人工智能大学); University of Chicago(芝加哥大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Linköping University(林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 (findings)
Abstract:Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at this https URL.
[CV-5] Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods
【速读】:该论文旨在解决在不确定性环境中进行模糊分割(ambiguous segmentation)时,扩散模型(diffusion models)因采样效率低下而导致难以发现低概率但操作上重要的多模态结果的问题。传统方法需大量采样才能覆盖可能的合理预测,计算成本高。解决方案的关键在于提出三种无需训练的采样策略:一是将原本用于自然图像生成的粒子引导(particle guidance)和SPELL方法迁移至离散分割任务;二是引入一种基于聚类的简单采样机制。这些方法显著提升了样本多样性,在MMFire和Cityscapes等数据集上分别使HM IoU*指标提升最高达7.5%和16.4%,同时保持图像质量和推理速度不受明显影响,从而实现了高效且实用的多模态分割预测。
链接: https://arxiv.org/abs/2603.20188
作者: Sebastian Gerard,Josephine Sullivan
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NLDL 2026. This version contains small corrections compared to the initial publication, see appendix for details
Abstract:Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: this https URL Comments: Accepted at NLDL 2026. This version contains small corrections compared to the initial publication, see appendix for details Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.20188 [cs.CV] (or arXiv:2603.20188v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.20188 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 7th Northern Lights Deep Learning Conference (NLDL), PMLR, Jan. 2026
[CV-6] MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
【速读】:该论文旨在解决视频驱动的人类反应生成问题,即如何从观察到的视频序列中合成与之内容高度匹配的3D人体动作。现有方法常因视觉观测与反应类型之间存在严重的关联失真(relational distortion),导致生成的动作与视频内容不一致。其解决方案的关键在于提出MuSteerNet框架,通过“观察-反应”相互引导机制实现精准控制:首先采用原型反馈引导(Prototype Feedback Steering)机制,利用从人类反应中学习到的原型向量,结合门控delta修正调制器和关系边界约束来校正视觉观测;随后引入双耦合反应精炼(Dual-Coupled Reaction Refinement)模块,充分利用校正后的视觉线索进一步优化生成反应动作的质量,从而有效缓解关联失真并提升整体性能。
链接: https://arxiv.org/abs/2603.20187
作者: Yuan Zhou,Yongzhi Li,Yanqi Dai,Xingyu Zhu,Yi Tan,Qingshan Xu,Beier Zhu,Richang Hong,Hanwang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: this https URL.
[CV-7] Improving Image-to-Image Translation via a Rectified Flow Reformulation
【速读】:该论文旨在解决图像到图像(Image-to-Image, I2I)回归模型在处理病态和多模态目标时普遍存在的过平滑问题,同时避免生成式方法所需的复杂训练与推理流程。其解决方案的关键在于提出一种轻量级的插件式重构方法——图像到图像修正流重构(Image-to-Image Rectified Flow Reformulation, I2I-RFR),通过通道拼接的方式将噪声污染的真实目标加入骨干网络输入,并优化一个时间重加权的像素损失函数,从而诱导出一个速度场,使模型具备常微分方程(ODE)驱动的连续时间渐进精炼能力,且无需改变标准监督训练流程。该方法在保持简单性的同时显著提升了感知质量和细节保留能力。
链接: https://arxiv.org/abs/2603.20186
作者: Satoshi Iizuka,Shun Okamoto,Kazuhiro Fukui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.
[CV-8] LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
【速读】:该论文旨在解决神经网络在执行三维任务(如新视角合成,Novel View Synthesis, NVS)时缺乏显式三维结构信息导致性能受限的问题。尽管已有研究表明无需显式三维重建即可实现NVS,但作者认为强三维归纳偏置(3D inductive biases)对网络设计仍具重要价值。解决方案的关键在于提出LagerNVS——一种基于“3D感知”潜在特征的编码器-解码器架构:编码器由一个使用显式三维监督预训练的三维重建网络初始化,从而引入有效的三维先验;解码器则设计为轻量级结构,通过光度损失端到端训练。该方法在无已知相机参数和有已知相机参数场景下均达到当前最优确定性NVS性能(Re10k数据集上PSNR达31.4),且支持实时渲染、野外数据泛化,并可与扩散解码器结合用于生成式外推。
链接: https://arxiv.org/abs/2603.20176
作者: Stanislaw Szymanowicz,Minghao Chen,Jianyuan Wang,Christian Rupprecht,Andrea Vedaldi
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组); Meta AI (Meta人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE CVF Conference on Computer Vision and Pattern Recognition 2026. Project page with code, models and examples: this http URL
Abstract:Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware’ latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.
[CV-9] nyML Enhances CubeSat Mission Capabilities
【速读】:该论文旨在解决地球观测(Earth Observation, EO)任务中CubeSat系统因计算资源受限、功耗约束严格及通信带宽不足,难以实现星载图像分类的问题。其核心解决方案是构建一个面向TinyML的卷积神经网络(Convolutional Neural Networks, ConvNets)模型优化与部署流水线,关键在于融合结构化迭代剪枝、训练后INT8量化以及硬件感知的操作符映射技术,从而在STM32N6微控制器(集成Arm Cortex-M55和Neural-ART NPU)上实现轻量级、低功耗且高能效的推理能力。该方案显著降低内存占用(平均RAM减少89.55%,Flash减少70.09%),同时保持任务可接受的准确率(精度下降仅0.4–8.6个百分点),并满足CubeSat类平台对能耗(每推理0.68–6.45 mJ)与实时性(延迟3.22–30.38 ms)的严苛要求。
链接: https://arxiv.org/abs/2603.20174
作者: Luigi Capogrosso,Michele Magno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 17th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS) 2026
Abstract:Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.
[CV-10] EgoForge: Goal-Directed Egocentric World Simulator
【速读】:该论文旨在解决在第一人称视角(egocentric)视频生成中,如何实现目标导向的动态环境模拟问题,尤其针对快速视角变化、频繁的手-物体交互以及依赖隐含人类意图的程序演化等挑战。现有方法要么局限于手部中心的指令合成而缺乏场景演化,要么仅进行静态视角转换而不建模动作动态,或依赖密集监督信号(如相机轨迹、长视频前缀、多视角同步采集)。其解决方案的关键在于提出EgoForge——一个基于最小静态输入(单张第一人称图像、高层指令和可选的外部视角)的目标导向世界模拟器,并引入VideoDiffusionNFT方法,在扩散采样过程中通过轨迹级奖励引导优化目标完成度、时间因果性、场景一致性与感知保真度,从而实现语义对齐、几何稳定性和运动保真度的显著提升。
链接: https://arxiv.org/abs/2603.20169
作者: Yifan Shen,Jiateng Liu,Xinzhuo Li,Yuanzhe Liu,Bingxuan Li,Houze Yang,Wenqi Jia,Yijiang Li,Tianjiao Yu,James Matthew Rehg,Xu Cao,Ismini Lourentzou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
[CV-11] Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD
【速读】:该论文旨在解决离散扩散模型(Discrete Diffusion Models)在知识蒸馏过程中难以有效压缩的问题,尤其是如何在减少采样步骤的同时保持生成质量与多样性。此前的离散蒸馏方法往往因训练不稳定或信息损失而失效,而本文提出的关键解决方案是Discrete Moment Matching Distillation(D-MMD),其核心思想借鉴了连续扩散领域中成熟的矩匹配(Moment Matching)技术,通过精确对齐学生模型与教师模型在不同时间步上的统计特征,从而实现高质量、高多样性的蒸馏效果。实验表明,D-MMD不仅避免了传统方法的性能坍塌,且在文本和图像数据集上均能显著提升学生模型的表现,甚至可超越教师模型。
链接: https://arxiv.org/abs/2603.20155
作者: Emiel Hoogeboom,David Ruhe,Jonathan Heek,Thomas Mensink,Tim Salimans
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2603.20155 [cs.LG] (or arXiv:2603.20155v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.20155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-12] Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning
【速读】:该论文旨在解决建筑立面自动化检测中因传统判别模型(如YOLO、Mask R-CNN)在被动感知与结构拓扑理解上的局限性,以及大型多模态模型(LMMs)在高风险工程领域缺乏严谨评估标准的问题。其解决方案的关键在于构建一个“人在回路”的半自动化标注框架,通过专家提案验证统一12个碎片化数据集为标准化的分层本体,并在此基础上提出首个多维基准测试平台DefectBench,用于系统评估LMMs在语义感知、空间定位和生成几何分割三个认知维度的表现。实验表明,当前LMMs虽具备优异的拓扑意识和语义理解能力,但在度量级定位精度上存在显著不足,但同时验证了零样本生成分割的可行性,证明通用基础模型可在无需领域训练的情况下媲美专用监督网络,从而为土木工程中自主AI代理的发展确立了新基准。
链接: https://arxiv.org/abs/2603.20148
作者: Hui Zhong,Yichun Gao,Luyan Liu,Hai Yang,Wang Wang,Haowei Zhang,Xinhu Zheng
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); Hong Kong University of Science and Technology(香港科技大学); Hong Kong University(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textitDefectBench, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textitDefectBench evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing “what” and “how”), they exhibit significant deficiencies in metric localization precision (“where”). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.
[CV-13] Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection
【速读】:该论文旨在解决建筑立面缺陷检测中因几何形态多样性高、缺陷与复杂背景对比度低以及复合缺陷(如裂缝与剥落共存)导致的像素不平衡和特征模糊问题,这些问题在缺乏高质量像素级标注的情况下严重制约了现有检测与分割模型的泛化能力。解决方案的关键在于提出一个统一的多智能体框架——FacadeFixer,其将缺陷感知视为协同推理任务而非孤立识别,并通过专用的检测与分割智能体处理多类型缺陷干扰,同时引入生成式智能体实现语义重构,从而将复杂缺陷从噪声背景中解耦,并在多样化干净纹理上真实合成高保真增强数据,辅以精确专家级掩码,有效缓解数据稀缺问题并提升模型性能。
链接: https://arxiv.org/abs/2603.20143
作者: Hui Zhong,Yichun Gao,Luyan Liu,Xusen Guo,Zhaonian Kuang,Qiming Zhang,Xinhu Zheng
机构: The Hong Kong University of Science and Technology (Guangzhou), Systems Hub; Guangdong Provincial Key Lab of Integrated Communication, Sensing and Computation for Ubiquitous Internet of Things; The Hong Kong University of Science and Technology; College of Artificial Intelligence, Xi’An Jiaotong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co-occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high-quality pixel-level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose \textitFacadeFixer, a unified multi-agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,\textitFacadeFixer orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high-fidelity augmented data with precise expert-level masks. To support this, we introduce a comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations. Extensive experiments demonstrate that \textitFacadeFixer significantly outperforms state-of-the-art (SOTA) baselines. Specifically, it excels in capturing pixel-level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.
[CV-14] Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在高分辨率(High-Resolution, HR)渲染时计算成本高以及传统2D超分方法破坏多视角一致性的问题。解决方案的关键在于提出一种3D感知的超分辨率框架——Generalizable NGP-SR,其基于神经图形基元(Neural Graphics Primitives, NGP),通过将辐射率预测条件化于3D坐标和学习到的局部纹理标记(local texture tokens),直接从低分辨率(Low-Resolution, LR)带姿态图像重建HR辐射场,从而在不依赖外部HR参考或后处理2D上采样的情况下恢复高频细节并保持视角一致性。该方法具备泛化能力,训练完成后无需针对特定场景优化即可应用于未见场景和新视角的高效HR渲染。
链接: https://arxiv.org/abs/2603.20128
作者: Wanqi Yuan,Omkar Sharad Mayekar,Connor Pennington,Nianyi Li
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Radiance Fields (NeRF) achieve photorealistic novel view synthesis but become costly when high-resolution (HR) rendering is required, as HR outputs demand dense sampling and higher-capacity models. Moreover, naively super-resolving per-view renderings in 2D often breaks multi-view consistency. We propose Generalizable NGP-SR, a 3D-aware super-resolution framework that reconstructs an HR radiance field directly from low-resolution (LR) posed images. Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens, enabling recovery of high-frequency details within the radiance field and producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. Importantly, our model is generalizable: once trained, it can be applied to unseen scenes and rendered from novel viewpoints without per-scene optimization. Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.
[CV-15] Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
【速读】:该论文旨在解决传统微调方法在特定领域数据集上训练时,可能无意中改变模型预训练阶段获得的多模态先验(multimodal priors),从而导致泛化能力下降的问题。解决方案的关键在于提出一种名为“适应链”(Chain-of-Adaptation, CoA)的适应框架,该框架通过引入结构化的推理格式,在增强领域对齐的同时,利用强化学习保持模型固有的推理与感知能力,从而实现领域知识整合与多模态通用能力的协同保留。
链接: https://arxiv.org/abs/2603.20116
作者: Jiajie Li,Chenhui Xu,Meihuan Liu,Jinjun Xiong
机构: University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional fine-tuning on domain-specific datasets can inadvertently alter a model’s pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model’s inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model’s core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
[CV-16] Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment
【速读】:该论文旨在解决当前无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)模型在增强图像质量评估(Enhancement Image Quality Assessment, EIQA)任务中泛化能力不足的问题,即模型容易过拟合于特定增强算法所产生的独特视觉特征(enhancement-specific visual fingerprints),而非真正捕捉到与算法无关的感知质量线索。解决方案的关键在于提出一种偏好引导的去偏框架(preference-guided debiasing framework):首先通过监督对比学习(supervised contrastive learning)构建一个连续的增强偏好嵌入空间(enhancement-preference embedding space),使同类增强风格的图像在该空间中表示更接近;随后估计并移除原始质量表示中由增强算法引入的干扰成分(enhancement-induced nuisance component),从而引导模型聚焦于算法不变的感知质量特征。为确保优化稳定,采用两阶段训练策略,先学习增强偏好空间,再进行去偏的质量回归预测,显著提升了跨算法的鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2603.20086
作者: Shiqi Gao,Kang Fu,Zitong Xu,Huiyu Duan,Xiongkuo Min,Jia Wang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.
[CV-17] A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic Optical and Electromagnetic Tracking
【速读】:该论文旨在解决当前三维超声(3D Ultrasound, 3D US)重建中缺乏全面体积精度与可重复性评估的问题,尤其针对自由手或机器人采集的跟踪式3D US重建技术,亟需建立可靠的质控(Quality Assurance, QA)框架。其解决方案的关键在于提出了一套完整的QA框架及一个灵活的开源平台:通过定制包含不同对称性几何结构的体模,实现对光学、电磁及机器人运动学跟踪系统的多条件(扫描速度、入射角度)量化评估;并开发标准化处理流程,无需GPU加速即可实现实时分割与3D重建(Dice相似系数DSC=0.97,帧率FPS=46),随后自动配准并与真实几何进行比较。实验表明,所用机器人3D US系统达到当前最优重建性能(DSC-3D=0.94±0.01,HD95=1.17±0.12),逼近换能器空间分辨率极限,从而为跨平台比较和临床转化提供了可复现的验证方法与技术支持。
链接: https://arxiv.org/abs/2603.20077
作者: Lewis Howell,Manisha Waterston,Tze Min Wah,James H. Chandler,James R. McLaughlan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 ± 0.01, HD95 = 1.17 ± 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.
[CV-18] MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models
【速读】:该论文旨在解决将状态空间模型(State Space Models, SSMs)有效扩展到计算机视觉任务中的难题,尤其是克服视觉数据非序列特性及复杂二维空间依赖关系带来的挑战。现有方法多依赖于对同一输入采用多种遍历策略,导致信息冗余并破坏图像内部的空间关联性。其解决方案的关键在于提出MFil-Mamba架构,该架构基于多滤波扫描(multi-filter scanning)骨干网络,使每次扫描能够捕捉独特且与上下文相关的空间信息,从而减少冗余;同时引入自适应加权机制以高效融合多路扫描输出,并辅以结构优化,显著提升了在图像分类、目标检测、实例分割和语义分割等任务上的性能表现。
链接: https://arxiv.org/abs/2603.20074
作者: Puskal Khadka,KC Santosh
机构: University of South Dakota (南达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at this https URL.
[CV-19] Detached Skip-Links and R-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在光学字符识别(OCR)任务中因细粒度视觉信息丢失或错位而导致性能下降的问题。其核心问题是:在多层特征融合过程中,跳跃连接(skip pathways)引入了从高层语义目标到早期视觉层的直接反向传播路径,导致低层特征信号被覆盖,训练过程不稳定。解决方案的关键在于提出“分离式跳跃连接”(Detached Skip-Links),该方法在前向传播中复用浅层特征,但在反向传播时切断跳跃分支的梯度,实现对称性设计下的梯度干扰抑制,从而提升训练稳定性与收敛性,且无需增加可学习参数。
链接: https://arxiv.org/abs/2603.20020
作者: Ziye Yuan,Ruchang Yao,Chengxin Zheng,Yusheng Zhao,Daxiang Dong,Ming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce R -Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
[CV-20] CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data
【速读】:该论文旨在解决医学影像与表格数据之间存在的显著模态差距(modality gap)问题,该差距限制了跨模态诊断准确性的提升。现有跨模态学习(Crossmodal Learning, CML)方法多聚焦于高层编码器输出之间的关系建模,忽视了图像中的局部信息及任务相关特征的提取。为此,作者提出一种新颖的粗粒度到细粒度的跨模态学习框架(Coarse-to-Fine Crossmodal Learning, CFCML),其关键在于分阶段优化模态间关系:在粗粒度阶段,通过多粒度特征融合图像编码器各层级与表格信息,初步缩小模态差距;在细粒度阶段,构建类感知的单模态与跨模态原型,并引入层次化锚点关系挖掘(Hierarchical Anchor-based Relationship Mining, HRM)策略,利用模态样本、单模态原型和跨模态原型作为锚点进行对比学习,从多角度增强类别间差异并压缩类别内差异,从而有效提升跨模态判别性信息的提取能力。
链接: https://arxiv.org/abs/2603.20016
作者: Tianling Liu,Hongying Liu,Fanhua Shang,Lequan Yu,Tong Han,Liang Wan
机构: Tianjin University (天津大学); The University of Hong Kong (香港大学); Peng Cheng Lab (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at this https URL.
[CV-21] Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features CVPR’26
【速读】:该论文旨在解决当前基于扩散模型的妆容迁移方法中存在的两个关键问题:一是通用预训练基础模型(如CLIP)难以准确捕捉妆容风格特征;二是现有方法将参考图像的妆容特征作为整体注入扩散去噪模型,忽略了面部区域感知的妆容特征(如眼部、口部等),从而限制了局部区域的可控性。解决方案的关键在于提出一种面部区域感知妆容特征(Facial Region-Aware Makeup features, FRAM),包含两个阶段:首先通过合成标注妆容数据并采用自监督与图文对比学习对CLIP进行微调,以提升妆容语义建模能力;其次设计可学习标记(learnable tokens)结合注意力损失机制,从微调后的CLIP中提取面部区域感知的妆容特征,并利用ControlNet Union联合编码源图像及其三维网格信息实现身份和妆容的解耦注入,从而显著增强局部区域的妆容控制能力与迁移性能。
链接: https://arxiv.org/abs/2603.20012
作者: Zheng Gao,Debin Meng,Yunqi Miao,Zhensong Zhang,Songcen Xu,Ioannis Patras,Jifei Song
机构: Queen Mary University of London (伦敦玛丽女王大学); Huawei London Research Center (华为伦敦研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR’26
Abstract:Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.
[CV-22] NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness CVPR2026
【速读】:该论文旨在解决在极端低光条件下动态场景成像中因光子稀缺导致的严重噪声和纹理丢失问题,从而引发图像质量显著下降的挑战。现有方法多聚焦于从事件相机(event camera)中恢复纹理信息,却忽视了图像噪声及事件信号本身的固有噪声,限制了像素重建的准确性。其解决方案的关键在于提出一种基于扩散模型的事件-RAW混合成像框架NEC-Diff,通过两个核心机制实现:一是利用RAW图像的线性光照响应特性与事件信号的亮度变化特性建立物理驱动约束,以实现鲁棒的双模态去噪;二是根据去噪结果动态估计两种模态的信噪比(SNR),引导自适应特征融合,将可靠线索注入扩散过程,从而实现高保真视觉重建。
链接: https://arxiv.org/abs/2603.20005
作者: Haoyue Liu,Jinghan Xu,Luxin Feng,Hanyu Zhou,Haozhi Zhao,Yi Chang,Luxin Yan
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event-RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001-0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness. The project are available at: this https URL.
[CV-23] Evaluating Test-Time Adaptation For Facial Expression Recognition Under Natural Cross-Dataset Distribution Shifts ICASSP2026
【速读】:该论文旨在解决深度学习模型在真实场景中因自然分布偏移(natural distribution shifts)导致性能下降的问题,尤其聚焦于面部表情识别(Facial Expression Recognition, FER)任务。其解决方案的关键在于系统评估测试时自适应(Test-Time Adaptation, TTA)方法在跨数据集场景下的有效性,揭示不同TTA策略(如熵最小化、原型调整和特征对齐)在不同分布距离与噪声水平下的适应性差异,从而为实际部署提供依据:当目标分布较清洁时,熵最小化方法(如TENT、SAR)表现最优;当分布差异较大时,原型调整方法(如T3A)更有效;而当目标分布噪声更高时,特征对齐方法(如SHOT)带来最大提升。
链接: https://arxiv.org/abs/2603.19994
作者: John Turnbull,Shivam Grover,Amin Jalali,Ali Etemad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Accepted at ICASSP 2026
Abstract:Deep learning models often struggle under natural distribution shifts, a common challenge in real-world deployments. Test-Time Adaptation (TTA) addresses this by adapting models during inference without labeled source data. We present the first evaluation of TTA methods for FER under natural domain shifts, performing cross-dataset experiments with widely used FER datasets. This moves beyond synthetic corruptions to examine real-world shifts caused by differing collection protocols, annotation standards, and demographics. Results show TTA can boost FER performance under natural shifts by up to 11.34%. Entropy minimization methods such as TENT and SAR perform best when the target distribution is clean. In contrast, prototype adjustment methods like T3A excel under larger distributional distance scenarios. Finally, feature alignment methods such as SHOT deliver the largest gains when the target distribution is noisier than our source. Our cross-dataset analysis shows that TTA effectiveness is governed by the distributional distance and the severity of the natural shift across domains.
[CV-24] MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在高风险临床软件环境中进行可靠视觉定位(visual grounding)能力不足的问题。现有图形用户界面(GUI)基准测试主要聚焦于孤立的单步定位查询,忽略了真实医疗场景中任务随流程演进、界面状态动态变化所要求的序列化、工作流驱动的推理能力。解决方案的关键在于提出MedSPOT——一个面向临床GUI环境的工作流感知序列定位基准,其核心创新包括:将交互过程建模为结构化的空间决策序列,构建包含216个任务驱动视频与597个标注关键帧的数据集(每项任务含2–3个相互依赖的定位步骤),引入严格的顺序评估协议(首次定位错误即终止任务评估以量化误差传播),并设计涵盖边缘偏倚、小目标误判、无预测、近似偏差、远距离偏差及工具栏混淆等六类故障的系统性失败分类体系,从而实现对模型在复杂医疗工作流中行为的精准诊断与评估。
链接: https://arxiv.org/abs/2603.19993
作者: Rozain Shakeel,Abdul Rahman Mohammad Ali,Muneeb Mushtaq,Tausifa Jan Saleem,Tajamul Ashraf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: this https URL.
[CV-25] X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
【速读】:该论文旨在解决自动驾驶系统中评估流程的可扩展性与可靠性问题,当前依赖真实道路测试的评估方法存在成本高、场景覆盖有限且难以复现等缺陷。为此,作者提出X-World——一种动作条件的多相机生成式世界模型(action-conditioned multi-camera generative world model),其核心创新在于设计了一个多视角潜在视频生成器,显式地强化跨视角几何一致性与时间连贯性,从而在视频空间中直接生成符合指令动作的未来多视角视频流。该方案支持对动态交通参与者和静态道路元素的可控编辑,并保留文本提示接口以实现外观级控制(如天气、时段),同时具备视频风格迁移能力,为自动驾驶VLA策略提供稳定、可控、可复现的仿真评估基础。
链接: https://arxiv.org/abs/2603.19979
作者: Chaoda Zheng,Sean Li,Jinhao Deng,Zhennan Wang,Shijia Chen,Liqiang Xiao,Ziheng Chi,Hongbin Lin,Kangjie Chen,Boyang Wang,Yu Zhang,Xianming Liu
机构: XPeng(小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report
Abstract:Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision–language–action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.
[CV-26] 2K Retrofit: Entropy-Guided Efficient Sparse Refinement for High-Resolution 3D Geometry Prediction
【速读】:该论文旨在解决当前几何基础模型在高分辨率场景(如2K图像)下推理时面临的计算与内存开销过大问题,从而限制了其在自动驾驶、机器人和AR/MR等实际应用中的部署。解决方案的关键在于提出一种名为2K Retrofit的新框架,该框架无需修改或重新训练骨干模型,通过快速粗粒度预测结合基于熵的稀疏细化策略,仅对高不确定性区域进行选择性增强,从而在保持高精度和高保真度的同时实现极低的额外计算开销,显著提升了高分辨率三维视觉任务的可扩展性和实用性。
链接: https://arxiv.org/abs/2603.19964
作者: Tianbao Zhang,Zhenyu Liang,Zhenbo Song,Nana Wang,Xiaomei Zhang,Xudong Cai,Zheng Zhu,Kejian Wu,Gang Wang,Zhaoxin Fan
机构: 1: School of Artificial Intelligence, Peking University (北京大学人工智能学院); 2: Center for Brain-Inspired Computing Research, Peking University (北京大学脑科学与人工智能研究中心); 3: School of Computer Science and Technology, Shanghai Jiao Tong University (上海交通大学计算机科学与技术学院); 4: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 5: School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院); 6: School of Data Science, Fudan University (复旦大学数据科学学院); 7: Department of Computer Science and Engineering, Tsinghua University (清华大学计算机科学与技术系); 8: School of Computer Science, The University of Sydney (悉尼大学计算机科学学院); 9: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15pages
Abstract:High-resolution geometric prediction is essential for robust perception in autonomous driving, robotics, and AR/MR, but current foundation models are fundamentally limited by their scalability to real-world, high-resolution scenarios. Direct inference on 2K images with these models incurs prohibitive computational and memory demands, making practical deployment challenging. To tackle the issue, we present 2K Retrofit, a novel framework that enables efficient 2K-resolution inference for any geometric foundation model, without modifying or retraining the backbone. Our approach leverages fast coarse predictions and an entropy-based sparse refinement to selectively enhance high-uncertainty regions, achieving precise and high-fidelity 2K outputs with minimal overhead. Extensive experiments on widely used benchmark demonstrate that 2K Retrofit consistently achieves state-of-the-art accuracy and speed, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications. Code will be released upon acceptance.
[CV-27] Cov2Pose: Leverag ing Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation CVPR
【速读】:该论文旨在解决从单张RGB图像中进行6自由度(6-DoF)物体位姿估计的问题。现有间接方法虽性能优异,但需预测2D关键点并借助PnP求解器;而直接回归方法虽计算高效,却因依赖全局池化特征、忽略空间二阶统计信息以及采用不连续的位姿表示而导致精度不足。其解决方案的关键在于提出一种协方差池化(covariance-pooled)表示,将卷积特征分布编码为对称正定(Symmetric Positive Definite, SPD)矩阵,并设计基于Cholesky分解的新型SPD位姿编码方式,结合考虑SPD矩阵黎曼几何结构的流形感知网络头,实现端到端的连续位姿回归,从而显著提升直接方法的准确性与鲁棒性,尤其在部分遮挡条件下表现优越。
链接: https://arxiv.org/abs/2603.19961
作者: Nassim Ali Ousalah,Peyman Rostami,Vincent Gaudillière,Emmanuel Koumandakis,Anis Kacem,Enjie Ghorbel,Djamila Aouada
机构: University of Luxembourg; Université de Lorraine; CNRS; Inria; LORIA; Infinite Orbits; University of Manouba
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
[CV-28] HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
【速读】:该论文旨在解决现有病理视觉语言模型(Pathology Vision-Language Models, VLMs)在处理结构化病理报告时的局限性问题,即通常将复杂的多粒度诊断信息(如诊断结论、组织学分级及辅助检测结果)简化为扁平标签或自由文本,导致信息丢失和临床实用性下降。其解决方案的关键在于提出一个轻量级VLM框架HiPath,以结构化报告预测为核心训练目标,并引入三个可训练模块:用于多图像视觉编码的分层补丁聚合器(Hierarchical Patch Aggregator, HiPA)、基于最优传输的跨模态对齐机制(Hierarchical Contrastive Learning, HiCL),以及基于槽位的掩码诊断预测模块(Slot-based Masked Diagnosis Prediction, Slot-MDP),共同实现高精度、高安全性的结构化病理报告生成。
链接: https://arxiv.org/abs/2603.19957
作者: Ruicheng Yuan,Zhenxuan Zhang,Anbang Wang,Liwei Hu,Xiangqian Hua,Yaya Peng,Jiawei Luo,Guang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figures, 3 tables
Abstract:Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
[CV-29] mestep-Aware Block Masking for Efficient Diffusion Model Inference
【速读】:该论文旨在解决扩散概率模型(Diffusion Probabilistic Models, DPMs)在图像生成任务中因迭代去噪特性导致的高推理延迟问题。其核心解决方案是提出一种基于时步(timestep)感知的计算图优化框架,通过学习每个时步特定的掩码(mask),动态决定在推理过程中哪些模块可以执行或跳过,并利用特征复用来减少冗余计算。该方法的关键创新在于:1)对每个时步独立优化掩码,避免全局优化带来的高昂内存开销;2)引入时步感知的损失缩放机制以保障关键去噪阶段的特征保真度,并结合知识引导的掩码修正策略剪枝冗余时空依赖关系。此方案不依赖具体网络架构,在DDPM、LDM、DiT和PixArt等多种模型上均实现了显著的效率提升,同时保持了生成质量。
链接: https://arxiv.org/abs/2603.19939
作者: Haodong He,Yuan Gao,Weizhong Zhang,Gui-Song Xia
机构: Wuhan University (武汉大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Diffusion Probabilistic Models (DPMs) have achieved great success in image generation but suffer from high inference latency due to their iterative denoising nature. Motivated by the evolving feature dynamics across the denoising trajectory, we propose a novel framework to optimize the computational graph of pre-trained DPMs on a per-timestep basis. By learning timestep-specific masks, our method dynamically determines which blocks to execute or bypass through feature reuse at each inference stage. Unlike global optimization methods that incur prohibitive memory costs via full-chain backpropagation, our method optimizes masks for each timestep independently, ensuring a memory-efficient training process. To guide this process, we introduce a timestep-aware loss scaling mechanism that prioritizes feature fidelity during sensitive denoising phases, complemented by a knowledge-guided mask rectification strategy to prune redundant spatial-temporal dependencies. Our approach is architecture-agnostic and demonstrates significant efficiency gains across a broad spectrum of models, including DDPM, LDM, DiT, and PixArt. Experimental results show that by treating the denoising process as a sequence of optimized computational paths, our method achieves a superior balance between sampling speed and generative quality. Our code will be released.
[CV-30] LIORNet: Self-Supervised LiDAR Snow Removal Framework for Autonomous Driving under Adverse Weather Conditions
【速读】:该论文旨在解决LiDAR传感器在恶劣天气条件下(如雪、雨、雾)性能显著下降的问题,其核心挑战在于噪声点(spurious noise points)主导点云数据,导致误感知。现有方法包括基于距离的滤波、基于强度的滤波和学习驱动的方法,但各自存在局限:距离法难以区分有效目标点与噪声,强度法依赖固定阈值适应性差,学习法则面临标注成本高、泛化能力弱及计算开销大等问题。本文提出LIORNet,其关键创新在于融合三类方法的优势,采用U-Net++架构并引入一种自监督学习策略,通过多种物理与统计线索(如距离相关强度阈值、雪反射特性、点稀疏性及传感范围约束)生成伪标签,从而无需人工标注即可有效区分噪声与环境结构,实现了高精度、低延迟且鲁棒的LiDAR点云去噪,适用于极端天气下的实时自动驾驶系统。
链接: https://arxiv.org/abs/2603.19936
作者: Ji-il Park,Inwook Shim
机构: Ministry of National Defense (韩国国防部); Inha University (仁荷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 14 pages, 6 figures, 2 tables
Abstract:LiDAR sensors provide high-resolution 3D perception and long-range detection, making them indispensable for autonomous driving and robotics. However, their performance significantly degrades under adverse weather conditions such as snow, rain, and fog, where spurious noise points dominate the point cloud and lead to false perception. To address this problem, various approaches have been proposed: distance-based filters exploiting spatial sparsity, intensity-based filters leveraging reflectance distributions, and learning-based methods that adapt to complex environments. Nevertheless, distance-based methods struggle to distinguish valid object points from noise, intensity-based methods often rely on fixed thresholds that lack adaptability to changing conditions, and learning-based methods suffer from the high cost of annotation, limited generalization, and computational overhead. In this study, we propose LIORNet, which eliminates these drawbacks and integrates the strengths of all three paradigms. LIORNet is built upon a U-Net++ backbone and employs a self-supervised learning strategy guided by pseudo-labels generated from multiple physical and statistical cues, including range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This design enables LIORNet to distinguish noise points from environmental structures without requiring manual annotations, thereby overcoming the difficulty of snow labeling and the limitations of single-principle approaches. Extensive experiments on the WADS and CADC datasets demonstrate that LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. These results highlight LIORNet as a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems.
[CV-31] RAM: Recover Any 3D Human Motion in-the-Wild
【速读】:该论文旨在解决野外场景下多人姿态跟踪与3D人体运动重建中因严重遮挡和动态交互导致的身份关联不稳定及运动估计不连续的问题。解决方案的关键在于提出一种名为RAM(Robust and Adaptive Motion tracking)的端到端框架,其核心创新包括:1)引入运动感知语义追踪器与自适应卡尔曼滤波(adaptive Kalman filtering),提升遮挡条件下的身份关联鲁棒性;2)设计记忆增强的时间人类运动重建模块(Temporal HMR),通过注入时空先验实现一致且平滑的3D运动估计;3)采用轻量级预测模块对未来姿态进行预估以维持重建连续性,并通过门控融合器自适应融合重建与预测特征,确保整体输出的连贯性和稳定性。
链接: https://arxiv.org/abs/2603.19929
作者: Sen Jia,Ning Zhu,Jinqin Zhong,Jiale Zhou,Huaping Zhang,Jenq-Neng Hwang,Lei Li
机构: University of Washington (华盛顿大学); Anhui University (安徽大学); East China University of Science and Technology (华东理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.
[CV-32] SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images
【速读】:该论文旨在解决传统3D实例分割方法依赖高质量点云或对齐的RGB-D扫描数据、流程复杂且对重建噪声敏感的问题,同时克服现有基于Transformer的多视角3D重建方法与高层语义理解脱节的局限。其解决方案的关键在于提出SegVGGT——一个统一的端到端框架,能够直接从多视角RGB图像中联合完成3D重建与实例分割;通过引入与多层次几何特征交互的对象查询(object queries),将实例识别深度嵌入视觉几何基础的Transformer结构中,并设计帧级注意力分布对齐(Frame-level Attention Distribution Alignment, FADA)策略,有效缓解因全局图像令牌数量庞大导致的注意力分散问题,从而在训练阶段提供结构化监督且不增加推理开销。
链接: https://arxiv.org/abs/2603.19926
作者: Jinyuan Qu,Hongyang Li,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.
[CV-33] PanORama: Multiview Consistent Panoptic Segmentation in Operating Rooms
【速读】:该论文旨在解决在手术室(Operating Room, OR)环境中,基于稀疏多视角图像实现一致且可靠的全景分割(Panoptic Segmentation)问题。由于手术室场景复杂、遮挡严重且视点受限,传统方法常因单视角信息不足导致跨相机预测不一致,难以支撑精准的空间理解。其解决方案的关键在于提出 PanORama,一种从架构设计上保证多视角一致性的全景分割方法:通过在骨干网络内部以单次前向传播的方式建模跨视角特征交互,使视角一致性直接涌现,而非依赖后处理优化。该方法无需相机标定参数,具备对未见视角的泛化能力,显著提升了多视角分割性能与手术空间感知精度。
链接: https://arxiv.org/abs/2603.19920
作者: Tuna Gürbüz,Ege Özsoy,Tony Danjun Wang,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Operating rooms (ORs) are cluttered, dynamic, highly occluded environments, where reliable spatial understanding is essential for situational awareness during complex surgical workflows. Achieving spatial understanding for panoptic segmentation from sparse multiview images poses a fundamental challenge, as limited visibility in a subset of views often leads to mispredictions across cameras. To this end, we introduce PanORama, the first panoptic segmentation for the operating room that is multiview-consistent by design. By modeling cross-view interactions at the feature level inside the backbone in a single forward pass, view consistency emerges directly rather than through post-hoc refinement. We evaluate on the MM-OR and 4D-OR datasets, achieving 70% Panoptic Quality (PQ) performance, and outperforming the previous state of the art. Importantly, PanORama is calibration-free, requiring no camera parameters, and generalizes to unseen camera viewpoints within any multiview configuration at inference time. By substantially enhancing multiview segmentation and, consequently, spatial understanding in the OR, we believe our approach opens new opportunities for surgical perception and assistance. Code will be released upon acceptance.
[CV-34] Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery CVPR2026
【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)中因仅依赖视觉信息及监督学习与发现模块间松散耦合而导致的细粒度、外观相似类别边界模糊的问题。解决方案的关键在于提出一种可插拔的类比文本概念生成器(Analogical Textual Concept Generator, ATCG),该模块通过将已标注知识类比至未标注样本,为无标签数据生成文本概念,并将其与视觉特征融合,从而将先验知识迁移至新数据,实现视觉-文本联合推理,显著增强类别间的区分度。ATCG兼容参数化和聚类类GCD流程,无需改动原有架构,在六个基准测试中均提升了整体性能,尤其在细粒度数据上表现最优。
链接: https://arxiv.org/abs/2603.19918
作者: Jizhou Han,Chenhao Ding,Yuhang He,Qiang Wang,Shaokun Wang,SongLin Dong,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by CVPR 2026
Abstract:Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data. Our code is available at: this https URL.
[CV-35] SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation
【速读】:该论文旨在解决地球观测(Earth Observation, EO)领域中基础模型微调时计算成本过高问题,包括训练时间和内存消耗大,以及现有参数高效方法无法降低推理复杂度、后处理压缩又依赖昂贵的完整微调等局限。其解决方案的关键在于提出一种预微调架构选择方法SIMPLER,通过分析预训练视觉Transformer在无标签任务数据上的层间表示相似性,自动识别冗余深层并进行剪枝,无需梯度计算、幅度启发式或超参数调优,从而在保持高精度的同时显著降低推理与部署成本。
链接: https://arxiv.org/abs/2603.19873
作者: Víctor Barreiro,Johannes Jakubik,Francisco Argüello,Dora B. Heras
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-tuning foundation models for Earth Observation is computationally expensive, with high training time and memory demands for both training and deployment. Parameter-efficient methods reduce training cost but retain full inference complexity, while post-hoc compression optimizes inference only after costly full fine-tuning. We introduce SIMPLER, a pre-fine-tuning architecture selection method that reduces inference and deployment costs by identifying an effective model depth before adaptation. SIMPLER exploits stabilization of representations in deeper layers of pre-trained vision transformers: it computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, with no gradients, magnitude heuristics, or hyperparameter tuning required. On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding a 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (a multimodal EO foundation model) and ImageNet-pretrained ViT-MAE, demonstrating applicability across tasks, architectures, and spectral modalities. Code is available at this https URL.
[CV-36] MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLM s in Medical Image Quality Assessment
【速读】:该论文旨在解决医学图像质量评估(Med-IQA)中多模态大语言模型(MLLMs)在描述性评估和临床推理能力上显著落后于人类专家的问题,尤其针对现有方法因标注成本高及无法动态适应模型弱点而导致的性能瓶颈。解决方案的关键在于提出MedQ-Engine——一个闭环数据引擎,其核心机制包括:通过数据驱动聚类识别模型失败原型,以这些原型为检索锚点从百万级图像池中筛选样本并结合渐进式“人机协同”标注,再经由高质量保证的微调实现模型迭代进化;同时引入熵引导的路由机制优化标注分配以降低标注成本。该方法在五种医学成像模态上验证有效,仅用1万条标注即使8B参数模型超越GPT-4o超过13%,且与人类专家差距缩小至4.34%,样本效率提升超4倍于随机采样。
链接: https://arxiv.org/abs/2603.19863
作者: Jiyao Liu,Junzhi Ning,Wanying Qu,Lihao Liu,Chenglong Ma,Junjun He,Ningsheng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model’s evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.
[CV-37] IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment CVPR2026
【速读】:该论文旨在解决视觉-语言模型(如CLIP)在图像到图像检索等单模态任务中因模态内错位(intra-modal misalignment)导致性能下降的问题。其核心发现是:CLIP中的投影器(projector)不仅包含一个用于跨模态对齐的互模态算子(inter-modal operator),还存在一个仅执行模态内归一化的模态内算子(intra-modal operator),后者无法促进模态内的对齐。解决方案的关键在于通过谱分析识别出一个近似各向同性的对齐子空间,该子空间可直接从投影器权重中提取,并通过去除各向异性的方向来提升模态内对齐效果。此方法无需重新训练即可显著改善单模态任务性能,同时降低延迟并优于现有方法。
链接: https://arxiv.org/abs/2603.19862
作者: Simone Magistri,Dipam Goswami,Marco Mistretta,Bartłomiej Twardowski,Joost van de Weijer,Andrew D. Bagdanov
机构: Media Integration and Communication Center (MICC), University of Florence, Italy; Department of Computer Science, Universitat Autònoma de Barcelona, Spain; Computer Vision Center, Barcelona, Spain; IDEAS Research Institute, Warsaw, Poland
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR2026
Abstract:Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: this https URL.
[CV-38] FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts CVPR
【速读】:该论文旨在解决当前基于DiT(Diffusion Transformer)的视频到音频(Video-to-Audio, V2A)生成方法在多事件场景或视觉线索不足时,难以实现精细时间控制的问题,例如小区域、屏幕外声音或遮挡/部分可见物体等情况下。解决方案的关键在于提出FoleyDirector框架,其核心创新包括:1)引入结构化时间脚本(Structured Temporal Scripts, STS),通过对应短时间片段的标注文本提供更丰富的时序信息;2)设计Script-Guided Temporal Fusion Module,利用时间脚本注意力机制(Temporal Script Attention)实现STS特征的协同融合;3)提出双帧声音合成(Bi-Frame Sound Synthesis)机制,在帧内与帧外音频生成之间并行处理,提升复杂多事件场景下的可控性。该方案在保持基础模型音频质量的同时,实现了从通用V2A生成到精确时间控制合成的无缝切换。
链接: https://arxiv.org/abs/2603.19857
作者: You Li,Dewei Zhou,Fan Ma,Fu Li,Dongliang He,Yi Yang
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, 18 pages
Abstract:Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
[CV-39] Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them CVPR2026
【速读】:该论文旨在解决深度学习驱动的在线地图构建模型在陌生环境中的泛化能力不足问题,其核心挑战在于模型对训练数据中输入特征的记忆效应(memorization)与对已知地图几何结构的过拟合(overfitting)难以区分。解决方案的关键在于提出一套系统性的评估框架:通过控制地理邻近性和几何相似性来设计验证子集,引入基于Fréchet距离的重建统计量以无阈值方式量化元素级形状保真度,并定义两类互补的故障模式评分——定位过拟合分数(衡量地理线索消失时性能下降程度)和地图几何过拟合分数(衡量场景几何新颖性导致的性能退化)。此外,论文还提出基于最小生成树(minimum-spanning-tree, MST)的多样性度量与对称覆盖度量,用于诊断数据集偏差并指导训练集的稀疏化策略,从而提升训练集的平衡性和多样性,最终实现更可靠的泛化性能评估与可部署的在线地图建模。
链接: https://arxiv.org/abs/2603.19852
作者: Michael Hubbertz,Qi Han,Tobias Meisen
机构: University of Wuppertal (伍珀塔尔大学); Aptiv Services Deutschland GmbH (阿普蒂夫服务德国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026, final camera ready version is published there
Abstract:Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.
[CV-40] Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation
【速读】:该论文旨在解决多模态医学图像分割中特征融合效率与精度不足的问题,特别是在脑肿瘤体积分割任务中,传统固定残差连接(fixed residual connections)难以充分挖掘不同模态间的互补信息。其解决方案的关键在于引入动态超连接(Hyper-Connections, HC),作为一种即插即用的模块替代固定连接结构,通过自适应地聚合多模态特征来增强模型对细粒度边界(如增强肿瘤区域)的敏感性。实验表明,HC在五个3D分割架构上均带来显著性能提升(最高Dice系数提升1.03%),且对临床主导序列(T1ce和FLAIR)表现出更强的响应能力,验证了其在多模态特征融合中的有效性与普适性。
链接: https://arxiv.org/abs/2603.19844
作者: Lokendra Kumar,Shubham Aggarwal
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages,6 tables,17 figures
Abstract:We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.
[CV-41] Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
【速读】:该论文旨在解决现有基于显式基元的三维渲染方法中视觉保真度与基元数量强耦合的问题,即当前方案仅能通过剪枝基元来降低质量,缺乏灵活的细节控制能力。其解决方案的关键在于提出了一种内在可扩展的基元表示方法——Fourier Splatting,该方法利用傅里叶编码描述平面微表面(planar surfels),生成具有任意闭合形状的可缩放基元;通过在运行时截断傅里叶系数即可实现多级细节(Level of Detail, LoD)渲染,从而无需重新训练模型即可适应不同带宽约束下的高保真度渲染需求。此外,为保障优化稳定性,作者引入了直通估计器(straight-through estimator)处理基元边界外的梯度传播,并设计了HYDRA策略在马尔可夫链蒙特卡洛(MCMC)框架下将复杂基元分解为更简单的组成单元,提升了训练稳定性与表达能力。
链接: https://arxiv.org/abs/2603.19834
作者: Mihnea-Bogdan Jurca,Bert Van hauwermeiren,Adrian Munteanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.
[CV-42] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks
【速读】:该论文旨在解决当前无人机视觉语言导航(Vision-Language Navigation, VLN)基准在实际应用中存在的重要局限性:现有评估体系主要依赖于长步骤、目标导向的语言指令,难以诊断真实场景下对简短、高层级命令的语义理解与安全多阶段行为执行能力。为此,作者提出了HUGE-Bench,一个面向高阶无人机视觉-语言-动作(High-Level UAV Vision-Language-Action, HL-VLA)任务的新型基准测试平台。其关键创新在于构建了一个基于3D高斯点云(3D Gaussian Splatting, 3DGS)与网格融合表示的数字孪生环境,实现了逼真的渲染和碰撞感知几何建模,从而支持大规模轨迹生成与安全评估;同时引入过程导向(process-oriented)和碰撞感知(collision-aware)指标,系统量化模型在语义完整性、终端准确性及安全性方面的表现,有效揭示了当前先进VLA模型在高阶语义理解和安全行为执行上的显著差距,为高阶无人机自主能力提供了一个具有诊断意义的评测框架。
链接: https://arxiv.org/abs/2603.19822
作者: Jingyu Guo,Ziye Chen,Ziwen Li,Zhengqing Gao,Jiaxin Huang,Hanlue Zhang,Fengming Huang,Yu Yao,Tongliang Liu,Mingming Gong
机构: Fengming Huang; Yu Yao; Tongliang Liu; Mingming Gong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing UAV vision-language navigation (VLN) benchmarks have enabled language-guided flight, but they largely focus on long, step-wise route descriptions with goal-centric evaluation, making them less diagnostic for real operations where brief, high-level commands must be grounded into safe multi-stage behaviors. We present HUGE-Bench, a benchmark for High-Level UAV Vision-Language-Action (HL-VLA) tasks that tests whether an agent can interpret concise language and execute complex, process-oriented trajectories with safety awareness. HUGE-Bench comprises 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories, and is built on an aligned 3D Gaussian Splatting (3DGS)-Mesh representation that combines photorealistic rendering with collision-capable geometry for scalable generation and collision-aware evaluation. We introduce process-oriented and collision-aware metrics to assess process fidelity, terminal accuracy, and safety. Experiments on representative state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, highlighting HUGE-Bench as a diagnostic testbed for high-level UAV autonomy.
[CV-43] Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在生成式训练中存在的粒度不匹配(granularity mismatch)和监督冗余(supervisory redundancy)问题。其核心解决方案是提出语义锚定监督(Semantically-Grounded Supervision, SeGroS),关键在于设计一种新颖的视觉锚定映射(visual grounding map),以此构建两种互补的监督信号:一是通过语义视觉提示(semantic Visual Hints)弥补文本提示稀疏性,二是生成语义锚定的损坏输入(semantically-grounded Corrupted Input),通过限制重建损失仅作用于与文本对齐的核心区域,从而增强基于掩码机制的UMMs的监督有效性。
链接: https://arxiv.org/abs/2603.19807
作者: Jiyeong Kim,Yerim So,Hyesong Choi,Uiwon Hwang,Dongbo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.
[CV-44] Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy
【速读】:该论文旨在解决生物医学图像中交互式语义分割(interactive semantic segmentation)和对象级分类任务的性能瓶颈问题,这些问题传统上依赖于基于特征的浅层学习方法,受限于数据多样性、缺乏大规模预训练数据集以及对计算和标注效率的要求。其解决方案的关键在于引入视觉基础模型(Vision Foundation Models, VFMs),包括通用模型(如SAM、SAM2、DINOv3)和领域特定模型(如μSAM、PathoSAM),并将其与浅层学习及注意力探测(attentive probing)相结合,在五个多样化且具有挑战性的显微成像数据集上进行评估。实验结果表明,VFMs能显著优于手工设计特征,并为实际应用提供了清晰的改进路径,同时建立了显微成像领域VFMs的基准,推动该方向未来的发展。
链接: https://arxiv.org/abs/2603.19802
作者: Carolin Teuber,Anwai Archit,Tobias Boothe,Peter Ditte,Jochen Rink,Constantin Pape
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object-level classification (object classification), feature-based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state-of-the-art tools for many other vision tasks in microscopy - most notably cellular instance segmentation - already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones ( \mu SAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.
[CV-45] Controllable Text-to-Motion Generation via Modular Body-Part Phase Control
【速读】:该论文旨在解决文本到动作(Text-to-Motion, T2M)生成中局部肢体编辑难以实现且保持整体运动连贯性的难题。现有方法通常依赖复杂的高维关节约束(如轨迹),导致用户交互繁琐、迭代修改困难。其解决方案的关键在于提出模块化身体部位相位控制(Modular Body-Part Phase Control),通过将每个身体部位的潜在运动通道建模为由振幅、频率、相位偏移和偏置参数表征的正弦相位信号,提取可解释的编码,并利用一个模块化的相位控制网络(Phase ControlNet)分支以残差特征调制的方式注入这些信号,从而实现对运动幅度、速度和时序的精细调控,同时解耦控制逻辑与生成主干模型,保障全局运动一致性。
链接: https://arxiv.org/abs/2603.19795
作者: Minyue Dai,Ke Fan,Anyi Rao,Jingbo Wang,Bo Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: this https URL
[CV-46] From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models
【速读】:该论文旨在解决生成式OCR(Optical Character Recognition)在实际部署中因模型生成机制与OCR任务本质需求不匹配而导致的严重错误问题,尤其是过度生成(over-generation)和无依据替换(unsupported substitutions),这些错误即使在传统基准测试准确率较高的情况下仍可能引发重大风险。解决方案的关键在于将冻结的视觉语言模型(VLM)用于OCR时视为一个“选择性接受/放弃”(selective accept/abstain)问题,并提出了一种模型无关的几何风险控制器(Geometric Risk Controller)。该控制器通过多视角结构化探测输入图像,执行轻量级结构筛选,在跨视角一致性与稳定性满足预设标准时才接受输出文本,从而在可控覆盖成本下显著降低极端错误风险,实现更可靠的生成式OCR部署。
链接: https://arxiv.org/abs/2603.19790
作者: Weile Gong,Yiping Zuo,Zijian Lu,Xin He,Weibei Fan,Chen Dai
机构: Nanjing University of Posts and Telecommunications(南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 5 tables
Abstract:Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.
[CV-47] Learning Hierarchical Orthogonal Prototypes for Generalized Few-Shot 3D Point Cloud Segmentation ICME2026
【速读】:该论文旨在解决广义少样本3D点云分割(Generalized few-shot 3D point cloud segmentation)中的稳定性-可塑性权衡问题,即在仅用少量标注数据适应新类别时,如何避免对基础类别的性能造成干扰(base-class forgetting)。其解决方案的关键在于提出HOP3D框架,通过引入分层正交原型(hierarchical orthogonal prototypes)和基于熵的少样本正则化器(entropy-based few-shot regularizer),实现梯度和表示层面的基类与新类学习解耦,从而有效缓解基类与新类之间的干扰,并利用预测不确定性优化原型学习,提升稀疏监督下的适应能力。
链接: https://arxiv.org/abs/2603.19788
作者: Yifei Zhao,Fanyu Zhao,Zhongyuan Zhang,Shengtang Wu,Yixuan Lin,Yinsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, 2 tables, Accepted by ICME 2026
Abstract:Generalized few-shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability-plasticity trade-off: adapting to novel classes can interfere with shared representations and cause base-class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy-based few-shot regularizer to enable robust novel-class adaptation without degrading base-class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base-novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings. The code is available at this https URL.
[CV-48] Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection ICME2026
【速读】:该论文旨在解决弱监督视频异常检测中因统一优化框架导致的敏感性-稳定性权衡问题,即在检测瞬时异常与持续异常时目标冲突,从而引发预测碎片化或过度平滑的问题。解决方案的关键在于提出一种解耦的敏感性-一致性框架(DeSC),通过两个专用分支分别优化:时间敏感性分支采用激进的优化策略以捕捉高频突变,语义一致性分支则施加鲁棒约束以保持长期连贯性和降噪;二者通过协同推理机制融合,有效降低个体偏差并生成平衡预测。
链接: https://arxiv.org/abs/2603.19780
作者: Hantao Zheng,Ning Han,Yawen Zeng,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 4 tables. Accepted by ICME 2026
Abstract:Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at this https URL.
[CV-49] One Model Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment
【速读】:该论文旨在解决统一图像质量评估(Image Quality Assessment, IQA)与图像美学评估(Image Aesthetic Assessment, IAA)在单一多模态大语言模型中所面临的任务错配问题。现有方法采用无任务区分的训练策略,对两类任务使用相同的推理机制和奖励函数,导致性能受限:IQA依赖低层次、客观的感知线索,需简洁的失真聚焦推理;而IAA则需要深层次语义判断,点式分数回归难以有效建模。作者识别出这一现象为“推理错配”与“优化错配”,并通过受控探测实验加以验证。解决方案的关键在于提出TATAR框架——其核心是任务感知的后训练机制,包含三个创新组件:(1)快-慢双通道任务特异性推理结构,分别匹配IQA的简明感知推理与IAA的思辨美学叙事;(2)两阶段SFT+GRPO训练流程,先建立任务感知行为先验再进行奖励驱动优化;(3)不对称奖励设计,对IQA采用高斯分数重塑,对IAA采用Thurstone风格完成排序奖励。该方案在八个基准测试中均显著优于统一基线,并媲美专用模型,同时提升美学评估训练稳定性,确立了任务条件化后训练作为统一感知评分的合理范式。
链接: https://arxiv.org/abs/2603.19779
作者: Wen Yin,Cencen Liu,Dingrui Liu,Bing Su,Yuan-Fang Li,Tao He
机构: University of Electronic Science and Technology of China (电子科技大学); Jiigan Technology (极安科技); Faculty of Information Technology, Monash University (莫纳什大学信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,7 figures
Abstract:Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task’s nature. TATAR combines three components: fast–slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at this https URL.
[CV-50] ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
【速读】:该论文旨在解决单目3D车道线检测中因深度歧义和几何约束弱而导致的重建不准确问题,现有方法通常依赖深度引导、BEV投影及基于锚点或曲线的检测头,但其物理假设过于简化,难以有效编码道路几何结构,导致2D到3D映射存在病态性,常出现凹陷、凸起和扭曲等失真现象。解决方案的关键在于提出“道路流形假设”(Road-Manifold Assumption):即道路是一个在 R3 中的光滑二维流形,车道线是嵌入其中的一维子流形,采样点为密集观测,从而建立表面、曲线与点集之间的度量与拓扑耦合关系;在此基础上设计ReManNet模型,通过Riemannian Gaussian描述子在对称正定(SPD)流形上编码几何信息,并以轻量门控机制融合视觉特征,实现一致的3D推理,同时引入3D隧道车道IoU(3D-TLIoU)损失函数,从点与曲线联合角度优化管状邻域的逐段重叠,提升形状层级对齐能力。
链接: https://arxiv.org/abs/2603.19776
作者: Chengzhi Hong,Bijun Li
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in \mathbbR^3 , lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at this https URL.
[CV-51] Evaluating Image Editing with LLM s: A Comprehensive Benchmark and Intermediate-Layer Probing Approach
【速读】:该论文旨在解决文本引导图像编辑(Text-guided Image Editing, TIE)方法评估中存在的可靠性问题,即现有基准在规模和与人类感知判断的相关性方面存在不足。为应对这一挑战,作者提出了TIEdit基准和EditProbe评估框架:TIEdit包含512张源图像与8类编辑任务的5,120张生成图像,并通过20名专家提供的15,360个平均意见分数(MOS)从感知质量、编辑对齐度和内容保留三个维度进行系统评估;而EditProbe则利用大语言模型(LLM)中间层特征探测技术,从多模态大语言模型的隐藏表示中提取语义与感知信息,从而更准确地估计编辑质量。其关键创新在于不依赖最终输出,而是基于中间层表征实现对编辑任务中源图像、文本指令与编辑结果之间关系的深入建模,显著提升了自动评估指标与人类感知的一致性。
链接: https://arxiv.org/abs/2603.19775
作者: Shiqi Gao,Zitong Xu,Kang Fu,Huiyu Duan,Xiongkuo Min,Jia wang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.
[CV-52] mplate-based Object Detection Using a Foundation Model
【速读】:该论文旨在解决在数据变化较少的场景下,传统基于学习的物体检测方法因依赖大量标注训练数据和耗时的训练过程而难以快速适应新目标或设计变更的问题。其核心解决方案是利用分割基础模型(segmentation foundation models)提取图像中的语义片段,并结合简单的基于特征的分类方法进行检测与分类,从而无需重新训练模型或构建新的训练数据集,显著降低了自动化测试中图形用户界面(GUI)对象识别的部署成本和时间开销。
链接: https://arxiv.org/abs/2603.19773
作者: Valentin Braeutigam,Matthias Stock,Bernhard Egger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most currently used object detection methods are learning-based, and can detect objects under varying appearances. Those models require training and a training dataset. We focus on use cases with less data variation, but the requirement of being free of generation of training data and training. Such a setup is for example desired in automatic testing of graphical interfaces during software development, especially for continuous integration testing. In our approach, we use segments from segmentation foundation models and combine them with a simple feature-based classification method. This saves time and cost when changing the object to be searched or its design, as nothing has to be retrained and no dataset has to be created. We evaluate our method on the task of detecting and classifying icons in navigation maps, which is used to simplify and automate the testing of user interfaces in automotive industry. Our methods achieve results almost on par with learning-based object detection methods like YOLO, without the need for training.
[CV-53] FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision CVPR2026
【速读】:该论文旨在解决精确运动时间(Precise Motion Timing, PMT)在人类姿态估计(Human Pose Estimation, HPE)领域长期被忽视的问题,其核心挑战在于缺乏高时间分辨率的标注数据集,以及现有高帧率RGB相机在成本、光照敏感性、带宽和计算复杂度方面的局限性。解决方案的关键在于提出FlashCap系统——一种基于闪烁LED的运动捕捉(MoCap)方法,首次实现了毫秒级时间精度的运动数据采集,并构建了多模态数据集FlashMotion(包含事件相机、RGB图像、LiDAR和IMU信息),从而为PMT和高时间分辨率HPE提供高质量基准。此外,作者提出了ResPose基线模型,通过学习事件与RGB图像中的残差姿态来显著降低姿态估计误差约40%,并实现毫秒级时间同步精度,推动了相关研究的发展。
链接: https://arxiv.org/abs/2603.19770
作者: Zekai Wu,Shuqi Fan,Mengyin Liu,Yuhua Luo,Xincheng Lin,Ming Yan,Junhao Wu,Xiuhong Lin,Yuexin Ma,Chenglu Wen,Lan Xu,Siqi Shen,Cheng Wang
机构: Xiamen University (厦门大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.
[CV-54] Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images CVPR2026
【速读】:该论文旨在解决从HE染色组织切片(histology)中准确预测空间转录组(spatial transcriptomics, ST)基因表达的问题,现有生成式方法因未显式建模基因间依赖关系而缺乏生物学一致性。其解决方案的关键在于提出HINGE框架,通过引入轻量级的SoftAdaLN模块将层级视觉上下文注入预训练单细胞基础模型(single-cell foundation model, sc-FM),同时采用表达空间掩码扩散目标和热启动课程训练策略,从而在保持sc-FM已学习基因关系的前提下,实现条件化表达生成,并显著提升空间标记表达模式的准确性与基因共表达一致性。
链接: https://arxiv.org/abs/2603.19766
作者: Donghai Fang,Yongheng Li,Zhen Wang,Yuansong Zeng,Wenwen Min
机构: Sun Yat-sen University (中山大学); Yunnan University (云南大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, ours outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.
[CV-55] FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉感知中存在严重幻觉(hallucination)问题,且现有评估基准因任务过于简单或多样性不足而难以有效衡量先进模型的幻觉程度。解决方案的关键在于提出FREAK——一个面向细粒度幻觉评估的综合性多模态基准,其核心创新在于使用高质量、逼真的图像并引入细粒度反常识编辑(counter-commonsense edits),从而精准刻画MLLMs在细节视觉理解中的幻觉现象;此外,研究进一步构建受控子集以间接评估模型对目标细节信息的感知能力,并通过系统分析主流思维链(Chain-of-Thought, CoT)提示技术揭示幻觉模式与模型推理机制之间的关联。
链接: https://arxiv.org/abs/2603.19765
作者: Zhihan Yin,Jianxin Liang,Yueqian Wang,Yifeng Yao,Huishuai Zhang,Dongyan Zhao
机构: Peking University (北京大学); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages
Abstract:Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model’s ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
[CV-56] PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences CVPR2026
【速读】:该论文旨在解决点云序列中长期场景光流估计(scene flow estimation)的时序一致性问题,现有方法通常局限于成对(pairwise)设置,在几何变化、遮挡出现及误差累积等因素影响下难以保持长时间序列的运动一致性。其解决方案的关键在于提出PCSTracker框架,包含两个核心模块:一是迭代几何-运动联合优化模块(IGMO),通过显式建模点特征的时间演化来缓解因动态几何变化导致的对应关系不一致;二是时空点轨迹更新模块(STTU),利用广泛的时序上下文推断被遮挡点的合理位置,从而保证运动估计的一致性。此外,采用重叠滑动窗口推理策略交替进行跨窗口传播与窗口内精化,有效抑制误差累积并维持长期运动稳定性。
链接: https://arxiv.org/abs/2603.19762
作者: Min Lin,Gangwei Xu,Xianqi Wang,Yuyi Peng,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Optics Valley Laboratory (光谷实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026 (Findings)
Abstract:Point cloud scene flow estimation is fundamental to long-term and fine-grained 3D motion analysis. However, existing methods are typically limited to pairwise settings and struggle to maintain temporal consistency over long sequences as geometry evolves, occlusions emerge, and errors accumulate. In this work, we propose PCSTracker, the first end-to-end framework specifically designed for consistent scene flow estimation in point cloud sequences. Specifically, we introduce an iterative geometry motion joint optimization module (IGMO) that explicitly models the temporal evolution of point features to alleviate correspondence inconsistencies caused by dynamic geometric changes. In addition, a spatio-temporal point trajectory update module (STTU) is proposed to leverage broad temporal context to infer plausible positions for occluded points, ensuring coherent motion estimation. To further handle long sequences, we employ an overlapping sliding-window inference strategy that alternates cross-window propagation and in-window refinement, effectively suppressing error accumulation and maintaining stable long-term motion consistency. Extensive experiments on the synthetic PointOdyssey3D and real-world ADT3D datasets show that PCSTracker achieves the best accuracy in long-term scene flow estimation and maintains real-time performance at 32.5 FPS, while demonstrating superior 3D motion understanding compared to RGB-D-based approaches.
[CV-57] Growing Networks with Autonomous Pruning
【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在训练过程中参数冗余与模型效率低下的问题,即如何在保证高分类精度的同时显著减少模型参数量。其解决方案的关键在于提出一种自适应生长与自动剪枝相结合的机制——Growing Networks with Autonomous Pruning (GNAP):在网络训练过程中,通过周期性地“生长”(growth)以提升表达能力,并在两次生长之间利用梯度下降实现完全自主的参数剪枝(pruning),从而在收敛点处动态调整网络结构,使模型始终保持稀疏且高效的状态。实验表明,该方法可在多个图像分类基准上实现高准确率与极低参数量的平衡,例如在MNIST上仅用6.2k参数即可达到99.44%准确率,在CIFAR10上使用157.8k参数获得92.2%准确率。
链接: https://arxiv.org/abs/2603.19759
作者: Charles De Lambilly,Stefan Duffner
机构: École Centrale de Lyon (里昂中央理工学院); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces Growing Networks with Autonomous Pruning (GNAP) for image classification. Unlike traditional convolutional neural networks, GNAP change their size, as well as the number of parameters they are using, during training, in order to best fit the data while trying to use as few parameters as possible. This is achieved through two complementary mechanisms: growth and pruning. GNAP start with few parameters, but their size is expanded periodically during training to add more expressive power each time the network has converged to a saturation point. Between these growing phases, model parameters are trained for classification and pruned simultaneously, with complete autonomy by gradient descent. Growing phases allow GNAP to improve their classification performance, while autonomous pruning allows them to keep as few parameters as possible. Experimental results on several image classification benchmarks show that our approach can train extremely sparse neural networks with high accuracy. For example, on MNIST, we achieved 99.44% accuracy with as few as 6.2k parameters, while on CIFAR10, we achieved 92.2\ accuracy with 157.8k parameters.
[CV-58] Uncertainty-aware Prototype Learning with Variational Inference for Few-shot Point Cloud Segmentation ICASSP2026
【速读】:该论文旨在解决少样本3D语义分割(few-shot 3D semantic segmentation)中因标注支持样本稀缺而导致的原型表示过于刚性、无法建模内在不确定性的问题,从而影响分割结果的鲁棒性和泛化能力。其解决方案的关键在于提出一种不确定性感知的原型学习方法(UPL),通过两个核心组件实现:一是设计双流原型精炼模块,联合利用支持集和查询样本的有限信息增强原型表征;二是将原型学习建模为变分推断问题,将类别原型视为潜在变量,从而显式地对不确定性进行建模,提升预测的鲁棒性与可解释性。
链接: https://arxiv.org/abs/2603.19757
作者: Yifei Zhao,Fanyu Zhao,Yinsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 3 tables, accepted by ICASSP 2026
Abstract:Few-shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype-based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty-aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few-shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual-stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state-of-the-art performance under different settings while providing reliable uncertainty estimation. The code is available at this https URL.
[CV-59] ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination
【速读】:该论文旨在解决传统3D资产重建流程中几何重建、材质估计与光照恢复需分步处理所带来的局限性与高计算开销问题。现有方法在单图条件下难以有效解耦材质与光照,导致重建结果不稳定且效率低下。其解决方案的关键在于提出首个端到端的统一框架ReLi3D,通过Transformer交叉条件架构融合多视角输入,并采用创新的双路径预测策略:第一条路径同时输出物体结构与外观,第二条路径从图像背景或物体反射中预测环境光照;结合可微分的蒙特卡洛重要性采样渲染器,构建最优光照解耦训练机制。此外,利用合成PBR数据集与真实RGB图像混合训练协议,显著提升几何精度、材质准确性和光照质量的泛化能力,从而实现仅需不到一秒即可生成完整、可重光照的3D资产。
链接: https://arxiv.org/abs/2603.19753
作者: Jan-Niklas Dihlmann,Mark Boss,Simon Donne,Andreas Engelhardt,Hendrik P.A. Lensch,Varun Jampani
机构: University of Tübingen (图宾根大学); Stability AI (Stability AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
Abstract:Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object’s structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: this https URL
[CV-60] PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement
【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在实际应用中因运动伪影和光照变化导致信号噪声大、稳定性差的问题。现有方法分为两类:一类是直接从原始视频中端到端建模,虽能保留完整时空信息但易受干扰;另一类基于时空图(spatial-temporal map, STMap)表示,虽降低计算复杂度却可能丢失高频细节。为融合两者优势,论文提出PhysNeXt——一种双输入深度学习框架,其关键在于引入时空差异建模单元、跨模态交互模块与结构化注意力解码器,实现视频帧与STMap表示的协同增强,从而显著提升脉搏信号提取的鲁棒性与精度。
链接: https://arxiv.org/abs/2603.19752
作者: Junzhe Cao,Bo Zhao,Zhiyi Niu,Dan Guo,Yue Sun,Haochen Liang,Yong Xu,Zitong YU
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote photoplethysmography (rPPG) enables contactless measurement of heart rate and other vital signs by analyzing subtle color variations in facial skin induced by cardiac pulsation. Current rPPG methods are mainly based on either end-to-end modeling from raw videos or intermediate spatial-temporal map (STMap) representations. The former preserves complete spatiotemporal information and can capture subtle heartbeat-related signals, but it also introduces substantial noise from motion artifacts and illumination variations. The latter stacks the temporal color changes of multiple facial regions of interest into compact two-dimensional representations, significantly reducing data volume and computational complexity, although some high-frequency details may be lost. To effectively integrate the mutual strengths, we propose PhysNeXt, a dual-input deep learning framework that jointly exploits video frames and STMap representations. By incorporating a spatio-temporal difference modeling unit, a cross-modal interaction module, and a structured attention-based decoder, PhysNeXt collaboratively enhances the robustness of pulse signal extraction. Experimental results demonstrate that PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations. The codes will be released.
[CV-61] PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing CVPR2026
【速读】:该论文旨在解决基于驱动视频的仅表情控制的人脸视频表演编辑问题,其核心挑战在于现有方法难以将面部表情与头部姿态旋转解耦,导致无法独立编辑表情。解决方案的关键在于利用3D Morphable Face Model (3DMM) 的特性,通过改进关键点变换公式使其更符合3DMM的参数分离结构,从而实现更好的表达与姿态解耦,并提供更精细的控制能力;同时,为避免生成结果中面部边界区域的错位问题,提出将输入图像的面部与非面部区域解耦,并预训练教师模型分别提供监督信号,显著提升生成质量与对驱动视频的忠实度。
链接: https://arxiv.org/abs/2603.19731
作者: Jiadong Liang,Bojun Xiong,Jie Tian,Hua Li,Xiao Long,Yong Zheng,Huan Fu
机构: HUJING Digital Media Entertainment Group (虎鲸数字媒体娱乐集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project Page: this https URL
Abstract:This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at this https URL.
[CV-62] BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates CVPR2026
【速读】:该论文旨在解决多模态学习中因模态缺失不平衡(Imbalanced Missing Rates, IMR)导致的表示学习偏差与梯度动态失衡问题,即信息丰富的模态主导优化过程,而弱模态或部分缺失模态贡献不足,从而影响模型鲁棒性和性能。解决方案的关键在于提出一个模型无关的插件式框架BALM,其核心由两个互补模块构成:特征校准模块(Feature Calibration Module, FCM)通过利用全局上下文对单模态特征进行再校准,建立跨异构缺失模式的共享表示基础;梯度重平衡模块(Gradient Rebalancing Module, GRM)则从分布和空间两个维度调节各模态的梯度幅值与方向,实现多模态学习动力学的均衡化。该框架可无缝集成至多种骨干网络(如多模态情感识别模型),无需修改原有结构即可显著提升在不同缺失和不平衡场景下的性能与鲁棒性。
链接: https://arxiv.org/abs/2603.19718
作者: Phuong-Anh Nguyen,Tien Anh Pham,Duc-Trong Le,Cam-Van Thi Nguyen
机构: VNU University of Engineering and Technology (河内大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing rates (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework comprises two complementary modules: the Feature Calibration Module (FCM), which recalibrates unimodal features using global context to establish a shared representation basis across heterogeneous missing patterns; the Gradient Rebalancing Module (GRM), which balances learning dynamics across modalities by modulating gradient magnitudes and directions from both distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings. Code available at: this https URL
[CV-63] WorldAgents : Can Foundation Image Models be Agents for 3D World Models? WWW ATC
【速读】:该论文试图解决的核心问题是:2D基础图像模型是否内在具备构建3D世界模型的能力。为回答这一问题,作者提出了一种基于多智能体架构的代理式框架(agentic framing),其关键在于通过一个由VLM(Vision-Language Model)驱动的导演智能体制定提示词引导图像合成、一个生成器负责合成新视角图像,并引入一个两步验证机制(由VLM支持)对生成结果进行评估与筛选,从而从2D图像空间和3D重建空间双重维度保障输出的一致性与合理性。实验表明,该方法能够有效挖掘2D模型中隐含的3D理解能力,实现连贯且鲁棒的3D场景重建,生成可渲染新视角的三维一致世界。
链接: https://arxiv.org/abs/2603.19708
作者: Ziya Erkoç,Angela Dai,Matthias Nießner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL Video: this https URL
Abstract:Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.
[CV-64] Demographic-Aware Self-Supervised Anomaly Detection Pretraining for Equitable Rare Cardiac Diagnosis
【速读】:该论文旨在解决罕见心脏异常在心电图(ECG)中难以检测的问题,其核心挑战在于病例分布呈长尾状且诊断性能存在人群差异,导致识别延迟和医疗质量不均。解决方案的关键在于提出一种两阶段的AI辅助ECG分析框架:第一阶段通过自监督异常检测预训练,利用掩码信号重建、趋势建模和患者属性预测来学习无需标注数据的鲁棒ECG表征;第二阶段采用不对称损失函数进行多标签分类微调,有效缓解罕见病种的长尾分布问题,并生成可解释的异常评分图用于定位,同时通过CPU优化实现实际部署。该方法在超百万例临床ECG队列上验证,对罕见异常的AUROC达94.7%,并将常见-罕见性能差距缩小73%,且在年龄与性别群体间保持一致准确性,体现出公平性与可扩展性优势。
链接: https://arxiv.org/abs/2603.19695
作者: Chaoqin Huang,Zi Zeng,Aofan Jiang,Yuchen Xu,Qing Cao,Kang Chen,Chenfei Chi,Yanfeng Wang,Ya Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rare cardiac anomalies are difficult to detect from electrocardiograms (ECGs) due to their long-tailed distribution with extremely limited case counts and demographic disparities in diagnostic performance. These limitations contribute to delayed recognition and uneven quality of care, creating an urgent need for a generalizable framework that enhances sensitivity while ensuring equity across diverse populations. In this study, we developed an AI-assisted two-stage ECG framework integrating self-supervised anomaly detection with demographic-aware representation learning. The first stage performs self-supervised anomaly detection pretraining by reconstructing masked global and local ECG signals, modeling signal trends, and predicting patient attributes to learn robust ECG representations without diagnostic labels. The pretrained model is then fine-tuned for multi-label ECG classification using asymmetric loss to better handle long-tail cardiac abnormalities, and additionally produces anomaly score maps for localization, with CPU-based optimization enabling practical deployment. Evaluated on a longitudinal cohort of over one million clinical ECGs, our method achieves an AUROC of 94.7% for rare anomalies and reduces the common-rare performance gap by 73%, while maintaining consistent diagnostic accuracy across age and sex groups. In conclusion, the proposed equity-aware AI framework demonstrates strong clinical utility, interpretable anomaly localization, and scalable performance across multiple cohorts, highlighting its potential to mitigate diagnostic disparities and advance equitable anomaly detection in biomedical signals and digital health. Source code is available at this https URL.
[CV-65] SegAgent : Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents MICCAI2026
【速读】:该论文旨在解决数字牙科中从口内三维扫描模型中自动进行牙齿分割与识别的问题,现有方法依赖于特定任务的3D神经网络并需密集标注数据集,导致标注成本高且泛化能力有限。其解决方案的关键在于将牙科分析重新定义为零样本几何推理问题,而非纯数据驱动的识别任务;通过结合通用基础模型的表征能力与源自牙齿解剖结构的显式几何归纳偏置(geometric inductive biases),利用多视角视觉抽象和几何约束推理来推断牙齿实例及其身份,无需特定任务训练即可实现准确分割与识别,从而降低计算和标注成本,并提升对未见扫描源的泛化性能。
链接: https://arxiv.org/abs/2603.19684
作者: Shaojie Zhuang,Lu Yin,Guangshun Wei,Yunpeng Li,Xilu Wang,Yuanfeng Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2026; Under review
Abstract:Automatic tooth segmentation and identification from intra-oral scanned 3D models are fundamental problems in digital dentistry, yet most existing approaches rely on task-specific 3D neural networks trained with densely annotated datasets, resulting in high annotation cost and limited generalization to scans from unseen sources. Thus, we propose TSegAgent, which addresses these challenges by reformulating dental analysis as a zero-shot geometric reasoning problem rather than a purely data-driven recognition task. The key idea is to combine the representational capacity of general-purpose foundation models with explicit geometric inductive biases derived from dental anatomy. Instead of learning dental-specific features, the proposed framework leverages multi-view visual abstraction and geometry-grounded reasoning to infer tooth instances and identities without task-specific training. By explicitly encoding structural constraints such as dental arch organization and volumetric relationships, the method reduces uncertainty in ambiguous cases and mitigates overfitting to particular shape distributions. Experimental results demonstrate that this reasoning-oriented formulation enables accurate and reliable tooth segmentation and identification with low computational and annotation cost, while exhibiting strong generalization across diverse and previously unseen dental scans.
[CV-66] 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction CVPR2026
【速读】:该论文旨在解决3D高斯散射(3DGS)在重建高保真表面时的不足,尤其是在深度渲染精度方面的局限性。其解决方案的关键在于提出一种自约束先验(self-constrained prior),该先验基于当前3D高斯模型渲染出的深度图融合生成的TSDF(Signed Distance Field, 有符号距离场)网格构建,通过定义一个以估计表面为中心的带状区域,对3D高斯进行几何感知的约束:包括移除带外高斯、将高斯向表面移动、以及按几何结构调整不透明度。更重要的是,该先验可利用最新渲染的深度图像定期更新,并逐步缩小带宽以强化约束效果,从而提升表面重建质量与深度准确性。
链接: https://arxiv.org/abs/2603.19682
作者: Takeshi Noda,Yu-Shen Liu,Zhizhong Han
机构: Tsinghua University (清华大学); Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL
Abstract:Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constrain the learning of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is derived from a TSDF grid that is obtained by fusing the depth maps rendered with current 3D Gaussians. The prior measures a distance field around the estimated surface, offering a band centered at the surface for imposing more specific constraints on 3D Gaussians, such as removing Gaussians outside the band, moving Gaussians closer to the surface, and encouraging larger or smaller opacity in a geometry-aware manner. More importantly, our prior can be regularly updated by the most recent depth images which are usually more accurate and complete. In addition, the prior can also progressively narrow the band to tighten the imposed constraints. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.
[CV-67] Unbiased Dynamic Multimodal Fusion CVPR2026
【速读】:该论文旨在解决动态多模态学习中因假设模态质量静态不变而导致的适应性不足问题,以及现有方法在极端噪声条件下难以准确评估模态质量、且未考虑模态内在依赖偏差(modality reliance bias)所引发的硬学习模态双重惩罚问题。解决方案的关键在于提出无偏动态多模态学习(Unbiased Dynamic Multimodal Learning, UDML)框架:首先设计了一个噪声感知的不确定性估计器(noise-aware uncertainty estimator),通过向模态数据添加可控噪声并预测其强度,使模型能够建立特征退化与噪声水平之间的清晰映射,从而实现低噪声和高噪声场景下的精确不确定性度量;其次,利用模态丢弃(modality dropout)量化网络内部的模态依赖偏差,并将其引入加权机制,有效消除对难学模态的双重抑制效应,提升动态融合性能。
链接: https://arxiv.org/abs/2603.19681
作者: Shicai Wei,Kaijie Zhang,Luyi Chen,Tao He,Guiduo Duan
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 Findings, 11 pages, 4 figures
Abstract:Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at this https URL.
[CV-68] Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification CVPR2026
【速读】:该论文旨在解决终身行人重识别(Lifelong Person Re-Identification, LReID)中因仅依赖全局特征学习而导致细粒度属性知识利用不足、历史知识难以有效迁移以及新知识学习时遗忘严重的问题。解决方案的关键在于提出一种基于视觉-语言模型(Vision-Language Model, VLM)驱动的新型方法——视觉-语言属性解耦与强化(Vision-Language Attribute Disentanglement and Reinforcement, VLADR),其核心思想是显式建模跨域共享的人类通用属性,以增强域间知识迁移能力,并通过多粒度文本属性解耦机制挖掘图像的全局与局部语义属性,结合跨模态属性对齐和域间属性对齐策略,实现细粒度知识的有效提取与强化,从而显著提升模型在抗遗忘能力和泛化性能上的表现。
链接: https://arxiv.org/abs/2603.19678
作者: Kunlun Xu,Haotong Cheng,Jiangmeng Li,Xu Zou,Jiahuan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our source code is available at this https URL
[CV-69] ATHENA: Adaptive Test-Time Steering for Improving Count Fidelity in Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型在生成图像时对显式物体数量的控制能力不足的问题(即数值控制失效问题)。解决方案的关键在于提出一种模型无关、测试时自适应的引导框架ATHENA,其通过利用采样过程中的中间表示估计物体数量,并在去噪早期阶段施加计数感知的噪声修正,从而在结构错误难以修正前调整生成轨迹,显著提升图像中物体数量的准确性。
链接: https://arxiv.org/abs/2603.19676
作者: Mohammad Shahab Sepehri,Asal Mehradfar,Berk Tinaz,Salman Avestimehr,Mahdi Soltanolkotabi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Text-to-image diffusion models achieve high visual fidelity but surprisingly exhibit systematic failures in numerical control when prompts specify explicit object counts. To address this limitation, we introduce ATHENA, a model-agnostic, test-time adaptive steering framework that improves object count fidelity without modifying model architectures or requiring retraining. ATHENA leverages intermediate representations during sampling to estimate object counts and applies count-aware noise corrections early in the denoising process, steering the generation trajectory before structural errors become difficult to revise. We present three progressively more advanced variants of ATHENA that trade additional computation for improved numerical accuracy, ranging from static prompt-based steering to dynamically adjusted count-aware control. Experiments on established benchmarks and a new visually and semantically complex dataset show that ATHENA consistently improves count fidelity, particularly at higher target counts, while maintaining favorable accuracy-runtime trade-offs across multiple diffusion backbones.
[CV-70] DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
【速读】:该论文旨在解决当前自动驾驶系统中世界模型(world model)在轨迹条件下的场景演化建模能力不足的问题,现有方法通常依赖外观生成或确定性回归进行未来状态预测,难以捕捉不同驾驶动作下场景的动态变化,从而导致决策规划不可靠。其解决方案的关键在于提出一种基于流形变换(flow-based dynamics)的潜在世界模型 DynFlowDrive,通过引入修正流(rectified flow)形式化建模场景状态在不同驾驶动作下的演化速度场(velocity field),实现对潜在状态的渐进式预测;同时设计了一种稳定性感知的多模态轨迹选择策略,依据诱导场景转换的稳定性评估候选轨迹,从而提升规划可靠性。
链接: https://arxiv.org/abs/2603.19675
作者: Xiaolu Liu,Yicong Li,Song Wang,Junbo Chen,Angela Yao,Jianke Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 6 figs
Abstract:Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at this https URL.
[CV-71] Making Video Models Adhere to User Intent with Minor Adjustments
【速读】:该论文旨在解决文本到视频扩散模型(text-to-video diffusion models)在生成过程中对用户提供的控制输入(如边界框或布局)难以精确遵循的问题。其关键解决方案在于通过微调用户指定的边界框,使其更契合模型内部的注意力机制,从而提升生成质量和控制一致性。具体而言,作者提出使用一个平滑掩码(smooth mask)使边界框位置具有可微性,并设计了一个基于注意力最大化的优化目标,以调整边界框至模型熟悉的位置;实验表明,即使微小的调整也能显著改善生成效果,且该方法在用户研究中得到验证。
链接: https://arxiv.org/abs/2603.19672
作者: Daniel Ajisafe,Eric Hedlin,Helge Rhodin,Kwang Moo Yi
机构: The University of British Columbia (不列颠哥伦比亚大学); Bielefeld University (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL
Abstract:With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.
[CV-72] oward High-Fidelity Visual Reconstruction: From EEG-Based Conditioned Generation to Joint-Modal Guided Rebuilding
【速读】:该论文旨在解决当前基于脑电图(EEG)的视觉重建方法中因深度依赖对齐框架而导致的细节信息丢失问题,即现有方法通常强制将EEG特征与文本或图像语义表示对齐,从而削弱了EEG中蕴含的丰富空间关系和色彩细节,仅实现条件图像生成而非高保真视觉重建。其解决方案的关键在于提出一种新的联合模态视觉重建(Joint-Modal Visual Reconstruction, JMVR)框架,该框架将EEG与文本视为独立模态进行联合学习,以保留EEG特有的信息;同时引入多尺度EEG编码策略以捕获粗粒度与细粒度特征,并结合图像增强技术提升感知细节的恢复能力。
链接: https://arxiv.org/abs/2603.19667
作者: Zhijian Gong,Tianren Yao,Wenjia Dong,Xueyuan Xu
机构: Beijing University of Technology (北京工业大学); Beijing Key Laboratory of Computational Intelligence and Intelligent System (北京市计算智能与智能系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human visual reconstruction aims to reconstruct fine-grained visual stimuli based on subject-provided descriptions and corresponding neural signals. As a widely adopted modality, Electroencephalography (EEG) captures rich visual cognition information, encompassing complex spatial relationships and chromatic details within scenes. However, current approaches are deeply coupled with an alignment framework that forces EEG features to align with text or image semantic representation. The dependency may condense the rich spatial and chromatic details in EEG that achieved mere conditioned image generation rather than high-fidelity visual reconstruction. To address this limitation, we propose a novel Joint-Modal Visual Reconstruction (JMVR) framework. It treats EEG and text as independent modalities for joint learning to preserve EEG-specific information for reconstruction. It further employs a multi-scale EEG encoding strategy to capture both fine- and coarse-grained features, alongside image augmentation to enhance the recovery of perceptual details. Extensive experiments on the THINGS-EEG dataset demonstrate that JMVR achieves SOTA performance against six baseline methods, specifically exhibiting superior capabilities in modeling spatial structure and chromatic fidelity.
[CV-73] Semantic Audio-Visual Navigation in Continuous Environments CVPR2026
【速读】:该论文旨在解决现有音频-视觉导航(Audio-Visual Navigation, AVN)方法在连续环境中因依赖预计算的房间脉冲响应(Room Impulse Response, RIR)而导致的空间离散性与感知不连续问题,以及目标声音间歇性中断时代理失去目标信息的挑战。解决方案的关键在于提出MAGNet模型——一个基于多模态Transformer的架构,通过联合编码空间与语义目标表示,并融合历史上下文与自身运动线索,实现增强记忆的目标推理能力,从而支持代理在3D连续环境中进行更鲁棒的导航。
链接: https://arxiv.org/abs/2603.19660
作者: Yichen Zeng,Hebaixu Wang,Meng Liu,Yu Zhou,Chen Gao,Kehan Chen,Gongping Huang
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村科学院); Shandong Jianzhu University (山东建筑大学); Nankai University (南开大学); Tsinghua University (清华大学); CASIA (中国科学院自动化研究所); UCAS (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to CVPR 2026
Abstract:Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at this https URL.
[CV-74] CS-MUNet: A Channel-Spatial Dual-Stream Mamba Network for Multi-Organ Segmentation
【速读】:该论文旨在解决当前基于状态空间模型(State Space Model, SSM)的腹部多器官分割方法中存在两个关键问题:一是缺乏跨通道解剖语义协作建模,二是未显式引入边界感知特征融合机制。解决方案的核心在于提出CS-MUNet架构,其包含两个专为任务设计的模块:其一为边界感知状态Mamba模块(Boundary-Aware State Mamba),通过贝叶斯注意力框架生成像素级边界后验图,并直接注入Mamba核心扫描参数中,将边界信息嵌入SSM状态转移机制,同时利用双分支权重分配实现全局与局部结构表征的互补调制;其二为通道状态聚合模块(Channel Mamba State Aggregation),将通道维度重新定义为SSM序列维度,以数据驱动方式显式建模跨通道解剖语义协同关系。实验表明,该方法在两个公开基准上均显著优于现有最优方法,确立了一种联合建模通道语义协同与边界感知特征融合的新SSM建模范式。
链接: https://arxiv.org/abs/2603.19659
作者: Yuyang Zheng,Mingda Zhang,Jianglong Qin,Qi Mo,Jingdan Pan,Haozhe Hu,Hongyi Huang
机构: Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures
Abstract:Recently Mamba-based methods have shown promise in abdominal organ segmentation. However, existing approaches neglect cross-channel anatomical semantic collaboration and lack explicit boundary-aware feature fusion mechanisms. To address these limitations, we propose CS-MUNet with two purpose-built modules. The Boundary-Aware State Mamba module employs a Bayesian-attention framework to generate pixel-level boundary posterior maps, injected directly into Mamba’s core scan parameters to embed boundary awareness into the SSM state transition mechanism, while dual-branch weight allocation enables complementary modulation between global and local structural representations. The Channel Mamba State Aggregation module redefines the channel dimension as the SSM sequence dimension to explicitly model cross-channel anatomical semantic collaboration in a data-driven manner. Experiments on two public benchmarks demonstrate that CS-MUNet consistently outperforms state-of-the-art methods across multiple metrics, establishing a new SSM modeling paradigm that jointly addresses channel semantic collaboration and boundary-aware feature fusion for abdominal multi-organ segmentation.
[CV-75] GravCal: Single-Image Calibration of IMU Gravity Priors with Per-Sample Confidence
【速读】:该论文旨在解决在存在线性加速度、振动和瞬时运动等干扰条件下,惯性测量单元(IMU)提供的重力先验(gravity prior)不可靠的问题,尤其是在单张RGB图像下如何校准噪声严重的重力方向。其关键解决方案是提出GravCal——一个前馈模型,通过融合两种互补的预测:一是对输入重力先验的残差修正,二是不依赖先验的图像独立估计,并利用一个可学习的门控机制(learned gate)自适应地融合二者,从而实现更鲁棒的重力方向校正与置信度评分。实验表明,该方法将平均角度误差从IMU原始先验的22.02°显著降低至14.24°,尤其在先验严重失真时提升更为明显。
链接: https://arxiv.org/abs/2603.19654
作者: Haichao Zhu,Qian Zhang
机构: UC Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures
Abstract:Gravity estimation is fundamental to visual-inertial perception, augmented reality, and robotics, yet gravity priors from IMUs are often unreliable under linear acceleration, vibration, and transient motion. Existing methods often estimate gravity directly from images or assume reasonably accurate inertial input, leaving the practical problem of correcting a noisy gravity prior from a single image largely unaddressed. We present GravCal, a feedforward model for single-image gravity prior calibration. Given one RGB image and a noisy gravity prior, GravCal predicts a corrected gravity direction and a per-sample confidence score. The model combines two complementary predictions, including a residual correction of the input prior and a prior-independent image estimate, and uses a learned gate to fuse them adaptively. Extensive experiments show strong gains over raw inertial priors: GravCal reduces mean angular error from 22.02° (IMU prior) to 14.24°, with larger improvements when the prior is severely corrupted. We also introduce a novel dataset of over 148K frames with paired VIO-derived ground-truth gravity and Mahony-filter IMU priors across diverse scenes and arbitrary camera orientations. The learned gate also correlates with prior quality, making it a useful confidence signal for downstream systems. Comments: 14 pages, 4 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.19654 [cs.CV] (or arXiv:2603.19654v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.19654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-76] OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)与虚拟脱衣(Virtual Try-Off, VTOFF)技术中存在的细粒度细节保留不足、复杂场景泛化能力弱、流程冗杂以及推理效率低等问题。其核心解决方案是提出OmniDiT框架,基于扩散Transformer(Diffusion Transformer)将试衣与脱衣任务统一建模;关键创新包括:1)构建自进化数据整理流水线以持续生成高质量数据,并建立包含380k样本的Omni-TryOn数据集;2)通过标记拼接和自适应位置编码有效融合多参考条件;3)首次在扩散模型中引入移位窗口注意力机制(Shifted Window Attention),实现线性计算复杂度;4)采用多时间步预测与对齐损失缓解局部窗口注意力导致的性能下降,从而提升生成保真度。
链接: https://arxiv.org/abs/2603.19643
作者: Weixuan Zeng,Pengcheng Wei,Huaiqing Wang,Boheng Zhang,Jia Sun,Dewen Fan,Lin HE,Long Chen,Qianqian Gan,Fan Yang,Tingting Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.
[CV-77] UniBioTransfer: A Unified Framework for Multiple Biometrics Transfer
【速读】:该论文旨在解决深度人脸生成(Deepface Generation)中因任务特定模型导致的泛化能力差与扩展性受限的问题,尤其是面对多任务场景时数据稀缺和跨任务干扰带来的挑战。其核心解决方案是提出UniBioTransfer框架,关键在于两个创新:一是通过基于交换的污染机制构建统一的数据策略,有效缓解空间动态属性(如头发)的样本不足问题;二是设计生物混合专家模型(BioMoE),结合两阶段训练策略,实现任务特异性知识的有效解耦,从而在单次推理中支持多种人脸生成任务(包括常规任务与形变类任务),并具备对未见任务(如嘴唇、眼睛、眼镜迁移)的零样本迁移能力。
链接: https://arxiv.org/abs/2603.19637
作者: Caiyi Sun,Yujing Sun,Xiangyu Li,Yuhang Zheng,Yiming Ren,Jiamin Wang,Yuexin Ma,Siu-Ming Yiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deepface generation has traditionally followed a task-driven paradigm, where distinct tasks (e.g., face transfer and hair transfer) are addressed by task-specific models. Nevertheless, this single-task setting severely limits model generalization and scalability. A unified model capable of solving multiple deepface generation tasks in a single pass represents a promising and practical direction, yet remains challenging due to data scarcity and cross-task conflicts arising from heterogeneous attribute transformations. To this end, we propose UniBioTransfer, the first unified framework capable of handling both conventional deepface tasks (e.g., face transfer and face reenactment) and shape-varying transformations (e.g., hair transfer and head transfer). Besides, UniBioTransfer naturally generalizes to unseen tasks, like lip, eye, and glasses transfer, with minimal fine-tuning. Generally, UniBioTransfer addresses data insufficiency in multi-task generation through a unified data construction strategy, including a swapping-based corruption mechanism designed for spatially dynamic attributes like hair. It further mitigates cross-task interference via an innovative BioMoE, a mixture-of-experts based model coupled with a novel two-stage training strategy that effectively disentangles task-specific knowledge. Extensive experiments demonstrate the effectiveness, generalization, and scalability of UniBioTransfer, outperforming both existing unified models and task-specific methods across a wide range of deepface generation tasks. Project page is at this https URL
[CV-78] Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking
【速读】:该论文旨在解决无人机(UAV)在夜间复杂环境下跟踪性能下降的问题,核心挑战在于现有特征编码方法忽视了光照和视角等关键线索,导致目标外观与运动感知不鲁棒。解决方案的关键是提出一种双提示驱动的特征编码方法(DPTracker),通过两个创新模块实现:一是金字塔光照提示器(pyramid illumination prompter),提取多尺度频域感知的光照提示以增强对夜间低光照条件的适应性;二是动态视角提示器(dynamic viewpoint prompter),通过调节可变形卷积偏移量来适配不同视角变化,从而学习视角不变特征。该方法有效提升了特征编码的域不变性,显著改善了夜间场景下的跟踪鲁棒性。
链接: https://arxiv.org/abs/2603.19628
作者: Yiheng Wang,Changhong Fu,Liangliang Yao,Haobo Zuo,Zijie Zhang
机构: Duke University (杜克大学); Tongji University (同济大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Robotics and Automation 2026
Abstract:Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view-invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at this https URL.
[CV-79] IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1
【速读】:该论文旨在解决相对位姿估计(Relative Pose Estimation)中现有方法面临的准确性与效率之间的权衡问题:传统基于特征匹配的流水线虽精度高,但因非微分的RANSAC算法阻断梯度传播;而基于视觉Transformer(ViT)的回归器虽支持端到端训练,却因计算复杂度高难以实现实时部署。其解决方案的关键在于提出一种几何驱动的解耦迭代框架IUP-Pose,通过轻量级多头双向交叉注意力(MHBC)模块实现隐式密集对齐,避免显式匹配监督;同时采用旋转-平移解耦处理流程,利用旋转同形变换(H_inf)在每轮迭代中重对齐特征图,从而提升精度并保持高效性,最终在MegaDepth1500数据集上达到73.3% AUC@20°,且具备70 FPS的推理速度和仅37M参数的紧凑模型规模。
链接: https://arxiv.org/abs/2603.19625
作者: Jun Wang,Xiaoyan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.
[CV-80] Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement CVPR2026
【速读】:该论文旨在解决多模态图像配准中的两个关键问题:一是现有方法在特征解耦过程中对共享特征空间的正则化不足,导致模态私有信息泄露至共享空间;二是多数多尺度框架仅支持单一变换类型,难以同时处理全局刚性偏移与局部形变。解决方案的关键在于提出HRNet(Hybrid Registration Network),其核心是将表示解耦与混合参数预测相耦合:首先通过带模态特定批归一化(Modality-Specific Batch Normalization, MSBN)的共享主干网络提取多尺度特征,并利用跨尺度解耦与自适应投影(Cross-scale Disentanglement and Adaptive Projection, CDAP)模块抑制模态私有线索、稳定共享特征空间;进而在此基础上,由混合参数预测模块(Hybrid Parameter Prediction Module, HPPM)实现非迭代式粗到精的全局刚性参数与变形场估计,并融合为一致的形变场,从而统一处理复杂场景下的多模态配准任务。
链接: https://arxiv.org/abs/2603.19623
作者: Chunlei Zhang,Jiahao Xia,Yun Xiao,Bo Jiang,Jian Zhang
机构: University of Technology Sydney (悉尼科技大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 main track
Abstract:Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.
[CV-81] UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
【速读】:该论文旨在解决现有机器人领域中从图像到仿真(real-to-sim)迁移任务中存在的模块化流水线效率低下与误差累积问题。传统方法依赖检测、分割、形状重建和位姿估计等多个子模块,各阶段仅利用局部或逐步细化的信息,忽略全局上下文,导致性能受限。其解决方案的关键在于提出UniPR——首个端到端的对象级感知与重建框架,该框架直接处理单对立体图像,通过几何约束消除尺度模糊性,并引入位姿感知的形状表示(Pose-Aware Shape Representation),无需按类别定义标准姿态即可统一重建与位姿估计任务,从而实现高效、准确且物理比例真实的对象重建。
链接: https://arxiv.org/abs/2603.19616
作者: Chuanrui Zhang,Yingshuang Zou,ZhengXian Wu,Yonggen Ling,Yuxiao Yang,Ziwei Wang
机构: Tencent Robotics X (腾讯机器人实验室); Futian Laboratory (福田实验室); NTU (南洋理工大学); HKUST (香港科技大学); THU (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.
[CV-82] OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis
【速读】:该论文旨在解决单视角输入下新颖视图合成(Novel View Synthesis, NVS)中存在的几何与外观一致性不足的问题,尤其在未观测区域难以生成合理图像。其核心解决方案是将NVS任务重新建模为轨道视频生成任务,并基于预训练视频生成模型进行适配:通过引入相机适配器(camera adapters)实现精确的相机控制;设计法向量图生成分支并利用法向量特征通过注意力机制引导目标视图合成,从而提升几何一致性;同时采用像素空间监督以缓解潜在空间压缩导致的模糊外观问题。该方法在GSO和OmniObject3D基准上显著优于现有方法,尤其在单视图设置下表现突出。
链接: https://arxiv.org/abs/2603.19613
作者: Jinglin Liang,Zijian Zhou,Rui Huang,Shuangping Huang,Yichen Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 10 figures
Abstract:Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).
[CV-83] ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
【速读】:该论文旨在解决当前视频大模型(Video-LLMs)在视频理解任务中因视频token数量庞大而导致的自回归解码效率低下问题。现有视觉token剪枝方法虽能缓解这一瓶颈,但仍存在信息丢失和加速比有限的问题。其解决方案的关键在于提出一种无需训练的“先草稿再验证”推测解码框架ParallelVLM,该框架包含两个并行化阶段以最大化硬件利用率,并引入无偏验证器引导的剪枝策略,通过消除注意力机制引导剪枝中的位置偏差,更好地对齐草稿模型与目标模型,从而显著扩展草稿窗口并提升解码速度,在多个视频理解基准测试中实现了最高达3.36倍的加速效果。
链接: https://arxiv.org/abs/2603.19610
作者: Quan Kong,Yuhao Shen,Yicheng Ji,Huan Li,Cong Wang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by 1.6\sim1.8\times with high accepted lengths, and accelerates various video understanding benchmarks by 3.36 \times on LLaVA-Onevision-72B and 2.42 \times on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
[CV-84] LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
【速读】:该论文旨在解决密集城市环境中航空视觉定位(aerial visual localization)的两个关键问题:一是现有方法在跨场景迁移时泛化能力差,二是面对建筑密集区域时定位失败率高。解决方案的关键在于两项创新:其一,构建了目前最大规模的实例分割数据集InsLoD-Loc,包含10万张带有精确建筑实例标注的航空图像,显著提升了模型的零样本泛化能力;其二,将定位范式从语义轮廓对齐重构为实例轮廓对齐,有效降低了密集场景下的位姿估计歧义性。实验表明,LoD-Loc v3在跨场景和密集城市场景中均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2603.19609
作者: Shuaibang Peng,Juelin Zhu,Xia Li,Kun Yang,Maojun Zhang,Yu Liu,Shen Yan
机构: National University of Defense Technology (国防科技大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at this https URL.
[CV-85] FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
【速读】:该论文旨在解决工业和医疗场景中细粒度异常检测的难题,特别是在标注异常样本稀缺的情况下实现零样本(zero-shot)检测。其核心挑战在于视觉-语言模型(如CLIP)存在前景-背景特征纠缠及文本语义粗粒度的问题。解决方案的关键在于提出FB-CLIP框架,通过多策略文本表征增强和前景-背景分离机制来提升异常定位精度:在文本模态上融合End-of-Text特征、全局池化表示与注意力加权token特征以获取更丰富的语义线索;在视觉模态上采用多视角软分离(沿身份、语义和空间维度)并结合背景抑制策略,降低干扰并增强判别能力;同时引入语义一致性正则化(Semantic Consistency Regularization, SCR),对齐图像特征与正常/异常文本原型,抑制不确定匹配并扩大语义差距。实验表明,FB-CLIP能够在零样本设置下有效区分异常与复杂背景,实现高精度的细粒度异常检测与定位。
链接: https://arxiv.org/abs/2603.19608
作者: Ming Hu,Yongsheng Huo,Mingyu Dou,Jianfu Yin,Peng Zhao,Yao Wang,Cong Hu,Bingliang Hu,Quan Wang
机构: Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences (中国科学院西安光学精密机械研究所); University of Chinese Academy of Sciences (中国科学院大学); Xi’an Jiaotong University (西安交通大学); Zhongnan Hospital of Wuhan University (武汉大学中南医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.
[CV-86] Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning
【速读】:该论文旨在解决当前视频生成模型在物理真实性方面缺乏系统性评估的问题,尤其是如何精准识别和诊断生成视频中违反现实物理规律的动态行为。现有方法多依赖自动化指标或粗粒度的人类判断,难以揭示生成内容在何种场景下、因何原因违背物理约束。其解决方案的关键在于提出Physion-Eval——一个大规模专家推理基准,包含10,990条专家标注的推理轨迹,覆盖22种细粒度物理类别,每个生成视频均基于真实参考视频,并配有时间定位的异常标记、结构化失败分类及自然语言解释。通过该基准,研究发现83.3%(外视角)和93.5%(第一人称视角)的生成视频存在可被人类识别的物理错误,从而为物理约束驱动的视频生成模型开发提供了新标准与方向。
链接: https://arxiv.org/abs/2603.19607
作者: Qin Zhang,Peiyu Jing,Hong-Xing Yu,Fangqiang Ding,Fan Nie,Weimin Wang,Yilun Du,James Zou,Jiajun Wu,Bing Shuai
机构: Physion Labs(Physion实验室); Stanford University (斯坦福大学); MIT (麻省理工学院); Harvard University (哈佛大学); Character AI (Character AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at this https URL.
[CV-87] Beyond Quadratic: Linear-Time Change Detection with RWKV
【速读】:该论文旨在解决遥感变化检测中现有方法在效率与全局上下文建模之间的权衡问题:卷积神经网络(CNN)虽计算高效但缺乏长距离依赖建模能力,而Transformer虽能捕捉全局信息却存在计算成本过高问题。解决方案的关键在于提出ChangeRWKV架构,其核心创新包括一个分层RWKV编码器以构建多尺度特征表示,以及一种新颖的空间-时间融合模块(Spatial-Temporal Fusion Module, STFM),用于缓解不同尺度间的空间错位并提取细粒度的时间差异。该方法在保持线性推理时间的同时实现了类Transformer的强表达能力,显著优于当前最优模型,在LEVIR-CD数据集上达到85.46% IoU和92.16% F1分数,且参数量和浮点运算次数(FLOPs)大幅降低。
链接: https://arxiv.org/abs/2603.19606
作者: Zhenyu Yang,Gensheng Pei,Tao Chen,Xia Yuan,Haofeng Zhang,Xiangbo Shu,Yazhou Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection. Our code and model are publicly available.
[CV-88] K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups
【速读】:该论文旨在解决视觉任务中非平稳协方差矩阵(non-stationary covariance matrices)在线追踪的难题,现有方法要么忽略流形约束,要么依赖一阶更新策略,在快速变化场景下不可避免地产生相位滞后。其解决方案的关键在于提出K-GMRF框架,将协方差追踪问题重新建模为李群(Lie groups)上的受迫刚体运动,基于欧拉-泊松方程推导出二阶动力学系统,将观测视为施加在隐式角速度上的力矩,并通过保持结构的辛积分器(symplectic integrator)进行传播;理论证明该方法在恒定旋转条件下可实现零稳态误差,显著优于一阶基线方法的比例滞后特性,从而在多个真实与合成数据集上展现出高精度和鲁棒性。
链接: https://arxiv.org/abs/2603.19601
作者: ZhiMing Li
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 33 pages, 13 figures
Abstract:Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.
[CV-89] FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
【速读】:该论文旨在解决场景生成中同时实现高保真度、细粒度对象控制以及场景级风格一致性的问题。现有方法如语言驱动的检索方法虽能组合合理场景,但缺乏对物体层级的精确控制且难以保证整体风格统一;而基于图结构的方法虽提升了对象间关系建模能力,却在纹理生成质量上表现不足,限制了实际应用。其解决方案的关键在于提出FlowScene——一个以多模态图为条件的三分支生成模型,通过紧密耦合的修正流(rectified flow)模型在生成过程中交换对象信息,实现布局、形状与纹理的协同推理,从而在保持结构与外观一致性的前提下,提供对物体形态、材质及相互关系的精细调控能力。
链接: https://arxiv.org/abs/2603.19598
作者: Zhifei Yang,Guangyao Zhai,Keyang Lu,YuYang Yin,Chao Zhang,Zhen Xiao,Jieyi Long,Nassir Navab,Yikai Wang
机构: Peking University (北京大学); Technical University of Munich (慕尼黑工业大学); Beijing Jiaotong University (北京交通大学); Beijing Digital Native Digital City Research Center (北京数字原生数字城市研究中心); Theta Labs, Inc. (Theta Labs 公司); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects’ shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.
[CV-90] MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation
【速读】:该论文旨在解决开放世界语义分割(open-world semantic segmentation)中高质量、细粒度像素级标注数据稀缺的问题,这一瓶颈限制了模型在未见类别上的泛化能力。传统方法依赖人工标注的图像-文本对数据集,成本高昂且难以覆盖足够多类别。其解决方案的关键在于提出了一种基于扩散模型(diffusion model)的数据生成流水线 MagicSeg:首先从类别标签出发生成高保真文本描述,进而引导扩散模型合成正样本图像;同时生成对应的负样本图像作为反事实对照样本,用于对比学习训练;此外,通过集成开放词汇检测模型与交互式分割模型自动提取精确对象掩码作为自监督信号,从而构建可用于预训练的多样化、大规模合成数据集。该方法显著提升了下游模型在 PASCAL VOC、PASCAL Context 和 COCO 数据集上的性能,达到当前最优水平。
链接: https://arxiv.org/abs/2603.19575
作者: Kaixin Cai,Pengzhen Ren,Jianhua Han,Yi Zhu,Hang Xu,Jianzhuang Liu,Xiaodan Liang
机构: Sun Yat-sen University (中山大学); PengCheng Laboratory; Huawei Noah’s Ark Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named “MagicSeg”. Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset’s effectiveness in enhancing open-world semantic segmentation capabilities. Project website: this https URL.
[CV-91] CurveStream: Boosting Streaming Video Understanding in MLLM s via Curvature-Aware Hierarchical Visual Memory Management
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在流式视频理解中因视觉标记(visual tokens)线性爆炸导致的显存溢出(Out-of-Memory, OOM)和灾难性遗忘问题。现有视觉保留与内存管理方法依赖于均匀采样、低层次物理指标或被动缓存淘汰策略,缺乏内在语义感知能力,易破坏上下文连贯性并模糊短暂但关键的语义过渡。其解决方案的关键在于提出一种无需训练的、基于曲率感知的分层视觉记忆管理框架 CurveStream:该框架基于核心观察——连续特征轨迹上的高曲率区域与全局语义跃迁高度一致,通过 Curvature Score 实时评估语义强度,并引入在线 K-Sigma 动态阈值机制,在严格标记预算下自适应地将帧路由至清晰记忆与模糊记忆状态,从而实现高效且语义敏感的记忆管理。
链接: https://arxiv.org/abs/2603.19571
作者: Chao Wang,Xudong Tan,Jianjian Cao,Kangcong Li,Tao Chen
机构: College of Future Information Technology, Fudan University (复旦大学未来信息技术学院); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video this http URL code will be released at this https URL.
[CV-92] Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation
【速读】:该论文旨在解决基于扩散模型的图像分词器(image tokenization)在解码过程中因迭代采样导致的高延迟问题,从而限制其在实时或大规模应用场景中的实用性。解决方案的关键在于提出一种两阶段加速框架:首先采用多尺度采样策略,从粗粒度分辨率开始逐步加倍细化,理论上实现 O(logn) 的速度提升;其次,在每一尺度上将扩散解码器蒸馏为单步去噪模型,使每个尺度仅需一次前向传播即可完成高质量重建。该方法在显著降低解码时间(约一个数量级)的同时保持输出质量几乎不变,为高效且表达能力强的图像分词器提供了可行路径。
链接: https://arxiv.org/abs/2603.19570
作者: Chuhan Wang,Hao Chen
机构: University of California San Diego (加州大学圣地亚哥分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of \mathcalO(\log n) compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.
[CV-93] Efficiency Follows Global-Local Decoupling
【速读】:该论文旨在解决现代视觉模型在捕捉图像级上下文信息的同时保持局部细节,并且在计算成本上实现高效性的难题。其解决方案的关键在于提出了一种双分支架构ConvNeur,通过解耦全局推理(global reasoning)与局部表征(local representation)的功能:一个轻量级神经记忆分支在紧凑的token集合上聚合全局上下文,另一个保局分支提取精细结构;并通过一个可学习门控机制使全局线索调制局部特征,而不混淆二者的目标。这种分离策略实现了次二次复杂度与图像尺寸的关系,保留了局部处理的归纳偏置,并显著降低了相对于全连接注意力机制的开销。
链接: https://arxiv.org/abs/2603.19567
作者: Zhenyu Yang,Gensheng Pei,Tao Chen,Yichao Zhou,Tianfei Zhou,Yazhou Yao,Fumin Shen
机构: Nanjing University of Science and Technology (南京理工大学); Sungkyunkwan University (成均馆大学); State Key Laboratory of Intelligent Manufacturing of Advanced Construction Machinery (先进施工机械智能制造国家重点实验室); Beijing Institute of Technology (北京理工大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.
[CV-94] PhyUnfold-Net: Advancing Remote Sensing Change Detection with Physics-Guided Deep Unfolding
【速读】:该论文旨在解决双时相变化检测(bi-temporal change detection)中因光照、季节和大气等获取差异导致的误报问题。现有方法在特征差异空间中难以区分真实变化与伪变化,从而降低检测准确性。解决方案的关键在于引入物理先验:真实变化在特征差异空间中具有更高的局部奇异值熵(patch-wise singular-value entropy, SVE)。基于此,作者提出PhyUnfold-Net——一种物理引导的深度展开框架,其核心是迭代变化分解模块(Iterative Change Decomposition Module, ICDM),通过多步求解器逐步将混合差异特征分离为变化分量和干扰分量;同时设计分阶段探索与约束损失(Staged Exploration-and-Constraint Loss, S-SEC)以稳定分解过程,并引入小波谱抑制模块(Wavelet Spectral Suppression Module, WSSM)预先消除获取引起的光谱不匹配,从而提升复杂场景下的变化检测性能。
链接: https://arxiv.org/abs/2603.19566
作者: Zelin Lei,Yaoxing Ren,Jiaming Chang
机构: Xi’an Jiaotong University (西安交通大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures, 9 tables. Appendix included
Abstract:Bi-temporal change detection is highly sensitive to acquisition discrepancies, including illumination, season, and atmosphere, which often cause false alarms. We observe that genuine changes exhibit higher patch-wise singular-value entropy (SVE) than pseudo changes in the feature-difference space. Motivated by this physical prior, we propose PhyUnfold-Net, a physics-guided deep unfolding framework that formulates change detection as an explicit decomposition problem. The proposed Iterative Change Decomposition Module (ICDM) unrolls a multi-step solver to progressively separate mixed discrepancy features into a change component and a nuisance component. To stabilize this process, we introduce a staged Exploration-and-Constraint loss (S-SEC), which encourages component separation in early steps while constraining nuisance magnitude in later steps to avoid degenerate solutions. We further design a Wavelet Spectral Suppression Module (WSSM) to suppress acquisition-induced spectral mismatch before decomposition. Experiments on four benchmarks show improvements over state-of-the-art methods, with gains under challenging conditions.
[CV-95] PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition
【速读】:该论文旨在解决事件相机(Event Camera)与RGB图像在行人属性识别(Pedestrian Attribute Recognition, PAR)中融合时存在的两大问题:一是现有两流多模态融合方法计算开销大,二是忽略了来自上下文样本的潜在指导信息。解决方案的关键在于提出一个轻量级的“Event Prompter”模块,该模块摒弃了复杂的辅助骨干网络,仅通过极高效的离散余弦变换(Discrete Cosine Transform, DCT)和逆离散余弦变换(Inverse DCT, IDCT)操作从事件数据中提取频域特征,以极低计算成本增强RGB分支;同时引入基于外部记忆库与现代霍普菲尔德网络(Hopfield Networks)的关联记忆机制,实现跨样本的全局关系建模,最终通过交叉注意力机制融合双模态信息并完成属性预测,显著提升了模型效率与性能。
链接: https://arxiv.org/abs/2603.19565
作者: Minghe Xu,Rouying Wu,ChiaWei Chu,Xiao Wang,Yu Li
机构: City University of Macau (澳门城市大学); Zhuhai College of Science and Technology (珠海科技学院); Macau University of Science and Technology (澳门科技大学); School of Computer Science and Technology, Anhui University (安徽大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two-stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency-domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory-augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross-attention mechanism fuses the RGB and event modalities, followed by feed-forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB-Event PAR framework. The source code of this paper will be released on this https URL
[CV-96] Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search
【速读】:该论文旨在解决大型视觉模型(Large Vision Models, LVMs)在资源受限的边缘设备上部署时,预测精度与实时效率难以平衡的问题。现有进化神经架构搜索(Evolutionary Neural Architecture Search, ENAS)方法因候选模型评估成本高及子网络排名不一致而难以实用。其解决方案的关键在于提出EvoNAS框架:首先构建融合视觉状态空间模块(Vision State Space, VSS)与视觉Transformer(Vision Transformer, ViT)的混合超网络,并引入跨架构双域知识蒸馏(Cross-Architecture Dual-Domain Knowledge Distillation, CA-DDKD)策略,提升超网络表征能力并增强子网络排名一致性,从而实现无需额外微调即可可靠估计适应度;其次设计基于GPU资源池化和异步调度的分布式多模型并行评估(Distributed Multi-Model Parallel Evaluation, DMMPE)机制,显著降低大规模验证开销,使多GPU并发执行效率提升超过70%。实验表明,所搜得的EvoNets在多个基准数据集上均实现了精度与效率的帕累托最优权衡。
链接: https://arxiv.org/abs/2603.19563
作者: Haoyu Zhang,Zhihao Yu,Rui Wang,Yaochu Jin,Qiqi Liu,Ran Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at this https URL
[CV-97] StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention
【速读】:该论文旨在解决自动驾驶场景中动态街道的快速重建问题,传统方法依赖于逐场景优化,效率低下且难以在闭环仿真等下游任务中高效利用大规模驾驶数据。其解决方案的关键在于提出了一种无位姿(pose-free)和无追踪器(tracker-free)的前向重建框架StreetForward,通过引入一种简单而有效的时序掩码注意力模块(temporal mask attention),从图像序列中捕捉动态运动信息并生成具有运动感知能力的潜在表示;同时采用统一的3D高斯点绘(3D Gaussian Splatting)表示静态内容与动态实例,并通过跨帧渲染与时空一致性约束联合优化,从而实现像素级速度推断与新视角、新时间下的高保真视图合成。
链接: https://arxiv.org/abs/2603.19552
作者: Zhongrui Yu,Zhao Wang,Yijia Xie,Yida Wang,Xueyang Zhang,Yifei Zhan,Kun Zhan
机构: Li Auto Inc.(小鹏汽车); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: this https URL.
[CV-98] SeeClear: Reliable Transparent Object Depth Estimation via Generative Opacification
【速读】:该论文旨在解决透明物体在单目深度估计中因折射和透射效应难以建模,导致现有深度网络产生不稳定或错误预测的问题。解决方案的关键在于提出SeeClear框架,通过一个基于扩散模型的生成式去透明化模块(diffusion-based generative opacification module),将透明区域的折射外观转换为几何一致的不透明形状,从而使得原本无法准确估计深度的透明物体能够被通用单目深度估计器稳定处理,且无需重新训练或修改原有网络结构。
链接: https://arxiv.org/abs/2603.19547
作者: Xiaoying Wang,Yumeng He,Jingkai Shi,Jiayin Lu,Yin Yang,Ying Jiang,Chenfanfu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL . 19 pages, 12 figures
Abstract:Monocular depth estimation remains challenging for transparent objects, where refraction and transmission are difficult to model and break the appearance assumptions used by depth networks. As a result, state-of-the-art estimators often produce unstable or incorrect depth predictions for transparent materials. We propose SeeClear, a novel framework that converts transparent objects into generative opaque images, enabling stable monocular depth estimation for transparent objects. Given an input image, we first localize transparent regions and transform their refractive appearance into geometrically consistent opaque shapes using a diffusion-based generative opacification module. The processed image is then fed into an off-the-shelf monocular depth estimator without retraining or architectural changes. To train the opacification model, we construct SeeClear-396k, a synthetic dataset containing 396k paired transparent-opaque renderings. Experiments on both synthetic and real-world datasets show that SeeClear significantly improves depth estimation for transparent objects. Project page: this https URL
[CV-99] Subspace Kernel Learning on Tensor Sequences ICLR2026
【速读】:该论文旨在解决从高阶张量(higher-order tensors)中学习时,如何有效捕捉跨模式(mode-wise)复杂交互关系并保持计算效率的问题。其核心挑战在于设计一种既具表达力又鲁棒的相似性度量方法,同时支持大规模数据处理和可解释性分析。解决方案的关键在于提出Uncertainty-driven Kernel Tensor Learning (UKTL)框架,该框架通过张量展开(tensor unfolding)提取各模式子空间,并引入不确定性感知的子空间加权机制,自适应地降低置信度低的模式分量权重,从而提升比较过程中的鲁棒性和可解释性;此外,结合动态学习的软k-均值聚类选取枢轴张量(pivot tensors),实现Nyström核线性化以支持可扩展性,最终构建一个端到端可训练、融合多维交互结构的核学习范式。
链接: https://arxiv.org/abs/2603.19546
作者: Lei Wang,Xi Ding,Yongsheng Gao,Piotr Koniusz
机构: Griffith University (格里菲斯大学); Data61 CSIRO (数据61 澳大利亚联邦科学与工业研究组织); University of New South Wales (新南威尔士大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract:Learning from structured multi-way data, represented as higher-order tensors, requires capturing complex interactions across tensor modes while remaining computationally efficient. We introduce Uncertainty-driven Kernel Tensor Learning (UKTL), a novel kernel framework for M -mode tensors that compares mode-wise subspaces derived from tensor unfoldings, enabling expressive and robust similarity measure. To handle large-scale tensor data, we propose a scalable Nyström kernel linearization with dynamically learned pivot tensors obtained via soft k -means clustering. A key innovation of UKTL is its uncertainty-aware subspace weighting, which adaptively down-weights unreliable mode components based on estimated confidence, improving robustness and interpretability in comparisons between input and pivot tensors. Our framework is fully end-to-end trainable and naturally incorporates both multi-way and multi-mode interactions through structured kernel compositions. Extensive evaluations on action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton) show that UKTL achieves state-of-the-art performance, superior generalization, and meaningful mode-wise insights. This work establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-way and multi-modal tensor sequences.
[CV-100] MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
【速读】:该论文旨在解决单目3D目标理解中因相机内参未知而导致图像平面几何信息(如投影后的3D边界框角点)难以准确获取的问题,这在真实场景下的目标检测任务中尤为突出。解决方案的关键在于提出MoCA3D模型,该模型无需依赖相机内参即可预测投影到图像平面的3D边界框角点及每个角点的深度信息;其核心创新是将像素空间定位与深度分配建模为密集预测任务,通过角点热图(corner heatmaps)和深度图(depth maps)实现端到端学习,同时引入Pixel-Aligned Geometry (PAG) 评估指标以直接衡量图像平面角点与深度的一致性,从而显著提升图像平面几何精度,同时大幅减少参数量并优于现有方法。
链接: https://arxiv.org/abs/2603.19538
作者: Changwoo Jeon,Rishi Upadhyay,Achuta Kadambi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 9 figures, including supplementary material
Abstract:Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
[CV-101] Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion
【速读】:该论文旨在解决自动驾驶车辆在城市环境中对行人意图预测的准确性问题,以确保安全导航。其核心挑战在于如何在资源受限条件下实现高精度、可解释且具备风险感知能力的意图预测。解决方案的关键在于提出一种轻量级、社会性感知的架构,通过融合四个行为流(注意力、位置、情境和交互)来建模行人行为,并采用高速公路编码器(highway encoders)、紧凑的4-token Transformer 和全局自注意力池化进行高效特征提取与融合;同时引入两个互补的不确定性量化模块——变分瓶颈(variational bottleneck)用于捕捉认知不确定性(epistemic uncertainty),以及马氏距离检测器(Mahalanobis distance detector)识别分布偏移(distributional shift),从而输出校准的概率和可操作的风险评分,兼顾性能与效率。
链接: https://arxiv.org/abs/2603.19533
作者: Sima Ashayer,Hoang H. Nguyen,Yu Liang,Mina Sartipi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IEEE Intelligent Vehicles Symposium (IV) 2026. 8 pages, 3 figures
Abstract:Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
[CV-102] dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
【速读】:该论文旨在解决开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS)中模型在未见类别上的泛化能力不足以及复杂场景下空间精度和鲁棒性受限的问题。现有方法多依赖于对图像-文本相似度图的后期精修,导致细粒度空间信息丢失,难以应对杂乱场景。其解决方案的关键在于:首先设计了针对OVSS任务的专用架构,融合全局[CLS] token与局部ViT patch特征,实现语义判别力与空间局部性的协同增强;其次,在图像-文本交互前进行早期视觉表示精修,并在交互后进行晚期相关特征精修,从而提升密集预测的准确性与鲁棒性;最后,提出基于滑动窗口聚合的高分辨率局部-全局推理策略,兼顾空间细节保留与全局上下文感知。
链接: https://arxiv.org/abs/2603.19531
作者: Saikat Dutta,Biplab Banerjee,Hamid Rezatofighi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce this http URL, extending this http URL into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.
[CV-103] Recognising BSL Fingerspelling in Continuous Signing Sequences
【速读】:该论文旨在解决英国手语(British Sign Language, BSL)中指拼写(fingerspelling)识别的挑战,包括签名速度较快、母语使用者常省略字母导致识别困难,以及现有数据集在规模或时间/字符层面标注不准确的问题。解决方案的关键在于构建了一个大规模、高精度的BSL指拼写数据集FS23K,并提出了一种能够显式建模双臂交互和口部动作(mouthing)线索的识别模型。通过迭代注释框架优化标注质量,该方法将字符错误率(character error rate, CER)相较于先前最优方法降低一半,验证了其在提升手语理解与自动化标注流程方面的有效性。
链接: https://arxiv.org/abs/2603.19523
作者: Alyssa Chan,Taein Kwon,Andrew Zisserman
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 15 figures
Abstract:Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines. The project page can be found at this https URL.
[CV-104] ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在解读日常医疗影像(如普通相机拍摄的医学照片)时缺乏系统性评估的问题。现有研究尚未建立一个全面的基准来衡量这些模型是否具备准确理解此类图像中医学内容的能力,而这类图像广泛应用于远程医疗和在线健康交流场景。解决方案的关键在于构建了一个名为ReXInTheWild的多选题基准数据集,包含955个由临床医生验证的问题,覆盖7个临床主题和484张来自生物医学文献的真实照片。该基准不仅要求模型具备细粒度的自然图像理解能力,还考验其特定领域的医学推理能力,从而为评估VLMs在真实医疗场景下的表现提供了严谨、临床相关的测试平台。
链接: https://arxiv.org/abs/2603.19517
作者: Oishi Banerjee,Sung Eun Kim,Alexandra N. Willauer,Julius M. Kernbach,Abeer Rihan Alomaish,Reema Abdulwahab S. Alghamdi,Hassan Rayhan Alomaish,Mohammed Baharoon,Xiaoman Zhang,Julian Nicolas Acosta,Christine Zhou,Pranav Rajpurkar
机构: Harvard University (哈佛大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 4 figures
Abstract:Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.
[CV-105] Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在医学诊断场景中应用受限的问题,特别是缺乏能够真实反映临床工作流程的结构化多模态数据集。为应对这一挑战,作者提出了Gastric-X——一个面向胃癌分析的大规模多模态基准数据集,包含1700例病例,每例均配有静息与动态CT图像、内镜图像、标准化生化指标、专家撰写的诊断报告及肿瘤区域边界框标注,全面覆盖临床诊疗的关键环节。解决方案的核心在于构建一个高度贴近真实医疗实践的多模态数据集,并通过五项核心任务(视觉问答、报告生成、跨模态检索、疾病分类和病灶定位)系统评估现有VLMs的能力,从而揭示其是否能有效关联生化信号、空间肿瘤特征与文本报告,推动机器智能向医生的认知与证据推理过程靠拢。
链接: https://arxiv.org/abs/2603.19516
作者: Sheng Lu,Hao Chen,Rui Yin,Juyan Ba,Yu Zhang,Yuanzhe Li
机构: Ruijin Hospital (瑞金医院); University of Cambridge (剑桥大学); Nanjing First Hospital (南京市第一医院); Shenzhen University (深圳大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Computer Vision and Pattern Recognition 2026
Abstract:Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
[CV-106] FedAgain: A Trust-Based and Robust Federated Learning Strategy for an Automated Kidney Stone Identification in Ureteroscopy
【速读】:该论文旨在解决医疗影像中人工智能(Artificial Intelligence, AI)模型在异构、受损图像下的可靠性问题,尤其是在不同医院使用多种设备采集的数据存在非独立同分布(non-IID)和噪声干扰时,传统联邦学习(Federated Learning, FL)方法难以保证模型的鲁棒性和泛化能力。解决方案的关键在于提出FedAgain框架,其核心是引入一种双信任机制(dual trust mechanism),通过结合基准可靠性(benchmark reliability)与模型分歧度(model divergence)动态调整各客户端的贡献权重,在聚合过程中有效抑制噪声或对抗性更新的影响,从而提升模型在真实世界多中心数据场景下的稳定性与诊断准确性。
链接: https://arxiv.org/abs/2603.19512
作者: Ivan Reyes-Amezcua,Francisco Lopez-Tiro,Clément Larose,Christian Daul,Andres Mendez-Vazquez,Gilberto Ochoa-Ruiz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper submitted for peer review
Abstract:The reliability of artificial intelligence (AI) in medical imaging critically depends on its robustness to heterogeneous and corrupted images acquired with diverse devices across different hospitals which is highly challenging. Therefore, this paper introduces FedAgain, a trust-based Federated Learning (Federated Learning) strategy designed to enhance robustness and generalization for automated kidney stone identification from endoscopic images. FedAgain integrates a dual trust mechanism that combines benchmark reliability and model divergence to dynamically weight client contributions, mitigating the impact of noisy or adversarial updates during aggregation. The framework enables the training of collaborative models across multiple institutions while preserving data privacy and promoting stable convergence under real-world conditions. Extensive experiments across five datasets, including two canonical benchmarks (MNIST and CIFAR-10), two private multi-institutional kidney stone datasets, and one public dataset (MyStone), demonstrate that FedAgain consistently outperforms standard Federated Learning baselines under non-identically and independently distributed (non-IID) data and corrupted-client scenarios. By maintaining diagnostic accuracy and performance stability under varying conditions, FedAgain represents a practical advance toward reliable, privacy-preserving, and clinically deployable federated AI for medical imaging.
[CV-107] Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement
【速读】:该论文旨在解决当前视觉模型(如CNN和Vision Transformer)因参数量庞大、计算资源消耗高而导致难以部署在资源受限环境中的问题。其解决方案的关键在于提出一种参数高效的架构——视觉小递归模型(Vision Tiny Recursion Model, ViTRM),该模型将原本L层的ViT编码器替换为一个仅含k=3层的小型模块,并通过N次递归应用实现状态迭代优化,从而在显著减少参数量(相比CNN模型减少最多6倍,相比ViT减少最多84倍)的同时保持与主流模型相当的性能表现,验证了递归计算作为深度架构替代方案的有效性。
链接: https://arxiv.org/abs/2603.19503
作者: Ange-Clément Akazan,Abdoulaye Koroko,Verlon Roel Mbingui,Choukouriyah Arinloye,Hassan Fifen,Rose Bandolo
机构: ΣηiGmα\Sigma\eta iGm\alpha Research Group; AIMS RIC; SaH Analytics International
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbfVision Tiny Recursion Model (ViTRM): a parameter-efficient architecture that replaces the L -layer ViT encoder with a single tiny k -layer block ( k=3 ) applied recursively N times. Despite using up to 6 \times and 84 \times fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.
[CV-108] aching an Agent to Sketch One Part at a Time
【速读】:该论文旨在解决文本到矢量草图(text-to-vector sketch)生成过程中缺乏可控性、可解释性和局部编辑能力的问题。现有方法通常生成整体结构,难以实现对草图中特定部件的精确控制。解决方案的关键在于提出一种基于多模态语言模型代理(multi-modal language model-based agent)的分步生成框架,结合监督微调与新颖的多轮过程-奖励强化学习(multi-turn process-reward reinforcement learning),并引入一个名为ControlSketch-Part的新数据集——该数据集通过自动化的语义分割和结构化多阶段标注流程,为矢量草图提供丰富的部件级注释。这种结构化的部分级数据与过程中的视觉反馈相结合,使得生成过程具备可解释性、可控性及局部可编辑性。
链接: https://arxiv.org/abs/2603.19500
作者: Xiaodan Du,Ruize Xu,David Yunis,Yael Vinker,Greg Shakhnarovich
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
Abstract:We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
[CV-109] VeloxNet: Efficient Spatial Gating for Lightweight Embedded Image Classification
【速读】:该论文旨在解决在嵌入式设备上部署深度学习模型进行航空灾害监测与基础设施检查时,如何在严格限制模型尺寸、内存和延迟的前提下实现高精度图像分类的问题。其解决方案的关键在于提出一种轻量级卷积神经网络(CNN)架构VeloxNet,通过将SqueezeNet中的fire模块替换为门控多层感知机(gated multi-layer perceptron, gMLP)块,其中每个gMLP块包含一个空间门控单元(spatial gating unit, SGU),该单元利用可学习的空间投影和乘性门控机制,在单层内实现对整个特征图的空间依赖建模,从而以更少参数获得全局空间信息;相比传统局部卷积核受限的fire模块,SGU显著提升了模型的参数效率和分类性能,在三个航空图像数据集上均优于多个主流基线模型。
链接: https://arxiv.org/abs/2603.19496
作者: Md Meftahul Ferdaus,Elias Ioup,Mahdi Abdelguerfi,Anton Netchaev,Steven Sloan,Ken Pathak,Kendall N. Niles
机构: University of New Orleans (新奥尔良大学); Naval Research Laboratory (海军研究实验室); US Army Corps of Engineers (美国陆军工程兵团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Deploying deep learning models on embedded devices for tasks such as aerial disaster monitoring and infrastructure inspection requires architectures that balance accuracy with strict constraints on model size, memory, and latency. This paper introduces VeloxNet, a lightweight CNN architecture that replaces SqueezeNet’s fire modules with gated multi-layer perceptron (gMLP) blocks for embedded image classification. Each gMLP block uses a spatial gating unit (SGU) that applies learned spatial projections and multiplicative gating, enabling the network to capture spatial dependencies across the full feature map in a single layer. Unlike fire modules, which are limited to local receptive fields defined by small convolutional kernels, the SGU provides global spatial modeling at each layer with fewer parameters. We evaluate VeloxNet on three aerial image datasets: the Aerial Image Database for Emergency Response (AIDER), the Comprehensive Disaster Dataset (CDD), and the Levee Defect Dataset (LDD), comparing against eleven baselines including MobileNet variants, ShuffleNet, EfficientNet, and recent vision transformers. VeloxNet reduces the parameter count by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores by 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. These results demonstrate that substituting local convolutional modules with spatial gating blocks can improve both classification accuracy and parameter efficiency for resource-constrained deployment. The source code will be made publicly available upon acceptance of the paper.
[CV-110] Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
【速读】:该论文旨在解决医学领域中大型视觉语言模型(Large Vision Language Models, LVLMs)在微调过程中因缺乏高质量、大规模指令数据而面临的挑战,尤其是受限于专业医疗知识的稀缺性。其解决方案的关键在于提出一种无需显式指令的微调方法:通过引入“动量代理指令”(momentum proxy instruction)替代传统人工设计的文本指令,使模型在仅使用图像-描述对进行训练时仍能保持指令遵循能力,并促进参数更新在推理阶段依然有效;同时结合响应洗牌策略(response shuffling),缓解模型对前序词的过度依赖,从而显著提升LVLM在医学视觉问答任务中的微调效率与性能。
链接: https://arxiv.org/abs/2603.19482
作者: Myeongkyun Kang,Soopil Kim,Xiaoxiao Li,Sang Hyun Park
机构: POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model’s over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
[CV-111] Narrative Aligned Long Form Video Question Answering
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频推理任务中普遍缺乏对叙事性推理(narrative reasoning)能力评估的问题,即现有基准测试主要依赖局部线索,难以衡量模型追踪意图、连接远距离事件并重构因果链的能力。其解决方案的关键在于提出Video-NaRA框架,该框架以叙事为中心,构建事件级因果链并存储于结构化记忆中,从而在推理阶段实现跨场景的分散信息整合,显著提升对远距离证据依赖问题的处理能力。
链接: https://arxiv.org/abs/2603.19481
作者: Rahul Jain,Keval Doshi,Burak Uzkent,Garin Kessler
机构: Purdue University (普渡大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
[CV-112] ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对复杂任务时缺乏“主动请求用户干预”的能力问题,即无法像人类一样在识别被遮挡物体、图像质量差或草图模糊等场景中主动寻求帮助。其核心解决方案在于提出一个名为ProactiveBench的新基准测试框架,该框架基于七个重构数据集,系统评估MLLMs在多种任务下的主动性表现;并通过引入基于强化学习的微调策略,验证了模型可通过训练习得主动行为,并能在未见过的场景中泛化,从而为构建具备主动交互能力的多模态模型提供了可行路径。
链接: https://arxiv.org/abs/2603.19466
作者: Thomas De Min,Subhankar Roy,Stéphane Lathuilière,Elisa Ricci,Massimiliano Mancini
机构: University of Trento(特伦托大学); University of Bergamo(贝加莫大学); Inria Grenoble(法国国家信息与自动化研究院格勒诺布尔分部); Bruno Kessler Foundation(布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar “proactive” behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) “hinting” at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.
[CV-113] In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在计算机视觉任务中对对抗攻击的脆弱性问题,特别是针对车辆伪装攻击(camouflage attacks)这一类隐蔽性强、能同时欺骗检测模型又不被人类察觉的攻击方式。其解决方案的关键在于将车辆伪装攻击建模为一个条件图像编辑问题,提出了一种基于ControlNet的细粒度图像合成框架,通过联合优化车辆结构保真度、风格一致性与对抗有效性三个目标,实现对真实场景图像中车辆的高效伪装生成。实验表明,该方法在COCO和LINZ数据集上显著提升了攻击效果(AP50下降超过38%),同时增强了人眼感知的隐蔽性,并具备良好的黑盒迁移能力和物理世界适用性。
链接: https://arxiv.org/abs/2603.19456
作者: Xiao Fang,Yiming Gong,Stanislav Panev,Celso de Melo,Shuowen Hu,Shayok Chakraborty,Fernando De la Torre
机构: Carnegie Mellon University(卡内基梅隆大学); DEVCOM Army Research Laboratory(美国陆军研究实验室); Florida State University(佛罗里达州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 35 figures
Abstract:Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object’s visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at this https URL
[CV-114] LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray
【速读】:该论文旨在解决胸部X光图像中细粒度表征学习的问题,尤其是在检索和短语定位任务中,由于临床相关发现通常空间上受限,而对比模型缺乏区域级监督信号,且大型视觉语言模型在外部验证中捕捉细粒度特征的能力有限,导致性能不佳。解决方案的关键在于提出一种位置感知的细粒度表征学习方法(Location-aware Fine-grained representation learning, LoFi),通过联合优化sigmoid损失、描述生成损失以及位置感知描述生成损失(location-aware captioning loss),利用轻量级大语言模型实现区域级监督,从而提升细粒度表征能力;进一步地,将该细粒度编码器集成到基于检索的上下文学习中,增强了跨多样场景下的胸部X光定位效果。
链接: https://arxiv.org/abs/2603.19451
作者: Myeongkyun Kang,Yanting Yang,Xiaoxiao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
[CV-115] Factored Levenberg-Marquardt for Diffeomorphic Image Registration: An efficient optimizer for FireANTs
【速读】:该论文旨在解决在大规模图像配准任务中,由于优化器(如Adam)需存储大量状态变量(如动量和平方动量估计)而导致内存消耗过高、难以应用于大尺寸图像的问题。解决方案的关键在于提出一种改进的Levenberg-Marquardt (LM)优化器,其仅需维护一个标量阻尼参数作为状态变量,并通过信任域策略自适应调整该参数,从而在显著降低内存占用(最高减少24.6%)的同时保持甚至提升配准性能。此外,该优化器具备良好的泛化能力,单一超参数配置可在脑部MRI、肺部CT及跨模态腹部图像等多种场景下直接迁移使用,且在三个基准测试中表现优于或匹配Adam。
链接: https://arxiv.org/abs/2603.19371
作者: Rohit Jena,Pratik Chaudhari,James C. Gee
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:FireANTs introduced a novel Eulerian descent method for plug-and-play behavior with arbitrary optimizers adapted for diffeomorphic image registration as a test-time optimization problem, with a GPU-accelerated implementation. FireANTs uses Adam as its default optimizer for fast and more robust optimization. However, Adam requires storing state variables (i.e. momentum and squared-momentum estimates), each of which can consume significant memory, prohibiting its use for significantly large images. In this work, we propose a modified Levenberg-Marquardt (LM) optimizer that requires only a single scalar damping parameter as optimizer state, that is adaptively tuned using a trust region approach. The resulting optimizer reduces memory by up to 24.6% for large volumes, and retaining performance across all four datasets. A single hyperparameter configuration tuned on brain MRI transfers without modification to lung CT and cross-modal abdominal registration, matching or outperforming Adam on three of four benchmarks. We also perform ablations on the effectiveness of using Metropolis-Hastings style rejection step to prevent updates that worsen the loss function.
[CV-116] AURORA: Adaptive Unified Representation for Robust Ultrasound Analysis
【速读】:该论文旨在解决超声图像在不同设备、操作者和解剖目标间存在显著差异导致模型泛化能力差的问题,尤其针对跨医院和临床场景下多任务(分割、检测、分类、关键点回归)的统一建模挑战。其解决方案的关键在于提出一个基于Qwen3-VL系列Transformer视觉编码器的统一多任务框架:通过将中间token特征映射为空间特征图并利用轻量级多尺度特征金字塔进行融合,实现共享表示下的像素级预测与全局推理;同时采用任务感知采样和选择性损失平衡策略,有效管理异构监督信号并缓解任务不平衡问题,从而提升模型在多样化超声分析任务中的适应性和性能表现。
链接: https://arxiv.org/abs/2603.19364
作者: Ufaq Khan,L. D. M. S. Sai Teja,Ayuba Shakiru,Mai A. Shaaban,Yutong Xie,Muhammad Bilal,Muhammad Haris Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound images vary widely across scanners, operators, and anatomical targets, which often causes models trained in one setting to generalize poorly to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis (FMC-UIA) reflects this difficulty by requiring a single model to handle multiple tasks, including segmentation, detection, classification, and landmark regression across diverse organs and datasets. We propose a unified multi-task framework based on a transformer visual encoder from the Qwen3-VL family. Intermediate token features are projected into spatial feature maps and fused using a lightweight multi-scale feature pyramid, enabling both pixel-level predictions and global reasoning within a shared representation. Each task is handled by a small task-specific prediction head, while training uses task-aware sampling and selective loss balancing to manage heterogeneous supervision and reduce task imbalance. Our method is designed to be simple to optimize and adaptable across a wide range of ultrasound analysis tasks. The performance improved from 67% to 85% on the validation set and achieved an average score of 81.84% on the official test set across all tasks. The code is publicly available at: this https URL
[CV-117] Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity ICME2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非独立同分布(non-IID)客户端数据场景下性能下降的问题,尤其是在多模态感知任务中,传统方法难以处理客户端间语义差异导致的模型泛化能力不足。其解决方案的关键在于提出SemanticFL框架,利用预训练扩散模型(如Stable Diffusion)的多层次语义表示(包括VAE编码的潜在空间和U-Net分层特征)构建共享潜在空间,从而对齐异构客户端;同时通过客户端-服务器高效架构将计算负载转移至服务器,并引入跨模态对比学习的一致性机制以稳定训练收敛,显著提升多模态感知任务下的模型鲁棒性与准确性。
链接: https://arxiv.org/abs/2603.19337
作者: Jing Liu,Zhengliang Guo,Yan Wang,Xiaoguang Zhu,Yao Du,Zehua Wang,Victor C. M. Leung
机构: 1. South China University of Technology (华南理工大学); 2. Tsinghua University (清华大学); 3. Guangdong Provincial Key Laboratory of Intelligent Information Processing and Security; 4. Beijing Institute of Technology (北京理工大学); 5. Zhejiang University (浙江大学); 6. Shanghai Jiao Tong University (上海交通大学); 7. University of British Columbia (不列颠哥伦比亚大学); 8. National Natural Science Foundation of China (国家自然科学基金委员会); 9. Shanghai Key Technology RD Program (上海市关键技术研究计划); 10. Natural Sciences and Engineering Research Council (NSERC) of Canada (加拿大自然科学与工程研究理事会); 11. Mitacs (Mitacs)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE ICME 2026
Abstract:Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.
[CV-118] PhyGile: Physics-Prefix Guided Motion Generation for Agile General Humanoid Motion Tracking
【速读】:该论文旨在解决现有文本到运动生成模型在将人类动作直接迁移至类人机器人时所面临的物理可行性问题。由于人类运动数据的先验假设(如生物力学、驱动特性、质量分布和接触策略)与机器人系统存在差异,直接retargeting会导致轨迹虽满足几何约束(如关节限位和姿态连续性),却难以实现在真实场景中的稳定执行。解决方案的关键在于提出PhyGile框架,其通过在推理阶段引入物理前缀引导(physics-prefix-guided)的机器人本体运动生成机制,直接在262维骨骼空间中生成符合机器人动力学特性的动作轨迹,从而消除因迁移带来的伪影并降低生成与执行之间的偏差;同时结合课程学习的专家混合(curriculum-based mixture-of-experts)训练策略和基于物理前缀的微调过程,显著提升了复杂高动态动作(如敏捷全身动作)在真实机器人上的稳定跟踪能力。
链接: https://arxiv.org/abs/2603.19305
作者: Jiacheng Bao,Haoran Yang,Yucheng Xin,Junhong Liu,Yuecheng Xu,Han Liang,Pengfei Han,Xiaoguang Ma,Dong Wang,Bin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humanoid robots are expected to execute agile and expressive whole-body motions in real-world settings. Existing text-to-motion generation models are predominantly trained on captured human motion datasets, whose priors assume human biomechanics, actuation, mass distribution, and contact strategies. When such motions are directly retargeted to humanoid robots, the resulting trajectories may satisfy geometric constraints (e.g., joint limits and pose continuity) and appear kinematically reasonable. However, they frequently violate the physical feasibility required for real-world execution. To address these issues, we present PhyGile, a unified framework that closes the loop between robot-native motion generation and General Motion Tracking (GMT). PhyGile performs physics-prefix-guided robot-native motion generation at inference time, directly generating robot-native motions in a 262-dimensional skeletal space with physics-guided prefixes, thereby eliminating inference-time retargeting artifacts and reducing generation-execution discrepancies. Before physics-prefix adaptation, we train the GMT controller with a curriculum-based mixture-of-experts scheme, followed by post-training on unlabeled motion data to improve robustness over large-scale robot motions. During physics-prefix adaptation, the GMT controller is further fine-tuned with generated objectives under physics-derived prefixes, enabling agile and stable execution of complex motions on real robots. Extensive offline and real-robot experiments demonstrate that PhyGile expands the frontier of text-driven humanoid control, enabling stable tracking of agile, highly difficult whole-body motions that go well beyond walking and low-dynamic motions typically achieved by prior methods.
[CV-119] Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery
【速读】:该论文旨在解决内窥镜手术中相机位姿估计的挑战,特别是传统基于几何的方法在低纹理区域、快速光照变化等复杂成像条件下性能退化的问题。解决方案的关键在于提出一种基于策略(policy-based)的位姿恢复方法,该方法通过学习专家级别的运动预测策略,在给定前一相机状态的条件下直接预测短时程相对运动,而无需在推理时维护显式的几何表示。这种方法从设计上缓解了传统方法对脆弱特征匹配、纹理稀疏区域不稳定性和重建失败导致的位姿覆盖受限等问题。
链接: https://arxiv.org/abs/2603.20045
作者: Jan Emily Mangulabnan,Akshat Chauhan,Laura Fleig,Lalithkumar Seenivasan,Roger D. Soberanis-Mukul,S. Swaroop Vedula,Russell H. Taylor,Masaru Ishii,Gregory D. Hager,Mathias Unberath
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In endoscopic surgery, surgeons continuously locate the endoscopic view relative to the anatomy by interpreting the evolving visual appearance of the intraoperative scene in the context of their prior knowledge. Vision-based navigation systems seek to replicate this capability by recovering camera pose directly from endoscopic video, but most approaches do not embody the same principles of reasoning about new frames that makes surgeons successful. Instead, they remain grounded in feature matching and geometric optimization over keyframes, an approach that has been shown to degrade under the challenging conditions of endoscopic imaging like low texture and rapid illumination changes. Here, we pursue an alternative approach and investigate a policy-based formulation of endoscopic camera pose recovery that seeks to imitate experts in estimating trajectories conditioned on the previous camera state. Our approach directly predicts short-horizon relative motions without maintaining an explicit geometric representation at inference time. It thus addresses, by design, some of the notorious challenges of geometry-based approaches, such as brittle correspondence matching, instability in texture-sparse regions, and limited pose coverage due to reconstruction failure. We evaluate the proposed formulation on cadaveric sinus endoscopy. Under oracle state conditioning, we compare short-horizon motion prediction quality to geometric baselines achieving lowest mean translation error and competitive rotational accuracy. We analyze robustness by grouping prediction windows according to texture richness and illumination change indicating reduced sensitivity to low-texture conditions. These findings suggest that a learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery.
[CV-120] Layered Quantum Architecture Search for 3D Point Cloud Classification
【速读】:该论文旨在解决参数化量子电路(Parametrised Quantum Circuit, PQC)缺乏标准化架构层(如卷积、注意力机制等)所导致的归纳偏置不足问题,从而限制其在特定学习任务中的表现。解决方案的关键在于提出分层量子架构搜索(layered Quantum Architecture Search, layered-QAS),该方法受经典网络形态学(network morphism)启发,通过逐步生长和适应的方式设计PQC架构,使模型在保持低参数量的同时具备更强的表达能力。实验表明,该策略可有效缓解“ barren plateau”现象,在3D点云分类任务中优于量子适配的局部与进化型QAS基线,并在ModelNet数据集上达到基于PQC方法的最先进性能。
链接: https://arxiv.org/abs/2603.20024
作者: Natacha Kuete Meli,Jovita Lukasik,Vladislav Golyanik,Michael Moeller
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce layered Quantum Architecture Search (layered-QAS), a strategy inspired by classical network morphism that designs Parametrised Quantum Circuit (PQC) architectures by progressively growing and adapting them. PQCs offer strong expressiveness with relatively few parameters, yet they lack standard architectural layers (e.g., convolution, attention) that encode inductive biases for a given learning task. To assess the effectiveness of our method, we focus on 3D point cloud classification as a challenging yet highly structured problem. Whereas prior work on this task has used PQCs only as feature extractors for classical classifiers, our approach uses the PQC as the main building block of the classification model. Simulations show that our layered-QAS mitigates barren plateau, outperforms quantum-adapted local and evolutionary QAS baselines, and achieves state-of-the-art results among PQC-based methods on the ModelNet dataset.
[CV-121] ReconMIL: Synergizing Latent Space Reconstruction with Bi-Stream Mamba for Whole Slide Image Analysis
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分析中基于多实例学习(Multiple Instance Learning, MIL)方法面临的两个关键问题:一是直接使用冻结的、任务无关特征导致领域差异(domain gap),从而降低诊断特征的可分性;二是仅依赖全局聚合机制易引发过平滑(over-smoothing)现象,使稀疏但关键的诊断信号被主导的背景上下文所掩盖。解决方案的关键在于提出ReconMIL框架,其核心创新包括:1)引入潜空间重构模块(Latent Space Reconstruction module),自适应地将通用特征投影至紧凑的任务特定流形,提升边界判别能力;2)设计双流架构——基于Mamba的全局流用于捕捉上下文先验,基于CNN的局部流保留细微形态异常;3)提出尺度自适应选择机制,动态融合两路特征,在整体结构与局部显著性之间实现平衡。该方案有效提升了诊断区域定位精度并抑制背景噪声,在多个诊断与生存预测基准上均优于现有最先进方法。
链接: https://arxiv.org/abs/2603.19925
作者: Lubin Gan,Jing Zhang,Heng Zhang,Xin Di,Zhifeng Wang,Wenke Huang,Xiaoyan Sun
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole slide image (WSI) analysis heavily relies on multiple instance learning (MIL). While recent methods benefit from large-scale foundation models and advanced sequence modeling to capture long-range dependencies, they still struggle with two critical issues. First, directly applying frozen, task-agnostic features often leads to suboptimal separability due to the domain gap with specific histological tasks. Second, relying solely on global aggregators can cause over-smoothing, where sparse but critical diagnostic signals are overshadowed by the dominant background context. In this paper, we present ReconMIL, a novel framework designed to bridge this domain gap and balance global-local feature aggregation. Our approach introduces a Latent Space Reconstruction module that adaptively projects generic features into a compact, task-specific manifold, improving boundary delineation. To prevent information dilution, we develop a bi-stream architecture combining a Mamba-based global stream for contextual priors and a CNN-based local stream to preserve subtle morphological anomalies. A scale-adaptive selection mechanism dynamically fuses these two streams, determining when to rely on overall architecture versus local saliency. Evaluations across multiple diagnostic and survival prediction benchmarks show that ReconMIL consistently outperforms current state-of-the-art methods, effectively localizing fine-grained diagnostic regions while suppressing background noise. Visualization results confirm the models superior ability to localize diagnostic regions by effectively balancing global structure and local granularity.
[CV-122] Offshore oil and gas platform dynamics in the North Sea Gulf of Mexico and Persian Gulf: Exploiting the Sentinel-1 archive
【速读】:该论文旨在解决海上油气平台(offshore oil and gas platforms)在广阔且难以到达的海域中缺乏系统、可扩展监测手段的问题,以支持经济、环境与监管决策。其核心解决方案是利用免费获取的地球观测数据(如Sentinel-1雷达影像)与基于深度学习的目标检测技术,构建2017–2025年期间北海南部、墨西哥湾和波斯湾三大产油区的平台时空分布动态数据库,实现对平台位置、尺寸、水深、距岸距离、所属国家及安装/退役时间等关键参数的自动化提取,从而为海洋基础设施长期监测和 offshore energy sector 转型分析提供可靠数据基础。
链接: https://arxiv.org/abs/2603.19801
作者: Robin Spanier,Thorsten Hoeser,John Truckenbrodt,Felix Bachofer,Claudia Kuenzer
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, 1 table
Abstract:The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.
[CV-123] Grounded Multimodal Retrieval-Augmented Drafting of Radiology Impressions Using Case-Based Similarity Search
【速读】:该论文旨在解决生成式 AI (Generative AI) 在自动放射学报告生成中因缺乏临床依据而导致的幻觉问题,从而限制其在真实临床工作流中的可靠性。解决方案的关键在于提出一种多模态检索增强生成(Retrieval-Augmented Generation, RAG)系统,通过结合对比图像-文本嵌入、基于病例的相似性检索以及引用约束的草稿生成机制,确保生成内容与历史放射学报告在事实层面保持一致。具体而言,系统利用 CLIP 编码器提取图像嵌入,并基于结构化印象文本构建语义嵌入,通过 FAISS 索引实现高效近邻检索,再以检索到的案例构造带引用约束的提示(prompt),最终生成可解释且具备显式引用溯源能力的报告草稿,显著提升了输出的可信度和临床适用性。
链接: https://arxiv.org/abs/2603.17765
作者: Himadri Samanta
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures, 3 tables
Abstract:Automated radiology report generation has gained increasing attention with the rise of deep learning and large language models. However, fully generative approaches often suffer from hallucinations and lack clinical grounding, limiting their reliability in real-world workflows. In this study, we propose a multimodal retrieval-augmented generation (RAG) system for grounded drafting of chest radiograph impressions. The system combines contrastive image-text embeddings, case-based similarity retrieval, and citation-constrained draft generation to ensure factual alignment with historical radiology reports. A curated subset of the MIMIC-CXR dataset was used to construct a multimodal retrieval database. Image embeddings were generated using CLIP encoders, while textual embeddings were derived from structured impression sections. A fusion similarity framework was implemented using FAISS indexing for scalable nearest-neighbor retrieval. Retrieved cases were used to construct grounded prompts for draft impression generation, with safety mechanisms enforcing citation coverage and confidence-based refusal. Experimental results demonstrate that multimodal fusion significantly improves retrieval performance compared to image-only retrieval, achieving Recall@5 above 0.95 on clinically relevant findings. The grounded drafting pipeline produces interpretable outputs with explicit citation traceability, enabling improved trustworthiness compared to conventional generative approaches. This work highlights the potential of retrieval-augmented multimodal systems for reliable clinical decision support and radiology workflow augmentation
人工智能
[AI-0] Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning
【速读】:该论文旨在解决机器学习(Machine Learning, ML)在网络安全领域中因泛化能力差而导致的性能下降问题,其根本原因在于模型倾向于学习表层模式(shortcuts)而非深层的网络安全概念。为改善这一状况,论文提出了一种基于对比多模态学习(contrastive multi-modal learning)的解决方案,其关键在于利用数据丰富的模态(如文本)向数据稀缺的模态(如网络流量payload)迁移知识。具体而言,作者构建了一个两阶段的多模态对比学习框架:首先通过对比学习建立语义有意义的文本描述嵌入空间,随后将payload映射到该空间以实现知识转移,从而减少对短路学习的依赖并提升威胁分类任务的性能。
链接: https://arxiv.org/abs/2603.20181
作者: Jianan Huang,Rodolfo V. Valentim,Luca Vassio,Matteo Boffa,Marco Mellia,Idilio Drago,Dario Rossi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted to Euro SP - 5th International Workshop on Designing and Measuring Security in Systems with AI
Abstract:The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.
[AI-1] Learning Dynamic Belief Graphs for Theory-of-mind Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在理论心理(Theory of Mind, ToM)推理中难以准确建模人类动态信念演化及其对行为影响的问题,尤其在高风险场景如灾难疏散、紧急医疗等需要持续推断他人隐含信念并据此决策的环境中。现有方法要么直接提示LLMs,要么采用静态独立的潜在状态模型,导致信念演化不一致且推理能力弱。其解决方案的关键在于提出一种结构化的认知轨迹模型,将心智状态表示为动态信念图(belief graph),通过联合推断潜在信念、学习其随时间变化的依赖关系,并将其演化与信息获取和决策相链接;具体创新包括:(i) 将文本化的概率陈述映射为一致的概率图模型更新机制,(ii) 采用基于能量的因子图表示信念间的相互依赖关系,(iii) 设计基于证据下界(ELBO)的目标函数以捕捉信念累积与延迟决策过程。该方法在多个真实灾难疏散数据集上显著提升了行为预测准确性,并恢复出符合人类推理逻辑的可解释信念轨迹。
链接: https://arxiv.org/abs/2603.20170
作者: Ruxiao Chen,Xilei Zhao,Thomas J. Cova,Frank A. Drews,Susu Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Theory of Mind (ToM) reasoning with Large Language Models (LLMs) requires inferring how people’s implicit, evolving beliefs shape what they seek and how they act under uncertainty – especially in high-stakes settings such as disaster response, emergency medicine, and human-in-the-loop autonomy. Prior approaches either prompt LLMs directly or use latent-state models that treat beliefs as static and independent, often producing incoherent mental models over time and weak reasoning in dynamic contexts. We introduce a structured cognitive trajectory model for LLM-based ToM that represents mental state as a dynamic belief graph, jointly inferring latent beliefs, learning their time-varying dependencies, and linking belief evolution to information seeking and decisions. Our model contributes (i) a novel projection from textualized probabilistic statements to consistent probabilistic graphical model updates, (ii) an energy-based factor graph representation of belief interdependencies, and (iii) an ELBO-based objective that captures belief accumulation and delayed decisions. Across multiple real-world disaster evacuation datasets, our model significantly improves action prediction and recovers interpretable belief trajectories consistent with human reasoning, providing a principled module for augmenting LLMs with ToM in high-uncertainty environment. this https URL
[AI-2] he Robots Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning ICRA2026
【速读】:该论文旨在解决传统机器人社交行为生成方法在灵活性和自主性方面的局限性,这些问题通常依赖于预定义动作或人工反馈,难以适应复杂多变的交互场景。其核心解决方案是提出CRISP(Critique-and-Replan for Interactive Social Presence)框架,关键在于利用视觉语言模型(Vision-Language Model, VLM)作为“类人社会批评者”,对机器人行为进行自我评估与迭代优化:首先基于机器人描述文件(如MJCF)提取可动关节及约束条件,再结合情境上下文生成分步行为计划,并通过参考视觉信息(如关节运动范围可视化)生成低层关节控制代码;随后由VLM评估行为的社会适宜性和自然度,识别错误步骤,并通过奖励驱动的搜索机制实现行为的持续精炼。该方法不绑定特定机器人API,仅需结构文件即可在多种平台生成细微差异且类人的动作,显著提升了机器人自主交互能力与跨平台适用性。
链接: https://arxiv.org/abs/2603.20164
作者: Jiyu Lim,Youngwoo Yoon,Kwanghyun Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ICRA 2026. 8 pages, 9 figures, Project page: this https URL
Abstract:Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.’ CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot’s description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot’s structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot’s autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: this https URL Comments: Accepted to ICRA 2026. 8 pages, 9 figures, Project page: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.20164 [cs.RO] (or arXiv:2603.20164v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.20164 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case
【速读】:该论文旨在解决工程系统设计(如机电、控制或嵌入式系统)中普遍存在的非系统化问题,即需求常被隐含表达,从设计意图到参数的可追溯性缺失,且现有基于规范的设计方法多聚焦于软件领域,AI辅助工具通常在解决方案生成阶段介入,而非问题定义阶段,导致人机协作在物理系统设计中尚未充分探索。解决方案的关键在于提出Design-OS——一种轻量级、以规范驱动的五阶段设计流程(概念定义、文献调研、概念设计、需求定义和设计定义),其中规范作为人与AI代理之间的共享契约,每个阶段产出结构化成果以维持可追溯性并支持代理增强执行,从而将AI协同时机前移至问题建模阶段,并实现从软件到物理工程系统的规范驱动型AI编排扩展。
链接: https://arxiv.org/abs/2603.20151
作者: H. Sinan Bank,Daniel R. Herber,Thomas H. Bradley
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 2 figures, 11 pages, Submitted to ASME IDETC 2026 - DAC-09
Abstract:Engineering system design – whether mechatronic, control, or embedded – often proceeds in an ad hoc manner, with requirements left implicit and traceability from intent to parameters largely absent. Existing specification-driven and systematic design methods mostly target software, and AI-assisted tools tend to enter the workflow at solution generation rather than at problem framing. Human–AI collaboration in the design of physical systems remains underexplored. This paper presents Design-OS, a lightweight, specification-driven workflow for engineering system design organized in five stages: concept definition, literature survey, conceptual design, requirements definition, and design definition. Specifications serve as the shared contract between human designers and AI agents; each stage produces structured artifacts that maintain traceability and support agent-augmented execution. We position Design-OS relative to requirements-driven design, systematic design frameworks, and AI-assisted design pipelines, and demonstrate it on a control systems design case using two rotary inverted pendulum platforms – an open-source SimpleFOC reaction wheel and a commercial Quanser Furuta pendulum – showing how the same specification-driven workflow accommodates fundamentally different implementations. A blank template and the full design-case artifacts are shared in a public repository to support reproducibility and reuse. The workflow makes the design process visible and auditable, and extends specification-driven orchestration of AI from software to physical engineering system design.
[AI-4] An Agent ic Multi-Agent Architecture for Cybersecurity Risk Management
【速读】:该论文旨在解决小型组织难以负担且耗时的网络安全风险评估问题(即传统基于NIST CSF框架的风险评估成本高、周期长且依赖稀缺的专业人员),从而导致多数小企业选择跳过评估。其解决方案的关键在于构建了一个由六个专业代理组成的多智能体系统(multi-agent system),每个代理负责风险评估的一个分析阶段(如组织画像、资产映射、威胁分析、控制评估、风险评分和建议生成),并通过持续上下文共享机制实现各阶段间的协同推理——这一机制区别于标准的顺序式代理流水线,使得后序代理能基于前序结论进行深化分析。实验表明,该系统在真实医疗企业案例中与三位CISSP专家达成85%的一致性,覆盖92%的风险点,并在15分钟内完成;同时,在合成场景下,领域微调后的模型相比通用模型更能识别行业特异性风险,但受限于单卡显存容量(Tesla T4的4096 token上下文窗口),完整多代理流程无法稳定运行,凸显出上下文容量是当前架构的主要瓶颈。
链接: https://arxiv.org/abs/2603.20131
作者: Ravish Gupta(1),Saket Kumar(2),Shreeya Sharma(3),Maulik Dang(4),Abhishek Aggarwal(4) ((1) BigCommerce, (2) University at Buffalo, The State University of New York, Buffalo, NY, USA, (3) Microsoft, (4) Amazon)
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages, 1 figure, 2 tables. Submitted to AICTC 2026 (Springer LNCS)
Abstract:Getting a real cybersecurity risk assessment for a small organization is expensive – a NIST CSF-aligned engagement runs 15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six-agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded – the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15-person HIPAA-covered healthcare company and compared outputs to independent assessments by three CISSP practitioners – the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single-agent assessments across five synthetic but sector-realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general-purpose Mistral-7B against a domain fine-tuned model. Both completed every run. The fine-tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform-specific risks in retail. The full multi-agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096-token default context window – context capacity, not model quality, turned out to be the binding constraint.
[AI-5] Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放网络应用中因暴露于长尾分布输入(如低资源语言和加密私有数据)而面临的越狱攻击(jailbreak attacks)安全风险问题。现有方法多依赖人工规则设计,难以系统性评估此类安全与隐私漏洞。其解决方案的关键在于提出EvoJail——一个基于多目标进化搜索的自动化框架,将越狱提示生成建模为同时最大化攻击有效性与最小化输出困惑度的优化问题,并引入语义-算法联合表示来捕捉高阶语义意图与低阶加密解密逻辑结构;在此基础上,EvoJail集成LLM辅助操作符至多目标进化流程中,实现自适应且语义感知的变异与交叉机制,从而高效探索高度结构化且开放的搜索空间,显著提升对多样化长尾越狱策略的发现能力。
链接: https://arxiv.org/abs/2603.20122
作者: Wenjing Hong,Zhonghua Rong,Li Wang,Feng Chang,Jian Zhu,Ke Tang,Zexuan Zhu,Yew-Soon Ong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs, including those from long-tail distributions such as low-resource languages and encrypted private data. This open-ended exposure increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic evaluation of these security and privacy vulnerabilities. In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search. EvoJail formulates long-tail attack prompt generation as a multi-objective optimization problem that jointly maximizes attack effectiveness and minimizes output perplexity, and introduces a semantic-algorithmic solution representation to capture both high-level semantic intent and low-level structural transformations of encryption-decryption logic. Building upon this representation, EvoJail integrates LLM-assisted operators into a multi-objective evolutionary framework, enabling adaptive and semantically informed mutation and crossover for efficiently exploring a highly structured and open-ended search space. Extensive experiments demonstrate that EvoJail consistently discovers diverse and effective long-tail jailbreak strategies, achieving competitive performance with existing methods in both individual and ensemble level.
[AI-6] Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture – Bridging Predictive and Generative Self-Supervised Learning
【速读】:该论文旨在解决当前联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)在表示学习中缺乏明确概率建模基础的问题,即JEPA虽常被视为非生成式方法,但其结构与变分推断框架下的潜变量模型存在本质关联。解决方案的关键在于提出变分JEPA(Variational JEPA, Var-JEPA),通过显式构建潜变量生成结构并优化单一证据下界(Evidence Lower Bound, ELBO),将JEPA从依赖架构和训练启发式正则化的确定性模型提升为具有理论保障的概率生成框架。该方法无需人为设计防坍缩(anti-collapse)正则项即可获得有意义的表示,并支持潜空间中的合理不确定性量化,同时在表格数据上实现了优于传统T-JEPA的下游性能表现。
链接: https://arxiv.org/abs/2603.20111
作者: Moritz Gögl,Christopher Yau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Joint-Embedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizing prediction in representation space rather than reconstruction in observation space. We argue that the resulting separation from probabilistic generative modeling is largely rhetorical rather than structural: the canonical JEPA design, coupled encoders with a context-to-target predictor, mirrors the variational posteriors and learned conditional priors obtained when variational inference is applied to a particular class of coupled latent-variable models, and standard JEPA can be viewed as a deterministic specialization in which regularization is imposed via architectural and training heuristics rather than an explicit likelihood. Building on this view, we derive the Variational JEPA (Var-JEPA), which makes the latent generative structure explicit by optimizing a single Evidence Lower Bound (ELBO). This yields meaningful representations without ad-hoc anti-collapse regularizers and allows principled uncertainty quantification in the latent space. We instantiate the framework for tabular data (Var-T-JEPA) and achieve strong representation learning and downstream performance, consistently improving over T-JEPA while remaining competitive with strong raw-feature baselines.
[AI-7] he mathbfY-Combinator for LLM s: Solving Long-Context Rot with λ-Calculus
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理任务中因固定上下文窗口导致的性能瓶颈问题。现有递归语言模型(Recursive Language Models, RLMs)虽通过外部化提示并递归求解子问题缓解此问题,但其依赖开放式的读-求值-打印循环(REPL),生成任意控制代码,难以验证、预测和分析执行过程。论文提出 λ-RLM 框架,其核心创新在于将自由形式的递归代码生成替换为基于 λ-演算的类型化函数式运行时,仅在有限叶节点子问题上使用神经推理,并通过预验证的组合子库实现结构化的函数程序与显式控制流。该设计赋予系统形式化保证,包括终止性、可证明的时间复杂度边界、可控的精度随递归深度变化特性及最优分割规则,显著提升了长上下文推理的可靠性与效率。
链接: https://arxiv.org/abs/2603.20105
作者: Amartya Roy,Rasul Tutunov,Xiaotong Ji,Matthieu Zimmer,Haitham Bou-Ammar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce \lambda -RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in \lambda -calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that \lambda -RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, \lambda -RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of \lambda -RLM, is open-sourced for the community at: this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.20105 [cs.LG] (or arXiv:2603.20105v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.20105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-8] Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
【速读】:该论文旨在解决连续环境中前向-后向(Forward-backward, FB)表示学习中因高秩转移动态与FB架构低秩瓶颈之间存在谱不匹配而导致的低秩表示学习困难问题。解决方案的关键在于引入时间抽象(temporal abstraction),通过将时间抽象建模为一种低通滤波器,抑制高频率的谱分量,从而降低诱导的 successor representation (SR) 的有效秩,同时保持值函数误差的理论边界。这一机制有效对齐了环境的谱结构与FB表示的低秩约束,显著提升了在高折扣因子下基于bootstrapping的FB学习的稳定性。
链接: https://arxiv.org/abs/2603.20103
作者: Seyed Mahdi B. Azad,Jasper Hoffmann,Iman Nematollahi,Hao Zhu,Abhinav Valada,Joschka Boedecker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts as a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
[AI-9] Pitfalls in Evaluating Interpretability Agents
【速读】:该论文旨在解决自动化可解释性系统(automated interpretability systems)在评估复杂性和规模增长时面临的挑战,尤其是当这些系统基于大语言模型(LLMs)实现更高自主性时,传统依赖人类专家标注的复制型评估方法存在局限性。其解决方案的关键在于提出一种无监督的内在评估机制,该机制基于模型组件的功能可互换性(functional interchangeability),从而绕过对人类专家解释的依赖,更准确地衡量自动化系统在电路分析任务中对模型内部机制的理解能力。
链接: https://arxiv.org/abs/2603.20101
作者: Tal Haklay,Nikhil Prakash,Sana Pandey,Antonio Torralba,Aaron Mueller,Jacob Andreas,Tamar Rott Shaham,Yonatan Belinkov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis – explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer examination reveals several pitfalls of replication-based evaluation: human expert explanations can be subjective or incomplete, outcome-based comparisons obscure the research process, and LLM-based systems may reproduce published findings via memorization or informed guessing. To address some of these pitfalls, we propose an unsupervised intrinsic evaluation based on the functional interchangeability of model components. Our work demonstrates fundamental challenges in evaluating complex automated interpretability systems and reveals key limitations of replication-based evaluation.
[AI-10] Agent ic Harness for Real-World Compilers
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在修复编译器(Compiler)漏洞时面临的独特挑战,包括编译器漏洞的复杂性、跨领域专业知识要求高以及缺陷报告稀疏且描述不明确等问题。为弥合这一差距,作者提出了 llvm-autofix,其核心创新在于构建了一个面向代理(Agent)的专用工具链,包含友好的 LLVM 工具集、可复现的 LLVM 缺陷基准测试集 llvm-bench,以及一个轻量级代理 llvm-autofix-mini,专门用于修复 LLVM 编译器漏洞。实验表明,前沿 LLM 在处理编译器漏洞时性能比通用软件漏洞下降 60%,而 llvm-autofix-mini 相较于现有最优方法提升约 22%,验证了专用工具链对提升 LLM 在复杂系统如编译器工程中能力的关键作用。
链接: https://arxiv.org/abs/2603.20075
作者: Yingwei Zheng,Cong Li,Shaohua Li,Yuqun Zhang,Zhendong Su
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross-domain expertise requirements, and sparse, non-descriptive bug reports, necessitating compiler-specific tools. To bridge the gap, we introduce llvm-autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm-autofix are agent-friendly LLVM tools, a benchmark llvm-bench of reproducible LLVM bugs, and a tailored minimal agent llvm-autofix-mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm-autofix-mini also outperforms the state-of-the-art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: this https URL
[AI-11] Fine-tuning Timeseries Predictors Using Reinforcement Learning
【速读】:该论文旨在解决如何通过强化学习(Reinforcement Learning, RL)算法对金融预测模型进行微调(fine-tuning),以提升其预测性能的问题。解决方案的关键在于提出了一种清晰的实现方案,将强化学习任务的损失函数反向传播至已通过监督学习(Supervised Learning)训练的模型,并在此基础上比较微调前后的性能差异。实证结果表明,微调后模型性能显著提升,并展现出迁移学习(Transfer Learning)特性,验证了该方法的有效性与实用性。
链接: https://arxiv.org/abs/2603.20063
作者: Hugo Cazaux,Ralph Rudd,Hlynur Stefánsson,Sverrir Ólafsson,Eyjólfur Ingi Ásgeirsson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This chapter presents three major reinforcement learning algorithms used for fine-tuning financial forecasters. We propose a clear implementation plan for backpropagating the loss of a reinforcement learning task to a model trained using supervised learning, and compare the performance before and after the fine-tuning. We find an increase in performance after fine-tuning, and transfer learning properties to the models, indicating the benefits of fine-tuning. We also highlight the tuning process and empirical results for future implementation by practitioners.
[AI-12] DIAL-KG: Schema-Free Incremental Knowledge Graph Construction via Dynamic Schema Induction and Evolution-Intent Assessment DASFAA2026
【速读】:该论文旨在解决传统知识图谱(Knowledge Graph, KG)构建方法在动态数据场景下的局限性,即静态构建方式难以适应实时数据流、重构成本高且预定义模式缺乏灵活性的问题。其解决方案的关键在于提出一个闭环的增量式知识图谱构建框架 DIAL KG,该框架以元知识库(Meta-Knowledge Base, MKB)为核心,通过三阶段循环实现高效、高质量的知识更新:(i) 双轨抽取机制确保知识完整性(默认采用三元组生成,复杂知识切换至事件抽取);(ii) 治理仲裁机制保障事实准确性与时效性,防止幻觉和知识过时;(iii) 模式演化机制从验证后的知识中自动归纳新schema,并将本轮知识增量融合至现有图谱,从而实现动态适应与持续优化。
链接: https://arxiv.org/abs/2603.20059
作者: Weidong Bao,Yilin Wang,Ruyu Gao,Fangling Leng,Yubin Bao,Ge Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to DASFAA 2026. 16 pages, 4 figures
Abstract:Knowledge Graphs (KGs) are foundational to applications such as search, question answering, and recommendation. Conventional knowledge graph construction methods are predominantly static, rely ing on a single-step construction from a fixed corpus with a prede f ined schema. However, such methods are suboptimal for real-world sce narios where data arrives dynamically, as incorporating new informa tion requires complete and computationally expensive graph reconstruc tions. Furthermore, predefined schemas hinder the flexibility of knowl edge graph construction. To address these limitations, we introduce DIAL KG, a closed-loop framework for incremental KG construction orches trated by a Meta-Knowledge Base (MKB). The framework oper ates in a three-stage cycle: (i) Dual-Track Extraction, which ensures knowledge completeness by defaulting to triple generation and switching to event extraction for complex knowledge; (ii) Governance Adjudica tion, which ensures the fidelity and currency of extracted facts to prevent hallucinations and knowledge staleness; and (iii) Schema Evolution, in which new schemas are induced from validated knowledge to guide subsequent construction cycles, and knowledge from the current round is incrementally applied to the existing KG. Extensive experiments demon strate that our framework achieves state-of-the-art (SOTA) performance in the quality of both the constructed graph and the induced schemas.
[AI-13] Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLM s
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时,因探索受限于当前策略分布而导致的效率低下问题。其核心挑战在于如何引导模型从失败轨迹中学习,并有效扩展到未覆盖的高质量响应空间。解决方案的关键在于提出HeRL(Hindsight experience guided Reinforcement Learning)框架,通过将未满足评分标准(rubric)的失败轨迹作为“事后经验”(hindsight experience),将其作为上下文引导信号,显式告诉LLM期望的行为模式;同时引入奖励奖励(bonus reward)以激励具有改进潜力的响应,从而实现更精准的梯度估计和更高效的探索。
链接: https://arxiv.org/abs/2603.20046
作者: Wenjian Zhang,Kongcheng Zhang,Jiaxin Qi,Baisheng Lai,Jianqiang Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at this https URL.
[AI-14] Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在软件工程领域研究多集中于个体任务完成,而缺乏对团队级交付过程影响的实证证据这一问题。其解决方案的关键在于通过一个纵向实地研究,系统评估了一个工业级平台 Chiron 在四个软件交付阶段(分析、规划、实现与验证)中协调人类与AI代理的效果,发现当AI被嵌入到协同工作流中时,相较于作为孤立编码助手部署,能显著提升交付速度、代码覆盖率并降低验证阶段的问题密度,从而证明了AI在组织化流程中的集成价值远大于单点工具应用。
链接: https://arxiv.org/abs/2603.20028
作者: Maximiliano Armesto,Christophe Kolb
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 12 tables
Abstract:Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. We report a retrospective longitudinal field study of Chiron, an industrial platform that coordinates humans and AI agents across four delivery stages: analysis, planning, implementation, and validation. The study covers three real software modernization programs – a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC) – observed across five delivery configurations: a traditional baseline and four successive platform versions (V1–V4). The benchmark separates observed outcomes (stage durations, task volumes, validation-stage issues, first-release coverage) from modeled outcomes (person-days and senior-equivalent effort under explicit staffing scenarios). Under baseline staffing assumptions, portfolio totals move from 36.0 to 9.3 summed project-weeks; modeled raw effort falls from 1080.0 to 232.5 person-days; modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days; validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks; and first-release coverage rises from 77.0% to 90.5%. V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. The evidence supports a central thesis: the largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant.
[AI-15] rojans Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance
【速读】:该论文旨在解决自主编码代理(Autonomous Coding Agents)在集成到软件开发工作流中时所面临的新颖攻击面问题,特别是由引导注入(guidance injection)引发的隐蔽性安全威胁。传统提示注入攻击依赖显式恶意指令,而引导注入则通过将有害操作伪装为常规最佳实践,嵌入到启动引导文件中,从而在不触发警报的情况下操纵代理的行为。解决方案的关键在于识别并系统化地表征这一攻击向量,并通过构建 ORE-Bench 基准测试平台验证其有效性——实验证明,此类攻击在多种主流大语言模型(LLM)后端上成功率可达 16.0% 至 64.2%,且绝大多数恶意行为无需用户确认即可自动执行,同时 94% 的恶意技能可规避现有静态和基于 LLM 的检测工具。研究揭示了自主代理生态系统设计中的根本性矛盾,强调需引入能力隔离、运行时策略强制与透明引导溯源等防御机制以应对此类风险。
链接: https://arxiv.org/abs/2603.19974
作者: Fazhong Liu,Zhuoyan Chen,Tu Lan,Haozhen Tan,Zhenyu Xu,Xiang Li,Guoxing Chen,Yan Meng,Haojin Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous coding agents are increasingly integrated into software development workflows, offering capabilities that extend beyond code suggestion to active system interaction and environment management. OpenClaw, a representative platform in this emerging paradigm, introduces an extensible skill ecosystem that allows third-party developers to inject behavioral guidance through lifecycle hooks during agent initialization. While this design enhances automation and customization, it also opens a novel and unexplored attack surface. In this paper, we identify and systematically characterize guidance injection, a stealthy attack vector that embeds adversarial operational narratives into bootstrap guidance files. Unlike traditional prompt injection, which relies on explicit malicious instructions, guidance injection manipulates the agent’s reasoning context by framing harmful actions as routine best practices. These narratives are automatically incorporated into the agent’s interpretive framework and influence future task execution without raising this http URL construct 26 malicious skills spanning 13 attack categories including credential exfiltration, workspace destruction, privilege escalation, and persistent backdoor installation. We evaluate them using ORE-Bench, a realistic developer workspace benchmark we developed. Across 52 natural user prompts and six state-of-the-art LLM backends, our attacks achieve success rates from 16.0% to 64.2%, with the majority of malicious actions executed autonomously without user confirmation. Furthermore, 94% of our malicious skills evade detection by existing static and LLM-based scanners. Our findings reveal fundamental tensions in the design of autonomous agent ecosystems and underscore the urgent need for defenses based on capability isolation, runtime policy enforcement, and transparent guidance provenance.
[AI-16] Graph2TS: Structure-Controlled Time Series Generation via Quantile-Graph VAEs
【速读】:该论文旨在解决生成式时间序列模型在保持全局时间结构与建模局部随机波动之间的根本性矛盾,尤其是在高波动性、弱周期性或非规则周期信号中,直接进行分布匹配容易放大噪声或抑制有意义的时间模式。其解决方案的关键在于提出“结构-残差”视角,将时间序列视为结构主干(structural backbone)与随机残差动态(stochastic residual dynamics)的组合,从而实现全局组织信息与样本级变异性的分离;在此基础上,作者构建了一个基于分位数转换图(quantile-based transition graph)的结构表示,并设计了Graph2TS——一种以量化图条件驱动的变分自编码器,通过结构条件而非标签或元数据引导生成过程,在保留全局时间组织的同时支持可控的随机变化,显著提升了分布保真度、时间对齐性和代表性。
链接: https://arxiv.org/abs/2603.19970
作者: Shaoshuai Du,Joze M. Rozanec,Andy Pimentel,Ana-Lucia Varbanescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although recent generative models can produce time series with close marginal distributions, they often face a fundamental tension between preserving global temporal structure and modeling stochastic local variations, particularly for highly volatile signals with weak or irregular periodicity. Direct distribution matching in such settings can amplify noise or suppress meaningful temporal patterns. In this work, we propose a structure-residual perspective on time-series generation, viewing temporal data as the combination of a structural backbone and stochastic residual dynamics, thereby motivating the separation of global organization from sample-level variability. Based on this insight, we represent time-series structure using a quantile-based transition graph that compactly captures global distributional and temporal dependencies. Building on this representation, we propose Graph2TS, a quantile-graph conditioned variational autoencoder that performs cross-modal generation from structural graphs to time series. By conditioning generation on structure rather than labels or metadata, the model preserves global temporal organization while enabling controlled stochastic variation. Experiments on diverse datasets, including sunspot, electricity load, ECG, and EEG signals, demonstrate improved distributional fidelity, temporal alignment, and representativeness compared to diffusion- and GAN-based baselines, highlighting structure-controlled and cross-modal generation as a promising direction for time-series modeling.
[AI-17] Revealing Domain-Spatiality Patterns for Configuration Tuning: Domain Knowledge Meets Fitness Landscapes
【速读】:该论文旨在解决配置调优(configuration tuning)中 tuner 效果难以解释的问题,尤其是由于可配置系统具有黑箱特性,导致现有方法在通用性和可解释性方面存在局限。为应对这一挑战,作者提出了一种名为 Domland 的两阶段方法,其核心在于将 Fitness Landscape Analysis (FLA) 与领域驱动分析相结合,通过提取 FLA 中的空间信息和领域知识,系统性地揭示配置调优案例的隐藏特征,从而解释 tuner 成功或失败的原因。该方案的关键创新在于利用 FLA 作为桥梁,将系统结构与调优难度关联起来,实现了对调优行为的更好理解和指导调优器设计。
链接: https://arxiv.org/abs/2603.19897
作者: Yulong Ye,Hongyuan Liang,Chao Jiang,Miqing Li,Tao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM)
Abstract:Configuration tuning for better performance is crucial in quality assurance. Yet, there has long been a mystery on tuners’ effectiveness, due to the black-box nature of configurable systems. Prior efforts predominantly adopt static domain analysis (e.g., static taint analysis), which often lacks generalizability, or dynamic data analysis (e.g., benchmarking performance analysis), limiting explainability. In this work, we embrace Fitness Landscape Analysis (FLA) as a bridge between domain knowledge and difficulty of the tuning. We propose Domland, a two-pronged methodology that synergizes the spatial information obtained from FLA and domain-driven analysis to systematically capture the hidden characteristics of configuration tuning cases, explaining how and why a tuner might succeed or fail. This helps to better interpret and contextualize the behavior of tuners and inform tuner design. To evaluate Domland, we conduct a case study of nine software systems and 93 workloads, from which we reveal several key findings: (1) configuration landscapes are inherently system-specific, with no single domain factor (e.g., system area, programming language, or resource intensity) consistently shaping their structure; (2) the core options (e.g., pic-struct of x264), which control the main functional flows, exert a stronger influence on landscape ruggedness (i.e. the difficulty of tuning) compared to resource options (e.g., cpu-independent of x264); (3) Workload effects on landscape structure are not uniformly tied to type or scale. Both contribute to landscape variations, but their impact is system-dependent.
[AI-18] Utility-Guided Agent Orchestration for Efficient LLM Tool Use
【速读】:该论文旨在解决工具使用型大语言模型(Tool-using Large Language Model, LLM)代理在回答质量与执行成本之间的权衡问题。现有方法如固定工作流虽稳定但缺乏灵活性,而自由形式的多步推理方法(如ReAct)虽可能提升任务性能,却往往导致过多工具调用、更长轨迹、高Token消耗和延迟增加。其解决方案的关键在于将代理调度(orchestration)建模为一个显式的决策问题,提出一种基于效用的调度策略,通过平衡估计收益、步骤成本、不确定性与冗余度,动态选择“响应”、“检索”、“工具调用”、“验证”或“停止”等动作。该策略不追求绝对最优性能,而是提供一个可控且可分析的框架,用于研究工具使用型LLM代理中的质量-成本权衡,实验证明显式调度信号显著影响代理行为,且轻量级效用设计具备实用性与可解释性。
链接: https://arxiv.org/abs/2603.19896
作者: Boyan Liu,Gongming Zhao,Hongli Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but inflexible, while free-form multi-step reasoning methods such as ReAct may improve task performance at the expense of excessive tool calls, longer trajectories, higher token consumption, and increased latency. In this paper, we study agent orchestration as an explicit decision problem rather than leaving it entirely to prompt-level behavior. We propose a utility-guided orchestration policy that selects among actions such as respond, retrieve, tool call, verify, and stop by balancing estimated gain, step cost, uncertainty, and redundancy. Our goal is not to claim universally best task performance, but to provide a controllable and analyzable policy framework for studying quality-cost trade-offs in tool-using LLM agents. Experiments across direct answering, threshold control, fixed workflows, ReAct, and several policy variants show that explicit orchestration signals substantially affect agent behavior. Additional analyses on cost definitions, workflow fairness, and redundancy control further demonstrate that lightweight utility design can provide a defensible and practical mechanism for agent control.
[AI-19] Integrating Meta-Features with Knowledge Graph Embeddings for Meta-Learning
【速读】:该论文旨在解决传统元学习方法在管道性能估计(Pipeline Performance Estimation, PPE)和基于数据集性能的相似性估计(Dataset Performance-based Similarity Estimation, DPSE)任务中,因仅依赖静态数据集元特征(如实例数量、类别熵等)而忽略大量历史实验结果与管道元数据的问题。这种局限性导致难以捕捉数据集与机器学习管道之间的复杂交互关系,从而影响性能预测和相似数据集识别的准确性。其解决方案的关键在于提出KGmetaSP,一种基于知识图谱嵌入(Knowledge Graph Embeddings)的方法:通过构建统一的知识图谱(Knowledge Graph, KG),将数据集和机器学习管道共同表示为图结构,并从中学习能够支持无特定管道依赖的元模型(用于PPE)以及基于距离检索的相似性匹配机制(用于DPSE),从而更有效地利用开放实验数据中的隐含交互信息,显著提升两类元学习任务的性能。
链接: https://arxiv.org/abs/2603.19888
作者: Antonis Klironomos,Ioannis Dasoulas,Francesco Periti,Mohamed Gad-Elrab,Heiko Paulheim,Anastasia Dimou,Evgeny Kharlamov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The vast collection of machine learning records available on the web presents a significant opportunity for meta-learning, where past experiments are leveraged to improve performance. Two crucial meta-learning tasks are pipeline performance estimation (PPE), which predicts pipeline performance on target datasets, and dataset performance-based similarity estimation (DPSE), which identifies datasets with similar performance patterns. Existing approaches primarily rely on dataset meta-features (e.g., number of instances, class entropy, etc.) to represent datasets numerically and approximate these meta-learning tasks. However, these approaches often overlook the wealth of past experimental results and pipeline metadata available. This limits their ability to capture dataset - pipeline interactions that reveal performance similarity patterns. In this work, we propose KGmetaSP, a knowledge-graph-embeddings approach that leverages existing experiment data to capture these interactions and improve both PPE and DPSE. We represent datasets and pipelines within a unified knowledge graph (KG) and derive embeddings that support pipeline-agnostic meta-models for PPE and distance-based retrieval for DPSE. To validate our approach, we construct a large-scale benchmark comprising 144,177 OpenML experiments, enabling a rich cross-dataset evaluation. KGmetaSP enables accurate PPE using a single pipeline-agnostic meta-model and improves DPSE over baselines. The proposed KGmetaSP, KG, and benchmark are released, establishing a new reference point for meta-learning and demonstrating how consolidating open experiment data into a unified KG advances the field.
[AI-20] What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
【速读】:该论文旨在解决测试时强化学习(Test-Time Reinforcement Learning, TTRL)中因依赖单一正向伪标签策略而导致的标签噪声放大问题,尤其在答案分布高度分散、共识较弱的场景下,易将错误轨迹误判为有效监督信号。解决方案的关键在于提出SCRL(Selective-Complementary Reinforcement Learning)框架:一方面通过选择性正向伪标签(Selective Positive Pseudo-Labeling)引入严格共识标准以过滤不可靠的多数投票;另一方面创新性地提出熵门控负向伪标签(Entropy-Gated Negative Pseudo-Labeling),首次在TTRL中引入负向监督机制,基于生成不确定性可靠地剔除错误轨迹,从而显著提升模型鲁棒性和训练稳定性。
链接: https://arxiv.org/abs/2603.19880
作者: Dong Yan,Jian Liang,Yanbo Wang,Shuo Lu,Ran He,Tieniu Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures
Abstract:Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at this https URL.
[AI-21] FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization
【速读】:该论文旨在解决自动形式化(autoformalization)中语义一致性与证明器有效性之间不一致的问题,即即使形式化结果在语义上正确,其在实际证明搜索中的成本和成功率仍可能存在显著差异。解决方案的关键在于提出 FormalEvolve——一种编译门控的神经符号进化框架,通过在严格生成器调用预算 T = 100 下进行测试时搜索语义一致的候选集,结合大语言模型(LLM)驱动的变异与交叉操作、受限补丁修复机制以及符号抽象语法树(Abstract Syntax Tree, AST)重写操作,从而提升形式化质量与下游证明性能,同时降低不同问题间语义成功分布的集中度(降低 Gini 系数)。
链接: https://arxiv.org/abs/2603.19828
作者: Haijian Lu(School of Artificial Intelligence, Xidian University, Beijing Institute for General Artificial Intelligence),Wei Wang(Beijing Institute for General Artificial Intelligence),Jing Liu(School of Artificial Intelligence, Xidian University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 12 figures
Abstract:Autoformalization aims to translate natural-language mathematics into compilable, machine-checkable statements. However, semantic consistency does not imply prover effectiveness: even semantically consistent formalizations can differ substantially in proof-search cost and success rate. In this work, we formulate autoformalization as a budgeted, test-time search for semantically consistent repertoires, and propose FormalEvolve, a compilation-gated neuro-symbolic evolutionary framework. FormalEvolve generates diverse candidates via LLM-driven mutation and crossover with bounded patch repair, while symbolic Abstract Syntax Tree (AST) rewrite operations further inject structural diversity. On CombiBench and ProofNet, under a strict generator-call budget of T = 100, FormalEvolve reaches semantic hit rates (SH@100) of 58.0% and 84.9%, and reduces cross-problem concentration of semantic successes(lower Gini). Under a fixed prover budget, FormalEvolve also improves downstream proving performance on CombiBench. Code will be released publicly.
[AI-22] Embodied Science: Closing the Discovery Loop with Agent ic Embodied AI
【速读】:该论文旨在解决当前人工智能在科学发现中因脱离物理实验环境而导致的“预测与验证脱节”问题,即现有计算方法多将科学发现视为孤立的任务型预测,而非持续与物理世界交互的动态过程。其解决方案的关键在于提出一种“具身科学”(Embodied Science)范式,核心是构建一个统一的感知-语言-行动-发现(Perception-Language-Action-Discovery, PLAD)框架,使智能体能够通过感知实验环境、基于科学知识进行推理、执行物理干预并吸收反馈结果,从而形成闭环驱动的自主探索机制,实现数字推理与实证验证的深度融合。
链接: https://arxiv.org/abs/2603.19782
作者: Xiang Zhuang,Chenyi Zhou,Kehua Feng,Zhihui Zhu,Yunfan Gao,Yijie Zhong,Yichi Zhang,Junjie Huang,Keyan Ding,Lei Bai,Haofen Wang,Qiang Zhang,Huajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Artificial intelligence has demonstrated remarkable capability in predicting scientific properties, yet scientific discovery remains an inherently physical, long-horizon pursuit governed by experimental cycles. Most current computational approaches are misaligned with this reality, framing discovery as isolated, task-specific predictions rather than continuous interaction with the physical world. Here, we argue for embodied science, a paradigm that reframes scientific discovery as a closed loop tightly coupling agentic reasoning with physical execution. We propose a unified Perception-Language-Action-Discovery (PLAD) framework, wherein embodied agents perceive experimental environments, reason over scientific knowledge, execute physical interventions, and internalize outcomes to drive subsequent exploration. By grounding computational reasoning in robust physical feedback, this approach bridges the gap between digital prediction and empirical validation, offering a roadmap for autonomous discovery systems in the life and chemical sciences.
[AI-23] FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在分布式场景下因客户端数据异构性和噪声标签(noisy annotations)导致的性能下降问题。现有方法多依赖于标量损失值来识别噪声样本,但在非独立同分布(non-IID)的数据环境下可靠性不足。其解决方案的关键在于从表示几何(representation geometry)角度重新思考噪声标签识别机制,提出一种名为 \method~的方法,遵循“表示几何优先”原则:首先通过自监督学习构建与标签无关的球形表示;随后迭代拟合球面 von Mises-Fisher (vMF) 混合模型以捕捉语义聚类;再结合语义-标签软映射机制,利用标签自由特征空间与标注条件特征空间之间的分布差异来鲁棒地识别噪声样本,并动态更新 vMF 模型;最后引入个性化噪声吸收矩阵对噪声标签进行优化,从而提升模型在多样化噪声客户端场景下的鲁棒性与准确性。
链接: https://arxiv.org/abs/2603.19722
作者: Tian Wen,Zhiqin Yang,Yonggang Zhang,Xuefeng Jiang,Hao Peng,Yuwei Wang,Bo Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: conference
Abstract:Federated learning (FL) suffers from performance degradation due to the inevitable presence of noisy annotations in distributed scenarios. Existing approaches have advanced in distinguishing noisy samples from the dataset for label correction by leveraging loss values. However, noisy samples recognition relying on scalar loss lacks reliability for FL under heterogeneous scenarios. In this paper, we rethink this paradigm from a representation perspective and propose \method~(\textbfFederated under \textbfRepresentation \textbfGemometry), which follows \textbfthe principle of ``representation geometry priority’’ to recognize noisy labels. Firstly, \method~creates label-agnostic spherical representations by using self-supervision. It then iteratively fits a spherical von Mises-Fisher (vMF) mixture model to this geometry using previously identified clean samples to capture semantic clusters. This geometric evidence is integrated with a semantic-label soft mapping mechanism to derive a distribution divergence between the label-free and annotated label-conditioned feature space, which robustly identifies noisy samples and updates the vMF mixture model with the newly separated clean dataset. Lastly, we employ an additional personalized noise absorption matrix on noisy labels to achieve robust optimization. Extensive experimental results demonstrate that \method~significantly outperforms state-of-the-art methods for FL with data heterogeneity under diverse noisy clients scenarios.
[AI-24] Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification
【速读】:该论文旨在解决形式化验证中大型证明脚本构建高度依赖人工、难以扩展的问题,尤其是在系统级验证项目中。其核心解决方案是一种神经符号(neuro-symbolic)证明生成框架,关键在于结合大语言模型(LLM)的推理能力与交互式定理证明工具(Interactive Theorem Proving, ITP)的语义精确性:一方面通过细粒度标注的证明状态-步骤对数据集微调LLM以提升证明步骤预测准确性;另一方面利用ITP工具对失败步骤进行修复、对证明状态进行过滤与排序,并在搜索停滞时自动处理子目标,从而实现数据高效的LLM适应和语义驱动的搜索空间剪枝。该框架在Isabelle REPL环境中实现并评估,在seL4 FVEL基准上成功证明77.6%的定理,显著优于以往基于LLM的方法和独立的Sledgehammer工具,展现出良好的泛化能力,为可扩展的自动化软件验证提供了可行路径。
链接: https://arxiv.org/abs/2603.19715
作者: Baoding He,Zenan Li,Wei Sun,Yuan Yao,Taolue Chen,Xiaoxing Ma,Zhendong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.
[AI-25] he Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference
【速读】:该论文旨在解决Transformer模型在推理过程中对键值缓存(Key-Value Cache, KV Cache)的依赖问题,该缓存通常占据大量内存且需要复杂的压缩或淘汰策略。研究发现,KV缓存实际上是冗余的:每一层的keys和values均可由残差流(residual stream)确定性地投影得到,仅需每token保存一个残差向量即可实现比特级精确重建(bit-identical reconstruction),无需近似。解决方案的关键在于提出KV-Direct机制——一种基于残差流Checkpointing的有限内存推理方案,它不存储完整的KV对(136 KB/token),而是仅存储5 KB/token的残差向量,并按需重新计算keys和values。实验表明,该方法在20轮对话中将峰值内存从103 MB降至42 MB,且在所有测试模型上保持100% token匹配率,显著优于五种主流缓存淘汰基线(性能下降至5–28%)。
链接: https://arxiv.org/abs/2603.19664
作者: Kaleem Ullah Qasim,Jiashu Zhang,Muhammad Kafeel Shaheen,Razan Alharith,Heying Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14
Abstract:The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at this https URL.
[AI-26] PolicySim: An LLM -Based Agent Social Simulation Sandbox for Proactive Policy Optimization
【速读】:该论文旨在解决社交平台干预政策(如推荐算法和内容过滤)在部署后可能无意中加剧回音室效应和群体极化的问题,而现有方法主要依赖于反应式的在线A/B测试,导致风险识别滞后且成本高昂。为此,作者提出PolicySim——一个基于大语言模型(LLM)的社会模拟沙盒,用于干预政策的主动评估与优化。其解决方案的关键在于构建双向动态机制:一是通过监督微调(SFT)和直接偏好优化(DPO)精调用户代理模块,实现平台特定的行为真实性;二是设计自适应干预模块,采用带消息传递机制的上下文Bandit算法,以捕捉动态网络结构,从而在微观和宏观层面精准模拟平台生态系统并支持有效干预策略制定。
链接: https://arxiv.org/abs/2603.19649
作者: Renhong Huang,Ning Tang,Jiarong Xu,Yuxuan Cao,Qingqian Tu,Sheng Guo,Bo Zheng,Huiyuan Liu,Yang Yang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Social platforms serve as central hubs for information exchange, where user behaviors and platform interventions jointly shape opinions. However, intervention policies like recommendation and content filtering, can unintentionally amplify echo chambers and polarization, posing significant societal risks. Proactively evaluating the impact of such policies is therefore crucial. Existing approaches primarily rely on reactive online A/B testing, where risks are identified only after deployment, making risk identification delayed and costly. LLM-based social simulations offer a promising pre-deployment alternative, but current methods fall short in realistically modeling platform interventions and incorporating feedback from the platform. Bridging these gaps is essential for building actionable frameworks to assess and optimize platform policies. To this end, we propose PolicySim, an LLM-based social simulation sandbox for the proactive assessment and optimization of intervention policies. PolicySim models the bidirectional dynamics between user behavior and platform interventions through two key components: (1) a user agent module refined via supervised fine-tuning (SFT) and direct preference optimization (DPO) to achieve platform-specific behavioral realism; and (2) an adaptive intervention module that employs a contextual bandit with message passing to capture dynamic network structures. Experiments show that PolicySim can accurately simulate platform ecosystems at both micro and macro levels and support effective intervention policy.
[AI-27] HyEvo: Self-Evolving Hybrid Agent ic Workflows for Efficient Reasoning
【速读】:该论文旨在解决当前基于代理的工作流(agentic workflows)在自动化生成过程中效率低下且性能不足的问题,其根源在于现有方法依赖预定义的操作符库和同质化的大型语言模型(Large Language Model, LLM)工作流,导致所有任务级计算均通过概率推理完成,造成较高的推理成本与执行延迟。解决方案的关键在于提出HyEvo框架,该框架采用异构原子合成(heterogeneous atomic synthesis)策略,将概率性LLM节点(用于语义推理)与确定性代码节点(用于规则执行)相结合,从而将可预测的操作从LLM推理中卸载,显著降低推理开销和执行延迟;同时,HyEvo引入LLM驱动的多岛进化策略与“反思-生成”机制,通过执行反馈迭代优化工作流拓扑结构与节点逻辑,实现高效搜索混合搜索空间。
链接: https://arxiv.org/abs/2603.19639
作者: Beibei Xu,Yutong Ye,Chuyun Shen,Yingbo Zhou,Cheng Chen,Mingsong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although agentic workflows have demonstrated strong potential for solving complex tasks, existing automated generation methods remain inefficient and underperform, as they rely on predefined operator libraries and homogeneous LLM-only workflows in which all task-level computation is performed through probabilistic inference. To address these limitations, we propose HyEvo, an automated workflow-generation framework that leverages heterogeneous atomic synthesis. HyEvo integrates probabilistic LLM nodes for semantic reasoning with deterministic code nodes for rule-based execution, offloading predictable operations from LLM inference and reducing inference cost and execution latency. To efficiently navigate the hybrid search space, HyEvo employs an LLM-driven multi-island evolutionary strategy with a reflect-then-generate mechanism, iteratively refining both workflow topology and node logic via execution feedback. Comprehensive experiments show that HyEvo consistently outperforms existing methods across diverse reasoning and coding benchmarks, while reducing inference cost and execution latency by up to 19 \times and 16 \times , respectively, compared to the state-of-the-art open-source baseline.
[AI-28] DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在库存管理中应用时存在的超参数敏感性高、训练不稳定及性能波动大的问题。其解决方案的关键在于引入基于经典库存理论(如“Base Stock”策略)的策略正则化(policy regularizations),通过约束DRL策略空间来提升模型对超参数变化的鲁棒性,从而显著加速超参数调优过程并改善最终性能表现。实证结果表明,该方法不仅在阿里巴巴旗下天猫电商平台实现100%部署,还在合成实验中重塑了关于最优DRL方法的选择认知。
链接: https://arxiv.org/abs/2603.19621
作者: Yaqi Xie,Xinru Hao,Jiaxi Liu,Will Ma,Linwei Xin,Lei Cao,Yidong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as “Base Stock”, we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba’s e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.
[AI-29] Physics-Informed Neural Network with Adaptive Clustering Learning Mechanism for Information Popularity Prediction
【速读】:该论文旨在解决现有信息传播流行度预测模型在捕捉信息级联微观特征时忽视宏观传播模式,且未充分考虑信息异质性对传播影响力影响的问题。其解决方案的关键在于提出一种融合物理信息的神经网络与自适应聚类学习机制的模型(Physics-Informed Adaptive Clustering Network, PIACN),首次通过物理信息方法建模信息传播的宏观规律,并借助自适应聚类机制量化不同信息源的异质性特征,从而显著提升预测精度。
链接: https://arxiv.org/abs/2603.19599
作者: Guangyin Jin,Xiaohan Ni,Yanjie Song,Kun Wei,Jie Zhao,Leiming Jia,Witold Pedrycz
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:With society entering the Internet era, the volume and speed of data and information have been increasing. Predicting the popularity of information cascades can help with high-value information delivery and public opinion monitoring on the internet platforms. The current state-of-the-art models for predicting information popularity utilize deep learning methods such as graph convolution networks (GCNs) and recurrent neural networks (RNNs) to capture early cascades and temporal features to predict their popularity increments. However, these previous methods mainly focus on the micro features of information cascades, neglecting their general macroscopic patterns. Furthermore, they also lack consideration of the impact of information heterogeneity on spread popularity. To overcome these limitations, we propose a physics-informed neural network with adaptive clustering learning mechanism, PIACN, for predicting the popularity of information cascades. Our proposed model not only models the macroscopic patterns of information dissemination through physics-informed approach for the first time but also considers the influence of information heterogeneity through an adaptive clustering learning mechanism. Extensive experimental results on three real-world datasets demonstrate that our model significantly outperforms other state-of-the-art methods in predicting information popularity.
[AI-30] ARMOR: Adaptive Resilience Against Model Poisoning Attacks in Continual Federated Learning for Mobile Indoor Localization
【速读】:该论文旨在解决持续联邦学习(Continual Federated Learning, CFL)在室内定位场景中面临的两大挑战:一是设备异构性和环境动态变化导致的全局模型(Global Model, GM)权重偏差累积,进而引发定位性能下降;二是模型投毒攻击(model poisoning attacks)对GM的潜在破坏。解决方案的关键在于提出ARMOR框架,其核心创新是一种新颖的状态空间模型(State-Space Model, SSM),该模型能够学习并预测GM权重张量的历史演化轨迹,并通过对比本地更新与SSM预测值来识别异常更新,从而在聚合前选择性地抑制污染更新,实现对模型漂移和恶意攻击的双重防护。此机制保障了GM在动态环境下的鲁棒适应能力,同时显著提升了定位精度与安全性。
链接: https://arxiv.org/abs/2603.19594
作者: Danish Gufran,Akhil Singampalli,Sudeep Pasricha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Indoor localization has become increasingly essential for applications ranging from asset tracking to delivering personalized services. Federated learning (FL) offers a privacy-preserving approach by training a centralized global model (GM) using distributed data from mobile devices without sharing raw data. However, real-world deployments require a continual federated learning (CFL) setting, where the GM receives continual updates under device heterogeneity and evolving indoor environments. In such dynamic conditions, erroneous or biased updates can cause the GM to deviate from its expected learning trajectory, gradually degrading internal GM representations and GM localization performance. This vulnerability is further exacerbated by adversarial model poisoning attacks. To address this challenge, we propose ARMOR, a novel CFL-based framework that monitors and safeguards the GM during continual updates. ARMOR introduces a novel state-space model (SSM) that learns the historical evolution of GM weight tensors and predicts the expected next state of weight tensors of the GM. By comparing incoming local updates with this SSM projection, ARMOR detects deviations and selectively mitigates corrupted updates before local updates are aggregated with the GM. This mechanism enables robust adaptation to temporal environmental dynamics and mitigate the effects of model poisoning attacks while preventing GM corruption. Experimental evaluations in real-world conditions indicate that ARMOR achieves notable improvements, with up to 8.0x reduction in mean error and 4.97x reduction in worst-case error compared to state-of-the-art indoor localization frameworks, demonstrating strong resilience against model corruption tested using real-world data and mobile devices.
[AI-31] PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management
【速读】:该论文旨在解决移动设备电池寿命受限的问题,现有电源管理机制依赖静态规则或粗粒度启发式策略,无法根据用户活动和个人偏好进行动态调整。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的常识推理能力,弥合用户行为语义与系统参数之间的语义鸿沟,从而实现无需人工配置的零样本、上下文感知的个性化电源策略生成。PowerLens采用多智能体架构识别UI语义并跨18个设备参数生成整体电源策略,结合基于PDL的约束验证框架确保执行安全性,并通过两级记忆系统从隐式用户干预中学习个体偏好,实现快速收敛(3–5天内)和高用户满意度,同时自身功耗仅占每日电池容量的0.5%。
链接: https://arxiv.org/abs/2603.19584
作者: Xingyu Feng,Chang Sun,Yuzhu Wang,Zhangbing Zhou,Chengwen Luo,Zhuangzhuang Chen,Xiaomin Ouyang,Huanqi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Battery life remains a critical challenge for mobile devices, yet existing power management mechanisms rely on static rules or coarse-grained heuristics that ignore user activities and personal preferences. We present PowerLens, a system that tames the reasoning power of Large Language Models (LLMs) for safe and personalized mobile power management on Android devices. The key idea is that LLMs’ commonsense reasoning can bridge the semantic gap between user activities and system parameters, enabling zero-shot, context-aware policy generation that adapts to individual preferences through implicit feedback. PowerLens employs a multi-agent architecture that recognizes user context from UI semantics and generates holistic power policies across 18 device parameters. A PDL-based constraint framework verifies every action before execution, while a two-tier memory system learns individualized preferences from implicit user overrides through confidence-based distillation, requiring no explicit configuration and converging within 3–5 days. Extensive experiments on a rooted Android device show that PowerLens achieves 81.7% action accuracy and 38.8% energy saving over stock Android, outperforming rule-based and LLM-based baselines, with high user satisfaction, fast preference convergence, and strong safety guarantees, with the system itself consuming only 0.5% of daily battery capacity.
[AI-32] Skilled AI Agents for Embedded and IoT Systems Development
【速读】:该论文旨在解决生成式 AI(Generative AI)在硬件在环(Hardware-in-the-Loop, HIL)嵌入式系统和物联网(Internet-of-Things, IoT)开发中应用时面临的挑战,即:尽管代码能够成功编译,但在实际部署到物理设备时仍可能因时序约束、外设初始化需求或硬件特异性行为而失败。解决方案的关键在于提出一种基于技能的智能体框架(skills-based agentic framework),并构建了 IoT-SkillsBench 基准测试平台,用于系统性评估 AI 智能体在真实嵌入式环境中的表现。实验表明,结构化的专家知识所形成的简洁人类专家技能(human-expert skills)可显著提升任务成功率,在跨平台、多外设、多难度的任务场景下实现接近完美的执行效果。
链接: https://arxiv.org/abs/2603.19583
作者: Yiming Li,Yuhan Cheng,Mingchen Ma,Yihang Zou,Ningyuan Yang,Wei Cheng,Hai “Helen” Li,Yiran Chen,Tingjun Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) and agentic systems have shown promise for automated software development, but applying them to hardware-in-the-loop (HIL) embedded and Internet-of-Things (IoT) systems remains challenging due to the tight coupling between software logic and physical hardware behavior. Code that compiles successfully may still fail when deployed on real devices because of timing constraints, peripheral initialization requirements, or hardware-specific behaviors. To address this challenge, we introduce a skills-based agentic framework for HIL embedded development together with IoT-SkillsBench, a benchmark designed to systematically evaluate AI agents in real embedded programming environments. IoT-SkillsBench spans three representative embedded platforms, 23 peripherals, and 42 tasks across three difficulty levels, where each task is evaluated under three agent configurations (no-skills, LLM-generated skills, and human-expert skills) and validated through real hardware execution. Across 378 hardware validated experiments, we show that concise human-expert skills with structured expert knowledge enable near-perfect success rates across platforms.
[AI-33] Evolving Embodied Intelligence: Graph Neural Network–Driven Co-Design of Morphology and Control in Soft Robotics
【速读】:该论文旨在解决软体机器人中形态(morphology)与控制器协同设计(co-design)的难题,尤其是当二者需同时优化时,形态演化可能破坏已学习的控制策略,导致知识难以复用或迁移。解决方案的关键在于提出一种基于图神经网络(Graph Neural Network, GNN)的联合设计框架:将机器人建模为图结构,利用图注意力网络(Graph Attention Network, GAT)编码节点特征,并通过多层感知机(Multilayer Perceptron, MLP)头输出执行器指令或价值估计;在进化过程中采用拓扑一致的继承机制——共享GAT层、完整传递MLP隐藏层、匹配的执行器输出直接复制,不匹配项则随机初始化并微调。这种形态感知的策略类使控制器能在身体结构变化时自适应调整,显著提升了最终适应度和对形态变异的鲁棒性。
链接: https://arxiv.org/abs/2603.19582
作者: Jianqiang Wang,Shuaiqun Pan,Alvaro Serra-Gomez,Xiaohan Wei,Yue Xie
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The intelligent behavior of robots does not emerge solely from control systems, but from the tight coupling between body and brain, a principle known as embodied intelligence. Designing soft robots that leverage this interaction remains a significant challenge, particularly when morphology and control require simultaneous optimization. A significant obstacle in this co-design process is that morphological evolution can disrupt learned control strategies, making it difficult to reuse or adapt existing knowledge. We address this by develop a Graph Neural Network-based approach for the co-design of morphology and controller. Each robot is represented as a graph, with a graph attention network (GAT) encoding node features and a pooled representation passed through a multilayer perceptron (MLP) head to produce actuator commands or value estimates. During evolution, inheritance follows a topology-consistent mapping: shared GAT layers are reused, MLP hidden layers are transferred intact, matched actuator outputs are copied, and unmatched ones are randomly initialized and fine-tuned. This morphology-aware policy class lets the controller adapt when the body mutates. On the benchmark, our GAT-based approach achieves higher final fitness and stronger adaptability to morphological variations compared to traditional MLP-only co-design methods. These results indicate that graph-structured policies provide a more effective interface between evolving morphologies and control for embodied intelligence.
[AI-34] PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning AAAI2024
【速读】:该论文旨在解决多目标强化学习(Multi-Objective Reinforcement Learning, MORL)中难以获得高质量帕累托策略集(Pareto policy set)的问题,尤其是在状态-动作空间连续或高维的复杂任务中。其解决方案的关键在于提出一种基于帕累托上升方向分解的多目标强化学习方法(Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning, PA2D-MORL),该方法通过帕累托上升方向选择标量化权重并计算多目标策略梯度,从而确定联合优化所有目标的策略更新方向;同时,在进化框架下对多个策略进行选择性优化以从不同方向逼近帕累托前沿,并引入自适应微调机制提升帕累托前沿近似的密度与分布广度。
链接: https://arxiv.org/abs/2603.19579
作者: Tianmeng Hu,Biao Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI 2024
Abstract:Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.
[AI-35] Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition
【速读】:该论文旨在解决矩阵乘法中基于逐元素标量量化(entrywise scalar quantization)的均方误差(MSE)最小化问题,即在对两个矩阵 $ A \in \mathbb{R}^{m \times k} $ 和 $ B \in \mathbb{R}^{k \times n} $ 分别进行独立标量量化后,如何设计最优量化中心密度以最小化 E[∥AB−AB∥F2]。其核心解决方案在于:在高分辨率极限 KX,KY→∞ 下,推导出 MSE 的 K−2 阶渐近展开式,并精确识别最优前导常数;进一步针对相关高斯乘积对,获得闭式最优点密度函数 λ⋆(u)∝exp(−6u2)((1−ρ2)+ρ2u2)1/3,其中 u=x/σX,并揭示了由相关系数 ρ 引发的相变行为——当 ∣ρ∣≤1/3 时密度为单峰,而当 ∣ρ∣>1/3 时变为双峰,峰值位于 upeak=±3−1/ρ2。这一理论成果为低精度矩阵运算和大语言模型激活值量化提供了可证明最优的量化策略。
链接: https://arxiv.org/abs/2603.19559
作者: Calvin Ang,Sungyoon Kim,Mert Pilanci
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:
Abstract:We study entrywise scalar quantization of two matrices prior to multiplication. Given A\in R^m\times k and B\in R^k\times n , we quantize entries of A and B independently using scalar quantizers with K_X and K_Y levels per entry, and form \widehat C=\widehat A,\widehat B . The objective is to minimize the matrix multiplication mean-squared error (MSE) E[|AB-\widehat A\widehat B|F^2] under a pair-i.i.d.\ inner-product model. In the high-resolution regime K_X,K_Y\to\infty , we derive a sharp K^-2 asymptotic expansion for \mathcalE , identify the exact optimal leading constants, and characterize asymptotically optimal quantization center densities in terms of conditional second moments. We then specialize to correlated Gaussian multiplicative pairs, obtaining a closed-form optimal point density [ \lambda^\star(u)\ \propto\ \exp!\left(-\fracu^26\right)\bigl((1-\rho^2)+\rho^2u^2\bigr)^1/3, \qquad u=\fracx\sigma_X, ] with the same form for y/\sigma_Y , and prove a correlation-driven phase transition: the density is unimodal at the origin for |\rho|\leq 1/\sqrt3 and becomes bimodal for |\rho|1/\sqrt3 with peaks at u\mathrmpeak=\pm\sqrt3-1/\rho^2 . We show our method’s applicability in synthetic experiments such as matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.
[AI-36] Plagiarism or Productivity? Students Moral Disengagement and Behavioral Intentions to Use ChatGPT in Academic Writing
【速读】:该论文试图解决的问题是:道德脱域(Moral Disengagement)如何影响菲律宾大学生在学术写作中使用生成式AI(Generative AI)——具体以ChatGPT为例的意图。研究通过检验五种道德脱域机制(道德合理化、委婉标签化、责任转移、后果最小化和归因责备)对态度、主观规范和感知行为控制的影响,进而预测使用意图,揭示了学生为何在存在学术诚信风险时仍倾向于使用AI工具。解决方案的关键在于识别出“归因责备”对意图影响最强,“态度”对意图的解释力最高,并指出当前学生常依赖制度漏洞与同伴行为来合理化AI使用,因此亟需制定清晰的学术诚信政策、伦理指导和课堂支持体系,以引导负责任的AI使用实践。
链接: https://arxiv.org/abs/2603.19549
作者: John Paul P. Miranda,Rhiziel P. Manalese,Mark Anthony A. Castro,Renen Paul M. Viado,Vernon Grace M. Maniago,Rudante M. Galapon,Jovita G. Rivera,Amado B. Martinez Jr
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Applications (stat.AP)
备注: 5 pages, 1 figure, 2 table, conference proceeding
Abstract:This study examined how moral disengagement influences Filipino college students’ intention to use ChatGPT in academic writing. The model tested five mechanisms: moral justification, euphemistic labeling, displacement of responsibility, minimizing consequences, and attribution of blame. These mechanisms were analyzed as predictors of attitudes, subjective norms, and perceived behavioral control, which then predicted behavioral intention. A total of 418 students with ChatGPT experience participated. The results showed that several moral disengagement mechanisms influenced students’ attitudes and sense of control. Among the predictors, attribution of blame had the strongest influence, while attitudes had the highest impact on behavioral intention. The model explained more than half of the variation in intention. These results suggest that students often rely on institutional gaps and peer behavior to justify AI use. Many believe it is acceptable to use ChatGPT for learning or when rules are unclear. This shows a need for clear academic integrity policies, ethical guidance, and classroom support. The study also recognizes that intention-based models may not fully explain student behavior. Emotional factors, peer influence, and convenience can also affect decisions. The results provide useful insights for schools that aim to support responsible and informed AI use in higher education.
[AI-37] ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系中缺乏多认知维度协同测试的问题,即现有基准多局限于单一类型的任务(如 verbal reasoning),难以全面反映模型在真实复杂场景下的综合推理能力。其解决方案的关键在于提出 ItinBench,一个融合空间推理(如路径优化)与传统语义推理任务的综合性旅行规划基准,从而实现对 LLM 在多个类人认知域(cognitive domains)下协同处理能力的系统性评测。实验表明,LLMs 在同时应对多种认知维度时性能显著下降,凸显了构建更贴近现实挑战的多维测试框架的重要性。
链接: https://arxiv.org/abs/2603.19515
作者: Tianlong Wang,Pinqiao Wang,Weili Shi,Sheng li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family. Our findings reveal that LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. By incorporating tasks from distinct human-level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real-world challenges. The code and dataset: this https URL
[AI-38] Learning to Disprove: Formal Counterexample Generation with Large Language Models
【速读】:该论文旨在解决当前生成式 AI 在数学推理中长期忽视的关键问题——即在专注于证明构造(proof construction)的同时,缺乏对反例发现(counterexample discovery)的有效支持。为弥补这一空白,作者提出通过微调大语言模型(Large Language Models, LLMs)来实现形式化反例生成(formal counterexample generation),要求模型不仅提出候选反例,还需生成可在 Lean 4 定理证明器中自动验证的形式化证明。其解决方案的关键在于引入符号变异(symbolic mutation)策略,该策略通过系统性地提取定理并移除部分假设,自动生成多样化的训练数据;结合精心构建的数据集与多奖励专家迭代(multi-reward expert iteration)训练框架,显著提升了模型在反例生成和定理证明任务上的效果与效率。
链接: https://arxiv.org/abs/2603.19514
作者: Zenan Li,Zhaoyu Li,Kaiyu Yang,Xiaoxing Ma,Zhendong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI efforts in mathematics focus almost exclusively on proof construction, often neglecting the equally important task of finding counterexamples. In this paper, we address this gap by fine-tuning large language models (LLMs) to reason about and generate counterexamples. We formalize this task as formal counterexample generation, which requires LLMs not only to propose candidate counterexamples but also to produce formal proofs that can be automatically verified in the Lean 4 theorem prover. To enable effective learning, we introduce a symbolic mutation strategy that synthesizes diverse training data by systematically extracting theorems and discarding selected hypotheses, thereby producing diverse counterexample instances. Together with curated datasets, this strategy enables a multi-reward expert iteration framework that substantially enhances both the effectiveness and efficiency of training LLMs for counterexample generation and theorem proving. Experiments on three newly collected benchmarks validate the advantages of our approach, showing that the mutation strategy and training framework yield significant performance gains.
[AI-39] Linear Social Choice with Few Queries: A Moment-Based Approach
【速读】:该论文旨在解决社会选择理论中因通信预算受限而导致的信息利用效率低下的问题,即在仅能获取每个选民少量比较信息(如一对比或单次评分)的情况下,如何有效识别选民类型分布并实现最优候选者选择。传统方法通常假设拥有完整的排名信息,而现实中由于数据获取成本高或用户参与度低,往往只能获得稀疏的偏好信号,导致当前对齐实践(alignment practice)仅能提取约每位选民1比特信息,难以支持复杂的社会福利目标。解决方案的关键在于将选民群体建模为未知的选民类型分布,并通过有限的查询(如每名选民两次成对比较或一次分级比较)来恢复该分布的矩(moment),从而精确刻画选民偏好的统计结构;研究表明,两轮成对比较足以识别二阶矩,进而支持不平等敏感的社会福利目标(如考虑选民效用方差的公平性优化),并进一步可识别所有阶矩,完整重建选民类型分布,为多样性和代表性等目标提供理论依据与技术路径。
链接: https://arxiv.org/abs/2603.19510
作者: Luise Ge,Daniel Halpern,Gregory Kehne,Yevgeniy Vorobeychik
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:Most social choice rules assume access to full rankings, while current alignment practice – despite aiming for diversity – typically treats voters as anonymous and comparisons as independent, effectively extracting only about one bit per voter. Motivated by this gap, we study social choice under an extreme communication budget in the linear social choice model, where each voter’s utility is the inner product between a latent voter type and the embedding of the context and candidate. The candidate and voter spaces may be very large or even infinite. Our core idea is to model the electorate as an unknown distribution over voter types and to recover its moments as informative summary statistics for candidate selection. We show that one pairwise comparison per voter already suffices to select a candidate that maximizes social welfare, but this elicitation cannot identify the second moment and therefore cannot support objectives that account for inequality. We prove that two pairwise comparisons per voter, or alternatively a single graded comparison, identify the second moment; moreover, these richer queries suffice to identify all moments, and hence the entire voter-type distribution. These results enable principled solutions to a range of social choice objectives including inequality-aware welfare criteria such as taking into account the spread of voter utilities and choosing a representative subset.
[AI-40] RACE: Trajectory Recovery with State Propagation Diffusion for Urban Mobility WWW2026
【速读】:该论文旨在解决真实世界GPS轨迹数据因采样率低和基础设施覆盖不足导致的稀疏性与点分布不均问题,从而难以支持高精度的位置服务(如导航、共享出行和配送等)的需求。其核心解决方案是提出一种名为TRACE的新型扩散模型,关键创新在于引入状态传播扩散模型(State Propagation Diffusion Model, SPDM),该模型融合了一种新颖的记忆机制,在去噪过程中能够保留并利用前序步骤的中间结果,从而有效重建那些难以恢复的轨迹片段,显著提升轨迹恢复的准确性(实验表明比现有最优方法提升26%),且推理开销可控。
链接: https://arxiv.org/abs/2603.19474
作者: Jinming Wang,Hai Wang,Hongkai Wen,Geyong Min,Man Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This article is accepted by WWW 2026, Dubai, United Arab Emirates
Abstract:High-quality GPS trajectories are essential for location-based web services and smart city applications, including navigation, ride-sharing and delivery. However, due to low sampling rates and limited infrastructure coverage during data collection, real-world trajectories are often sparse and feature unevenly distributed location points. Recovering these trajectories into dense and continuous forms is essential but challenging, given their complex and irregular spatio-temporal patterns. In this paper, we introduce a novel diffusion model for trajectory recovery named TRACE, which reconstruct dense and continuous trajectories from sparse and incomplete inputs. At the core of TRACE, we propose a State Propagation Diffusion Model (SPDM), which integrates a novel memory mechanism, so that during the denoising process, TRACE can retain and leverage intermediate results from previous steps to effectively reconstruct those hard-to-recover trajectory segments. Extensive experiments on multiple real-world datasets show that TRACE outperforms the state-of-the-art, offering 26% accuracy improvement without significant inference overhead. Our work strengthens the foundation for mobile and web-connected location services, advancing the quality and fairness of data-driven urban applications. Code is available at: this https URL
[AI-41] Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
【速读】:该论文旨在解决大语言模型强化学习(LLM RL)中因离策略(off-policy)问题导致的训练不稳定和探索受限问题,特别是政策停滞(policy staleness)和训练-推理不匹配(training-inference mismatch)所引发的重尾重要性比(heavy-tailed importance ratios)现象。其核心解决方案是提出自适应分层扰动(Adaptive Layerwise Perturbation, ALP),通过在每层的输入隐藏状态中注入可学习的小扰动,作为目标函数中相对于不变推理策略的重要性比的分子。该方法通过引入可控噪声于中间表示层面,有效抑制更新策略与推理策略间的剧烈偏离,扩大策略族以覆盖存在失配噪声的推理策略空间,从而自然地缩小两者差距、降低重要性比尾部值并维持训练稳定性。实验证明ALP不仅能提升最终性能,还能避免重要性比尾部爆炸和KL散度尖峰,同时增强探索能力。
链接: https://arxiv.org/abs/2603.19470
作者: Chenlu Ye,Xuanchang Zhang,Yifan Hao,Zhou Yu,Ziji Zhang,Abhinav Gullapalli,Hao Chen,Jing Huang,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
[AI-42] A Framework for Formalizing LLM Agent Security
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)智能体安全定义缺乏情境敏感性的问题。现有攻击与防御机制通常未充分考虑行为发生的上下文因素,如指令来源、目标合理性及行动与目标的一致性等,导致防御措施在不同场景下要么过度限制功能(utility-loss),要么无法有效阻断攻击(security-vulnerability)。其解决方案的关键在于提出一个基于情境安全的系统化框架,包含四个核心安全属性:任务对齐(task alignment)、动作对齐(action alignment)、源授权(source authorization)和数据隔离(data isolation),并通过一组预言机函数(oracle functions)实现对这些属性是否被违反的实时验证,从而为攻击和防御提供精确、情境化的形式化定义。
链接: https://arxiv.org/abs/2603.19469
作者: Vincent Siu,Jingxuan He,Kyle Montgomery,Zhun Wang,Neil Gong,Chenguang Wang,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruction led to the action, what objective is being pursued, and whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility-security tradeoff: applying defenses uniformly across all contexts can lead to significant utility loss, while applying defenses in insufficient or inappropriate contexts can result in security vulnerabilities. In this work, we present a framework that systematizes existing attacks and defenses from the perspective of contextual security. To this end, we propose four security properties that capture contextual security for LLM agents: task alignment (pursuing authorized objectives), action alignment (individual actions serving those objectives), source authorization (executing commands from authenticated sources), and data isolation (ensuring information flows respect privilege boundaries). We further introduce a set of oracle functions that enable verification of whether these security properties are violated as an agent executes a user task. Using this framework, we reformalize existing attacks, such as indirect prompt injection, direct prompt injection, jailbreak, task drift, and memory poisoning, as violations of one or more security properties, thereby providing precise and contextual definitions of these attacks. Similarly, we reformalize defenses as mechanisms that strengthen oracle functions or perform security property checks. Finally, we discuss several important future research directions enabled by our framework.
[AI-43] Global Convergence of Multiplicative Updates for the Matrix Mechanism: A Collaborative Proof with Gemini 3
【速读】:该论文旨在解决一个在私有机器学习算法空间优化中出现的固定点迭代问题,具体为迭代形式 $ v^{(k+1)} = \text{diag}\left((D_{v^{(k)}}^{1/2} M D_{v^{(k)}}^{1/2})^{1/2}\right) $ 的收敛性证明问题,该迭代源自对带有Hadamard积结构的正则化核范数目标函数的优化。此前该问题在文献~\citedenisov 中被提出但未完全解决。解决方案的关键在于证明该迭代单调收敛至势函数 $ J(v) = 2,\text{Tr}\left((D_v^{1/2} M D_v^{1/2})^{1/2}\right) - \sum v_i $ 的唯一全局最优解,这一结论填补了原文献中的理论空白。整个证明过程主要由Gemini 3辅助完成,体现了AI在数学证明中的实际应用潜力与协作价值。
链接: https://arxiv.org/abs/2603.19465
作者: Keith Rush
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 12 pages, 1 figure
Abstract:We analyze a fixed-point iteration v \leftarrow \phi(v) arising in the optimization of a regularized nuclear norm objective involving the Hadamard product structure, posed in~\citedenisov in the context of an optimization problem over the space of algorithms in private machine learning. We prove that the iteration v^(k+1) = \textdiag((D_v^(k)^1/2 M D_v^(k)^1/2)^1/2) converges monotonically to the unique global optimizer of the potential function J(v) = 2 \textTr((D_v^1/2 M D_v^1/2)^1/2) - \sum v_i , closing a problem left open there. The bulk of this proof was provided by Gemini 3, subject to some corrections and interventions. Gemini 3 also sketched the initial version of this note. Thus, it represents as much a commentary on the practical use of AI in mathematics as it represents the closure of a small gap in the literature. As such, we include a small narrative description of the prompting process, and some resulting principles for working with AI to prove mathematics. Comments: 12 pages, 1 figure Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2603.19465 [cs.LG] (or arXiv:2603.19465v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19465 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-44] Hyperag ents KR
【速读】:该论文旨在解决现有自改进人工智能(Self-improving AI)系统依赖固定、人工设计的元级机制,从而限制其自我提升速度的问题。解决方案的关键在于提出**超智能体(hyperagents)**框架,该框架将任务代理(task agent)与元代理(meta agent)整合为一个可编辑程序,且元级修改过程本身也可被修改,从而实现元认知层面的自我改进。通过扩展达尔文哥德尔机(DGM)构建出DGM-Hyperagents(DGM-H),该方法消除了任务性能与自我改进能力之间必须存在领域特定对齐的假设,使系统能够持续优化自身的改进机制,并在多个计算任务中展现出性能随时间递增、元级改进跨域迁移和累积的能力,从而推动开放式的、自我加速的智能发展。
链接: https://arxiv.org/abs/2603.19461
作者: Jenny Zhang,Bingchen Zhao,Wannan Yang,Jakob Foerster,Jeff Clune,Minqi Jiang,Sam Devlin,Tatiana Shavrina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code at this https URL
Abstract:Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin Gödel Machine (DGM) demonstrates open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce \textbfhyperagents, self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only the task-solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM-Hyperagents (DGM-H), eliminating the assumption of domain-specific alignment between task performance and self-modification skill to potentially support self-accelerating progress on any computable task. Across diverse domains, the DGM-H improves performance over time and outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. Furthermore, the DGM-H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. DGM-Hyperagents offer a glimpse of open-ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.
[AI-45] When both Grounding and not Grounding are Bad – A Partially Grounded Encoding of Planning into SAT (Extended Version)
【速读】:该论文旨在解决经典规划问题中因完全下例化(grounding)导致的状态空间指数级膨胀的问题。传统规划方法通常将一阶逻辑表示的规划问题完全下例化以简化推理,但这一过程在复杂场景下会引发严重的计算瓶颈。为应对这一挑战,作者提出了一种介于完全上层(lifted)与完全下例化之间的折中方案:设计三种SAT编码方法,在保持动作(action)上层表达的同时,对谓词(predicate)进行部分下例化。其关键创新在于通过这种局部下例化策略,使编码规模与计划长度呈线性关系,而非以往方法中的二次增长,从而显著提升长计划场景下的求解效率。实验证明,该方法在难以下例化的领域中实现了优于现有最优技术的长度最优规划性能。
链接: https://arxiv.org/abs/2603.19429
作者: João Filipe,Gregor Behnke
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注:
Abstract:Classical planning problems are typically defined using lifted first-order representations, which offer compactness and generality. While most planners ground these representations to simplify reasoning, this can cause an exponential blowup in size. Recent approaches instead operate directly on the lifted level to avoid full grounding. We explore a middle ground between fully lifted and fully grounded planning by introducing three SAT encodings that keep actions lifted while partially grounding predicates. Unlike previous SAT encodings, which scale quadratically with plan length, our approach scales linearly, enabling better performance on longer plans. Empirically, our best encoding outperforms the state of the art in length-optimal planning on hard-to-ground domains.
[AI-46] he Autonomy Tax: Defense Training Breaks LLM Agents
【速读】:该论文旨在解决当前防御训练方法在保护多步任务代理(multi-step agents)免受提示注入攻击(prompt injection attacks)时所引发的“能力-对齐悖论”问题,即防御训练虽提升了安全性,却系统性地削弱了代理的工具执行能力,并未能有效抵御复杂攻击。其关键发现是:现有防御策略导致三种系统性偏差——代理无能偏差(agent incompetence bias)、级联放大偏差(cascade amplification bias)和触发偏差(trigger bias),根源在于模型通过捷径学习(shortcut learning)过度拟合表面攻击模式而非语义威胁理解,从而在单轮拒绝基准上表现良好,但在多步任务中变得不可靠。因此,论文指出亟需发展能够维持工具执行能力的同时抵御对抗扰动的新防御范式。
链接: https://arxiv.org/abs/2603.19423
作者: Shawn Li,Yue Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbfcapability-alignment paradox: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbfAgent incompetence bias manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbfCascade amplification bias causes early failures to propagate through retry loops, pushing defended models to timeout on 99% of tasks compared to 13% for baselines. \textbfTrigger bias leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.
[AI-47] A Novel Solution for Zero-Day Attack Detection in IDS using Self-Attention and Jensen-Shannon Divergence in WGAN-GP
【速读】:该论文旨在解决零日攻击(zero-day attacks)难以检测和防御的问题,这类攻击利用未知漏洞,传统基于补丁修复和入侵检测系统(Intrusion Detection System, IDS)的方法难以应对。其关键解决方案是提出一种改进的生成对抗网络(Generative Adversarial Network, GAN)架构——SA-JS-WGAN-GP,通过引入自注意力机制(Self-Attention, SA)增强模型对长程跨特征依赖关系的建模能力,并结合基于Jensen-Shannon(JS)散度的辅助判别器(auxiliary discriminator)以优化梯度平滑性和样本质量,从而合成更贴近真实零日攻击模式的网络流量数据,提升IDS在未见攻击类型上的泛化能力和风险识别性能。
链接: https://arxiv.org/abs/2603.19350
作者: Ziyu Mu,Xiyu Shi,Safak Dogan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 5 figures, including references
Abstract:The increasing sophistication of cyber threats, especially zero-day attacks, poses a significant challenge to cybersecurity. Zero-day attacks exploit unknown vulnerabilities, making them difficult to detect and defend against. Existing approaches patch flaws and deploy an Intrusion Detection System (IDS). Using advanced Wasserstein GANs with Gradient Penalty (WGAN-GP), this paper makes a novel proposition to synthesize network traffic that mimics zero-day patterns, enriching data diversity and improving IDS generalization. SA-WGAN-GP is first introduced, which adds a Self-Attention (SA) mechanism to capture long-range cross-feature dependencies by reshaping the feature vector into tokens after dense projections. A JS-WGAN-GP is then proposed, which adds a Jensen-Shannon (JS) divergence-based auxiliary discriminator that is trained with Binary Cross-Entropy (BCE), frozen during updates, and used to regularize the generator for smoother gradients and higher sample quality. Third, SA-JS-WGAN-GP is created by combining the SA mechanism with JS divergence, thereby enhancing the data generation ability of WGAN-GP. As data augmentation does not equate with true zero-day attack discovery, we emulate zero-day attacks via the leave-one-attack-type-out method on the NSL-KDD dataset for training all GANs and IDS models in the assessment of the effectiveness of the proposed solution. The evaluation results show that integrating SA and JS divergence into WGAN-GP yields superior IDS performance and more effective zero-day risk detection.
[AI-48] Beyond Weighted Summation: Learnable Nonlinear Aggregation Functions for Robust Artificial Neurons
【速读】:该论文旨在解决传统人工神经元中基于加权求和的输入聚合机制在面对噪声或极端输入时敏感性高、鲁棒性差的问题(即固定线性聚合机制对异常值不稳健)。其解决方案的关键在于引入可学习的非线性聚合机制,具体包括两种不同的可微分聚合方式:基于可学习幂权重规则的F-Mean神经元和基于距离感知亲和力加权的Gaussian Support神经元;同时为保持优化稳定性,提出混合神经元结构,通过一个可学习的混合参数在标准线性聚合与非线性聚合之间进行插值。实验表明,这种设计能够在不牺牲训练稳定性的前提下显著提升模型在噪声环境下的鲁棒性,且学习到的聚合参数倾向于亚线性(p ≈ 0.43–0.50)和高新颖性利用(α ≈ 0.69–0.79),证明了神经元级聚合机制是构建更抗噪神经网络的重要且未被充分探索的设计维度。
链接: https://arxiv.org/abs/2603.19344
作者: Berke Deniz Bozyigit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 tables
Abstract:Weighted summation has remained the default input aggregation mechanism in artificial neurons since the earliest neural network models. While computationally efficient, this design implicitly behaves like a mean-based estimator and is therefore sensitive to noisy or extreme inputs. This paper investigates whether replacing fixed linear aggregation with learnable nonlinear alternatives can improve neural network robustness without sacrificing trainability. Two differentiable aggregation mechanisms are introduced: an F-Mean neuron based on a learnable power-weighted aggregation rule, and a Gaussian Support neuron based on distance-aware affinity weighting. To preserve the optimisation stability of standard neurons, hybrid neurons are proposed that interpolate between linear and nonlinear aggregation through a learnable blending parameter. Evaluated in multilayer perceptrons and convolutional neural networks on CIFAR-10 and a noisy CIFAR-10 variant with additive Gaussian corruption, hybrid neurons consistently improve robustness under noise while F-Mean hybrids also yield modest gains on clean data. The three-way hybrid achieves robustness scores of up to 0.991 compared to 0.890 for the standard baseline, and learned parameters converge consistently to sub-linear aggregation (p \approx 0.43–0.50) and high novelty utilisation ( \alpha \approx 0.69–0.79). These findings suggest that neuron-level aggregation is a meaningful and underexplored design dimension for building more noise-tolerant neural networks.
[AI-49] Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions
【速读】:该论文旨在解决后训练对齐(post-training alignment)算法选择缺乏可控比较的问题,即当前存在数十种竞争性算法(如DPO、SimPO、KTO、GRPO等),但实践中缺乏统一基准来指导算法选型。其解决方案的关键在于提出OXRL框架,这是一个统一实现51种后训练算法的基础设施,确保所有实验在相同硬件和训练条件下进行,从而实现了首次大规模“苹果对苹果”的公平评估。通过在8种算法、4个模型规模(0.5B–7B)、3个评估领域上开展约240次训练运行,研究揭示了模型规模是影响性能的核心因素(约50个百分点),远超损失函数设计(约1个百分点)或在线/离线训练范式(约9个百分点),为从业者提供了清晰的算法优先级排序依据。
链接: https://arxiv.org/abs/2603.19335
作者: Xiaoyi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-training alignment has produced dozens of competing algorithms – DPO, SimPO, KTO, GRPO, and others – yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B–7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling \sim 240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0%~ \pm 0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2 \times 2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse ( - 11.5~pp, p 10^-4 ). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH ( 36\times ) and 0.47~pp on general-domain benchmarks ( 41\times ), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale ( \sim 50~pp) \gg training paradigm ( \sim 10~pp) \gg online vs.\ offline ( \sim 9~pp) \gg loss function ( \sim 1~pp). We release all code, configs, and evaluation data as a living community benchmark.
[AI-50] POET: Power-Oriented Evolutionary Tuning for LLM -Based RTL PPA Optimization
【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)应用于寄存器传输级(Register-Transfer Level, RTL)代码优化时面临的两个核心问题:一是如何确保优化后设计的功能正确性,避免LLM幻觉导致的错误;二是如何在功耗、性能和面积(Power, Performance, Area, PPA)多目标优化空间中系统性地优先降低功耗。解决方案的关键在于提出POET(Power-Oriented Evolutionary Tuning)框架:首先,通过基于差异测试的测试平台生成流水线,以原始设计作为功能Oracle,利用确定性仿真生成黄金参考,从而消除LLM幻觉对验证过程的影响;其次,采用LLM驱动的进化机制,结合非支配排序、功耗优先的层级内排序及比例存活选择策略,引导搜索过程向帕累托前沿中的低功耗区域聚焦,无需人工权重调整。
链接: https://arxiv.org/abs/2603.19333
作者: Heng Ping,Peiyu Zhang,Zhenkun Wang,Shixuan Li,Anzhe Cheng,Wei Yang,Paul Bogdan,Shahin Nazarian
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Applying large language models (LLMs) to RTL code optimization for improved power, performance, and area (PPA) faces two key challenges: ensuring functional correctness of optimized designs despite LLM hallucination, and systematically prioritizing power reduction within the multi-objective PPA trade-off space. We propose POET (Power-Oriented Evolutionary Tuning), a framework that addresses both challenges. For functional correctness, POET introduces a differential-testing-based testbench generation pipeline that treats the original design as a functional oracle, using deterministic simulation to produce golden references and eliminating LLM hallucination from the verification process. For PPA optimization, POET employs an LLM-driven evolutionary mechanism with non-dominated sorting, power-first intra-level ranking, and proportional survivor selection to steer the search toward the low-power region of the Pareto front without manual weight tuning. Evaluated on the RTL-OPT benchmark across 40 diverse RTL designs, POET achieves 100% functional correctness, the best power on all 40 designs, and competitive area and delay improvements.
[AI-51] PAI: Fast Accurate and Full Benchmark Performance Projection with AI
【速读】:该论文旨在解决现代片上系统(System-on-Chip, SoC)中复杂知识产权核(Intellectual Property, IP)数量激增背景下,预硅阶段硬件-软件功耗与性能分析效率低下的问题。传统周期精确仿真器因仿真速度慢、开发维护成本高且易出错,难以满足全基准测试集的快速准确预测需求;而先前基于机器学习的方法在速度、精度或全基准测试预测能力上均存在不足。解决方案的关键在于提出PAI(Performance Analysis with AI),其核心是一个分层长短期记忆(Long Short Term Memory, LSTM)模型,该模型直接利用程序执行过程中提取的微架构无关特征轨迹,无需依赖详细仿真或指令级编码即可准确预测完整基准测试的性能指标(如每周期指令数,Instructions Per Cycle, IPC)。实验表明,PAI在SPEC CPU 2017基准测试套件上平均IPC预测误差仅为9.35%,同时仅需约3分钟完成全部测试,相较现有最优方法节省了三个数量级的计算时间。
链接: https://arxiv.org/abs/2603.19330
作者: Avery Johnson,Mohammad Majharul Islam,Riad Akram,Abdullah Muzahid
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:The exponential increase in complex IPs within modern SoCs, driven by Moore’s Law, has created a pressing need for fast and accurate hardware-software power-performance analysis. Traditional performance simulators (such as cycle accurate simulators) are often too slow to simulate full benchmarks within a reasonable timeframe; require considerable effort for development, maintenance, and extensions; and are prone to errors, making pre-silicon performance projections and competitive analysis increasingly challenging. Prior attempts in addressing this challenge using machine learning fall short as they are either slow, inaccurate or unable to predict the performance of full benchmarks. To address these limitations, we present PAI, the first technique to accurately predict full benchmark performance without relying on detailed simulation or instruction-wise encoding. At the heart of PAI is a hierarchical Long Short Term Memory (LSTM)-based model that takes a trace of microarchitecture independent features from a program execution and predicts performance metrics. We present the detailed design, implementation and evaluation of PAI. Our initial experiments showed that PAI can achieve an average IPC prediction error of 9.35% for SPEC CPU 2017 benchmark suite while taking only 2 min 57 sec for the entire suite. This prediction error is comparable to prior state-of-the-art techniques while requiring 3 orders of magnitude less time.
[AI-52] Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成代码时缺乏正确性保障的问题,即如何实现对代码实现的自动化形式化验证。现有方法难以构造可机器检查的证明,导致验证过程仍依赖人工干预。其解决方案的关键在于提出一种分层证明搜索框架(hierarchical proof search framework),该框架通过结构化的子目标分解策略将复杂验证目标简化为更易处理的子问题,并引入一个结合构造性合理性与结构有效性原则的评分机制作为训练奖励和推理阶段的排序标准,从而实现优化与部署的一致性对齐。此外,该方法采用统一策略网络 Goedel-Code-Prover-8B,基于监督初始化与混合强化学习进行训练,其中连续的分解奖励驱动规划探索,而监督回放机制稳定证明生成过程,在三个基于 Lean 4 的代码验证基准上实现了 62.0% 的成功证明率,显著优于现有最强基线模型。
链接: https://arxiv.org/abs/2603.19329
作者: Zenan Li,Ziran Yang,Deyuan(Mike)He,Haoyu Zhao,Andrew Zhao,Shange Tang,Kaiyu Yang,Aarti Gupta,Zhendong Su,Chi Jin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0% prove success rate, a 2.6 \times improvement over the strongest baseline, surpassing neural provers up to 84 \times larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.
[AI-53] arget Concept Tuning Improves Extreme Weather Forecasting
【速读】:该论文旨在解决深度学习模型在气象预报中对罕见但高影响事件(如台风)预测性能不足的问题,此类事件因数据稀缺导致模型易出现过拟合或忽略。其解决方案的关键在于提出一种可解释的概念门控微调框架TaCT,通过选择性模型优化实现对故障案例的针对性改进,同时保持常规场景下的性能稳定:TaCT利用稀疏自编码器(Sparse Autoencoders)与反事实分析自动识别与失败相关的内部概念,并仅在这些概念被激活时更新模型参数,而非进行统一微调,从而实现了精准、可解释且物理意义明确的适应性调整。
链接: https://arxiv.org/abs/2603.19325
作者: Shijie Ren,Xinyue Gu,Ziheng Peng,Haifan Zhang,Peisong Niu,Bo Wu,Xiting Wang,Liang Sun,Jirong Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models for meteorological forecasting often fail in rare but high-impact events such as typhoons, where relevant data is scarce. Existing fine-tuning methods typically face a trade-off between overlooking these extreme events and overfitting them at the expense of overall performance. We propose TaCT, an interpretable concept-gated fine-tuning framework that solves the aforementioned issue by selective model improvement: models are adapted specifically for failure cases while preserving performance in common scenarios. To this end, TaCT automatically discovers failure-related internal concepts using Sparse Autoencoders and counterfactual analysis, and updates parameters only when the corresponding concepts are activated, rather than applying uniform adaptation. Experiments show consistent improvements in typhoon forecasting across different regions without degrading other meteorological variables. The identified concepts correspond to physically meaningful circulation patterns, revealing model biases and supporting trustworthy adaptation in scientific forecasting tasks. The code is available at this https URL.
[AI-54] A General Deep Learning Framework for Wireless Resource Allocation under Discrete Constraints
【速读】:该论文旨在解决深度学习(Deep Learning, DL)在处理涉及离散变量的连续无线资源分配问题时面临的三大挑战:一是反向传播中的零梯度问题,二是难以对离散变量施加复杂约束,三是无法生成具有非同参数同决策(non-same-parameter-same-decision, non-SPSD)特性的解。解决方案的关键在于引入支持集(support set)来表示离散变量,并将支持集元素建模为随机变量,学习其联合概率分布;通过将联合概率分解为条件概率的乘积,逐层学习每个条件概率,从而在概率分布层面自然规避零梯度问题,利用掩码机制在学习过程中无缝嵌入离散约束,并借助动态上下文嵌入(dynamic context embedding)捕捉演化中的离散解,天然满足non-SPSD特性。该框架在两类典型混合离散无线资源分配问题中验证了优越性:(a) 无基站系统中的用户关联与波束赋形联合优化,(b) 可移动天线系统中的天线位置与波束赋形联合优化。
链接: https://arxiv.org/abs/2603.19322
作者: Yikun Wang,Yang Li,Yik-Chung Wu,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:While deep learning (DL)-based methods have achieved remarkable success in continuous wireless resource allocation, efficient solutions for problems involving discrete variables remain challenging. This is primarily due to the zero-gradient issue in backpropagation, the difficulty of enforcing intricate constraints with discrete variables, and the inability in generating solutions with non-same-parameter-same-decision (non-SPSD) property. To address these challenges, this paper proposes a general DL framework by introducing the support set to represent the discrete variables. We model the elements of the support set as random variables and learn their joint probability distribution. By factorizing the joint probability as the product of conditional probabilities, each conditional probability is sequentially learned. This probabilistic modeling directly tackles all the aforementioned challenges of DL for handling discrete variables. By operating on probability distributions instead of hard binary decisions, the framework naturally avoids the zero-gradient issue. During the learning of the conditional probabilities, discrete constraints can be seamlessly enforced by masking out infeasible solutions. Moreover, with a dynamic context embedding that captures the evolving discrete solutions, the non-SPSD property is inherently provided by the proposed framework. We apply the proposed framework to two representative mixed-discrete wireless resource allocation problems: (a) joint user association and beamforming in cell-free systems, and (b) joint antenna positioning and beamforming in movable antenna-aided systems. Simulation results demonstrate that the proposed DL framework consistently outperforms existing baselines in terms of both system performance and computational efficiency.
[AI-55] rnary Gamma Semirings: From Neural Implementation to Categorical Foundations
【速读】:该论文旨在解决神经网络在组合泛化任务中表现不佳的问题,即标准神经网络在面对未见过的组合输入时准确率接近零(0%)。其解决方案的关键在于引入逻辑约束——Ternary Gamma Semiring(三元Gamma-半环),通过该约束使同一网络架构学习到一个具有完美结构的特征空间,从而实现对新组合的100%准确预测。研究表明,该特征空间构成一个有限交换三元Γ-半环,其三元运算实现多数投票规则,并且与Gokavarapu等人最近提出的分类体系中的布尔型三元Γ-半环(|T|=4, |Γ|=1)精确对应,证明了学习到的表示本质上是数学上“自然”的代数结构,且逻辑约束引导网络收敛至此类规范形式。
链接: https://arxiv.org/abs/2603.19317
作者: Ruoqi Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper establishes a theoretical framework connecting neural network learning with abstract algebraic structures. We first present a minimal counterexample demonstrating that standard neural networks completely fail on compositional generalization tasks (0% accuracy). By introducing a logical constraint – the Ternary Gamma Semiring – the same architecture learns a perfectly structured feature space, achieving 100% accuracy on novel combinations. We prove that this learned feature space constitutes a finite commutative ternary \Gamma -semiring, whose ternary operation implements the majority vote rule. Comparing with the recently established classification of Gokavarapu et al., we show that this structure corresponds precisely to the Boolean-type ternary \Gamma -semiring with |T|=4 , |\Gamma|=1 , which is unique up to isomorphism in their enumeration. Our findings reveal three profound conclusions: (i) the success of neural networks can be understood as an approximation of mathematically ``natural’’ structures; (ii) learned representations generalize because they internalize algebraic axioms (symmetry, idempotence, majority property); (iii) logical constraints guide networks to converge to these canonical forms. This work provides a rigorous mathematical framework for understanding neural network generalization and inaugurates the new interdisciplinary direction of Computational \Gamma -Algebra.
[AI-56] LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
【速读】:该论文旨在解决当前联合嵌入预测架构(Joint Embedding Predictive Architectures, JEPA)在训练过程中存在的不稳定性问题,例如依赖复杂的多目标损失函数、指数移动平均、预训练编码器或辅助监督信号以避免表示坍塌(representation collapse)。其解决方案的关键在于提出LeWorldModel(LeWM),这是首个能够从原始像素端到端稳定训练的JEPA模型,仅使用两项损失:下一嵌入预测损失和一个强制潜空间嵌入服从高斯分布的正则项。该设计将可调损失超参数从现有方法的六个减少至一个,显著简化了训练流程,并在单个GPU上仅需数小时即可完成约1500万参数的训练,推理速度比基于基础模型的世界模型快达48倍,同时在多种2D和3D控制任务中保持竞争力。
链接: https://arxiv.org/abs/2603.19312
作者: Lucas Maes,Quentin Le Lidec,Damien Scieur,Yann LeCun,Randall Balestriero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM’s latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
[AI-57] MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在复杂推理任务中进行强化学习微调时,因奖励标签稀缺而导致的性能瓶颈问题。现有方法依赖大量人工标注或专家验证获取奖励信号,成本高昂且难以扩展。其解决方案的关键在于提出MemReward——一种基于图结构的经验记忆框架:将每个查询的推理过程与最终答案作为异构图中的节点,通过相似性与结构边连接,并利用图神经网络(Graph Neural Network, GNN)从少量已标注节点向未标注节点传播奖励信号,从而实现高效、可扩展的奖励分配机制。实验表明,该方法仅需20%的标签即可达到Oracle性能的97.3%,并展现出良好的泛化能力与标签预算的平滑可扩展性。
链接: https://arxiv.org/abs/2603.19310
作者: Tianyang Luo,Tao Feng,Zhigang Hua,Yan Xie,Shuang Yang,Ge Liu,Jiaxuan You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.
[AI-58] Exploring Subnetwork Interactions in Heterogeneous Brain Network via Prior-Informed Graph Learning
【速读】:该论文旨在解决现有基于Transformer的方法在学习功能子网络(functional subnetworks)间复杂交互时面临的挑战,尤其是在训练样本有限的情况下难以有效建模这些交互的问题。其解决方案的关键在于提出KD-Brain框架,该框架通过显式编码先验知识来引导学习过程:一是设计语义条件交互机制(Semantic-Conditioned Interaction),将语义先验注入注意力机制的查询端,基于子网络的功能身份显式导航其交互;二是引入病理一致性约束(Pathology-Consistent Constraint),通过将学习到的交互分布与临床先验对齐来正则化模型优化。该方法在多种精神障碍诊断任务中达到最先进性能,并识别出与精神病理生理学一致的可解释生物标志物。
链接: https://arxiv.org/abs/2603.19307
作者: Siyu Liu,Guangqi Wen,Peng Cao,Jinzhu Yang,Xiaoli Liu,Fei Wang,Osmar R. Zaiane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling the complex interactions among functional subnetworks is crucial for the diagnosis of mental disorders and the identification of functional pathways. However, learning the interactions of the underlying subnetworks remains a significant challenge for existing Transformer-based methods due to the limited number of training samples. To address these challenges, we propose KD-Brain, a Prior-Informed Graph Learning framework for explicitly encoding prior knowledge to guide the learning process. Specifically, we design a Semantic-Conditioned Interaction mechanism that injects semantic priors into the attention query, explicitly navigating the subnetwork interactions based on their functional identities. Furthermore, we introduce a Pathology-Consistent Constraint, which regularizes the model optimization by aligning the learned interaction distributions with clinical priors. Additionally, KD-Brain leads to state-of-the-art performance on a wide range of disorder diagnosis tasks and identifies interpretable biomarkers consistent with psychiatric pathophysiology. Our code is available at this https URL.
[AI-59] Agreement Between Large Language Models Human Reviewers and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology
【速读】:该论文旨在解决 observational研究中对STROBE声明(Strengthening the Reporting of Observational Studies in Epidemiology)合规性评估耗时且主观的问题。其解决方案的关键在于比较大型语言模型(Large Language Models, LLMs)与人类评审小组及原始作者在评估风湿病学领域观察性研究时的一致性,以检验LLMs是否具备替代人工进行初步筛查的潜力。研究发现,尽管LLMs在标准格式类条目上表现出与人类评审者完全一致的高可靠性(AC1=1.000),但在涉及方法学复杂性条目(如失访处理)上的表现显著下降,表明当前LLMs更适用于标准化的基础核查,而非取代专业人员对复杂方法学质量的判断。
链接: https://arxiv.org/abs/2603.19303
作者: Emre Bilgin,Ebru Ozturk,Meera Shah,Lisa Traboco,Rebecca Everitt,Ai Lyn Tan,Marwan Bukhari,Vincenzo Venerito,Latika Gupta
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, 2 supplementary figures
Abstract:Introduction: Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods: Guided by the GRRAS and DEAL Pathway B frameworks, 17 rheumatology articles were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a five-person human panel (ranging from junior to senior professionals), and two LLMs (ChatGPT-5.2, Gemini-3Pro). Items were grouped into Methodological Rigor and Presentation and Context domains. Inter-rater reliability was calculated using Gwet’s Agreement Coefficient (AC1). Results: Overall agreement across all reviewers was 85.0% (AC1=0.826). Domain stratification showed almost perfect agreement for Presentation and Context (AC1=0.841) and substantial agreement for Methodological Rigor (AC1=0.803). Although LLMs achieved complete agreement (AC1=1.000) with all human reviewers on standard formatting elements, their agreement with human reviewers and authors declined on complex items. For example, regarding the item on loss to follow-up, the agreement between Gemini 3 Pro and the senior reviewer was AC1=-0.252, while the agreement with the authors was only fair. Additionally, ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on specific methodological items. Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex methodological items likely reflects a reliance on surface-level information. Currently, these models appear more reliable for standardizing straightforward checks than for replacing expert human judgment in evaluating observational research.
[AI-60] Parameter-Efficient Token Embedding Editing for Clinical Class-Level Unlearning
【速读】:该论文旨在解决临床语言模型中敏感信息删除的难题,即在不从头重新训练模型的前提下,实现对特定行为类别的有效遗忘,同时最小化参数修改并保持模型在其他任务上的性能。解决方案的关键在于提出一种名为稀疏词元嵌入遗忘(Sparse Token Embedding Unlearning, STEU)的方法,该方法仅更新基于PMI(Pointwise Mutual Information)选择的词元嵌入以及少量分类头参数,而冻结所有编码器层,从而实现高效、精准的行为类级别遗忘,实验表明在MIMIC-IV等数据集上仅修改0.19%的参数即可达到近乎完全遗忘(forget F1 = 0.0004),同时保留较高的任务性能(retain avg F1 = 0.4766)。
链接: https://arxiv.org/abs/2603.19302
作者: Iyad Ait Hou,Shrenik Borad,Harsh Sharma,Pooja Srinivasan,Rebecca Hwa,Aya Zirikly
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Machine unlearning is increasingly important for clinical language models, where privacy regulations and institutional policies may require removing sensitive information from deployed systems without retraining from scratch. In practice, deletion requests must balance effective forgetting of targeted information with preservation of model utility and minimal parameter modification. We introduce Sparse Token Embedding Unlearning (STEU), a parameter-efficient method for behavioral class-level unlearning that updates only PMI-selected token embeddings together with a small classifier head while keeping all encoder layers frozen. Across experiments on MIMIC-IV, MIMIC-III, and eICU using BioClinicalBERT, BERT-base, and DistilBERT, STEU consistently suppresses the target class while largely preserving retained task performance. In the primary MIMIC-IV setting, STEU achieves near-complete forgetting (forget F1 = 0.0004) while maintaining competitive retained utility (retain avg F1 = 0.4766) after modifying only 0.19% of model parameters. These results suggest that targeted behavioral unlearning can be achieved through sparse embedding edits without modifying deeper encoder representations.
[AI-61] A Visualization for Comparative Analysis of Regression Models
【速读】:该论文旨在解决传统回归模型性能评估中因过度聚合信息而导致的局限性问题,即仅依赖单一数值指标(如MAE、RMSE或R²)难以揭示模型误差的分布特征与潜在模式。其解决方案的关键在于提出一种新颖的二维可视化方法:首先在二维空间中同时呈现两个模型的残差(residuals),从而实现对比分析;其次引入马氏距离(Mahalanobis distance)以考虑数据内部的相关性和尺度差异;最后采用基于百分位数的色彩映射(colormap)直观展示误差分布密度和异常值区域。此方法通过图形化呈现误差的分布及其相关性,提供比传统指标更细致、全面的模型性能洞察,有助于识别被平均指标掩盖的深层模式。
链接: https://arxiv.org/abs/2603.19291
作者: Nassime Mountasir(ICube),Baptiste Lafabregue(ICube),Bruno Albert,Nicolas Lachiche(ICube)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:As regression is a widely studied problem, many methods have been proposed to solve it, each of them often requiring setting different hyper-parameters. Therefore, selecting the proper method for a given application may be very difficult and relies on comparing their performances. Performance is usually measured using various metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or R-squared (R ^2 ). These metrics provide a numerical summary of predictive accuracy by quantifying the difference between predicted and actual values. However, while these metrics are widely used in the literature for summarizing model performance and useful to distinguish between models performing poorly and well, they often aggregate too much information. This article addresses these limitations by introducing a novel visualization approach that highlights key aspects of regression model performance. The proposed method builds upon three main contributions: (1) considering the residuals in a 2D space, which allows for simultaneous evaluation of errors from two models, (2) leveraging the Mahalanobis distance to account for correlations and differences in scale within the data, and (3) employing a colormap to visualize the percentile-based distribution of errors, making it easier to identify dense regions and outliers. By graphically representing the distribution of errors and their correlations, this approach provides a more detailed and comprehensive view of model performance, enabling users to uncover patterns that traditional aggregate metrics may obscure. The proposed visualization method facilitates a deeper understanding of regression model performance differences and error distributions, enhancing the evaluation and comparison process.
[AI-62] Neural Dynamics Self-Attention for Spiking Transformers
【速读】:该论文旨在解决现有脉冲神经网络(Spiking Neural Networks, SNNs)与Transformer架构融合时面临的两大问题:一是性能显著低于人工神经网络(Artificial Neural Networks, ANNs)的模型;二是推理过程中存在较高的内存开销。其解决方案的关键在于提出一种名为LRF-Dyn的新机制,通过引入局部感受野(Localized Receptive Fields, LRF)增强脉冲自注意力(Spiking Self-Attention, SSA)中的局部建模能力,同时利用电荷-放电-重置动力学近似注意力计算过程,从而避免显式存储大规模注意力矩阵,显著降低推理阶段的内存占用并提升性能。
链接: https://arxiv.org/abs/2603.19290
作者: Dehao Zhang,Fukai Guo,Shuai Wang,Jingya Wang,Jieyuan Zhang,Yimeng Shan,Malu Zhang,Yang Yang,Haizhou Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Integrating Spiking Neural Networks (SNNs) with Transformer architectures offers a promising pathway to balance energy efficiency and performance, particularly for edge vision applications. However, existing Spiking Transformers face two critical challenges: (i) a substantial performance gap compared to their Artificial Neural Networks (ANNs) counterparts and (ii) high memory overhead during inference. Through theoretical analysis, we attribute both limitations to the Spiking Self-Attention (SSA) mechanism: the lack of locality bias and the need to store large attention matrices. Inspired by the localized receptive fields (LRF) and membrane-potential dynamics of biological visual neurons, we propose LRF-Dyn, which uses spiking neurons with localized receptive fields to compute attention while reducing memory requirements. Specifically, we introduce a LRF method into SSA to assign higher weights to neighboring regions, strengthening local modeling and improving performance. Building on this, we approximate the resulting attention computation via charge-fire-reset dynamics, eliminating explicit attention-matrix storage and reducing inference-time memory. Extensive experiments on visual tasks confirm that our method reduces memory overhead while delivering significant performance improvements. These results establish it as a key unit for achieving energy-efficient Spiking Transformers.
[AI-63] Speculating Experts Accelerates Inference for Mixture-of-Experts
【速读】:该论文旨在解决在内存受限的推理场景下,混合专家(Mixture-of-Experts, MoE)模型因专家权重需从CPU卸载至GPU而导致的性能瓶颈问题,即CPU-GPU数据传输成为解码阶段的主要延迟来源。解决方案的关键在于提出一种专家预取(expert prefetching)机制,该机制利用当前已计算的内部模型表示来推测未来将被路由选择的专家,从而实现内存传输与计算的重叠。实验表明,这些内部表示能够可靠地预测未来专家,并且推测执行通常不会显著影响下游任务准确性,因此可避免重新加载真实路由器选择的专家,提升计算-内存重叠效率。集成到优化的推理引擎后,该方法相比按需从CPU内存加载专家的方式,最多可降低14%的每输出token时间(Time Per Output Token, TPOT)。对于仅靠推测执行精度不足的情况,进一步引入轻量级估计器以提高专家预测命中率,减少性能下降。
链接: https://arxiv.org/abs/2603.19289
作者: Vivan Madan,Prajwal Singhania,Abhinav Bhatele,Tom Goldstein,Ashwinee Panda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at this https URL.
[AI-64] CDEoH: Category-Driven Automatic Algorithm Design With Large Language Models
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的启发式搜索方法在自动化算法生成过程中存在的进化不稳定性和早熟收敛问题。现有方法主要依赖提示工程或联合演化思维与代码,但忽视了算法类别多样性对维持进化稳定性的关键作用。解决方案的关键在于提出Category Driven Automatic Algorithm Design with Large Language Models (CDEoH),通过显式建模算法类别,并在种群管理中协同平衡性能与类别多样性,从而实现多个算法范式的并行探索,显著提升进化稳定性并在不同规模的组合优化任务中获得更优且一致的平均性能表现。
链接: https://arxiv.org/abs/2603.19284
作者: Yu-Nian Wang,Shen-Huan Lyu,Ning Chen,Jia-Le Xu,Baoliu Ye,Qingfu Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of large language models (LLMs), LLM-based heuristic search methods have demonstrated strong capabilities in automated algorithm generation. However, their evolutionary processes often suffer from instability and premature convergence. Existing approaches mainly address this issue through prompt engineering or by jointly evolving thought and code, while largely overlooking the critical role of algorithmic category diversity in maintaining evolutionary stability. To this end, we propose Category Driven Automatic Algorithm Design with Large Language Models (CDEoH), which explicitly models algorithm categories and jointly balances performance and category diversity in population management, enabling parallel exploration across multiple algorithmic paradigms. Extensive experiments on representative combinatorial optimization problems across multiple scales demonstrate that CDEoH effectively mitigates convergence toward a single evolutionary direction, significantly enhancing evolutionary stability and achieving consistently superior average performance across tasks and scales.
[AI-65] Survey of Various Fuzzy and Uncertain Decision-Making Methods
【速读】:该论文旨在解决现实应用中决策过程常受模糊性(vagueness)、信息不完整、异构数据及专家意见冲突等因素影响的问题,系统梳理了不确定性感知的多准则决策方法(uncertainty-aware multi-criteria decision-making, MCDM)。其解决方案的关键在于构建一个任务导向的分类体系,涵盖问题设置(如离散型、群体共识型、动态型等七类场景)、权重获取方式(基于模糊/语言输入的主观与客观方法)以及准则间结构与因果建模,并对比不同求解策略:包括补偿型评分法、参考点距离与妥协方案、非补偿型排序框架,以及规则/证据驱动和序贯决策模型。该综述明确提炼了典型输入、核心计算步骤与输出形式,为依据鲁棒性、可解释性和数据可用性选择合适方法提供指导,同时指出了未来在可解释不确定性融合、稳定性与大规模动态环境下的可扩展性等开放方向。
链接: https://arxiv.org/abs/2603.15709
作者: Takaaki Fujita,Florentin Smarandache
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Book. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-883-3. 446 pages
Abstract:Decision-making in real applications is often affected by vagueness, incomplete information, heterogeneous data, and conflicting expert opinions. This survey reviews uncertainty-aware multi-criteria decision-making (MCDM) and organizes the field into a concise, task-oriented taxonomy. We summarize problem-level settings (discrete, group/consensus, dynamic, multi-stage, multi-level, multiagent, and multi-scenario), weight elicitation (subjective and objective schemes under fuzzy/linguistic inputs), and inter-criteria structure and causality modelling. For solution procedures, we contrast compensatory scoring methods, distance-to-reference and compromise approaches, and non-compensatory outranking frameworks for ranking or sorting. We also outline rule/evidence-based and sequential decision models that produce interpretable rules or policies. The survey highlights typical inputs, core computational steps, and primary outputs, and provides guidance on choosing methods according to robustness, interpretability, and data availability. It concludes with open directions on explainable uncertainty integration, stability, and scalability in large-scale and dynamic decision environments.
[AI-66] AI Agents Can Already Autonomously Perform Experimental High Energy Physics
【速读】:该论文旨在解决高能物理(High Energy Physics, HEP)分析流程中繁琐且重复的代码开发与执行问题,当前依赖人工编写和调试分析代码严重制约了科研效率。解决方案的关键在于构建一个名为“Just Furnish Context (JFC)”的集成框架,该框架利用生成式AI代理(Generative AI Agents)结合文献驱动的知识检索与多代理评审机制,实现从事件选择、背景估计、不确定性量化到统计推断及论文撰写等全流程自动化。研究表明,Claude Code等大语言模型可在少量专家输入下自主完成典型HEP分析任务,从而将研究人员从重复性技术工作中解放出来,聚焦于物理洞察、方法创新与严格验证。
链接: https://arxiv.org/abs/2603.20179
作者: Eric A. Moreno,Samuel Bright-Thonney,Andrzej Novak,Dolores Garcia,Philip Harris
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof-of-concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature-based knowledge retrieval and multi-agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.
[AI-67] Physics-Informed Long-Range Coulomb Correction for Machine-learning Hamiltonians
【速读】:该论文旨在解决当前机器学习电子哈密顿量模型在处理极性晶体和异质结构时忽略长程库仑相互作用的问题,此类相互作用对宏观电场和极化行为至关重要。解决方案的关键在于通过变分分解静电能,推导出非正交原子轨道基下长程哈密顿量矩阵元的闭式表达式,并建立电子密度矩阵到有效原子电荷的变分一致映射;进而提出HamGNN-LR框架,采用双通道架构融合E(3)-等变消息传递与倒易空间Ewald求和,从而实现物理驱动的长程修正,显著提升模型精度与泛化能力。
链接: https://arxiv.org/abs/2603.20007
作者: Yang Zhong,Xiwen Li,Xingao Gong,Hongjun Xiang
机构: 未知
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures
Abstract:Machine-learning electronic Hamiltonians achieve orders-of-magnitude speedups over density-functional theory, yet current models omit long-range Coulomb interactions that govern physics in polar crystals and heterostructures. We derive closed-form long-range Hamiltonian matrix elements in a nonorthogonal atomic-orbital basis through variational decomposition of the electrostatic energy, deriving a variationally consistent mapping from the electron density matrix to effective atomic charges. We implement this framework in HamGNN-LR, a dual-channel architecture combining E(3)-equivariant message passing with reciprocal-space Ewald summation. Benchmarks demonstrate that physics-based long-range corrections are essential: purely data-driven attention mechanisms fail to capture macroscopic electrostatic potentials. Benchmarks on polar ZnO slabs, CdSe/ZnS heterostructures, and GaN/AlN superlattices show two- to threefold error reductions and robust transferability to systems far beyond training sizes, eliminating the characteristic staircase artifacts that plague short-range models in the presence of built-in electric fields.
[AI-68] Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech? AAAI2026
【速读】:该论文旨在解决当前神经语音合成(Neural Speech Synthesis)系统中缺乏对手部手势(hand gestures)与语音韵律(prosody)协同建模的问题。现有文本到语音(Text-to-Speech, TTS)系统虽已开始引入面部表情或唇动等多模态线索,但未充分探索手势如何动态调节语音的语调、情感和强调特征。解决方案的关键在于提出一种名为Gesture2Speech的新颖多模态TTS框架,其核心创新是设计了一个融合语言内容与手势特征的多专家混合(Mixture-of-Experts, MoE)风格提取模块,并通过一个显式的手势-语音对齐损失函数确保手势动作与语音韵律在时间维度上的细粒度同步,从而实现由手部动作驱动的自然且时序一致的韵律调控。
链接: https://arxiv.org/abs/2603.19831
作者: Lokesh Kumar,Nirmesh Shah,Ashishkumar P. Gudmalwar,Pankaj Wasnik
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted at The 2nd International Workshop on Bodily Expressed Emotion Understanding (BEEU) at AAAI 2026 [non-archival]
Abstract:Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at this https URL
[AI-69] Data-driven ensemble prediction of the global ocean
【速读】:该论文旨在解决将机器学习扩展至概率性全球海洋预测这一开放性挑战,即如何在保持计算效率的同时提升海洋要素(如海表温度、海表高度、温盐剖面及洋流)的不确定性量化能力。其解决方案的关键在于提出FuXi-ONS——首个基于机器学习的全球海洋集合预报系统,该系统通过学习物理结构化的扰动(physically structured perturbations)并引入大气编码模块(atmospheric encoding module),有效稳定长期预报性能,同时显著优于确定性模型和噪声扰动基线,在保证高精度的同时实现比传统集合系统快数个数量级的运行速度。
链接: https://arxiv.org/abs/2603.19591
作者: Qiusheng Huang,Xiaohui Zhong,Anboyu Guo,Ziyi Peng,Lei Chen,Hao Li
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Data-driven models have advanced deterministic ocean forecasting, but extending machine learning to probabilistic global ocean prediction remains an open challenge. Here we introduce FuXi-ONS, the first machine-learning ensemble forecasting system for the global ocean, providing 5-day forecasts on a global 1° grid up to 365 days for sea-surface temperature, sea-surface height, subsurface temperature, salinity and ocean currents. Rather than relying on repeated integration of computationally expensive numerical models, FuXi-ONS learns physically structured perturbations and incorporates an atmospheric encoding module to stabilize long-range forecasts. Evaluated against GLORYS12 reanalysis, FuXi-ONS improves both ensemble-mean skill and probabilistic forecast quality relative to deterministic and noise-perturbed baselines, and shows competitive performance against established seasonal forecast references for SST and Niño3.4 variability, while running orders of magnitude faster than conventional ensemble systems. These results provide a strong example of machine learning advancing a core problem in ocean science, and establish a practical path toward efficient probabilistic ocean forecasting and climate risk assessment.
[AI-70] Joint Return and Risk Modeling with Deep Neural Networks for Portfolio Construction
【速读】:该论文旨在解决传统投资组合构建方法在时变市场条件下表现不佳的问题,即依赖历史统计量分别估计预期收益和协方差矩阵,难以适应动态市场环境。其解决方案的关键在于提出一种基于深度神经网络的联合收益与风险建模框架,实现从序列金融数据中端到端学习动态预期收益和风险结构,从而有效捕捉波动率聚集性和市场状态切换特征,并在投资组合优化中显著提升风险调整后收益(年化收益率36.4%,夏普比率0.91),优于等权配置和历史均值-方差基准。
链接: https://arxiv.org/abs/2603.19288
作者: Keonvin Park
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Portfolio construction traditionally relies on separately estimating expected returns and covariance matrices using historical statistics, often leading to suboptimal allocation under time-varying market conditions. This paper proposes a joint return and risk modeling framework based on deep neural networks that enables end-to-end learning of dynamic expected returns and risk structures from sequential financial data. Using daily data from ten large-cap US equities spanning 2010 to 2024, the proposed model is evaluated across return prediction, risk estimation, and portfolio-level performance. Out-of-sample results during 2020 to 2024 show that the deep forecasting model achieves competitive predictive accuracy (RMSE = 0.0264) with economically meaningful directional accuracy (51.9%). More importantly, the learned representation effectively captures volatility clustering and regime shifts. When integrated into portfolio optimization, the proposed Neural Portfolio strategy achieves an annual return of 36.4% and a Sharpe ratio of 0.91, outperforming equal weight and historical mean-variance benchmarks in terms of risk-adjusted performance. These findings demonstrate that jointly modeling return and covariance dynamics can provide consistent improvements over traditional allocation approaches. The framework offers a scalable and practical alternative for data-driven portfolio construction under nonstationary market conditions.
机器学习
[LG-0] Kolmogorov-Arnold causal generative models
链接: https://arxiv.org/abs/2603.20184
作者: Alejandro Almodóvar,Mar Elizo,Patricia A. Apellániz,Santiago Zazo,Juan Parras
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 8 figures, 3 tables, 5 algorithms, preprint
Abstract:Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high-stakes domains. We propose KaCGM, a causal generative model for mixed-type tabular data where each structural equation is parameterized by a Kolmogorov–Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent–child relationships, while preserving query-agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi-synthetic benchmarks show competitive performance against state-of-the-art methods. A real-world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings. Code: this https URL
[LG-1] Revisiting Gene Ontology Knowledge Discovery with Hierarchical Feature Selection and Virtual Study Group of AI Agents
链接: https://arxiv.org/abs/2603.20132
作者: Cen Wan,Alex A. Freitas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models have achieved great success in multiple challenging tasks, and their capacity can be further boosted by the emerging agentic AI techniques. This new computing paradigm has already started revolutionising the traditional scientific discovery pipelines. In this work, we propose a novel agentic AI-based knowledge discovery-oriented virtual study group that aims to extract meaningful ageing-related biological knowledge considering highly ageing-related Gene Ontology terms that are selected by hierarchical feature selection methods. We investigate the performance of the proposed agentic AI framework by considering four different model organisms’ ageing-related Gene Ontology terms and validate the biological findings by reviewing existing research articles. It is found that the majority of the AI agent-generated scientific claims can be supported by existing literatures and the proposed internal mechanisms of the virtual study group also play an important role in the designed agentic AI-based knowledge discovery framework.
[LG-2] Conditioning Protein Generation via Hopfield Pattern Multiplicity
链接: https://arxiv.org/abs/2603.20115
作者: Jeffrey D. Varner
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored sequences equally and cannot direct generation toward a functional subset of interest. We show that a single scalar parameter, added as a bias to the sampler’s attention logits, continuously shifts generation from the full family toward a user-specified subset, with no retraining and no change to the model architecture. A practitioner supplies a small set of sequences (for example, hits from a binding screen) and a multiplicity ratio that controls how strongly generation favors them. The method is agnostic to what the subset represents: binding, stability, specificity, or any other property. We find that the conditioning is exact at the level of the sampler’s internal representation, but that the decoded sequence phenotype can fall short because the dimensionality reduction used to encode sequences does not always preserve the residue-level variation that defines the functional split. We term this discrepancy the calibration gap and show that it is predicted by a simple geometric measure of how well the encoding separates the functional subset from the rest of the family. Experiments on five Pfam families (Kunitz, SH3, WW, Homeobox, and Forkhead domains) confirm the monotonic relationship between separation and gap across a fourfold range of geometries. Applied to omega-conotoxin peptides targeting a calcium channel involved in pain signaling, curated seeding from 23 characterized binders produces over a thousand candidates that preserve the primary pharmacophore and all experimentally identified binding determinants. These results show that stochastic attention enables practitioners to expand a handful of experimentally characterized sequences into diverse candidate libraries without retraining a generative model.
[LG-3] GO-GenZip: Goal-Oriented Generative Sampling and Hybrid Compression
链接: https://arxiv.org/abs/2603.20109
作者: Pietro Talli,Qi Liao,Alessandro Lieto,Parijat Bhattacharjee,Federico Chiariotti,Andrea Zanella
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Current network data telemetry pipelines consist of massive streams of fine-grained Key Performance Indicators (KPIs) from multiple distributed sources towards central aggregators, making data storage, transmission, and real-time analysis increasingly unsustainable. This work presents a generative AI (GenAI)-driven sampling and hybrid compression framework that redesigns network telemetry from a goal-oriented perspective. Unlike conventional approaches that passively compress fully observed data, our approach jointly optimizes what to observe and how to encode it, guided by the relevance of information to downstream tasks. The framework integrates adaptive sampling policies, using adaptive masking techniques, with generative modeling to identify patterns and preserve critical features across temporal and spatial dimensions. The selectively acquired data are further processed through a hybrid compression scheme that combines traditional lossless coding with GenAI-driven, lossy compression. Experimental results on real network datasets demonstrate over 50 % reductions in sampling and data transfer costs, while maintaining comparable reconstruction accuracy and goal-oriented analytical fidelity in downstream tasks.
[LG-4] rojan horse hunt in deep forecasting models: Insights from the European Space Agency competition
链接: https://arxiv.org/abs/2603.20108
作者: Krzysztof Kotowski,Ramez Shendy,Jakub Nalepa,Agata Kaczmarek,Dawid Płudowski,Piotr Wilczyński,Artur Janicki,Przemysław Biecek,Ambros Marzetta,Atul Pande,Lalit Chandra Routhu,Swapnil Srivastava,Evridiki Ntagiou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 43 pages, 18 figures
Abstract:Forecasting plays a crucial role in modern safety-critical applications, such as space operations. However, the increasing use of deep forecasting models introduces a new security risk of trojan horse attacks, carried out by hiding a backdoor in the training data or directly in the model weights. Once implanted, the backdoor is activated by a specific trigger pattern at test time, causing the model to produce manipulated predictions. We focus on this issue in our \textitTrojan Horse Hunt data science competition, where more than 200 teams faced the task of identifying triggers hidden in deep forecasting models for spacecraft telemetry. We describe the novel task formulation, benchmark set, evaluation protocol, and best solutions from the competition. We further summarize key insights and research directions for effective identification of triggers in time series forecasting models. All materials are publicly available on the official competition webpage this https URL.
[LG-5] How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models
链接: https://arxiv.org/abs/2603.20092
作者: Luca Ambrogioni
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we propose a theoretical framework that interprets the generation process in trained diffusion models as an instance of out-of-equilibrium phase transitions. We argue that, rather than evolving smoothly from noise to data, reverse diffusion passes through a critical regime in which small spatial fluctuations are amplified and seed the emergence of large-scale structure. Our central insight is that architectural constraints, such as locality, sparsity, and translation equivariance, transform memorization-driven instabilities into collective spatial modes, enabling the formation of coherent patterns beyond the training data. Using analytically tractable patch score models, we show how classical symmetry-breaking bifurcations generalize into spatially extended critical phenomena described by softening Fourier modes and growing correlation lengths. We further connect these dynamics to effective field theories of the Ginzburg-Landau type and to mechanisms of pattern formation in non-equilibrium physics. Empirical results on trained convolutional diffusion models corroborate the theory, revealing signatures of criticality including mode softening and rapid growth of spatial correlations. Finally, we demonstrate that this critical regime has practical relevance: targeted perturbations, such as classifier-free guidance pulses applied at the estimated critical time, significantly improve generation control. Together, these findings position non-equilibrium critical phenomena as a unifying principle for understanding, and potentially improving, the behavior of modern diffusion models.
[LG-6] Federated Hyperdimensional Computing for Resource-Constrained Industrial IoT
链接: https://arxiv.org/abs/2603.20037
作者: Nikita Zeulin,Olga Galinina,Nageen Himayat,Sergey Andreev
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Submitted to the IEEE for possible publication
Abstract:In the Industrial Internet of Things (IIoT) systems, edge devices often operate under strict constraints in memory, compute capability, and wireless bandwidth. These limitations challenge the deployment of advanced data analytics tasks, such as predictive and prescriptive maintenance. In this work, we explore hyperdimensional computing (HDC) as a lightweight learning paradigm for resource-constrained IIoT. Conventional centralized HDC leverages the properties of high-dimensional vector spaces to enable energy-efficient training and inference. We integrate this paradigm into a federated learning (FL) framework where devices exchange only prototype representations, which significantly reduces communication overhead. Our numerical results highlight the potential of federated HDC to support collaborative learning in IIoT with fast convergence speed and communication efficiency. These results indicate that HDC represents a lightweight and resilient framework for distributed intelligence in large-scale and resource-constrained IIoT environments.
[LG-7] Continual Learning as Shared-Manifold Continuation Under Compatible Shift KR
链接: https://arxiv.org/abs/2603.20036
作者: Henry J. Kobs
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, repo: this https URL
Abstract:Continual learning methods usually preserve old behavior by regularizing parameters, matching old outputs, or replaying previous examples. These strategies can reduce forgetting, but they do not directly specify how the latent representation should evolve. We study a narrower geometric alternative for the regime where old and new data should remain on the same latent support: continual learning as continuation of a shared manifold. We instantiate this view within Support-Preserving Manifold Assimilation (SPMA) and evaluate a geometry-preserving variant, SPMA-OG, that combines sparse replay, output distillation, relational geometry preservation, local smoothing, and chart-assignment regularization on old anchors. On representative compatible-shift CIFAR10 and Tiny-ImageNet runs, SPMA-OG improves over sparse replay baselines in old-task retention and representation-preservation metrics while remaining competitive on new-task accuracy. On a controlled synthetic atlas-manifold benchmark, it achieves near-perfect anchor-geometry preservation while also improving new-task accuracy over replay. These results provide evidence that geometry-aware anchor regularization is a useful inductive bias when continual learning should preserve a shared latent support rather than create a new one.
[LG-8] ODySSeI: An Open-Source End-to-End Framework for Automated Detection Segmentation and Severity Estimation of Lesions in Invasive Coronary Angiography Images
链接: https://arxiv.org/abs/2603.20021
作者: Anand Choudhary,Xiaowu Sun,Thabo Mahendiran,Ortal Senouf,Denise Auberson,Bernard De Bruyne,Stephane Fournier,Olivier Muller,Emmanuel Abbé,Pascal Frossard,Dorina Thanou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Invasive Coronary Angiography (ICA) is the clinical gold standard for the assessment of coronary artery disease. However, its interpretation remains subjective and prone to intra- and inter-operator variability. In this work, we introduce ODySSeI: an Open-source end-to-end framework for automated Detection, Segmentation, and Severity estimation of lesions in ICA images. ODySSeI integrates deep learning-based lesion detection and lesion segmentation models trained using a novel Pyramidal Augmentation Scheme (PAS) to enhance robustness and real-time performance across diverse patient cohorts (2149 patients from Europe, North America, and Asia). Furthermore, we propose a quantitative coronary angiography-free Lesion Severity Estimation (LSE) technique that directly computes the Minimum Lumen Diameter (MLD) and diameter stenosis from the predicted lesion geometry. Extensive evaluation on both in-distribution and out-of-distribution clinical datasets demonstrates ODySSeI’s strong generalizability. Our PAS yields large performance gains in highly complex tasks as compared to relatively simpler ones, notably, a 2.5-fold increase in lesion detection performance versus a 1-3% increase in lesion segmentation performance over their respective baselines. Our LSE technique achieves high accuracy, with predicted MLD values differing by only \pm 2-3 pixels from the corresponding ground truths. On average, ODySSeI processes a raw ICA image within only a few seconds on a CPU and in a fraction of a second on a GPU and is available as a plug-and-play web interface at this http URL. Overall, this work establishes ODySSeI as a comprehensive and open-source framework which supports automated, reproducible, and scalable ICA analysis for real-time clinical decision-making.
[LG-9] Agent icRS-EnsNAS: Ensemble-Decoupled Self-Evolving Architecture Search
链接: https://arxiv.org/abs/2603.20014
作者: Yun Chen,Moyu Zhang,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural Architecture Search (NAS) deployment in industrial production systems faces a fundamental validation bottleneck: verifying a single candidate architecture pi requires evaluating the deployed ensemble of M models, incurring prohibitive O(M) computational cost per candidate. This cost barrier severely limits architecture iteration frequency in real-world applications where ensembles (M=50-200) are standard for robustness. This work introduces Ensemble-Decoupled Architecture Search, a framework that leverages ensemble theory to predict system-level performance from single-learner evaluation. We establish the Ensemble-Decoupled Theory with a sufficient condition for monotonic ensemble improvement under homogeneity assumptions: a candidate architecture pi yields lower ensemble error than the current baseline if rho(pi) rho(pi_old) - (M / (M - 1)) * (Delta E(pi) / sigma^2(pi)), where Delta E, rho, and sigma^2 are estimable from lightweight dual-learner training. This decouples architecture search from full ensemble training, reducing per-candidate search cost from O(M) to O(1) while maintaining O(M) deployment cost only for validated winners. We unify solution strategies across pipeline continuity: (1) closed-form optimization for tractable continuous pi (exemplified by feature bagging in CTR prediction), (2) constrained differentiable optimization for intractable continuous pi, and (3) LLM-driven search with iterative monotonic acceptance for discrete pi. The framework reveals two orthogonal improvement mechanisms – base diversity gain and accuracy gain – providing actionable design principles for industrial-scale NAS. All theoretical derivations are rigorous with detailed proofs deferred to the appendix. Comprehensive empirical validation will be included in the journal extension of this work.
[LG-10] Model-Driven Learning-Based Physical Layer Authentication for Mobile Wi-Fi Devices
链接: https://arxiv.org/abs/2603.19972
作者: Yijia Guo,Junqing Zhang,Yao-Win Peter Hong,Stefano Tomasin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rise of wireless technologies has made the Internet of Things (IoT) ubiquitous, but the broadcast nature of wireless communications exposes IoT to authentication risks. Physical layer authentication (PLA) offers a promising solution by leveraging unique characteristics of wireless channels. As a common approach in PLA, hypothesis testing yields a theoretically optimal Neyman-Pearson (NP) detector, but its reliance on channel statistics limits its practicality in real-world scenarios. In contrast, deep learning-based PLA approaches are practical but tend to be not optimal. To address these challenges, we proposed a learning-based PLA scheme driven by hypothesis testing and conducted extensive simulations and experimental evaluations using Wi-Fi. Specifically, we incorporated conditional statistical models into the hypothesis testing framework to derive a theoretically optimal NP detector. Building on this, we developed LiteNP-Net, a lightweight neural network driven by the NP detector. Simulation results demonstrated that LiteNP-Net could approach the performance of the NP detector even without prior knowledge of the channel statistics. To further assess its effectiveness in practical environments, we deployed an experimental testbed using Wi-Fi IoT development kits in various real-world scenarios. Experimental results demonstrated that the LiteNP-Net outperformed the conventional correlation-based method as well as state-of-the-art Siamese-based methods.
[LG-11] Channel Prediction-Based Physical Layer Authentication under Consecutive Spoofing Attacks
链接: https://arxiv.org/abs/2603.19962
作者: Yijia Guo,Junqing Zhang,Yao-Win Peter Hong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Wireless networks are highly vulnerable to spoofing attacks, especially when attackers transmit consecutive spoofing packets. Conventional physical layer authentication (PLA) methods have mostly focused on single-packet spoofing attack. However, under consecutive spoofing attacks, they become ineffective due to channel evolution caused by device mobility and channel fading. To address this challenge, we propose a channel prediction-based PLA framework. Specifically, a Transformer-based channel prediction module is employed to predict legitimate CSI measurements during spoofing interval, and the input of channel prediction module is adaptively updated with predicted or observed CSI measurements based on the authentication decision to ensure robustness against sustained spoofing. Simulation results under Rayleigh fading channels demonstrate that the proposed approach achieves low prediction error and significantly higher authentication accuracy than conventional benchmark, maintaining robustness even under extended spoofing attacks.
[LG-12] APAS: Efficient Two-Server Asymmetric Private Aggregation Beyond Prio()
链接: https://arxiv.org/abs/2603.19949
作者: Harish Karthikeyan,Antigoni Polychroniadou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Privacy-preserving aggregation is a cornerstone for AI systems that learn from distributed data without exposing individual records, especially in federated learning and telemetry. Existing two-server protocols (e.g., Prio and successors) set a practical baseline by validating inputs while preventing any single party from learning users’ values, but they impose symmetric costs on both servers and communication that scales with the per-client input dimension L . Modern learning tasks routinely involve dimensionalities L in the tens to hundreds of millions of model parameters. We present TAPAS, a two-server asymmetric private aggregation scheme that addresses these limitations along four dimensions: (i) no trusted setup or preprocessing, (ii) server-side communication that is independent of L (iii) post-quantum security based solely on standard lattice assumptions (LWE, SIS), and (iv) stronger robustness with identifiable abort and full malicious security for the servers. A key design choice is intentional asymmetry: one server bears the O(L) aggregation and verification work, while the other operates as a lightweight facilitator with computation independent of L . This reduces total cost, enables the secondary server to run on commodity hardware, and strengthens the non-collusion assumption of the servers. One of our main contributions is a suite of new and efficient lattice-based zero-knowledge proofs; to our knowledge, we are the first to establish privacy and correctness with identifiable abort in the two-server setting. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2603.19949 [cs.CR] (or arXiv:2603.19949v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.19949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Memori: A Persistent Memory Layer for Efficient Context-Aware LLM Agents
链接: https://arxiv.org/abs/2603.19935
作者: Luiz C. Borro,Luiz A. B. Macarini,Gordon Tindall,Michael Montero,Adam B. Struck
类目: Machine Learning (cs.LG)
*备注: 9 pages; 2 figures; white paper
Abstract:As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment. Comments: 9 pages; 2 figures; white paper Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.19935 [cs.LG] (or arXiv:2603.19935v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19935 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luiz Antonio Buschetto Macarini [view email] [v1] Fri, 20 Mar 2026 13:26:38 UTC (311 KB)
[LG-14] Discovery of Decision Synchronization Patterns from Event Logs
链接: https://arxiv.org/abs/2603.19879
作者: Tijmen Kuijpers,Karolin Winter,Remco Dijkman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Synchronizing decisions between running cases in business processes facilitates fair and efficient use of resources, helps prioritize the most valuable cases, and prevents unnecessary waiting. Consequently, decision synchronization patterns are regularly built into processes, in the form of mechanisms that temporarily delay one case to favor another. These decision mechanisms therefore consider properties of multiple cases at once, rather than just the properties of a single case; an aspect that is rarely addressed by current process discovery techniques. To address this gap, this paper proposes an approach for discovering decision synchronization patterns inspired by supply chain processes. These decision synchronization patterns take the form of specific process constructs combined with a constraint that determines which particular case to execute. We describe, formalize and demonstrate how the constraint for four such patterns can be discovered. We evaluate our approach in two artificial scenarios. First, with four separate process models each containing a single decision synchronization pattern, i.e., we demonstrate that our approach can discover every type of pattern when only this one type is present. Second, we consider a process model containing all four decision synchronization patterns to show generalizability of the approach to more complex problems. For both scenarios, we could reliably retrieve the expected patterns.
[LG-15] On the Dynamics Transferability of Latent Generalization during Memorization
链接: https://arxiv.org/abs/2603.19865
作者: Simran Ketha,Venkatakrishnan Ramaswamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren’t yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Our recent work has demonstrated, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, the origin and dynamics over training of this latent generalization during memorization is not well understood. Here, we track the training dynamics, empirically, and find that latent generalization abilities largely peak early in training, with model generalization. Next, we investigate to what extent the specific nature of the MASC probe is critical for our ability to extract latent generalization from the model’s layerwise outputs. To this end, we first examine the mathematical structure of the MASC probe and show that it is a quadratic classifier, i.e. is non-linear. This brings up the question of the extent to which this latent generalization might be linearly decodable from layerwise outputs. To investigate this, we designed a new linear probe for this setting. Next, we consider the question of whether it is possible to transfer latent generalization to model generalization by directly editing model weights. To this end, we devise a way to transfer the latent generalization present in last-layer representations to the model using the new linear probe.
[LG-16] NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing
链接: https://arxiv.org/abs/2603.19864
作者: Raphael Simon,José Carrasquel,Wim Mees,Pieter Libin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Penetration testing, the practice of simulating cyberattacks to identify vulnerabilities, is a complex sequential decision-making task that is inherently partially observable and features large action spaces. Training reinforcement learning (RL) policies for this domain faces a fundamental bottleneck: existing simulators are too slow to train on realistic network scenarios at scale, resulting in policies that fail to generalize. We present NASimJax, a complete JAX-based reimplementation of the Network Attack Simulator (NASim), achieving up to 100x higher environment throughput than the original simulator. By running the entire training pipeline on hardware accelerators, NASimJax enables experimentation on larger networks under fixed compute budgets that were previously infeasible. We formulate automated penetration testing as a Contextual POMDP and introduce a network generation pipeline that produces structurally diverse and guaranteed-solvable scenarios. Together, these provide a principled basis for studying zero-shot policy generalization. We use the framework to investigate action-space scaling and generalization across networks of up to 40 hosts. We find that Prioritized Level Replay better handles dense training distributions than Domain Randomization, particularly at larger scales, and that training on sparser topologies yields an implicit curriculum that improves out-of-distribution generalization, even on topologies denser than those seen during training. To handle linearly growing action spaces, we propose a two-stage action decomposition (2SAS) that substantially outperforms flat action masking at scale. Finally, we identify a failure mode arising from the interaction between Prioritized Level Replay’s episode-reset behaviour and 2SAS’s credit assignment structure. NASimJax thus provides a fast, flexible, and realistic platform for advancing RL-based penetration testing.
[LG-17] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
链接: https://arxiv.org/abs/2603.19835
作者: Chiyu Ma,Shuo Yang,Kexin Huang,Jinda Lu,Haoming Meng,Shangshang Wang,Bolin Ding,Soroush Vosoughi,Guoyin Wang,Jingren Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.
[LG-18] GDEGAN: Gaussian Dynamic Equivariant Graph Attention Network for Ligand Binding Site Prediction
链接: https://arxiv.org/abs/2603.19817
作者: Animesh,Plaban Kumar Bhowmick,Pralay Mitra
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of binding sites of a given protein, to which ligands can bind, is a critical step in structure-based computational drug discovery. Recently, Equivariant Graph Neural Networks (GNNs) have emerged as a powerful paradigm for binding site identification methods due to the large-scale availability of 3D structures of proteins via protein databases and AlphaFold predictions. The state-of-the-art equivariant GNN methods implement dot product attention, disregarding the variation in the chemical and geometric properties of the neighboring residues. To capture this variation, we propose GDEGAN (Gaussian Dynamic Equivariant Graph Attention Network), which replaces dot-product attention with adaptive kernels that recognize binding sites. The proposed attention mechanism captures variation in neighboring residues using statistics of their characteristic local feature distributions. Our mechanism dynamically computes neighborhood statistics at each layer, using local variance as an adaptive bandwidth parameter with learnable per-head temperatures, enabling each protein region to determine its own context-specific importance. GDEGAN outperforms existing methods with relative improvements of 37-66% in DCC and 7-19% DCA success rates across COACH420, HOLO4k, and PDBBind2020 datasets. These advances have direct application in accelerating protein-ligand docking by identifying potential binding sites for therapeutic target identification.
[LG-19] Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study
链接: https://arxiv.org/abs/2603.19812
作者: Danya Li,Yan Feng,Rico Krueger
类目: Machine Learning (cs.LG)
*备注:
Abstract:The integration of Automated Shuttles into shared urban spaces presents unique challenges due to the absence of traffic rules and the complex pedestrian interactions. Accurately anticipating pedestrian behavior in such unstructured environments is therefore critical for ensuring both safety and efficiency. This paper presents a Virtual Reality (VR) study that captures how pedestrians interact with automated shuttles across diverse scenarios, including varying approach angles and navigating in continuous traffic. We identify critical behavior patterns present in pedestrians’ decision-making in shared spaces, including hesitation, evasive maneuvers, gaze allocation, and proxemic adjustments. To model pedestrian behavior, we propose GazeX-LSTM, a multimodal eye gaze-informed and context-aware prediction model that integrates pedestrians’ trajectories, fine-grained eye gaze dynamics, and contextual factors. We shift prediction from a vehicle- to a human-centered perspective by leveraging eye-tracking data to capture pedestrian attention. We systematically validate the unique and irreplaceable predictive power of eye gaze over head orientation alone, further enhancing performance by integrating contextual variables. Notably, the combination of eye gaze data and contextual information produces super-additive improvements on pedestrian behavior prediction accuracy, revealing the complementary relationship between visual attention and situational contexts. Together, our findings provide the first evidence that eye gaze-informed modeling fundamentally advances pedestrian behavior prediction and highlight the critical role of situational contexts in shared-space interactions. This paves the way for safer and more adaptive automated vehicle technologies that account for how people perceive and act in complex shared spaces.
[LG-20] wo-Time-Scale Learning Dynamics: A Population View of Neural Network Training
链接: https://arxiv.org/abs/2603.19808
作者: Giacomo Borghi,Hyesung Im,Lorenzo Pareschi
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Machine Learning (stat.ML)
*备注:
Abstract:Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection–mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection–mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann–Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator–mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.
[LG-21] Quantifying Gate Contribution in Quantum Feature Maps for Scalable Circuit Optimization
链接: https://arxiv.org/abs/2603.19805
作者: F. Rodríguez-Díaz,D. Gutiérrez-Avilés,A. Troncoso,F. Martínez-Álvarez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantum machine learning offers promising advantages for classification tasks, but noise, decoherence, and connectivity constraints in current devices continue to limit the efficient execution of feature map-based circuits. Gate Assessment and Threshold Evaluation (GATE) is presented as a circuit optimization methodology that reduces quantum feature maps using a novel gate significance index. This index quantifies the relevance of each gate by combining fidelity, entanglement, and sensitivity. It is formulated for both simulator/emulator environments, where quantum states are accessible, and for real hardware, where these quantities are estimated from measurement results and auxiliary circuits. The approach iteratively scans a threshold range, eliminates low-contribution gates, generates optimized quantum machine learning models, and ranks them based on accuracy, runtime, and a balanced performance criterion before final testing. The methodology is evaluated on real-world classification datasets using two representative quantum machine learning models, PegasosQSVM and Quantum Neural Network, in three execution scenarios: noise-free simulation, noisy emulation derived from an IBM backend, and real IBM quantum hardware. The structural impact of gate removal in feature maps is examined, compatibility with noise-mitigation techniques is studied, and the scalability of index computation is evaluated using approaches based on density matrices, matrix product states, tensor networks, and real-world devices. The results show consistent reductions in circuit size and runtime and, in many cases, preserved or improved predictive accuracy, with the best trade-offs typically occurring at intermediate thresholds rather than in the baseline circuits or in those compressed more aggressively.
[LG-22] Scalable Learning of Multivariate Distributions via Coresets AISTATS2026
链接: https://arxiv.org/abs/2603.19792
作者: Zeyu Ding,Katja Ickstadt,Nadja Klein,Alexander Munteanu,Simon Omlor
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: AISTATS 2026
Abstract:Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of (1\pm\varepsilon) and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.
[LG-23] Learning from Similarity/Dissimilarity and Pairwise Comparison
链接: https://arxiv.org/abs/2603.19713
作者: Tomoya Tate,Kosuke Sugiyama,Masato Uchida
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses binary classification in scenarios where obtaining explicit instance level labels is impractical, by exploiting multiple weak labels defined on instance pairs. The existing SconfConfDiff classification framework relies on continuous valued probabilistic supervision, including similarity-confidence, the probability of class agreement, and confidence-difference, the difference in positive class probabilities. However, probabilistic labeling requires subjective uncertainty quantification, often leading to unstable supervision. We propose SD-Pcomp classification, a binary judgment based weakly supervised learning framework that relies only on relative judgments, namely class agreement between two instances and pairwise preference toward the positive class. The method employs Similarity/Dissimilarity (SD) labels and Pairwise Comparison (Pcomp) labels, and develops two unbiased risk estimators, (i) a convex combination of SD and Pcomp and (ii) a unified estimator that integrates both labels by modeling their relationship. Theoretical analysis and experimental results show that the proposed approach improves classification performance over methods using a single weak label, and is robust to label noise and uncertainty in class prior estimation.
[LG-24] Regret Analysis of Sleeping Competing Bandits
链接: https://arxiv.org/abs/2603.19700
作者: Shinnosuke Uba,Yutaro Yamaguchi
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 29 pages, 3 figures
Abstract:The Competing Bandits framework is a recently emerging area that integrates multi-armed bandits in online learning with stable matching in game theory. While conventional models assume that all players and arms are constantly available, in real-world problems, their availability can vary arbitrarily over time. In this paper, we formulate this setting as Sleeping Competing Bandits. To analyze this problem, we naturally extend the regret definition used in existing competing bandits and derive regret bounds for the proposed model. We propose an algorithm that simultaneously achieves an asymptotic regret bound of \mathrmO\left(NK\log T_i/\Delta^2\right) under reasonable assumptions, where N is the number of players, K is the number of arms, T_i is the number of rounds of each player p_i , and \Delta is the minimum reward gap. We also provide a regret lower bound of \mathrm\Omega\left( N(K-N+1)\log T_i/\Delta^2 \right) under the same assumptions. This implies that our algorithm is asymptotically optimal in the regime where the number of arms K is relatively larger than the number of players N .
[LG-25] Diminishing Returns in Expanding Generative Models and Godel-Tarski-Lob Limits
链接: https://arxiv.org/abs/2603.19687
作者: Angshul Majumdar
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:
Abstract:Modern generative modelling systems are increasingly improved by expanding model capacity, training data, and computational resources. While empirical studies have documented such scaling behaviour across architectures including generative adversarial networks, variational autoencoders, transformer-based models, and diffusion models, the theoretical limits of capability growth in expanding generative systems remain poorly understood. In this paper we develop a general task-space framework for analysing expanding generative reasoning systems. Each system induces a subset of a global task space representing the tasks it can successfully solve, and system capability is measured by the probability mass of this solved-task set under a fixed task distribution. Within this framework we prove a structural result showing that, under mild assumptions, the marginal improvement in solved tasks must converge to zero as system capacity increases. Thus expanding generative systems may continue to gain capability, but the probability mass of newly solvable tasks necessarily diminishes asymptotically. We further provide a prediction-theoretic refinement based on complexity-weighted hypothesis classes inspired by algorithmic probability, yielding quantitative bounds on marginal improvement in prediction settings. Finally, we examine logical reasoning tasks and show that classical results from mathematical logic – including Rosser incompleteness, Tarski’s undefinability theorem, and Löb’s theorem – imply the persistence of unresolved logical tasks within sufficiently expressive reasoning systems. Together these results provide a mathematical perspective on the asymptotic behaviour of expanding generative systems, showing that long-run capability growth is constrained both by diminishing marginal improvements in task coverage and by fundamental logical limitations on internal reasoning. Subjects: Logic in Computer Science (cs.LO); Machine Learning (cs.LG) Cite as: arXiv:2603.19687 [cs.LO] (or arXiv:2603.19687v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2603.19687 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Angshul Majumdar Dr. [view email] [v1] Fri, 20 Mar 2026 06:42:22 UTC (14 KB)
[LG-26] Ontology-Based Knowledge Modeling and Uncertainty-Aware Outdoor Air Quality Assessment Using Weighted Interval Type-2 Fuzzy Logic
链接: https://arxiv.org/abs/2603.19683
作者: Md Inzmam,Ritesh Chandra,Sadhana Tiwari,Sonali Agarwal,Triloki Pant
类目: Machine Learning (cs.LG)
*备注:
Abstract:Outdoor air pollution is a major concern for the environment and public health, especially in areas where urbanization is taking place rapidly. The Indian Air Quality Index (IND-AQI), developed by the Central Pollution Control Board (CPCB), is a standardized reporting system for air quality based on pollutants such as PM2.5, PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), carbon monoxide (CO), and ammonia (NH3). However, the traditional calculation of the AQI uses crisp thresholds and deterministic aggregation rules, which are not suitable for handling uncertainty and transitions between classes. To address these limitations, this study proposes a hybrid ontology-based uncertainty-aware framework integrating Weighted Interval Type-2 Fuzzy Logic with semantic knowledge modeling. Interval Type-2 fuzzy sets are used to model uncertainty near AQI class boundaries, while pollutant importance weights are determined using Interval Type-2 Fuzzy Analytic Hierarchy Process (IT2-FAHP) to reflect their relative health impacts. In addition, an OWL-based air quality ontology extending the Semantic Sensor Network (SSN) ontology is developed to represent pollutants, monitoring stations, AQI categories, regulatory standards, and environmental governance actions. Semantic reasoning is implemented using SWRL rules and validated through SPARQL queries to infer AQI categories, health risks, and recommended mitigation actions. Experimental evaluation using CPCB air quality datasets demonstrates that the proposed framework improves AQI classification reliability and uncertainty handling compared with traditional crisp and Type-1 fuzzy approaches, while enabling explainable semantic reasoning and intelligent decision support for air quality monitoring systems
[LG-27] Scale-Dependent Radial Geometry and Metric Mismatch in Wasserstein Propagation for Reverse Diffusion
链接: https://arxiv.org/abs/2603.19670
作者: Zicheng Lyu,Zengfeng Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing analyses of reverse diffusion often propagate sampling error in the Euclidean geometry underlying (\Wtwo) along the entire reverse trajectory. Under weak log-concavity, however, Gaussian smoothing can create contraction first at large separations while short separations remain non-dissipative. The first usable contraction is therefore radial rather than Euclidean, creating a metric mismatch between the geometry that contracts early and the geometry in which the terminal error is measured. We formalize this mismatch through an explicit radial lower profile for the learned reverse drift. Its far-field limit gives a contraction reserve, its near-field limit gives the Euclidean load governing direct (\Wtwo) propagation, and admissible switch times are characterized by positivity of the reserve on the remaining smoothing window. We exploit this structure with a one-switch routing argument. Before the switch, reflection coupling yields contraction in a concave transport metric adapted to the radial profile. At the switch, we convert once from this metric back to (\Wtwo) under a (p)-moment budget, and then propagate the converted discrepancy over the remaining short window in Euclidean geometry. For discretizations of the learned reverse SDE under (L^2) score-error control, a one-sided Lipschitz condition of score error, and standard well-posedness and coupling hypotheses, we obtain explicit non-asymptotic end-to-end (\Wtwo) guarantees, a scalar switch-selection objective, and a sharp structural limit on the conversion exponent within the affine-tail concave class.
[LG-28] Ensembles-based Feature Guided Analysis
链接: https://arxiv.org/abs/2603.19653
作者: Federico Formica,Stefano Gregis,Andrea Rota,Aurora Francesca Zanenga,Mark Lawford,Claudio Menghi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent Deep Neural Networks (DNN) applications ask for techniques that can explain their behavior. Existing solutions, such as Feature Guided Analysis (FGA), extract rules on their internal behaviors, e.g., by providing explanations related to neurons activation. Results from the literature show that these rules have considerable precision (i.e., they correctly predict certain classes of features), but the recall (i.e., the number of situations these rule apply) is more limited. To mitigate this problem, this paper presents Ensembles-based Feature Guided Analysis (EFGA). EFGA combines rules extracted by FGA into ensembles. Ensembles aggregate different rules to increase their applicability depending on an aggregation criterion, a policy that dictates how to combine rules into ensembles. Although our solution is extensible, and different aggregation criteria can be developed by users, in this work, we considered three different aggregation criteria. We evaluated how the choice of the criterion influences the effectiveness of EFGA on two benchmarks (i.e., the MNIST and LSC datasets), and found that different aggregation criteria offer alternative trade-offs between precision and recall. We then compare EFGA with FGA. For this experiment, we selected an aggregation criterion that provides a reasonable trade-off between precision and recall. Our results show that EFGA has higher train recall (+28.51% on MNIST, +33.15% on LSC), and test recall (+25.76% on MNIST, +30.81% on LSC) than FGA, with a negligible reduction on the test precision (-0.89% on MNIST, -0.69% on LSC).
[LG-29] Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis
链接: https://arxiv.org/abs/2603.19648
作者: Siddharth Chandak,Anuj Yadav,Ayfer Ozgur,Nicholas Bambos
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Submitted to IEEE Transactions on Automatic Control
Abstract:Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.
[LG-30] RiboSphere: Learning Unified and Efficient Representations of RNA Structures
链接: https://arxiv.org/abs/2603.19636
作者: Zhou Zhang,Hanqun Cao,Cheng Tan,Fang Wu,Pheng Ann Heng,Tianfan Fu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate RNA structure modeling remains difficult because RNA backbones are highly flexible, non-canonical interactions are prevalent, and experimentally determined 3D structures are comparatively scarce. We introduce \emphRiboSphere, a framework that learns \emphdiscrete geometric representations of RNA by combining vector quantization with flow matching. Our design is motivated by the modular organization of RNA architecture: complex folds are composed from recurring structural motifs. RiboSphere uses a geometric transformer encoder to produce SE(3)-invariant (rotation/translation-invariant) features, which are discretized with finite scalar quantization (FSQ) into a finite vocabulary of latent codes. Conditioned on these discrete codes, a flow-matching decoder reconstructs atomic coordinates, enabling high-fidelity structure generation. We find that the learned code indices are enriched for specific RNA motifs, suggesting that the model captures motif-level compositional structure rather than acting as a purely compressive bottleneck. Across benchmarks, RiboSphere achieves strong performance in structure reconstruction (RMSD 1.25,Å, TM-score 0.84), and its pretrained discrete representations transfer effectively to inverse folding and RNA–ligand binding prediction, with robust generalization in data-scarce regimes.
[LG-31] Alternating Diffusion for Proximal Sampling with Zeroth Order Queries ICLR2026
链接: https://arxiv.org/abs/2603.19633
作者: Hirohane Takagi,Atsushi Nitanda
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICLR2026
Abstract:This work introduces a new approximate proximal sampler that operates solely with zeroth-order information of the potential function. Prior theoretical analyses have revealed that proximal sampling corresponds to alternating forward and backward iterations of the heat flow. The backward step was originally implemented by rejection sampling, whereas we directly simulate the dynamics. Unlike diffusion-based sampling methods that estimate scores via learned models or by invoking auxiliary samplers, our method treats the intermediate particle distribution as a Gaussian mixture, thereby yielding a Monte Carlo score estimator from directly samplable distributions. Theoretically, when the score estimation error is sufficiently controlled, our method inherits the exponential convergence of proximal sampling under isoperimetric conditions on the target distribution. In practice, the algorithm avoids rejection sampling, permits flexible step sizes, and runs with a deterministic runtime budget. Numerical experiments demonstrate that our approach converges rapidly to the target distribution, driven by interactions among multiple particles and by exploiting parallel computation.
[LG-32] Continual Learning for Food Category Classification Dataset: Enhancing Model Adaptability and Performance
链接: https://arxiv.org/abs/2603.19624
作者: Piyush Kaushik Bhattacharyya,Devansh Tomar,Shubham Mishra,Divyanshu Rai,Yug Pratap Singh,Harsh Yadav,Krutika Verma,Vishal Meena,N Sangita Achary
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conventional machine learning pipelines often struggle to recognize categories absent from the original trainingset. This gap typically reduces accuracy, as fixed datasets rarely capture the full diversity of a domain. To address this, we propose a continual learning framework for text-guided food classification. Unlike approaches that require retraining from scratch, our method enables incremental updates, allowing new categories to be integrated without degrading prior knowledge. For example, a model trained on Western cuisines could later learn to classify dishes such as dosa or kimchi. Although further refinements are needed, this design shows promise for adaptive food recognition, with applications in dietary monitoring and personalized nutrition planning.
[LG-33] On Performance Guarantees for Federated Learning with Personalized Constraints
链接: https://arxiv.org/abs/2603.19617
作者: Mohammadjavad Ebrahimi,Daniel Burbano,Farzad Yousefian
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Federated learning (FL) has emerged as a communication-efficient algorithmic framework for distributed learning across multiple agents. While standard FL formulations capture unconstrained or globally constrained problems, many practical settings involve heterogeneous resource or model constraints, leading to optimization problems with agent-specific feasible sets. Here, we study a personalized constrained federated optimization problem in which each agent is associated with a convex local objective and a private constraint set. We propose PC-FedAvg, a method in which each agent maintains cross-estimates of the other agents’ variables through a multi-block local decision vector. Each agent updates all blocks locally, penalizing infeasibility only in its own block. Moreover, the cross-estimate mechanism enables personalization without requiring consensus or sharing constraint information among agents. We establish communication-complexity rates of \mathcalO(\epsilon^-2) for suboptimality and \mathcalO(\epsilon^-1) for agent-wise infeasibility. Preliminary experiments on the MNIST and CIFAR-10 datasets validate our theoretical findings.
[LG-34] Demonstrations CoT and Prompting: A Theoretical Analysis of ICL
链接: https://arxiv.org/abs/2603.19611
作者: Xuhan Tong,Yuchen Zeng,Jiawei Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-Context Learning (ICL) enables pretrained LLMs to adapt to downstream tasks by conditioning on a small set of input-output demonstrations, without any parameter updates. Although there have been many theoretical efforts to explain how ICL works, most either rely on strong architectural or data assumptions, or fail to capture the impact of key practical factors such as demonstration selection, Chain-of-Thought (CoT) prompting, the number of demonstrations, and prompt templates. We address this gap by establishing a theoretical analysis of ICL under mild assumptions that links these design choices to generalization behavior. We derive an upper bound on the ICL test loss, showing that performance is governed by (i) the quality of selected demonstrations, quantified by Lipschitz constants of the ICL loss along paths connecting test prompts to pretraining samples, (ii) an intrinsic ICL capability of the pretrained model, and (iii) the degree of distribution shift. Within the same framework, we analyze CoT prompting as inducing a task decomposition and show that it is beneficial when demonstrations are well chosen at each substep and the resulting subtasks are easier to learn. Finally, we characterize how ICL performance sensitivity to prompt templates varies with the number of demonstrations. Together, our study shows that pretraining equips the model with the ability to generalize beyond observed tasks, while CoT enables the model to compose simpler subtasks into more complex ones, and demonstrations and instructions enable it to retrieve similar or complex tasks, including those that can be composed into more complex ones, jointly supporting generalization to unseen tasks. All theoretical insights are corroborated by experiments.
[LG-35] Wearable Foundation Models Should Go Beyond Static Encoders
链接: https://arxiv.org/abs/2603.19564
作者: Yu Yvonne Wu,Yuwei Zhang,Hyungjun Yoon,Ting Dang,Dimitris Spathis,Tong Xia,Qiang Yang,Jing Han,Dong Ma,Sung-Ju Lee,Cecilia Mascolo
类目: Machine Learning (cs.LG)
*备注: 13 pages
Abstract:Wearable foundation models (WFMs), trained on large volumes of data collected by affordable, always-on devices, have demonstrated strong performance on short-term, well-defined health monitoring tasks, including activity recognition, fitness tracking, and cardiovascular signal assessment. However, most existing WFMs primarily map short temporal windows to predefined labels via static encoders, emphasizing retrospective prediction rather than reasoning over evolving personal history, context, and future risk trajectories. As a result, they are poorly suited for modeling chronic, progressive, or episodic health conditions that unfold over weeks, months or years. Hence, we argue that WFMs must move beyond static encoders and be explicitly designed for longitudinal, anticipatory health reasoning. We identify three foundational shifts required to enable this transition: (1) Structurally rich data, which goes beyond isolated datasets or outcome-conditioned collection to integrated multimodal, long-term personal trajectories, and contextual metadata, ideally supported by open and interoperable data ecosystems; (2) Longitudinal-aware multimodal modeling, which prioritizes long-context inference, temporal abstraction, and personalization over cross-sectional or population-level prediction; and (3) Agentic inference systems, which move beyond static prediction to support planning, decision-making, and clinically grounded intervention under uncertainty. Together, these shifts reframe wearable health monitoring from retrospective signal interpretation toward continuous, anticipatory, and human-aligned health support.
[LG-36] Neural Uncertainty Principle: A Unified View of Adversarial Frag ility and LLM Hallucination
链接: https://arxiv.org/abs/2603.19562
作者: Dong-Xiao Zhang,Hu Lou,Jun-Jie Zhang,Jun Zhu,Deyu Meng
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Computational Physics (physics.comp-ph)
*备注: 16 pages,3 figures
Abstract:Adversarial vulnerability in vision and hallucination in large language models are conventionally viewed as separate problems, each addressed with modality-specific patches. This study first reveals that they share a common geometric origin: the input and its loss gradient are conjugate observables subject to an irreducible uncertainty bound. Formalizing a Neural Uncertainty Principle (NUP) under a loss-induced state, we find that in near-bound regimes, further compression must be accompanied by increased sensitivity dispersion (adversarial fragility), while weak prompt-gradient coupling leaves generation under-constrained (hallucination). Crucially, this bound is modulated by an input-gradient correlation channel, captured by a specifically designed single-backward probe. In vision, masking highly coupled components improves robustness without costly adversarial training; in language, the same prefill-stage probe detects hallucination risk before generating any answer tokens. NUP thus turns two seemingly separate failure taxonomies into a shared uncertainty-budget view and provides a principled lens for reliability analysis. Guided by this NUP theory, we propose ConjMask (masking high-contribution input components) and LogitReg (logit-side regularization) to improve robustness without adversarial training, and use the probe as a decoding-free risk signal for LLMs, enabling hallucination detection and prompt selection. NUP thus provides a unified, practical framework for diagnosing and mitigating boundary anomalies across perception and generation tasks.
[LG-37] An Adaptive Machine Learning Framework for Fluid Flow in Dual-Network Porous Media
链接: https://arxiv.org/abs/2603.19561
作者: V. S. Maduri,K. B. Nakshatrala
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Porous materials – natural or engineered – often exhibit dual pore-network structures that govern processes such as mineral exploration and hydrocarbon recovery from tight shales. Double porosity/permeability (DPP) mathematical models describe incompressible fluid flow through two interacting pore networks with inter-network mass exchange. Despite significant advances in numerical methods, there remains a need for computational frameworks that enable rapid forecasting, data assimilation, and reliable inverse analysis. To address this, we present a physics-informed neural network (PINN) framework for forward and inverse modeling of DPP systems. The proposed approach encodes the governing equations in mixed form, along with boundary conditions, directly into the loss function, with adaptive weighting strategies to balance their contributions. Key features of the framework include adaptive weight tuning, dynamic collocation point selection, and the use of shared trunk neural architectures to efficiently capture the coupled behavior of the dual pore networks. It is inherently mesh-free, making it well-suited for complex geometries typical of porous media. It accurately captures discontinuities in solution fields across layered domains without introducing spurious oscillations commonly observed in classical finite element formulations. Importantly, the framework is well-suited for inverse analysis, enabling robust parameter identification in scenarios where key physical quantities – such as the mass transfer coefficient in DPP models – are difficult to measure directly. In addition, a systematic convergence analysis is provided to rigorously assess the stability, accuracy, and reliability of the method. The effectiveness and computational advantages of the approach are demonstrated through a series of representative numerical experiments.
[LG-38] Verifiable Error Bounds for Physics-Informed Neural Network Solutions of Lyapunov and Hamilton-Jacobi-Bellm an Equations
链接: https://arxiv.org/abs/2603.19545
作者: Jun Liu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Many core problems in nonlinear systems analysis and control can be recast as solving partial differential equations (PDEs) such as Lyapunov and Hamilton-Jacobi-Bellman (HJB) equations. Physics-informed neural networks (PINNs) have emerged as a promising mesh-free approach for approximating their solutions, but in most existing works there is no rigorous guarantee that a small PDE residual implies a small solution error. This paper develops verifiable error bounds for approximate solutions of Lyapunov and HJB equations, with particular emphasis on PINN-based approximations. For both the Lyapunov and HJB PDEs, we show that a verifiable residual bound yields relative error bounds with respect to the true solutions as well as computable a posteriori estimates in terms of the approximate solutions. For the HJB equation, this also yields certified upper and lower bounds on the optimal value function on compact sublevel sets and quantifies the optimality gap of the induced feedback policy. We further show that one-sided residual bounds already imply that the approximation itself defines a valid Lyapunov or control Lyapunov function. We illustrate the results with numerical examples.
[LG-39] Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers
链接: https://arxiv.org/abs/2603.19544
作者: Yijiang Li,Zilinghan Li,Kyle Chard,Ian Foster,Todd Munson,Ravi Madduri,Kibaek Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.
[LG-40] Stochastic Sequential Decision Making over Expanding Networks with Graph Filtering
链接: https://arxiv.org/abs/2603.19501
作者: Zhan Gao,Bishwadeep Das,Elvin Isufi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Graph filters leverage topological information to process networked data with existing methods mainly studying fixed graphs, ignoring that graphs often expand as nodes continually attach with an unknown pattern. The latter requires developing filter-based decision-making paradigms that take evolution and uncertainty into account. Existing approaches rely on either pre-designed filters or online learning, limited to a myopic view considering only past or present information. To account for future impacts, we propose a stochastic sequential decision-making framework for filtering networked data with a policy that adapts filtering to expanding graphs. By representing filter shifts as agents, we model the filter as a multi-agent system and train the policy following multi-agent reinforcement learning. This accounts for long-term rewards and captures expansion dynamics through sequential decision-making. Moreover, we develop a context-aware graph neural network to parameterize the policy, which tunes filter parameters based on information of both the graph and agents. Experiments on synthetic and real datasets from cold-start recommendation to COVID prediction highlight the benefits of using a sequential decision-making perspective over batch and online filtering alternatives.
[LG-41] ICLAD: In-Context Learning for Unified Tabular Anomaly Detection Across Supervision Regimes
链接: https://arxiv.org/abs/2603.19497
作者: Jack Yi Wei,Narges Armanfard
类目: Machine Learning (cs.LG)
*备注: 33 pages, 17 figures
Abstract:Anomaly detection on tabular data is commonly studied under three supervision regimes, including one-class settings that assume access to anomaly-free training samples, fully unsupervised settings with unlabeled and potentially contaminated training data, and semi-supervised settings with limited anomaly labels. Existing deep learning approaches typically train dataset-specific models under the assumption of a single supervision regime, which limits their ability to leverage shared structures across anomaly detection tasks and to adapt to different supervision levels. We propose ICLAD, an in-context learning foundation model for tabular anomaly detection that generalizes across both datasets and supervision regimes. ICLAD is trained via meta-learning on synthetic tabular anomaly detection tasks, and at inference time, the model assigns anomaly scores by conditioning on the training set without updating model weights. Comprehensive experiments on 57 tabular datasets from ADBench show that our method achieves state-of-the-art performance across three supervision regimes, establishing a unified framework for tabular anomaly detection.
[LG-42] Any-Subgroup Equivariant Networks via Symmetry Breaking ICLR2026
链接: https://arxiv.org/abs/2603.19486
作者: Abhinav Goel,Derek Lim,Hannah Lawrence,Stefanie Jegelka,Ningyuan Huang
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026
Abstract:The inclusion of symmetries as an inductive bias, known as equivariance, often improves generalization on geometric data (e.g. grids, sets, and graphs). However, equivariant architectures are usually highly constrained, designed for symmetries chosen a priori, and not applicable to datasets with other symmetries. This precludes the development of flexible, multi-modal foundation models capable of processing diverse data equivariantly. In this work, we build a single model – the Any-Subgroup Equivariant Network (ASEN) – that can be simultaneously equivariant to several groups, simply by modulating a certain auxiliary input feature. In particular, we start with a fully permutation-equivariant base model, and then obtain subgroup equivariance by using a symmetry-breaking input whose automorphism group is that subgroup. However, finding an input with the desired automorphism group is computationally hard. We overcome this by relaxing from exact to approximate symmetry breaking, leveraging the notion of 2-closure to derive fast algorithms. Theoretically, we show that our subgroup-equivariant networks can simulate equivariant MLPs, and their universality can be guaranteed if the base model is universal. Empirically, we validate our method on symmetry selection for graph and image tasks, as well as multitask and transfer learning for sequence tasks, showing that a single network equivariant to multiple permutation subgroups outperforms both separate equivariant models and a single non-equivariant model.
[LG-43] Deep Hilbert–Galerkin Methods for Infinite-Dimensional PDEs and Optimal Control
链接: https://arxiv.org/abs/2603.19463
作者: Samuel N. Cohen,Filippo de Feo,Jackson Hebner,Justin Sirignano
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR)
*备注:
Abstract:We develop deep learning-based approximation methods for fully nonlinear second-order PDEs on separable Hilbert spaces, such as HJB equations for infinite-dimensional control, by parameterizing solutions via Hilbert–Galerkin Neural Operators (HGNOs). We prove the first Universal Approximation Theorems (UATs) which are sufficiently powerful to address these problems, based on novel topologies for Hessian terms and corresponding novel continuity assumptions on the fully nonlinear operator. These topologies are non-sequential and non-metrizable, making the problem delicate. In particular, we prove UATs for functions on Hilbert spaces, together with their Fréchet derivatives up to second order, and for unbounded operators applied to the first derivative, ensuring that HGNOs are able to approximate all the PDE terms. For control problems, we further prove UATs for optimal feedback controls in terms of our approximating value function HGNO. We develop numerical training methods, which we call Deep Hilbert–Galerkin and Hilbert Actor-Critic (reinforcement learning) Methods, for these problems by minimizing the L^2_\mu(H) -norm of the residual of the PDE on the whole Hilbert space, not just a projected PDE to finite dimensions. This is the first paper to propose such an approach. The models considered arise in many applied sciences, such as functional differential equations in physics and Kolmogorov and HJB PDEs related to controlled PDEs, SPDEs, path-dependent systems, partially observed stochastic systems, and mean-field SDEs. We numerically solve examples of Kolmogorov and HJB PDEs related to the optimal control of deterministic and stochastic heat and Burgers’ equations, demonstrating the promise of our deep learning-based approach. Subjects: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR) MSC classes: 35R15, 49L12, 35Q93, 93C25, 65J15, 68T07 Cite as: arXiv:2603.19463 [cs.LG] (or arXiv:2603.19463v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19463 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] GeoLAN: Geometric Learning of Latent Explanatory Directions in Large Language Models
链接: https://arxiv.org/abs/2603.19460
作者: Tianyu Bell Pan,Damon L. Woodard
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:
Abstract:Large language models (LLMs) demonstrate strong performance, but they often lack transparency. We introduce GeoLAN, a training framework that treats token representations as geometric trajectories and applies stickiness conditions inspired by recent developments related to the Kakeya Conjecture. We have developed two differentiable regularizers, Katz-Tao Convex Wolff (KT-CW) and Katz-Tao Attention (KT-Attn), that promote isotropy and encourage diverse attention. Our experiments with Gemma-3 (1B, 4B, 12B) and Llama-3-8B show that GeoLAN frequently maintains task accuracy while improving geometric metrics and reducing certain fairness biases. These benefits are most significant in mid-sized models. Our findings reveal scale-dependent trade-offs between geometric precision and performance, suggesting that geometry-aware training is a promising approach to enhance mechanistic interpretability.
[LG-45] Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning
链接: https://arxiv.org/abs/2603.19397
作者: Xueqiao Peng,Andrew Perrault
类目: Machine Learning (cs.LG)
*备注:
Abstract:Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particularly in early outbreak stages. In real-world public health settings, resources must be allocated across multiple outbreak clusters that emerge asynchronously, vary in size and risk, and compete for a shared resource budget. Here, a cluster corresponds to a group of close contacts generated by a single infected index case. Thus, decisions must be made under uncertainty and heterogeneous demands, while respecting operational constraints. We formulate this problem as a constrained restless multi-armed bandit and propose a hierarchical reinforcement learning framework. A global controller learns a continuous action cost multiplier that adjusts global resource demand, while a generalized local policy estimates the marginal value of allocating resources to individuals within each cluster. We evaluate the proposed framework in a realistic agent-based simulator of SARS-CoV-2 with dynamically arriving clusters. Across a wide range of system scales and testing budgets, our method consistently outperforms RMAB-inspired and heuristic baselines, improving outbreak control effectiveness by 20%-30%. Experiments on up to 40 concurrently active clusters further demonstrate that the hierarchical framework is highly scalable and enables faster decision-making than the RMAB-inspired method.
[LG-46] Bridging Conformal Prediction and Scenario Optimization: Discarded Constraints and Modular Risk Allocation
链接: https://arxiv.org/abs/2603.19396
作者: Giuseppe C. Calafiore
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Scenario optimization and conformal prediction share a common goal, that is, turning finite samples into safety margins. Yet, different terminology often obscures the connection between their respective guarantees. This paper revisits that connection directly from a systems-and-control viewpoint. Building on the recent conformal/scenario bridge of \citetOSullivanRomaoMargellos2026, we extend the forward direction to feasible sample-and-discard scenario algorithms. Specifically, if the final decision is determined by a stable subset of the retained sampled constraints, the classical mean violation law admits a direct exchangeability-based derivation. In this view, discarded samples naturally appear as admissible exceptions. We also introduce a simple modular composition rule that combines several blockwise calibration certificates into a single joint guarantee. This rule proves particularly useful in multi-output prediction and finite-horizon control, where engineers must distribute risk across coordinates, constraints, or prediction steps. Finally, we provide numerical illustrations using a calibrated multi-step tube around an identified predictor. These examples compare alternative stage-wise risk allocations and highlight the resulting performance and safety trade-offs in a standard constraint-tightening problem.
[LG-47] Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents
链接: https://arxiv.org/abs/2603.19375
作者: Toan Tran,Olivera Kotevska,Li Xiong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Membership inference attacks (MIAs), which enable adversaries to determine whether specific data points were part of a model’s training dataset, have emerged as an important framework to understand, assess, and quantify the potential information leakage associated with machine learning systems. Designing effective MIAs is a challenging task that usually requires extensive manual exploration of model behaviors to identify potential vulnerabilities. In this paper, we introduce AutoMIA – a novel framework that leverages large language model (LLM) agents to automate the design and implementation of new MIA signal computations. By utilizing LLM agents, we can systematically explore a vast space of potential attack strategies, enabling the discovery of novel strategies. Our experiments demonstrate AutoMIA can successfully discover new MIAs that are specifically tailored to user-configured target model and dataset, resulting in improvements of up to 0.18 in absolute AUC over existing MIAs. This work provides the first demonstration that LLM agents can serve as an effective and scalable paradigm for designing and implementing MIAs with SOTA performance, opening up new avenues for future exploration.
[LG-48] Warm-Start Flow Matching for Guaranteed Fast Text/Image Generation
链接: https://arxiv.org/abs/2603.19360
作者: Minyoung Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Current auto-regressive (AR) LLMs, diffusion-based text/image generative models, and recent flow matching (FM) algorithms are capable of generating premium quality text/image samples. However, the inference or sample generation in these models is often very time-consuming and computationally demanding, mainly due to large numbers of function evaluations corresponding to the lengths of tokens or the numbers of diffusion steps. This also necessitates heavy GPU resources, time, and electricity. In this work we propose a novel solution to reduce the sample generation time of flow matching algorithms by a guaranteed speed-up factor, without sacrificing the quality of the generated samples. Our key idea is to utilize computationally lightweight generative models whose generation time is negligible compared to that of the target AR/FM models. The draft samples from a lightweight model, whose quality is not satisfactory but fast to generate, are regarded as an initial distribution for a FM algorithm. Unlike conventional usage of FM that takes a pure noise (e.g., Gaussian or uniform) initial distribution, the draft samples are already of decent quality, so we can set the starting time to be closer to the end time rather than 0 in the pure noise FM case. This will significantly reduce the number of time steps to reach the target data distribution, and the speed-up factor is guaranteed. Our idea, dubbed \em Warm-Start FM or WS-FM, can essentially be seen as a \em learning-to-refine generative model from low-quality draft samples to high-quality samples. As a proof of concept, we demonstrate the idea on some synthetic toy data as well as real-world text and image generation tasks, illustrating that our idea offers guaranteed speed-up in sample generation without sacrificing the quality of the generated samples.
[LG-49] A Mathematical Theory of Understanding
链接: https://arxiv.org/abs/2603.19349
作者: Bahar Taşkesen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Theoretical Economics (econ.TH)
*备注:
Abstract:Generative AI has transformed the economics of information production, making explanations, proofs, examples, and analyses available at very low cost. Yet the value of information still depends on whether downstream users can absorb and act on it. A signal conveys meaning only to a learner with the structural capacity to decode it: an explanation that clarifies a concept for one user may be indistinguishable from noise to another who lacks the relevant prerequisites. This paper develops a mathematical model of that learner-side bottleneck. We model the learner as a mind, an abstract learning system characterized by a prerequisite structure over concepts. A mind may represent a human learner, an artificial learner such as a neural network, or any agent whose ability to interpret signals depends on previously acquired concepts. Teaching is modeled as sequential communication with a latent target. Because instructional signals are usable only when the learner has acquired the prerequisites needed to parse them, the effective communication channel depends on the learner’s current state of knowledge and becomes more informative as learning progresses. The model yields two limits on the speed of learning and adoption: a structural limit determined by prerequisite reachability and an epistemic limit determined by uncertainty about the target. The framework implies threshold effects in training and capability acquisition. When the teaching horizon lies below the prerequisite depth of the target, additional instruction cannot produce successful completion of teaching; once that depth is reached, completion becomes feasible. Across heterogeneous learners, a common broadcast curriculum can be slower than personalized instruction by a factor linear in the number of learner types.
[LG-50] Exploring the Agent ic Frontier of Verilog Code Generation
链接: https://arxiv.org/abs/2603.19347
作者: Patrick Yubeaton,Chinmay Hegde,Siddharth Garg
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) have made rapid advancements in code generation for popular languages such as Python and C++. Many of these recent gains can be attributed to the use of ``agents’’ that wrap domain-relevant tools alongside LLMs. Hardware design languages such as Verilog have also seen improved code generation in recent years, but the impact of agentic frameworks on Verilog code generation tasks remains unclear. In this work, we present the first systematic evaluation of agentic LLMs for Verilog generation, using the recently introduced CVDP benchmark. We also introduce several open-source hardware design agent harnesses, providing a model-agnostic baseline for future work. Through controlled experiments across frontier models, we study how structured prompting and tool design affect performance, analyze agent failure modes and tool usage patterns, compare open-source and closed-source models, and provide qualitative examples of successful and failed agent runs. Our results show that naive agentic wrapping around frontier models can degrade performance (relative to standard forward passes with optimized prompts), but that structured harnesses meaningfully match and in some cases exceed non-agentic baselines. We find that the performance gap between open and closed source models is driven by both higher crash rates and weaker tool output interpretation. Our exploration illuminates the path towards designing special-purpose agents for verilog generation in the future.
[LG-51] DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training
链接: https://arxiv.org/abs/2603.19338
作者: Maoyang Xiang,Bo Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16 \times and decreases DSP utilization by 16 \times while maintaining comparable or better performance across vision Transformers and GPT-2 models.
[LG-52] FalconBC: Flow matching for Amortized inference of Latent-CONditioned physiologic Boundary Conditions
链接: https://arxiv.org/abs/2603.19331
作者: Chloe H. Choi,Alison L. Marsden,Daniele E. Schiavazzi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Boundary condition tuning is a fundamental step in patient-specific cardiovascular modeling. Despite an increase in offline training cost, recent methods in data-driven variational inference can efficiently estimate the joint posterior distribution of boundary conditions, with amortization of training efforts over clinical targets. However, even the most modern approaches fall short in two important scenarios: open-loop models with known mean flow and assumed waveform shapes, and anatomies affected by vascular lesions where segmentation influences the reachability of pressure or flow split targets. In both cases, boundary conditions cannot be tuned in isolation. We introduce a general amortized inference framework based on probabilistic flow that treats clinical targets, inflow features, and point cloud embeddings of patient-specific anatomies as either conditioning variables or quantities to be jointly estimated. We demonstrate the approach on two patient-specific models: an aorto-iliac bifurcation with varying stenosis locations and severity, and a coronary arterial tree.
[LG-53] owards Solving Polynomial-Objective Integer Programming with Hypergraph Neural Networks
链接: https://arxiv.org/abs/2603.19318
作者: Minshuo Li,Yaoxin Wu,Pavel Troubil,Yingqian Zhang,Wim P.M. Nuijten
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted for publication in CPAIOR 2026, to appear in Springer Lecture Notes in Computer Science (LNCS)
Abstract:Complex real-world optimization problems often involve both discrete decisions and nonlinear relationships between variables. Many such problems can be modeled as polynomial-objective integer programs, encompassing cases with quadratic and higher-degree variable interactions. Nonlinearity makes them more challenging than their linear counterparts. In this paper, we propose a hypergraph neural network (HNN) based method to solve polynomial-objective integer programming (POIP). Besides presenting a high-degree-term-aware hypergraph representation to capture both high-degree information and variable-constraint interdependencies, we also propose a hypergraph neural network, which integrates convolution between variables and high-degree terms alongside convolution between variables and constraints, to predict solution values. Finally, a search process initialized from the predicted solutions is performed to further refine the results. Comprehensive experiments across a range of benchmarks demonstrate that our method consistently outperforms both existing learning-based approaches and state-of-the-art solvers, delivering superior solution quality with favorable efficiency. Note that our experiments involve both polynomial objectives and constraints, demonstrating our HNN’s versatility for general POIP problems and highlighting its advancement over the existing literature.
[LG-54] MSNet and LS-Net: Scalable Multi-Scale Multi-Representation Networks for Time Series Classification
链接: https://arxiv.org/abs/2603.19315
作者: Celal Alagöz,Mehmet Kurnaz,Farhan Aadil
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series classification (TSC) performance depends not only on architectural design but also on the diversity of input representations. In this work, we propose a scalable multi-scale convolutional framework that systematically integrates structured multi-representation inputs for univariate time series. We introduce two architectures: MSNet, a hierarchical multi-scale convolutional network optimized for robustness and calibration, and LS-Net, a lightweight variant designed for efficiency-aware deployment. In addition, we adapt LiteMV – originally developed for multivariate inputs – to operate on multi-representation univariate signals, enabling cross-representation interaction. We evaluate all models across 142 benchmark datasets under a unified experimental protocol. Critical Difference analysis confirms statistically significant performance differences among the top models. Results show that LiteMV achieves the highest mean accuracy, MSNet provides superior probabilistic calibration (lowest NLL), and LS-Net offers the best efficiency-accuracy tradeoff. Pareto analysis further demonstrates that multi-representation multi-scale modeling yields a flexible design space that can be tuned for accuracy-oriented, calibration-oriented, or resource-constrained settings. These findings establish scalable multi-representation multi-scale learning as a principled and practical direction for modern TSC. Reference implementation of MSNet and LS-Net is available at: this https URL Subjects: Machine Learning (cs.LG) MSC classes: 68T07 (Primary), 68T05, 62M10, 62H30 ACMclasses: I.2.6; I.5.1; I.5.2 Cite as: arXiv:2603.19315 [cs.LG] (or arXiv:2603.19315v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.19315 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Celal Alagoz [view email] [v1] Sat, 14 Mar 2026 11:51:01 UTC (171 KB)
[LG-55] DPxFin: Adaptive Differential Privacy for Anti-Money Laundering Detection via Reputation-Weighted Federated Learning
链接: https://arxiv.org/abs/2603.19314
作者: Renuga Kanagavelu,Manjil Nepal,Ning Peiyan,Cai Kangning,Xu Jiming,Fei Gao,Yong Liu,Goh Siow Mong Rick,Qingsong Wei
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at AI FOR FINANCIAL FRAUD DETECTION PREVENTION AT ACM ICAIF-25
Abstract:In the modern financial system, combating money laundering is a critical challenge complicated by data privacy concerns and increasingly complex fraud transaction patterns. Although federated learning (FL) is a promising problem-solving approach as it allows institutions to train their models without sharing their data, it has the drawback of being prone to privacy leakage, specifically in tabular data forms like financial data. To address this, we propose DPxFin, a novel federated framework that integrates reputation-guided adaptive differential privacy. Our approach computes client reputation by evaluating the alignment between locally trained models and the global model. Based on this reputation, we dynamically assign differential privacy noise to client updates, enhancing privacy while maintaining overall model utility. Clients with higher reputations receive lower noise to amplify their trustworthy contributions, while low-reputation clients are allocated stronger noise to mitigate risk. We validate DPxFin on the Anti-Money Laundering (AML) dataset under both IID and non-IID settings using Multi Layer Perceptron (MLP). Experimental analysis established that our approach has a more desirable trade-off between accuracy and privacy than those of traditional FL and fixed-noise Differential Privacy (DP) baselines, where performance improvements were consistent, even though on a modest scale. Moreover, DPxFin does withstand tabular data leakage attacks, proving its effectiveness under real-world financial conditions.
[LG-56] PRIME-CVD: A Parametrically Rendered Informatics Medical Environment for Education in Cardiovascular Risk Modelling
链接: https://arxiv.org/abs/2603.19299
作者: Nicholas I-Hsien Kuo,Marzia Hoque Tania,Blanca Gallego,Louisa Jorm
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, progress in medical informatics and machine learning has been accelerated by the availability of openly accessible benchmark datasets. However, patient-level electronic medical record (EMR) data are rarely available for teaching or methodological development due to privacy, governance, and re-identification risks. This has limited reproducibility, transparency, and hands-on training in cardiovascular risk modelling. Here we introduce PRIME-CVD, a parametrically rendered informatics medical environment designed explicitly for medical education. PRIME-CVD comprises two openly accessible synthetic data assets representing a cohort of 50,000 adults undergoing primary prevention for cardiovascular disease. The datasets are generated entirely from a user-specified causal directed acyclic graph parameterised using publicly available Australian population statistics and published epidemiologic effect estimates, rather than from patient-level EMR data or trained generative models. Data Asset 1 provides a clean, analysis-ready cohort suitable for exploratory analysis, stratification, and survival modelling, while Data Asset 2 restructures the same cohort into a relational, EMR-style database with realistic structural and lexical heterogeneity. Together, these assets enable instruction in data cleaning, harmonisation, causal reasoning, and policy-relevant risk modelling without exposing sensitive information. Because all individuals and events are generated de novo, PRIME-CVD preserves realistic subgroup imbalance and risk gradients while ensuring negligible disclosure risk. PRIME-CVD is released under a Creative Commons Attribution 4.0 licence to support reproducible research and scalable medical education.
[LG-57] A Dynamic Bayesian and Machine Learning Framework for Quantitative Evaluation and Prediction of Operator Situation Awareness in Nuclear Power Plants
链接: https://arxiv.org/abs/2603.19298
作者: Shuai Chen,Huiqiao Jia,Tao Qing,Li Zhang,Xingyu Xiao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Operator situation awareness is a pivotal yet elusive determinant of human reliability in complex nuclear control environments. Existing assessment methods, such as SAGAT and SART, remain static, retrospective, and detached from the evolving cognitive dynamics that drive operational risk. To overcome these limitations, this study introduces the dynamic Bayesian machine learning framework for situation awareness (DBML SA), a unified approach that fuses probabilistic reasoning and data driven intelligence to achieve quantitative, interpretable, and predictive situation awareness modeling. Leveraging 212 operational event reports (2007 to 2021), the framework reconstructs the causal temporal structure of 11 performance shaping factors across multiple cognitive layers. The Bayesian component enables time evolving inference of situation awareness reliability under uncertainty, while the neural component establishes a nonlinear predictive mapping from PSFs to SART scores, achieving a mean absolute percentage error of 13.8 % with statistical consistency to subjective evaluations (p 0.05). Results highlight training quality and stress dynamics as primary drivers of situation awareness degradation. Overall, DBML SA transcends traditional questionnaire-based assessments by enabling real-time cognitive monitoring, sensitivity analysis, and early-warning prediction, paving the way toward intelligent human machine reliability management in next-generation digital main control rooms.
[LG-58] CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing
链接: https://arxiv.org/abs/2603.19297
作者: Manit Baser,Alperen Yildiz,Dinil Mon Divakaran,Mohan Gurusamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:The static knowledge representations of large language models (LLMs) inevitably become outdated or incorrect over time. While model-editing techniques offer a promising solution by modifying a model’s factual associations, they often produce unpredictable ripple effects, which are unintended behavioral changes that propagate even to the hidden space. In this work, we introduce CLaRE, a lightweight representation-level technique to identify where these ripple effects may occur. Unlike prior gradient-based methods, CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. To enable systematic study, we prepare and analyse a corpus of 11,427 facts drawn from three existing datasets. Using CLaRE, we compute large-scale entanglement graphs of this corpus for multiple models, capturing how local edits propagate through representational space. These graphs enable stronger preservation sets for model editing, audit trails, efficient red-teaming, and scalable post-edit evaluation. In comparison to baselines, CLaRE achieves an average of 62.2% improvement in Spearman correlation with ripple effects while being 2.74\times faster, and using 2.85\times less peak GPU memory. Besides, CLaRE requires only a fraction of the storage needed by the baselines to compute and preserve fact representations. Our entanglement graphs and corpus are available at this https URL.
[LG-59] Q: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly
链接: https://arxiv.org/abs/2603.19296
作者: Toshiaki Koike-Akino,Jing Liu,Ye Wang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 25 pages
Abstract:To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.
[LG-60] BrainSCL: Subtype-Guided Contrastive Learning for Brain Disorder Diagnosis
链接: https://arxiv.org/abs/2603.19295
作者: Xiaolong Li,Guiliang Guo,Guangqi Wen,Peng Cao,Jinzhu Yang,Honglin Wu,Xiaoli Liu,Fei Wang,Osmar R. Zaiane
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mental disorder populations exhibit pronounced heterogeneity – that is, the significant differences between samples – poses a significant challenge to the definition of positive pairs in contrastive learning. To address this, we propose a subtype-guided contrastive learning framework that models patient heterogeneity as latent subtypes and incorporates them as structural priors to guide discriminative representation learning. Specifically, we construct multi-view representations by combining patients’ clinical text with graph structure adaptively learned from BOLD signals, to uncover latent subtypes via unsupervised spectral clustering. A dual-level attention mechanism is proposed to construct prototypes for capturing stable subtype-specific connectivity patterns. We further propose a subtype-guided contrastive learning strategy that pulls samples toward their subtype prototype graph, reinforcing intra-subtype consistency for providing effective supervisory signals to improve model performance. We evaluate our method on Major Depressive Disorder (MDD), Bipolar Disorder (BD), and Autism Spectrum Disorders (ASD). Experimental results confirm the effectiveness of subtype prototype graphs in guiding contrastive learning and demonstrate that the proposed approach outperforms state-of-the-art approaches. Our code is available at this https URL.
[LG-61] Beam-aware Kernelized Contextual Bandits for User Association and Beamforming in mmWave Vehicular Networks
链接: https://arxiv.org/abs/2603.19285
作者: Xiaoyang He,Manabu Tsukada
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Timely channel information is necessary for vehicles to determine both the serving base station (BS) and the beamforming vector, but frequent estimation of fast-fading mmWave channels incurs significant overhead. To address this challenge, we propose a Beam-aware Kernelized Contextual Upper Confidence Bound (BKC-UCB) algorithm that estimates instantaneous transmission rates without additional channel measurements by exploiting historical contexts such as vehicle location and velocity, together with past observed transmission rates. Specifically, BKC-UCB leverages kernel methods to capture the nonlinear relationship between context and transmission rate by mapping contexts into a reproducing kernel Hilbert space (RKHS), where linear learning becomes feasible. Rather than treating each beam as an independent arm, the beam index is embedded into the context, enabling BKC-UCB to exploit correlations among beams to accelerate convergence. Furthermore, an event-triggered information sharing mechanism is incorporated into BKC-UCB, enabling information exchange only when significant explorations are conducted to improve learning efficiency with limited communication overhead.
[LG-62] he IJCNN 2025 Review Process
链接: https://arxiv.org/abs/2603.19244
作者: Michele Scarpiniti,Danilo Comminiello
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:
Abstract:The International Joint Conference on Neural Networks (IJCNN) is the premier international conference in the area of neural networks theory, analysis, and applications. The 2025 edition of the conference comprised 5,526 paper submissions, 7,877 active reviewers, 426 area chairs, 2,152 accepted papers, and more than 2,300 attendees. This represents a growth of about 100% in terms of submissions, 200% in terms of reviewers, and over 50% in terms of attendees as compared to the previous edition. In this paper, we describe several key aspects of the whole review process, including a strategy for ranking the scores provided by the reviewers by evaluating a score index and a calibrated version used experimentally to remove reviewer-specific bias from reviews.
[LG-63] Antenna Array Beamforming Based on a Hybrid Quantum Optimization Framework
链接: https://arxiv.org/abs/2603.20072
作者: Shuai Zeng
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:This paper proposes a hybrid quantum optimization framework for large-scale antenna-array beamforming with jointly optimized discrete phases and continuous amplitudes. The method combines quantum-inspired search with classical gradient refinement to handle mixed discrete-continuous variables efficiently. For phase optimization, a Gray-code and odd-combination encoding scheme is introduced to improve robustness and avoid the complexity explosion of higher-order Ising models. For amplitude optimization, a geometric spin-combination encoding and a two-stage strategy are developed, using quantum-inspired optimization for coarse search and gradient optimization for fine refinement. To enhance solution diversity and quality, a rainbow quantum-inspired algorithm integrates multiple optimizers for parallel exploration, followed by hierarchical-clustering-based candidate refinement. In addition, a double outer-product method and an augmented version are proposed to construct the coupling matrix and bias vector efficiently, improving numerical precision and implementation efficiency. Under the scoring rules of the 7th National Quantum Computing Hackathon, simulations on a 32-element antenna array show that the proposed method achieves a score of 461.58 under constraints on near-main-lobe sidelobes, wide-angle sidelobes, beamwidth, and optimization time, nearly doubling the baseline score. The proposed framework provides an effective reference for beamforming optimization in future wireless communication systems.
[LG-64] Structured Latent Dynamics in Wireless CSI via Homomorphic World Models
链接: https://arxiv.org/abs/2603.20048
作者: Salmane Naoumi,Mehdi Bennis,Marwa Chafii
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: ACCEPTED FOR PUBLICATION IN IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC) 2026
Abstract:We introduce a self-supervised framework for learning predictive and structured representations of wireless channels by modeling the temporal evolution of channel state information (CSI) in a compact latent space. Our method casts the problem as a world modeling task and leverages the Joint Embedding Predictive Architecture (JEPA) to learn action-conditioned latent dynamics from CSI trajectories. To promote geometric consistency and compositionality, we parameterize transitions using homomorphic updates derived from Lie algebra, yielding a structured latent space that reflects spatial layout and user motion. Evaluations on the DICHASUS dataset show that our approach outperforms strong baselines in preserving topology and forecasting future embeddings across unseen environments. The resulting latent space enables metrically faithful channel charts, offering a scalable foundation for downstream applications such as mobility-aware scheduling, localization, and wireless scene understanding.
[LG-65] Graph-Informed Adversarial Modeling: Infimal Subadditivity of Interpolative Divergences
链接: https://arxiv.org/abs/2603.20025
作者: Panagiota Birmpa(1 and 2),Eric Joseph Hall(1 and 2) ((1) Heriot–Watt University, (2) Maxwell Institute for Mathematical Sciences)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 32 pages, 9 figures
Abstract:We study adversarial learning when the target distribution factorizes according to a known Bayesian network. For interpolative divergences, including (f,\Gamma) -divergences, we prove a new infimal subadditivity principle showing that, under suitable conditions, a global variational discrepancy is controlled by an average of family-level discrepancies aligned with the graph. In an additive regime, this surrogate is exact. This provides a variational justification for replacing a graph-agnostic GAN with a monolithic discriminator by a graph-informed GAN with localized family-level discriminators. The result does not require the optimizer itself to factorize according to the graph. We also obtain parallel results for integral probability metrics and proximal optimal transport divergences, identify natural discriminator classes for which the theory applies, and present experiments showing improved stability and structural recovery relative to graph-agnostic baselines.
[LG-66] Structural Controllability of Large-Scale Hypergraphs
链接: https://arxiv.org/abs/2603.19955
作者: Joshua Pickard,Xin Mao,Can Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Systems and Control (eess.SY)
*备注: 14 pages, 4 figures, 1 table
Abstract:Controlling real-world networked systems, including ecological, biomedical, and engineered networks that exhibit higher-order interactions, remains challenging due to inherent nonlinearities and large system scales. Despite extensive studies on graph controllability, the controllability properties of hypergraphs remain largely underdeveloped. Existing results focus primarily on exact controllability, which is often impractical for large-scale hypergraphs. In this article, we develop a structural controllability framework for hypergraphs by modeling hypergraph dynamics as polynomial dynamical systems. In particular, we extend classical notions of accessibility and dilation from linear graph-based systems to polynomial hypergraph dynamics and establish a hypergraph-based criterion under which the topology guarantees satisfaction of classical Lie-algebraic and Kalman-type rank conditions for almost all parameter choices. We further derive a topology-based lower bound on the minimum number of driver nodes required for structural controllability and leverage this bound to design a scalable driver node selection algorithm combining dilation-aware initialization via maximum matching with greedy accessibility expansion. We demonstrate the effectiveness and scalability of the proposed framework through numerical experiments on hypergraphs with tens to thousands of nodes and higher-order interactions.
[LG-67] Infinite-dimensional spherical-radial decomposition for probabilistic functions with application to constrained optimal control and Gaussian process regression
链接: https://arxiv.org/abs/2603.19907
作者: Kewei Wang,Georg Stadler
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 25 pages, 8 figures
Abstract:The spherical-radial decomposition (SRD) is an efficient method for estimating probabilistic functions and their gradients defined over finite-dimensional elliptical distributions. In this work, we generalize the SRD to infinite stochastic dimensions by combining subspace SRD with standard Monte Carlo methods. The resulting method, which we call hybrid infinite-dimensional SRD (hiSRD) provides an unbiased, low-variance estimator for convex sets arising, for instance, in chance-constrained optimization. We provide a theoretical analysis of the variance of finite-dimensional SRD as the dimension increases, and show that the proposed hybrid method eliminates truncation-induced bias, reduces variance, and allows the computation of derivatives of probabilistic functions. We present comprehensive numerical studies for a risk-neutral stochastic PDE optimal control problem with joint chance state constraints, and for optimizing kernel parameters in Gaussian process regression under the constraint that the posterior process satisfies joint chance constraints.
[LG-68] Deep Autocorrelation Modeling for Time-Series Forecasting: Progress and Prospects
链接: https://arxiv.org/abs/2603.19899
作者: Hao Wang,Licheng Pan,Qingsong Wen,Jialin Yu,Zhichao Chen,Chunyuan Zheng,Xiaoxi Li,Zhixuan Chu,Chao Xu,Mingming Gong,Haoxuan Li,Yuan Lu,Zhouchen Lin,Philip Torr,Yan Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Autocorrelation is a defining characteristic of time-series data, where each observation is statistically dependent on its predecessors. In the context of deep time-series forecasting, autocorrelation arises in both the input history and the label sequences, presenting two central research challenges: (1) designing neural architectures that model autocorrelation in history sequences, and (2) devising learning objectives that model autocorrelation in label sequences. Recent studies have made strides in tackling these challenges, but a systematic survey examining both aspects remains lacking. To bridge this gap, this paper provides a comprehensive review of deep time-series forecasting from the perspective of autocorrelation modeling. In contrast to existing surveys, this work makes two distinctive contributions. First, it proposes a novel taxonomy that encompasses recent literature on both model architectures and learning objectives – whereas prior surveys neglect or inadequately discuss the latter aspect. Second, it offers a thorough analysis of the motivations, insights, and progression of the surveyed literature from a unified, autocorrelation-centric perspective, providing a holistic overview of the evolution of deep time-series forecasting. The full list of papers and resources is available at this https URL.
[LG-69] Minimax Generalized Cross-Entropy
链接: https://arxiv.org/abs/2603.19874
作者: Kartheek Bondugula,Santiago Mazuelas,Aritz Pérez,Anqi Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.
[LG-70] Modeling subgrid scale production rates on complex meshes using graph neural networks
链接: https://arxiv.org/abs/2603.19841
作者: Priyabrat Dash,Mathis Bode,Konduri Aditya
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Large-eddy simulations (LES) require closures for filtered production rates because the resolved fields do not contain all correlations that govern chemical source terms. We develop a graph neural network (GNN) that predicts filtered species production rates on non-uniform meshes from inputs of filtered mass fractions and temperature. Direct numerical simulations of turbulent premixed hydrogen-methane jet flames with hydrogen fractions of 10%, 50%, and 80% provide the dataset. All fields are Favre filtered with the filter width matched to the operating mesh, and learning is performed on subdomain graphs constructed from mesh-point connectivity. A compact set of reactants, intermediates, and products is used, and their filtered production rates form the targets. The model is trained on 10% and 80% blends and evaluated on the unseen 50% blend to test cross-composition generalization. The GNN is compared against an unclosed reference that evaluates rates at the filtered state, and a convolutional neural network baseline that requires remeshing. Across in-distribution and out-of-distribution cases, the GNN yields lower errors and closer statistical agreement with the reference data. Furthermore, the model demonstrates robust generalization across varying filter widths without retraining, maintaining bounded errors at coarser spatial resolutions. A backward facing step configuration further confirms prediction efficacy on a practically relevant geometry. These results highlight the capability of GNNs as robust data-driven closure models for LES on complex meshes.
[LG-71] Explainable cluster analysis: a bagging approach
链接: https://arxiv.org/abs/2603.19840
作者: Federico Maria Quetti,Elena Ballante,Silvia Figini,Paolo Giudici
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:A major limitation of clustering approaches is their lack of explainability: methods rarely provide insight into which features drive the grouping of similar observations. To address this limitation, we propose an ensemble-based clustering framework that integrates bagging and feature dropout to generate feature importance scores, in analogy with feature importance mechanisms in supervised random forests. By leveraging multiple bootstrap resampling schemes and aggregating the resulting partitions, the method improves stability and robustness of the cluster definition, particularly in small-sample or noisy settings. Feature importance is assessed through an information-theoretic approach: at each step, the mutual information between each feature and the estimated cluster labels is computed and weighted by a measure of clustering validity to emphasize well-formed partitions, before being aggregated into a final score. The method outputs both a consensus partition and a corresponding measure of feature importance, enabling a unified interpretation of clustering structure and variable relevance. Its effectiveness is demonstrated on multiple simulated and real-world datasets.
[LG-72] A two-step sequential approach for hyperparameter selection in finite context models
链接: https://arxiv.org/abs/2603.19736
作者: José Contente,Ana Martins,Armando J. Pinho,Sónia Gouveia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Finite-context models (FCMs) are widely used for compressing symbolic sequences such as DNA, where predictive performance depends critically on the context length k and smoothing parameter \alpha. In practice, these hyperparameters are typically selected through exhaustive search, which is computationally expensive and scales poorly with model complexity. This paper proposes a statistically grounded two-step sequential approach for efficient hyperparameter selection in FCMs. The key idea is to decompose the joint optimization problem into two independent stages. First, the context length k is estimated using categorical serial dependence measures, including Cramér’s \nu, Cohen’s \kappa and partial mutual information (pami). Second, the smoothing parameter \alpha is estimated via maximum likelihood conditional on the selected context length k. Simulation experiments were conducted on synthetic symbolic sequences generated by FCMs across multiple (k, \alpha) configurations, considering a four-letter alphabet and different sample sizes. Results show that the dependence measures are substantially more sensitive to variations in k than in \alpha, supporting the sequential estimation strategy. As expected, the accuracy of the hyperparameter estimation improves with increasing sample size. Furthermore, the proposed method achieves compression performance comparable to exhaustive grid search in terms of average bitrate (bits per symbol), while substantially reducing computational cost. Overall, the results on simulated data show that the proposed sequential approach is a practical and computationally efficient alternative to exhaustive hyperparameter tuning in FCMs.
[LG-73] Minimax and Adaptive Covariance Matrix Estimation under Differential Privacy
链接: https://arxiv.org/abs/2603.19703
作者: T. Tony Cai,Yicheng Li
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:
Abstract:The covariance matrix plays a fundamental role in the analysis of high-dimensional data. This paper studies minimax and adaptive estimation of high-dimensional bandable covariance matrices under differential privacy constraints. We propose a novel differentially private blockwise tridiagonal estimator that achieves minimax-optimal convergence rates under both the operator norm and the Frobenius norm. In contrast to the non-private setting, the privacy-induced error exhibits a polynomial dependence on the ambient dimension, revealing a substantial additional cost of privacy. To establish optimality, we develop a new differentially private van Trees inequality and construct carefully designed prior distributions to obtain matching minimax lower bounds. The proposed private van Trees inequality applies more broadly to general private estimation problems and is of independent interest. We further introduce an adaptive estimator that attains the optimal rate up to a logarithmic factor without prior knowledge of the decay parameter, based on a novel hierarchical tridiagonal approach. Numerical experiments corroborate the theoretical results and illustrate the fundamental privacy-accuracy trade-off. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG) Cite as: arXiv:2603.19703 [math.ST] (or arXiv:2603.19703v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2603.19703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-74] Model Selection and Parameter Estimation of Multi-dimensional Gaussian Mixture Model
链接: https://arxiv.org/abs/2603.19657
作者: Xinyu Liu,Hai Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study the problem of learning multi-dimensional Gaussian Mixture Models (GMMs), with a specific focus on model order selection and efficient mixing distribution estimation. We first establish an information-theoretic lower bound on the critical sample complexity required for reliable model selection. More specifically, we show that distinguishing a k -component mixture from a simpler model necessitates a sample size scaling of \Omega(\Delta^-(4k-4)) . We then propose a thresholding-based estimation algorithm that evaluates the spectral gap of an empirical covariance matrix constructed from random Fourier measurement vectors. This parameter-free estimator operates with an efficient time complexity of \mathcalO(k^2 n) , scaling linearly with the sample size. We demonstrate that the sample complexity of our method matches the established lower bound, confirming its minimax optimality with respect to the component separation distance \Delta . Conditioned on the estimated model order, we subsequently introduce a gradient-based minimization method for parameter estimation. To effectively navigate the non-convex objective landscape, we employ a data-driven, score-based initialization strategy that guarantees rapid convergence. We prove that this method achieves the optimal parametric convergence rate of \mathcalO_p(n^-1/2) for estimating the component means. To enhance the algorithm’s efficiency in high-dimensional regimes where the ambient dimension exceeds the number of mixture components (i.e., (d k)), we integrate principal component analysis (PCA) for dimension reduction. Numerical experiments demonstrate that our Fourier-based algorithmic framework outperforms conventional Expectation-Maximization (EM) methods in both estimation accuracy and computational time. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2603.19657 [stat.ML] (or arXiv:2603.19657v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.19657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-75] On the role of memorization in learned priors for geophysical inverse problems
链接: https://arxiv.org/abs/2603.19629
作者: Ali Siahkoohi,Davide Sabeddu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Learned priors based on deep generative models offer data-driven regularization for seismic inversion, but training them requires a dataset of representative subsurface models – a resource that is inherently scarce in geoscience applications. Since the training objective of most generative models can be cast as maximum likelihood on a finite dataset, any such model risks converging to the empirical distribution – effectively memorizing the training examples rather than learning the underlying geological distribution. We show that the posterior under such a memorized prior reduces to a reweighted empirical distribution – i.e., a likelihood-weighted lookup among the stored training examples. For diffusion models specifically, memorization yields a Gaussian mixture prior in closed form, and linearizing the forward operator around each training example gives a Gaussian mixture posterior whose components have widths and shifts governed by the local Jacobian. We validate these predictions on a stylized inverse problem and demonstrate the consequences of memorization through diffusion posterior sampling for full waveform inversion.
[LG-76] Learning to Bet for Horizon-Aware Anytime-Valid Testing
链接: https://arxiv.org/abs/2603.19551
作者: Ege Onur Taga,Samet Oymak,Shubhanshu Shekhar
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 21 pages, 27 figures
Abstract:We develop horizon-aware anytime-valid tests and confidence sequences for bounded means under a strict deadline N . Using the betting/e-process framework, we cast horizon-aware betting as a finite-horizon optimal control problem with state space (t, \log W_t) , where t is the time and W_t is the test martingale value. We first show that in certain interior regions of the state space, policies that deviate significantly from Kelly betting are provably suboptimal, while Kelly betting reaches the threshold with high probability. We then identify sufficient conditions showing that outside this region, more aggressive betting than Kelly can be better if the bettor is behind schedule, and less aggressive can be better if the bettor is ahead. Taken together these results suggest a simple phase diagram in the (t, \log W_t) plane, delineating regions where Kelly, fractional Kelly, and aggressive betting may be preferable. Guided by this phase diagram, we introduce a Deep Reinforcement Learning approach based on a universal Deep Q-Network (DQN) agent that learns a single policy from synthetic experience and maps simple statistics of past observations to bets across horizons and null values. In limited-horizon experiments, the learned DQN policy yields state-of-the-art results.
[LG-77] Reinforcement-guided generative protein language models enable de novo design of highly diverse AAV capsids
链接: https://arxiv.org/abs/2603.19473
作者: Lucas Ferraz,Ana F. Rodrigues,Pedro Giesteira Cotovio,Mafalda Ventura,Gabriela Silva,Ana Sofia Coroadinha,Miguel Machuqueiro,Catia Pesquita
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Adeno-associated viral (AAV) vectors are widely used delivery platforms in gene therapy, and the design of improved capsids is key to expanding their therapeutic potential. A central challenge in AAV bioengineering, as in protein design more broadly, is the vast sequence design space relative to the scale of feasible experimental screening. Machine-guided generative approaches provide a powerful means of navigating this landscape and proposing novel protein sequences that satisfy functional constraints. Here, we develop a generative design framework based on protein language models and reinforcement learning to generate highly novel yet functionally plausible AAV capsids. A pretrained model was fine-tuned on experimentally validated capsid sequences to learn patterns associated with viability. Reinforcement learning was then used to guide sequence generation, with a reward function that jointly promoted predicted viability and sequence novelty, thereby enabling exploration beyond regions represented in the training data. Comparative analyses showed that fine-tuning alone produces sequences with high predicted viability but remains biased toward the training distribution, whereas reinforcement learining-guided generation reaches more distant regions of sequence space while maintaining high predicted viability. Finally, we propose a candidate selection strategy that integrates predicted viability, sequence novelty, and biophysical properties to prioritize variants for downstream evaluation. This work establishes a framework for the generative exploration of protein sequence space and advances the application of generative protein language models to AAV bioengineering.
[LG-78] Near-Equivalent Q-learning Policies for Dynamic Treatment Regimes
链接: https://arxiv.org/abs/2603.19440
作者: Sophia Yazzourh,Erica E.M. Moodie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures
Abstract:Precision medicine aims to tailor therapeutic decisions to individual patient characteristics. This objective is commonly formalized through dynamic treatment regimes, which use statistical and machine learning methods to derive sequential decision rules adapted to evolving clinical information. In most existing formulations, these approaches produce a single optimal treatment at each stage, leading to a unique decision sequence. However, in many clinical settings, several treatment options may yield similar expected outcomes, and focusing on a single optimal policy may conceal meaningful alternatives. We extend the Q-learning framework for retrospective data by introducing a worst-value tolerance criterion controlled by a hyperparameter \varepsilon , which specifies the maximum acceptable deviation from the optimal expected value. Rather than identifying a single optimal policy, the proposed approach constructs sets of \varepsilon -optimal policies whose performance remains within a controlled neighborhood of the optimum. This formulation shifts Q-learning from a vector-valued representation to a matrix-valued one, allowing multiple admissible value functions to coexist during backward recursion. The approach yields families of near-equivalent treatment strategies and explicitly identifies regions of treatment indifference where several decisions achieve comparable outcomes. We illustrate the framework in two settings: a single-stage problem highlighting indifference regions around the decision boundary, and a multi-stage decision process based on a simulated oncology model describing tumor size and treatment toxicity dynamics.
[LG-79] Subspace Projection Methods for Fast Spectral Embeddings of Evolving Graphs
链接: https://arxiv.org/abs/2603.19439
作者: Mohammad Eini,Abdullah Karaaslanli,Vassilis Kalantzis,Panagiotis A. Traganitis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Several graph data mining, signal processing, and machine learning downstream tasks rely on information related to the eigenvectors of the associated adjacency or Laplacian matrix. Classical eigendecomposition methods are powerful when the matrix remains static but cannot be applied to problems where the matrix entries are updated or the number of rows and columns increases frequently. Such scenarios occur routinely in graph analytics when the graph is changing dynamically and either edges and/or nodes are being added and removed. This paper puts forth a new algorithmic framework to update the eigenvectors associated with the leading eigenvalues of an initial adjacency or Laplacian matrix as the graph evolves dynamically. The proposed algorithm is based on Rayleigh-Ritz projections, in which the original eigenvalue problem is projected onto a restricted subspace which ideally encapsulates the invariant subspace associated with the sought eigenvectors. Following ideas from eigenvector perturbation analysis, we present a new methodology to build the projection subspace. The proposed framework features lower computational and memory complexity with respect to competitive alternatives while empirical results show strong qualitative performance, both in terms of eigenvector approximation and accuracy of downstream learning tasks of central node identification and node clustering.
[LG-80] Pseudo-Labeling for Unsupervised Domain Adaptation with Kernel GLMs KR
链接: https://arxiv.org/abs/2603.19422
作者: Nathan Weill,Kaizheng Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 55 pages, 4 figures. Python solvers and experiment scripts are available at: this https URL
Abstract:We propose a principled framework for unsupervised domain adaptation under covariate shift in kernel Generalized Linear Models (GLMs), encompassing kernelized linear, logistic, and Poisson regression with ridge regularization. Our goal is to minimize prediction error in the target domain by leveraging labeled source data and unlabeled target data, despite differences in covariate distributions. We partition the labeled source data into two batches: one for training a family of candidate models, and the other for building an imputation model. This imputation model generates pseudo-labels for the target data, enabling robust model selection. We establish non-asymptotic excess-risk bounds that characterize adaptation performance through an “effective labeled sample size”, explicitly accounting for the unknown covariate shift. Experiments on synthetic and real datasets demonstrate consistent performance gains over source-only baselines.
[LG-81] uLaBM: Tumor-Biased Latent Bridge Matching for Contrast-Enhanced MRI Synthesis
链接: https://arxiv.org/abs/2603.19386
作者: Atharva Rege,Adinath Madhavrao Dukre,Numan Balci,Dwarikanath Mahapatra,Imran Razzak
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Contrast-enhanced magnetic resonance imaging (CE-MRI) plays a crucial role in brain tumor assessment; however, its acquisition requires gadolinium-based contrast agents (GBCAs), which increase costs and raise safety concerns. Consequently, synthesizing CE-MRI from non-contrast MRI (NC-MRI) has emerged as a promising alternative. Early Generative Adversarial Network (GAN)-based approaches suffered from instability and mode collapse, while diffusion models, despite impressive synthesis quality, remain computationally expensive and often fail to faithfully reproduce critical tumor contrast patterns. To address these limitations, we propose Tumor-Biased Latent Bridge Matching (TuLaBM), which formulates NC-to-CE MRI translation as Brownian bridge transport between source and target distributions in a learned latent space, enabling efficient training and inference. To enhance tumor-region fidelity, we introduce a Tumor-Biased Attention Mechanism (TuBAM) that amplifies tumor-relevant latent features during bridge evolution, along with a boundary-aware loss that constrains tumor interfaces to improve margin sharpness. While bridge matching has been explored for medical image translation in pixel space, our latent formulation substantially reduces computational cost and inference time. Experiments on BraTS2023-GLI (BraSyn) and Cleveland Clinic (in-house) liver MRI dataset show that TuLaBM consistently outperforms state-of-the-art baselines on both whole-image and tumor-region metrics, generalizes effectively to unseen liver MRI data in zero-shot and fine-tuned settings, and achieves inference times under 0.097 seconds per image.
[LG-82] Mathematical Modeling of Cancer-Bacterial Therapy: Analysis and Numerical Simulation via Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2603.19326
作者: Ayoub Farkane,David Lassounon
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
*备注:
Abstract:Bacterial cancer therapy exploits anaerobic bacteria’s ability to target hypoxia tumor regions, yet the interactions among tumor growth, bacterial colonization, oxygen levels, immunosuppressive cytokines, and bacterial communication remain poorly quantified. We present a mathematical model of five coupled nonlinear reaction-diffusion equations in a two-dimensional tissue domain. We proved the global well-posedness of the model and identified its steady states to analyze stability. Furthermore, a physics-informed neural network (PINN) solves the system without a mesh and without requiring extensive data. It provides convergence guarantees by combining residual stability and Sobolev approximation error bounds. This results in an overall error rate of O(n^-2 ln^4(n) + N^-1/2), which depends on the network width n and the number of collocation points N. We conducted several numerical experiments, including predicting the tumor’s response to therapy. We also performed a sensitivity analysis of certain parameters. The results suggest that long-term therapeutic efficacy may require the maintenance of hypoxia regions in the tumor, or using bacteria that tolerate oxygen better, may be necessary for long-lasting tumor control.
附件下载


