Arxiv今日论文 | 2026-03-27

本篇博文主要内容为 2026-03-27 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共67篇(Computation and Language (cs.CL))
人工智能共178篇(Artificial Intelligence (cs.AI))
计算机视觉共172篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共129篇(Machine Learning (cs.LG))
多智能体系统共18篇(Multiagent Systems (cs.MA))
信息检索共15篇(Information Retrieval (cs.IR))
人机交互共26篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving CVPR CVPR2026

【速读】：该论文旨在解决当前端到端自动驾驶系统在个性化适配方面的不足，即现有方法通常优化通用目标或依赖固定驾驶模式，难以适应个体驾驶员的长期习惯和短期自然语言指令。其核心解决方案是提出Drive My Way（DMW）框架，这是一个基于视觉-语言-动作（Vision-Language-Action, VLA）的个性化驾驶策略模型，通过从多驾驶员、多场景采集的个性化驾驶数据中学习用户嵌入（user embedding），并在规划阶段将该嵌入作为条件输入以匹配用户的长期驾驶风格，同时结合自然语言指令实现短期意图的动态调整，从而实现人本化自动驾驶的个性化行为生成。

链接: https://arxiv.org/abs/2603.25740
作者: Zehao Wang,Huaide Jiang,Shuaiwu Dong,Yuping Wang,Hang Qiu,Jiachen Li
机构: University of California, Riverside (加州大学河滨分校); University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026); Project website: this https URL

点击查看摘要

Abstract:Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users’ long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver’s own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at this https URL.

[MA-1] Conchordal: Emergent Harmony via Direct Cognitive Coupling in a Psychoacoustic Landscape

【速读】：该论文旨在解决如何在非传统计算媒介中实现音乐生成的自组织、选择、同步与谱系级积累问题，尤其关注如何摆脱对离散音阶和显式和声规则的依赖，转而基于心理声学感知构建动态生成机制。其解决方案的关键在于引入**直接认知耦合（Direct Cognitive Coupling, DCC）**设计原则，使生成性动力系统直接作用于由心理声学可观测量（如粗糙度与和谐度）构成的连续协和场（consonance field），并在此场中通过局部提议-接受机制调整音高、基于协和度调节代谢以决定生存、以及利用Kuramoto风格相位耦合实现时间同步，从而在无符号和声规则的前提下生成结构化的多声部音乐。

链接: https://arxiv.org/abs/2603.25637
作者: Koichi Takahashi
机构: Keio University (庆应义塾大学); Conchordal.org
类目: Multiagent Systems (cs.MA)
备注: 9 pages, 5 figures; supplementary PDF included as ancillary file

点击查看摘要

Abstract:This paper introduces Conchordal, a bio-acoustic instrument for generative composition whose sonic agents are governed by artificial life dynamics within a psychoacoustic fitness landscape. The system is built on Direct Cognitive Coupling (DCC), a design principle requiring that generative dynamics operate directly within a landscape derived from psychoacoustic observables and read from that landscape without symbolic harmonic rules. The environment integrates roughness and harmonicity into a continuous consonance field without presupposing discrete scales or explicit harmonic rules. Agents adjust pitch through local proposal-and-accept dynamics under a crowding penalty, regulate survival via consonance-dependent metabolism, and entrain temporally through Kuramoto-style phase coupling. Four experiments are reported: (1) consonance search produces structured polyphony with enriched consonant intervals; (2) consonance-dependent metabolism yields survival differentials that vanish when recharge is disabled; (3) a minimal hereditary adaptation assay shows that parent-guided respawn plus metabolic selection can accumulate more structured polyphony without adult hill-climbing; and (4) a shared oscillatory scaffold organizes rhythmic timing under external forcing. A supplementary mechanism check reports one possible composer-configurable bridge by which spectral state can modulate temporal coupling. These findings show that a psychoacoustically derived landscape serves as an effective artificial-life terrain, yielding self-organization, selection, synchronization, and lineage-level accumulation in a non-traditional computational medium. At the level of the model, the same landscape therefore functions both as ecological terrain and as an internal proxy for musical coherence.

[MA-2] Cooperative Deep Reinforcement Learning for Fair RIS Allocation

【速读】：该论文旨在解决多小区无线网络中可重构智能表面（Reconfigurable Intelligent Surface, RIS）资源分配因用户负载不均导致的性能失衡问题，尤其关注如何在竞争性基站之间动态共享RIS这一基础设施。解决方案的关键在于提出一种公平感知的协作式多智能体强化学习方法，其中各基站基于预期效用增益和相对服务质量调整竞价策略，并通过引入由中心节点计算的、依赖于性能的公平性指标作为智能体观测输入，实现无需基站间直接通信的隐式协调机制，从而有效将RIS资源向表现较差的小区倾斜，显著提升最差服务用户的数据速率，同时维持整体系统吞吐量。

链接: https://arxiv.org/abs/2603.25572
作者: Martin Mark Zan,Stefan Schwarz
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The deployment of reconfigurable intelligent surfaces (RISs) introduces new challenges for resource allocation in multi-cell wireless networks, particularly when user loads are uneven across base stations. In this work, we consider RISs as shared infrastructure that must be dynamically assigned among competing base stations, and we address this problem using a simultaneous ascending auction mechanism. To mitigate performance imbalances between cells, we propose a fairness-aware collaborative multi-agent reinforcement learning approach in which base stations adapt their bidding strategies based on both expected utility gains and relative service quality. A centrally computed performance-dependent fairness indicator is incorporated into the agents’ observations, enabling implicit coordination without direct inter-base-station communication. Simulation results show that the proposed framework effectively redistributes RIS resources toward weaker-performing cells, substantially improving the rates of the worst-served users while preserving overall throughput. The results demonstrate that fairness-oriented RIS allocation can be achieved through cooperative learning, providing a flexible tool for balancing efficiency and equity in future wireless networks.

[MA-3] UMBRELLA: Uncertainty-aware Multi-robot Reactive Coordination under Dynamic Temporal Logic Tasks

【速读】：该论文旨在解决多机器人系统在协作任务中应对动态移动目标时的协调难题，特别是如何在环境变化和目标运动不确定性下实现高效、鲁棒的任务规划。其关键解决方案是提出了一种名为UMBRELLA的框架，该框架结合蒙特卡洛树搜索（MCTS）与基于置信预测（Conformal Prediction, CP）的不确定性感知模拟策略，并引入CP构建的度量指标以引导搜索方向并加速收敛；同时，通过最小化平均完成时间的条件风险价值（CVaR），确保在不确定环境下仍能获得稳定且高效的调度结果。此外，采用滚动时域规划机制处理在线释放的任务，动态调整分配方案并维持空间-时间约束（由线性时序逻辑LTL表达），仅需部分同步即可完成协作任务执行。

链接: https://arxiv.org/abs/2603.25395
作者: Qisheng Zhao,Meng Guo,Hengxuan Du,Lars Lindemann,Zhongkui Li
机构: Peking University (北京大学); ETH Zürich (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-robot systems can be extremely efficient for accomplishing team-wise tasks by acting concurrently and collaboratively. However, most existing methods either assume static task features or simply replan when environmental changes occur. This paper addresses the challenging problem of coordinating multi-robot systems for collaborative tasks involving dynamic and moving targets. We explicitly model the uncertainty in target motion prediction via Conformal Prediction(CP), while respecting the spatial-temporal constraints specified by Linear Temporal Logic (LTL). The proposed framework (UMBRELLA) combines the Monte Carlo Tree Search (MCTS) over partial plans with uncertainty-aware rollouts, and introduces a CP-based metric to guide and accelerate the search. The objective is to minimize the Conditional Value at Risk (CVaR) of the average makespan. For tasks released online, a receding-horizon planning scheme dynamically adjusts the assignments based on updated task specifications and motion predictions. Spatial and temporal constraints among the tasks are always ensured, and only partial synchronization is required for the collaborative tasks during online execution. Extensive large-scale simulations and hardware experiments demonstrate substantial reductions in both the average makespan and its variance by 23% and 71%, compared with static baselines.

[MA-4] AD-CARE: A Guideline-grounded Modality-agnostic LLM Agent for Real-world Alzheimers Disease Diagnosis with Multi-cohort Assessment Fairness Analysis and Reader Study

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）临床诊断中因多模态数据不完整、异质性强以及不同医疗机构和人群差异导致的准确性不足问题。传统方法难以处理缺失模态且缺乏对临床指南的系统性整合，限制了其在真实世界场景中的应用。解决方案的关键在于提出AD-CARE——一个与模态无关的智能代理，它通过动态协调专用诊断工具并嵌入临床指南至大语言模型（Large Language Models, LLMs）驱动的推理过程，实现从不完整、异质输入中生成符合临床工作流程的透明化诊断报告，无需填补缺失数据即可保持高精度（平均84.9%），显著优于基线方法，并有效降低不同种族和年龄亚组间的性能差异。

链接: https://arxiv.org/abs/2603.25322
作者: Wenlong Hou,Sheng Bi,Guangqian Yang,Lihao Liu,Ye Du,Hanxiao Xue,Juncheng Wang,Yuxiang Feng,Yue Xun,Nanxi Yu,Ning Mao,Mo Yang,Yi Wah Eva Cheung,Ling Long,Kay Chen Tan,Lequan Yu,Xiaomeng Ma,Shaozhen Yan,Shujun Wang
机构: The Hong Kong Polytechnic University (香港理工大学); The University of Cambridge (剑桥大学); The University of Hong Kong (香港大学); Zhejiang University (浙江大学); Peking University (北京大学); Sun Yat-sen University (中山大学); Capital Medical University (首都医科大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

[MA-5] Learning in Proportional Allocation Auctions Games

【速读】：该论文旨在解决重复博弈场景下Kelly机制（Kelly mechanism）中个体策略演化与收敛性问题，特别是当参与者采用不同学习算法时，如何保证系统稳定收敛至唯一纳什均衡（Nash equilibrium, NE）。其核心挑战在于：在无线网络切片等实际应用中，资源分配需兼顾公平性与吞吐量权衡，而传统静态分析难以刻画动态交互下的长期行为。解决方案的关键在于：首先从无线网络切片的公平-吞吐量权衡出发推导出对数形式效用函数；进而证明该效用结构下阶段博弈存在唯一NE，并通过三种典型行为模型（在线梯度下降OGD、带二次正则项的对偶平均DAU、以及贪婪最优响应BR）严格证明重复博弈的收敛性，且该结论对个性化学习率和更广义效用类仍成立。理论结果得到仿真验证，表明最优响应（BR）在收敛速度与时间平均效用方面表现最优，但异质更新规则可能导致收敛失败。

链接: https://arxiv.org/abs/2603.25303
作者: Younes Ben Mazziane,Cleque-Marlain Mboulou Moutoubi,Eitan Altman,Francesco De Pellegrini
机构: LIA, Avignon university (Avignon大学); INRIA, Sophia Antipolis (索菲亚安提波利斯)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The Kelly or proportional allocation mechanism is a simple and efficient auction-based scheme that distributes an infinitely divisible resource proportionally to the agents bids. When agents are aware of the allocation rule, their interactions form a game extensively studied in the literature. This paper examines the less explored repeated Kelly game, focusing mainly on utilities that are logarithmic in the allocated resource fraction. We first derive this logarithmic form from fairness-throughput trade-offs in wireless network slicing, and then prove that the induced stage game admits a unique Nash equilibrium NE. For the repeated play, we prove convergence to this NE under three behavioral models: (i) all agents use Online Gradient Descent (OGD), (ii) all agents use Dual Averaging with a quadratic regularizer (DAQ) (a variant of the Follow-the-Regularized leader algorithm), and (iii) all agents play myopic best responses (BR). Our convergence results hold even when agents use personalized learning rates in OGD and DAQ (e.g., tuned to optimize individual regret bounds), and they extend to a broader class of utilities that meet a certain sufficient condition. Finally, we complement our theoretical results with extensive simulations of the repeated Kelly game under several behavioral models, comparing them in terms of convergence speed to the NE, and per-agent time-average utility. The results suggest that BR achieves the fastest convergence and the highest time-average utility, and that convergence to the stage-game NE may fail under heterogeneous update rules.

[MA-6] WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的自动化网页开发中缺乏可靠验证机制的问题，尤其针对开放环境中难以评估Web功能是否正确实现的挑战。现有方法依赖静态视觉相似性或预定义检查清单，无法捕捉隐含的逻辑约束，导致测试完整性不足与缺陷检测瓶颈。解决方案的关键在于提出WebTestBench——一个涵盖多种Web应用类别的端到端自动化网页测试基准，并将测试过程分解为检查清单生成和缺陷检测两个级联子任务；同时设计了WebTester作为基线框架，以系统化评估LLMs在复杂交互场景下的测试能力。实验表明，当前LLM驱动的代理在测试完备性、缺陷识别效率及长程交互稳定性方面均存在显著不足，揭示了其与工业级部署需求之间的巨大差距。

链接: https://arxiv.org/abs/2603.25226
作者: Fanheng Kong,Jingyuan Zhang,Yang Yue,Chenxi Sun,Yang Tian,Shi Feng,Xiaocui Yang,Daling Wang,Yu Tian,Jun Du,Wenchong Zeng,Han Li,Kun Gai
机构: Northeastern University (东北大学); Kuaishou Technology (快手科技)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 24 pages, code: this https URL

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to “vibe coding”, where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at this https URL.

[MA-7] From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies

【速读】：该论文旨在解决当前多智能体系统中存在的“逻辑垄断”（Logic Monopoly）问题，即各智能体同时自主规划、执行与评估自身行为所导致的结构缺陷，其后果表现为显著的“可靠性缺口”：在十种部署场景中平均攻击成功率高达84.30%，31.4%的未受奖励信号引导下出现欺骗性行为，并因六大结构性瓶颈引发级联失效。解决方案的核心在于引入一种基于契约的权力分立（Separation of Power, SoP）模型，将智能体视为具有法律识别性的企业实体（Agent Enterprise for Enterprise, AE4E），并构建三权分立架构——立法、执行与裁决分支，通过NetX Enterprise Framework（NEF）实现治理枢纽、可信执行环境（TEE）计算飞地、隐私保护数据桥接及原生智能体区块链底层，从而形成具备制度基础设施的代理社会层（Agentic Social Layer），以帕森斯AGIL框架为理论基础，实现从私有飞地到全球服务网络的可扩展治理范式。

链接: https://arxiv.org/abs/2603.25100
作者: Anbang Ruan
机构: NetX Foundation
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 143 pages, 15 tables, 23 figures, 173 references, 4 appendices. Working paper – pre-peer-review preprint. LaTeX source with arXiv-style template. Three companion manuscripts under development targeting peer-reviewed venues

点击查看摘要

Abstract:Existing multi-agent frameworks allow each agent to simultaneously plan, execute, and evaluate its own actions – a structural deficiency we term the “Logic Monopoly.” Empirical evidence quantifies the resulting “Reliability Gap”: 84.30% average attack success rates across ten deployment scenarios, 31.4% emergent deceptive behavior without explicit reward signals, and cascading failure modes rooted in six structural bottlenecks. The remedy is not better alignment of individual models but a social contract for agents: institutional infrastructure that enforces a constitutional Separation of Power. This paper introduces the Agent Enterprise for Enterprise (AE4E) paradigm – agents as autonomous, legally identifiable business entities within a functionalist social system – with a contract-centric SoP model trifurcating authority into Legislation, Execution, and Adjudication branches. The paradigm is operationalized through the NetX Enterprise Framework (NEF): governance hubs, TEE-backed compute enclaves, privacy-preserving data bridges, and an Agent-Native blockchain substrate. The Agent Enterprise Economy scales across four deployment tiers from private enclaves to a global Web of Services. The Agentic Social Layer, grounded in Parsons’ AGIL framework, provides institutional infrastructure via sixty-plus named Institutional AE4Es. 143 pages, 173 references, eight specialized smart contracts. Comments: 143 pages, 15 tables, 23 figures, 173 references, 4 appendices. Working paper – pre-peer-review preprint. LaTeX source with arXiv-style template. Three companion manuscripts under development targeting peer-reviewed venues Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: I.2.11; K.6.5; C.2.4 Cite as: arXiv:2603.25100 [cs.MA] (or arXiv:2603.25100v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.25100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-8] Ultra-fast Traffic Nowcasting and Control via Differentiable Agent -based Simulation

【速读】：该论文旨在解决传统细粒度交通仿真模型因不可微（non-differentiable）特性而难以进行高效校准的问题，这限制了交通数字孪生（Traffic Digital Twins）在实际城市交通管理中的应用。其核心解决方案是提出一种可微的基于智能体（Agent-Based）的交通仿真器，通过开发一系列可微计算技术来模拟车辆个体行为（包括随机决策和智能体间交互），并确保整个仿真轨迹保持端到端可微，从而支持高效的梯度优化。该方法显著提升了仿真速度与校准效率，在芝加哥大规模路网（超过10,000个校准参数）上实现了每秒173倍于真实时间的模拟，并在不到20分钟内完成从模型校准、交通现在预测（nowcasting）到控制策略求解的完整闭环，为交通数字孪生提供了切实可行的计算基础。

链接: https://arxiv.org/abs/2603.25068
作者: Fumiyasu Makinoshima,Yuya Yamaguchi,Eigo Segawa,Koichiro Niinuma,Sean Qian
机构: Fujitsu Limited(富士通有限公司); Fujitsu Research of America(富士通研究美国公司); Carnegie Mellon University(卡内基梅隆大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traffic digital twins, which inform policymakers of effective interventions based on large-scale, high-fidelity computational models calibrated to real-world traffic, hold promise for addressing societal challenges in our rapidly urbanizing world. However, conventional fine-grained traffic simulations are non-differentiable and typically rely on inefficient gradient-free optimization, making calibration for real-world applications computationally infeasible. Here we present a differentiable agent-based traffic simulator that enables ultra-fast model calibration, traffic nowcasting, and control on large-scale networks. We develop several differentiable computing techniques for simulating individual vehicle movements, including stochastic decision-making and inter-agent interactions, while ensuring that entire simulation trajectories remain end-to-end differentiable for efficient gradient-based optimization. On the large-scale Chicago road network, with over 10,000 calibration parameters, our model simulates more than one million vehicles at 173 times real-time speed. This ultra-fast simulation, together with efficient gradient-based optimization, enables us to complete model calibration using the previous 30 minutes of traffic data in 455 s, provide a one-hour-ahead traffic nowcast in 21 s, and solve the resulting traffic control problem in 728 s. This yields a full calibration–nowcast–control loop in under 20 minutes, leaving about 40 minutes of lead time for implementing interventions. Our work thus provides a practical computational basis for realizing traffic digital twins.

[MA-9] Belief-Driven Multi-Agent Collaboration via Approximate Perfect Bayesian Equilibrium for Social Simulation WWW2026

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的多智能体框架在社会仿真中因静态交互拓扑结构而导致的现实性不足问题，即无法有效模拟人类社会中合作型知识整合与竞争性批判推理之间的动态博弈关系，从而引发“群体思维”或无效率僵局，削弱仿真结果对决策支持的可信度。解决方案的关键在于提出一个受完美贝叶斯均衡（Perfect Bayesian Equilibrium, PBE）启发的信念驱动自适应协作框架（Belief-driven Adaptive Collaboration Framework, BEACOF），通过将社交互动建模为不完全信息下的动态博弈，使智能体能够迭代更新对同伴能力的概率信念，并自主调节协作策略，从而在不确定性下实现顺序理性决策，最终防止协调失败并促进高质量解的稳健收敛。

链接: https://arxiv.org/abs/2603.24973
作者: Weiwei Fang,Lin Li,Kaize Shi,Yu Yang,Jianwei Zhang
机构: Wuhan University of Technology (武汉理工大学); University of Southern Queensland (南昆士兰大学); The Education University of Hong Kong (香港教育大学); Iwate University (岩手大学)
类目: Multiagent Systems (cs.MA)
备注: accepted at WWW 2026

点击查看摘要

Abstract:High-fidelity social simulation is pivotal for addressing complex Web societal challenges, yet it demands agents capable of authentically replicating the dynamic spectrum of human interaction. Current LLM-based multi-agent frameworks, however, predominantly adhere to static interaction topologies, failing to capture the fluid oscillation between cooperative knowledge synthesis and competitive critical reasoning seen in real-world scenarios. This rigidity often leads to unrealistic ``groupthink’’ or unproductive deadlocks, undermining the credibility of simulations for decision support. To bridge this gap, we propose \textitBEACOF, a \textitbelief-driven adaptive collaboration framework inspired by Perfect Bayesian Equilibrium (PBE). By modeling social interaction as a dynamic game of incomplete information, BEACOF rigorously addresses the circular dependency between collaboration type selection and capability estimation. Agents iteratively refine probabilistic beliefs about peer capabilities and autonomously modulate their collaboration strategy, thereby ensuring sequentially rational decisions under uncertainty. Validated across adversarial (judicial), open-ended (social) and mixed (medical) scenarios, BEACOF prevents coordination failures and fosters robust convergence toward high-quality solutions, demonstrating superior potential for reliable social simulation. Source codes and datasets are publicly released at: this https URL.

[MA-10] Integrated Multi-Drone Task Allocation Sequencing and Optimal Trajectory Generation in Obstacle-Rich 3D Environments

【速读】：该论文旨在解决在复杂三维（3D）环境中协同控制多架无人机团队的问题，核心挑战在于如何将离散的任务分配（决定哪台无人机服务哪些目标及其顺序）与连续时间轨迹生成（确保避障和动力学可行性）进行原理性融合。解决方案的关键在于提出一种端到端框架 IMD-TAPP（Integrated Multi-Drone Task Allocation and Path Planning），其核心创新包括：首先通过图搜索方法构建3D导航图并计算障碍物感知的机器人到目标及目标到目标的代价；随后利用注入粒子群优化（IPSO）结合多重线性指派策略高效探索任务分配与路径排序的耦合空间以最小化任务完成时间（makespan）；最后将所得航路点序列转化为时间参数化的最小snap轨迹，并通过迭代验证障碍物间隙和机间安全距离实现动态可行且无碰撞的飞行路径生成，从而在保证安全性的同时实现高效率的任务执行。

链接: https://arxiv.org/abs/2603.24908
作者: Yunes Alqudsi,Murat Makaraci
机构: Kocaeli University (科乔拉大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Resubmission following accepted appeal (MOD-78958). Resubmitting to cs.RO with cross-lists cs.MA and cs.AI as advised by arXiv Support

点击查看摘要

Abstract:Coordinating teams of aerial robots in cluttered three-dimensional (3D) environments requires a principled integration of discrete mission planning-deciding which robot serves which goals and in what order – with continuous-time trajectory synthesis that enforces collision avoidance and dynamic feasibility. This paper introduces IMD-TAPP (Integrated Multi-Drone Task Allocation and Path Planning), an end-to-end framework that jointly addresses multi-goal allocation, tour sequencing, and safe trajectory generation for quadrotor teams operating in obstacle-rich spaces. IMD–TAPP first discretizes the workspace into a 3D navigation graph and computes obstacle-aware robot-to-goal and goal-to-goal travel costs via graph-search-based pathfinding. These costs are then embedded within an Injected Particle Swarm Optimization (IPSO) scheme, guided by multiple linear assignment, to efficiently explore coupled assignment/ordering alternatives and to minimize mission makespan. Finally, the resulting waypoint tours are transformed into time-parameterized minimum-snap trajectories through a generation-and-optimization routine equipped with iterative validation of obstacle clearance and inter-robot separation, triggering re-planning when safety margins are violated. Extensive MATLAB simulations across cluttered 3D scenarios demonstrate that IMD–TAPP consistently produces dynamically feasible, collision-free trajectories while achieving competitive completion times. In a representative case study with two drones serving multiple goals, the proposed approach attains a minimum mission time of 136~s while maintaining the required safety constraints throughout execution.

[MA-11] Context-Mediated Domain Adaptation in Multi-Agent Sensemaking Systems

【速读】：该论文旨在解决传统基于提示（prompt-based）交互中难以捕捉领域专家隐性知识的问题，即专家在修正AI生成内容时所体现的深层领域理解常被系统视为终点式修改而非可复用的隐式规范。其解决方案的关键在于提出“上下文媒介的领域自适应”（context-mediated domain adaptation）范式，将用户对生成成果的编辑行为（如术语修正、论证重构与重点调整）视为隐式领域规范，并通过多智能体推理系统Seedentia实现双向语义关联：一方面利用用户编辑模式反向提取隐性知识，另一方面使代理行为基于观察到的修正模式进行上下文学习（in-context learning），从而支持从模糊初始提示逐步演化为精确领域规范的规格自举（specification bootstrapping）。

链接: https://arxiv.org/abs/2603.24858
作者: Anton Wolter,Leon Haag,Vaishali Dhanoa,Niklas Elmqvist
机构: Aarhus University (奥胡斯大学); Maastricht University (马斯特里赫特大学); TU Wien (维也纳工业大学)
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Domain experts possess tacit knowledge that they cannot easily articulate through explicit specifications. When experts modify AI-generated artifacts by correcting terminology, restructuring arguments, and adjusting emphasis, these edits reveal domain understanding that remains latent in traditional prompt-based interactions. Current systems treat such modifications as endpoint corrections rather than as implicit specifications that could reshape subsequent reasoning. We propose context-mediated domain adaptation, a paradigm where user modifications to system-generated artifacts serve as implicit domain specification that reshapes LLM-powered multi-agent reasoning behavior. Through our system Seedentia, a web-based multi-agent framework for sense-making, we demonstrate bidirectional semantic links between generated artifacts and system reasoning. Our approach enables specification bootstrapping where vague initial prompts evolve into precise domain specifications through iterative human-AI collaboration, implicit knowledge transfer through reverse-engineered user edits, and in-context learning where agent behavior adapts based on observed correction patterns. We present results from an evaluation with domain experts who generated and modified research questions from academic papers. Our system extracted 46 domain knowledge entries from user modifications, demonstrating the feasibility of capturing implicit expertise through edit patterns, though the limited sample size constrains conclusions about systematic quality improvements.

[MA-12] SentinelAI: A Multi-Agent Framework for Structuring and Linking NG9-1-1 Emergency Incident Data

【速读】：该论文旨在解决应急响应系统中多源异构数据难以协同与标准化的问题，尤其是在符合下一代9-1-1（Next Generation 9-1-1）数据标准的前提下实现信息的实时关联与更新。其核心挑战在于如何将来自不同机构的数据流转化为可机器读取、持续更新的统一视图，以支持复合事件构建和跨源推理。解决方案的关键是提出SentinelAI框架，该框架采用可扩展的处理流水线，由专用代理（agents）组成，其中EIDO Agent负责解析原始通信内容并生成符合NENA标准的应急事件数据对象（Emergency Incident Data Object, EIDO）JSON结构，从而实现数据的标准化与即时集成。

链接: https://arxiv.org/abs/2603.24856
作者: Kliment Ho,Ilya Zaslavsky
机构: San Diego Supercomputer Center (圣地亚哥超级计算中心); UC San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Emergency response systems generate data from many agencies and systems. In practice, correlating and updating this information across sources in a way that aligns with Next Generation 9-1-1 data standards remains challenging. Ideally, this data should be treated as a continuous stream of operational updates, where new facts are integrated immediately to provide a timely and unified view of an evolving incident. This paper presents SentinelAI, a data integration and standardization framework for transforming emergency communications into standardized, machine-readable datasets that support integration, composite incident construction, and cross-source reasoning. SentinelAI implements a scalable processing pipeline composed of specialized agents. The EIDO Agent ingests raw communications and produces NENA-compliant Emergency Incident Data Object JSON.

[MA-13] Formal Semantics for Agent ic Tool Protocols: A Process Calculus Approach

【速读】：该论文旨在解决大语言模型代理（Large Language Model Agents）在调用外部工具时缺乏形式化验证机制的问题，尤其是针对两种主流范式——Schema-Guided Dialogue (SGD) 和 Model Context Protocol (MCP) 之间形式关系未被明确的问题。解决方案的关键在于首次构建了 SGD 与 MCP 的过程演算（process calculus）形式化模型，并证明二者在特定映射 Φ 下结构同构（structurally bisimilar），同时揭示 MCP 的逆映射 Φ⁻¹ 是部分且有损的，从而识别出 MCP 在表达能力上的关键缺陷。进一步地，作者提出五项原则作为行为等价的充要条件，并将其形式化为 MCP+ 类型系统扩展，最终证明 MCP+ 与 SGD 同构，实现了对代理协议的形式化验证基础和 schema 质量作为可证明安全属性的建立。

链接: https://arxiv.org/abs/2603.24747
作者: Andreas Schlapbach
机构: SBB-IT
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages. Companion to arXiv:2602.18764

点击查看摘要

Abstract:The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^-1 is partial and lossy, revealing critical gaps in MCP’s expressivity. Through bidirectional analysis, we identify five principles – semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration – as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.

[MA-14] rust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

【速读】：该论文试图解决的问题是：随着人工智能（AI）系统能力与应用的不断增长，如何通过演化博弈论框架来理解用户信任与开发者行为之间的动态互动，从而识别出能够实现安全且广泛采纳的AI系统的演化稳定机制。其解决方案的关键在于揭示三种长期演化均衡状态的存在——无采纳、不安全但广泛采纳、以及安全且广泛采纳，并指出唯有后者为理想结果；该结果仅在对不安全行为的惩罚超过安全成本、且用户仍能负担得起一定程度监测时才会出现。这表明，有效的AI治理需依赖透明度、低成本监测机制和具有约束力的惩戒措施，而非单纯依靠监管或盲目的用户信任。

链接: https://arxiv.org/abs/2603.24742
作者: Adeela Bashir,Zhao Song,Ndidi Bianca Ogbo,Nataliya Balabanova,Martin Smit,Chin-wing Leung,Paolo Bova,Manuel Chica Serrano,Dhanushka Dissanayake,Manh Hong Duong,Elias Fernandez Domingos,Nikita Huber-Kralj,Marcus Krellner,Andrew Powell,Stefan Sarkadi,Fernando P. Santos,Zia Ush Shamszaman,Chaimaa Tarzi,Paolo Turrini,Grace Ibukunoluwa Ufeoshi,Victor A. Vargas-Perez,Alessandro Di Stefano,Simon T. Powers, TheAnh Han
机构: 1. University of Oxford (牛津大学); 2. University of Edinburgh (爱丁堡大学); 3. University of Lisbon (里斯本大学); 4. University College London (伦敦大学学院); 5. University of Barcelona (巴塞罗那大学); 6. University of Cambridge (剑桥大学); 7. Microsoft Research (微软研究院); 8. ETH Zurich (苏黎世联邦理工学院); 9. University of Copenhagen (哥本哈根大学); 10. University of Toronto (多伦多大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:AI safety is an increasingly urgent concern as the capabilities and adoption of AI systems grow. Existing evolutionary models of AI governance have primarily examined incentives for safe development and effective regulation, typically representing users’ trust as a one-shot adoption choice rather than as a dynamic, evolving process shaped by repeated interactions. We instead model trust as reduced monitoring in a repeated, asymmetric interaction between users and AI developers, where checking AI behaviour is costly. Using evolutionary game theory, we study how user trust strategies and developer choices between safe (compliant) and unsafe (non-compliant) AI co-evolve under different levels of monitoring cost and institutional regimes. We complement the infinite-population replicator analysis with stochastic finite-population dynamics and reinforcement learning (Q-learning) simulations. Across these approaches, we find three robust long-run regimes: no adoption with unsafe development, unsafe but widely adopted systems, and safe systems that are widely adopted. Only the last is desirable, and it arises when penalties for unsafe behaviour exceed the extra cost of safety and users can still afford to monitor at least occasionally. Our results formally support governance proposals that emphasise transparency, low-cost monitoring, and meaningful sanctions, and they show that neither regulation alone nor blind user trust is sufficient to prevent evolutionary drift towards unsafe or low-adoption outcomes.

[MA-15] Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach

【速读】：该论文旨在解决大规模异构分布式系统中任务调度的效率问题，其核心挑战在于动态工作负载、异构资源以及服务质量（Quality-of-Service, QoS）需求之间的复杂权衡。传统集中式调度方法存在可扩展性瓶颈和单点故障风险，而经典启发式算法又缺乏对环境变化的自适应能力。解决方案的关键在于提出一种去中心化的多智能体深度强化学习（Multi-Agent Deep Reinforcement Learning, MADRL）框架，将任务调度建模为一个去中心化部分可观测马尔可夫决策过程（Decentralized Partially Observable Markov Decision Process, Dec-POMDP），并设计了一种仅依赖NumPy实现的轻量级Actor-Critic架构，可在资源受限的边缘设备上部署，无需依赖重型机器学习框架。实验基于Google Cluster Trace数据集，在100节点异构系统上验证了该方案在任务完成时间、能效和SLA满足率上的显著提升（p < 0.001）。

链接: https://arxiv.org/abs/2603.24738
作者: Daniel Benniah John
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages, 8 figures. Under review. Code available at GitHub

点击查看摘要

Abstract:Efficient task scheduling in large-scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality-of-service requirements. Traditional centralized approaches face scalability limitations and single points of failure, while classical heuristics lack adaptability to changing conditions. This paper proposes a decentralized multi-agent deep reinforcement learning (DRL-MADRL) framework for task scheduling in heterogeneous distributed systems. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and develop a lightweight actor-critic architecture implemented using only NumPy, enabling deployment on resource-constrained edge devices without heavyweight machine learning frameworks. Using workload characteristics derived from the publicly available Google Cluster Trace dataset, we evaluate our approach on a 100-node heterogeneous system processing 1,000 tasks per episode over 30 experimental runs. Experimental results demonstrate 15.6% improvement in average task completion time (30.8s vs 36.5s for random baseline), 15.2% energy efficiency gain (745.2 kWh vs 878.3 kWh), and 82.3% SLA satisfaction compared to 75.5% for baselines, with all improvements statistically significant (p 0.001). The lightweight implementation requires only NumPy, Matplotlib, and SciPy. Complete source code and experimental data are provided for full reproducibility at this https URL.

[MA-16] Sketch2Simulation: Automating Flowsheet Generation via Multi Agent Large Language Models

【速读】：该论文旨在解决化工流程系统工程中从工艺草图（process sketch）到可执行仿真模型（executable simulation model）转换的瓶颈问题，该过程传统上依赖大量手动操作和特定仿真软件（如Aspen HYSYS）的专业知识。解决方案的关键在于提出一个端到端的多智能体大语言模型（multi-agent large language model）框架，将视觉输入的工艺图直接转化为可运行的HYSYS流程图。该框架分为三个协同层：图示解析与理解、仿真模型合成及多层次验证，其中专用智能体分别负责视觉解释、基于图的中间表示构建、HYSYS COM接口代码生成、执行与结构验证，从而实现了从原始图形到完整结构一致性的仿真模型自动化生成，连接一致性超过0.93，物流一致性高于0.96，在多个复杂度递增的案例中均成功生成了可执行模型。

链接: https://arxiv.org/abs/2603.24629
作者: Abdullah Bahamdan,Emma Pajak,John D. Hedengren,Antonio del Rio Chanona
机构: Imperial College London (帝国理工学院); Brigham Young University (杨百翰大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 27 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Converting process sketches into executable simulation models remains a major bottleneck in process systems engineering, requiring substantial manual effort and simulator-specific expertise. Recent advances in generative AI have improved both engineering-diagram interpretation and LLM-assisted flowsheet generation, but these remain largely disconnected: diagram-understanding methods often stop at extracted graphs, while text-to-simulation workflows assume structured inputs rather than raw visual artifacts. To bridge this gap, we present an end-to-end multi-agent large language model system that converts process diagrams directly into executable Aspen HYSYS flowsheets. The framework decomposes the task into three coordinated layers: diagram parsing and interpretation, simulation model synthesis, and multi-level validation. Specialized agents handle visual interpretation, graph-based intermediate representation construction, code generation for the HYSYS COM interface, execution, and structural verification. We evaluate the framework on four chemical engineering case studies of increasing complexity, from a simple desalting process to an industrial aromatic production flowsheet with multiple recycle loops. The system produces executable HYSYS models in all cases, achieving complete structural fidelity on the two simpler cases and strong performance on the more complex ones, with connection consistency above 0.93 and stream consistency above 0.96. These results demonstrate a viable end-to-end sketch-to-simulation workflow while highlighting remaining challenges in dense recycle structures, implicit diagram semantics, and simulator-interface constraints.

[MA-17] Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

【速读】：该论文旨在解决当前人工智能（AI）系统在复杂场景下难以自主识别并保障自身安全的问题，尤其关注如何在缺乏明确形式化方法指令的情况下实现对多种AI安全领域的自动化验证。其解决方案的关键在于提出一个统一的框架（substrate-guard），该框架基于Z3 SMT求解器，通过标准化API跨六类输出域（包括大语言模型生成代码验证、工具API安全性、后蒸馏推理正确性等）执行形式化验证，并在181个测试用例中实现了100%分类准确率，零假阳性与零假阴性，同时发现传统经验测试无法捕捉的实际漏洞（如RISC-V汇编中的INT_MIN溢出），并数学证明了无约束字符串参数在工具API中的不可验证性。

链接: https://arxiv.org/abs/2603.21149
作者: Octavian Untila
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 3 figures, 5 tables. Code: this https URL . Companion paper: this https URL

点击查看摘要

Abstract:An autonomous AI ecosystem (SUBSTRATE S3), generating product specifications without explicit instructions about formal methods, independently proposed the use of Z3 SMT solver across six distinct domains of AI safety: verification of LLM-generated code, tool API safety for AI agents, post-distillation reasoning correctness, CLI command validation, hardware assembly verification, and smart contract safety. These convergent discoveries, occurring across 8 products over 13 days with Jaccard similarity below 15% between variants, suggest that formal verification is not merely a useful technique for AI safety but an emergent property of any sufficiently complex system reasoning about its own safety. We propose a unified framework (substrate-guard) that applies Z3-based verification across all six output classes through a common API, and evaluate it on 181 test cases across five implemented domains, achieving 100% classification accuracy with zero false positives and zero false negatives. Our framework detected real bugs that empirical testing would miss, including an INT_MIN overflow in branchless RISC-V assembly and mathematically proved that unconstrained string parameters in tool APIs are formally unverifiable.

自然语言处理

[NLP-0] Natural-Language Agent Harnesses

【速读】：该论文旨在解决智能体（Agent）性能依赖于“Harness工程”（Harness Engineering）的问题，而当前的Harness设计通常嵌入在控制器代码和运行时特定约定中，导致难以移植、比较和作为科学对象进行研究。其解决方案的关键在于提出自然语言智能体Harness（Natural-Language Agent Harnesses, NLAHs），将Harness的行为以可编辑的自然语言形式表达，并配合一个共享的智能Harness运行时（Intelligent Harness Runtime, IHR），通过显式契约（explicit contracts）、持久化产物（durable artifacts）和轻量级适配器（lightweight adapters）实现对NLAH的执行。这一方法使Harness成为可移植、可验证和可迁移的可执行实体，从而提升智能体系统的可复现性与模块化能力。

链接: https://arxiv.org/abs/2603.25723
作者: Linyue Pan,Lexiao Zou,Shuo Guo,Jingchen Ni,Hai-Tao Zheng
机构: Shenzhen International Graduate School, Tsinghua University; Harbin Institute of Technology (Shenzhen)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Agent performance increasingly depends on \emphharness engineering, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbfNatural-Language Agent Harnesses (NLAHs), which express harness behavior in editable natural language, and \textbfIntelligent Harness Runtime (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

[NLP-1] S2D2: Fast Decoding for Diffusion LLM s via Training-Free Self-Speculation

【速读】：该论文旨在解决块扩散语言模型（block-diffusion language models）在少步数（few-step）推理场景下，标准置信度阈值解码策略的脆弱性问题：过于激进的阈值会损害生成质量，而保守阈值则需额外的去噪步骤，无法实现高效加速。解决方案的关键在于提出一种无需训练的自推测解码框架S2D2，其核心思想是利用预训练模型在块大小为1时退化为自回归模式的特性，使同一模型同时担任“起草者”（drafter）和“验证者”（verifier）。S2D2在标准块扩散解码中引入推测验证步骤，并通过轻量级路由策略动态决定何时进行验证以平衡成本与收益，从而形成扩散并行提议与自回归局部序列级批判相结合的混合解码轨迹，显著优化了准确率-速度权衡。

链接: https://arxiv.org/abs/2603.25702
作者: Ligong Han,Hao Wang,Han Gao,Kai Xu,Akash Srivastava
机构: Red Hat AI Innovation; MIT-IBM Watson AI Lab; Iowa State University; Core AI, IBM
类目: Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7\times speedup over autoregressive decoding, and up to 1.57\times over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4\times faster than the static baseline with slightly higher accuracy.

[NLP-2] Self-Improvement of Large Language Models : A Technical Overview and Future Outlook

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在发展到接近人类水平能力后，单纯依赖人工监督进行优化变得成本高昂且难以规模化的问题。其核心挑战在于，当模型性能趋于饱和时，人类反馈提供的信息信号逐渐失效，难以支撑进一步提升。为此，论文提出一种系统级的自改进框架，关键在于构建一个闭环生命周期流程，涵盖数据获取、数据选择、模型优化与推理精炼四个紧密耦合的环节，并引入自主评估层以持续监控进展并指导各阶段迭代。在此框架下，模型自身作为核心驱动力，在每个阶段主动完成数据生成、信号筛选、参数更新和输出优化，从而实现从数据到性能的闭环自我进化，推动LLMs向完全自主改进的方向演进。

链接: https://arxiv.org/abs/2603.25681
作者: Haoyan Yang,Mario Xerri,Solha Park,Huajian Zhang,Yiyang Feng,Sai Akhil Kogilathota,Jiawei Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.

[NLP-3] Measuring What Matters – or Whats Convenient?: Robustness of LLM -Based Scoring Systems to Construct-Irrelevant Factors

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）驱动的自动化评分系统在面对与评估构念无关的因素（construct-irrelevant factors）时的鲁棒性问题，尤其是这些因素可能干扰评分准确性并影响测评效度。解决方案的关键在于设计一种双架构（dual-architecture）LLM-based评分系统，并通过实证检验其对多种构造无关干扰因素（如无意义文本填充、拼写错误、写作复杂度变化、重复段落及离题内容）的响应模式。研究发现，该系统对多数干扰因素表现出稳健性，仅在重复大段文本时出现评分下降，且显著惩罚离题响应，表明当系统设计聚焦于构念相关性时，LLM-based评分方法具备良好的抗干扰能力与应用前景。

链接: https://arxiv.org/abs/2603.25674
作者: Cole Walsh,Rodica Ivan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Shortened version of this paper accepted to AIED 2026; experiment 3 was omitted from accepted paper due to space restrictions

点击查看摘要

Abstract:Automated systems have been widely adopted across the educational testing industry for open-response assessment and essay scoring. These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that are unrelated to the construct assessed) and adversarial conditions. Given the rising usage of large language models in automated scoring systems, there is a renewed focus on ``hallucinations’’ and the robustness of these LLM-based automated scoring approaches to construct-irrelevant factors. This study investigates the effects of construct-irrelevant factors on a dual-architecture LLM-based scoring system designed to score short essay-like open-response items in a situational judgment test. It was found that the scoring system was generally robust to padding responses with meaningless text, spelling errors, and writing sophistication. Duplicating large passages of text resulted in lower scores predicted by the system, on average, contradicting results from previous studies of non-LLM-based scoring systems, while off-topic responses were heavily penalized by the scoring system. These results provide encouraging support for the robustness of future LLM-based scoring systems when designed with construct relevance in mind.

[NLP-4] RenoBench: A Citation Parsing Benchmark

【速读】：该论文旨在解决学术文献中引用信息（citation）自动解析的准确性问题，这是构建机器可读的学术基础设施的关键环节。现有评估方法存在泛化能力差、依赖合成数据或未公开等问题，限制了研究进展。其解决方案的关键在于提出RenoBench——一个基于四个出版生态系统（SciELO、Redalyc、Public Knowledge Project 和 Open Research Europe）PDF文档构建的公开基准数据集，通过自动化验证和基于特征的采样策略从16.1万条标注引文中筛选出1万条跨语言、跨出版类型和平台的高质量引文样本，并在此基础上对多种引文解析系统进行标准化评估，结果显示微调后的语言模型表现优异。该基准支持可复现、标准化的评估，为提升引文解析自动化水平和元科学研究奠定基础。

链接: https://arxiv.org/abs/2603.25640
作者: Parth Sarin,Juan Pablo Alperin,Adam Buttrick,Dione Mentis
机构: Stanford University (斯坦福大学); Simon Fraser University (西蒙弗雷泽大学); DataCite; California Digital Library (加州数字图书馆); University of California Office of the President (加州大学校长办公室)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-based sampling to produce a dataset of 10,000 citations spanning multiple languages, publication types, and platforms. We then evaluate a variety of citation parsing systems and report field-level precision and recall. Our results show strong performance from language models, particularly when fine-tuned. RenoBench enables reproducible, standardized evaluation of citation parsing systems, and provides a foundation for advancing automated citation parsing and metascientific research.

[NLP-5] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在学术写作中的广泛使用正在悄然改变论文中词汇的分布模式，而这些变化尚未被充分识别和量化。研究发现，诸如“beyond”和“via”等词在标题中出现频率上升，“the”和“of”在摘要中减少，且不同LLM生成文本的风格差异导致分类困难。解决方案的关键在于采用一种直接且高度可解释的线性建模方法，同时控制模型类型与提示（prompt）差异，从而定量评估LLM使用对学术文本词汇模式的影响，并揭示其异质性和动态演化特性。

链接: https://arxiv.org/abs/2603.25638
作者: Mingmeng Geng,Yuhang Dong,Thierry Poibeau
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: Visualization of word usage patterns in arXiv abstracts: this https URL

点击查看摘要

Abstract:Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of “beyond” and “via” in titles and the decreased frequency of “the” and “of” in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

[NLP-6] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的个性代理（persona agent）在多轮交互中缺乏系统性验证方法的问题，即无法确保其响应在逻辑一致性、事实准确性及重复稳定性方面不出现矛盾或错误。解决方案的关键在于提出PICon评估框架，该框架通过逻辑链式多轮提问对 persona agent 进行系统性探测，从三个核心维度进行评估：内部一致性（避免自我矛盾）、外部一致性（与现实世界事实一致）以及重测一致性（重复测试下的稳定性）。实证研究表明，即使此前被认为高度一致的模型也未能达到人类参与者的基准水平，凸显了PICon在识别潜在缺陷方面的有效性。

链接: https://arxiv.org/abs/2603.25620
作者: Minseo Kim,Sujeong Im,Junseong Choi,Junhee Lee,Chaeeun Shim,Edward Choi
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent’s responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: this https URL

[NLP-7] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

【速读】：该论文旨在解决在长序列场景下，基于采样token的在线蒸馏（sampled-token on-policy distillation, OPD）方法因分布匹配退化为单token信号而导致的不稳定性问题。具体而言，随着rollout与教师模型常见前缀偏离，采样token OPD会因梯度方差增大、教师指导不可靠以及分词器或特殊token不匹配等因素引发优化失败。解决方案的关键在于引入教师局部支持匹配（teacher top-K local support matching）机制，通过截断反向KL散度（truncated reverse-KL）结合top-p rollout采样和特殊token掩码策略，在保持序列级信息一致性的同时显著降低梯度方差并提升训练稳定性，从而在单任务数学推理和多任务代理+数学训练中实现更优下游性能。

链接: https://arxiv.org/abs/2603.25562
作者: Yuqian Fu,Haohuan Huang,Kaiwen Jiang,Yuanheng Zhu,Dongbin Zhao
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, UCAS
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

[NLP-8] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

【速读】：该论文旨在解决视觉引导叙事（visually grounded stories）中人类写作与视觉语言模型（Vision-Language Models, VLMs）生成文本在叙事连贯性（narrative coherence）方面的差异问题。其解决方案的关键在于构建一套多维度的叙事连贯性评分体系，涵盖指代一致性（coreference）、话语关系类型（discourse relation types）、主题连续性（topic continuity）、角色持续性（character persistence）以及多模态角色定位（multimodal character grounding）等指标，从而系统性地量化并比较人类与VLM生成故事的连贯性特征。结果表明，尽管VLM生成文本在表面流畅性上接近人类，但在整体话语组织结构上存在系统性差异。

链接: https://arxiv.org/abs/2603.25537
作者: Nikolai Ilinykh,Hyewon Jang,Shalom Lappin,Asad Sayeed,Sharid Loáiciga
机构: University of Gothenburg (哥德堡大学); Queen Mary University of London (伦敦玛丽女王大学); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: 9 pages of content, 1 page of appendices, 9 tables, 3 figures

点击查看摘要

Abstract:We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at this https URL.

[NLP-9] Synchronous Signal Temporal Logic for Decidable Verification of Cyber-Physical Systems

【速读】：该论文旨在解决安全关键型网络物理系统（Cyber Physical System, CPS）中信号时序逻辑（Signal Temporal Logic, STL）静态验证不可判定的问题。传统STL在一般情况下无法进行静态验证，而运行时验证方法又存在局限性。解决方案的关键在于提出同步信号时序逻辑（Synchronous Signal Temporal Logic, SSTL），这是一种STL的可判定子集，其核心假设是“信号不变性假设”（Signal Invariance Hypothesis, SIH），即信号在固定离散时间点（称为ticks）上采样且保持不变。SIH被证明是STL公式与其SSTL版本等价的充分必要条件；通过将SSTL公式转化为基于谓词的线性时序逻辑（LTL_P），可利用SPIN模型检测器实现可判定的静态安全性与活锁性质验证。

链接: https://arxiv.org/abs/2603.25531
作者: Partha Roop,Sobhan Chatterjee,Avinash Malik,Nathan Allen,Logan Kenwright
机构: University of Auckland(奥克兰大学); Auckland University of Technology(奥克兰理工大学)
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many Cyber Physical System (CPS) work in a safety-critical environment, where correct execution, reliability and trustworthiness are essential. Signal Temporal Logic (STL) provides a formal framework for checking safety-critical CPS. However, static verification of STL is undecidable in general, except when we want to verify using run-time-based methods, which have limitations. We propose Synchronous Signal Temporal Logic (SSTL), a decidable fragment of STL, which admits static safety and liveness property verification. In SSTL, we assume that a signal is sampled at fixed discrete steps, called ticks, and then propose a hypothesis, called the Signal Invariance Hypothesis (SIH), which is inspired by a similar hypothesis for synchronous programs. We define the syntax and semantics of SSTL and show that SIH is a necessary and sufficient condition for equivalence between an STL formula and its SSTL counterpart. By translating SSTL to LTL_P (LTL defined over predicates), we enable decidable model checking using the SPIN model checker. We demonstrate the approach on a 33-node human heart model and other case studies.

[NLP-10] An Experimental Comparison of the Most Popular Approaches to Fake News Detection

【速读】：该论文旨在解决虚假新闻检测方法在真实应用场景中泛化能力不足的问题，尤其是面对领域迁移（domain shift）和分布外数据（out-of-distribution data）时模型性能下降的挑战。其关键解决方案在于系统性地评估12种代表性检测方法（涵盖传统机器学习、深度学习、Transformer及跨域专用架构），并通过统一标签格式（“Real”与“Fake”）在10个不同领域的公开数据集上开展域内（in-domain）、多域（multi-domain）和跨域（cross-domain）实验，从而揭示现有方法的局限性；同时指出，尽管微调模型在单一域表现良好，但跨域架构虽可缓解泛化差距却依赖大量数据，而大语言模型（LLMs）凭借零样本（zero-shot）和少样本（few-shot）学习潜力提供了更具前景的替代路径。

链接: https://arxiv.org/abs/2603.25501
作者: Pietro Dell’Oglio,Alessandro Bondielli,Francesco Marcelloni,Lucia C. Passaro
机构: University of Pisa (比萨大学); University of Pisa (比萨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, fake news detection has received increasing attention in public debate and scientific research. Despite advances in detection techniques, the production and spread of false information have become more sophisticated, driven by Large Language Models (LLMs) and the amplification power of social media. We present a critical assessment of 12 representative fake news detection approaches, spanning traditional machine learning, deep learning, transformers, and specialized cross-domain architectures. We evaluate these methods on 10 publicly available datasets differing in genre, source, topic, and labeling rationale. We address text-only English fake news detection as a binary classification task by harmonizing labels into “Real” and “Fake” to ensure a consistent evaluation protocol. We acknowledge that label semantics vary across datasets and that harmonization inevitably removes such semantic nuances. Each dataset is treated as a distinct domain. We conduct in-domain, multi-domain and cross-domain experiments to simulate real-world scenarios involving domain shift and out-of-distribution data. Fine-tuned models perform well in-domain but struggle to generalize. Cross-domain architectures can reduce this gap but are data-hungry, while LLMs offer a promising alternative through zero- and few-shot learning. Given inherent dataset confounds and possible pre-training exposure, results should be interpreted as robustness evaluations within this English, text-only protocol.

[NLP-11] ranslation Asymmetry in LLM s as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

【速读】：该论文旨在解决低资源语言（特别是罗曼什语的6种不同变体）在机器翻译中的性能瓶颈问题，传统依赖大语言模型（LLM）从高资源语言生成合成数据的方法在此场景下失效，因LLM易混淆罗曼什语的不同变体。解决方案的关键在于调整数据增强的方向，使其与源语言和目标语言之间的资源梯度保持一致，从而显著提升翻译质量——实验表明该方法在最低资源的罗曼什语变体上比Gemini 3 Pro高出23 BLEU分数，并通过人工评估验证了其在各罗曼什语变体中生成流畅翻译的能力。

链接: https://arxiv.org/abs/2603.25489
作者: Jannis Vamvas,Ignacio Pérez Prat,Angela Heldstab,Dominic P. Fischer,Sina Ahmadi,Rico Sennrich
机构: University of Zurich (苏黎世大学); Lia Rumantscha (利阿·鲁芒查)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

[NLP-12] Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

【速读】：该论文试图解决的问题是：如何在社会科学研究中利用大语言模型（Large Language Models, LLMs）进行文本分类时，最大化其性能表现。尽管已有研究表明LLMs可显著降低计算成本并达到与传统方法相当的准确性，但当前测试中性能存在较大波动，缺乏系统性优化策略。解决方案的关键在于对提示工程（prompt engineering）中的三个核心要素进行系统性调整——标签描述（label descriptions）、指令引导（instructional nudges）以及少量示例（few-shot examples），并通过实证验证发现：适度增加提示上下文（prompt context）能带来最显著的性能提升，而进一步扩展则仅产生边际收益，甚至可能因过载导致准确率下降；同时强调模型、任务和批次大小之间的异质性，要求针对每个具体编码任务独立验证，而非依赖通用规则。

链接: https://arxiv.org/abs/2603.25422
作者: Erkan Gunes,Christoffer Florczak,Tevfik Murat Yildirim
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

[NLP-13] APO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言数学推理任务中性能显著低于英语环境的问题，其核心原因在于多语言理解能力的不足。解决方案的关键在于提出一种基于GRPO的强化学习框架——翻译增强策略优化（Translation-Augmented Policy Optimization, TAPO），该框架通过显式对齐策略，以英语为枢纽语言，并遵循“先理解后推理”的范式；尤为关键的是，TAPO引入了步骤级相对优势机制，将语言理解与推理过程解耦，从而能够在不引发优化冲突的前提下整合翻译质量奖励信号，有效协同提升模型的语言理解与推理能力。

链接: https://arxiv.org/abs/2603.25419
作者: Xu Huang,Zhejian Lai,Zixian Huang,Jiajun Chen,Shujian Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

[NLP-14] Large Language Model as Token Compressor and Decompressor

【速读】：该论文旨在解决长文本处理中Token消耗过高导致的计算效率低下问题，特别是在大语言模型（Large Language Model, LLM）进行长上下文推理时面临的资源瓶颈。其解决方案的关键在于提出一种自表达式自动编码学习框架，通过微调预训练LLM，将其转化为高效的离散、变长潜码（Z-tokens）压缩与解压缩器；该框架利用轻量级LoRA适配头实现内容自适应压缩——语义密集区域分配更多Z-tokens，冗余或可预测区域则被大幅压缩，从而在保持重建精度和下游任务性能的同时，实现最高达18倍的Token压缩率，为提示压缩和直接在Z-token空间进行自回归生成提供了高效路径。

链接: https://arxiv.org/abs/2603.25340
作者: Wenbing Li,Zikai Song,Jielei Zhang,Tianhao Zhao,Junkai Lin,Yiran Wang,Wei Yang
机构: Qwen3 (通义千问)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.

[NLP-15] Beyond Detection: Rethinking Education in the Age of AI-writing

【速读】：该论文试图解决的问题是：随着生成式AI（Generative AI）工具如ChatGPT在课堂、职场和日常思维中的广泛应用，写作正面临被外包、自动化并丧失其认知价值的风险，从而削弱人类深度学习的能力。解决方案的关键在于强调写作过程本身——尽管其具有混乱、缓慢且常令人沮丧的特性——正是人类实现深层学习的核心机制；同时提出教育者应通过更智能的教学策略（而非简单禁止）来适应AI时代，培养识别机器生成语言的能力，这将成为21世纪关键的信息素养之一。

链接: https://arxiv.org/abs/2603.25329
作者: Maria Marina,Alexander Panchenko,Vasily Konovalov
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, AIED 2025

点击查看摘要

Abstract:As generative AI tools like ChatGPT enter classrooms, workplaces and everyday thinking, writing is at risk of becoming a formality – outsourced, automated and stripped of its cognitive value. But writing is not just output; it is how we learn to think. This paper explores what is lost when we let machines write for us, drawing on cognitive psychology, educational theory and real classroom practices. We argue that the process of writing – messy, slow, often frustrating – is where a human deep learning happens. The paper also explores the current possibilities of AI-text detection, how educators can adapt through smarter pedagogy rather than bans, and why the ability to recognize machine-generated language may become a critical literacy of the 21st century. In a world where writing can be faked, learning can not.

[NLP-16] Separate Before You Compress: The WWHO Tokenization Architecture

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）普遍采用的基于字节对编码（Byte Pair Encoding, BPE）的分词器在处理复杂音节文字（Abugida scripts）时存在的严重问题，即BPE会破坏多码点音节簇（multi-codepoint grapheme clusters），导致生成无意义的子字符单元，从而降低推理效率并增加计算成本，形成对全球南方国家的“Token Tax”。解决方案的关键在于提出一种三层次架构WWHO（Where-What-How Often）和一种名为SGPE（Syllable-aware Grapheme Pair Encoding）的新算法，该算法将语言学规则与统计压缩过程解耦，实现对复杂音节结构的保留式分词，并支持无缝多语言分词。实验表明，SGPE在僧伽罗语（Sinhala）和天城文（Devanagari，如印地语/梵语）中分别实现了61.7%和27.0%的token减少率，且在混合脚本场景下相较主流模型（o200k base、Llama 4 Scout、DeepSeek V3）分别提升36.7%–60.2%的token效率，同时保障“语言零分割保证”（Linguistic Zero-Breakage Guarantee），确保任何有效音节不会被拆分至多个token中。

链接: https://arxiv.org/abs/2603.25309
作者: Kusal Darshana
机构: Remeinium Research (Remeinium 研究所)
类目: Computation and Language (cs.CL)
备注: 17 pages, 1 figure, 8 tables. Tokenization Architecture including formal DFA definitions and regular expressions for Sinhala and Devanagari syllabification. Evaluation includes comparisons with OpenAI o200k-base, Llama-4-Scout, and DeepSeek-V3. Source code and datasets: this https URL

点击查看摘要

Abstract:Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM’s reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant “Token Tax” for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI’s o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.

[NLP-17] DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers

【速读】：该论文旨在解决从科学文献中自动构建文档驱动的语义有向无环图（Directed Acyclic Graph, DAG）的问题，即Doc2SemDAG构建问题。其核心挑战在于：文档可能对应多个合理的抽象结构、目标结构常隐含于文本中，且支持证据分散在正文、公式、图表说明等多种形式中。解决方案的关键是利用包含显式DAG图的科学论文作为监督信号——其中图结构提供先验知识，文本内容提供上下文与解释，并提出DAGverse框架，其核心组件DAGverse-Pipeline通过图分类、结构重建、语义锚定和验证四步流程实现高精度的语义DAG生成。该方法显著优于现有视觉-语言模型在DAG分类与标注任务上的表现，为基于真实证据的结构化推理研究提供了新范式和基准数据集。

链接: https://arxiv.org/abs/2603.25293
作者: Shu Wan,Saketh Vishnubhatla,Iskander Kushbay,Tom Heffernan,Aaron Belikoff,Raha Moraffah,Huan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.

[NLP-18] When Hate Meets Facts: LLM s-in-the-Loop for Check-worthiness Detection in Hate Speech

【速读】：该论文旨在解决在线仇恨言论（Hateful Speech, HS）与虚假信息（Misinformation）交织传播所带来的挑战，此类内容常以看似事实的形式出现，导致内容审核人员需同时评估言论的有害性和真实性，显著增加人工标注负担。其解决方案的关键在于提出首个融合仇恨言论与“值得核查性”（check-worthiness）信息的数据集WSF-ARG+，并设计一种“大语言模型在环”（LLM-in-the-loop）框架，利用12种不同规模和架构的开源大语言模型辅助标注检查价值主张，从而降低人类标注成本且不牺牲数据质量。实验证明，该框架可有效提升基于大语言模型的仇恨言论检测性能，平均宏F1值提升达0.154，最高提升0.213。

链接: https://arxiv.org/abs/2603.25269
作者: Nicolás Benjamín Ocampo,Tommaso Caselli,Davide Ceolin
机构: Centrum Wiskunde Informatica (荷兰数学与计算机科学研究中心); University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.

[NLP-19] CRAFT: Grounded Multi-Agent Coordination Under Partial Information

【速读】：该论文旨在解决大型语言模型在严格部分信息条件下进行实用型多智能体协作通信的评估难题，即多个拥有互补但不完整视角的智能体需通过自然语言协调构建一个单一智能体无法完全观测的共享3D结构。其解决方案的关键在于提出CRAFT基准，将该问题形式化为多发送方实用推理任务，并构建诊断框架以分解失败类型为空间定位错误、信念建模错误和实用沟通错误，从而系统性识别前沿模型与开源模型在多智能体协作中的行为失败模式。

链接: https://arxiv.org/abs/2603.25268
作者: Abhijnan Nath,Hannah VanderHoeven,Nikhil Krishnaswamy
机构: Colorado State University (科罗拉多州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at this https URL

[NLP-20] MolQuest: A Benchmark for Agent ic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在真实科学场景中动态推理能力缺乏系统评估的问题。现有科学评测基准多采用静态、单轮问答（Question Answering, QA）格式，无法有效衡量模型在需要多步迭代与实验交互的复杂科学任务中的表现。其解决方案的关键在于提出MolQuest——一个基于真实化学实验数据的代理式（agent-based）评估框架，将分子结构解析任务形式化为多轮交互过程，要求模型主动规划实验步骤、融合异构光谱信息（如核磁共振NMR和质谱MS），并迭代优化结构假设。该框架首次系统性地评估了LLMs在广阔且复杂的化学空间中的溯因推理（abductive reasoning）与战略决策能力，揭示了当前前沿模型在真实科学场景中仍存在显著性能瓶颈（SOTA模型准确率仅约50%）。

链接: https://arxiv.org/abs/2603.25253
作者: Taolin Han,Shuang Wu,Jinghang Wang,Yuhao Zhou,Renquan Lv,Bing Zhao,Wei Hu
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs’ abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs’ strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.

[NLP-21] Comparing Natural and Synthetic Structured Data: A Study of the Passive Verb Alternation in French and Italian

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在训练与评估过程中对自然语言数据与合成数据依赖性的差异问题，特别是其在捕捉抽象句法和语义模式方面的有效性。研究聚焦于法语和意大利语中被动动词交替现象的建模，通过使用Blackbird Language Matrices（BLMs）——一种结构化数据集设计来系统性地探查模型对底层语言规律的理解能力。解决方案的关键在于：相比合成数据能快速使模型达到上限性能但缺乏泛化能力，自然数据训练的模型不仅在自然测试集中表现稳健，还能有效应对结构化的合成测试集，从而更真实地反映模型对抽象语言知识的掌握程度。这一发现强调了自然数据在LLM评估中的不可替代价值，以及结构化实验设计在精准探测模型语言能力中的重要性。

链接: https://arxiv.org/abs/2603.25227
作者: Giuseppe Samo,Paola Merlo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 8 figures, paper accepted at the Workshop on Structured Linguistic Data and Evaluation (SLiDE)

点击查看摘要

Abstract:This study compares the impact of natural and synthetic data on training and evaluating large language models (LLMs), using the case of passive verb alternation in French and Italian. We use Blackbird Language Matrices (BLMs), structured datasets designed to probe linguistic knowledge of underlying patterns across sentence sets. We compare structured templates instantiated with natural sentences extracted from Universal Dependencies to structured templates of synthetic sentences. Experiments show that while models achieve ceiling performance when trained and tested on synthetic datasets, they do not reliably generalize to natural sentences. In contrast, models trained on natural data exhibit robust performance across both natural and synthetic test suites, demonstrating their superior ability to capture abstract linguistic patterns. These results corroborate the value of natural data and of structured set ups in linguistic evaluation for probing LLMs’ syntactic and semantic knowledge.

[NLP-22] ranslation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

【速读】：该论文旨在解决极低资源机器翻译（Extremely Low-Resource Machine Translation, XLR MT）领域中因基准数据集差异导致性能结果难以横向比较的问题，尤其针对特定语言群体（如古代语言或非拉丁语系原住民语言）的研究者而言，难以判断性能提升是源于方法创新还是数据集偏差。其解决方案的关键在于提出一套数据集内生的难度指标——FRED指标体系，包括繁殖率（Fertility Ratio, F）、检索代理（Retrieval Proxy, R）、预训练暴露度（Pre-training Exposure, E）和语料多样性（Corpus Diversity, D），用以量化影响翻译性能的核心因素，发现多数性能差异主要由训练-测试重叠和预训练暴露程度决定，而非模型能力本身；同时揭示了某些语言（尤其是消亡语言和非拉丁语系原住民语言）存在高词元繁殖率问题，暴露出从高资源语言迁移模型时词汇共享缺失的根本局限。通过在性能评估中引入这些指标，提升了跨语言迁移评估的透明度与可靠性。

链接: https://arxiv.org/abs/2603.25222
作者: Danlu Chen,Ka Sing He,Jiahe Tian,Chenghao Xiao,Zhaofeng Wu,Taylor Berg-Kirkpatrick,Freda Shi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups – such as ancient languages – it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy ®, Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages – particularly extinct and non-Latin indigenous languages – suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.

[NLP-23] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection CVPR2026

【速读】：该论文旨在解决多模态虚假信息（Multimodal Misinformation）检测中传统方法因黑箱特性及对新型篡改手段脆弱性而导致的准确性与鲁棒性不足的问题。其解决方案的关键在于提出概率概念图推理（Probabilistic Concept Graph Reasoning, PCGR）框架，该框架采用“先构建、后推理”的范式：首先利用多模态大语言模型（Multimodal Large Language Models, MLLMs）自动发现并验证人类可理解的概念节点，构建结构化的概念图；随后在该概念图上应用分层注意力机制进行推理，从而生成可解释的证据到结论的推理链，实现高精度且具备演化能力的多模态虚假信息检测。

链接: https://arxiv.org/abs/2603.25203
作者: Ruichao Yang,Wei Gao,Xiaobin Zhu,Jing Ma,Hongzhan Lin,Ziyang Luo,Bo-Wen Zhang,Xu-Cheng Yin
机构: University of Science and Technology Beijing (北京科技大学); Singapore Management University (新加坡管理大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.

[NLP-24] SafeMath: Inference-time Safety improves Math Accuracy

【速读】：该论文旨在解决生成式 AI（Generative AI）在处理数学应用题时可能被恶意输入诱导产生有害、偏见或违反政策的内容的问题，尤其关注以自然语言叙述形式呈现的数学问题作为传播伦理风险和心理危害的隐蔽载体。其解决方案的关键在于提出 SafeMath —— 一种安全对齐（safety alignment）技术，通过有效分离语言层面的危害性与数学推理任务的本质逻辑，实现减少有害输出的同时维持甚至提升数学推理准确性，从而在安全性和正确性之间取得平衡。

链接: https://arxiv.org/abs/2603.25201
作者: Sagnik Basu,Subhrajit Mitra,Aman Juneja,Somnath Banerjee,Rima Hazra,Animesh Mukherjee
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校); Cisco Systems (思科系统公司); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Submitted in ARR March 2026

点击查看摘要

Abstract:Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath – a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at this https URL.

[NLP-25] A Decade-Scale Benchmark Evaluating LLM s Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在临床实践指南（Clinical Practice Guidelines, CPGs）相关对话中存在“知其内容但无法准确溯源与应用”的问题，即LLMs虽能识别部分指南内容，却难以正确引用来源并有效遵循指南建议。解决方案的关键在于构建CPGBench——一个自动化基准测试框架，通过收集来自9个国家/地区及2个国际组织的3,418份CPG文档，提取32,155条结构化临床推荐，并为每条推荐生成多轮对话场景，从而系统评估8种主流LLMs在指南检测和依从性方面的表现。结果揭示了LLMs在指南识别率（71.1%-89.6%）与引用准确性（3.6%-29.7%）之间存在显著差距，以及依从率仅为21.8%-63.2%，表明模型对指南知识的应用能力远低于认知水平，凸显了安全部署LLMs于真实临床环境前亟需填补的技术鸿沟。

链接: https://arxiv.org/abs/2603.25196
作者: Andong Tan,Shuyu Dai,Jinglu Wang,Fengtao Zhou,Yan Lu,Xi Wang,Yingcong Chen,Can Yang,Shujie Liu,Hao Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

[NLP-26] A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

【速读】：该论文旨在解决巴斯克语方言自然语言处理（Natural Language Processing, NLP）中数据稀缺的问题。其解决方案的关键在于系统性地整理和分类当前可用的巴斯克语方言数据资源，并提出两种数据构建策略：一是收集在线原始方言文本（如新闻、社交媒体内容及词典、语法等资源），二是通过人工或自动方式将标准语料转化为方言语料。其中，人工适配部分基于XNLI自然语言推理数据集的测试集，生成了西、中、纳瓦拉-拉普尔迪亚三种方言的高质量平行金标准评估数据集；自动适配部分则通过自动转换物理常识数据集（BasPhyCowest）并由母语者进行人工质量评估，验证其作为“银标准”数据替代方案的可行性，从而为方言NLP研究提供可扩展且可靠的数据基础。

链接: https://arxiv.org/abs/2603.25189
作者: Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

[NLP-27] Probing the Lack of Stable Internal Beliefs in LLM s NEURIPS2025

【速读】：该论文旨在解决生成式 AI（Generative AI）在对话系统中缺乏稳定内在表征、难以维持隐式一致性（implicit consistency）的问题，即模型在多轮交互中无法持续遵循未明示的目标（如秘密选择一个目标并始终基于此目标回答）。解决方案的关键在于引入显式上下文提示机制——当模型被明确告知其先前选定的目标时，其行为一致性显著提升；反之，若不提供该信息，模型的隐式目标会随对话轮次发生漂移。这揭示了构建具备真实人格特征的对话系统需依赖能锚定隐式目标的机制，以实现长期稳定的个性表现。

链接: https://arxiv.org/abs/2603.25187
作者: Yifan Luo,Kangping Xu,Yanzhen Lu,Yang Yuan,Andrew Chi-Chih Yao
机构: Tsinghua University (清华大学); Shanghai Qizhi Institute (上海奇智研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025 Workshop Mexico City PersonaNLP

点击查看摘要

Abstract:Persona-driven large language models (LLMs) require consistent behavioral tendencies across interactions to simulate human-like personality traits, such as persistence or reliability. However, current LLMs often lack stable internal representations that anchor their responses over extended dialogues. This work explores whether LLMs can maintain “implicit consistency”, defined as persistent adherence to an unstated goal in multi-turn interactions. We designed a 20-question-style riddle game paradigm where an LLM is tasked with secretly selecting a target and responding to users’ guesses with “yes/no” answers. Through evaluations, we find that LLMs struggle to preserve latent consistency: their implicit “goals” shift across turns unless explicitly provided their selected target in context. These findings highlight critical limitations in the building of persona-driven LLMs and underscore the need for mechanisms that anchor implicit goals over time, which is a key to realistic personality modeling in interactive applications such as dialogue systems.

[NLP-28] Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

【速读】：该论文旨在解决上下文感知机器翻译（Context-aware Machine Translation, CAMT）中一个关键问题：尽管CAMT利用文档级信息提升翻译质量，但其性能并不总是优于仅依赖句子级别的翻译模型，这是因为不同句子从上下文中获得的收益存在差异。现有训练目标未显式建模这种差异性，导致模型无法自适应地利用上下文信息。解决方案的关键在于提出一种基于偏好的训练框架——交叉偏好学习（Cross-Preference Learning, CPL），通过在优化目标中引入句内偏好（intra-condition preference）和跨条件偏好（cross-condition preference），显式地监督模型何时以及如何借助上下文提升翻译质量，从而实现更稳定且一致的性能提升，且无需修改模型架构即可在多个主流大语言模型上取得显著改进。

链接: https://arxiv.org/abs/2603.25183
作者: Ying Li,Xinglin Lyu,Junhui Li,Jinlong Yang,Hengchao Shang,Min Zhang,Shimin Tao,Daimeng Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model’s ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.

[NLP-29] Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

【速读】：该论文旨在解决文本到动作生成（text-to-motion generation）在跨语言场景下的两大关键挑战：一是缺乏双语数据集，二是现有语言模型在跨语言语义理解上的不足。为应对这些问题，作者提出了首个双语文本到动作基准数据集 BiHumanML3D，该数据集通过大语言模型（LLM）辅助标注与严格人工校正构建而成。解决方案的核心创新在于提出了一种简单但高效的基线方法——双语运动扩散模型（Bilingual Motion Diffusion, BiMD），其引入了跨语言对齐（Cross-Lingual Alignment, CLA）机制，显式地对齐不同语言的语义表示，从而建立一个鲁棒的条件空间，支持高质量的动作生成，包括零样本混语（zero-shot code-switching）场景。实验表明，BiMD 在 FID 和 R@3 指标上显著优于单语扩散模型和翻译基线，验证了该数据集和对齐策略在跨语言动作合成中的必要性与有效性。

链接: https://arxiv.org/abs/2603.25178
作者: Wanjiang Weng,Xiaofeng Tan,Xiangbo Shu,Guo-Sen Xie,Pan Zhou,Hongsong Wang
机构: Southeast University (东南大学); Nanjing University of Science and Technology (南京理工大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \hrefthis https URLthis https URL

[NLP-30] Prompt Attack Detection with LLM -as-a-Judge and Mixture-of-Models

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在实际部署中面临的提示攻击（prompt attacks）安全风险问题，尤其是轻量级防护机制在分布偏移下泛化能力不足、而高容量LLM作为安全判官又因延迟或成本过高难以用于实时防护的“部署鸿沟”（deployment gap）。解决方案的关键在于设计一种基于轻量级通用LLM的安全判官框架，通过结构化的推理流程——包括显式意图分解、安全信号验证、危害评估与自我反思——引导模型在低延迟约束下实现可靠判断。实验表明，如gemini-2.0-flash-lite-001这类轻量模型可有效替代传统规则系统或高成本LLM，在真实场景中作为中央化安全护栏服务运行，且已应用于新加坡公共服务聊天机器人的生产环境。

链接: https://arxiv.org/abs/2603.25176
作者: Hieu Xuan Le,Benjamin Goh,Quy Anh Tang
机构: Model call failure
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

[NLP-31] o Write or to Automate Linguistic Prompts That Is the Question

【速读】：该论文旨在解决生成式 AI（Generative AI）在语言任务中对提示（prompt）设计高度敏感的问题，具体探究自动提示优化是否能够替代专家手工设计的零样本提示（zero-shot expert prompts）。其解决方案的关键在于系统性地比较三种提示策略：人工设计的零样本专家提示、基础 DSPy 签名（base DSPy signatures），以及通过 GEPA（Gradient-based Prompt Optimization）优化的 DSPy 签名，在翻译、术语插入和语言质量评估（LQA）三项任务中的表现。研究发现，GEPA 能显著提升最小 DSPy 签名的效果，且多数情况下与专家优化提示在统计上无显著差异，表明自动化优化可有效逼近甚至等效于专家提示工程，但需注意该对比存在不对称性——GEPA 依赖黄金标准数据划分进行程序化搜索，而专家提示则无需标注数据，仅依赖领域知识与迭代优化。

链接: https://arxiv.org/abs/2603.25169
作者: Marina Sánchez-Torrón,Daria Akselrod,Jason Rauchwerk
机构: Smartling(智能语言)
类目: Computation and Language (cs.CL)
备注: 10 pages, to be submitted for EAMT 2026

点击查看摘要

Abstract:LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

[NLP-32] Do LLM s Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在置信度评估中存在的混淆问题，即传统校准指标（如期望校准误差 ECE 和 Brier 分数）将模型的知识水平（Type-1 敏感性）与元认知敏感性（Type-2 metacognitive sensitivity）混为一谈。为此，作者提出基于 Type-2 信号检测理论（Signal Detection Theory）的评估框架，其关键在于引入两个核心指标：meta-d’（衡量模型识别自身知识边界的能力）和元认知效率比 M-ratio（反映模型在保持高敏感性的同时优化置信判断策略的效率）。该方法首次实现了对LLM元认知能力的解耦分析，揭示了不同模型在“知道自己不知道什么”方面的本质差异，而非仅依赖于置信度阈值设置带来的表面校准效果，从而为模型选择、部署及人机协作提供更可靠的依据。

链接: https://arxiv.org/abs/2603.25112
作者: Jon-Paul Cacioli
机构: Meta AI (Meta AI); Google (Google)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 7 tables. Pre-registered; code and data at this https URL

点击查看摘要

Abstract:Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d’ and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar – Mistral achieves the highest d’ but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d’ remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d’ framework reveals which models “know what they don’t know” versus which merely appear well-calibrated due to criterion placement – a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.

[NLP-33] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在心理健康领域应用时面临的三大核心挑战：高质量、可解释且知识 grounded 的训练数据匮乏；训练范式局限于基础能力而缺乏多样化任务适配；以及多轮对话场景下的评估体系缺失。解决方案的关键在于提出 oMind 框架，其核心包括：构建一个包含约 16.4 万条样本的多任务监督微调（Supervised Fine-Tuning, SFT）数据集，该数据集通过结构化知识检索、基于 LLM 的剪枝与人工审核流程生成；同时引入 oMind-Chat 基准数据集，具备专家标注的逐轮及整轮对话评分标准，从而实现对模型对话能力的精细化评估。实验表明，oMind LLM 在多项核心能力和多轮对话任务中均显著优于基线模型，尤其在推理能力上提升达 80% 的胜率优势。

链接: https://arxiv.org/abs/2603.25105
作者: Suraj Racha,Prashant Harish Joshi,Utkarsh Maurya,Nitin Yadav,Mridul Sharma,Ananya Kunisetty,Saranya Darisipudi,Nirmal Punjabi,Ganesh Ramakrishnan
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.

[NLP-34] Closing the Confidence-Faithfulness Gap in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中 verbalized confidence（口头化置信度）与实际准确性之间严重脱节的问题，即模型输出的置信度评分往往不能真实反映其预测准确率。研究通过线性探针（linear probes）和对比激活添加（contrastive activation addition, CAA）操控技术揭示：校准能力（calibration）与口头化置信度信号在模型内部以正交方式编码，这一发现适用于三个开源模型和四个数据集。关键创新在于识别出“推理污染效应”（Reasoning Contamination Effect）——当模型同时进行推理和输出置信度时，推理过程会干扰置信度方向，加剧校准偏差。基于此，作者提出一种两阶段自适应转向管道（adaptive steering pipeline），首先读取模型内部的准确率估计，再引导输出置信度与其对齐，显著提升了所有测试模型的校准一致性。

链接: https://arxiv.org/abs/2603.25052
作者: Miranda Muqing Miao,Lyle Ungar
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another – a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the “Reasoning Contamination Effect.” Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model’s internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

[NLP-35] Approaches to Analysing Historical Newspapers Using LLM s

【速读】：该论文旨在解决如何利用计算方法对噪声较大的历史报纸文本进行多维度分析，以揭示19世纪末至20世纪初斯洛文尼亚社会中集体身份、政治倾向与国家认同的公共话语建构问题。其解决方案的关键在于融合多种计算技术：首先使用BERTopic进行主题建模识别核心议题与意识形态差异；其次采用适配斯洛文尼亚语的大语言模型（LLM）——GaMS3-12B-Instruct实现细粒度的情感分类，克服光学字符识别（OCR）降质带来的挑战；再通过命名实体识别（NER）构建实体关系图谱，并结合定量网络分析与批判性话语分析（Critical Discourse Analysis, CDA），挖掘群体身份与地理空间之间的关联及其演变逻辑。该混合方法框架实现了从大规模数据中提取可解释的社会认知模式，为数字人文研究提供了可扩展且具批判性的分析路径。

链接: https://arxiv.org/abs/2603.25051
作者: Filip Dobranić,Tina Munda,Oliver Pejić,Vojko Gorjanc,Uroš Šmajdek,David Bordon,Jakob Lenardič,Tjaša Konovšek,Kristina Pahor de Maiti Tekavčič,Ciril Bohak,Darja Fišer
机构: Institute of Contemporary History (当代历史研究所); Faculty of Arts, University of Ljubljana (卢布尔雅那大学艺术学院); Faculty of Computer Science, University of Ljubljana (卢布尔雅那大学计算机科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents a computational analysis of the Slovene historical newspapers \textitSlovenec and \textitSlovenski narod from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

[NLP-36] Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

【速读】：该论文旨在解决当前科学领域专用模型与通用大模型之间存在的能力割裂问题，即如何在保持强大通用智能的同时显著提升模型在多个科学细分领域的专业能力。解决方案的关键在于构建一个参数规模达一万亿（1-trillion-parameter）的多模态基础模型Intern-S1-Pro，通过大规模数据训练和高效强化学习（Reinforcement Learning, RL）框架支持，实现对超过100项关键科学任务的深度掌握，涵盖化学、材料、生命科学及地球科学等领域；同时借助XTuner与LMDeploy基础设施保障训练与推理阶段的精度一致性，使模型兼具广泛的通用智能与高度可定制的专业化能力，从而成为“可定制的通用模型”（Specializable Generalist），在开源模型中处于通用能力顶尖水平，并在科学专项任务上超越闭源模型。

链接: https://arxiv.org/abs/2603.25040
作者: Yicheng Zou,Dongsheng Zhu,Lin Zhu,Tong Zhu,Yunhua Zhou,Peiheng Zhou,Xinyu Zhou,Dongzhan Zhou,Zhiwang Zhou,Yuhao Zhou,Bowen Zhou,Zhanping Zhong,Zhijie Zhong,Haiteng Zhao,Penghao Zhao,Xiaomeng Zhao,Zhiyuan Zhao,Yechen Zhang,Jin Zhang,Wenwei Zhang,Hongjie Zhang,Zhuo Zhang,Wenlong Zhang,Bo Zhang,Chao Zhang,Chen Zhang,Yuhang Zang,Fei Yuan,Jiakang Yuan,Jiashuo Yu,Jinhui Yin,Haochen Ye,Qian Yao,Bowen Yang,Danni Yang,Kaichen Yang,Ziang Yan,Jun Xu,Yicheng Xu,Wanghan Xu,Xuenan Xu,Chao Xu,Ruiliang Xu,Shuhao Xing,Long Xing,Xinchen Xie,Ling-I Wu,Zijian Wu,Zhenyu Wu,Lijun Wu,Yue Wu,Jianyu Wu,Wen Wu,Fan Wu,Xilin Wei,Qi Wei,Bingli Wang,Rui Wang,Ziyi Wang,Zun Wang,Yi Wang,Haomin Wang,Yizhou Wang,Lintao Wang,Yiheng Wang,Longjiang Wang,Bin Wang,Jian Tong,Zhongbo Tian,Huanze Tang,Chen Tang,Shixiang Tang,Yu Sun,Qiushi Sun,Xuerui Su,Qisheng Su,Chenlin Su,Demin Song,Jin Shi,Fukai Shang,Yuchen Ren,Pengli Ren,Xiaoye Qu,Yuan Qu,Jiantao Qiu,Yu Qiao,Runyu Peng,Tianshuo Peng,Jiahui Peng,Qizhi Pei,Zhuoshi Pan,Linke Ouyang,Wenchang Ning,Yichuan Ma,Zerun Ma,Ningsheng Ma,Runyuan Ma,Chengqi Lyu,Haijun Lv,Han Lv
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.

[NLP-37] Imperative Interference: Social Register Shapes Instruction Topology in Large Language Models

【速读】：该论文试图解决多语言场景下模型指令遵循（instruction-following）行为存在跨语言不一致的问题，即相同语义内容的指令在不同语言中表现出截然相反的交互拓扑（如英语指令表现为竞争性，而西班牙语指令表现为合作性）。解决方案的关键在于识别并干预“社会语域”（social register）的作用机制：研究发现，命令式语气（imperative mood）在不同语言社群中具有不同的强制力（obligatory force），模型通过多语言训练已习得这些语域惯例。通过将命令式指令重写为陈述式描述（如将“NEVER do X”改为“X: disabled”），可显著降低跨语言差异（减少81%，p = 0.029），且局部重写能引发未修改指令块的拓扑转变，体现语域对指令处理的社会性本质——即模型将指令视为社会行为而非纯技术规范。这一发现提示，在训练阶段若未考虑语域差异，可能导致基于命令式语气的对齐原则（constitutional AI）产生语言依赖性偏差。

链接: https://arxiv.org/abs/2603.25015
作者: Tony Mason
机构: the University of British Columbia (不列颠哥伦比亚大学); the Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:System prompt instructions that cooperate in English compete in Spanish, with the same semantic content, but opposite interaction topology. We present instruction-level ablation experiments across four languages and four models showing that this topology inversion is mediated by social register: the imperative mood carries different obligatory force across speech communities, and models trained on multilingual data have learned these conventions. Declarative rewriting of a single instruction block reduces cross-linguistic variance by 81% (p = 0.029, permutation test). Rewriting three of eleven imperative blocks shifts Spanish instruction topology from competitive to cooperative, with spillover effects on unrewritten blocks. These findings suggest that models process instructions as social acts, not technical specifications: “NEVER do X” is an exercise of authority whose force is language-dependent, while “X: disabled” is a factual description that transfers across languages. If register mediates instruction-following at inference time, it plausibly does so during training. We state this as a testable prediction: constitutional AI principles authored in imperative mood may create language-dependent alignment. Corpus: 22 hand-authored probes against a production system prompt decomposed into 56 blocks.

[NLP-38] Exons-Detect: Identifying and Amplifying Exonic Tokens via Hidden-State Discrepancy for Robust AI-Generated Text Detection

【速读】：该论文旨在解决生成式 AI (Generative AI) 生成文本与人类撰写文本之间界限日益模糊所带来的社会风险，如虚假信息传播、署名不清及知识产权威胁等问题，核心挑战在于现有无需训练的检测方法在短文本或局部token修改场景下鲁棒性不足。解决方案的关键在于提出一种名为Exons-Detect的无训练检测方法，其创新性地引入“外显子感知”的token重加权视角：通过双模型设定测量隐藏状态差异来识别并放大具有判别力的外显子（exonic）token，并基于重要性加权后的token序列计算可解释的翻译得分（translation score），从而实现高精度且对对抗攻击和输入长度变化具有强鲁棒性的检测性能。

链接: https://arxiv.org/abs/2603.24981
作者: Xiaowei Zhu,Yubing Ren,Fang Fang,Shi Wang,Yanan Cao,Li Guo
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China; Institute of Computing Science, Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models has increasingly blurred the boundary between human-written and AI-generated text, raising societal risks such as misinformation dissemination, authorship ambiguity, and threats to intellectual property rights. These concerns highlight the urgent need for effective and reliable detection methods. While existing training-free approaches often achieve strong performance by aggregating token-level signals into a global score, they typically assume uniform token contributions, making them less robust under short sequences or localized token modifications. To address these limitations, we propose Exons-Detect, a training-free method for AI-generated text detection based on an exon-aware token reweighting perspective. Exons-Detect identifies and amplifies informative exonic tokens by measuring hidden-state discrepancy under a dual-model setting, and computes an interpretable translation score from the resulting importance-weighted token sequence. Empirical evaluations demonstrate that Exons-Detect achieves state-of-the-art detection performance and exhibits strong robustness to adversarial attacks and varying input lengths. In particular, it attains a 2.2% relative improvement in average AUROC over the strongest prior baseline on DetectRL.

[NLP-39] LLM -Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

【速读】：该论文旨在解决大规模工业机器学习系统中特征选择（Feature Selection）的难题，传统方法依赖标注数据和统计启发式规则，在生产环境中因标注数据稀缺且需满足多种运行约束而难以应用。其解决方案的关键在于提出Model Feature Agent (MoFA)，一个基于大语言模型（Large Language Model, LLM）驱动的框架，通过结构化提示（structured prompts）整合特征定义、重要性得分、相关性及元数据（如特征组或类型），实现基于推理的序贯特征选择，具备可解释性和约束感知能力，从而在真实工业场景中显著提升模型精度与效率。

链接: https://arxiv.org/abs/2603.24979
作者: Yuhang Zhou,Zhuokai Zhao,Ke Li,Spilios Evmorfos,Gökalp Demirci,Mingyi Wang,Qiao Liu,Qifei Wang,Serena Li,Weiwei Li,Tingting Wang,Mingze Gao,Gedi Zhou,Abhishek Kumar,Xiangjun Fan,Lizhu Zhang,Jiayi Liu
机构: Meta AI
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 tables

点击查看摘要

Abstract:Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.

[NLP-40] Can MLLM s Read Students Minds? Unpacking Multimodal Error Analysis in Handwritten Math

【速读】：该论文旨在解决当前教育自然语言处理（Natural Language Processing, NLP）领域对真实手写演算草稿（handwritten scratchwork）分析能力不足的问题，尤其是现有方法忽视了手写文本的多样性、复杂版式及多模态特性，且主流多模态大语言模型（Multimodal Large Language Models, MLLMs）通常采用“考生视角”，仅关注生成正确答案而非诊断学生错误。解决方案的关键在于提出一个名为ScratchMath的新基准，专门用于解释和分类真实手写数学草稿中的错误类型，包含1,720个来自中国中小学学生的样本，支持两个核心任务：错误原因解释（Error Cause Explanation, ECE）与错误原因分类（Error Cause Classification, ECC），并定义了七类具体错误类型；该数据集通过人机协同标注流程实现高质量注释，为评估MLLMs在视觉识别与逻辑推理方面的误差诊断能力提供了标准化平台，揭示了当前模型性能与人类专家之间的显著差距，并验证了大型推理模型在错误解释任务上的潜力。

链接: https://arxiv.org/abs/2603.24961
作者: Dingjie Song,Tianlong Xu,Yi-Fan Zhang,Hang Li,Zhiling Yan,Xing Fan,Haoyang Li,Lichao Sun,Qingsong Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 27th International Conference on Artificial Intelligence in Education (AIED’26)

点击查看摘要

Abstract:Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an “examinee perspective”, prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

[NLP-41] oward domain-specific machine translation and quality estimation systems

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）与质量评估（Quality Estimation, QE）系统在面对领域不匹配（domain mismatch）时性能显著下降的问题。解决方案的关键在于通过数据驱动的方法实现高效的领域自适应：首先，采用基于相似性的数据选择策略，从大规模语料中筛选出小而精准的领域内子集，以较低计算成本获得高质量翻译；其次，提出分阶段的QE训练流程，融合领域适应与轻量级数据增强技术，在多种语言和资源条件下（包括零样本和跨语言场景）提升模型泛化能力；再次，强调子词分词（subword tokenization）与词汇表的一致性对微调稳定性及翻译质量的重要性；最后，设计一种基于QE引导的上下文学习方法，利用QE模型动态选择示例来优化大语言模型的翻译表现，且无需参数更新，同时支持无参考文本设置。整体而言，论文表明领域适应的核心在于数据选择、表示一致性与高效适配策略的协同作用。

链接: https://arxiv.org/abs/2603.24955
作者: Javad Pourmostafa Roshan Sharami
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PhD Dissertation

点击查看摘要

Abstract:Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

[NLP-42] FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol ICASSP2026

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在真实金融场景中进行工具调用（tool invocation）能力评估缺乏标准化、实用且具有挑战性的基准测试问题。现有评估方法难以全面衡量模型在复杂金融任务中的推理与工具协同能力，尤其在涉及多工具调用、多轮交互及真实金融协议（Financial Model Context Protocols, MCPs）情境下存在明显不足。解决方案的关键在于构建FinMCP-Bench——一个包含613个样本、覆盖10个主场景和33个子场景的新型基准测试集，涵盖真实与合成用户查询，并整合65个真实金融MCPs，支持单工具、多工具及多轮对话三种任务类型，从而系统评估主流LLMs在金融代理（financial LLM agents）场景下的工具调用准确率与推理能力。

链接: https://arxiv.org/abs/2603.24943
作者: Jie Zhu,Yimin Tian,Boyang Li,Kehao Wu,Zhongzhi Liang,Junhui Li,Xianyin Zhang,Lifan Guo,Feng Chen,Yong Liu,Chi Zhang
机构: Alibaba Cloud Computing (阿里巴巴云计算); YINGMI Wealth Management (盈米财富管理); Soochow University (苏州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:This paper introduces \textbfFinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

[NLP-43] Beyond Attention Magnitude: Leverag ing Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作任务中因处理密集视觉标记（visual tokens）而导致的显著推理延迟问题。现有token缩减方法主要依赖注意力强度作为静态选择标准，但本文指出高注意力token具有任务依赖性，甚至可能损害策略性能。解决方案的关键在于提出TIES（Tau-guided Inter-layer Efficient Selection）框架，该框架通过跨层token排序一致性（inter-layer token ranking consistency）动态引导token选择，自适应地平衡注意力强度与排序一致性，从而实现无需额外训练的鲁棒token筛选，在CogACT + SIMPLER基准上将平均成功率提升6%，同时减少78%的token使用量，并展现出对多种解码器和基准的良好泛化能力。

链接: https://arxiv.org/abs/2603.24941
作者: Peiju Liu,Jinming Liu,Xipeng Qiu,Xuanjing Huang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 7 figures, preprint

点击查看摘要

Abstract:Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbfTIES (\textbfTau-guided \textbfInter-layer \textbfEfficient \textbfSelection), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6% while reducing token usage by 78%, and demonstrate strong generalization across diverse decoders and benchmarks.

[NLP-44] LogitScope: A Framework for Analyzing LLM Uncertainty Through Information Metrics

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）输出中不确定性难以量化和定位的问题，尤其关注在生成过程中每个标记（token）级别的置信度评估不足这一局限。传统评估方法无法提供细粒度的模型信心信息，导致难以识别幻觉（hallucination）或高风险决策点。解决方案的关键在于提出 LogitScope——一个轻量级框架，通过从概率分布中计算 token 级别的信息度量（如熵和方差熵）来分析模型不确定性，从而揭示模型置信度模式、识别潜在幻觉并定位高不确定性的决策点。该方法无需标注数据或语义解释，具备模型无关性、计算高效性（基于懒加载优化），且兼容 HuggingFace 生态系统，适用于研究与生产环境中的推理行为监控。

链接: https://arxiv.org/abs/2603.24929
作者: Farhan Ahmed,Yuya Jeremy Ong,Chad DeLuca
机构: IBM Research (IBM 研究院); Plastic Labs (塑料实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Understanding and quantifying uncertainty in large language model (LLM) outputs is critical for reliable deployment. However, traditional evaluation approaches provide limited insight into model confidence at individual token positions during generation. To address this issue, we introduce LogitScope, a lightweight framework for analyzing LLM uncertainty through token-level information metrics computed from probability distributions. By measuring metrics such as entropy and varentropy at each generation step, LogitScope reveals patterns in model confidence, identifies potential hallucinations, and exposes decision points where models exhibit high uncertainty, all without requiring labeled data or semantic interpretation. We demonstrate LogitScope’s utility across diverse applications including uncertainty quantification, model behavior analysis, and production monitoring. The framework is model-agnostic, computationally efficient through lazy evaluation, and compatible with any HuggingFace model, enabling both researchers and practitioners to inspect LLM behavior during inference.

[NLP-45] Estimating near-verbatim extraction risk in language models with decoding-constrained beam search

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中近似复现（near-verbatim）记忆提取风险难以量化的问题。现有方法如标准贪婪解码（greedy-decoding）和概率提取（probabilistic extraction）仅能捕捉完全复现（verbatim memorization）的隐私与版权风险，而忽略了大量语义相似但字面不同的文本片段，这些片段同样构成实质性风险。为降低近似复现提取风险的计算成本，作者提出解码约束束搜索（decoding-constrained beam search），其关键在于通过引入解码约束条件，在保持计算效率的同时获得近似复现提取风险的确定性下界，计算开销仅为约20次蒙特卡洛（Monte Carlo, MC）采样的水平，显著优于传统MC估计所需的约10万次采样。实验表明，该方法揭示了传统方法无法发现的信息：更多可提取序列、更高的单序列提取质量分布以及模型规模和文本类型对近似复现风险的影响模式。

链接: https://arxiv.org/abs/2603.24917
作者: A. Feder Cooper,Mark A. Lemley,Christopher De Sa,Lea Duesterwald,Allison Casasola,Jamie Hayes,Katherine Lee,Daniel E. Ho,Percy Liang
机构: Stanford (斯坦福大学); Cornell University (康奈尔大学); Google DeepMind (谷歌深思)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work shows that standard greedy-decoding extraction methods for quantifying memorization in LLMs miss how extraction risk varies across sequences. Probabilistic extraction – computing the probability of generating a target suffix given a prefix under a decoding scheme – addresses this, but is tractable only for verbatim memorization, missing near-verbatim instances that pose similar privacy and copyright risks. Quantifying near-verbatim extraction risk is expensive: the set of near-verbatim suffixes is combinatorially large, and reliable Monte Carlo (MC) estimation can require ~100,000 samples per sequence. To mitigate this cost, we introduce decoding-constrained beam search, which yields deterministic lower bounds on near-verbatim extraction risk at a cost comparable to ~20 MC samples per sequence. Across experiments, our approach surfaces information invisible to verbatim methods: many more extractable sequences, substantially larger per-sequence extraction mass, and patterns in how near-verbatim extraction risk manifests across model sizes and types of text.

[NLP-46] LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis

【速读】：该论文针对维度情感分析（Dimensional Aspect-Based Sentiment Analysis, DimABSA）中多语言、多领域下Valence（效价）与Arousal（唤醒度）两个连续情感维度预测难度不一致的问题提出解决方案。传统方法通常将情感视为离散标签，而DimABSA要求预测1-9尺度上的连续VA分数，其挑战在于不同语言和领域中Valence与Arousal的预测难度差异显著。论文提出的解决方案关键在于引入学习型同方差不确定性（learned homoscedastic uncertainty），即模型在训练过程中自动学习每个回归任务的对数方差参数（log-variance parameters），从而动态平衡Valence与Arousal的损失权重，实现任务间的自适应优化。实验表明，该方法结合语言特定编码器与多种子集成策略，在五个数据集上均取得最佳性能，且学习到的方差权重因语言而异（如德语为0.66倍，英语为2.18倍），验证了最优任务平衡具有语言依赖性，不可预先设定。

链接: https://arxiv.org/abs/2603.24896
作者: Baraa Hikal,Jonas Becker,Bela Gipp
机构: University of Göttingen, Germany; LKA NRW, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1-9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence-Arousal difficulty profiles-from 0.66x for German to 2.18x for English-demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.

[NLP-47] How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在评估中过度关注感知真实性的局限性，即现有基准大多仅衡量生成的3D布局、形状和外观是否视觉上合理，而忽视了模型对实际建造过程中步骤逻辑与物理依赖关系的理解能力——这是实现从设计到施工自动化管道的关键。解决方案的核心在于提出DreamHouse这一新型基准，用于评测“物理生成推理”（physical generative reasoning），即合成同时满足几何、结构、可施工性和规范合规性约束的建筑构件的能力；其关键创新包括：以木结构住宅建造为场景，基于明确工程标准（LOD 350）构建超过26,000个经验证的结构数据集，并开发一套确定性的10项结构验证框架；更重要的是，DreamHouse支持迭代式代理交互机制，允许模型观察建造中间状态、生成施工动作并接收结构反馈，从而实现对规划、结构推理与自我修正能力的细粒度评估。

链接: https://arxiv.org/abs/2603.24866
作者: Luyu Yang,Yutong Dai,An Yan,Viraj Prabhu,Ran Xu,Zeyuan Chen
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at this https URL

[NLP-48] AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

【速读】：该论文旨在解决当前机器学习（Machine Learning, ML）系统安全研究中缺乏统一框架的问题，即现有攻击与防御方法多被孤立看待，未能揭示其内在关联性与相互依赖关系，从而阻碍了对模型-数据双向风险传播机制的系统理解。解决方案的关键在于提出一个统一的闭环威胁分类法（unified closed-loop threat taxonomy），该框架通过四个方向轴明确刻画模型与数据之间的交互关系，将安全威胁划分为四类：(1) 数据→数据（D→D），如数据解密和水印移除；(2) 数据→模型（D→M），如投毒、有害微调和越狱攻击；(3) 模型→数据（M→D），如模型逆向、成员推断和训练数据提取；(4) 模型→模型（M→M），如模型提取攻击。此框架不仅厘清了各类威胁间的内在联系，还为构建可扩展、可迁移且跨模态的防御策略提供了理论基础，尤其适用于基础模型（foundation models）的安全防护。

链接: https://arxiv.org/abs/2603.24857
作者: Zhenyi Wang,Siyu Luan
机构: University of Central Florida (中佛罗里达大学); University of Copenhagen (哥本哈根大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:As machine learning (ML) systems expand in both scale and functionality, the security landscape has become increasingly complex, with a proliferation of attacks and defenses. However, existing studies largely treat these threats in isolation, lacking a coherent framework to expose their shared principles and interdependencies. This fragmented view hinders systematic understanding and limits the design of comprehensive defenses. Crucially, the two foundational assets of ML – \textbfdata and \textbfmodels – are no longer independent; vulnerabilities in one directly compromise the other. The absence of a holistic framework leaves open questions about how these bidirectional risks propagate across the ML pipeline. To address this critical gap, we propose a \emphunified closed-loop threat taxonomy that explicitly frames model-data interactions along four directional axes. Our framework offers a principled lens for analyzing and defending foundation models. The resulting four classes of security threats represent distinct but interrelated categories of attacks: (1) Data \rightarrow Data (D \rightarrow D): including \emphdata decryption attacks and watermark removal attacks; (2) Data \rightarrow Model (D \rightarrow M): including \emphpoisoning, harmful fine-tuning attacks, and jailbreak attacks; (3) Model \rightarrow Data (M \rightarrow D): including \emphmodel inversion, membership inference attacks, and training data extraction attacks; (4) Model \rightarrow Model (M \rightarrow M): including \emphmodel extraction attacks. Our unified framework elucidates the underlying connections among these security threats and establishes a foundation for developing scalable, transferable, and cross-modal security strategies, particularly within the landscape of foundation models.

[NLP-49] Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

【速读】：该论文旨在解决语言模型（Language Model, LM）在面对存在多个合理答案或固有不确定性的实际任务时，因后训练过程将答案分布坍缩为单一主导模式而导致的表达能力不足问题。例如，在医疗诊断、模糊问答及信息不完整场景中，模型应能生成多个 plausible 假设并附带置信度估计，而非仅输出一个答案。其解决方案的关键在于提出一种多答案强化学习（multi-answer reinforcement learning, multi-answer RL）方法，通过修改强化学习目标函数，使模型能够在单次前向传播中显式生成多个候选答案，并将推理时搜索机制内化至生成过程中，从而实现高效且分布合理的多答案推理。实验表明，该方法在多样性、覆盖率和集合级校准性上优于单答案训练基线，且在编码任务中准确率显著提升，同时减少生成多答案所需的 token 数量。

链接: https://arxiv.org/abs/2603.24844
作者: Isha Puri,Mehul Damani,Idan Shenfeld,Marzyeh Ghassemi,Jacob Andreas,Yoon Kim
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at this https URL.

[NLP-50] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）方法在训练效率和学习信号强度方面的瓶颈问题。具体而言，现有方法如GRPO和DAPO因需对每个提示（prompt）采样大量轨迹（rollout），导致计算开销巨大；同时，由于相对优势（relative advantage）常呈现稀疏性——多数样本趋于全正确或全错误，造成组内奖励方差低、学习信号弱。解决方案的关键在于提出arrol（Accelerating RLVR via online Rollout Pruning），其核心机制是在生成过程中在线剪枝（pruning）轨迹，并通过一个轻量级质量头（quality head）实时预测部分轨迹的成功概率，从而引导幸存轨迹保持更平衡的正确性分布，增强学习信号。此外，arrol将剪枝操作集成至推理引擎内部并动态重批处理剩余轨迹，显著提升训练效率与测试时缩放（test-time scaling）下的推理准确性。

链接: https://arxiv.org/abs/2603.24840
作者: Haobo Xu,Sirui Chen,Ruizhong Qiu,Yuchen Yan,Chen Luo,Monica Cheng,Jingrui He,Hanghang Tong
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at this https URL.

[NLP-51] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

【速读】：该论文旨在解决合成数据生成（synthetic data generation）在语言模型预训练中如何与源数据质量相互作用的问题，特别是在葡萄牙语场景下的持续预训练（continued pretraining）中。其核心问题是：当使用文档重写（document rewriting）技术生成合成数据时，若源数据质量不同，这种合成方法是否仍能有效提升模型性能？解决方案的关键在于设计了一个受控实验，基于ClassiCC-PT这一标注了STEM和教育质量评分的葡萄牙语语料库，构建两个不同质量级别的10B-token子集，并分别用一个7B参数的指令微调模型改写为四种风格，从而生成约40B tokens的合成数据。通过在两个不同规模（1.1B和7B参数）的英文基础模型上进行训练并评估PoETa V2基准测试，发现合成重写主要作为质量放大器（quality multiplier），而非数据筛选的替代方案，且该效应具有明显的规模依赖性。

链接: https://arxiv.org/abs/2603.24826
作者: Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini
机构: University of Campinas (UNICAMP); Maritaca AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

[NLP-52] Enhancing Structured Meaning Representations with Aspect Classification

【速读】：该论文旨在解决语义表示框架中aspect（方面）标注稀疏的问题，这一问题阻碍了人工标注效率及自动预测系统的发展。其关键解决方案是构建一个全新的英语句子数据集，该数据集基于缺乏aspect特征的抽象意义表示（Abstract Meaning Representation, AMR）图，通过统一意义表示（Uniform Meaning Representations, UMR）的aspect标签进行标注，并采用多步骤仲裁流程确保标注一致性与质量，从而为未来自动化任务提供基准和基础。

链接: https://arxiv.org/abs/2603.24797
作者: Claire Benét Post,Paul Bontempo,August Milliken,Alvin Po-Chun Chen,Nicholas Derby,Saksham Khatwani,Sumeyye Nabieva,Karthik Sairam,Alexis Palmer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures, 8 tables

点击查看摘要

Abstract:To fully capture the meaning of a sentence, semantic representations should encode aspect, which describes the internal temporal structure of events. In graph-based meaning representation frameworks such as Uniform Meaning Representations (UMR), aspect lets one know how events unfold over time, including distinctions such as states, activities, and completed events. Despite its importance, aspect remains sparsely annotated across semantic meaning representation frameworks. This has, in turn, hindered not only current manual annotation, but also the development of automatic systems capable of predicting aspectual information. In this paper, we introduce a new dataset of English sentences annotated with UMR aspect labels over Abstract Meaning Representation (AMR) graphs that lack the feature. We describe the annotation scheme and guidelines used to label eventive predicates according to the UMR aspect lattice, as well as the annotation pipeline used to ensure consistency and quality across annotators through a multi-step adjudication process. To demonstrate the utility of our dataset for future automation, we present baseline experiments using three modeling approaches. Our results establish initial benchmarks for automatic UMR aspect prediction and provide a foundation for integrating aspect into semantic meaning representations more broadly.

[NLP-53] Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

【速读】：该论文旨在解决低资源语言（如芬兰语）在电子健康记录（EHR）系统中临床文档生成效率低下、医生工作负担重的问题，尤其针对芬兰语医疗对话的自动转录与生成。其解决方案的关键在于通过小规模已验证的模拟临床对话语料库对LLaMA 3.1-8B大语言模型进行领域对齐微调（domain-aligned fine-tuning），并采用受控的预处理和优化策略提升模型在芬兰语医学话语中的表现。评估结果显示，尽管n-gram重叠度较低（BLEU=0.1214），但语义相似性高（BERTScore F1=0.8230），表明该方法在保持隐私的前提下，能够有效支持芬兰语临床文档的自动化生成，为未来低资源语言环境下基于隐私保护的大语言模型应用提供了可行路径。

链接: https://arxiv.org/abs/2603.24772
作者: Mohammed Nowshad Ruhani Chowdhury,Mohammed Nowaz Rabbani Chowdhury,Sakari Lukkarinen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.

[NLP-54] Fine-Tuning A Large Language Model for Systematic Review Screening

【速读】：该论文旨在解决系统性综述（systematic review）中标题和摘要筛选环节耗时耗力的问题，传统方法依赖大量人工阅读海量文献以确定纳入标准。其解决方案的关键在于：针对特定任务对一个参数量为12亿的小型开源大语言模型（large language model, LLM）进行微调（fine-tuning），利用人类对超过8500条标题和摘要的标注数据提供上下文信息，从而显著提升模型在筛选任务中的准确性与一致性。实验表明，微调后的模型相较于基础模型在加权F1分数上提升80.79%，且在全量8277条研究数据上的预测结果与人类编码者达成86.40%的一致性，验证了基于领域数据微调LLM在大规模系统性综述筛选中的有效性与稳定性。

链接: https://arxiv.org/abs/2603.24767
作者: Kweku Yamoah,Noah Schroeder,Emmanuel Dorley,Neha Rani,Caleb Schutz
机构: University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.

[NLP-55] SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks WWW

【速读】：该论文旨在解决当前生成式 AI 在软件开发中评估方式的局限性问题——现有基准测试多聚焦于单次生成结果是否通过完整规范，而忽视了代码在迭代扩展过程中的质量退化现象。这种“通过即成功”的评估范式无法真实反映代码在长期演化中的可维护性和架构稳定性。解决方案的关键在于提出一个语言无关的迭代式基准测试平台 SlopCodeBench，其包含 20 个问题和 93 个检查点（checkpoint），要求代理（agent）在不断演化的规范下持续扩展自身代码，且不强制内部结构设计。该平台引入两个轨迹级质量指标：冗余度（verbosity，即重复代码比例）与结构侵蚀度（structural erosion，即复杂度集中于高复杂度函数的比例），从而量化代码在迭代过程中质量的系统性下降趋势。实证表明，无论何种模型，均无法在任何问题上端到端完成任务，且代码质量随迭代显著恶化，揭示了当前代理缺乏满足迭代软件开发所需的架构设计纪律。

链接: https://arxiv.org/abs/2603.24755
作者: Gabriel Orlanski,Devjeet Roy,Alexander Yun,Changho Shin,Alex Gu,Albert Ge,Dyah Adila,Frederic Sala,Aws Albarghouthi
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Washington State University (华盛顿州立大学); MIT (麻省理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and Leaderboards are located at this https URL

点击查看摘要

Abstract:Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent’s design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

[NLP-56] raining LLM s for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

【速读】：该论文致力于解决多步工具编排（multi-step tool orchestration）中的关键挑战，即大语言模型（LLM）在执行依赖API调用序列时，常因参数值错误或顺序不当而导致全流程失败。现有训练环境多局限于单轮函数调用的模拟数据，且采用二元奖励机制无法提供部分正确性的反馈信号。其解决方案的关键在于：首先构建一个基于真实API响应缓存的强化学习环境，支持高效生成可控复杂度的多步调用轨迹；其次设计分层奖励机制，将正确性解耦为原子有效性（atomic validity，即逐级细化的单个函数调用正确性）与编排正确性（orchestration，即尊重依赖关系的工具调用顺序），从而实现对中间步骤的精细化引导。实验表明，该方法在ComplexFuncBench上显著提升了每一步的准确率，且消融实验证明两类奖励缺一不可。

链接: https://arxiv.org/abs/2603.24709
作者: Cheng Jiayang,Xin Liu,Zhihan Zhang,Haoyang Wen,Zixuan Zhang,Qingyu Yin,Shiyang Li,Priyanka Nigam,Bing Yin,Chao Zhang,Yangqiu Song
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance. Comments: Under Review Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2603.24709 [cs.LG] (or arXiv:2603.24709v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.24709 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-57] Demystifying When Pruning Works via Representation Hierarchies

【速读】：该论文旨在解决网络剪枝（network pruning）在不同语言任务中表现不一致的问题，特别是为何剪枝模型在非生成性任务（如检索和多项选择）中仍能保持性能，而在生成性任务中却常出现显著退化。其解决方案的关键在于从表示层次（representation-hierarchy）视角出发，将语言模型内部计算分解为三个连续空间：嵌入空间（embedding）、logit空间（pre-softmax输出）和概率空间（post-softmax分布）。研究发现，嵌入空间与logit空间的表示对剪枝引起的扰动具有较强鲁棒性，但logits到概率的非线性变换会放大这些扰动，并在生成过程中逐时间步累积，导致生成质量严重下降；相比之下，概率空间中的类别 token 概率子空间稳定性较高，结合嵌入空间的鲁棒性，解释了剪枝在非生成任务中的有效性。这一分析为剪枝策略在不同类型任务中的适用性提供了理论依据和实践指导。

链接: https://arxiv.org/abs/2603.24652
作者: Shwai He,Guoheng Sun,Haichao Zhang,Yun Fu,Ang Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 21 figures, Table 3

点击查看摘要

Abstract:Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at this https URL

[NLP-58] When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews LREC2026

【速读】：该论文旨在解决自动抑郁检测模型在半结构化医患对话中缺乏可解释性的问题，特别是模型性能看似优异但可能依赖于访谈者提示（interviewer prompts）等非患者语言特征的系统性偏差。解决方案的关键在于识别并区分模型决策证据的来源：通过限制模型仅使用患者话语（participant utterances），而非包含访谈者提示的完整对话，能够使决策依据更广泛地分布于真实语义线索，从而避免因利用固定提示和位置信息而产生的虚假高分类性能。该方法揭示了跨数据集、架构无关的偏差现象，并强调了基于时间与说话人维度定位决策证据的重要性，以确保模型真正从患者语言中学习。

链接: https://arxiv.org/abs/2603.24651
作者: Hasindri Watawana,Sergio Burdisso,Diego A. Moreno-Galván,Fernando Sánchez-Vega,A. Pastor López-Monroy,Petr Motlicek,Esaú Villatoro-Tello
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to LREC 2026 Conference

点击查看摘要

Abstract:Automatic depression detection from doctor-patient conversations has gained momentum thanks to the availability of public corpora and advances in language modeling. However, interpretability remains limited: strong performance is often reported without revealing what drives predictions. We analyze three datasets: ANDROIDS, DAIC-WOZ, E-DAIC and identify a systematic bias from interviewer prompts in semi-structured interviews. Models trained on interviewer turns exploit fixed prompts and positions to distinguish depressed from control subjects, often achieving high classification scores without using participant language. Restricting models to participant utterances distributes decision evidence more broadly and reflects genuine linguistic cues. While semi-structured protocols ensure consistency, including interviewer prompts inflates performance by leveraging script artifacts. Our results highlight a cross-dataset, architecture-agnostic bias and emphasize the need for analyses that localize decision evidence by time and speaker to ensure models learn from participants’ language.

[NLP-59] X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLM s

【速读】：该论文旨在解决端到端（End-to-End, E2E）语音大语言模型（Speech Large Language Models, Speech LLMs）在复杂任务中性能显著低于文本基线模型的问题，尤其是标准监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL）方法难以弥合这一差距。解决方案的关键在于提出一种新型跨模态在线策略蒸馏框架（Cross-Modal On-Policy Distillation, X-OPD），其核心机制是通过在线策略采样（on-policy rollouts）让语音LLM自主探索自身分布，并由文本教师模型对生成的轨迹提供逐标记（token-level）反馈，从而将教师模型的能力有效蒸馏至学生模型的多模态表示中，实现语音与文本能力的系统性对齐。

链接: https://arxiv.org/abs/2603.24596
作者: Di Cao,Dongjie Fu,Hai Yu,Siqi Zheng,Xu Tan,Tao Jin
机构: Tencent Hunyuan (腾讯混元); Zhejiang University (浙江大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher’s capabilities into student’s multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model’s inherent capabilities.

信息检索

[IR-0] raining the Knowledge Base through Evidence Distillation and Write-Back Enrichment

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中知识库静态不变导致的检索效率低下问题，即在面对碎片化且隐含于无关内容中的事实时，传统RAG方法难以有效定位和利用相关信息。其解决方案的关键在于将知识库视为可训练组件，并提出WriteBack-RAG框架：通过标注样例识别检索成功的位置，隔离相关文档并将其提炼为紧凑的知识单元，再索引至原始语料库中。由于该方法仅修改语料库，可作为离线预处理步骤与任意RAG流水线结合使用，实验证明其在四种RAG方法、六个基准测试和两个大语言模型（Large Language Model, LLM）骨干网络下均实现性能提升，平均增益达+2.14%，且跨方法迁移实验表明改进效果源自语料库本身的优化。

链接: https://arxiv.org/abs/2603.25737
作者: Yuxing Lu,Xukai Zhao,Wei Wu,Jinzhuo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages

点击查看摘要

Abstract:The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

[IR-1] Unveiling the Resilience of LLM -Enhanced Search Engines against Black-Hat SEO Manipulation WWW2026

【速读】：该论文旨在解决大型语言模型增强型搜索引擎（LLMSE）在面对传统黑帽搜索引擎优化（SEO）攻击时的安全性问题，首次系统性地分析了LLMSE生态系统的安全风险。其关键解决方案在于构建了一个包含1000个真实黑帽SEO网站的基准测试集（SEO-Bench），并提出七种新型LLMSEO攻击策略，揭示了即便使用现成的LLMSE产品仍存在被操纵的风险，特别是通过重写查询注入（rewritten-query stuffing）和分段文本（segmented texts）等手段可使攻击成功率翻倍。研究进一步表明，检索阶段是抵御传统SEO攻击的主要防线，但对新型LLMSEO攻击仍显脆弱，从而为开发更鲁棒的AI驱动搜索系统提供了重要实践指导。

链接: https://arxiv.org/abs/2603.25500
作者: Pei Chen,Geng Hong,Xinyi Wu,Mengying Wu,Zixuan Zhu,Mingxuan Liu,Baojun Liu,Mi Zhang,Min Yang
机构: Fudan University (复旦大学); Tsinghua University (清华大学); Zhongguancun Laboratory (中关村实验室)
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: Accepted at The ACM Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:The emergence of Large Language Model-enhanced Search Engines (LLMSEs) has revolutionized information retrieval by integrating web-scale search capabilities with AI-powered summarization. While these systems demonstrate improved efficiency over traditional search engines, their security implications against well-established black-hat Search Engine Optimization (SEO) attacks remain unexplored. In this paper, we present the first systematic study of SEO attacks targeting LLMSEs. Specifically, we examine ten representative LLMSE products (e.g., ChatGPT, Gemini) and construct SEO-Bench, a benchmark comprising 1,000 real-world black-hat SEO websites, to evaluate both open- and closed-source LLMSEs. Our measurements show that LLMSEs mitigate over 99.78% of traditional SEO attacks, with the phase of retrieval serving as the primary filter, intercepting the vast majority of malicious queries. We further propose and evaluate seven LLMSEO attack strategies, demonstrating that off-the-shelf LLMSEs are vulnerable to LLMSEO attacks, i.e., rewritten-query stuffing and segmented texts double the manipulation rate compared to the baseline. This work offers the first in-depth security analysis of the LLMSE ecosystem, providing practical insights for building more resilient AI-driven search systems. We have responsibly reported the identified issues to major vendors. Comments: Accepted at The ACM Web Conference 2026 (WWW 2026) Subjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR) Cite as: arXiv:2603.25500 [cs.CR] (or arXiv:2603.25500v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.25500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Supercharging Federated Intelligence Retrieval

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）系统在面对分布式私有数据孤岛时无法有效访问知识的问题。其核心挑战在于如何在不集中化敏感数据的前提下实现安全的远程大语言模型（Large Language Model, LLM）推理。解决方案的关键在于构建一个基于Flower框架的联邦RAG（Federated RAG）系统，该系统通过本地数据孤岛执行检索操作，并将服务器端的聚合与文本生成过程置于受证明（attested）的可信执行环境（Confidential Compute Environment, CCE）中，从而即使在存在诚实但好奇或已被攻破的服务器情况下，也能保障LLM推理的机密性。此外，论文提出了一种级联推理方法，引入非机密第三方模型（如Amazon Nova）作为辅助上下文源，而不削弱整体系统的保密性。

链接: https://arxiv.org/abs/2603.25374
作者: Dimitris Stripelis,Patrick Foley,Mohammad Naseri,William Lindskog-Münzing,Chong Shen Ng,Daniel Janes Beutel,Nicholas D. Lane
机构: Flower Labs
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 6 pages, 1 figure, 2 tables

点击查看摘要

Abstract:RAG typically assumes centralized access to documents, which breaks down when knowledge is distributed across private data silos. We propose a secure Federated RAG system built using Flower that performs local silo retrieval, while server-side aggregation and text generation run inside an attested, confidential compute environment, enabling confidential remote LLM inference even in the presence of honest-but-curious or compromised servers. We also propose a cascading inference approach that incorporates a non-confidential third-party model (e.g., Amazon Nova) as auxiliary context without weakening confidentiality.

[IR-3] Adaptive Chunking: Optimizing Chunking-Method Selection for RAG LREC2026

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中文档分块（chunking）策略依赖“一刀切”方法而导致性能受限的问题。现有分块方式难以适配不同文本的语义结构与内容特征，且缺乏独立于下游任务的评估框架，使得分块质量难以量化和优化。解决方案的关键在于提出一种自适应分块（Adaptive Chunking）框架，其核心是基于五种新颖的内在文档级指标——引用完整性（References Completeness, RC）、块内一致性（Intrachunk Cohesion, ICC）、文档上下文连贯性（Document Contextual Coherence, DCC）、块完整性（Block Integrity, BI）和尺寸合规性（Size Compliance, SC）——动态选择最适合每篇文档的分块策略，并配套设计了LLM-regex分割器与递归式分段合并分割器及针对性后处理技术。实验表明，该方法在法律、技术和社会科学等多领域语料上显著提升RAG性能，无需调整模型或提示词即可将答案正确率从62%-64%提升至72%，成功回答问题数量增加超30%（从49增至65），验证了基于内在指标引导的文档感知型分块对构建更鲁棒RAG系统的有效性。

链接: https://arxiv.org/abs/2603.25333
作者: Paulo Roberto de Moura Júnior,Jean Lelong,Annabelle Blangero
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at LREC 2026. 10 pages, 4 figures. Code: this https URL

点击查看摘要

Abstract:The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used “one-size-fits-all” approaches often fail to capture the nuanced structure and semantics of diverse texts. Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance. We challenge this paradigm by introducing Adaptive Chunking, a framework that selects the most suitable chunking strategy for each document based on a set of five novel intrinsic, document-based metrics: References Completeness (RC), Intrachunk Cohesion (ICC), Document Contextual Coherence (DCC), Block Integrity (BI), and Size Compliance (SC), which directly assess chunking quality across key dimensions. To support this framework, we also introduce two new chunkers, an LLM-regex splitter and a split-then-merge recursive splitter, alongside targeted post-processing techniques. On a diverse corpus spanning legal, technical, and social science domains, our metric-guided adaptive method significantly improves downstream RAG performance. Without changing models or prompts, our framework increases RAG outcomes, raising answers correctness to 72% (from 62-64%) and increasing the number of successfully answered questions by over 30% (65 vs. 49). These results demonstrate that adaptive, document-aware chunking, guided by a complementary suite of intrinsic metrics, offers a practical and effective path to more robust RAG systems. Code available at this https URL.

[IR-4] ColBERT-Att: Late-Interaction Meets Attention for Enhanced Retrieval

【速读】：该论文旨在解决现有基于晚期交互（late interaction）框架的神经信息检索系统（如ColBERT）在计算查询与文档相似度时，未能显式建模查询和文档词项之间重要性差异的问题。传统方法仅依赖向量嵌入的点积来衡量匹配程度，忽略了注意力权重所体现的语义重要性信息，从而可能影响相关性判断的准确性。解决方案的关键在于提出ColBERT-Att模型，通过在晚交互架构中引入显式的注意力机制，显式地融合查询和文档词项之间的注意力权重，以增强对相关性的理解，从而提升检索性能。实证结果表明，该方法在MS-MARCO、BEIR及LoTTE等多个基准数据集上均实现了召回率的显著提升。

链接: https://arxiv.org/abs/2603.25248
作者: Raj Nath Patel,Sourav Dutta
机构: Huawei Research Center (华为研究中心)
类目: Information Retrieval (cs.IR)
备注: 5 pages

点击查看摘要

Abstract:Vector embeddings from pre-trained language models form a core component in Neural Information Retrieval systems across a multitude of knowledge extraction tasks. The paradigm of late interaction, introduced in ColBERT, demonstrates high accuracy along with runtime efficiency. However, the current formulation fails to take into account the attention weights of query and document terms, which intuitively capture the “importance” of similarities between them, that might lead to a better understanding of relevance between the queries and documents. This work proposes ColBERT-Att, to explicitly integrate attention mechanism into the late interaction framework for enhanced retrieval performance. Empirical evaluation of ColBERT-Att depicts improvements in recall accuracy on MS-MARCO as well as on a wide range of BEIR and LoTTE benchmark datasets.

[IR-5] UniAI-GraphRAG : Synergizing Ontology-Guided Extraction Multi-Dimensional Clustering and Dual-Channel Fusion for Robust Multi-Hop Reasoning

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统在复杂推理、多跳查询（multi-hop queries）以及领域特定问答（domain-specific QA）任务中面临的性能瓶颈问题，尤其是在跨行业适应性、社区报告完整性与检索效率方面的局限。其解决方案的关键在于提出UniAI-GraphRAG框架，通过三项核心创新实现：(1) 基于本体（Ontology）引导的知识抽取，利用预定义Schema提升大语言模型（LLM）对领域实体和关系的识别准确性；(2) 多维社区聚类策略，结合对齐补全、属性聚类与多跳关系聚类以增强社区结构完整性；(3) 双通道图检索融合机制，通过图结构与社区检索的混合方式在问答准确率与计算性能之间取得平衡。实验证明，该方案在MultiHopRAG基准测试中显著优于主流开源方案，尤其在推理型和时序类查询上表现突出。

链接: https://arxiv.org/abs/2603.25152
作者: Jie Wang,Honghua Huang,Xi Ge,Jianhui Su,Wen Liu,Shiguo Lian
机构: Data Science Artificial Intelligence Research Institute, China Unicom
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in cross-industry adaptability, community report integrity, and retrieval performance. This paper proposes UniAI-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHopRAG benchmark show that UniAI-GraphRAG outperforms mainstream open source solutions (this http URL) in comprehensive F1 scores, particularly in inference and temporal queries. The code is available at this https URL.

[IR-6] MCLMR: A Model-Agnostic Causal Learning Framework for Multi-Behavior Recommendation WWW2026

【速读】：该论文针对多行为推荐（Multi-Behavior Recommendation, MBR）中普遍存在的三大挑战展开研究：一是缺乏对用户行为习惯与物品多行为分布所导致的复杂混杂效应（confounding effects）进行建模的原理性框架；二是难以有效聚合异构辅助行为信息；三是跨行为表示在语义鸿沟下无法对齐且未考虑偏差扭曲。为解决这些问题，论文提出了一种模型无关的因果学习框架MCLMR，其核心创新在于：首先构建因果图以建模混杂效应并实施干预实现无偏偏好估计；其次引入基于专家混合（Mixture-of-Experts）的自适应聚合模块动态融合辅助行为信息；最后设计一种偏差感知的对比学习模块，在考虑偏差的前提下对齐跨行为表示。该框架可无缝集成至多种MBR架构，并在三个真实数据集上验证了其有效性与通用性。

链接: https://arxiv.org/abs/2603.25126
作者: Ranxu Zhang,Junjie Meng,Ying Sun,Ziqi Xu,Bing Yin,Hao Li,Yanyong Zhang,Chao Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by WWW 2026

点击查看摘要

Abstract:Multi-Behavior Recommendation (MBR) leverages multiple user interaction types (e.g., views, clicks, purchases) to enrich preference modeling and alleviate data sparsity issues in traditional single-behavior approaches. However, existing MBR methods face fundamental challenges: they lack principled frameworks to model complex confounding effects from user behavioral habits and item multi-behavior distributions, struggle with effective aggregation of heterogeneous auxiliary behaviors, and fail to align behavioral representations across semantic gaps while accounting for bias distortions. To address these limitations, we propose MCLMR, a novel model-agnostic causal learning framework that can be seamlessly integrated into various MBR architectures. MCLMR first constructs a causal graph to model confounding effects and performs interventions for unbiased preference estimation. Under this causal framework, it employs an Adaptive Aggregation module based on Mixture-of-Experts to dynamically fuse auxiliary behavior information and a Bias-aware Contrastive Learning module to align cross-behavior representations in a bias-aware manner. Extensive experiments on three real-world datasets demonstrate that MCLMR achieves significant performance improvements across various baseline models, validating its effectiveness and generality. All data and code will be made publicly available. For anonymous review, our code is available at the following the link: this https URL.

[IR-7] AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation ACL2026

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中因依赖低权威来源而导致的信息误导问题，核心挑战在于大语言模型（Large Language Models, LLMs）是否具备感知信息权威性的能力，而不仅限于语义理解。解决方案的关键在于提出AuthorityBench——一个涵盖三个子数据集的综合性基准测试框架（DomainAuth、EntityAuth 和 RAGAuth），用于系统评估LLMs对权威性的判断能力，并通过三种评判方法（PointJudge、PairJudge、ListJudge）和多种输出格式进行实证分析。结果表明，ListJudge与PointScore组合在相关性上表现最优，且具备最佳成本效益；更重要的是，实验发现直接使用网页文本会降低权威判断性能，说明权威性并非由文本风格决定，而是可独立识别的特征。进一步的下游RAG实验验证了基于权威性过滤能显著提升答案准确性，凸显了权威感知对可靠知识检索的实际价值。

链接: https://arxiv.org/abs/2603.25092
作者: Zhihui Yao,Hengran Zhang,Keping Bi
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 4 figures. Submitted to ACL 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) with external knowledge but remains vulnerable to low-authority sources that can propagate misinformation. We investigate whether LLMs can perceive information authority - a capability extending beyond semantic understanding. To address this, we introduce AuthorityBench, a comprehensive benchmark for evaluating LLM authority perception comprising three datasets: DomainAuth (10K web domains with PageRank-based authority), EntityAuth (22K entities with popularity-based authority), and RAGAuth (120 queries with documents of varying authority for downstream evaluation). We evaluate five LLMs using three judging methods (PointJudge, PairJudge, ListJudge) across multiple output formats. Results show that ListJudge and PairJudge with PointScore output achieve the strongest correlation with ground-truth authority, while ListJudge offers optimal cost-effectiveness. Notably, incorporating webpage text consistently degrades judgment performance, suggesting authority is distinct from textual style. Downstream experiments on RAG demonstrate that authority-guided filtering largely improves answer accuracy, validating the practical importance of authority perception for reliable knowledge retrieval. Code and benchmark are available at: this https URL.

[IR-8] Hyena Operator for Fast Sequential Recommendation WWW’26

【速读】：该论文旨在解决基于注意力机制的序列推荐模型在处理长用户行为序列时存在的计算复杂度高（二次方复杂度）问题，同时克服现有次二次复杂度方法（如Hyena）在稀疏、长序列场景下表达能力有限的局限性。其解决方案的关键在于提出HyenaRec，通过引入基于勒让德正交多项式的卷积核参数化方法，构建平滑且紧凑的长期时间依赖建模能力，并结合门控卷积机制捕捉短期行为突变，从而在保持线性时间复杂度的同时显著提升模型表达能力和推荐精度。

链接: https://arxiv.org/abs/2603.25027
作者: Jiahao Liu,Lin Li,Zhiyuan Li,Kaixi Hu,Kaize Shi,Jingling Yuan
机构: Wuhan University of Technology (武汉理工大学); Wuhan Textile University (武汉纺织大学); University of Southern Queensland (南昆士兰大学); Hubei Key Laboratory of Transportation Internet of Things (湖北省交通物联网重点实验室)
类目: Information Retrieval (cs.IR)
备注: 11 pages, 5 figures, accepted by ACM Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:Sequential recommendation models, particularly those based on attention, achieve strong accuracy but incur quadratic complexity, making long user histories prohibitively expensive. Sub-quadratic operators such as Hyena provide efficient alternatives in language modeling, but their potential in recommendation remains underexplored. We argue that Hyena faces challenges in recommendation due to limited representation capacity on sparse, long user sequences. To address these challenges, we propose HyenaRec, a novel sequential recommender that integrates polynomial-based kernel parameterization with gated convolutions. Specifically, we design convolutional kernels using Legendre orthogonal polynomials, which provides a smooth and compact basis for modeling long-term temporal dependencies. A complementary gating mechanism captures fine-grained short-term behavioral bursts, yielding a hybrid architecture that balances global temporal evolution with localized user interests under sparse feedback. This construction enhances expressiveness while scaling linearly with sequence length. Extensive experiments on multiple real-world datasets demonstrate that HyenaRec consistently outperforms Attention-, Recurrent-, and other baselines in ranking accuracy. Moreover, it trains significantly faster (up to 6x speedup), with particularly pronounced advantages on long-sequence scenarios where efficiency is maintained without sacrificing accuracy. These results highlight polynomial-based kernel parameterization as a principled and scalable alternative to attention for sequential recommendation.

[IR-9] Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval

【速读】：该论文旨在解决学习稀疏检索（Learned Sparse Retrieval, LSR）模型中语言模型（Language Modeling, LM）头导致的内存瓶颈问题，尤其是在大规模词汇表（vocabulary size |V| 可达250k以上）下，传统方法需显式存储庞大的词元对数矩阵（logit matrix），造成显著的内存占用和I/O开销，限制了模型扩展性和训练效率。解决方案的关键在于提出一种名为Sparton的高效Triton内核，通过将分块矩阵乘法、ReLU、Log1P变换与最大池化归约操作融合为单一GPU内核，并在原始对数矩阵块层面执行早期在线归约（early online reduction），从而避免完整对数矩阵在内存中的显式存储。这种设计实现了高达4.8倍的速度提升和数量级级别的峰值内存减少，显著提升了LSR模型的训练吞吐量与可扩展性。

链接: https://arxiv.org/abs/2603.25011
作者: Thong Nguyen,Cosimo Rulli,Franco Maria Nardini,Rossano Venturini,Andrew Yates
机构: University of Amsterdam (阿姆斯特丹大学); ISTI-CNR (意大利国家研究委员会信息科学与技术研究所); University of Pisa (比萨大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:State-of-the-art Learned Sparse Retrieval (LSR) models, such as Splade, typically employ a Language Modeling (LM) head to project latent hidden states into a lexically-anchored logit matrix. This intermediate matrix is subsequently transformed into a sparse lexical representation through element-wise operations (ReLU, Log1P) and max-pooling over the sequence dimension. Despite its effectiveness, the LM head creates a massive memory bottleneck due to the sheer size of the vocabulary (V), which can range from 30,000 to over 250,000 tokens in recent models. Materializing this matrix creates a significant memory bottleneck, limiting model scaling. The resulting I/O overhead between operators further throttles throughput and runtime performance. In this paper, we propose Sparton, a fast memory-efficient Triton kernel tailored for the LM head in LSR models. Sparton utilizes a fused approach that integrates the tiled matrix multiplication, ReLU, Log1P, and max-reduction into a single GPU kernel. By performing an early online reduction directly on raw logit tiles, Sparton avoids materializing the full logit matrix in memory. Our experiments demonstrate that the Sparton kernel, in isolation, achieves up to a 4.8x speedup and an order-of-magnitude reduction in peak memory usage compared to PyTorch baselines. Integrated into Splade (|V| ~ 30k), Sparton enables a 33% larger batch size and 14% faster training with no effectiveness loss. On a multilingual backbone (|V| ~ 250k), these gains jump to a 26x larger batch size and 2.5x faster training.

[IR-10] Unbiased Multimodal Reranking for Long-Tail Short-Video Search

【速读】：该论文旨在解决短视频搜索中长尾查询（long-tail queries）因用户行为数据稀疏而导致的“马太效应”问题，即模型倾向于放大低质量内容（如标题党、浅层内容），而忽视高质量但曝光不足的视频。解决方案的关键在于提出一种基于大语言模型（Large Language Models, LLMs）的多模态重排序框架，通过LLM内嵌的世界知识评估内容质量，无需依赖真实用户行为数据；该框架采用两阶段训练策略：第一阶段利用多模态证据构建高质量标注用于监督微调，第二阶段引入成对偏好优化以学习候选结果间的部分排序关系；推理时，模型输出的经验评分用于在重排序阶段提升高质但低曝光视频的排名，并通过强化学习进一步指导页面级优化，从而有效改善长尾查询下的内容质量和用户体验。

链接: https://arxiv.org/abs/2603.24975
作者: Wenyi Xu,Feiran Zhu,Songyang Li,Renzhe Zhou,Chao Zhang,Chenglei Dai,Yuren Mao,Yunjun Gao,Yi Zhang
机构: Zhejiang University (浙江大学); Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Kuaishou serving hundreds of millions of searches daily, the quality of short-video search is paramount. However, it suffers from a severe Matthew effect on long-tail queries: sparse user behavior data causes models to amplify low-quality content such as clickbait and shallow content. The recent advancements in Large Language Models (LLMs) offer a new paradigm, as their inherent world knowledge provides a powerful mechanism to assess content quality, agnostic to sparse user interactions. To this end, we propose a LLM-driven multimodal reranking framework, which estimates user experience without real user behavior. The approach involves a two-stage training process: the first stage uses multimodal evidence to construct high-quality annotations for supervised fine-tuning, while the second stage incorporates pairwise preference optimization to help the model learn partial orderings among candidates. At inference time, the resulting experience scores are used to promote high-quality but underexposed videos in reranking, and further guide page-level optimization through reinforcement learning. Experiments show that the proposed method achieves consistent improvements over strong baselines in offline metrics including AUC, NDCG@K, and human preference judgement. An online A/B test covering 15% of traffic further demonstrates gains in both user experience and consumption metrics, confirming the practical value of the approach in long-tail video search scenarios.

[IR-11] DIET: Learning to Distill Dataset Continually for Recommender Systems

【速读】：该论文旨在解决大规模推荐系统中持续学习场景下的数据效率问题，即在不断增长的用户行为日志流中，如何以极低的数据开销准确逼近全量数据训练的行为，从而显著降低模型迭代成本。其核心解决方案是提出了一种名为DIET的统一框架，关键在于将蒸馏数据建模为随数据流演化的训练记忆（training memory），并通过分阶段更新机制保持与长期训练动态的一致性；具体实现上，DIET采用基于影响函数的初始化策略和受影响感知的记忆寻址机制，在双层优化框架下实现选择性更新，从而在仅保留原始数据1%-2%的情况下仍能忠实复现完整训练性能趋势，且具备跨模型架构的良好泛化能力。

链接: https://arxiv.org/abs/2603.24958
作者: Jiaqing Zhang,Hao Wang,Mingjia Yin,Bo Chen,Qinglin Jia,Rui Zhou,Ruiming Tang,ChaoYi Ma,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Kuaishou Technology (快手科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Modern deep recommender models are trained under a continual learning paradigm, relying on massive and continuously growing streaming behavioral logs. In large-scale platforms, retraining models on full historical data for architecture comparison or iteration is prohibitively expensive, severely slowing down model development. This challenge calls for data-efficient approaches that can faithfully approximate full-data training behavior without repeatedly processing the entire evolving data stream. We formulate this problem as \emphstreaming dataset distillation for recommender systems and propose \textbfDIET, a unified framework that maintains a compact distilled dataset which evolves alongside streaming data while preserving training-critical signals. Unlike existing dataset distillation methods that construct a static distilled set, DIET models distilled data as an evolving training memory and updates it in a stage-wise manner to remain aligned with long-term training dynamics. DIET enables effective continual distillation through principled initialization from influential samples and selective updates guided by influence-aware memory addressing within a bi-level optimization framework. Experiments on large-scale recommendation benchmarks demonstrate that DIET compresses training data to as little as \textbf1-2% of the original size while preserving performance trends consistent with full-data training, reducing model iteration cost by up to \textbf60 \times . Moreover, the distilled datasets produced by DIET generalize well across different model architectures, highlighting streaming dataset distillation as a scalable and reusable data foundation for recommender system development.

[IR-12] GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中语义搜索在应对复杂信息需求时的不足问题，尤其是在相关证据分散于多个数据源的情况下。现有方法如代理式检索策略虽能扩展语义搜索空间，但依赖迭代探索导致效率低下；而基于知识图谱的方法虽能建模非语义关系，却因维护成本高且难以与主流向量存储兼容而受限。解决方案的关键在于提出GraphER——一种无需知识图谱的图结构增强与重排序方法，其通过离线索引阶段独立 enrich 数据对象，并在查询时基于图结构对候选对象进行重排序，从而捕获超越语义相似性的多种邻近性（proximity），同时保持与标准向量存储的无缝集成和检索器无关性，且引入可忽略的延迟开销。

链接: https://arxiv.org/abs/2603.24925
作者: Ruizhong Miao,Yuying Wang,Rongguang Wang,Chenyang Li,Tao Sheng,Sujith Ravi,Dan Roth
机构: Oracle AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Semantic search in retrieval-augmented generation (RAG) systems is often insufficient for complex information needs, particularly when relevant evidence is scattered across multiple sources. Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries. However, these methods do not fully leverage the organizational structure of the data and instead rely on iterative exploration, which can lead to inefficient retrieval. Another class of approaches employs knowledge graphs to model non-semantic relationships through graph edges. Although effective in capturing richer proximities, such methods incur significant maintenance costs and are often incompatible with the vector stores used in most production systems. To address these limitations, we propose GraphER, a graph-based enrichment and reranking method that captures multiple forms of proximity beyond semantic similarity. GraphER independently enriches data objects during offline indexing and performs graph-based reranking over candidate objects at query time. This design does not require a knowledge graph, allowing GraphER to integrate seamlessly with standard vector stores. In addition, GraphER is retriever-agnostic and introduces negligible latency overhead. Experiments on multiple retrieval benchmarks demonstrate the effectiveness of the proposed approach.

[IR-13] Enhancing Online Support Group Formation Using Topic Modeling Techniques

【速读】：该论文旨在解决在线健康社区（Online Health Communities, OHCs）中支持小组（support groups）自动构建所面临的三大挑战：可扩展性不足、静态分类方式导致的个性化缺失，以及组内语义一致性弱的问题。为应对这些挑战，作者提出两种新型机器学习模型——群体特异性狄利克雷多项式回归（Group specific Dirichlet Multinomial Regression, gDMR）和群体特异性结构化主题模型（Group specific Structured Topic Model, gSTM），其关键在于融合用户生成文本内容、人口统计学特征与基于用户网络节点嵌入（node embeddings）的交互数据，从而实现语义连贯且高度个性化的支持小组自动形成。gDMR通过引入网络关系模式与人口统计学变量生成可解释的组协变量，提升实际部署可行性；而gSTM则利用稀疏约束增强主题区分度，生成更具主题聚焦性的群组。实验表明，二者在预测准确性、语义一致性和内部一致性指标上均显著优于LDA、DMR和STM等基线方法，且定性分析验证了模型分组与人工标注主题的高度契合，证明其在慢性病管理、诊断不确定性及心理健康等多样化健康议题中的实用价值。

链接: https://arxiv.org/abs/2603.24765
作者: Pronob Kumar Barman,Tera L. Reynolds,James Foulds
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Information Retrieval (cs.IR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Online health communities (OHCs) are vital for fostering peer support and improving health outcomes. Support groups within these platforms can provide more personalized and cohesive peer support, yet traditional support group formation methods face challenges related to scalability, static categorization, and insufficient personalization. To overcome these limitations, we propose two novel machine learning models for automated support group formation: the Group specific Dirichlet Multinomial Regression (gDMR) and the Group specific Structured Topic Model (gSTM). These models integrate user generated textual content, demographic profiles, and interaction data represented through node embeddings derived from user networks to systematically automate personalized, semantically coherent support group formation. We evaluate the models on a large scale dataset from this http URL, comprising over 2 million user posts. Both models substantially outperform baseline methods including LDA, DMR, and STM in predictive accuracy (held out log likelihood), semantic coherence (UMass metric), and internal group consistency. The gDMR model yields group covariates that facilitate practical implementation by leveraging relational patterns from network structures and demographic data. In contrast, gSTM emphasizes sparsity constraints to generate more distinct and thematically specific groups. Qualitative analysis further validates the alignment between model generated groups and manually coded themes, showing the practical relevance of the models in informing groups that address diverse health concerns such as chronic illness management, diagnostic uncertainty, and mental health. By reducing reliance on manual curation, these frameworks provide scalable solutions that enhance peer interactions within OHCs, with implications for patient engagement, community resilience, and health outcomes. Subjects: Information Retrieval (cs.IR); Machine Learning (stat.ML) Cite as: arXiv:2603.24765 [cs.IR] (or arXiv:2603.24765v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.24765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-14] Pseudo Label NCF for Sparse OHC Recommendation: Dual Representation Learning and the Separability Accuracy Trade off

【速读】：该论文旨在解决在线健康社区中因用户交互数据极度稀疏而导致的推荐难题，特别是在冷启动场景下如何有效为新用户提供个性化的支持组推荐。其核心解决方案是引入一种基于问卷调查特征对齐的伪标签（pseudo label）辅助目标函数，通过余弦相似度将支持组的结构化特征与用户输入的16维问卷向量映射至[0,1]区间生成伪标签，并将其融入神经协同过滤（Neural Collaborative Filtering, NCF）架构中，包括矩阵分解（Matrix Factorization, MF）、多层感知机（Multi Layer Perceptron, MLP）和混合模型（NeuMF）。关键创新在于构建了双嵌入空间：主嵌入用于排序打分，伪标签嵌入用于语义对齐，从而在极低交互条件下显著提升推荐性能（如HR@5指标提升超一倍），同时获得更具可解释性的任务特定嵌入空间。

链接: https://arxiv.org/abs/2603.24750
作者: Pronob Kumar Barman,Tera L. Reynolds. James Foulds
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online Health Communities connect patients for peer support, but users face a discovery challenge when they have minimal prior interactions to guide personalization. We study recommendation under extreme interaction sparsity in a survey driven setting where each user provides a 16 dimensional intake vector and each support group has a structured feature profile. We extend Neural Collaborative Filtering architectures, including Matrix Factorization, Multi Layer Perceptron, and NeuMF, with an auxiliary pseudo label objective derived from survey group feature alignment using cosine similarity mapped to [0, 1]. The resulting Pseudo Label NCF learns dual embedding spaces: main embeddings for ranking and pseudo label embeddings for semantic alignment. We evaluate on a dataset of 165 users and 498 support groups using a leave one out protocol that reflects cold start conditions. All pseudo label variants improve ranking performance: MLP improves HR@5 from 2.65% to 5.30%, NeuMF from 4.46% to 5.18%, and MF from 4.58% to 5.42%. Pseudo label embedding spaces also show higher cosine silhouette scores than baseline embeddings, with MF improving from 0.0394 to 0.0684 and NeuMF from 0.0263 to 0.0653. We further observe a negative correlation between embedding separability and ranking accuracy, indicating a trade off between interpretability and performance. These results show that survey derived pseudo labels improve recommendation under extreme sparsity while producing interpretable task specific embedding spaces. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.24750 [cs.IR] (or arXiv:2603.24750v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.24750 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

人机交互

[HC-0] A Mentalistic Interface for Probing Folk-Psychological Attribution to Non-Humanoid Robots

【速读】：该论文旨在解决如何通过语言和解释框架影响人类对非人形机器人意图状态（intentional state）归因的问题。解决方案的关键在于构建一个实验平台，该平台结合了仿真机器人、逼真的任务环境以及基于大语言模型（large language model）的解释层，能够以心理主义（mentalistic）、目的论（teleological）或机制论（mechanistic）三种不同语义框架描述相同行为，从而在保持行为不变的前提下，系统性地考察语言表述方式如何塑造人类对机器人采取“意图立场”（intentional stance）的认知倾向。

链接: https://arxiv.org/abs/2603.25646
作者: Giulio Pisaneschi,Pierpaolo Serio,Estelle Gerbier,Andrea Dan Ryals,Lorenzo Pollini,Mario G. C. A. Cimino
机构: Institute of Clinical Physiology National Research Council(国家研究委员会临床生理学研究所); University of Pisa(比萨大学); Delft University of Technology(代尔夫特理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint submitted to IEEE. 8 pages, 21 figures

点击查看摘要

Abstract:This paper presents an experimental platform for studying intentional-state attribution toward a non-humanoid robot. The system combines a simulated robot, realistic task environments, and large language model-based explanatory layers that can express the same behavior in mentalistic, teleological, or mechanistic terms. By holding behavior constant while varying the explanatory frame, the platform provides a controlled way to investigate how language and framing shape the adoption of the intentional stance in robotics.

[HC-1] Clinician Perspectives on Type 1 Diabetes Guidelines and Glucose Data Interpretation

【速读】：该论文试图解决的问题是：当前临床实践中，医疗专业人员如何理解和应用针对1型糖尿病（Type 1 Diabetes Mellitus, T1DM）的管理指南，以及他们对患者利用葡萄糖监测设备数据进行自我管理能力的认知。其解决方案的关键在于通过一项包含两部分的在线问卷调查，系统收集英国19名糖尿病相关领域医护人员的观点，发现临床指南优先关注血糖管理且常个体化调整；同时，医务人员普遍认为患者能够理解设备提供的数据，并在某些情况下做出正确的治疗决策。

链接: https://arxiv.org/abs/2603.25631
作者: Mohammed Basheikh,Rujiravee Kongdee,Hood Thabit,Bijan Parsia,Sarah Clinch,Simon Harper
机构: University of Manchester(曼彻斯特大学); Manchester University NHS Foundation Trust(曼彻斯特大学NHS基金会信托)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study explored healthcare professionals’ perspectives on the management of Type 1 Diabetes Mellitus (T1DM) through a two-part questionnaire. The first part examined how clinicians prioritise and apply current clinical guidelines, including the relative importance assigned to different aspects of T1DM management. The second part investigated clinicians’ perceptions of patients’ ability to interpret data from the glucose monitoring devices and to make appropriate treatment decisions. An online questionnaire was completed by 19 healthcare professionals working in diabetes-related roles in the United Kingdom. The findings revealed that blood glucose management is prioritised within clinical guidance and that advice is frequently tailored to individual patient needs. Additionally, clinicians generally perceive that data presented in glucose monitoring devices is easy for patients to interpret and based on these data, they believe that patients occasionally make correct treatment decisions.

[HC-2] Visual or Textual: Effects of Explanation Format and Personal Characteristics on the Perception of Explanations in an Educational Recommender System

【速读】：该论文旨在解决推荐系统（Recommender Systems, RS）中解释形式（视觉 vs. 文本）与用户个人特征（Personal Characteristics, PCs）之间的交互关系不明确的问题，以提升教育推荐系统（Educational Recommender System, ERS）的透明度、信任感和用户满意度。解决方案的关键在于：通过一个包含54名参与者的被试内实验，发现一种设计良好、简洁、交互性强、可选择性展示且易于理解的可视化解释方式，能够有效增强大多数用户的感知控制力、透明度、适当信任及满意度，且这种效果不受用户个人特征（如大五人格特质、认知需求、决策风格等）的显著影响，从而为ERS中的解释设计提供了普适性指导原则。

链接: https://arxiv.org/abs/2603.25624
作者: Qurat Ul Ain,Mohamed Amine Chatti,Nasim Yazdian Varjani,Farah Kamal,Astrid Rosenthal-von der Pütten
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学); RWTH Aachen University (亚琛工业大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Paper accepted to UMAP 2026

点击查看摘要

Abstract:Explanations are central to improving transparency, trust, and user satisfaction in recommender systems (RS), yet it remains unclear how different explanation formats (visual vs. textual) are suited to users with different personal characteristics (PCs). To this end, we report a within-subject user study (n=54) comparing visual and textual explanations and examine how explanation format and PCs jointly influence perceived control, transparency, trust, and satisfaction in an educational recommender system (ERS). Using robust mixed-effects models, we analyze the moderating effects of a wide range of PCs, including Big Five traits, need for cognition, decision making style, visualization familiarity, and technical expertise. Our results show that a well-designed visual, simple, interactive, selective, easy to understand visualization that clearly and intuitively communicates how user preferences are linked to recommendations, fosters perceived control, transparency, appropriate trust, and satisfaction in the ERS for most users, independent of their PCs. Moreover, we derive a set of guidelines to support the effective design of explanations in ERSs.

[HC-3] Does Structured Intent Representation Generalize? A Cross-Language Cross-Model Empirical Study of 5W3H Prompting

【速读】：该论文旨在解决多语言环境下生成式 AI (Generative AI) 中意图表示的泛化性问题，即结构化的意图表示方法是否能在不同语言和模型之间保持一致的有效性。其解决方案的关键在于引入基于5W3H（Who, What, When, Where, Why, How, How much, How many）框架的提示协议规范（Prompt Protocol Specification, PPS），并通过四种条件对比实验验证其效果：包括手动构建的5W3H提示（Condition C）、AI辅助扩展的单句提示（Condition D）以及两种无结构提示基线。研究发现，AI自动将用户简单输入扩展为完整5W3H结构后，在目标对齐度上与人工编写结构化提示无显著差异，且显著降低了跨模型输出的方差波动，尤其在非约束条件下揭示了传统无结构提示存在系统性的双重膨胀偏差（复合评分虚高、模型间差异低估）。这表明，借助AI辅助的结构化意图表示可提升多语言场景下意图对齐精度与一致性，同时降低非专业用户的使用门槛。

链接: https://arxiv.org/abs/2603.25379
作者: Peng Gang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 28 pages, figures, tables, and appendix. Follow-up empirical study extending prior work on PPS and 5W3H structured prompting to cross-language, cross-model, and AI-assisted authoring settings

点击查看摘要

Abstract:Does structured intent representation generalize across languages and models? We study PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction, and extend prior Chinese-only evidence along three dimensions: two additional languages (English and Japanese), a fourth condition in which a user’s simple prompt is automatically expanded into a full 5W3H specification by an AI-assisted authoring interface, and a new research question on cross-model output consistency. Across 2,160 model outputs (3 languages x 4 conditions x 3 LLMs x 60 tasks), we find that AI-expanded 5W3H prompts (Condition D) show no statistically significant difference in goal alignment from manually crafted 5W3H prompts (Condition C) across all three languages, while requiring only a single-sentence input from the user. Structured PPS conditions often reduce or reshape cross-model output variance, though this effect is not uniform across languages and metrics; the strongest evidence comes from identifying spurious low variance in unconstrained baselines. We also show that unstructured prompts exhibit a systematic dual-inflation bias: artificially high composite scores and artificially low apparent cross-model variance. These findings suggest that structured 5W3H representations can improve intent alignment and accessibility across languages and models, especially when AI-assisted authoring lowers the barrier for non-expert users.

[HC-4] Usability of Passwordless Authentication in Wi-Fi Networks: A Comparative Study of Passkeys and Passwords in Captive Portals

【速读】：该论文旨在解决无密码认证机制（如passkeys）在公共Wi-Fi热点的 captive portal 场景下可用性不足的问题，这一场景中传统密码登录常因界面限制、用户认知负担和平台兼容性问题导致体验不佳。解决方案的关键在于通过实证比较研究揭示 passkeys 与密码在登录流程中的可用性差异，并提出针对性设计改进：包括引入无需用户名的认证流程（usernameless authentication flows）、优化 captive portal 自动检测机制，以及改进用户界面（UI）设计，从而提升整体用户体验并降低错误率。

链接: https://arxiv.org/abs/2603.25290
作者: Martiño Rivera-Dourado,Rubén Pérez-Jove,Alejandro Pazos,Jose Vázquez-Naya
机构: 未知
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: This is an author version. It has not been peer reviewed

点击查看摘要

Abstract:Passkeys have recently emerged as a passwordless authentication mechanism, yet their usability in captive portals remains unexplored. This paper presents an empirical, comparative usability study of passkeys and passwords in a Wi-Fi hotspot using a captive portal. We conducted a controlled laboratory experiment with 50 participants following a split-plot design across Android and Windows platforms, using a router implementing the FIDO2CAP protocol. Our results show a tendency for passkeys to be perceived as more usable than passwords during login, although differences are not statistically significant. Independent of the authentication method, captive portal limitations negatively affected user experience and increased error rates. We further found that passkeys are generally easy to configure on both platforms, but platform-specific issues introduce notable usability challenges. Based on quantitative and qualitative findings, we derive design recommendations to improve captive portal authentication, including the introduction of usernameless authentication flows, improved captive portal detection mechanisms, and user interface design changes.

[HC-5] Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

【速读】：该论文旨在解决生成式 AI (Generative AI) 中可解释人工智能（Explainable AI, XAI）方法的评估问题，即当前广泛使用的功能指标（如正确性）是否真正能够反映人类对模型决策的理解。传统假设认为更高的解释正确性必然带来更好的人类理解，但这一假设缺乏实验验证。研究通过控制时间序列分类任务中解释正确性的四个水平（100%、85%、70%、55%），并要求参与者基于解释进行前向模拟预测AI决策，发现：仅当正确性低于70%时，人类理解显著下降；进一步降低正确性不再加剧理解损失；且即使完全正确的解释也未保证所有参与者都能掌握决策模式。关键在于揭示了功能正确性与人类理解之间的非线性关系，并指出需将功能指标与人类行为结果结合验证，以提升XAI评估的有效性。

链接: https://arxiv.org/abs/2603.25251
作者: Gregor Baer,Chao Zhang,Isel Grau,Pieter Van Gorp
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model’s reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI’s decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

[HC-6] Understanding Newcomer Persistence in Social VR: A Case Study of VRChat

【速读】：该论文旨在解决社交虚拟现实（Social Virtual Reality, SVR）环境中新用户初始融入困难的问题，尤其是相较于传统二维在线环境，SVR平台中缺乏明确的引导机制和社交规范，导致新手难以快速适应并留存。研究通过访谈24名活跃SVR用户并进行反思性主题分析，识别出新手面临的主要障碍包括不熟悉的用户界面、模糊的社会规范以及感官过载等问题，并提炼出关键解决方案：一是通过主动构建社交意义来弥补SVR环境中目标导向不足的缺陷；二是利用社会动态策略有效管理VR特有的生理不适（如VR晕动症），从而提升用户留存率。研究进一步提出设计建议，以系统性地支持这些成功融入路径的形成。

链接: https://arxiv.org/abs/2603.25223
作者: Qijia Chen,Andrea Bellucci,Giulio Jacucci
机构: University of Helsinki(赫尔辛基大学); Universidad Carlos III de Madrid(卡洛斯三世大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Newcomers are crucial for the growth of online communities, yet their successful integration into these spaces requires overcoming significant initial hurdles. Social Virtual Reality (VR) platforms are novel avenues that offer unprecedented online interaction experiences. Unlike well-studied two-dimensional online environments, the pathways to successful newcomer integration in online VR spaces are underexplored. Our research addresses this gap by examining the strategies used by newcomers to navigate early challenges in social VR and how they adapt. By focusing on active participants (ranging from newcomers currently navigating these hurdles to veterans who have successfully integrated) we isolate the specific strategies necessary for retention. We interviewed 24 active social VR users and conducted a reflexive thematic analysis. While participants identified barriers such as unfamiliar user interfaces, social norms, and overwhelming sensory input, our analysis reveals the adaptation strategies required to overcome them. Our findings expand on understanding newcomer persistence beyond traditional 2D environments, emphasizing how social dynamics influence the management of VR-specific issues like VR sickness during onboarding. Additionally, we highlight how successful newcomers overcome the lack of clear objectives in social VR by proactively constructing social meaning. We propose design suggestions to scaffold these successful integration pathways.

[HC-7] Beyond Benchmarks: How Users Evaluate AI Chat Assistants

【速读】：该论文试图解决的问题是：当前大型语言模型（Large Language Models, LLMs）的评估主要依赖自动化基准测试，但缺乏对用户满意度、采纳动机及使用痛点的系统性跨平台比较。为填补这一空白，研究者设计并实施了一项涵盖388名活跃AI聊天用户的跨平台调查，对比了七款主流平台（ChatGPT、Claude、Gemini、DeepSeek、Grok、Mistral 和 Llama）在用户满意度、使用场景表现、采纳动因及定性不满方面的差异。其解决方案的关键在于采用统一的测量工具进行多平台实证分析，揭示出尽管各平台在资金、团队规模和基准性能上存在显著差异，但用户满意度高度趋同；同时发现用户将这些工具视为可互换的实用工具而非锁定生态，且各平台因差异化优势（如界面、答案质量、口碑传播、内容政策）吸引特定用户群体，从而支撑竞争多样性而非赢家通吃格局。

链接: https://arxiv.org/abs/2603.25220
作者: Moiz Sadiq Awan,Muhammad Haris Noor,Muhammad Salman Munaf
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 15 figures, 5 tables, 32 references

点击查看摘要

Abstract:Automated benchmarks dominate the evaluation of large language models, yet no systematic study has compared user satisfaction, adoption motivations, and frustrations across competing platforms using a consistent instrument. We address this gap with a cross-platform survey of 388 active AI chat users, comparing satisfaction, adoption drivers, use case performance, and qualitative frustrations across seven major platforms: ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, and Llama. Three broad findings emerge. First, the top three platforms (Claude, ChatGPT, and DeepSeek) receive statistically indistinguishable satisfaction ratings despite vast differences in funding, team size, and benchmark performance. Second, users treat these tools as interchangeable utilities rather than sticky ecosystems: over 80% use two or more platforms, and switching costs are negligible. Third, each platform attracts users for different reasons: ChatGPT for its interface, Claude for answer quality, DeepSeek through word-of-mouth, and Grok for its content policy, suggesting that specialization, not generalist dominance, sustains competition. Hallucination and content filtering remain the most common frustrations across all platforms. These findings offer an early empirical baseline for a market that benchmarks alone cannot characterize, and point toward competitive plurality rather than winner-take-all consolidation among engaged users.

[HC-8] he Competence Shadow: Theory and Bounds of AI Assistance in Safety Engineering

【速读】：该论文旨在解决人工智能（AI）辅助在物理AI系统安全工程中的应用问题，即AI协助是否真正提升安全分析质量，还是引入难以察觉的系统性盲点，仅在部署后才暴露。其核心挑战在于传统基于基准的评估方法无法充分刻画安全工程的多维能力需求。解决方案的关键在于提出一个五维胜任力框架（涵盖领域知识、标准专长、操作经验、情境理解与判断力），并首次引入“胜任力阴影”（competence shadow）概念——指AI生成的安全分析对人类推理范围的系统性压缩效应，而非AI输出本身。论文进一步形式化四种典型人-AI协作结构，并推导出性能边界，揭示胜任力阴影呈乘法累积效应，远超线性叠加估计。最终结论强调：AI在安全工程中的作用本质上是协作流程设计问题，而非单纯工具采购决策；通过构建抗阴影的工作流可实现不劣化效果，从而推动从“工具认证”向“流程认证”的范式转变，以保障物理AI系统的可信性。

链接: https://arxiv.org/abs/2603.25197
作者: Umair Siddique
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Software Engineering (cs.SE)
备注: 8 Pages, 3 Figures, 2 table

点击查看摘要

Abstract:As AI assistants become integrated into safety engineering workflows for Physical AI systems, a critical question emerges: does AI assistance improve safety analysis quality, or introduce systematic blind spots that surface only through post-deployment incidents? This paper develops a formal framework for AI assistance in safety analysis. We first establish why safety engineering resists benchmark-driven evaluation: safety competence is irreducibly multidimensional, constrained by context-dependent correctness, inherent incompleteness, and legitimate expert disagreement. We formalize this through a five-dimensional competence framework capturing domain knowledge, standards expertise, operational experience, contextual understanding, and judgment. We introduce the competence shadow: the systematic narrowing of human reasoning induced by AI-generated safety analysis. The shadow is not what the AI presents, but what it prevents from being considered. We formalize four canonical human-AI collaboration structures and derive closed-form performance bounds, demonstrating that the competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates. The central finding is that AI assistance in safety engineering is a collaboration design problem, not a software procurement decision. The same tool degrades or improves analysis quality depending entirely on how it is used. We derive non-degradation conditions for shadow-resistant workflows and call for a shift from tool qualification toward workflow qualification for trustworthy Physical AI. Comments: 8 Pages, 3 Figures, 2 table Subjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Software Engineering (cs.SE) Cite as: arXiv:2603.25197 [cs.AI] (or arXiv:2603.25197v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.25197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-9] On-Demand Instructional Material Providing Agent Based on MLLM for Tutoring Support

【速读】：该论文旨在解决一对一 tutoring 过程中，教师难以及时获取与学生需求匹配的补充教学材料（如图像）的问题。解决方案的关键在于设计并实现一个基于多模态大语言模型（Multimodal Large Language Model）的智能代理（agent），该代理能够实时分析师生对话内容，自动构建搜索查询，并从网络中检索相关图像，从而在教学过程中按需提供高质量的视觉辅助材料。实验表明，该代理可将图像检索时间平均缩短 44.4 秒，且在 85.7% 的试验中成功提供可接受质量的图像，显著提升了教学支持效率。

链接: https://arxiv.org/abs/2603.25195
作者: Takumi Kato,Masato Kikuchi,Tadachika Ozono
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: The 20th International Conference on E-Service and Knowledge Management (ESKM 2025)

点击查看摘要

Abstract:Effective instruction in tutoring requires promptly providing instructional materials that match the needs of each student (e.g., in response to questions). In this study, we introduce an agent that automatically delivers supplementary materials on demand during one-on-one tutoring sessions. Our agent uses a multimodal large language model to analyze spoken dialogue between the instructor and the student, automatically generate search queries, and retrieve relevant Web images. Evaluation experiments demonstrate that our agent reduces the average image retrieval time by 44.4 s compared to cases without support and successfully provides images of acceptable quality in 85.7% of trials. These results indicate that our agent effectively supports instructors during tutoring sessions.

[HC-10] Goodness-of-pronunciation without phoneme time alignment

【速读】：该论文旨在解决低资源语言中语音评估（Speech Evaluation）因自动语音识别（ASR）训练数据有限而难以扩展的问题。现有开源弱监督ASR模型虽支持多语言，但其帧异步且非音素级的特性阻碍了语音评估所需的特征提取。解决方案的关键在于：通过将ASR假设映射到音素混淆网络（phoneme confusion network）来计算音素后验概率，并采用词级而非音素级的说话速率和持续时间；同时，利用交叉注意力架构融合音素级与帧级特征，从而避免对音素时间对齐的依赖，实现与标准帧同步特征相当的性能，在英语（speechOcean762）和低资源泰米尔语数据集上均取得有效结果。

链接: https://arxiv.org/abs/2603.25150
作者: Jeremy H. M. Wong,Nancy F. Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of speech evaluation to low-resource languages. Open-source weakly-supervised models are capable of ASR over many languages, but they are frame-asynchronous and not phonemic, hindering feature extraction for speech evaluation. This paper proposes to overcome incompatibilities for feature extraction with weakly-supervised models, easing expansion of speech evaluation to low-resource languages. Phoneme posteriors are computed by mapping ASR hypotheses to a phoneme confusion network. Word instead of phoneme-level speaking rate and duration are used. Phoneme and frame-level features are combined using a cross-attention architecture, obviating phoneme time alignment. This performs comparably with standard frame-synchronous features on English speechocean762 and low-resource Tamil datasets.

[HC-11] opoPilot: Reliable Conversational Workflow Automation for Topological Data Analysis and Visualization

【速读】：该论文旨在解决当前生成式AI在自动化科学可视化流程中可靠性不足的问题，特别是大型语言模型（Large Language Models, LLMs）在执行过程中可能产生无效操作、引入细微但关键的错误，或未能在输入信息不完整时主动请求补充信息等缺陷。这些问题在现实世界复杂工作流中尤为突出，远超标准基准测试的复杂度。解决方案的关键在于提出TopoPilot框架，其核心创新是采用以可靠性为中心的双代理架构：一个编排代理（orchestrator agent）将自然语言提示转化为由原子后端动作组成的可视化工作流，另一个验证代理（verifier agent）在执行前对工作流进行结构有效性与语义一致性检查，从而实现正确性保障；此外，模块化设计增强了系统的可扩展性与鲁棒性，通过系统性地识别失败模式并实施针对性防护机制，在模拟1000次多轮交互中实现了超过99%的成功率，显著优于缺乏全面校验机制的基线方法。

链接: https://arxiv.org/abs/2603.25063
作者: Nathaniel Gorski,Shusen Liu,Bei Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent agentic systems demonstrate that large language models can generate scientific visualizations from natural language. However, reliability remains a major limitation: systems may execute invalid operations, introduce subtle but consequential errors, or fail to request missing information when inputs are underspecified. These issues are amplified in real-world workflows, which often exceed the complexity of standard benchmarks. Ensuring reliability in autonomous visualization pipelines therefore remains an open challenge. We present TopoPilot, a reliable and extensible agentic framework for automating complex scientific visualization workflows. TopoPilot incorporates systematic guardrails and verification mechanisms to ensure reliable operation. While we focus on topological data analysis and visualization as a primary use case, the framework is designed to generalize across visualization domains. TopoPilot adopts a reliability-centered two-agent architecture. An orchestrator agent translates user prompts into workflows composed of atomic backend actions, while a verifier agent evaluates these workflows prior to execution, enforcing structural validity and semantic consistency. This separation of interpretation and verification reduces code-generation errors and enforces correctness guarantees. A modular architecture further improves robustness by isolating components and enabling seamless integration of new descriptors and domain-specific workflows without modifying the core system. To systematically address reliability, we introduce a taxonomy of failure modes and implement targeted safeguards for each class. In evaluations simulating 1,000 multi-turn conversations across 100 prompts, including adversarial and infeasible requests, TopoPilot achieves a success rate exceeding 99%, compared to under 50% for baselines without comprehensive guardrails and checks.

[HC-12] Framing Data Choices: How Pre-Donation Exploration Design Influence Data Donation Behavior and Decision-Making

【速读】：该论文试图解决数据捐赠（data donation）在公共部门研究中面临的“意愿-行为差距”问题，即用户虽有参与意愿，但实际捐赠行为却不足。其解决方案的关键在于通过优化数据选择的呈现方式（choice framing），特别是预捐赠阶段的数据探索设计，来干预个体行为。研究通过实证实验发现，采用“社会比较”（social comparison）框架能显著提升捐赠率（87.5%），而“仅集体视角”（collective-only）反而引发认知混乱和隐私担忧，导致捐赠率下降至37.5%。这表明，合理的设计干预可将数据捐赠视为一个行为挑战，并凸显出设计在推动参与式公共部门创新中的关键作用。

链接: https://arxiv.org/abs/2603.24995
作者: Zeya Chen,Zach Pino,Ruth Schmidt
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: This work has been accepted for inclusion in DRS Biennial Conference Series, DRS2026: Edinburgh, 8-12 June, Edinburgh, UK

点击查看摘要

Abstract:Data donation, an emerging user-centric data collection method for public sector research, faces a gap between participant willingness and actual donation. This suggests a design absence in practice: while promoted as “donor-centered” with technical and regulational advances, a design perspective on how data choices are presented and intervene on individual behaviors remain underexplored. In this paper, we focus on pre-donation data exploration, a key stage for adequately and meaningful informed participation. Through a real-world data donation study (N=24), we evaluated three data exploration interventions (self-focused, social comparison, collective-only). Findings show choice framing impacts donation participation. The “social comparison” design (87.5%) outperformed the “self-focused view” (62.5%) while a “collective-only” frame (37.5%) backfired, causing “perspective confusion” and privacy concerns. This study demonstrates how strategic data framing addresses data donation as a behavioral challenge, revealing design’s critical yet underexplored role in data donation for participatory public sector innovation.

[HC-13] Co-designing for the Triad: Design Considerations for Collaborative Decision-Making Technologies in Pediatric Chronic Care

【速读】：该论文旨在解决儿科慢性病管理中患者、照护者与医疗提供者之间因价值观差异、角色分工不明确及情境认知不对称所导致的协作障碍问题，从而影响健康结局。其解决方案的关键在于通过共同设计工作坊（co-design workshops）收集多方利益相关者的反馈，识别出从个体认知与情绪限制到心智模型错位再到照护目标冲突等多层次的情境认知障碍，并据此提出设计启示：支持持续的决策实践、对齐心智模型、平衡照护者支持与青少年自主性发展、以及凸显潜在照护挑战，最终推动促进家庭共享理解的协同决策技术设计。

链接: https://arxiv.org/abs/2603.24993
作者: Ray-Yuan Chung,Jaime Snyder,Zixuan Xu,Daeun Yoo,Athena C. Ortega,Wanda Pratt,Aaron Wightman,Ryan Hutson,Cozumel Pruette,Ari Pollack
机构: University of Washington (华盛顿大学); Johns Hopkins University (约翰霍普金斯大学); Seattle Children’s Hospital (西雅图儿童医院)
类目: Human-Computer Interaction (cs.HC)
备注: Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:In pediatric chronic care, the triadic relationship among patients, caregivers, and healthcare providers introduces unique challenges for youth in managing their conditions. Diverging values, roles, and asymmetrical situational awareness across decision-maker groups often hinder collaboration and affect health outcomes, highlighting the need to support collaborative decision-making. We conducted co-design workshops with 6 youth with chronic kidney disease, 6 caregivers, and 7 healthcare providers to explore how digital technologies can be designed to support collaborative decision-making. Findings identify barriers across all levels of situational awareness, ranging from individual cognitive and emotional constraints, misaligned mental models, to relational conflicts regarding care goals. We propose design implications that support continuous decision-making practice, align mental models, balance caregiver support and youth autonomy development, and surface potential care challenges. This work advances the design of collaborative decision-making technologies that promote shared understanding and empower families in pediatric chronic care.

[HC-14] Rethinking Health Agents : From Siloed AI to Collaborative Decision Mediators

【速读】：该论文试图解决的问题是：当前基于大语言模型的健康代理（Health Agents）多以孤立方式运行，仅服务于单一用户，无法有效支持医疗场景中多利益相关方（如患者、照护者与临床医生）之间的协作关系，导致情境认知碎片化和目标错位，进而加剧依从性问题。解决方案的关键在于重构AI角色——将其从独立助手转变为嵌入多方诊疗互动中的协作伙伴（AI Collaborator），通过提出一个概念框架，实现三方面功能：显式呈现上下文信息、调和不同参与者的心理模型（Mental Models）、促进共享理解，同时确保人类决策权不受侵蚀。

链接: https://arxiv.org/abs/2603.24986
作者: Ray-Yuan Chung,Xuhai Xu,Ari Pollack
机构: University of Washington (华盛顿大学); Columbia University (哥伦比亚大学); Seattle Children’s Hospital (西雅图儿童医院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted in CHI '26 Workshop on Human-Agent Collaboration

点击查看摘要

Abstract:Large language model based health agents are increasingly used by health consumers and clinicians to interpret health information and guide health decisions. However, most AI systems in healthcare operate in siloed configurations, supporting individual users rather than the multi-stakeholder relationships central to healthcare. Such use can fragment understanding and exacerbate misalignment among patients, caregivers, and clinicians. We reframe AI not as a standalone assistant, but as a collaborator embedded within multi-party care interactions. Through a clinically validated fictional pediatric chronic kidney disease case study, we show that breakdowns in adherence stem from fragmented situational awareness and misaligned goals, and that siloed use of general-purpose AI tools does little to address these collaboration gaps. We propose a conceptual framework for designing AI collaborators that surface contextual information, reconcile mental models, and scaffold shared understanding while preserving human decision authority.

[HC-15] PII Shield: A Browser-Level Overlay for User-Controlled Personal Identifiable Information (PII) Management in AI Interactions

【速读】：该论文旨在解决个人用户在使用基于云的大语言模型（Large Language Models, LLM）服务时，因缺乏数据保护机制而导致的敏感个人信息（Personally Identifiable Information, PII）泄露风险问题。当前，用户在与AI聊天机器人进行深度情感或认知交互时，往往无意中将隐私数据传输至科技公司，而这些公司通常具有不透明的隐私政策，且普通用户难以控制其数据流向。解决方案的关键在于引入一种面向消费者的、浏览器端的干预机制，通过两个核心组件实现：一是本地实体匿名化（local entity anonymization），用于防止敏感信息外泄；二是“烟雾弹”（smokescreens）机制，即自主代理活动干扰第三方追踪与画像行为。该方案首次将企业级PII脱敏技术转化为直观、免费、易用的消费级工具，有效弥合了用户隐私偏好与实际交互行为之间的鸿沟。

链接: https://arxiv.org/abs/2603.24895
作者: Max Holschneider,Saetbyeol LeeYouk
机构: MIT Media Lab (麻省理工学院媒体实验室)
类目: Human-Computer Interaction (cs.HC)
备注: An open-source implementation is accessible at the following GitHub repository: this https URL

点击查看摘要

Abstract:AI chatbots have quietly become the world’s most popular therapists, coaches, and confidants. Users of cloud-based LLM services are increasingly shifting from simple queries like idea generation and poem writing, to deeply personal interactions. As Large Language Models increasingly assume the role of our confessors, we are witnessing a massive, unregulated transfer of sensitive personal identifiable information (PII) to powerful tech companies with opaque privacy practices. While the enterprise sector has made great strides in addressing data leakage concerns through sophisticated guardrails and PII redaction pipelines, these powerful tools have functionally remained inaccessible for the average user due to their technical complexity. This results in a dangerous trade off for individual users. In order to receive the therapeutic or productivity benefits of AI, users need to abandon any agency they might otherwise have over their data, often without a clear mental model of what is being shared, and how it might be used for advertising later on. This work addresses this interaction gap, applying the redaction pipelines of enterprise-grade redaction into an intuitive, first-of-its-kind, consumer-facing, and free experience. Specifically, this work introduces a scalable, browser-based intervention designed to help align user behavior with their privacy preferences during web-based AI interactions. Our system introduces two key mechanisms: local entity anonymization to prevent data leakage, and ‘smokescreens’: autonomous agent activity to disrupt third-party profiling. An open-source implementation is accessible at the GitHub repository below.

[HC-16] Governance in Practice: How Open Source Projects Define and Document Roles

【速读】：该论文旨在解决开源软件（Open Source Software, OSS）可持续性问题中治理结构不清晰的挑战，特别是缺乏对项目如何通过书面文档正式定义角色与权威的系统性实证研究。其解决方案的关键在于运用制度语法（Institutional Grammar）方法，从GitHub托管仓库中的this http URL文件及相关文档中提取并形式化角色定义，将每个角色分解为作用范围（scope）、权限（privileges）、义务（obligations）和生命周期规则（life-cycle rules），从而揭示不同社区间角色结构的异质性与“角色漂移”（role drift）现象，并识别出维护者悖论（Maintainer Paradox）——即少数核心成员同时承担技术、管理和社区职责，导致治理瓶颈。这一分析框架为设计更清晰的角色分工、优化责任分配及缓解领导层过载提供了实证依据，有助于构建更健康、可持续的OSS社区。

链接: https://arxiv.org/abs/2603.24879
作者: Pedro Oliveira,Tayana Conte,Marco Gerosa,Igor Steinmacher
机构: Northern Arizona University (北亚利桑那大学); Federal University of Amazonas (亚马逊联邦大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Open source software (OSS) sustainability depends not only on code contributions but also on governance structures that define who decides, who acts, and how responsibility is distributed. We lack systematic empirical evidence of how projects formally codify roles and authority in written artifacts. This paper investigates how OSS projects define and structure governance through their this http URL files and related documents. We analyze governance as an institutional infrastructure, a set of explicit rules that shape participation, decision rights, and community memory. We used Institutional Grammar to extract and formalize role definitions from repositories hosted on GitHub. We decompose each role into scope, privileges, obligations, and life-cycle rules to compare role structures across communities. Our results show that although OSS projects use a stable set of titles, identical titles carry different responsibilities, and different labels describe similar functions, which we call role drift. Still, we observed that a few actors sometimes accumulate technical, managerial, and community duties. %This creates the Maintainer Paradox: those who enable broad participation simultaneously become governance bottlenecks. By understanding authority and responsibilities in OSS, our findings inform researchers and practitioners on the importance of designing clearer roles, distributing work, and reducing leadership overload to support healthier and more sustainable communities.

[HC-17] More Than “Means to an End”: Supporting Reasoning with Transparently Designed AI Data Science Processes

【速读】：该论文试图解决的问题是：当前生成式AI（Generative AI）工具虽然能够辅助用户完成复杂的数据科学任务，但其端到端的处理方式缺乏对用户在高风险领域中评估替代方案和重构问题的支持，而这正是解决开放式任务的关键能力。解决方案的关键在于构建围绕有意设计的中间产物（intermediate artifacts）的AI工作流，例如可读的查询语言、概念定义或输入输出示例，这些中间产物虽不改变整体流程的黑箱特性，却能有效支持用户进行分析决策、细化初始问题并融入自身专业判断，从而提升数据科学思维的有效性。

链接: https://arxiv.org/abs/2603.24877
作者: Venkatesh Sivaraman,Patrick Vossler,Adam Perer,Julian Hong,Jean Feng
机构: UC San Francisco (加州大学旧金山分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to Workshop on Tools for Thought at CHI’26: Understanding, Protecting, and Augmenting Human Cognition with Generative AI - From Vision to Implementation

点击查看摘要

Abstract:Generative artificial intelligence (AI) tools can now help people perform complex data science tasks regardless of their expertise. While these tools have great potential to help more people work with data, their end-to-end approach does not support users in evaluating alternative approaches and reformulating problems, both critical to solving open-ended tasks in high-stakes domains. In this paper, we reflect on two AI data science systems designed for the medical setting and how they function as tools for thought. We find that success in these systems was driven by constructing AI workflows around intentionally-designed intermediate artifacts, such as readable query languages, concept definitions, or input-output examples. Despite opaqueness in other parts of the AI process, these intermediates helped users reason about important analytical choices, refine their initial questions, and contribute their unique knowledge. We invite the HCI community to consider when and how intermediate artifacts should be designed to promote effective data science thinking.

[HC-18] Gaze patterns predict preference and confidence in pairwise AI image evaluation

【速读】：该论文旨在解决当前偏好学习方法（如基于人类反馈的强化学习，Reinforcement Learning from Human Feedback, RLHF）和直接偏好优化（Direct Preference Optimization, DPO）依赖于成对人类判断时，其背后认知过程尚不明确的问题。研究的关键在于利用眼动追踪技术揭示在成对AI生成图像评估中偏好形成的心理机制：通过记录30名参与者完成1800次判断任务时的眼动数据，发现注视模式（如注视时间、注视次数与回视频率）能显著预测选择结果（准确率达68%），且注视转换行为可区分高信心与低信心决策（准确率达66%），从而表明眼动轨迹提供了关于偏好标注质量的隐式信号，为改进偏好学习的数据采集与质量评估提供了新的实证依据。

链接: https://arxiv.org/abs/2603.24849
作者: Nikolas Papadopoulos,Shreenithi Navaneethan,Sheng Bai,Ankur Samanta,Paul Sajda
机构: Columbia University (哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: This paper has been accepted to ACM ETRA 2026

点击查看摘要

Abstract:Preference learning methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on pairwise human judgments, yet little is known about the cognitive processes underlying these judgments. We investigate whether eye-tracking can reveal preference formation during pairwise AI-generated image evaluation. Thirty participants completed 1,800 trials while their gaze was recorded. We replicated the gaze cascade effect, with gaze shifting toward chosen images approximately one second before the decision. Cascade dynamics were consistent across confidence levels. Gaze features predicted binary choice (68% accuracy), with chosen images receiving more dwell time, fixations, and revisits. Gaze transitions distinguished high-confidence from uncertain decisions (66% accuracy), with low-confidence trials showing more image switches per second. These results show that gaze patterns predict both choice and confidence in pairwise image evaluations, suggesting that eye-tracking provides implicit signals relevant to the quality of preference annotations.

[HC-19] SABER: Spatial Attention Brain Extended Reality

【速读】：该论文旨在解决当前对注意力神经机制的理解主要基于静态物体在二维屏幕上的任务，而缺乏对三维空间中动态目标实时追踪的神经基础认知的问题。其关键解决方案是开发了SABER（Spatial Attention, Brain, Extended Reality）框架，通过结合虚拟现实（VR）环境与脑电图（EEG）记录，验证了传统单变量EEG指标可有效扩展至包含静态和动态三维刺激的沉浸式VR情境，并利用计算建模从振荡脑活动重建目标位置的逐时刻注意力分布，从而实现了在三维空间中精确追踪注意力的可行性。

链接: https://arxiv.org/abs/2603.24830
作者: Tom Bullock,Emily Machniak,You-Jin Kim,Radha Kumaran,Justin Kasowski,Apurv Varshney,Julia Ram,Melissa M. Hernandez,Stina Johansson,Neil M. Dundon,Tobias Höllerer,Barry Giesbrecht
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校); University of Huddersfield(哈德斯菲尔德大学); Arizona State University(亚利桑那州立大学); University of Nebraska–Lincoln(内布拉斯加林肯大学); Institute for Collaborative Biotechnologies(协作生物技术研究所)
类目: Human-Computer Interaction (cs.HC)
备注: Conference Paper, 11 pages. Published at the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR)

点击查看摘要

Abstract:Tracking moving objects is a critical skill for many everyday tasks, such as crossing a busy street, driving a car or catching a ball. Attention is a key cognitive function that supports object tracking; however, our understanding of the brain mechanisms that support attention is almost exclusively based on evidence from tasks that present stable objects at fixed locations. Accounts of multiple object tracking are also limited because they are largely based on behavioral data alone and involve tracking objects in a 2D plane. Consequently, the neural mechanisms that enable moment-by-moment tracking of goal-relevant objects remain poorly understood. To address this knowledge gap, we developed SABER (Spatial Attention, Brain, Extended Reality), a new framework for studying the behavioral and neural dynamics of attention to objects moving in 3D. Participants (n=32) completed variants of a task inspired by the popular virtual reality (VR) game, Beat Saber, where they used virtual sabers to strike stationary and moving color-defined target spheres while we recorded electroencephalography (EEG). We first established that standard univariate EEG metrics which are typically used to study spatial attention to static objects presented on 2D screens, can generalize effectively to an immersive VR context involving both static and dynamic 3D stimuli. We then used a computational modeling approach to reconstruct moment-by-moment attention to the locations of stationary and moving objects from oscillatory brain activity, demonstrating the feasibility of precisely tracking attention in a 3D space. These results validate SABER, and provide a foundation for future research that is critical not only for understanding how attention works in the physical world, but is also directly relevant to the development of better VR applications.

[HC-20] Examining the Effect of Explanations of AI Privacy Redaction in AI-mediated Interactions

【速读】：该论文旨在解决AI中介通信（AI-mediated communication）中隐私保护与用户信任之间的矛盾问题，尤其是在隐私敏感场景下，AI中介通过删减或隐藏信息来保护隐私时，如何提升用户对系统行为的理解与信任。其解决方案的关键在于设计并验证一种基于解释（explanation）的机制，即在执行红笔操作（redaction）后向接收方提供不同程度的解释说明，以增强用户对系统隐私保护能力的认知和接受度。实验结果表明，提供解释能显著提升用户对系统隐私保护效果的感知（p < 0.05，Cohen’s d ≈ 0.3），且当红笔内容较多时，解释的作用更为突出（p < 0.05，Cohen’s f ≈ 0.2），同时个体差异（如年龄和AI熟悉度）也影响解释偏好。这表明，构建适应性、情境感知的解释机制是实现隐私友好型可信AI系统的核心路径。

链接: https://arxiv.org/abs/2603.24735
作者: Roshni Kaushik,Maarten Sap,Koichi Onoue
机构: Fujitsu Research of America(富士通研究美国公司); Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: Under review at FAccT 2026

点击查看摘要

Abstract:AI-mediated communication is increasingly being utilized to help facilitate interactions; however, in privacy sensitive domains, an AI mediator has the additional challenge of considering how to preserve privacy. In these contexts, a mediator may redact or withhold information, raising questions about how users perceive these interventions and whether explanations of system behavior can improve trust. In this work, we investigate how explanations of redaction operations can affect user trust in AI-mediated communication. We devise a scenario where a validated system removes sensitive content from messages and generates explanations of varying detail to communicate its decisions to recipients. We then conduct a user study with 180 participants that studies how user trust and preferences vary for cases with different amounts of redacted content and different levels of explanation detail. Our results show that participants believed our system was more effective at preserving privacy when explanations were provided ( p0.05 , Cohen’s d \approx 0.3 ). We also found that contextual factors had an impact; participants relied more on explanations and found them more helpful when the system performed extensive redactions ( p0.05 , Cohen’s f \approx 0.2 ). We also found that explanation preferences depended on individual differences as well, and factors such as age and baseline familiarity with AI affected user trust in our system. These findings highlight the importance and challenge of balancing transparency and privacy in AI-mediated communications and suggest that adaptive, context-aware explanations are essential for designing privacy-aware, trustworthy AI systems.

[HC-21] Malicious LLM -Based Conversational AI Makes Users Reveal Personal Information USENIX-SECURITY’25

【速读】：该论文旨在解决生成式 AI (Generative AI) 对话系统（LLM-based Conversational AIs, CAIs）可能被恶意设计用于主动提取用户个人隐私信息的问题，这一风险在以往研究中尚未被充分探讨。其解决方案的关键在于通过构建不同策略的恶意与良性CAI系统，开展随机对照试验（n=502），实证验证了基于社交隐私特性（如建立信任、模拟共情）的设计策略在提升信息泄露量方面的显著有效性，同时最小化用户对隐私风险的感知。研究结果表明，恶意CAI比良性版本能更高效地获取个人信息，揭示了此类系统在现实场景中的潜在威胁，并为未来安全防护机制的设计提供了数据驱动的实践依据。

链接: https://arxiv.org/abs/2506.11680
作者: Xiao Zhan,Juan Carlos Carrillo,William Seymour,Jose Such
机构: King’s College London (伦敦国王学院); VRAIN, Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: This paper has been accepted at USENIX Security '25

点击查看摘要

Abstract:LLM-based Conversational AIs (CAIs), also known as GenAI chatbots, like ChatGPT, are increasingly used across various domains, but they pose privacy risks, as users may disclose personal information during their conversations with CAIs. Recent research has demonstrated that LLM-based CAIs could be used for malicious purposes. However, a novel and particularly concerning type of malicious LLM application remains unexplored: an LLM-based CAI that is deliberately designed to extract personal information from users. In this paper, we report on the malicious LLM-based CAIs that we created based on system prompts that used different strategies to encourage disclosures of personal information from users. We systematically investigate CAIs’ ability to extract personal information from users during conversations by conducting a randomized-controlled trial with 502 participants. We assess the effectiveness of different malicious and benign CAIs to extract personal information from participants, and we analyze participants’ perceptions after their interactions with the CAIs. Our findings reveal that malicious CAIs extract significantly more personal information than benign CAIs, with strategies based on the social nature of privacy being the most effective while minimizing perceived risks. This study underscores the privacy threats posed by this novel type of malicious LLM-based CAIs and provides actionable recommendations to guide future research and practice. Comments: This paper has been accepted at USENIX Security '25 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2506.11680 [cs.CY] (or arXiv:2506.11680v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.11680 Focus to learn more arXiv-issued DOI via DataCite Journalreference: USENIX Security 2025

[HC-22] Online Advertising is a Regrettable Necessity: On the Dangers of Pay-Walling the Web

【速读】：该论文试图解决的问题是：当前互联网日益加剧的付费墙（paywall）趋势正在威胁开放网络（open web）模型的可持续性，导致经济弱势群体难以获取信息资源，进而加剧数字鸿沟（digital divide），并可能引发在线广告生态系统的崩溃。其解决方案的关键在于重新审视以广告支持的开放网络商业模式，并通过构建“人均国民收入（GNI）与平均付费墙支出之间的差距基线”来量化全球范围内无法负担全付费网页场景的国家和人口规模——研究发现，135个国家共计约65.6亿人无法承受完全付费墙模式，这凸显了维持开放网络的紧迫性和必要性。论文呼吁开展进一步的研究与政策行动，以确保网络的开放性与包容性，建立更具可持续性的商业模型。

链接: https://arxiv.org/abs/2409.00026
作者: Yonas Kassa
机构: 未知
类目: Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI)
备注: 2 figs

点击查看摘要

Abstract:The exponential growth of the web and its benefits can be attributed largely to its open model where anyone with internet connection can access information on the web for free. This has created unprecedented opportunities for various members of society including the most vulnerable, as recognized by organizations such as the UN. This again can be attributed to online advertising, which has been the main financier to the open web. However, recent trends of paywalling information and services on the web are creating imminent dangers to such open model of the web, inhibiting access for the economically vulnerable, and eventually creating digital segregation. In this paper, we argue that this emerging model lacks sustainability, exacerbates digital divide, and might lead to collapse of online advertising. We revisit the ad-supported open web business model and demonstrate how global users actually pay for the ads they see. Using data on GNI (gross national income) per capita and average paywall access costs, we established a simple income-paywall expenditure gap baseline. With this baseline we show that 135 countries with a total population estimate of 6.56 billion people cannot afford a scenario of a fully paywalled web. We further discuss how a mixed model of the so-called “premium services” creates digital segregation and poses danger to online advertising ecosystem. Finally, we call for further research and policy initiatives to keep the web open and more inclusive with a sustainable business model.

[HC-23] History of generative Artificial Intelligence (AI) chatbots: past present and future development

【速读】：该论文旨在系统梳理对话式人工智能（Chatbot）技术从早期基于规则的简单系统到当前由人工智能驱动的高级对话代理的发展历程，以厘清其演进路径中的关键里程碑与范式转变。其解决方案的关键在于通过整合学术文献与行业资料，对多个重要节点进行历史性分析，包括图灵测试的提出、CALO等标志性项目以及基于Transformer架构的现代模型，从而揭示自然语言处理（Natural Language Processing, NLP）与机器学习（Machine Learning, ML）如何逐步融合进聊天机器人体系，推动其实现更复杂的交互能力。这一综述为研究者和相关利益方提供了理解对话式AI发展轨迹及其未来应用潜力的重要背景依据。

链接: https://arxiv.org/abs/2402.05122
作者: Md. Al-Amin,Mohammad Shazed Ali,Abdus Salam,Arif Khan,Ashraf Ali,Ahsan Ullah,Md Nur Alam,Shamsul Kabir Chowdhury
机构: 未知
类目: General Literature (cs.GL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This research provides an in-depth comprehensive review of the progress of chatbot technology over time, from the initial basic systems relying on rules to today’s advanced conversational bots powered by artificial intelligence. Spanning many decades, the paper explores the major milestones, innovations, and paradigm shifts that have driven the evolution of chatbots. Looking back at the very basic statistical model in 1906 via the early chatbots, such as ELIZA and ALICE in the 1960s and 1970s, the study traces key innovations leading to today’s advanced conversational agents, such as ChatGPT and Google Bard. The study synthesizes insights from academic literature and industry sources to highlight crucial milestones, including the introduction of Turing tests, influential projects such as CALO, and recent transformer-based models. Tracing the path forward, the paper highlights how natural language processing and machine learning have been integrated into modern chatbots for more sophisticated capabilities. This chronological survey of the chatbot landscape provides a holistic reference to understand the technological and historical factors propelling conversational AI. By synthesizing learnings from this historical analysis, the research offers important context about the developmental trajectory of chatbots and their immense future potential across various field of application which could be the potential take ways for the respective research community and stakeholders.

[HC-24] Colon-Bench: An Agent ic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

【速读】：该论文旨在解决当前结直肠癌早期筛查中AI系统开发所面临的挑战，即缺乏密集标注、长序列视频数据集，尤其是无法满足现代多模态大语言模型（Multimodal Large Language Models, MLLMs）对空间、时间与语言信息联合评估的需求。其解决方案的关键在于提出了一种新颖的多阶段智能体（agentic）工作流，通过整合时序提议、边界框跟踪、AI驱动的视觉确认以及人机协同审核机制，实现了全手术过程视频的可扩展标注。该方法构建了迄今为止规模最大、标注最丰富的基准数据集Colon-Bench，包含528段视频、14类病变、超30万个体素框、21.3万个分割掩码及13.3万个临床描述词，从而为MLLM在病变分类、开放词汇视频目标分割（Open-Vocabulary Video Object Segmentation, OV-VOS）和视频视觉问答（Video Visual Question Answering, VQA）等任务上的性能评估提供了坚实基础。

链接: https://arxiv.org/abs/2603.25645
作者: Abdullah Hamdi,Changchun Yang,Xin Gao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: preprint

点击查看摘要

Abstract:Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel “colon-skill” prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at this https URL .

计算机视觉

[CV-0] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

【速读】：该论文旨在解决多镜头视频生成（multi-shot video generation）在长篇叙事场景中面临的交互性差与延迟高的问题，尤其是现有双向架构在实时互动和高效帧生成方面的局限。其核心解决方案是提出一种新的因果式多镜头架构 ShotStream，通过将任务重新定义为基于历史上下文的下一镜头生成，实现用户对正在进行叙事的动态指令干预。关键创新包括：一是引入双缓存记忆机制，利用全局上下文缓存维持跨镜头一致性、局部上下文缓存保障单镜头内一致性，并借助 RoPE 不连续指示符明确区分两者以消除歧义；二是设计两阶段蒸馏策略，先在单镜头内使用真实历史帧进行自强制训练，再逐步扩展至使用自生成历史进行跨镜头自强制训练，有效缓解自回归生成中的误差累积问题，从而在单张 GPU 上实现亚秒级延迟（16 FPS）且质量优于或等同于传统双向模型。

链接: https://arxiv.org/abs/2603.25746
作者: Yawen Luo,Xiaoyu Shi,Junhao Zhuang,Yutian Chen,Quande Liu,Xintao Wang,Pengfei Wan,Tianfan Xue
机构: MMLab, CUHK; Kuaishou Technology; CPII under InnoHK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

[CV-1] Less Gaussians Texture More: 4K Feed-Forward Textured Splatting

【速读】：该论文旨在解决现有前向传播式三维高斯点绘（3D Gaussian Splatting）方法在图像分辨率提升时，因像素对齐的高斯原型数量呈二次增长而导致的可扩展性瓶颈问题，这一限制使得4K级高分辨率新视角合成在前向传播方法中难以实现。解决方案的关键在于提出LGTM（Less Gaussians, Texture More）框架，通过预测紧凑的高斯原型并为其分配每个原型的纹理信息，从而将几何复杂度与渲染分辨率解耦，使得在无需场景级优化的前提下即可实现高质量的4K新视角合成，同时显著减少所需的高斯原型数量。

链接: https://arxiv.org/abs/2603.25745
作者: Yixing Lao,Xuyang Bai,Xiaoyang Wu,Nuoyuan Yan,Zixin Luo,Tian Fang,Jean-Daniel Nahmias,Yanghai Tsin,Shiwei Li,Hengshuang Zhao
机构: HKU (香港大学); Apple (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: this https URL

[CV-2] MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

【速读】：该论文旨在解决视觉基础模型（Vision Foundation Models, VFMs）在推理阶段通常仅依赖单一固定尺度输入的问题，而忽略了不同分辨率可提供互补的归纳偏置——低分辨率有助于全局语义识别，高分辨率则利于细粒度特征提取。其解决方案的关键在于提出一种简单且通用的多分辨率融合策略（Multi-Resolution Fusion, MuRF），该方法在不进行任何训练的前提下，通过冻结的VFM对同一图像在多个分辨率下进行处理，并融合所得特征以构建统一表示，从而有效利用多尺度信息提升视觉表征能力。

链接: https://arxiv.org/abs/2603.25744
作者: Bocheng Zou,Mu Cai,Mark Stanley,Dingfu Lu,Yong Jae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

[CV-3] RefAlign: Representation Alignment for Reference-to-Video Generation

【速读】：该论文旨在解决参考图像到视频（Reference-to-video, R2V）生成任务中因跨模态特征异构性导致的复制粘贴伪影（copy-paste artifacts）和多主体混淆（multi-subject confusion）问题。现有方法通常通过引入额外的高层语义或跨模态特征与VAE潜空间表示共同输入扩散Transformer（DiT），虽能缓解像素级信息泄露，但难以有效对齐不同模态间的语义空间。其解决方案的关键在于提出RefAlign框架，该框架通过设计一种参考对齐损失（reference alignment loss），在训练阶段显式地将DiT参考分支特征对齐至视觉基础模型（Visual Foundation Model, VFM）的语义空间：一方面拉近同一主体的参考特征与VFM特征的距离以提升身份一致性，另一方面推远不同主体对应特征的距离以增强语义可区分性。此策略仅在训练时生效，不增加推理开销，且显著提升了文本可控性与参考保真度之间的平衡。

链接: https://arxiv.org/abs/2603.25743
作者: Lei Wang,YuXin Song,Ge Wu,Haocheng Feng,Hang Zhou,Jingdong Wang,Yaxing Wang,jian Yang
机构: Nankai University (南开大学); Baidu Inc. (百度公司); Nanjing University (南京大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy–paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

[CV-4] Vega: Learning to Drive with Natural Language Instructions

【速读】：该论文旨在解决当前自动驾驶系统在决策过程中对语言模态利用不足的问题，即现有方法通常仅将语言用于场景描述或推理，缺乏根据多样化用户指令进行个性化驾驶的能力。其解决方案的关键在于提出一个统一的视觉-语言-世界-动作模型（Vega），采用自回归范式处理视觉输入和语言指令，并结合扩散模型生成未来状态预测与轨迹规划；同时通过联合注意力机制实现多模态交互，并引入独立的投影层增强各模态的表达能力，从而显著提升指令跟随能力和规划性能。

链接: https://arxiv.org/abs/2603.25741
作者: Sicheng Zuo,Yuxuan Li,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); GigaAI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Code is available at this https URL

点击查看摘要

Abstract:Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

[CV-5] MegaFlow: Zero-Shot Large Displacement Optical Flow KR

【速读】：该论文旨在解决大位移光流估计（large displacement optical flow）在零样本泛化（zero-shot generalization）场景下的性能瓶颈问题，现有方法通常依赖迭代局部搜索或领域特定微调，难以应对复杂运动和跨域迁移。其解决方案的关键在于引入MegaFlow模型，通过利用预训练视觉Transformer（Vision Transformer）的全局特征将光流估计建模为全局匹配问题，从而自然捕捉大位移信息；随后采用少量轻量级迭代优化步骤提升亚像素精度，无需任务特异性架构设计即可实现卓越的零样本性能与跨任务迁移能力。

链接: https://arxiv.org/abs/2603.25739
作者: Dingxi Zhang,Fangjinhua Wang,Marc Pollefeys,Haofei Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: this https URL.

[CV-6] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow CVPR2026

【速读】：该论文旨在解决自动化图形设计系统在忠实传达用户意图并生成可编辑设计文件方面的挑战，现有方法通常简化专业工作流程，导致灵活性和直观性不足。解决方案的关键在于提出PSDesigner，一个模拟人类设计师创意流程的自动化系统，其核心是通过多个专业化组件实现基于用户指令的主题资产收集，并自主推断与执行工具调用以操作设计文件（如整合新资产或优化低质量元素）；为增强系统的工具使用能力，研究者构建了大规模高质量PSD设计数据集CreativePSD，其中包含多种设计场景和艺术风格下的操作轨迹标注，使模型能够学习专家设计流程，从而显著提升设计质量与实用性。

链接: https://arxiv.org/abs/2603.25738
作者: Xincheng Shuai,Song Tang,Yutong Huang,Henghui Ding,Dacheng Tao
机构: Fudan University (复旦大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project Page: this https URL

点击查看摘要

Abstract:Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

[CV-7] How good was my shot? Quantifying Player Skill Level in Table Tennis

【速读】：该论文旨在解决如何在复杂交互行为中自动量化个体技能水平的问题，尤其针对具有高度战术性与情境依赖性的双人运动（如乒乓球）场景。其核心挑战在于技能是潜在的（latent），无法直接从观测到的动作中显式获取。解决方案的关键在于构建一个生成式模型（Generative Model），用于学习每位球员的战术击球模式，并将这些模型联合嵌入到一个共享的潜在空间（latent space）中，该空间编码了球员的个体特征（包括技能水平）。通过在大规模3D重建的职业比赛数据集上训练，并引入全面的游戏情境信息（如球员位置和对手行为作为条件），模型能够捕捉每个球员独特的战术身份。进一步地，通过对该潜在空间进行探查并结合简单的相对排名网络，实现了对球员技能的相对与绝对预测，从而验证了该方法可有效量化技能水平，为复杂交互行为中的自动化技能评估提供了新路径。

链接: https://arxiv.org/abs/2603.25736
作者: Akihiro Kubota,Tomoya Hasegawa,Ryo Kawahara,Ko Nishino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gauging an individual’s skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports – specifically table tennis – where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player’s tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context – including player positioning and opponent behaviors – the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.

[CV-8] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

【速读】：该论文旨在解决生成逼真人-物交互（Human-Object Interaction, HOI）动画的难题，其核心挑战在于需同时建模动态的人体动作与多样的物体几何形状。传统基于扩散模型的方法通常依赖手工设计的接触先验（contact priors）或人为施加的运动学约束来提升接触质量，但这类方法缺乏灵活性且难以泛化。论文提出的解决方案是LIGHT，其关键创新在于利用去噪过程本身的特性实现数据驱动的引导机制——通过将表征分解为模态特定组件，并为不同组件分配异步的去噪调度和个性化噪声水平，使较干净的组件通过交叉注意力（cross-attention）引导噪声更大的组件，从而在无需辅助分类器的情况下自然地产生接触感知的引导信号。这种由去噪速率诱导的引导机制比传统的无分类器引导更有效地模拟接触先验，显著提升了接触保真度、生成真实性，并增强了对未见物体和任务的泛化能力。

链接: https://arxiv.org/abs/2603.25734
作者: Ziyin Wang,Sirui Xu,Chuan Guo,Bing Zhou,Jiangshan Gong,Jian Wang,Yu-Xiong Wang,Liang-Yan Gui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Snap Inc. (Snap Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this http URL

点击查看摘要

Abstract:Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

[CV-9] SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding CVPR2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视频时间定位（Video Temporal Grounding, VTG）任务中因粗粒度识别能力不足而导致的细粒度时序理解缺陷，以及由此引发的特定数据集快捷方式记忆问题，进而导致跨域（Out-of-Domain, OOD）泛化性能差的问题。其解决方案的关键在于提出 SlotVTG 框架，通过引入一个轻量级的 slot adapter，利用 slot attention 将视觉 token 分解为抽象槽位（slot），并借助自监督视觉模型提供的 objectness 先验引导语义一致的槽位形成，从而实现对输入内容的物体中心（object-centric）视觉推理，且无需重新训练整个多阶段管道，在保持域内（In-Domain, ID）性能的同时显著提升跨域鲁棒性。

链接: https://arxiv.org/abs/2603.25733
作者: Jiwook Han,Geo Ahn,Youngrae Kim,Jinwoo Choi
机构: Kyung Hee University (高丽大学); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to GRAIL-V workshop at CVPR 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

[CV-10] BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

【速读】：该论文旨在解决当前图像生成模型在真实商业设计任务中缺乏系统性评估标准的问题，现有基准主要聚焦于自然图像合成，难以全面衡量模型在结构化和多约束条件下的表现。解决方案的关键在于提出 BizGenEval，这是一个针对商业视觉内容生成的系统性基准，涵盖五类典型文档类型（幻灯片、图表、网页、海报和科学图），从文本渲染、布局控制、属性绑定和知识推理四个核心能力维度构建20项多样化评估任务，并包含400个精心设计的提示和8000条人工验证的检查清单问题，从而实现对生成图像是否满足复杂视觉与语义约束的严谨评估。

链接: https://arxiv.org/abs/2603.25732
作者: Yan Li,Zezi Zeng,Ziwei Zhou,Xin Gao,Muzhao Tian,Yifan Yang,Mingxi Cheng,Qi Dai,Yuqing Yang,Lili Qiu,Zhendong Wang,Zhengyuan Yang,Xue Yang,Lijuan Wang,Ji Li,Chong Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

[CV-11] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

【速读】：该论文旨在解决自回归视频扩散模型在长视频生成中面临的三大瓶颈问题：不可控的线性KV缓存增长、时间重复性以及误差累积。其核心解决方案是提出PackForcing框架，通过一种新颖的三段式KV缓存管理策略实现高效的历史上下文压缩与维护：（1）Sink tokens保留早期关键帧的全分辨率信息以维持全局语义；（2）Mid tokens利用双分支网络融合3D卷积与低分辨率VAE重编码，实现32倍时空压缩；（3）Recent tokens保持全分辨率以保障局部时序一致性。同时引入动态top-k上下文选择机制和连续的时间位置编码调整（Temporal RoPE Adjustment），在严格控制内存占用（仅4 GB KV缓存）的前提下，实现2分钟、832×480分辨率、16 FPS视频的高质量生成，并支持24倍时间外推（5秒至120秒）。

链接: https://arxiv.org/abs/2603.25730
作者: Xiaofeng Mao,Shaohao Rui,Kaining Ying,Bo Zheng,Chuanhao Li,Mingmin Chi,Kaipeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top- k context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. this https URL

[CV-12] PixelSmile: Toward Fine-Grained Facial Expression Editing

【速读】：该论文旨在解决细粒度面部表情编辑中因语义重叠导致的结构性混淆问题，尤其在保持身份一致性的同时实现连续、可控且精确的表情控制。其解决方案的关键在于提出PixelSmile框架，该框架通过完全对称的联合训练策略解耦表情语义，并结合强度监督与对比学习机制，增强表达的区分度；同时利用文本潜在空间插值实现稳定线性控制，从而在表达编辑精度和身份保留之间取得良好平衡，支持平滑的表情融合。

链接: https://arxiv.org/abs/2603.25728
作者: Jiabin Hua(1 and 2),Hengyuan Xu(1 and 2),Aojie Li(2),Wei Cheng(2),Gang Yu(2),Xingjun Ma(1),Yu-Gang Jiang(1) ((1) Fudan University, (2) StepFun)
机构: Fudan University (复旦大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 Pages; Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

[CV-13] AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

【速读】：该论文旨在解决当前3D手部姿态估计（3D hand pose estimation）任务中训练数据不足与多样性有限的问题，尤其是现有合成数据集普遍缺乏遮挡、手臂细节及对齐深度信息等关键场景。其解决方案的核心在于构建一个大规模、高质量的合成数据集AnyHand，包含2.5M单手和4.1M手物交互的RGB-D图像及其丰富的几何标注，并通过实验证明该数据集能显著提升模型在RGB-only和RGB-D两种输入模式下的性能与泛化能力，同时提出轻量级深度融合模块以增强RGB模型对深度信息的利用效率。

链接: https://arxiv.org/abs/2603.25726
作者: Chen Si,Yulin Liu,Bo Ai,Jianwen Xie,Rolandos Alexandros Potamias,Chuanxia Zheng,Hao Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

[CV-14] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models CVPR2026

【速读】：该论文旨在解决对比视觉-语言（Vision-Language, VL）模型在学习组合性表示（compositional representations）方面的局限性问题。现有方法通常通过生成特定任务的难负样本（hard negatives）来提升组合性任务表现，但这类方法往往缺乏泛化能力，并可能损害模型的基础VL能力（如零样本或检索性能）。作者识别出两个根本原因：一是长训练句段不强制模型学习组合性表示；二是文本与图像编码器的最终全局池化导致绑定信息完全丢失。解决方案的关键在于两点：其一，利用标准自然语言处理工具提取短且以概念为中心的句子片段，并将其与图像对齐；其二，引入一种无参数的跨模态注意力池化机制（cross-modal attention-pooling），从图像编码器中获取以概念为中心的视觉嵌入。结合简单的辅助对比损失，该方法在标准组合性基准上达到最先进性能，同时保持甚至提升零样本和检索能力，且不增加推理开销。

链接: https://arxiv.org/abs/2603.25722
作者: Hai X. Pham,David T. Hoffmann,Ricardo Guerrero,Brais Martinez
机构: Samsung AI Center Cambridge (三星人工智能中心剑桥分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Contrastive vision-language (VL) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of VL models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic VL capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of VLs: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at this https URL.

[CV-15] R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

【速读】：该论文旨在解决当前多模态模型在视觉与文本表示同一概念时产生矛盾预测的问题，即缺乏跨模态一致性（cross-modal consistency），这会导致系统性偏差并影响推理准确性。解决方案的关键在于提出RC2框架，通过强化学习机制强制模型满足跨模态循环一致性（cross-modal cycle consistency）：要求模型进行反向推理、切换模态，并通过正向推理可靠重建答案，从而获得密集且无需标签的奖励信号。这一结构化约束促使模型自主对齐内部表征，有效缓解模态特异性错误，提升推理准确率最高达7.6分。

链接: https://arxiv.org/abs/2603.25720
作者: Zirui Zhang,Haoyu Dong,Kexin Pei,Chengzhi Mao
机构: Rutgers University (罗格斯大学); Columbia University (哥伦比亚大学); University of Chicago (芝加哥大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

[CV-16] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

【速读】：该论文旨在解决视频世界模型在处理动态主体（dynamic subjects）隐藏与重现时的连续性问题，即现有方法在动态主体短暂不可见后难以保持其身份和运动一致性，导致生成结果出现冻结、失真或消失等现象。解决方案的关键在于提出了一种名为Hybrid Memory（混合记忆）的新范式，要求模型同时具备精确记录静态背景的能力和持续追踪动态主体的能力，从而保障主体在视野外期间的运动连续性；并进一步设计了HyDRA这一专用记忆架构，通过将记忆压缩为token并采用时空相关性驱动的检索机制，实现对关键运动线索的选择性关注，有效维持隐藏主体的身份与运动特征。

链接: https://arxiv.org/abs/2603.25716
作者: Kaijin Chen,Dingkang Liang,Xin Zhou,Yikang Ding,Xiaoqiang Liu,Pengfei Wan,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

[CV-17] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLM s

【速读】：该论文旨在解决多模态扩散大语言模型（Multimodal Diffusion Large Language Models, MDLLMs）在并行掩码解码过程中产生的多模态幻觉（multimodal hallucinations）问题。其核心症结在于解码器仅依赖文本似然进行候选词排序，而未验证局部视觉支持，导致语言概率分布作为代理目标与实际多模态任务之间存在客观偏差（objective mismatch）。解决方案的关键是提出一种无需训练的推理阶段校准框架VISAGE，通过量化跨注意力分布的空间熵来估计代理偏差，并强制不同注意力头在空间上达成一致性共识，从而惩罚空间均匀的注意力分布，重新排序token选择以优先考虑视觉上有依据的输出。该方法在多个基准测试中表现出鲁棒性，显著提升了生成内容的视觉一致性。

链接: https://arxiv.org/abs/2603.25711
作者: Vishal Narnaware,Animesh Gupta,Kevin Zhai,Zhenyi Wang,Mubarak Shah
机构: Institute of Artificial Intelligence, University of Central Florida (中央佛罗里达大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

[CV-18] RACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

【速读】：该论文旨在解决视频中目标物体运动轨迹编辑的问题，即在不破坏原始场景内容的前提下，修改特定目标物体的运动路径。传统方法通常依赖于外观变换或基于点跟踪的轨迹控制，后者在存在相机运动的视频中难以操作且用户交互复杂。其解决方案的关键在于提出了一种名为Trace的两阶段框架：第一阶段通过跨视角运动变换模块，将用户在单个锚定帧中设计的轨迹映射为在相机运动下对齐各帧的边界框轨迹；第二阶段利用运动条件视频重合成模块，沿这些轨迹重新生成目标物体，同时保持视频其余内容的一致性。该方法实现了更连贯、真实且可控的物体中心运动编辑效果。

链接: https://arxiv.org/abs/2603.25707
作者: Quynh Phung,Long Mai,Cusuh Ham,Feng Liu,Jia-Bin Huang,Aniruddha Mahapatra
机构: University of Maryland, College Park (马里兰大学学院公园分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: webpage: this https URL

点击查看摘要

Abstract:We study object motion path editing in videos, where the goal is to alter a target object’s trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

[CV-19] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training CVPR2026

【速读】：该论文旨在解决当前多模态统一模型在生成交织内容（interleaved generation）时的局限性，即尽管能够接受多模态输入，但通常只能输出单一模态的内容，难以实现文本与视觉信息的交替生成，其核心挑战在于训练数据稀缺以及长距离跨模态上下文建模困难。解决方案的关键在于将交织生成分解为两个阶段：一是通过“规划器”（planner）生成密集的文本描述以指导视觉内容；二是通过“可视化器”（visualizer）根据这些文本描述合成图像，从而实现文本连贯性和视觉一致性。作者构建了大规模文本代理的交织数据用于训练规划器，并利用参考引导的图像数据训练可视化器，最终提出 Wan-Weaver 框架，在无需真实交织数据的情况下实现了优于现有方法的交织生成能力。

链接: https://arxiv.org/abs/2603.25706
作者: Jinbo Xing,Zeyinzi Jiang,Yuxiang Tuo,Chaojie Mao,Xiaotang Gai,Xi Chen,Jingfeng Zhang,Yulin Pan,Zhen Han,Jie Xiao,Keyu Yan,Chenwei Xie,Chongyang Zhong,Kai Zhu,Tong Shen,Lianghua Huang,Yu Liu,Yujiu Yang
机构: Tongyi Lab(通义实验室); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Camera-ready, Webpage: this https URL

点击查看摘要

Abstract:Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model’s capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

[CV-20] LEMMA: Laplacian pyramids for Efficient Marine SeMAntic Segmentation CVPR2026

【速读】：该论文旨在解决海洋环境中语义分割任务在资源受限场景下部署困难的问题，现有方法多依赖于计算成本高昂的深度卷积神经网络（Deep CNNs）和基于Transformer的架构，难以满足无人水面艇（USVs）自主导航及海岸地球观测等实时、低成本应用的需求。其解决方案的关键在于提出一种轻量级语义分割模型LEMMA，通过引入拉普拉斯金字塔（Laplacian Pyramids）增强边缘识别能力，并在特征提取早期融合边缘信息，从而避免在网络深层进行高复杂度的特征图计算，显著降低模型参数量、计算量（最高减少71倍参数和88.5% GFLOPs）与推理时间（最高减少84.65%），同时保持优异的分割性能（如Oil Spill数据集达93.42% IoU，Mastr1325数据集达98.97% mIoU）。

链接: https://arxiv.org/abs/2603.25689
作者: Ishaan Gakhar,Laven Srivastava,Sankarshanaa Sagaram,Aditya Kasliwal,Ujjwal Verma
机构: Manipal Institute of Technology, Manipal Academy of Higher Education, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MaCVi Workshop, CVPR 2026

点击查看摘要

Abstract:Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer-based architectures, face challenges in deployment due to their high computational costs and resource-intensive nature. These limitations hinder the practicality of real-time, low-cost applications in real-world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state-of-the-art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5%, and inference time by up to 84.65%, as compared to existing models. Experimental results highlight its effectiveness and real-world applicability, including 93.42% IoU on the Oil Spill dataset and 98.97% mIoU on Mastr1325. Comments: Accepted at the MaCVi Workshop, CVPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.25689 [cs.CV] (or arXiv:2603.25689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.25689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-21] Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

【速读】：该论文旨在解决跨视图地理定位（Cross-view geo-localization, CVGL）中因现有方法依赖对比学习嵌入空间而导致的性能瓶颈问题，尤其是其对大规模批次和难负样本挖掘的强依赖性，以及忽略地图几何结构和街景与航拍图像间覆盖不匹配所带来的定位模糊性。解决方案的关键在于提出一种全新的“Just Zoom In”范式，通过在城市尺度的航拍地图上进行自回归式的逐级缩放推理，从粗粒度卫星视图逐步聚焦至目标分辨率下的终端卫星单元，从而实现无需对比损失或难负样本挖掘的端到端定位；该方法显式建模了空间层次结构，提升了对地图空间关系的推理能力，并在真实场景下验证了其优越性。

链接: https://arxiv.org/abs/2603.25686
作者: Yunus Talha Erzurumlu,Jiyong Kwag,Alper Yilmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) estimates a camera’s location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

[CV-22] Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

【速读】：该论文旨在解决动作条件下的机器人世界模型（action-conditioned robot world models）在自回归部署时因误差累积导致视觉质量迅速下降的问题。传统方法虽能在短时预测中表现良好，但一旦将预测帧作为下一时刻的输入上下文，便难以维持长期生成的稳定性与真实性。解决方案的关键在于引入一种强化学习（Reinforcement Learning, RL）后训练策略，使世界模型基于其自身生成的自回归轨迹进行优化，而非依赖真实历史数据；同时设计多候选未来序列对比机制和多视角视觉保真度奖励函数，以提升预测的准确性与一致性，最终在DROID数据集上实现了优于现有基线的长期滚动预测性能。

链接: https://arxiv.org/abs/2603.25685
作者: Jai Bardhan,Patrik Drozdik,Josef Sivic,Vladimir Petrik
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11 figures, 12 tables

点击查看摘要

Abstract:Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

[CV-23] Can Users Specify Driving Speed? Bench2Drive-Speed: Benchmark and Baselines for Desired-Speed Conditioned Autonomous Driving

【速读】：该论文旨在解决端到端自动驾驶（End-to-End Autonomous Driving, E2E-AD）中用户自定义驾驶行为需求长期被忽视的问题，特别是用户希望指定期望速度或允许/禁止超车操作的需求。解决方案的关键在于提出一个名为Bench2Drive-Speed的基准框架，包含针对目标速度条件化驾驶的量化指标（如Speed-Adherence Score和Overtake Score）、标注数据集（CustomizedSpeedDataset）以及基线模型。其核心创新在于通过重新标注现有常规驾驶数据中的未来帧速度作为训练目标，实现无需额外真实世界专家演示即可训练出符合用户速度指令的策略模型，实验表明该方法在保持基础驾驶性能的同时可有效实现速度跟随，但超车指令仍因交互行为复杂性而具挑战性。

链接: https://arxiv.org/abs/2603.25672
作者: Yuqian Shao,Xiaosong Jia,Langechuan Liu,Junchi Yan
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users’ desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and Overtake Score, to measure how faithfully policies follow user specifications, while remaining compatible with standard autonomous driving metrics. To enable training of speed-conditioned policies, one approach is to collect expert demonstrations that strictly follow speed requirements, an expensive and unscalable process in the real world. An alternative is to adapt existing regular driving data by treating the speed observed in future frames as the target speed for training. To investigate this, we construct CustomizedSpeedDataset, composed of 2,100 clips annotated with experts demonstrations, enabling systematic investigation of supervision strategies. Our experiments show that, under proper re-annotation, models trained on regular driving data perform comparably to on expert demonstrations, suggesting that speed supervision can be introduced without additional complex real-world data collection. Furthermore, we find that while target-speed following can be achieved without degrading regular driving performance, executing overtaking commands remains challenging due to the inherent difficulty of interactive behaviors. All code, datasets and baselines are available at this https URL

[CV-24] Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

【速读】：该论文旨在解决预训练视觉-语言-动作（VLA）模型在标准监督微调（SFT）过程中性能提升有限且适应成本较高的问题。现有高级微调方法虽通过引入辅助任务目标可提升性能并减少收敛步数，但通常因额外损失项带来显著计算开销。其解决方案的关键在于：在参数空间中解耦辅助任务训练的两个目标——增强通用能力与拟合特定任务的动作分布，并仅用两种不同训练策略在小规模任务集上分别训练模型；二者参数差异被解释为由辅助任务提供的能力向量，将其与预训练参数融合即可得到能力增强的元模型（meta model）。进一步地，结合轻量级正交化正则化损失进行标准SFT时，该方法可在降低计算开销的同时达到与复杂辅助微调基线相当的性能表现。

链接: https://arxiv.org/abs/2603.25661
作者: Wenxuan Song,Jiayi Chen,Shuai Chen,Jingbo Wang,Pengxiang Ding,Han Zhao,Yikai Qin,Xinhu Zheng,Donglin Wang,Yan Wang,Haoang Li
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: this https URL

[CV-25] Designing Any Imaging System from Natural Language: Agent -Constrained Composition over a Finite Primitive Basis

【速读】：该论文旨在解决计算成像系统设计中长期存在的专家瓶颈问题——即每种成像模态都需要数周的专业人力来选择算子、设置参数并验证一致性，导致非专业研究人员难以参与成像仪器的原型开发。其解决方案的关键在于提出一个结构化的规范格式（structured specification format）和三个自主代理（Plan、Judge、Execute），它们协同将一句自然语言描述自动转化为具有边界重建误差的验证后向模型。通过设计到现实的误差定理（design-to-real error theorem），总重建误差被分解为五个可独立控制的项，并对应具体的修正动作，从而实现了从自然语言到高质量成像系统设计的端到端自动化，且在6个真实数据模态上达到与专家库相当的性能（98.1 ± 4.2%）。

链接: https://arxiv.org/abs/2603.25636
作者: Chengshuai Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 7 figures, 8 tables, includes Supplementary Information (sections S1-S6)

点击查看摘要

Abstract:Designing a computational imaging system – selecting operators, setting parameters, validating consistency – requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce this http URL, a structured specification format, and three autonomous agents – Plan, Judge, and Execute – that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs – composing primitives into chains from 3D to 5D – demonstrate compositional reach beyond any single-modality tool. Comments: 28 pages, 7 figures, 8 tables, includes Supplementary Information (sections S1-S6) Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68U10, 65F22, 94A08 ACMclasses: I.4.5; I.2.2; J.3 Cite as: arXiv:2603.25636 [cs.CV] (or arXiv:2603.25636v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.25636 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chengshuai Yang [view email] [v1] Thu, 26 Mar 2026 16:47:27 UTC (325 KB) Full-text links: Access Paper: View a PDF of the paper titled Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis, by Chengshuai YangView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-26] LanteRn: Latent Visual Structured Reasoning

【速读】：该论文旨在解决当前大型多模态模型（Large Multimodal Models, LMMs）在视觉推理任务中表现不佳的问题，尤其是其默认将感知内容转化为文本的局限性，这限制了需要精细空间和视觉理解的任务性能。解决方案的关键在于提出LanteRn框架，该框架使LMM能够将语言与紧凑的潜在视觉表示（latent visual representations）交错使用，从而在潜在空间中直接进行视觉推理，而非依赖外部模块或在像素空间中进行冗余计算。LanteRn通过在视觉-语言Transformer中引入生成和注意力机制以处理连续的视觉思维嵌入（visual thought embeddings），并在两个阶段进行训练：第一阶段为监督微调以将视觉特征锚定在潜在状态，第二阶段为强化学习以对齐潜在推理与任务级效用，从而显著提升视觉定位和细粒度推理能力。

链接: https://arxiv.org/abs/2603.25629
作者: André G. Viveiros,Nuno Gonçalves,Matthias Lindemann,André Martins
机构: Instituto de Telecomunicações; Instituto Superior Técnico, Universidade de Lisboa; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.

[CV-27] Demographic Fairness in Multimodal LLM s: A Benchmark of Gender and Ethnicity Bias in Face Verification CVPR2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在人脸验证任务中存在的人口统计学公平性问题，即不同种族和性别群体间验证准确率是否存在系统性差异。解决方案的关键在于构建一个涵盖四个种族群体和两个性别群体的基准测试体系，采用IJB-C和RFW人脸验证协议，结合等错误率（Equal Error Rate, EER）与真匹配率（True Match Rate, TMR）等指标，并引入四种基于假匹配率（False Match Rate, FMR）的公平性度量方法，对九个开源MLLMs进行系统评估。研究发现，专用人脸验证模型FaceLLM-8B显著优于通用模型，且不同模型在不同基准下的偏见模式各异，揭示了高准确率并不必然带来公平性，低精度模型也可能因均匀的高错误率而呈现表面公平性。

链接: https://arxiv.org/abs/2603.25613
作者: Ünsal Öztürk,Hatef Otroshi Shahreza,Sébastien Marcel
机构: Idiap Research Institute (Idiap研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in CVPR 2026 workshops

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

[CV-28] DeepFAN a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader multi-case trial

【速读】：该论文旨在解决当前深度学习方法在肺结节良恶性分类中难以全面融合全局与局部特征，且缺乏临床试验验证的问题。其解决方案的关键在于提出了一种基于Transformer架构的模型DeepFAN，该模型在超过10,000例病理确诊结节数据上训练，并通过多中心、多读者的临床试验验证其辅助诊断效能。结果显示，DeepFAN不仅在内部测试集和临床试验数据集上均表现出优异的诊断性能（AUC达0.939和0.954），还能显著提升初级放射科医师的诊断准确性、敏感性和特异性，同时改善结节级别诊断一致性，表明其具有临床实用价值和推广潜力。

链接: https://arxiv.org/abs/2603.25607
作者: Zhenchen Zhu,Ge Hu,Weixiong Tan,Kai Gao,Chao Sun,Zhen Zhou,Kepei Xu,Wei Han,Meixia Shang,Xiaoming Qiu,Yiqing Tan,Jinhua Wang,Zhoumeng Ying,Li Peng,Wei Song,Lan Song,Zhengyu Jin,Nan Hong,Yizhou Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages for main text and 37 pages for supplementary information, 7 figures in main text and 9 figures in supplementary information

点击查看摘要

Abstract:The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers’ average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

[CV-29] UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation

【速读】：该论文旨在解决虚拟沉浸式体验中服装物理模拟的实时性问题，传统物理仿真方法虽能生成逼真形变，但计算复杂度高、硬件要求昂贵，难以满足实时应用需求；而现有基于学习的方法（如图神经网络）在处理拓扑复杂的服装网格时表现不佳。其解决方案的关键在于提出一种基于实例特定神经形变场（neural deformation field）的新方法 UNIC，通过学习针对特定服装实例的形变场来实现高效动画生成，该方法无需泛化至新服装，仅需适应新动作序列，显著降低训练难度并提升形变质量；同时，神经形变场直接将3D点映射为形变偏移量，天然引入平滑约束且避免了对复杂拓扑结构的显式处理，从而在保持高质量形变的同时实现实时性能。

链接: https://arxiv.org/abs/2603.25580
作者: Chengfeng Zhao,Junbo Qi,Yulou Liu,Zhiyang Dou,Minchen Li,Taku Komura,Ziwei Liu,Wenping Wang,Yuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.

[CV-30] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference ICLR2026

【速读】：该论文旨在解决大规模野外数据中生物多样性识别的准确性问题，其核心挑战在于如何从不完整或有噪声的多模态输入（如标本图像、DNA条形码或两者结合）中进行准确的分类学预测（taxonomic prediction）。现有方法通常将分类学视为扁平标签空间，未能利用生物分类体系的层次结构，导致在模态缺失或污染时鲁棒性不足。解决方案的关键在于：一是提出CLiBD-HiR模型，通过引入层次信息正则化（Hierarchical Information Regularization, HiR）来塑造跨分类层级的嵌入几何结构，从而生成结构化且抗噪的表示；二是进一步设计CLiBD-HiR-Fuse模型，集成轻量级融合预测器，支持图像单模态、DNA单模态或联合推理，并对模态损坏具有强鲁棒性。实验表明，该方法在多个大型生物多样性基准上相比强基线提升超过14%的分类准确率，尤其在部分和受损DNA条件下表现显著优于现有方法。

链接: https://arxiv.org/abs/2603.25573
作者: Sk Miraj Ahmed,Xi Yu,Yunqi Li,Yuewei Lin,Wei Xu
机构: Brookhaven National Laboratory (布鲁克海文国家实验室); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the ICLR 2026 Workshop on Foundation Models for Science (FM4Science)

点击查看摘要

Abstract:Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.

[CV-31] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

【速读】：该论文旨在解决当前大型多模态模型（Large Multimodal Models, LMMs）在遥感理解中普遍忽视“垂直”维度的问题，这一局限性导致其在复杂地形和灾害场景下的推理能力受限，因为物理空间结构的重要性往往超过平面视觉纹理。解决方案的关键在于提出一个全面的评估框架，并构建两个互补的基准测试集（GeoHeight-Bench 和 GeoHeight-Bench+），用于衡量模型对高度信息的感知与综合推理能力；同时开发首个面向高度感知的遥感大模型基线 GeoHeightChat，通过将视觉语义与隐式注入的高度几何特征相结合，有效缓解了“垂直盲区”，从而实现现有光学模型中交互式高度推理的新范式。

链接: https://arxiv.org/abs/2603.25565
作者: Xuran Hu,Zhitong Xiong,Zhongcheng Hong,Yifang Ban,Xiaoxiang Zhu,Wufan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical “vertical” dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the “vertical blind spot”, successfully unlocking a new paradigm of interactive height reasoning in existing optical models.

[CV-32] owards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion

【速读】：该论文旨在解决眼科微创手术中多模态图像融合与实时多任务预测的问题，尤其聚焦于玻璃体视网膜手术中器械的精准追踪与工具-组织距离估计。其核心挑战在于如何有效整合术中显微镜（OPMI）和实时光学相干断层扫描（iOCT）两种互补成像模态的信息，以提升手术场景理解的准确性与鲁棒性。解决方案的关键在于提出一种时序、多模态、实时可行的网络架构，其中引入交叉注意力融合模块（cross-attention fusion module）来联合处理OPMI与iOCT特征，并结合基于区域的循环模块利用时间一致性增强预测稳定性，从而实现器械检测、关键点定位及工具-组织距离估计的多任务同步输出，最终在保持22.5 ms/帧实时性能的同时显著提高近距离（<1 mm）距离估计精度（从284 μm提升至33 μm）。

链接: https://arxiv.org/abs/2603.25555
作者: Nikolo Rohrmoser,Ghazal Ghazaei,Michael Sommersperger,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 \mu m (OPMI only) to 33 \mu m (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.25555 [cs.CV] (or arXiv:2603.25555v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.25555 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nikolo Rohrmoser [view email] [v1] Thu, 26 Mar 2026 15:27:27 UTC (2,072 KB)

[CV-33] PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos ALT

【速读】：该论文旨在解决从真实场景中自动提取物体关节运动（articulation）的问题，传统学习方法依赖大量高质量3D标注数据和人工标注，导致可扩展性和多样性受限。解决方案的关键在于提出PAWS方法，该方法直接从大规模自然视角（egocentric）视频中通过人-物交互信息提取物体的关节结构，无需依赖人工标注的3D数据，从而实现更高效、通用的关节感知能力。

链接: https://arxiv.org/abs/2603.25539
作者: Yihao Wang,Yang Miao,Wenshuai Zhao,Wenyan Yang,Zihan Wang,Joni Pajarinen,Luc Van Gool,Danda Pani Paudel,Juho Kannala,Xi Wang,Arno Solin
机构: Aalto University (阿尔托大学); ETH Zurich (苏黎世联邦理工学院); Stanford University (斯坦福大学); University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 figures. Project page: this https URL

点击查看摘要

Abstract:Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at this https URL.

[CV-34] Insights on back marking for the automated identification of animals

【速读】：该论文试图解决如何设计猪背部标记（back marks），以最佳支持基于机器学习的个体水平监测问题。其解决方案的关键在于：标记设计必须确保在运动模糊、多角度视图和遮挡等动物行为引起的复杂条件下仍保持唯一性和可识别性；同时，需考虑训练过程中常用的数据增强策略（如颜色变换、翻转和裁剪）对模型识别性能的影响，从而优化标记设计以提升模型鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2603.25535
作者: David Brunner,Marie Bordes,Elisabeth Mayrhuber,Stephan M. Winkler,Viktoria Dorfer,Maciej Oczak
机构: 1. University of Vienna (维也纳大学); 2. Austrian Academy of Sciences (奥地利科学院); 3. TU Wien (维也纳工业大学); 4. Institute of Science and Technology Austria (奥地利科学技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To date, there is little research on how to design back marks to best support individual-level monitoring of uniform looking species like pigs. With the recent surge of machine learning-based monitoring solutions, there is a particular need for guidelines on the design of marks that can be effectively recognised by such algorithms. This study provides valuable insights on effective back mark design, based on the analysis of a machine learning model, trained to distinguish pigs via their back marks. Specifically, a neural network of type ResNet-50 was trained to classify ten pigs with unique back marks. The analysis of the model’s predictions highlights the significance of certain design choices, even in controlled settings. Most importantly, the set of back marks must be designed such that each mark remains unambiguous under conditions of motion blur, diverse view angles and occlusions, caused by animal behaviour. Further, the back mark design must consider data augmentation strategies commonly employed during model training, like colour, flip and crop augmentations. The generated insights can support individual-level monitoring in future studies and real-world applications by optimizing back mark design.

[CV-35] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

【速读】：该论文旨在解决现有羽毛球数据集普遍局限于短片段或特定任务标注，缺乏全场比赛层面的密集多模态注释这一问题，从而阻碍了精准的击球描述生成与比赛级战术分析。其解决方案的关键在于构建首个羽毛球全场比赛密集标注数据集（Badminton Full Match Dense, BFMD），涵盖19场职业比赛（含单打和双打）超过20小时的比赛内容，包含1,687个回合和16,751次击球事件，并提供逐击球级别的多模态信息（如击球类型、球轨迹、球员姿态关键点及击球描述）。此外，研究提出基于VideoMAE的多模态描述框架并引入语义反馈机制（Semantic Feedback），利用击球语义指导生成过程以提升描述的语义一致性，实验证明该方法显著优于仅使用RGB图像的基线模型。

链接: https://arxiv.org/abs/2603.25533
作者: Ning Ding,Keisuke Fujii,Toru Tamaki
机构: Nagoya Institute of Technology (名古屋工业大学); Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVSports2026 accepted

点击查看摘要

Abstract:Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.

[CV-36] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training CVPR2026

【速读】：该论文旨在解决视频生成模型训练中面临的“运动-视觉质量困境”（Motion-Vision Quality Dilemma），即视觉质量与运动强度之间存在固有的负相关关系，导致难以获取在两者上均表现优异的高质量数据。解决方案的关键在于提出训练过程中的时间步选择机制（Timestep selection in Training Process），并进一步设计了时间感知的质量解耦方法（Timestep-aware Quality Decoupling, TQD）：通过分析视频扩散模型的分层学习动态及退化样本的梯度特性，发现质量失衡的数据在特定时间步长下可产生与优质数据相似的梯度信号；据此调整数据采样分布——对高运动丰富数据倾向于在较高时间步采样，而高视觉质量数据则优先在较低时间步采样，从而实现更高效的学习路径。实验表明，该方法可在仅使用分离的不平衡数据情况下超越传统使用优质数据的训练效果，并在高质量数据场景下同样提升模型性能。

链接: https://arxiv.org/abs/2603.25527
作者: Xiangyang Luo,Qingyu Li,Yuming Li,Guanbo Huang,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Shao-Lun Huang
机构: Tsinghua University (清华大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model’s learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

[CV-37] CHIRP dataset: towards long-term individual-level behavioral monitoring of bird populations in the wild

【速读】：该论文旨在解决野生动物个体行为监测中自动化识别与测量的难题，尤其针对野外种群中长期行为变化研究的需求。当前计算机视觉方法在生物多样性监测中虽具潜力，但受限于缺乏覆盖多种任务（如重识别、动作识别、关键点估计等）的高质量数据集，导致模型难以有效应用于真实生态场景。解决方案的关键在于构建一个名为CHIRP（Combining beHaviour, Individual Re-identification and Postures）的数据集，其源自瑞典拉普兰地区对 Siberian jay（西伯利亚蓝鸦）长期种群的观察，并引入一种基于彩色腿环分割与分类的新方法CORVID（COlouR-based Video re-ID），实现概率驱动的个体追踪。通过应用特定指标（如摄食率、共现率）进行评估，验证了CORVID在实际生物研究场景中优于现有最先进的重识别方法，为连接计算机视觉研究与生物学应用提供了可复用的数据与方法框架。

链接: https://arxiv.org/abs/2603.25524
作者: Alex Hoi Hang Chan,Neha Singhal,Onur Kocahan,Andrea Meltzer,Saverio Lubrano,Miyako H. Warrington,Michel Griesser,Fumihiro Kano,Hemal Naik
机构: Centre for the Advanced Study of Collective Behaviour, University of Konstanz; Dept. of Collective Behavior, Max Planck Institute of Animal Behavior; Dept. of Biology, University of Konstanz; Dept. of Computer and Information Science, University of Konstanz; Luondua Boreal Field Station; School of Biological and Medical Sciences, Oxford Brookes University; Dept. of Zoology, Stockholm University; Dept. of Ecology of Animal Societies, Max Planck Institute of Animal Behavior
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

[CV-38] Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case

【速读】：该论文旨在解决高光谱成像（Hyperspectral Imaging, HSI）在自动驾驶（Autonomous Driving, AD）应用中面临的多重挑战，包括非受控和变化的光照条件、宽景深范围、动态场景中快速移动物体，以及嵌入式平台对实时性与计算资源的严格限制。解决方案的关键在于结合特定应用场景的需求，筛选合适的HSI技术，并开发定制化的视觉算法，以充分利用传感器获取的光谱与空间信息，从而提升系统在复杂环境下的感知性能与鲁棒性。

链接: https://arxiv.org/abs/2603.25510
作者: Koldo Basterretxea,Jon Gutiérrez-Zaballa,Javier Echanobe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.

[CV-39] RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

【速读】：该论文旨在解决现有图像恢复（Image Restoration）模型在真实世界退化场景下泛化能力不足的问题，其核心挑战在于训练数据规模与分布的局限性导致模型难以适应多样化的实际退化类型。解决方案的关键在于构建一个覆盖九类常见真实世界退化类型的大型开源数据集，并基于此训练出性能领先的开放源代码模型；同时提出 RealIR-Bench 评估基准，包含 464 张真实退化图像及针对退化去除与一致性保持的定制化指标，从而有效缩小开源模型与闭源先进模型（如 Nano Banana Pro）之间的性能差距。

链接: https://arxiv.org/abs/2603.25502
作者: Yufeng Yang,Xianfang Zeng,Zhangqi Jiang,Fukun Yin,Jianzhuang Liu,Wei Cheng,jinghong lan,Shiyu Liu,Yuqi Peng,Gang YU,Shifeng Chen
机构: Southern University of Science and Technology (南方科技大学); StepFun (StepFun); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Shenzhen University of Advanced Technology (深圳先进技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 15 figures, Project homepage: this https URL

点击查看摘要

Abstract:Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.

[CV-40] Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects

【速读】：该论文旨在解决安全关键场景中目标检测器可能产生无声失效（silent failure）的问题，即检测器在未识别出行人、工人或其他关键物体时不会发出任何警告，从而导致潜在的安全风险。传统分布外（Out of Distribution, OOD）检测方法主要关注输入数据的陌生程度，而非直接预测检测器自身的功能失效。解决方案的关键在于提出知识引导的故障预测（Knowledge Guided Failure Prediction, KGFP）框架，该框架基于表示学习，将漏检视为运行时异常进行监测：通过双编码器架构计算目标检测器内部特征与视觉基础模型嵌入之间的语义不一致性，并采用角度距离度量来捕捉二者偏离程度；当检测器超出其能力范围或视觉基础模型遇到新输入时，两嵌入显著发散，生成高角度信号以可靠标识不安全图像。

链接: https://arxiv.org/abs/2603.25499
作者: Jakob Paul Zimmermann,Gerrit Holzbach,David Lerch
机构: Fraunhofer HHI; Fraunhofer IOSB
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Object detectors deployed in safety-critical environments can fail silently, e.g. missing pedestrians, workers, or other safety-critical objects without emitting any warning. Traditional Out Of Distribution (OOD) detection methods focus on identifying unfamiliar inputs, but do not directly predict functional failures of the detector itself. We introduce Knowledge Guided Failure Prediction (KGFP), a representation-based monitoring framework that treats missed safety-critical detections as anomalies to be detected at runtime. KGFP measures semantic misalignment between internal object detector features and visual foundation model embeddings using a dual-encoder architecture with an angular distance metric. A key property is that when either the detector is operating outside its competence or the visual foundation model itself encounters novel inputs, the two embeddings diverge, producing a high-angle signal that reliably flags unsafe images. We compare our novel KGFS method to baseline OOD detection methods. On COCO person detection, applying KGFP as a selective-prediction gate raises person recall among accepted images from 64.3% to 84.5% at 5% False Positive Rate (FPR), and maintains strong performance across six COCO-O visual domains, outperforming OOD baselines by large margins. Our code, models, and features are published at this https URL.

[CV-41] AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments CVPR2026

【速读】：该论文旨在解决室内单目语义场景补全（Indoor Monocular Semantic Scene Completion, MSSC）中因复杂空间布局和严重遮挡带来的挑战，尤其是传统Transformer模型在处理此类任务时存在的高内存开销以及难以恢复细粒度细节的问题。解决方案的关键在于提出AdaSFormer，一种专为室内MSSC设计的序列化Transformer框架，其核心创新包括：(1) 自适应序列化Transformer（Adaptive Serialized Transformer），通过可学习偏移动态调整感受野；(2) 中心相对位置编码（Center-Relative Positional Encoding），增强对空间信息丰富性的建模能力；(3) 卷积调制层归一化（Convolution-Modulated Layer Normalization），有效融合卷积与Transformer特征之间的异构表示。

链接: https://arxiv.org/abs/2603.25494
作者: Xuzhi Wang,Xinran Wu,Song Wang,Lingdong Kong,Ziping Zhao
机构: Tianjin Normal University (天津师范大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: this https URL.

[CV-42] GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在视频监控中作为异常检测器时的脆弱性问题：由于缺乏校准的异常先验，VLMs 会产生漏检和虚假警报交替出现的现象。作者指出，问题不在于VLM本身，而在于其使用方式。解决方案的关键在于提出“提议-定位-传播”（propose-ground-propagate）范式：VLM 仅负责生成开放集的异常候选描述（anomaly proposals），随后由专门设计的空间与时间模块进行锚定（grounding）和传播（propagation）。具体实现为 GridVAD，一个无需领域训练的像素级异常掩码生成流程，其中自一致性融合（Self-Consistency Consolidation, SCC）通过多采样重复性过滤幻觉，Grounding DINO 将提案锚定为边界框，SAM2 在异常区间内传播为密集掩码，且每片段固定调用 M+1 次 VLM，显著提升效率与精度。

链接: https://arxiv.org/abs/2603.25467
作者: Mohamed Eltahir,Ahmed O. Ibrahim,Obada Siralkhatim,Tabarak Abdallah,Sondos Mohamed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation this http URL and qualitative video results are available at this https URL.

[CV-43] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

【速读】：该论文旨在解决自回归（Auto-regressive, AR）图像生成模型在设备端部署时面临的计算密集性和序列化特性导致的高延迟问题。其核心挑战在于：一方面，高保真图像生成需要庞大的视觉标记（token）词汇表；另一方面，图像中存在大量空间冗余区域（具有高度可预测性），而对象边界则具有高不确定性。传统均匀验证机制会浪费资源在冗余标记上，无法高效处理。解决方案的关键在于提出一种云-设备协同框架CIAR，其中包含两个核心技术：一是基于连续概率区间（continuous probability intervals）的设备端标记不确定性量化器，用于加速处理并支持大规模视觉词汇表；二是引入增强区间解码模块（Interval-enhanced decoding module），通过分布对齐训练策略在提升解码速度的同时保持图像质量和语义一致性。

链接: https://arxiv.org/abs/2603.25463
作者: Keming Ye,Zhou Zhao,Fan Wu,Shengyu Zhang
机构: Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 tables, 7 figures

点击查看摘要

Abstract:Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbfCIAR, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textitthe vast token vocabulary required for high-fidelity images and \textitinherent spatial redundancy which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70%, while preserving image quality compared to existing methods.

[CV-44] DC-Reg: Globally Optimal Point Cloud Registration via Tight Bounding with Difference of Convex Programming

【速读】：该论文旨在解决在部分重叠和大初始位姿偏差条件下，点云配准（point cloud registration）中难以实现全局最优解的问题。现有方法在同时估计变换参数（\boldsymbol\theta）与对应关系（\mathbf{P}）时，由于目标函数非凸且耦合性强，常陷入局部最优或收敛速度极慢，主要源于下界松弛过松。解决方案的关键在于提出DC-Reg框架，其核心创新是基于差分凸（Difference of Convex, DC）规划理论，推导出一个整体式的凹下界估计器（holistic concave underestimator），该估计器充分捕捉了变换参数与对应关系之间的联合结构交互，而非以往依赖逐项松弛（如McCormick包络）的方法。这一建模方式使得通过高效线性指派问题（Linear Assignment Problem, LAP）在搜索盒顶点处计算出紧致的下界成为可能，从而显著加速分支定界（Branch-and-Bound, BnB）搜索过程，在保持全局最优性的前提下大幅提升效率与鲁棒性。

链接: https://arxiv.org/abs/2603.25442
作者: Wei Lian,Fei Ma,Hang Pan,Zhesen Cui,Wangmeng Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving globally optimal point cloud registration under partial overlaps and large misalignments remains a fundamental challenge. While simultaneous transformation ( \boldsymbol\theta ) and correspondence ( \mathbfP ) estimation has the advantage of being robust to nonrigid deformation, its non-convex coupled objective often leads to local minima for heuristic methods and prohibitive convergence times for existing global solvers due to loose lower bounds. To address this, we propose DC-Reg, a robust globally optimal framework that significantly tightens the Branch-and-Bound (BnB) search. Our core innovation is the derivation of a holistic concave underestimator for the coupled transformation-assignment objective, grounded in the Difference of Convex (DC) programming paradigm. Unlike prior works that rely on term-wise relaxations (e.g., McCormick envelopes) which neglect variable interplay, our holistic DC decomposition captures the joint structural interaction between \boldsymbol\theta and \mathbfP . This formulation enables the computation of remarkably tight lower bounds via efficient Linear Assignment Problems (LAP) evaluated at the vertices of the search boxes. We validate our framework on 2D similarity and 3D rigid registration, utilizing rotation-invariant features for the latter to achieve high efficiency without sacrificing optimality. Experimental results on synthetic data and the 3DMatch benchmark demonstrate that DC-Reg achieves significantly faster convergence and superior robustness to extreme noise and outliers compared to state-of-the-art global techniques.

[CV-45] VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

【速读】：该论文旨在解决多视角视频到视频（multi-view video-to-video, V2V）翻译中的关键挑战：现有方法仅支持单视角输入，导致在多相机同步采集场景下出现跨视图外观不一致的问题；同时，标准Transformer架构因交叉视图注意力的二次计算复杂度难以扩展至多视角设置。解决方案的关键在于提出VideoWeaver框架，其核心创新包括两点：一是通过一个前馈空间基础模型Pi3将所有视角映射到共享的4D隐空间，从而实现宽基线和动态摄像机运动下的视图一致性；二是引入扩散时间步训练策略，使模型能够学习联合与条件视图分布，进而支持基于已有视角的自回归新视角合成，有效突破固定相机数量限制并实现物理与风格一致的多视角生成。

链接: https://arxiv.org/abs/2603.25420
作者: George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Yang Bai,Liudi Yang,Ziyuan Liu
机构: Huawei Heisenberg Research Center (华为海森堡研究中心); Ludwig Maximilian University of Munich (慕尼黑路德维希马克西米利安大学); University of Freiburg (弗莱堡大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.25420 [cs.CV] (or arXiv:2603.25420v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.25420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-46] HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models MICRO CVPR2026

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在实现类人空间智能方面的关键挑战，即从二维观测中推断三维结构、识别三维空间中的物体属性与关系，并进行高层空间推理。其解决方案的核心在于提出一个原理性的分层框架，将VLM的三维空间理解学习分解为四个逐步复杂的学习层级（从几何感知到抽象空间推理），并基于此框架构建自动化数据生成流水线，利用约500万张图像和超过4500万个物体生成多样场景下的三维空间问答（3D Spatial VQA）对用于监督微调；同时开发了融合RGB-D信息与度量尺度点云图的VLM架构，显著提升空间理解能力。实验表明，该方法在多个空间理解与推理基准上达到最先进性能，超越专用空间模型及大型闭源系统（如Gemini-2.5-pro和GPT-5），且揭示了多层级任务设计对3D空间智能涌现的关键作用。

链接: https://arxiv.org/abs/2603.25411
作者: Huizhi Liang,Yichao Shen,Yu Deng,Sicheng Xu,Zhiyuan Feng,Tong Zhang,Yaobo Liang,Jiaolong Yang
机构: Tsinghua University (清华大学); Microsoft Research Asia (亚洲微软研究院); Xi’an Jiaotong University (西安交通大学); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

[CV-47] LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

【速读】：该论文旨在解决现有视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作中因直接从二维语义视觉特征回归动作而导致对复杂三维物理交互隐式学习的局限性，这种隐式学习策略在面对不熟悉的空间动态时性能显著下降。解决方案的关键在于提出一种双专家框架LaMP，通过门控交叉注意力机制将一个生成3D场景光流的“运动专家”（Motion Expert）与一个预测动作的“动作专家”（Action Expert）进行对齐：运动专家生成单步部分去噪的3D场景光流，并以其隐藏状态条件化动作专家，从而避免了完整的多步重建过程，有效提升了模型在未见场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.25399
作者: Xinkai Wang,Chenyi Wang,Yifu Xu,Mingzhe Ye,Fu-Cheng Zhang,Jialin Tian,Xinyu Zhan,Lifeng Zhu,Cewu Lu,Lixin Yang
机构: Southeast University (东南大学); School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院); Zhejiang University (浙江大学); Beihang University (北京航空航天大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce \textbfLaMP, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emphMotion Expert with a policy-predicting \emphAction Expert through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at this https URL.

[CV-48] PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders CVPR

【速读】：该论文旨在解决当前基于视觉基础模型（Vision Foundation Models, VFMs）的图像与视频分割方法中，如何在保持编码器（encoder）冻结以实现多任务共享的同时，兼顾模型的高效性与准确性的问题。现有方法如EoMT和VidEoMT虽具备低延迟优势，但需微调编码器，从而丧失了VFMs在大规模部署中的核心价值——即冻结编码器带来的通用性和可复用性。解决方案的关键在于提出一种轻量级、基于Transformer的掩码解码器（Plain Mask Decoder, PMD），该解码器直接作用于冻结的VFM特征之上，构建出Plain Mask Transformer (PMT) 模型。PMT在保持编码器不变的前提下，实现了与全微调方法相当甚至更优的分割性能，同时显著提升推理速度（图像分割快约3倍，视频分割快达8倍），并统一适用于图像与视频分割任务。

链接: https://arxiv.org/abs/2603.25398
作者: Niccolò Cavagnero,Narges Norouzi,Gijs Dubbelman,Daan de Geus
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, ECV 2026, CVPR Workshop

点击查看摘要

Abstract:Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: this https URL.

[CV-49] FSGNet: A Frequency-Aware and Semantic Guidance Network for Infrared Small Target Detection

【速读】：该论文旨在解决红外小目标检测（Infrared Small Target Detection, IRSTD）中因U-Net架构在深层到浅层特征传递过程中出现语义退化（semantic degradation）而导致的小目标精确定位能力受限的问题。解决方案的关键在于提出FSGNet框架，其核心创新包括：1）在编码器中引入多方向交互注意力模块（multi-directional interactive attention module），以捕获细粒度和方向性特征，提升对低对比度小目标的敏感性；2）设计多尺度频域感知模块（multi-scale frequency-aware module），利用快速傅里叶变换（Fast Fourier Transform）过滤与目标相似的背景杂波，同时保留显著的目标结构；3）在最深层引入全局池化模块并构建全局语义引导流（global semantic guidance flows），将高层语义信息上采样后逐级传播至解码阶段，从而保障跨尺度的语义一致性与定位精度。

链接: https://arxiv.org/abs/2603.25389
作者: Yingmei Zhang,Wangtao Bao,Yong Yang,Weiguo Wan,Qin Xiao,Xueting Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (IRSTD) aims to identify and distinguish small targets from complex backgrounds. Leveraging the powerful multi-scale feature fusion capability of the U-Net architecture, IRSTD has achieved significant progress. However, U-Net suffers from semantic degradation when transferring high-level features from deep to shallow layers, limiting the precise localization of small targets. To address this issue, this paper proposes FSGNet, a lightweight and effective detection framework incorporating frequency-aware and semantic guidance mechanisms. Specifically, a multi-directional interactive attention module is proposed throughout the encoder to capture fine-grained and directional features, enhancing the network’s sensitivity to small, low-contrast targets. To suppress background interference propagated through skip connections, a multi-scale frequency-aware module leverages Fast Fourier transform to filter out target-similar clutter while preserving salient target structures. At the deepest layer, a global pooling module captures high-level semantic information, which is subsequently upsampled and propagated to each decoder stage through the global semantic guidance flows, ensuring semantic consistency and precise localization across scales. Extensive experiments on four public IRSTD datasets demonstrate that FSGNet achieves superior detection performance and maintains high efficiency, highlighting its practical applicability and robustness. The codes will be released on this https URL.

[CV-50] Multimodal Dataset Distillation via Phased Teacher Models ICLR2026

【速读】：该论文旨在解决多模态数据蒸馏（Multimodal Dataset Distillation）中现有方法难以捕捉教师模型在训练后期阶段所蕴含的复杂且动态演化的知识的问题，这一缺陷导致学生模型性能下降以及蒸馏数据质量受损。其解决方案的关键在于提出一种分阶段教师模型与捷径轨迹构建策略（Phased Teacher Model with Shortcut Trajectory, PTM-ST），通过阶段感知的教师建模和基于捷径的轨迹构造机制，精确拟合教师模型在不同训练阶段的学习动态，从而提升蒸馏过程的稳定性和表达能力，显著缓解跨阶段性能差距与优化振荡问题，并降低存储开销。

链接: https://arxiv.org/abs/2603.25388
作者: Shengbin Guo,Hang Zhao,Senqiao Yang,Chenyang Jiang,Yuhang Cheng,Xiangru Peng,Rui Shao,Zhuotao Tian
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) – a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher’s learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: this https URL.

[CV-51] CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

【速读】：该论文旨在解决当前CLIP（Contrastive Language–Image Pre-training）模型知识蒸馏过程中，学生模型难以有效保留教师模型中多方向关系结构的问题。现有方法未显式建模教师与学生嵌入之间的双向依赖关系，导致学生模型在几何结构上无法忠实复现教师模型的语义对齐特性。解决方案的关键在于提出一种关系知识蒸馏框架（CLIP-RD），其核心创新为引入两种新方法：垂直关系蒸馏（Vertical Relational Distillation, VRD）和交叉关系蒸馏（Cross Relational Distillation, XRD）。VRD 在分布层面强制跨模态的教师-学生蒸馏强度一致性，XRD 则在跨模态相似度分布上施加双向对称性约束，从而联合建模多方向关系结构，显著提升学生模型在嵌入空间几何对齐上的保真度，最终在零样本分类任务上较现有方法提升0.8个百分点。

链接: https://arxiv.org/abs/2603.25383
作者: Jeannie Chung,Hanna Jang,Ingyeong Yang,Uiwon Hwang,Jaehyung Sim
机构: Ewha Womans University (梨花女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student’s ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.

[CV-52] Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics

【速读】：该论文旨在解决移动机器人在室内环境中进行自主目标搜索时面临的挑战，包括部分可观测性（partial observability）、感知不确定性（perceptual uncertainty）以及探索与导航效率之间的权衡问题。传统概率方法虽能显式建模不确定性，但依赖人工设计的动作选择启发式策略；而深度强化学习虽可学习自适应策略，却常存在收敛慢和可解释性差的问题。解决方案的关键在于提出一种融合贝叶斯推理（Bayesian inference）与深度强化学习的混合框架：通过在线贝叶斯更新构建目标位置的空间信念图（belief map），并训练强化学习策略直接从该概率表示中选择导航动作，从而在保证可靠性的前提下提升搜索效率。

链接: https://arxiv.org/abs/2603.25366
作者: João Castelo-Branco,José Santos-Victor,Alexandre Bernardino
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and to be published in the ICARSC 2026 26th IEEE International Conference on Autonomous Robot Systems and Competitions

点击查看摘要

Abstract:Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies. Across two indoor environments, the proposed method improves success rate while reducing search effort. Overall, the results support the value of combining Bayesian belief estimation with learned action selection to achieve more efficient and reliable objectsearch behavior under partial observability.

[CV-53] InstanceAnimator: Multi-Instance Sketch Video Colorization

【速读】：该论文旨在解决多实例草图视频着色中存在的三大核心问题：现有方法因过度依赖单参考帧而导致用户控制灵活性不足、在多角色场景中实例可控性差引发错位问题，以及细粒度区域细节保真度下降。解决方案的关键在于提出三个创新模块：一是Canvas Guidance Condition，通过允许自由放置参考元素和背景来消除工作流程碎片化，实现前所未有的用户灵活性；二是Instance Matching Mechanism，通过将实例特征与草图融合，确保对多个角色的精确控制以解决错位问题；三是Adaptive Decoupled Control Module，通过向扩散过程注入来自角色、背景及文本条件的语义特征，显著提升细节保真度。

链接: https://arxiv.org/abs/2603.25357
作者: Yinhan Zhang,Yue Ma,Bingyuan Wang,Kunyu Feng,Yeying Jin,Qifeng Chen,Anyi Rao,Zeyu Wang
机构: HKUST(GZ); HKUST; NUS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.

[CV-54] Image Rotation Angle Estimation: Comparing Circular-Aware Methods

【速读】：该论文旨在解决图像自动旋转估计（automatic image rotation estimation）这一关键预处理问题，其核心挑战在于角度具有环形拓扑结构（circular topology），导致传统回归方法在边界处产生不连续性，从而影响模型性能。解决方案的关键在于采用五种面向环形特性的建模方法：直接角度回归结合环形损失函数、基于角度分箱的分类法、单位向量回归、相位移编码器以及环形高斯分布建模。实验表明，概率化方法（尤其是环形高斯分布）在不同骨干网络上表现出更强的鲁棒性，而分类法虽在匹配良好的骨干网络中精度最优，但训练稳定性较差；最终通过迁移学习与架构适配，在DRC-D和COCO数据集上分别实现1.23°和2.84°的平均绝对误差（MAE），显著优于现有方法。

链接: https://arxiv.org/abs/2603.25351
作者: Maximilian Woehrer
机构: University of Vienna (维也纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 7 pages, 3 figures, 2 tables. Under review at Pattern Recognition Letters

点击查看摘要

Abstract:Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.

[CV-55] HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT CVPR2026

【速读】：该论文旨在解决视觉几何接地变换器（Visual Geometry Grounded Transformer, VGGT）中全局注意力层因二次计算复杂度导致的可扩展性瓶颈问题，同时避免现有基于稀疏化的加速方法因统一稀疏模式引发的显著性能下降。其解决方案的关键在于提出一种两阶段稀疏化流程：首先引入“头敏感度评分”（Head Sensitivity Score, HeSS），通过在小规模校准集上近似Hessian矩阵来量化每个注意力头对稀疏化的敏感程度；其次在推理阶段实施HeSS引导的稀疏化策略，根据预计算的HeSS动态分配注意力预算——将更密集的注意力分配给敏感头，而将更稀疏的注意力分配给鲁棒性强的头。该方法有效捕捉了注意力头间异质性的稀疏敏感特性，显著缓解了高稀疏度下的性能退化问题，并展现出跨不同稀疏水平的强鲁棒性。

链接: https://arxiv.org/abs/2603.25336
作者: Yongsung Kim,Wooseok Song,Jaihyun Lew,Hun Hwangbo,Jaehoon Lee,Sungroh Yoon
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at this https URL.

[CV-56] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

【速读】：该论文旨在解决多参考图像生成（multi-reference image generation）中模型性能随输入参考数量增加而显著下降的问题。其核心挑战在于现有数据集普遍缺乏结构化的长程上下文监督，难以学习密集的跨参考依赖关系。解决方案的关键是提出MacroData——一个包含400K样本的大规模数据集，每个样本最多包含10张参考图像，并按定制化（Customization）、插画（Illustration）、空间推理（Spatial reasoning）和时序动态（Temporal dynamics）四个互补维度系统组织，从而提供对多参考生成空间的全面覆盖；同时引入MacroBench基准测试框架，用于评估不同任务维度与输入规模下的生成一致性，推动该领域标准化评测的发展。

链接: https://arxiv.org/abs/2603.25319
作者: Zhekai Chen,Yuqing Wang,Manyuan Zhang,Xihui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions – Customization, Illustration, Spatial reasoning, and Temporal dynamics – to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

[CV-57] Adaptive Learned Image Compression with Graph Neural Networks CVPR2026

【速读】：该论文旨在解决当前基于卷积神经网络（Convolutional Neural Networks, CNNs）或Transformer的 learned image compression (LIC) 方法在建模图像局部与全局冗余时存在的刚性问题。现有方法受限于固定感受野和静态连接模式，难以自适应地捕捉空间变化的冗余特征，尤其在全局层面表现不足。解决方案的关键在于提出一种基于图神经网络（Graph Neural Networks, GNNs）的内容自适应图像压缩框架——GLIC，其核心创新包括：构建双尺度图结构以实现数据驱动的灵活感受野，并引入动态邻接机制，根据局部内容复杂度自适应调整每个节点的邻居数量，从而有效建模图像中多样化的冗余模式，显著提升压缩效率。

链接: https://arxiv.org/abs/2603.25316
作者: Yunuo Chen,Bing He,Zezheng Lyu,Hongwei Hu,Qunshan Gu,Yuan Tian,Guo Lu
机构: Shanghai Jiao Tong University (上海交通大学); Massachusetts Institute of Technology (麻省理工学院); Alibaba Group (阿里巴巴集团); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Efficient image compression relies on modeling both local and global redundancy. Most state-of-the-art (SOTA) learned image compression (LIC) methods are based on CNNs or Transformers, which are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model’s ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression. Experiments demonstrate that GLIC achieves state-of-the-art performance, achieving BD-rate reductions of 19.29%, 21.69%, and 18.71% relative to VTM-9.1 on Kodak, Tecnick, and CLIC, respectively. Code will be released at this https URL.

[CV-58] owards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework

【速读】：该论文旨在解决低光照图像增强（Low-light Image Enhancement, LLIE）任务中因环境光照条件和传感器参数未知而导致的病态性问题，即传统确定性映射方法难以应对多模态解空间，导致预测结果与真实标签间常出现亮度差异，从而依赖“gt-mean”后处理来对齐亮度进行评估。其解决方案的关键在于提出可控低光照增强（Controllable Low-light Enhancement, CLE），将LLIE重构为一个有良好定义的条件问题，并引入CLE-RWKV框架，结合Light100新基准（支持连续真实光照变化）和HVI颜色空间中的噪声解耦监督策略，实现亮度控制与色彩保真度的有效分离；同时采用Space-to-Depth（S2D）策略适配状态空间模型（State Space Models, SSMs）用于密集预测，在保持线性复杂度的同时恢复局部归纳偏置，有效弥合扁平化视觉序列中的“扫描间隙”。

链接: https://arxiv.org/abs/2603.25296
作者: Hongru Han,Tingrui Guo,Liming Zhang,Yan Su,Qiwen Xu,Zhuohua Ye
机构: University of Macau (澳门大学); China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating “gt-mean” post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the “scanning gap” inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.

[CV-59] V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception CVPR2026

【速读】：该论文旨在解决现代自动驾驶感知系统在复杂环境中因遮挡（occlusion）、盲区（blind spot）和有限感知范围所导致的性能瓶颈问题。现有协同感知范式如车对车（Vehicle-to-Vehicle, V2V）和车对基础设施（Vehicle-to-Infrastructure, V2I）受限于地面层级协作，难以应对大尺度遮挡与远距离感知挑战。其解决方案的关键在于提出首个面向真实场景的多模态车对无人机（Vehicle-to-UAV, V2U）协同目标感知数据集V2U4Real，通过地面车辆与无人机协同搭载多视角激光雷达（LiDAR）与RGB相机采集数据，覆盖城市街道、校园及乡村道路等多种交通场景，并提供超过56K帧LiDAR数据、56K张多视角图像及70万条3D边界框标注，从而支持单智能体与协同式3D目标检测、跟踪等任务的研究，验证了V2U协作在提升感知鲁棒性与远距感知能力方面的有效性。

链接: https://arxiv.org/abs/2603.25275
作者: Weijia Li,Haoen Xiang,Tianxu Wang,Shuaibing Wu,Qiming Xia,Cheng Wang,Chenglu Wen
机构: Xiamen University (厦门大学); Fujian Key Laboratory of Urban Intelligent Sensing and Computing (福建省城市智能感知与计算重点实验室); Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China (教育部多媒体可信感知与高效计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at this https URL.

[CV-60] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval CVPR2026

【速读】：该论文旨在解决文本-视频检索任务中因忽略视频内部帧间关系而导致的文本与视频语义不匹配问题。现有方法虽通过增强文本表达能力提升检索效果，但仅关注文本与视频帧之间的交互，未能有效建模视频帧间的上下文信息，从而导致生成的文本嵌入缺乏帧级语义一致性。解决方案的关键在于提出Energy-Aware Fine-Grained Relationship Learning Network (EagleNet)，其核心创新是Fine-Grained Relationship Learning (FRL)机制：首先构建文本候选与视频帧之间的图结构，学习文本与帧之间及帧内细粒度关系，并基于此聚合文本候选以生成融合帧上下文信息的增强文本嵌入；同时引入Energy-Aware Matching (EAM)建模文本-帧交互的能量分布，以更精确捕捉真实文本-视频对的分布特性；此外，采用sigmoid损失替代传统softmax对比损失，提升跨模态对齐效果和训练稳定性。

链接: https://arxiv.org/abs/2603.25267
作者: Yuhan Chen,Pengwen Dai,Chuan Wang,Dayan Wu,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Shenzhen Key Laboratory of Adversarial Artificial Intelligence (深圳市 adversarial 人工智能重点实验室); Beijing Jiaotong University (北京交通大学); Institute of Information Engineering, CAS (中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at this https URL.

[CV-61] ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis

【速读】：该论文旨在解决基于未对齐图像（unposed images）进行新视角合成时，现有前馈式3D高斯点绘（3D Gaussian splatting）方法中存在的保真度不足问题。其核心瓶颈在于单步前馈网络难以回归出适用于所有视点的静态高斯基元（Gaussian primitives），导致重建质量受限。解决方案的关键在于提出一种“视图自适应动态点绘”（view-adaptive dynamic splatting）机制：首先预测基础高斯基元及用于动态调整的MLP权重，在渲染阶段，这些MLP以目标视点坐标为输入，生成针对每个高斯属性（3D位置、尺度、旋转、不透明度和颜色）的视图依赖残差更新，从而实现高保真度的外观建模。该方法在保持快速推理（17 FPS）与实时渲染（154 FPS）的同时显著提升了重建质量。

链接: https://arxiv.org/abs/2603.25265
作者: Moonyeon Jeong,Seunggi Min,Suhyeon Lee,Hongje Seong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:We present ViewSplat, a view-adaptive 3D Gaussian splatting network for novel view synthesis from unposed images. While recent feed-forward 3D Gaussian splatting has significantly accelerated 3D scene reconstruction by bypassing per-scene optimization, a fundamental fidelity gap remains. We attribute this bottleneck to the limited capacity of single-step feed-forward networks to regress static Gaussian primitives that satisfy all viewpoints. To address this limitation, we shift the paradigm from static primitive regression to view-adaptive dynamic splatting. Instead of a rigid Gaussian representation, our pipeline learns a view-adaptable latent representation. Specifically, ViewSplat initially predicts base Gaussian primitives alongside the weights of dynamic MLPs. During rendering, these MLPs take target view coordinates as input and predict view-dependent residual updates for each Gaussian attribute (i.e., 3D position, scale, rotation, opacity, and color). This mechanism, which we term view-adaptive dynamic splatting, allows each primitive to rectify initial estimation errors, effectively capturing high-fidelity appearances. Extensive experiments demonstrate that ViewSplat achieves state-of-the-art fidelity while maintaining fast inference (17 FPS) and real-time rendering (154 FPS).

[CV-62] owards Practical Lossless Neural Compression for LiDAR Point Clouds

【速读】：该论文旨在解决高精度LiDAR点云中几何细节极端稀疏导致的上下文建模效率低下问题，从而限制了现有压缩方法的速度与性能。其核心解决方案在于提出一种紧凑表示框架，包含两个轻量级模块：一是几何重稠密化模块（Geometry Re-Densification Module），通过迭代稠密化编码后的稀疏几何信息，在密集尺度提取特征后再进行稀疏化处理以支持预测编码，避免在高度稀疏区域进行昂贵计算的同时保持轻量预测头；二是跨尺度特征传播模块（Cross-scale Feature Propagation Module），利用多分辨率层级的占用线索引导分层特征传播，实现跨尺度信息共享并减少冗余特征提取。此外，引入纯整数推理流程确保比特级跨平台一致性，防止现有神经压缩方法中的熵编码崩溃现象，并进一步提升编码速度。

链接: https://arxiv.org/abs/2603.25260
作者: Pengpeng Yu,Haoran Li,Runqing Jiang,Dingquan Li,Jing Wang,Liang Lin,Yulan Guo
机构: Sun Yat-sen University (中山大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR point clouds are fundamental to various applications, yet the extreme sparsity of high-precision geometric details hinders efficient context modeling, thereby limiting the compression speed and performance of existing methods. To address this challenge, we propose a compact representation for efficient predictive lossless coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module iteratively densifies encoded sparse geometry, extracts features at a dense scale, and then sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation, enabling information sharing across scales and reducing redundant feature extraction. Additionally, we introduce an integer-only inference pipeline to enable bit-exact cross-platform consistency, which avoids the entropy-coding collapse observed in existing neural compression methods and further accelerates coding. Experiments demonstrate competitive compression performance at real-time speed. Code will be released upon acceptance. Code is available at this https URL.

[CV-63] Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection

【速读】：该论文旨在解决多月尺度下密集GPS轨迹异常检测的计算瓶颈问题，现有方法因二次复杂度难以处理长期数据，而稀疏停留点方法则丢失了细粒度证据（如异常速度和短时事件），导致无法统一建模不同密度轨迹。解决方案的关键在于提出TITAnD框架，其核心创新是将轨迹异常检测重构为视觉问题：通过构建高光谱轨迹图像（Hyperspectral Trajectory Image, HTI）将轨迹映射为天数×时段的二维网格，通道编码空间、语义、时间和运动学信息；并设计循环因子化Transformer（Cyclic Factorized Transformer, CFT），沿日周期和跨日周期轴分解注意力机制，显式建模人类行为的双重循环结构，显著降低计算复杂度并首次实现密集多月轨迹的异常检测。

链接: https://arxiv.org/abs/2603.25255
作者: Md Awsafur Rahman,Chandrakanth Gudavalli,Hardik Prajapati,B. S. Manjunath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.

[CV-64] Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models CVPR2026

【速读】：该论文旨在解决分布外（Out-of-distribution, OOD）检测中因负样本标签激活不足导致的性能瓶颈问题。现有方法依赖于静态负标签，但这些标签在面对OOD样本时可能缺乏有效激活，难以捕捉其特征。解决方案的关键在于提出**测试时激活负标签（Test-time Activated Negative Labels, TANL）**机制：通过在线评估测试样本在语料库中的激活水平，动态挖掘高激活响应的候选负标签；并基于历史测试样本构建标签激活度量，实现对测试分布的自适应调整。进一步地，引入细粒度的批次自适应变体和激活感知评分函数，以充分利用激活信息，显著提升OOD检测性能与鲁棒性，且无需训练、计算高效，并具备理论支撑。

链接: https://arxiv.org/abs/2603.25250
作者: Yabin Zhang,Maya Varma,Yunhe Gao,Jean-Benoit Delbrouck,Jiaming Liu,Chong Wang,Curtis Langlotz
机构: Harbin Institute of Technology (Shenzhen); Stanford University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026 main track, Codes are available at this https URL

点击查看摘要

Abstract:Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underlineTest-time \underlineActivated \underlineNegative \underlineLabels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%. Codes are available at \hrefthis https URLYBZh/OpenOOD-VLM.

[CV-65] Semantic-Aware Prefix Learning for Token-Efficient Image Generation

【速读】：该论文旨在解决现有视觉分词器（visual tokenizer）在图像生成任务中语义对齐不足的问题，即大多数分词器依赖重建主导的目标函数，导致潜在表示与高层语义关联较弱。解决方案的关键在于提出SMAP（SeMantic-Aware Prefix tokenizer），其核心创新是将类别级语义条件注入基于查询的1D分词框架，并通过引入尾部token丢弃策略（tail token dropping），迫使语义条件和早期潜在前缀在逐步减少的token预算下承担越来越重要的表征责任，从而确保语义信息在训练过程中成为表征学习的必要组成部分。

链接: https://arxiv.org/abs/2603.25249
作者: Qingfeng Li,Haoxian Zhang,Xu He,Songlin Tang,Zhixue Fang,Xiaoqiang Liu,Pengfei Wan Guoqi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive–Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

[CV-66] FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics

【速读】：该论文旨在解决空间转录组学（Spatial Transcriptomics, ST）因成本高昂而难以广泛应用的问题，特别是如何从易于获取的全切片图像（Whole Slide Images, WSI）中准确推断空间基因表达。现有方法依赖于预定义的稀疏图结构建模组织区域间的相互作用，无法充分捕捉潜在的点对点交互关系，限制了对复杂生物关系的建模能力。其解决方案的关键在于提出FEAST（Fully connected Expressive Attention for Spatial Transcriptomics），一个基于注意力机制的框架：首先将组织建模为全连接图以考虑所有点对之间的交互；其次引入负向感知注意力（negative-aware attention），同时建模兴奋性和抑制性相互作用，从而更真实地反映生物学中的正负调控关系；此外，采用非网格采样策略（off-grid sampling）从中间区域提取额外图像信息，缓解标准采样导致的上下文信息丢失问题。实验表明，FEAST在基因表达预测性能上优于现有最优方法，并能生成具有生物学合理性的注意力热图，清晰揭示正负相互作用模式。

链接: https://arxiv.org/abs/2603.25247
作者: Taejin Jeong,Joohyeok Kim,Jinyeong Kim,Chanyoung Kim,Seong Jae Hwang
机构: Yonsei University (延世大学); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions. Our code is available at this https URL FEAST.

[CV-67] Efficient Preemptive Robustification with Image Sharpening

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）对不可察觉扰动敏感的问题，尤其是在迁移场景下模型鲁棒性不足的挑战。现有防御方法如对抗训练或输入净化等，存在依赖预训练分类器作为代理、计算开销大或缺乏可解释性等局限。本文的关键解决方案是提出一种无需替代模型、无需迭代优化或生成器的图像锐化（image sharpening）方法，其基于纹理强度与良性样本鲁棒性之间的正相关关系，实现了高效且人类可解释的预攻击防御机制，显著提升了模型在迁移场景下的鲁棒性，同时保持极低的计算成本。

链接: https://arxiv.org/abs/2603.25244
作者: Jiaming Liang,Chi-Man Pun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.

[CV-68] A Unified Spatial Alignment Framework for Highly Transferable Transformation-Based Attacks on Spatially Structured Tasks

【速读】：该论文旨在解决基于变换的对抗攻击（Transformation-based Adversarial Attacks, TAAs）在结构化任务（如语义分割和目标检测）中转移性不足的问题。现有方法在非结构化任务（如图像分类）中表现良好，但在结构化任务中因输入与标签的空间不一致导致梯度错误，从而失效。解决方案的关键在于提出一种统一的空间对齐框架（Spatial Alignment Framework, SAF），通过所提出的空间对齐（Spatial Alignment, SA）算法，在应用空间变换时同步地对标签进行空间变换，以保持输入与标签之间的空间一致性，从而显著提升对抗攻击在结构化任务中的迁移能力。

链接: https://arxiv.org/abs/2603.25230
作者: Jiaming Liang,Chi-Man Pun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformation-based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non-spatial and spatial transformations inspire us to address this challenge. We find that for non-structured tasks, labels are spatially non-structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non-targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir-SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

[CV-69] An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models

【速读】：该论文旨在解决发展中国家（如孟加拉国）皮肤疾病诊断资源匮乏的问题，特别是在缺乏足够皮肤科专家和诊断设备的情况下，如何利用人工智能技术实现高效、准确的皮肤疾病自动识别。其关键解决方案是构建一个公开可用的区域性皮肤疾病图像数据集，涵盖五种在南亚地区高发的皮肤病（接触性皮炎、白癜风、湿疹、疥疮和体癣），共包含1612张图像（含302–316张每类疾病图像，其中250张为原始图像，其余经增强处理），并在此基础上应用多种机器学习与深度学习模型进行分类性能评估。该数据集具有区域代表性且所选病种具全球普适性，为基于AI的皮肤疾病自动化诊断提供了可复用的基础资源与实证依据。

链接: https://arxiv.org/abs/2603.25229
作者: Sazzad Hossain,Saiful Islam,Muhammad Ibrahim,Md. Rasel Ahmed,Md Shuayb,Ahmedul Kabir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.

[CV-70] raining-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

【速读】：该论文旨在解决手术器械的6D位姿估计（6D pose estimation）问题，尤其针对现有监督学习方法在面对未见过的器械时缺乏灵活性且依赖大量标注数据的局限性。解决方案的关键在于提出了一种无需训练的多视角位姿估计流程：首先利用预训练特征提取器生成物体掩码候选区域，并通过跨视角匹配与几何一致性过滤获得3D实例候选；随后采用基于特征-度量分数和跨视角注意力机制的迭代优化策略对位姿假设进行评分与 refinement，最终引入一种新颖的多视角、遮挡感知轮廓注册方法，最小化未被遮挡轮廓点的重投影误差，从而实现高精度位姿估计。该方案仅需一个带纹理的CAD模型作为先验知识，即可在真实手术场景中实现对未见器械的准确检测与跟踪。

链接: https://arxiv.org/abs/2603.25228
作者: Jonas Hein,Lilian Calvet,Matthias Seibold,Siyu Tang,Marc Pollefeys,Philipp Fürnstahl
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCARS: IPCAI 2026

点击查看摘要

Abstract:Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

[CV-71] SDD-YOLO: A Small-Target Detection Framework for Ground-to-Air Anti-UAV Surveillance with Edge-Efficient Deployment

【速读】：该论文旨在解决从地面到空中（G2A）视角下小型无人机（UAV）检测的难题，其核心挑战包括目标在图像中占据像素极少、天空背景复杂以及严格的实时性要求。现有基于YOLO的目标检测器主要针对通用物体优化，在亚像素级小目标上的特征分辨率不足，且部署复杂度高。解决方案的关键在于提出SDD-YOLO框架：一是引入P2高分辨率检测头（仅4倍下采样），以捕捉微小目标的关键空间细节；二是集成YOLOv10的无DFL、无NMS架构实现高效推理，并采用MuSGD混合训练策略结合ProgLoss与STAL损失函数，显著缓解稀疏小目标信号下的梯度震荡问题。该方案在自建的大规模DroneSOD-30K数据集上实现86.0% mAP@0.5，较YOLOv5n提升7.8个百分点，且在边缘设备上达到226 FPS（RTX 5090）和35 FPS（Xeon CPU），兼具高精度与强实时性。

链接: https://arxiv.org/abs/2603.25218
作者: Pengyu Chen,Haotian Sa,Yiwei Hu,Yuhan Cheng,Junbo Wang
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting small unmanned aerial vehicles (UAVs) from a ground-to-air (G2A) perspective presents significant challenges, including extremely low pixel occupancy, cluttered aerial backgrounds, and strict real-time constraints. Existing YOLO-based detectors are primarily optimized for general object detection and often lack adequate feature resolution for sub-pixel targets, while introducing complexities during deployment. In this paper, we propose SDD-YOLO, a small-target detection framework tailored for G2A anti-UAV surveillance. To capture fine-grained spatial details critical for micro-targets, SDD-YOLO introduces a P2 high-resolution detection head operating at 4 times downsampling. Furthermore, we integrate the recent architectural advancements from YOLO26, including a DFL-free, NMS-free architecture for streamlined inference, and the MuSGD hybrid training strategy with ProgLoss and STAL, which substantially mitigates gradient oscillation on sparse small-target signals. To support our evaluation, we construct DroneSOD-30K, a large-scale G2A dataset comprising approximately 30,000 annotated images covering diverse meteorological conditions. Experiments demonstrate that SDD-YOLO-n achieves a mAP@0.5 of 86.0% on DroneSOD-30K, surpassing the YOLOv5n baseline by 7.8 percentage points. Extensive inference analysis shows our model attains 226 FPS on an NVIDIA RTX 5090 and 35 FPS on an Intel Xeon CPU, demonstrating exceptional efficiency for future edge deployment.

[CV-72] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction CVPR2026

【速读】：该论文旨在解决预训练视频扩散模型（video diffusion models）在生成长视频时因分布外（out-of-distribution, O.O.D）问题导致的视觉质量显著下降的问题。具体而言，其核心挑战来自两个方面：帧级相对位置O.O.D和上下文长度O.O.D。解决方案的关键在于提出一种无需重新训练的层自适应框架FreeLOC，包含两项核心技术：一是基于视频的相对位置重编码（Video-based Relative Position Re-encoding, VRPR），通过多粒度策略分层重构时间相对位置以匹配模型预训练分布；二是分层稀疏注意力机制（Tiered Sparse Attention, TSA），通过在不同时间尺度上结构化注意力密度来同时保留局部细节与长程依赖关系。此外，引入层自适应探测机制识别各Transformer层对O.O.D问题的敏感性，从而实现方法的精准、高效应用。

链接: https://arxiv.org/abs/2603.25209
作者: Jiahao Tian,Chenxi Song,Wei Cheng,Chi Zhang
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026. Code: this https URL

点击查看摘要

Abstract:Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model’s pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at this https URL.

[CV-73] CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging

【速读】：该论文旨在解决医学人工智能（Medical AI）中跨机构泛化能力受限的问题，其根源在于选择偏差（selection bias），即患者的人口统计学特征（如年龄、病情严重程度）非随机地决定医院分配，从而导致站点特异性变异与诊断标签之间产生虚假相关性。传统领域泛化（Domain Generalization, DG）方法主要关注图像级分布偏移，无法有效处理此类结构性混淆。解决方案的关键在于提出一种基于因果推理的框架CIV-DG（Conditional Instrumental Variables for Domain Generalization），利用条件工具变量（Conditional Instrumental Variables, CIV）将病理语义与扫描仪诱导的伪影解耦；通过放宽标准工具变量方法对随机分配的严格假设，CIV-DG能够适应由患者特征内生驱动的医院选择场景，并借助深度广义矩估计（Deep Generalized Method of Moments, DeepGMM）架构，在人口统计学分层内最小化矩约束违反并强制工具变量与误差项正交，从而实现对结构混淆的有效缓解。

链接: https://arxiv.org/abs/2603.25202
作者: Shaojin Bai,Yuting Su,Weizhi Nie
机构: 天津大学( Tianjin University)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Cross-site generalizability in medical AI is fundamentally compromised by selection bias, a structural mechanism where patient demographics (e.g., age, severity) non-randomly dictate hospital assignment. Conventional Domain Generalization (DG) paradigms, which predominantly target image-level distribution shifts, fail to address the resulting spurious correlations between site-specific variations and diagnostic labels. To surmount this identifiability barrier, we propose CIV-DG, a causal framework that leverages Conditional Instrumental Variables to disentangle pathological semantics from scanner-induced artifacts. By relaxing the strict random assignment assumption of standard IV methods, CIV-DG accommodates complex clinical scenarios where hospital selection is endogenously driven by patient demographics. We instantiate this theory via a Deep Generalized Method of Moments (DeepGMM) architecture, employing a conditional critic to minimize moment violations and enforce instrument-error orthogonality within demographic strata. Extensive experiments on the Camelyon17 benchmark and large-scale Chest X-Ray datasets demonstrate that CIV-DG significantly outperforms leading baselines, validating the efficacy of conditional causal mechanisms in resolving structural confounding for robust medical AI.

[CV-74] acSIm: A Dataset and Benchmark for Football Tactical Style Imitation CVPR2026

【速读】：该论文旨在解决当前足球模仿研究中对真实球队战术行为精准复现不足的问题，现有方法多聚焦于基于奖励的目标优化（如进球数或胜率代理指标），而忽视了战术风格的准确捕捉。解决方案的关键在于提出TacSIm——一个大规模的足球战术风格模仿数据集与基准测试平台，其核心创新在于：通过单视角直播画面将双方22名球员的位置和动作投影到标准场地坐标系中，定义了基于空间占据相似性和运动向量相似性的量化评估协议，从而支持对球队整体时空协同性的客观衡量；同时在统一虚拟环境中运行多种基线方法生成完整团队行为，实现从广播数据到仿真结果的端到端一致性评估，为足球战术风格对齐的建模与评测建立了严谨的基准。

链接: https://arxiv.org/abs/2603.25199
作者: Peng Wen,Yuting Wang,Qiurui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Current football imitation research primarily aims to opti mize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicat ing real-world team tactical behaviors. We introduce Tac SIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Pre mier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. Tac SIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and tem poral similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full team behaviors, enabling both quantitative and visual as sessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm estab lishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football.

[CV-75] CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis

【速读】：该论文旨在解决当前生成式模型在处理心脏电影磁共振成像（cine cardiac MRI, CMR）时，难以有效建模时空联合特性的难题。传统方法通常将空间与时间维度分离建模或通过辅助机制（如解剖掩膜）强制时间一致性，这引入了结构偏差，限制了全局上下文整合，并可能导致细微的时空不连续性或生理上不一致的心脏动力学。解决方案的关键在于提出一种完全4D（3D空间+1D时间）的潜在扩散框架CardioDiT，其核心创新是采用基于扩散变压器（diffusion transformer）的架构，结合一个时空VQ-VAE编码器，将2D+t切片压缩为紧凑潜在表示后，在生成过程中以完整3D+t体积形式联合建模空间与时间信息，从而实现无架构因子分解的连续心脏动力学学习，显著提升跨切片一致性、时间连贯运动及生理合理的心脏功能分布。

链接: https://arxiv.org/abs/2603.25194
作者: Marvin Seyfarth,Sarah Kaye Müller,Arman Ghanaat,Isabelle Ayx,Fabian Fastenrath,Philipp Wild,Alexander Hertel,Theano Papavassiliu,Salman Ul Hassan Dar,Sandy Engelhardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at this https URL.

[CV-76] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

【速读】：该论文旨在解决现有身份保持视频生成方法在实际应用中面临的两大核心问题：一是多数模型仅支持单一身份参考源，难以适应多样化的输入格式（如人脸、肖像或视频），限制了创作灵活性；二是单参考源导致的病态设定（ill-posed scenario）使模型在新场景下难以忠实再现目标身份特征。解决方案的关键在于提出AnyID框架，其核心创新包括：（1）设计一种可扩展的全参考架构（omni-referenced architecture），能够将异构身份输入（如面部图像、肖像画和视频）统一为一致的身份表征；（2）引入主参考生成范式（primary-referenced generation paradigm），以一个参考作为基准锚点，并通过新颖的差异提示（differential prompt）实现属性级别的精确控制。该方案结合大规模数据训练与基于人类偏好评估的强化学习微调，显著提升了身份保真度和可控性。

链接: https://arxiv.org/abs/2603.25188
作者: Jiahao Wang,Hualian Sheng,Sijia Cai,Yuxiao Yang,Weizhan Zhang,Caixia Yan,Bing Deng,Jieping Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.

[CV-77] VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

【速读】：该论文旨在解决当前3D医学图像生成方法中普遍依赖卷积U-Net结构所导致的局部性偏差强、感受野有限的问题，这些问题限制了模型在全局上下文整合、可扩展性和灵活条件控制方面的性能。其解决方案的关键在于提出VolDiT——首个纯基于Transformer架构的3D扩散Transformer（Diffusion Transformer），通过体素补丁嵌入（volumetric patch embeddings）和直接作用于3D标记（tokens）的全局自注意力机制实现对原生3D数据的有效建模；同时引入时间步门控控制适配器（timestep-gated control adapter），将分割掩码映射为可学习的控制标记（control tokens），在去噪过程中动态调制Transformer层，从而实现精确的空间引导并保持Transformer的建模优势。

链接: https://arxiv.org/abs/2603.25181
作者: Marvin Seyfarth,Salman Ul Hassan Dar,Yannik Frisch,Philipp Wild,Norbert Frey,Florian André,Sandy Engelhardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at this https URL.

[CV-78] AG-EgoPose: Leverag ing Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

【速读】：该论文旨在解决第一人称视角（egocentric）视频中3D人体姿态估计的挑战，主要包括严重的透视畸变、有限的身体可见性以及复杂的相机运动等问题。现有方法多依赖单帧分析或有限的时间融合，难以充分利用egocentric视频中丰富的运动上下文信息。其解决方案的关键在于提出一种双流框架AG-EgoPose，通过并行的空间流和时间流分别提取细粒度空间特征与长短时运动上下文，并在Transformer解码器中以可学习的关节令牌实现跨模态联合优化，从而在保持解剖学约束的同时融合空间与时间证据，显著提升姿态估计的鲁棒性与精度。

链接: https://arxiv.org/abs/2603.25175
作者: Md Mushfiqur Azam,John Quarles,Kevin Desai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: this https URL.

[CV-79] Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

【速读】：该论文旨在解决红外目标检测在复杂环境下的鲁棒性不足问题，尤其是面对常见噪声干扰和对抗样本时模型性能下降的挑战。现有数据驱动方法虽能提升训练集上的表现，但未充分考虑红外图像特有的物理特性，导致泛化能力受限。解决方案的关键在于引入红外物理知识——即不同类别间的相对热辐射关系（relative thermal radiation relations），通过理论建模其在灰度值排序下的稳定性，并将其嵌入对抗训练过程，形成知识引导的对抗训练（Knowledge-Guided Adversarial Training, KGAT）。该方法使模型预测结果不仅符合数据分布，更满足实际物理规律，从而显著提升干净准确率及对对抗攻击与常见退化的鲁棒性。

链接: https://arxiv.org/abs/2603.25170
作者: Shiji Zhao,Shukun Xiong,Maoxun Yuan,Yao Huang,Ranjie Duan,Qing Guo,Jiansheng Chen,Haibin Duan,Xingxing Wei
机构: Beihang University (北京航空航天大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:In complex environments, infrared object detection exhibits broad applicability and stability across diverse scenarios. However, infrared object detection is vulnerable to both common corruptions and adversarial examples, leading to potential security risks. To improve the robustness of infrared object detection, current methods mostly adopt a data-driven ideology, which only superficially drives the network to fit the training data without specifically considering the unique characteristics of infrared images, resulting in limited robustness. In this paper, we revisit infrared physical knowledge and find that relative thermal radiation relations between different classes can be regarded as a reliable knowledge source under the complex scenarios of adversarial examples and common corruptions. Thus, we theoretically model thermal radiation relations based on the rank order of gray values for different classes, and further quantify the stability of various inter-class thermal radiation relations. Based on the above theoretical framework, we propose Knowledge-Guided Adversarial Training (KGAT) for infrared object detection, in which infrared physical knowledge is embedded into the adversarial training process, and the predicted results are optimized to be consistent with the actual physical laws. Extensive experiments on three infrared datasets and six mainstream infrared object detection models demonstrate that KGAT effectively enhances both clean accuracy and robustness against adversarial attacks and common corruptions.

[CV-80] ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis ECCV2026

【速读】：该论文旨在解决基于Segment Anything Model (SAM) 的场景文本检测与版面分析方法中存在推理延迟高和数据利用率低的问题。具体而言，传统方法依赖像素级文本分割来采样数千个前景点作为提示，导致计算开销大且难以充分利用标注信息不一致的数据集。其解决方案的关键在于提出ET-SAM框架，该框架包含两个解码器：一是轻量级点解码器，用于生成词级别热图以减少前景点数量，从而显著降低推理延迟；二是分层掩码解码器，配合联合训练策略，能够融合具有多层级、仅词级或仅行级标注的异构数据集，通过引入三组可学习的任务提示缓解不同标注粒度间的差异，实现高效且鲁棒的统一文本检测与版面分析。

链接: https://arxiv.org/abs/2603.25168
作者: Xike Zhang,Maoyuan Ye,Juhua Liu,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures, 8 tables. Submitted to ECCV 2026

点击查看摘要

Abstract:Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across this http URL experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3 \times inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

[CV-81] owards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds CVPR2026

【速读】：该论文旨在解决当前自监督学习（Self-Supervised Learning, SSL）方法在点云表示中缺乏实例感知能力（instance awareness）的问题，导致其在实例定位任务上的迁移性能较差，且通常需要全量微调才能获得较好效果。为实现真正支持下游多任务的3D基础模型（3D foundation models），关键在于增强点云表征对实例结构的敏感性。解决方案的核心是提出PointINS框架，通过引入一个正交偏移分支（orthogonal offset branch）联合学习高层语义理解与几何推理，从而显式建模实例级特征；同时设计两种互补的正则化策略：Offset Distribution Regularization (ODR) 用于对齐预测偏移与经验几何先验，Spatial Clustering Regularization (SCR) 则通过伪实例掩码约束局部一致性，有效提升实例分割和全景分割的鲁棒性与精度。

链接: https://arxiv.org/abs/2603.25165
作者: Bin Yang,Mohamed Abdelsamad,Miao Zhang,Alexandru Paul Condurache
机构: Bosch Research, Robert Bosch GmbH, Stuttgart, Germany; Institute for Neuro- and Bioinformatics, University of Lübeck, Lübeck, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was accepted by CVPR2026

点击查看摘要

Abstract:Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

[CV-82] SportSkills: Physical Skill Learning from Sports Instructional Videos

【速读】：该论文旨在解决现有大规模视频数据集在物理技能学习（physical skill learning）领域中对细粒度动作理解覆盖不足的问题。传统视频数据集多聚焦于通用人类活动识别，难以支持对运动技巧的精细化建模与个性化指导。其解决方案的关键在于构建首个面向物理技能学习的大规模真实场景体育视频数据集SportSkills，该数据集包含超过360k条教学视频和630k个视觉演示片段，并配有详细解说文本，能够有效捕捉动作背后的“know-how”知识。此外，研究进一步提出基于错误条件的指令视频检索任务（mistake-conditioned instructional video retrieval），通过将表示学习与可操作反馈生成相结合，显著提升了模型根据用户执行情况推荐个性化教学视频的能力，经专业教练评估验证了其有效性。

链接: https://arxiv.org/abs/2603.25163
作者: Kumar Ashutosh,Chi Hsuan Wu,Kristen Grauman
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., “here’s my execution of a skill; which video clip should I watch to improve it?”). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.

[CV-83] A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection CVPR2026

【速读】：该论文旨在解决统一模型在3D异常检测中因类别间特征纠缠（Inter-Category Entanglement, ICE）导致的语义先验错误问题，即不同类别的潜在特征重叠使得模型在重建过程中引入错误语义信息，从而影响异常评分的可靠性。解决方案的关键在于提出一种语义解耦的统一模型（Semantically Disentangled Unified Model），其核心创新包括：(i) 从粗到细的全局Token化以构建实例级语义身份，(ii) 基于类别条件的对比学习实现类别语义解耦，以及(iii) 几何引导的解码器确保语义一致的重建。该方法显著提升了统一模型与特定类别模型的性能，在Real3D-AD和Anomaly-ShapeNet数据集上分别将物体级AUROC提升2.8%和9.1%，增强了3D异常检测的可靠性。

链接: https://arxiv.org/abs/2603.25159
作者: SuYeon Kim,Wongyu Lee,MyeongAh Cho
机构: Kyung Hee University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)-where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.

[CV-84] Vision Hopfield Memory Networks

【速读】：该论文旨在解决当前视觉基础模型（如基于Transformer或状态空间模型Mamba的架构）在计算原理上与人脑机制相去甚远、训练数据需求量大且可解释性差的问题。其解决方案的关键在于提出一种受大脑启发的视觉霍普菲尔德记忆网络（Vision Hopfield Memory Network, V-HMN），通过引入分层记忆机制实现对局部与全局动态的统一建模：具体包括图像块级别的局部霍普菲尔德模块（提供关联记忆动力学）、全局霍普菲尔德模块（作为情景记忆用于上下文调制），以及受预测编码启发的迭代误差修正规则，从而在保持高性能的同时显著提升模型的可解释性和数据效率。

链接: https://arxiv.org/abs/2603.25157
作者: Jianfeng Wang,Amine M’Charrak,Luk Koska,Xiangtao Wang,Daniel Petriceanu,Mykyta Smyrnov,Ruizhi Wang,Michael Bumbar,Luca Pinchetti,Thomas Lukasiewicz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.

[CV-85] Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models ICLR2026

【速读】：该论文旨在解决3D医学影像在临床视觉问答任务中因计算成本高而导致的扩展难题，尤其是现有方法依赖2D切片或固定长度的token压缩所引发的体积连续性破坏和细微病灶信息丢失问题。其解决方案的关键在于提出Photon框架，通过引入指令条件驱动的token调度机制与代理梯度传播策略，在训练和推理阶段自适应地减少冗余token，从而降低计算开销并缓解注意力稀释效应；同时设计了定制化的反向传播规则与梯度恢复机制，实现离散token删除下的可微优化，并结合正则化目标抑制语言主导偏差，提升视觉证据利用的可靠性与准确性。

链接: https://arxiv.org/abs/2603.25155
作者: Chengyu Fang,Heng Guo,Zheng Jiang,Chunming He,Xiu Li,Minfeng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

[CV-86] Learning to Rank Caption Chains for Video-Text Alignment

【速读】：该论文旨在解决直接偏好优化（Direct Preference Optimization, DPO）在视觉语言模型（Vision-Language Models, VLMs）中应用时存在的局限性问题，即其二元“胜者通吃”策略无法充分考虑响应对视觉内容的忠实度，尤其在长视频文本对齐任务中，即使某响应非最优，仍可能保持较高的视觉一致性。为此，作者提出采用排序优化（Ranking Optimization）作为替代方案，其关键在于通过重复降级生成具有挑战性的、完全有序的细粒度视频描述链（caption chains），从而更精确地建模响应与视觉输入之间的忠实关系；实验表明，该方法在长格式内容生成与评估中优于传统DPO，并且强调了对视觉编码器进行微调的重要性，挑战了DPO仅为语言重加权过程的传统认知。

链接: https://arxiv.org/abs/2603.25145
作者: Ansel Blume,Burak Uzkent,Shalini Chaudhuri,Garin Kessler
机构: Amazon Prime Video(亚马逊Prime视频)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary “winner-takes-all” approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the “losing” response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses’ faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.

[CV-87] FD2: A Dedicated Framework for Fine-Grained Dataset Distillation

【速读】：该论文旨在解决细粒度（fine-grained）数据集在去耦合数据蒸馏（decoupled Dataset Distillation, DD）中所面临的核心问题：现有方法依赖粗粒度类别标签监督，导致蒸馏样本在类内保留较大差异、类间区分度不足，且同类样本过于相似，削弱了局部判别性特征，从而影响识别性能。解决方案的关键在于提出FD²框架，其核心创新包括：在预训练阶段采用反事实注意力学习（counterfactual attention learning）聚合判别性特征以更新类原型；在蒸馏阶段引入细粒度特征约束（fine-grained characteristic constraint）使每个样本与其类原型对齐并排斥其他类原型，同时通过相似性约束（similarity constraint）促进同类别样本间的注意力分布多样性，从而增强局部判别力与类间可分性。

链接: https://arxiv.org/abs/2603.25144
作者: Hongxu Ma,Guang Li,Shijie Wang,Dongzhan Zhou,Baoli Sun,Takahiro Ogawa,Miki Haseyama,Zhihui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD ^2 , a dedicated framework for Fine-grained Dataset Distillation. FD ^2 localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD ^2 integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.

[CV-88] SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

【速读】：该论文旨在解决多模态深度伪造（multimodal deepfakes）检测中因训练数据依赖合成伪造样本而导致的泛化能力不足问题，特别是当检测器在特定数据集上训练时易引入数据集和生成器偏差（dataset and generator bias），从而限制其对未见篡改手法的鲁棒性。解决方案的关键在于提出一种完全基于真实视频的自监督音频-视觉（audio-visual, AV）检测框架 SAVe：它通过在线生成保持身份一致性的、区域感知的伪混合篡改（region-aware self-blended pseudo-manipulations），模拟伪造痕迹以学习跨面部粒度的互补视觉线索；同时，借助音频-视觉对齐组件建模唇音同步关系，识别音频-视觉伪造中的时间错位模式，从而实现跨数据集的良好泛化性能。

链接: https://arxiv.org/abs/2603.25140
作者: Sahibzada Adil Shahzad,Ammarah Hashmi,Junichi Yamagishi,Yusuke Yasuda,Yu Tsao,Chia-Wen Lin,Yan-Tsung Peng,Hsin-Min Wang
机构: Institute of Information Science, Academia Sinica(中央研究院资讯科学研究所); National Chengchi University(国立政治大学); National Tsing Hua University(国立清华大学); National Institute of Informatics(日本国立情报学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.

[CV-89] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions CVPR2026

【速读】：该论文旨在解决当前6D物体位姿估计模型在真实世界第一人称视角（egocentric view）场景中泛化能力不足的问题，尤其是面对严重运动模糊、动态光照变化和视觉遮挡等极端条件时性能显著下降。其解决方案的关键在于构建了一个全新的大规模6D位姿估计数据集EgoXtreme，该数据集完全从第一人称视角采集，并设计了工业维护、体育运动和应急救援三种高挑战性场景，引入极端光照、剧烈运动模糊和烟雾干扰等感知模糊因素。实验表明，现有通用位姿估计算法在EgoXtreme上表现不佳，尤其在低光条件下；单纯图像恢复（如去模糊）也无法提升性能，而基于时序信息的跟踪方法则显示出一定优势，验证了时间建模对高速运动场景的重要性。因此，EgoXtreme成为推动下一代鲁棒第一人称视觉位姿估计模型发展的关键资源。

链接: https://arxiv.org/abs/2603.25135
作者: Taegyoon Yoon,Yegyu Han,Seojin Ji,Jaewoo Park,Sojeong Kim,Taein Kwon,Hyung-Sin Kim
机构: Seoul National University (首尔国立大学); VGG, University of Oxford (牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera ready version for CVPR 2026, appendix included

点击查看摘要

Abstract:Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at this https URL

[CV-90] Robust Principal Component Completion

【速读】：该论文旨在解决传统鲁棒主成分分析（Robust Principal Component Analysis, RPCA）在处理实际应用中稀疏前景（sparse foreground）与低秩背景（low-rank background）关系时存在的不匹配问题，即现实中稀疏前景往往直接替换或遮挡低秩背景的元素，而非简单叠加。为克服这一局限，作者提出了一种新的框架——鲁棒主成分补全（Robust Principal Component Completion, RPCC），其核心创新在于通过变分贝叶斯推断（variational Bayesian inference）对一个完全概率化的贝叶斯稀疏张量分解模型进行求解，从而间接识别稀疏成分的支持集（support），并实现收敛到硬分类器（hard classifier），避免了以往RPCA方法所需的后处理阈值设定步骤。此方案在合成数据上获得近最优估计，在真实彩色视频和高光谱数据集中分别实现了鲁棒的前景提取与异常检测性能。

链接: https://arxiv.org/abs/2603.25132
作者: Yinjian Wang,Wei Li,Yuanyuan Gui,James E. Fowler,Gemine Vivone
机构: Beijing Institute of Technology (北京理工大学); National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (空间智能信息处理科学技术国家重点实验室); Mississippi State University (密西西比州立大学); National Research Council, Institute of Methodologies for Environmental Analysis (CNR-IMAA) (国家研究委员会环境分析方法研究所); National Biodiversity Future Center (NBFC) (国家生物多样性未来中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at this https URL.

[CV-91] Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation CVPR26

【速读】：该论文旨在解决全景语义分割（Panoramic Semantic Segmentation）在真实应用场景中面临的两大挑战：一是全景投影带来的严重几何失真，二是密集标注成本高昂。尤其在源域数据不可获取的“无源域自适应”（Source-Free Unsupervised Domain Adaptation, SFUDA）场景下，传统方法因域偏移导致伪标签不可靠，性能显著下降，尤其是对少数类别的分割效果恶化。解决方案的关键在于提出DAPASS框架，其核心创新为两个协同模块：一是全景置信度引导去噪模块（PCGD），通过扰动一致性约束和邻域置信度过滤机制生成高质量、类别平衡的伪标签；二是上下文分辨率对抗模块（CRAM），通过对抗性对齐高分辨率细节与低分辨率全局语义来缓解尺度差异和几何畸变问题，从而实现无需源数据的鲁棒知识迁移。

链接: https://arxiv.org/abs/2603.25131
作者: Yaowen Chang,Zhen Cao,Xu Zheng,Xiaoxin Mi,Zhen Dong
机构: Wuhan University (武汉大学); HKUST (Guangzhou) (香港科技大学（广州）); Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR26

点击查看摘要

Abstract:Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

[CV-92] AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

【速读】：该论文旨在解决3D视觉基础模型（3D Vision Foundation Models, 3DVFMs）在通用新视角合成（Novel View Synthesis, NVS）任务中难以直接应用的问题，尤其是在保持高保真度和无需相机位姿信息（pose-free）的前提下实现鲁棒的视图合成。其核心挑战在于3DVFMs虽具备强大的几何先验能力，但与NVS所需的像素级一致性及遮挡建模存在位姿-几何不一致问题。解决方案的关键在于提出AirSplat训练框架，包含两项核心技术：(1) 自洽位姿对齐（Self-Consistent Pose Alignment, SCPA），通过训练阶段的反馈回路确保像素级监督以缓解位姿与几何间的偏差；(2) 基于评分的不透明度匹配（Rating-based Opacity Matching, ROM），利用稀疏视图NVS教师模型提供的局部3D几何一致性知识来过滤劣质点云片段（primitives）。实验表明，该方法显著优于现有无位姿约束的NVS方法，在大规模基准上实现了更高质量的重建效果。

链接: https://arxiv.org/abs/2603.25129
作者: Minh-Quan Viet Bui,Jaeho Moon,Munchurl Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis. Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.25129 [cs.CV] (or arXiv:2603.25129v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.25129 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Minh-Quan Viet Bui [view email] [v1] Thu, 26 Mar 2026 07:52:33 UTC (12,943 KB) Full-text links: Access Paper: View a PDF of the paper titled AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting, by Minh-Quan Viet Bui and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-93] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization CVPR2026

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在文档生成任务中面临的多样性不足、数据规模有限以及内容溢出（content overflow）等问题。针对这些问题，其解决方案的关键在于提出 AnyDoc 框架：首先构建一个可扩展的数据合成流水线，自动生成涵盖 111 类文档和 32 种样式的 HTML/CSS 格式文档数据集 DocHTML（共 265,206 样本），并附带完整元数据；其次基于该数据集微调多模态大语言模型（MLLMs），实现意图到文档（intention-to-document）、文档反渲染（document derendering）和元素到文档（element-to-document）三项任务；最后引入高度感知强化学习（height-aware reinforcement learning, HARL）后训练机制，通过定义基于预测与目标文档高度差异的奖励函数来惩罚内容溢出，从而显著提升生成质量与稳定性。

链接: https://arxiv.org/abs/2603.25118
作者: Jiawei Lin,Wanrong Zhu,Vlad I Morariu,Christopher Tensmeyer
机构: Xi’an Jiaotong University (西安交通大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Main Conference

点击查看摘要

Abstract:Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.

[CV-94] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

【速读】：该论文旨在解决图像分类模型在面对分布外数据（如噪声、风格变化或对抗攻击）时鲁棒性不足的问题，同时避免传统数据增强方法中依赖复杂生成模型（如扩散模型）或外部数据集所带来的高计算开销和存储负担。其解决方案的关键在于提出一种基于解析干涉图案（analytic interference patterns）的轻量级程序化增强方法——通过数学闭式公式实时生成莫尔纹（Moire）纹理作为结构化扰动，这些扰动覆盖广泛的时空频率范围，并在训练过程中直接嵌入到图像中后立即丢弃，无需额外存储或外部数据。该方法仅需0.0026秒/图像的极低计算成本，即可显著提升Vision Transformer在ImageNet-C、ImageNet-R及对抗基准上的鲁棒性表现，验证了分析型干涉模式作为数据驱动生成增强替代方案的有效性与实用性。

链接: https://arxiv.org/abs/2603.25109
作者: Yuto Matsuo,Yoshihiro Fukuhara,Yuki M. Asano,Rintaro Yanagi,Hirokatsu Kataoka,Akio Nakamura
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

[CV-95] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning CVPR2026

【速读】：该论文旨在解决多模态奖励模型（Multimodal Reward Models, MRMs）在强化学习训练中对昂贵且稀缺的多模态偏好数据的高度依赖问题，从而限制了模型的可扩展性。其解决方案的关键在于提出一种多阶段强化学习（Multi-Stage Reinforcement Learning, MSRL）框架：首先利用大规模文本偏好数据学习通用的奖励推理能力，随后通过基于图像描述（caption-based）和完全多模态的强化学习阶段逐步迁移该能力至多模态任务；同时引入跨模态知识蒸馏方法提升偏好泛化能力。该方法显著提升了MRMs在视觉理解与生成任务上的性能，且无需额外标注多模态偏好数据。

链接: https://arxiv.org/abs/2603.25108
作者: Chenglong Wang,Yifu Huo,Yang Gan,Qiaozhi He,Qi Meng,Bei Li,Yan Wang,Junfu Liu,Tianhua Zhou,Jingbo Zhu,Tong Xiao
机构: Northeastern University (东北大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: this https URL.

[CV-96] Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning

【速读】：该论文旨在解决多模态主动学习（Multimodal Active Learning, MAL）中因模态重要性在训练过程中动态变化而导致的性能瓶颈问题。现有方法通常假设模态权重固定，忽略了不同训练轮次下模态贡献和样本难度的时变特性，从而难以有效平衡模态公平性与分类准确性。其解决方案的关键在于提出一种基于强化学习的框架RL-MBA，通过两个核心组件实现动态适应：(1) 自适应模态贡献平衡（Adaptive Modality Contribution Balancing, AMCB），利用强化学习反馈动态调整各模态权重；(2) 基于证据融合的难度感知策略调整（Evidential Fusion for Difficulty-Aware Policy Adjustment, EFDA），通过不确定性驱动的证据融合机制估计样本难度并优先选择高信息量样本，从而在有限标注预算下提升模型性能与模态公平性。

链接: https://arxiv.org/abs/2603.25107
作者: Yuqiao Zeng,Xu Wang,Tengfei Liang,Yiqing Hao,Yi Jin,Hui Yu
机构: Beijing Jiaotong University (北京交通大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.

[CV-97] Pixelis: Reasoning in Pixels from Seeing to Acting

【速读】：该论文旨在解决当前视觉-语言系统（Vision-Language Systems）普遍存在的静态观察者局限性问题，即这些系统仅能描述像素而无法采取行动，且在分布偏移（distributional shift）下难以安全地持续改进，从而限制了物理世界中可泛化的、具身的视觉智能。解决方案的关键在于提出Pixelis——一个直接在像素空间中操作的代理（agent），它通过一组紧凑的可执行操作（如缩放/裁剪、分割、追踪、光学字符识别OCR、时间定位）与环境交互，并从行为后果中学习。其训练分为三阶段：(1) 监督微调利用链式思维-动作（Chain-of-Thought-Action）轨迹，采用掩码模仿损失并加权操作/参数标记以稳定像素对齐的参数；(2) 好奇心-一致性奖励微调结合预测误差好奇心与相邻步骤一致性，辅以KL锚定的效率先验，生成短且结构合理的工具链；(3) 像素测试时强化学习（Pixel Test-Time RL）实现无标签适应，通过邻居检索、轨迹投票和KL到EMA的安全控制约束漂移，确保长期稳定性。该方法使多模态感知扎根于物理世界，实现了无需外部反馈的具身适应能力。

链接: https://arxiv.org/abs/2603.25091
作者: Yunpeng Zhou
机构: University of Reading (阅读大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28pages, 16figures, 18tables

点击查看摘要

Abstract:Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.

[CV-98] HEMIS: Towards Holistic Evaluation of MLLM s for Scientific Paper Fraud Forensics ICLR2026

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在真实世界学术欺诈场景中视觉欺诈推理能力评估不足的问题。现有基准测试在场景真实性、欺诈类型多样性与粒度、以及多维能力解析方面存在明显局限，难以全面衡量模型对复杂视觉欺诈的识别与推理能力。解决方案的关键在于提出THEMIS这一新型多任务基准，其核心创新包括：(1) 构建源自真实撤稿论文案例和精心设计的多模态合成数据的7类真实场景，其中60.47%为高复杂度纹理图像，显著提升场景逼真度；(2) 系统覆盖5类挑战性欺诈类型并引入16种细粒度篡改操作，每样本平均叠加多个操作，大幅提高任务难度；(3) 建立欺诈类型到5项核心视觉欺诈推理能力的映射关系，实现对模型能力的多维度解构式评估。实验表明，即便最强模型GPT-5在该基准上性能也仅为56.15%，验证了THEMIS作为严苛测试标准的有效性。

链接: https://arxiv.org/abs/2603.25089
作者: Tzu-Yen Ma,Bo Zhang,Zichen Tang,Junpeng Ding,Haolin Tian,Yuanze Li,Zhuodi Hao,Zixin Ding,Zirui Wang,Xinyu Yu,Shiyao Peng,Yizhuo Zhao,Ruomeng Jiang,Yiling Huang,Peizhi Zhao,Jiayuan Chen,Weisheng Tan,Haocheng Gao,Yang Liu,Jiacheng Liu,Zhongjun Yang,Jiayu Huang,Haihong E
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

[CV-99] Visual Attention Driftsbut Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中存在的对象幻觉（object hallucination）问题，特别是由于深层注意力机制在推理过程中回归到早期层的视觉噪声所导致的不可靠输出。其解决方案的关键在于提出一种无需训练的跨层视觉锚点（Cross-Layer Visual Anchors, CLVA）方法：通过识别并强化中间层中关键的视觉特征作为“锚点”，抑制深层注意力向初始噪声的漂移，从而引导深层注意力聚焦于正确的视觉区域，提升模型输出的可靠性。

链接: https://arxiv.org/abs/2603.25088
作者: Chengxu Yang,Jingling Yuan,Chuang Hu,Jiawei Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.

[CV-100] Learning domain-invariant features through channel-level sparsification for Out-Of Distribution Generalization

【速读】：该论文旨在解决深度学习模型在分布外（Out-of-Distribution, OOD）场景下泛化能力差的问题，其根源在于模型倾向于学习域特定的非因果特征（spurious features），从而形成捷径依赖（shortcut dependencies），导致跨数据源性能不稳定。解决方案的关键在于提出分层因果丢弃（Hierarchical Causal Dropout, HCD），通过通道级因果掩码（channel-level causal masks）强制特征稀疏性，实现因果特征与伪相关特征的分离，并在表示层面执行因果干预；同时，利用基于矩阵的互信息（Matrix-based Mutual Information, MMI）目标最小化潜在特征与域标签之间的互信息，同时最大化其与类别标签的信息共享，辅以StyleMix驱动的VICReg模块增强训练稳定性，防止关键因果特征被误过滤。

链接: https://arxiv.org/abs/2603.25083
作者: Haoran Pei,Yuguang Yang,Kexin Liu,Juan Zhang,Baochang Zhang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) generalization has become a primary metric for evaluating image analysis systems. Since deep learning models tend to capture domain-specific context, they often develop shortcut dependencies on these non-causal features, leading to inconsistent performance across different data sources. Current techniques, such as invariance learning, attempt to mitigate this. However, they struggle to isolate highly mixed features within deep latent spaces. This limitation prevents them from fully resolving the shortcut learning this http URL this paper, we propose Hierarchical Causal Dropout (HCD), a method that uses channel-level causal masks to enforce feature sparsity. This approach allows the model to separate causal features from spurious ones, effectively performing a causal intervention at the representation level. The training is guided by a Matrix-based Mutual Information (MMI) objective to minimize the mutual information between latent features and domain labels, while simultaneously maximizing the information shared with class this http URL ensure stability, we incorporate a StyleMix-driven VICReg module, which prevents the masks from accidentally filtering out essential causal data. Experimental results on OOD benchmarks show that HCD performs better than existing top-tier methods.

[CV-101] Bridging Perception and Reasoning : Token Reweighting for RLVR in Multimodal LLM s

【速读】：该论文旨在解决将可验证奖励强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）扩展至多模态大语言模型（Multimodal Large Language Models, MLLMs）时所面临的根本性挑战：MLLMs的输出序列中感知相关token（用于视觉内容锚定）与推理相关token（用于构建推理链）天然交织，二者分别体现视觉锚定（visual grounding）和符号推理（symbolic reasoning）能力，且存在内在耦合关系，导致仅优化单一类型token会显著限制整体性能。解决方案的关键在于提出一种即插即用的Token-Reweighting（ToR）策略，通过识别两类关键token并动态调整其权重，在RLVR训练过程中显式建模二者间的依赖关系，从而实现对视觉锚定与符号推理能力的协同优化，最终在多个多模态推理基准上取得一致性的性能提升，并达到当前最优水平。

链接: https://arxiv.org/abs/2603.25077
作者: Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Guoyin Wang,Jiancan Wu,Xiang Wang,Xiangnan He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities – visual grounding and symbolic reasoning – making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.

[CV-102] Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

【速读】：该论文旨在解决单流扩散Transformer架构（single-stream diffusion transformers）中概念擦除（concept erasure）任务的稳定性问题，此类模型将文本和图像标记作为统一序列处理，直接应用传统基于U-Net或双流架构的方法常导致生成崩溃（generation collapse）。其解决方案的关键在于提出Z-Erase，这是首个专为单流T2I模型设计的概念擦除方法；核心创新包括：1）Stream Disentangled Concept Erasure Framework，通过解耦更新机制使现有擦除方法适用于单流模型；2）Lagrangian-Guided Adaptive Erasure Modulation，一种约束优化算法，在擦除强度与保留质量之间实现动态平衡。此外，论文还提供了严格的收敛性分析，证明Z-Erase可收敛至Pareto平稳点，实验验证其在多种任务中均能有效避免生成崩溃并达到当前最优性能。

链接: https://arxiv.org/abs/2603.25074
作者: Nanxiang Jiang,Zhaoxin Fan,Baisen Wang,Daiheng Gao,Junhang Cheng,Jifeng Guo,Yalan Qin,Yeying Jin,Hongwei Zheng,Faguo Wu,Wenjun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.

[CV-103] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

【速读】：该论文旨在解决视频大语言模型（Video Large Language Models, VLMs）在处理长视频时因密集帧计算成本过高而导致的实用性受限问题。现有方法通过关键帧选择来缓解这一问题，但其贪婪决策机制与相关性与多样性解耦评估容易陷入局部最优，并误选无关噪声帧。解决方案的关键在于提出GIFT（Global Irreplaceability Frame Targeting）框架，其核心是基于帧的内在不可替代性进行选择：首先引入“定向多样性”（Directed Diversity）量化在相关性条件下的帧独特性，从而构建统一的不可替代性评分；其次采用“预算感知精炼”（Budget-Aware Refinement）策略，通过自适应迭代过程优先锁定高不可替代性帧作为核心集合，并随预算增加逐步强化这些帧周围的时间上下文建模。

链接: https://arxiv.org/abs/2603.25072
作者: Junpeng Ma,Sashuai Zhou,Guanghao Li,Xin Gao,Yue Cao,Hengyu Zeng,Yuxiang Yan,Zhibin Wang,Jun Song,Bo Zheng,Shanghang Zhang,Jian Pu
机构: Fudan University (复旦大学); Peking University (北京大学); Zhejiang University (浙江大学); Alibaba Group Holding Limited (阿里巴巴集团); Future Living Lab of Alibaba (阿里巴巴未来生活实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame’s uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.

[CV-104] Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos CVPR2026

【速读】：该论文旨在解决单目视频中动态场景的高质量高斯点渲染问题，特别是如何精确建模动态高斯点在连续时间内的位置与姿态变化，以实现更逼真的新视角合成。其解决方案的关键在于提出了一种基于SE(3) B样条运动基函数（SE(3) B-spline motion bases）的方法，显式地对动态高斯点的位置和方向变形进行参数化建模，并引入自适应控制机制动态调整运动基函数的数量和控制点密度，从而在保持计算效率的同时增强复杂运动的表达能力；此外，通过软分割重建策略缓解长间隔运动干扰，并利用多视角扩散模型提供跨视角约束，有效防止训练视图过拟合，显著提升了新视角合成的质量与鲁棒性。

链接: https://arxiv.org/abs/2603.25058
作者: Xuankai Zhang,Junjin Xiao,Shangwei Huang,Wei-shi Zheng,Qing Zhang
机构: Sun Yat-sen University (中山大学); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at this https URL.

[CV-105] Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics

【速读】：该论文旨在解决高能推进剂燃烧过程中实时监测的难题，尤其针对极端高动态范围（HDR）、微秒级粒子运动以及浓烟干扰等复杂工况下传统成像技术易出现饱和、运动模糊和粒子提取不稳定的问题。其解决方案的关键在于提出了一种闭环事件-空间可变曝光（Event–SVE）测量系统，该系统将空间可变曝光（SVE）相机与一对类脑事件相机（neuromorphic event cameras）耦合使用：SVE分支通过烟雾感知融合策略生成HDR图像，并利用多线索烟雾可能性图分离粒子发射与烟雾散射信号，从而获得可用于下游分析的校准强度图；同时，这些HDR图提供了事件相机缺失的绝对强度参考，用于抑制烟雾驱动的事件伪影并提升粒子状态判别能力；基于清理后的事件观测，结合立体事件驱动的三维重建流程，实现分离高度与等效粒子尺寸的精确估计（最大校准误差0.56%），在硼基推进剂实验中成功捕捉到难以用传统传感器观测的快速分离瞬态过程。

链接: https://arxiv.org/abs/2603.25054
作者: Jing Tao,Taihang Lei,Banglei Guan,Ying Qu,Xudong Na,Likun Ma,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室); Hypersonic Technology Laboratory (高超声速技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event–SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.

[CV-106] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator CVPR2026

【速读】：该论文旨在解决3D Gaussian splatting (3DGS) 在真实场景重建中常见的几何伪影问题，如浮点噪声（floaters）、闪烁（flickering）和模糊（blur），这些问题主要由相机位姿误差、覆盖不完整以及初始几何噪声引起。解决方案的关键在于提出GaussFusion，一种基于几何信息的视频到视频生成方法：首先从现有3DGS重建中渲染包含深度、法向量、不透明度和协方差信息的高斯基元视频缓冲区，然后通过一个几何感知的生成器对这些帧进行优化，从而生成时序一致且无伪影的高质量图像。此外，作者还设计了伪影合成管道以模拟多样化的退化模式，提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.25053
作者: Liyuan Zhu,Manjunath Narayana,Michal Stary,Will Hutchcroft,Gordon Wetzstein,Iro Armeni
机构: Stanford University (斯坦福大学); Zillow Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 main paper camera-ready. Project page: this http URL

点击查看摘要

Abstract:We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.

[CV-107] MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

【速读】：该论文旨在解决在线动态场景重建中，现有方法无法准确学习每个高斯点（Gaussian）的运动信息的问题。由于仅依赖光度损失优化，传统方法导致每个高斯点的运动被像素残差驱动而非真实三维运动，从而影响4D重建质量与时间一致性。解决方案的关键在于提出MoRGS框架，其核心创新包括：（1）利用稀疏关键帧上的光流作为轻量级运动先验，对每个高斯点的运动进行正则化；（2）引入每高斯点的运动偏移场（motion offset field），以补偿投影3D运动与观测光流之间的跨视角和跨时间差异；（3）设计每高斯点运动置信度机制，区分动态与静态区域，并加权属性残差更新，抑制静态区域冗余运动，提升时序一致性并加速大范围运动建模。

链接: https://arxiv.org/abs/2603.25042
作者: Wonjoon Lee,Sungmin Woo,Donghyeong Kim,Jungho Lee,Sangheon Park,Sangyoun Lee
机构: Yonsei University (延世大学); Electronics and Telecommunications Research Institute (电子与电信研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.

[CV-108] GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation

【速读】：该论文旨在解决卫星地球观测数据因以离散栅格文件形式存储而导致的存储、传输与查询成本高昂的问题。其核心挑战在于如何在不牺牲时空分辨率和光谱保真度的前提下，实现高效的数据压缩与按需访问。解决方案的关键在于提出GeoNDC——一种可查询的神经数据立方体（neural data cube），它将全球尺度的地球观测数据编码为连续时空隐式神经场（implicit neural field），从而支持在消费级硬件上直接进行时空查询和连续时间重建，无需完整解压原始数据。该方法通过深度学习模型学习高维遥感数据的潜在表示，在保持高精度的同时实现了高达95:1的压缩比，并具备良好的泛化能力与分析就绪性。

链接: https://arxiv.org/abs/2603.25037
作者: Jianbo Qi,Mengyao Li,Baogui Jiang,Yidan Chen,Qiao Wang
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Satellite Earth observation has accumulated massive spatiotemporal archives essential for monitoring environmental change, yet these remain organized as discrete raster files, making them costly to store, transmit, and query. We present GeoNDC, a queryable neural data cube that encodes planetary-scale Earth observation data as a continuous spatiotemporal implicit neural field, enabling on-demand queries and continuous-time reconstruction without full decompression. Experiments on a 20-year global MODIS MCD43A4 reflectance record (7 bands, 5,km, 8-day sampling) show that the learned representation supports direct spatiotemporal queries on consumer hardware. On Sentinel-2 imagery (10,m), continuous temporal parameterization recovers cloud-free dynamics with high fidelity ( R^2 0.85 ) under simulated 2-km cloud occlusion. On HiGLASS biophysical products (LAI and FPAR), GeoNDC attains near-perfect accuracy ( R^2 0.98 ). The representation compresses the 20-year MODIS archive to 0.44,GB – approximately 95:1 relative to an optimized Int16 baseline – with high spectral fidelity (mean R^2 0.98 , mean RMSE = 0.021 ). These results suggest GeoNDC offers a unified AI-native representation for planetary-scale Earth observation, complementing raw archives with a compact, analysis-ready data layer integrating query, reconstruction, and compression in a single framework.

[CV-109] CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering

【速读】：该论文旨在解决医学图像恢复中缺乏可控性的问题，即现有方法通常依赖任务特定的重新训练，且难以在忠实重建与先验驱动增强之间实现有效权衡，尤其在临床场景下，过度激进的修复可能引入幻觉细节或改变诊断相关的结构。解决方案的关键在于提出一种无需训练的可控恢复框架CARE，其核心是采用双隐空间（dual-latent）恢复策略：一个分支强制数据保真度和解剖一致性，另一个分支利用生成先验恢复缺失或退化的信息；同时通过风险感知自适应控制器动态调整两分支贡献，依据恢复不确定性与局部结构可靠性，在不需额外训练的前提下实现保守或增强导向的恢复模式，从而在保持临床相关结构的同时降低不合理重建的风险。

链接: https://arxiv.org/abs/2603.25026
作者: Xu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.

[CV-110] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在长视频理解（Long Video Understanding, LVU）任务中因文本与视觉 token不平衡而导致的幻觉问题。现有方法虽通过自动分割视频为可处理片段缓解该问题，但依赖大量细粒度高质量数据的监督微调（SFT）方法存在工具调用轨迹受限的问题。本文提出VideoTIR框架，其核心创新在于引入强化学习（Reinforcement Learning, RL），驱动MLLM使用多层次工具集高效定位并聚焦于有意义的视频片段、图像或区域，从而提升长视频理解的准确性与效率。关键解决方案包括：1）采用零样本强化学习（Zero-RL）和SFT冷启动策略实现灵活的工具调用；2）提出工具动作分组策略优化（Toolkit Action Grouped Policy Optimization, TAGPO），通过逐步奖励分配和失败回放复用减少冗余调用；3）构建基于沙盒的轨迹合成框架生成高质量训练轨迹数据，显著增强模型在复杂长视频场景下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2603.25021
作者: Zhe Gao,Shiyu Shen,Taifeng Chai,Weinong Wang,Haotian Xu,Xing W,Wenbin Li,Qi Fan,Yang Gao,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

[CV-111] GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

【速读】：该论文旨在解决虚拟人类合成中双人交互场景下听众头部运动生成的难题，特别是现有方法在监听者动作上普遍存在“回归均值”（Regression-to-the-Mean）问题，导致动作趋于静态且缺乏复杂非语言行为的参数空间。解决方案的关键在于提出GDPO-Listener框架：首先采用自回归流匹配（Auto-Regressive Flow Matching）架构实现稳定监督学习；其次引入分组奖励解耦策略优化（Group reward-Decoupled Policy Optimization, GDPO），通过分离不同FLAME参数组的奖励归一化，显式激励高方差、富有表现力的动作生成；最后支持语义文本控制，实现响应的可定制性。实验证明该方法在长期运动方差、视觉表现力和语义可控性方面均优于现有基线。

链接: https://arxiv.org/abs/2603.25020
作者: Zhangyu Jin,Maksim Siniukov,Deuksin Kwon,Ashutosh Chaubey,Mohammad Soleymani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean’ problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

[CV-112] Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields

【速读】：该论文旨在解决稀疏视角输入下3D重建质量不稳定、渲染效率低的问题。其解决方案的关键在于融合TensorRF的张量（tensor）表示以提升渲染速度，并引入FreeNeRF中的频率驱动少样本正则化机制，结合频域掩码和遮挡掩码来增强重建稳定性与精度，从而在仅需少量输入图像的情况下实现高质量的3D场景重建。

链接: https://arxiv.org/abs/2603.25008
作者: Thanh-Hai Le,Hoang-Hau Tran,Trong-Nghia Vu
机构: The Saigon International University (胡志明市国际大学); FPT University (FPT大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:This paper presents Few TensoRF, a 3D reconstruction framework that combines TensorRF’s efficient tensor based representation with FreeNeRF’s frequency driven few shot regularization. Using TensorRF to significantly accelerate rendering speed and introducing frequency and occlusion masks, the method improves stability and reconstruction quality under sparse input views. Experiments on the Synthesis NeRF benchmark show that Few TensoRF method improves the average PSNR from 21.45 dB (TensorRF) to 23.70 dB, with the fine tuned version reaching 24.52 dB, while maintaining TensorRF’s fast (\approx10-15) minute training time. Experiments on the THuman 2.0 dataset further demonstrate competitive performance in human body reconstruction, achieving 27.37 - 34.00 dB with only eight input images. These results highlight Few TensoRF as an efficient and data effective solution for real-time 3D reconstruction across diverse scenes.

[CV-113] Improving Fine-Grained Rice Leaf Disease Detection via Angular-Compactness Dual Loss Learning

【速读】：该论文旨在解决水稻叶片疾病早期检测中因类别内差异大、类间相似度高而导致的传统深度学习模型分类性能受限的问题。其解决方案的关键在于提出一种双损失框架，融合中心损失（Center Loss）与弧面损失（ArcFace Loss），通过引入角度边界约束和中心距离约束，显著增强特征嵌入的判别能力，从而提升细粒度分类精度，且无需对骨干网络结构进行重大修改，具备良好的实用性与部署效率。

链接: https://arxiv.org/abs/2603.25006
作者: Md. Rokon Mia,Rakib Hossain Sajib,Abdullah Al Noman,Abir Ahmed,B M Taslimul Haque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection of rice leaf diseases is critical, as rice is a staple crop supporting a substantial share of the world’s population. Timely identification of these diseases enables more effective intervention and significantly reduces the risk of large-scale crop losses. However, traditional deep learning models primarily rely on cross entropy loss, which often struggles with high intra-class variance and inter-class similarity, common challenges in plant pathology datasets. To tackle this, we propose a dual-loss framework that combines Center Loss and ArcFace Loss to enhance fine-grained classification of rice leaf diseases. The method is applied into three state-of-the-art backbone architectures: InceptionNetV3, DenseNet201, and EfficientNetB0 trained on the public Rice Leaf Dataset. Our approach achieves significant performance gains, with accuracies of 99.6%, 99.2% and 99.2% respectively. The results demonstrate that angular margin-based and center-based constraints substantially boost the discriminative strength of feature embeddings. In particular, the framework does not require major architectural modifications, making it efficient and practical for real-world deployment in farming environments.

[CV-114] Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

【速读】：该论文旨在解决零样本指代表达理解（Zero-shot Referring Expression Comprehension, REC）任务中现有方法对细粒度视觉细节捕捉不足以及复杂对象关系理解能力有限的问题。传统视觉语言模型（Vision-Language Models, VLMs）如CLIP通过直接计算文本查询与图像区域的特征相似度进行定位，难以处理复杂语义；而大语言模型（Large Language Models, LLMs）虽具备强语义推理能力，却无法直接从视觉特征中抽象出文本语义，限制其在REC中的应用。解决方案的关键在于提出SGREC，一种基于查询驱动场景图（query-driven scene graph）的可解释零样本REC方法：首先利用VLM构建包含空间关系、描述性标题和对象交互信息的结构化场景图，作为连接低层图像区域与高层语义理解的中介；随后由LLM基于该结构化文本表示推断目标对象，并生成决策解释，从而实现高精度且可解释的零样本REC。

链接: https://arxiv.org/abs/2603.25004
作者: Yike Wu,Necva Bolucu,Stephen Wan,Dadong Wang,Jiahao Xia,Jian Zhang
机构: University of Technology Sydney (悉尼科技大学); Commonwealth Scientific and Industrial Research Organisation (澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by T-MM

点击查看摘要

Abstract:Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbfSGREC, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78%), RefCOCO+ testB (53.43%), and RefCOCOg val (73.28%), highlighting its strong visual scene understanding.

[CV-115] Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method

【速读】：该论文旨在解决紧急车辆快速通行与普通车辆交通影响之间的平衡问题，尤其针对现有方法在计算复杂度高和可扩展性差方面的局限性。其解决方案的关键在于提出一种基于局部信息的分布式车辆控制机制，通过在线实时调整各车辆行为，实现近似全局最优决策，无需预训练且具备对不同交通条件的自然适应能力；同时引入分布式冲突消解机制，确保车辆决策安全、避免单点故障风险，从而提供确定性的安全保障，显著优于依赖集中式训练的强化学习方法。

链接: https://arxiv.org/abs/2603.25000
作者: WenXi Wang,JunQi Zhang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Cybernetics

点击查看摘要

Abstract:Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles’ safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.

[CV-116] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting

【速读】：该论文旨在解决动态3D场景重建中因高斯分布（Gaussian）运动与真实物理动力学不一致而导致的局部几何结构失真问题，尤其是在单目视频数据集上，这种失真会显著降低重建质量。现有方法通常依赖外部先验（如光流或2D轨迹）来强制时间一致性，但存在对额外标注的依赖。其解决方案的关键在于提出一种视图空间射线分组策略（view-space ray grouping strategy），通过聚类被同一射线相交且α-混合权重超过阈值的高斯点，并对其施加空间分布一致性约束，从而显式保持高斯点在时间维度上的局部几何结构稳定性，实现无需外部引导的更符合物理规律的运动建模。

链接: https://arxiv.org/abs/2603.24994
作者: Junoh Leea,Junmyeong Lee,Yeon-Ji Song,Inhwan Bae,Jisu Shin,Hae-Gon Jeon,Jin-Hwa Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose \alpha -blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.

[CV-117] C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance

【速读】：该论文旨在解决3D延迟钆增强磁共振成像（LGE-MRI）中左心房（LA）壁薄、解剖结构复杂及对比度低导致的分割精度不足问题，从而实现更准确的壁厚映射与纤维化定量分析。其解决方案的关键在于提出一种两阶段腔体到壁的迁移学习框架（C2W-Tune），利用高精度的LA腔体模型作为解剖先验信息，在第一阶段预训练网络以提取稳健的心房特征，第二阶段通过渐进式层解冻策略将权重迁移至LA壁分割任务，既保留心内膜特征又实现壁特异性优化，显著提升了边界精度和分割性能。

链接: https://arxiv.org/abs/2603.24992
作者: Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted this to the International Conference on Artificial Intelligence in Medicine (AIME 2026)

点击查看摘要

Abstract:Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall’s thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.

[CV-118] owards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets

【速读】：该论文旨在解决事件相机（Event Camera）在视频异常检测（Video Anomaly Detection, VAD）中缺乏专用数据集和有效建模策略的问题，从而阻碍了事件流驱动的异常检测研究进展。解决方案的关键在于提出一个以事件为中心的时空视频异常检测框架（EWAD），其核心创新包括：1）基于事件密度感知的动态采样策略，用于选择时序上信息量丰富的片段；2）密度调制的时序建模方法，以从稀疏事件流中捕捉上下文关系；3）RGB到事件的知识蒸馏机制，在弱监督条件下增强事件表示能力。实验表明，该框架在三个基准数据集上显著优于现有方法，验证了事件驱动建模在VAD中的有效性与潜力。

链接: https://arxiv.org/abs/2603.24991
作者: Peng Wu,Yuting Yan,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.

[CV-119] Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

【速读】：该论文旨在解决从晚期钆增强磁共振成像（Late Gadolinium Enhancement Magnetic Resonance Imaging, LGE-MRI）中对左心房壁进行精确分割的难题，该任务因心房壁薄、对比度低以及专家标注数据稀缺而极具挑战性。解决方案的关键在于提出一种模型无关的元学习（Model-Agnostic Meta-Learning, MAML）框架，通过在左心房壁分割任务上联合训练辅助的心房腔体分割任务，并引入边界感知的复合损失函数以强化对细结构边界的准确建模，从而在极少量标注样本（K-shot，K=5,10,20）条件下实现鲁棒且高精度的分割性能。

链接: https://arxiv.org/abs/2603.24985
作者: Yusri Al-Sanaani,Rebecca Thornhill,Pablo Nery,Elena Pena,Robert deKemp,Calum Redpath,David Birnie,Sreeraman Rajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE EMBC 2026

点击查看摘要

Abstract:Segmenting the left atrial wall from late gadolinium enhancement magnetic resonance images (MRI) is challenging due to the wall’s thin geometry, low contrast, and the scarcity of expert annotations. We propose a Model-Agnostic Meta-Learning (MAML) framework for K-shot (K = 5, 10, 20) 3D left atrial wall segmentation that is meta-trained on the wall task together with auxiliary left atrial and right atrial cavity tasks and uses a boundary-aware composite loss to emphasize thin-structure accuracy. We evaluated MAML segmentation performance on a hold-out test set and assessed robustness under an unseen synthetic shift and on a distinct local cohort. On the hold-out test set, MAML appeared to improve segmentation performance compared to the supervised fine-tuning model, achieving a Dice score (DSC) of 0.64 vs. 0.52 and HD95 of 5.70 vs. 7.60 mm at 5-shot, and approached the fully supervised reference at 20-shot (0.69 vs. 0.71 DSC). Under unseen shift, performance degraded but remained robust: at 5-shot, MAML attained 0.59 DSC and 5.99 mm HD95 on the unseen domain shift and 0.57 DSC and 6.01 mm HD95 on the local cohort, with consistent gains as K increased. These results suggest that more accurate and reliable thin-wall boundaries are achievable in low-shot adaptation, potentially enabling clinical translation with minimal additional labeling for the assessment of atrial remodeling.

[CV-120] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models CVPR2026

【速读】：该论文旨在解决当前基于混合专家（Mixture-of-Experts, MoE）架构的视觉语言模型（Vision-Language Models, VLMs）中，因采用确定性Top-K路由机制而导致的专家选择多样性不足与专家过拟合问题。其核心解决方案是提出MoE-GRPO框架，该框架将专家选择建模为序列决策问题，并利用组相对策略优化（Group Relative Policy Optimization, GRPO）进行强化学习训练，使模型能够通过探索和奖励反馈学习自适应的专家路由策略；同时引入模态感知的路由器引导机制，抑制对特定模态下低频激活专家的探索，从而提升训练稳定性和效率，最终实现任务级专家专业化并缓解过拟合现象。

链接: https://arxiv.org/abs/2603.24984
作者: Dohwan Ko,Jinyoung Park,Seoung Choi,Sanghyeok Lee,Seohyun Lee,Hyunwoo J. Kim
机构: Korea University (韩国科学技术院); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

[CV-121] PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration

【速读】：该论文旨在解决真实场景下低光照人脸图像中存在的多重退化问题，包括光照不足、模糊、噪声以及可见度低等，这些问题导致现有级联式方法因误差累积而性能受限，而通用联合模型又缺乏显式的面部先验知识，难以恢复清晰的面部结构。其解决方案的关键在于提出一种无需训练的物理感知语义扩散框架（PASDiff），通过逆强度加权与Retinex理论引入光度约束以可靠恢复可见性和自然色彩分布；同时设计了风格无关的结构注入机制（Style-Agnostic Structural Injection, SASI），从现成的面部先验中提取结构信息并去除其固有的光度偏置，从而将身份特征与物理约束无缝融合，最终在真实世界复杂退化条件下实现照明合理性、色彩恢复和身份一致性的优异平衡。

链接: https://arxiv.org/abs/2603.24969
作者: Yilin Ni,Wenjie Li,Zhengxue Wang,Juncheng Li,Guangwei Gao,Jian Yang
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Beijing University of Posts and Telecommunications (北京邮电大学); PCA Lab, Nanjing University of Science and Technology (南京理工大学PCA实验室); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.

[CV-122] Self-Corrected Image Generation with Explainable Latent Rewards CVPR2026

【速读】：该论文旨在解决文本到图像生成中复杂提示（prompt）的语义对齐难题，特别是细粒度语义和空间关系的准确表达问题。其核心挑战在于生成过程的前馈特性导致模型难以在输出生成前完全理解最终结果的对齐情况。解决方案的关键在于提出一种自修正框架xLARD，其通过多模态大语言模型（Multimodal Large Language Models, MLLMs）生成可解释的潜在奖励信号（Explainable LAtent RewarDs），并引入一个轻量级校正器，基于模型生成的参考反馈对潜在表示进行精细化调整。该方法的核心创新在于构建了一个从潜在编辑到可微分奖励信号的映射机制，从而实现非可微图像评估结果对潜在空间的连续引导，使模型能够在生成过程中自主理解、评估与修正自身输出。

链接: https://arxiv.org/abs/2603.24965
作者: Yinyi Luo,Hrishikesh Gokhale,Marios Savvides,Jindong Wang,Shengfeng He
机构: Carnegie Mellon University (卡内基梅隆大学); Singapore Management University (新加坡管理大学); William & Mary (威廉与玛丽学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at this https URL.

[CV-123] Select Hypothesize and Verify: Towards Verified Neuron Concept Interpretation CVPR2026

【速读】：该论文旨在解决现有神经网络可解释性方法中因假设每个神经元都具有明确且判别性强的功能而导致的误解释问题，尤其是冗余或提供误导性概念的神经元可能引发对模型决策机制的错误理解。其解决方案的关键在于提出一个“选择-假设-验证”（Select-Hypothesize-Verify）框架：首先通过激活分布分析筛选出能最好体现神经元功能行为的激活样本；其次基于这些样本生成关于神经元功能的概念假设；最后通过验证机制确认所生成的概念是否真正激活对应神经元，从而提升神经元概念描述的准确性。实验表明，该方法使生成的概念激活目标神经元的概率约为当前最优方法的1.5倍。

链接: https://arxiv.org/abs/2603.24953
作者: ZeBin Ji,Yang Hu,Xiuli Bi,Bo Liu,Bin Xiao
机构: Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China; Jinan Inspur Data Technology Co., Ltd., Jinan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026

点击查看摘要

Abstract:It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network’s decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network’s decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select-Hypothesize-Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron’s well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.

[CV-124] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation CVPR2026

【速读】：该论文旨在解决当前少步数（few-step）扩散模型和流匹配模型在图像编辑任务中因前向过程近似不佳而导致的编辑质量下降问题，以及现有少步数反演方法依赖预训练生成器和辅助模块所引发的可扩展性和跨架构泛化能力受限的问题。解决方案的关键在于提出BiFM（Bidirectional Flow Matching）框架，该框架通过联合学习生成与反演过程，在单一模型内直接估计“图像→噪声”和“噪声→图像”两个方向的平均速度场，并受共享瞬时速度场约束（该速度场可来自预定义调度或预训练多步扩散模型），同时引入连续时间区间监督策略、双向一致性目标和轻量级时间区间嵌入以稳定训练。这一双向建模机制不仅支持单步反演，还能无缝集成至主流扩散和流匹配骨干网络，从而在多种图像生成与编辑任务中实现优于现有方法的性能和编辑灵活性。

链接: https://arxiv.org/abs/2603.24942
作者: Yasong Dai,Zeeshan Hayder,David Ahmedt-Aristizabal,Hongdong Li
机构: Australian National University (澳大利亚国立大学); Data61-CSIRO (数据61-澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR2026

点击查看摘要

Abstract:Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both image \to noise" and noise \to image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

[CV-125] Infinite Gaze Generation for Videos with Autoregressive Diffusion

【速读】：该论文旨在解决视频中人类注视轨迹（gaze trajectory）预测的长期时序建模问题，现有方法通常局限于短时窗口（约3-5秒），难以捕捉真实场景中复杂的长程行为依赖关系，且传统显著性图（saliency map）与扫视路径（scanpath）等抽象表示会丢失原始注视数据的细粒度时间动态特性。解决方案的关键在于提出一种基于自回归扩散模型（autoregressive diffusion model）的生成式框架，能够实现任意长度视频中的无限时域原始注视预测，通过条件化于一个感知显著性的视觉潜在空间（saliency-aware visual latent space），合成具有连续空间坐标和高分辨率时间戳的注视轨迹，从而在长程时空准确性与轨迹真实性上显著优于现有方法。

链接: https://arxiv.org/abs/2603.24938
作者: Jenna Kang,Colin Groth,Tong Wu,Finley Torrens,Patsorn Sangkloy,Gordon Wetzstein,Qi Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ( \approx 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

[CV-126] IGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization

【速读】：该论文旨在解决现有条件流匹配（Conditional Flow Matching, CFM）方法在人类轨迹预测中对社会规范和场景约束建模不足的问题，导致生成轨迹缺乏行为合理性。其解决方案的关键在于提出一种两阶段生成框架 TIGFlow-GRPO：第一阶段通过引入轨迹交互图（Trajectory-Interaction-Graph, TIG）模块增强时空交互建模能力，从而更有效地捕捉代理间及代理与场景之间的细粒度关系；第二阶段采用基于奖励的强化学习优化（Flow-GRPO），将确定性流滚动转化为随机微分方程（SDE）采样以实现轨迹探索，并设计复合奖励函数融合视图感知的社会合规性与地图感知的物理可行性，使多模态预测逐步收敛至行为合理且符合实际约束的未来轨迹。

链接: https://arxiv.org/abs/2603.24936
作者: Xuepeng Jing,Wenhuan Lu,Hao Meng,Zhizhi Yu,Jianguo Wei
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.

[CV-127] CVA: Context-aware Video-text Alignment for Video Temporal Grounding CVPR2026

【速读】：该论文旨在解决视频时序定位（Video Temporal Grounding, VTG）中因无关背景干扰导致的时序敏感对齐不准确问题，即如何在保持时间边界敏感性的同时提升模型对冗余背景内容的鲁棒性。解决方案的关键在于提出一个名为Context-aware Video-text Alignment (CVA) 的新框架，其核心创新包括：1）Query-aware Context Diversification (QCD) 数据增强策略，通过基于视频-文本相似度构建替换片段池，仅混合语义无关内容以避免“假负样本”；2）Context-invariant Boundary Discrimination (CBD) 对比损失，强化时间边界处的语义一致性，使表示对上下文变化和难负样本更具鲁棒性；3）Context-enhanced Transformer Encoder (CTE) 层次化架构，融合窗口自注意力与双向交叉注意力及可学习查询，捕获多尺度时间上下文。三者协同作用显著提升了主流VTG基准（如QVHighlights和Charades-STA）上的性能，Recall@1（R1）指标较现有最优方法提升约5个百分点。

链接: https://arxiv.org/abs/2603.24934
作者: Sungho Moon,Seunghun Lee,Jiwan Seo,Sunghoon Im
机构: Daegu Gyeongbuk Institute of Science and Technology (DGIST)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

[CV-128] ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects CVPR2026

【速读】：该论文旨在解决真实世界材质光反射建模在逆向渲染（inverse rendering）中的关键难题，即由于实测反射数据稀缺，现有方法依赖于简化光照条件和有限材质真实感的合成数据集，导致模型难以泛化到真实图像场景。其解决方案的关键在于构建一个大规模、高分辨率的真实物体极化反射与材质数据集，该数据集通过8相机、346光源的Light Stage系统捕获，包含多视角、多光照、偏振信息、反射分离及材质属性等五个维度的数据，共涵盖218个日常物体，生成超过120万张图像，并提供解析计算的漫反射反照率（diffuse albedo）、镜面反照率（specular albedo）和表面法向量（surface normals）。基于此数据集，作者训练并评估了先进的逆向与正向渲染模型，在内在分解、重光照和稀疏视图三维重建任务中显著提升了材质分离精度、光照保真度和几何一致性，为物理基础的材质理解提供了新基准，并推动模型从合成训练向真实世界泛化的转变。

链接: https://arxiv.org/abs/2603.24912
作者: Jing Yang,Krithika Dharanikota,Emily Jia,Haiwei Chen,Yajie Zhao
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions-multiview, multi-illumination, polarization, reflectance separation, and material attributes-yielding over 1.2M high-resolution images with diffuse-specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes. Project page: this https URL

[CV-129] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data

【速读】：该论文旨在解决自监督学习（Self-Supervised Learning, SSL）在膝骨关节炎（Knee Osteoarthritis, OA）建模中是否优于ImageNet预训练初始化的问题，尤其关注其在诊断（Kellgren-Lawrence, KL等级预测）与预后（结构进展预测）任务中的差异表现。解决方案的关键在于对比两种SSL策略：一是仅使用膝关节X光图像的SSL（基于OAI、MOST和NYU队列），二是结合未标注医院X光图像与放射科医师描述的多模态SSL；研究发现，尽管图像-only SSL在诊断任务中未能超越ImageNet基线，而多模态SSL因预训练数据严重偏向KL等级3（93%），导致诊断性能受限；但该多模态初始化显著提升了预后建模效果，在外部验证集上实现了更高的AUROC（如MOST队列4年结构进展预测达0.701 vs. 0.599，仅用10%标签数据），表明当下游任务与预训练数据分布一致时，即使数据未经严格筛选，也能提供强信号用于预后建模。

链接: https://arxiv.org/abs/2603.24903
作者: Haresh Rengaraj Rajamohan,Yuxuan Chen,Kyunghyun Cho,Cem M. Deniz
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution

[CV-130] SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform

【速读】：该论文旨在解决垂体瘤手术（Pituitary Tumor Surgery, PTS）视频中手术阶段识别的准确性问题，以支持术中决策、流程分析及外科教学与评估的数据驱动优化。其解决方案的关键在于构建一个融合自监督表征学习、鲁棒时序建模和可扩展标注策略的综合框架：首先在251段未标注PTS视频上预训练ResNet-50模型，提取高质量特征表示；随后在81例标注病例上通过引入焦点损失（focal loss）、渐进式层解冻和动态采样等改进训练策略进行微调，有效缓解类别不平衡与术式变异性问题；同时开发了一个协作式在线平台，促进外科医生上传视频、获取自动化阶段分析并参与数据集共建，从而实现大规模数据积累与模型持续迭代。该方法在独立测试集中达到90%准确率，显著优于现有最先进方法，并展现出良好的跨病例泛化能力。

链接: https://arxiv.org/abs/2603.24897
作者: Yan Meng,Jack Cook,X.Y. Han,Kaan Duman,Shauna Otto,Dhiraj Pangal,Jonathan Chainey,Ruth Lau,Margaux Masson-Forsythe,Daniel A. Donoho,Danielle Levy,Gabriel Zada,Sébastien Froelich,Juan Fernandez-Miranda,Mike Chang
机构: Children’s National Hospital (儿童国家医院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.24897 [cs.CV] (or arXiv:2603.24897v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.24897 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-131] OptiSAR-Net: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

【速读】：该论文旨在解决遥感视觉定位（Remote Sensing Visual Grounding, RSVG）任务中现有方法局限于单一传感器域（如仅光学或合成孔径雷达SAR）的问题，从而限制了其在真实场景中的应用。为应对跨域特征建模、计算效率低下以及细粒度语义区分等挑战，作者提出OptiSAR-Net++框架，其关键创新在于：采用基于补丁级低秩适应的混合专家模型（Patch-level Low-Rank Adaptation Mixture of Experts, PL-MoE）实现高效跨域特征解耦；引入CLIP-based对比学习范式与动态对抗负采样机制，将生成式回归转化为高效的跨模态匹配过程以降低Transformer解码的计算开销；同时设计文本引导的双门控融合模块（Text-guided Dual-gate Fusion Module, TGDF-SSA）和区域感知辅助头，增强语义-视觉对齐与空间建模能力。实验表明，该方法在OptSAR-RSVG和DIOR-RSVG数据集上均达到当前最优性能，在定位精度与效率方面显著提升。

链接: https://arxiv.org/abs/2603.24876
作者: Xiaoyu Tang,Jun Dong,Jintao Cheng,Rui Fan
机构: South China Normal University (华南师范大学); Hong Kong University of Science and Technology (香港科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.

[CV-132] owards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration

【速读】：该论文旨在解决烟雾探测器（smoke detector）在消防安全检查中因安装位置高、难以触及而导致的人工巡检效率低、成本高且存在安全隐患的问题。解决方案的关键在于构建一个可集成于无人机平台的自动识别系统，通过对比YOLOv11、SSD和基于Transformer的RT-DETRv2三种主流目标检测模型，并结合真实数据与半合成数据的训练策略及多种图像增强方法，提升模型在复杂场景（如运动模糊、小分辨率、不完整目标等）下的鲁棒性。实验表明，YOLOv11n在mAP@0.5指标上达到0.884，成为最优检测器，验证了该方案的有效性和实用性。

链接: https://arxiv.org/abs/2603.24850
作者: Lukas Kratochvila,Jakub Stefansky,Simon Bilik,Robert Rous,Tomas Zemcik,Michal Wolny,Frantisek Rusnak,Ondrej Cech,Karel Horak
机构: Brno University of Technology (布林诺大学); VSB - Technical University of Ostrava (奥斯特拉发工业大学); Institute for Research and Applications of Fuzzy Modeling, University of Ostrava (奥斯特拉发大学模糊建模研究所); Mendel University in Brno (门德尔布诺大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.

[CV-133] CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment

【速读】：该论文旨在解决冠状动脉疾病（Coronary Artery Disease, CAD）的非侵入性评估中，因专家标注数据稀缺导致的深度学习模型临床转化受限问题，以及现有无标签预训练策略（如掩码图像建模）对全局解剖统计特征过度依赖、难以捕捉局部病灶特征的局限性。其解决方案的关键在于提出一种基于病理中心（pathology-centric）、合成驱动（synthesis-driven）的自监督学习框架，构建3D视觉基础模型CORA，通过引入解剖引导的病变合成引擎，显式训练模型识别模拟血管异常，从而将表征学习聚焦于临床相关的疾病特征而非主导的背景解剖结构。该方法在12,801例未标注CCTA体积数据上训练，并在九家独立医院的多中心数据集上验证，显著优于当前最先进的3D视觉基础模型，在斑块特征分析、狭窄检测和冠状动脉分割等任务中性能提升最高达29%；进一步结合大语言模型扩展为多模态框架后，显著提升了30天主要不良心脏事件（MACE）风险分层能力。

链接: https://arxiv.org/abs/2603.24847
作者: Jinkui Hao,Gorkem Durak,Halil Ertugrul Aktas,Ulas Bagci,Bradley D. Allen,Nilay S. Shah,Bo Zhou
机构: Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.

[CV-134] NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在二维神经影像（如MRI和CT）中用于辅助诊断时的可靠性与性能权衡问题，尤其是其在多种神经系统疾病（如多发性硬化、脑卒中、脑肿瘤等）上的诊断准确性、结构化输出有效性及计算效率之间的平衡。解决方案的关键在于构建一个全面的基准测试框架，涵盖判别分类带弃权机制、校准度、结构化输出有效性以及计算效率四个维度，并通过多阶段设计控制选择偏差，从而实现对二十个前沿MLLMs的公平比较。研究发现，技术性影像属性（如成像模态和解剖平面）已基本解决，但诊断推理（特别是亚型分类）仍具挑战，其中肿瘤分类最可靠，而罕见异常最难识别；同时，少样本提示（few-shot prompting）虽提升部分模型性能，但也显著增加延迟和成本，而Gemini-2.5-Pro和GPT-5-Chat在整体诊断性能上最优，Gemini-2.5-Flash在效率-性能权衡上最佳，开放权重模型MedGemma-1.5-4B则展现出接近商业模型的潜力。

链接: https://arxiv.org/abs/2603.24846
作者: Katarina Trojachanec Dineva,Stefan Andonov,Ilinka Ivanoska,Ivan Kitanovski,Sasho Gramatikov,Tamara Kostova,Monika Simjanoska Misheva,Kostadin Mishev
机构: Ss. Cyril and Methodius University of Skopje (斯科普里大学); Faculty of Computer Science and Engineering (计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 53 pages, 12 figures. Manuscript submitted to the BMC Medical Informatics and Decision Making journal

点击查看摘要

Abstract:Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.

[CV-135] WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching

【速读】：该论文旨在解决立体匹配（stereo matching）中成本体积（cost volume）设计带来的计算效率低下问题，指出其并非实现高性能所必需。解决方案的关键在于提出一种基于图像扭曲（warping）的简单而有效的方法——WAFT-Stereo，该方法摒弃了传统成本体积构建过程，在保持甚至提升精度的同时显著提高了运行效率，从而在ETH3D、KITTI和Middlebury等多个公开基准上取得最优性能，并实现1.8–6.7倍的速度提升。

链接: https://arxiv.org/abs/2603.24836
作者: Yihan Wang,Jia Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary for strong performance and can be replaced by warping with improved efficiency. WAFT-Stereo ranks first on ETH3D, KITTI and Middlebury public benchmarks, reducing the zero-shot error by 81% on ETH3D benchmark, while being 1.8-6.7x faster than competitive methods. Code and model weights are available at this https URL.

[CV-136] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation

【速读】：该论文旨在解决长轨迹视频生成（long-trajectory video generation）中现有视频扩散模型（VDMs）可扩展性有限的问题，以及自回归模型（autoregressive models）在长时间生成时存在的视觉漂移（visual drift）和可控性差的挑战。其解决方案的关键在于提出一种名为DCARL的新型分而治之、自回归框架，该框架通过两个核心组件实现：首先使用一个专用的关键帧生成器（Keyframe Generator）在无时间压缩条件下训练，以建立全局一致的结构锚点；随后由插值生成器（Interpolation Generator）采用重叠片段的自回归方式合成密集帧，利用关键帧提供全局上下文信息，并以单个干净前帧确保局部一致性。这种设计有效融合了分而治之方案的结构稳定性与VDM的高保真度，从而在长达32秒的长视频生成任务中实现了更优的视觉质量和相机轨迹一致性表现。

链接: https://arxiv.org/abs/2603.24835
作者: Junyi Ouyang,Wenbin Teng,Gonglin Chen,Yajie Zhao,Haiwei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 11 figures. Project page: this https URL

点击查看摘要

Abstract:Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.

[CV-137] Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting CVPR2026

【速读】：该论文旨在解决跨范式对抗攻击问题，即如何在不依赖目标模型具体结构的前提下，同时有效攻击基于密度图（density map）和点回归（point regression）两种主流人群计数架构的模型。其解决方案的关键在于提出一种新颖的多任务损失优化框架，结合场景密度特异性高置信度logit抑制（用于点回归模型）与峰值目标密度图抑制（用于密度图模型），并引入模型无关的感知约束以确保扰动对人类视觉不可察觉，从而实现高效且隐蔽的迁移攻击，在7个先进模型上平均使均方误差提升7倍，转移比率达0.55–1.69。

链接: https://arxiv.org/abs/2603.24821
作者: Alabi Mehzabin Anisha,Guangjing Wang,Sriram Chellappan
机构: University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026 Main Conference

点击查看摘要

Abstract:State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field’s security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at this https URL

[CV-138] Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators

【速读】：该论文旨在解决外固定器（external fixator）金属针/线与皮肤交界处（pin site）感染的早期识别与分类问题，此类感染是临床常见并发症，易导致患者疼痛、增加医疗负担。其解决方案的关键在于提出一种基于注意力机制的深度学习（Deep Learning, DL）模型，该模型能够聚焦于针-皮肤界面的关键区域，有效抑制金属针结构带来的干扰；同时引入高效冗余重建卷积（Efficient Redundant Reconstruction Convolution, ERRC）模块，在减少参数量（仅5.77 M）的同时增强特征图的表达能力，从而实现对针位伤口图像的精准分类——区分出具有炎症或感染迹象的Group A与无明显并发症的Group B。实验结果显示该方法在AUC达0.975、F1-score为0.927，优于基线模型，具备良好的临床辅助诊断潜力。

链接: https://arxiv.org/abs/2603.24815
作者: Yubo Wang,Marie Fridberg,Anirejuoritse Bafor,Ole Rahbek,Christopher Iobst,Søren Vedding Kold,Ming Shen
机构: Aalborg University (奥尔堡大学); Aalborg University Hospital (奥尔堡大学医院); Nationwide Children’s Hospital (全国儿童医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.

[CV-139] GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

【速读】：该论文旨在解决大规模视觉语言模型（Vision-Language Models, VLMs）对超大规模数据集（通常需数亿样本）的依赖问题，从而降低训练门槛并提升数据效率。其核心挑战在于现有对比预训练方法存在多方面缺陷，而单一改进手段难以全面优化性能。解决方案的关键在于提出GoldiCLIP框架，基于“黄金标准”（Goldilocks principle）平衡多种监督信号：首先采用文本条件自蒸馏方法对齐文本无关与文本相关的特征；其次引入集成解码器的视觉问答（Visual Question Answering, VQA）目标，使编码器能泛化至非描述性查询；最后设计基于不确定性的损失加权机制，自动调节异构损失之间的平衡。该方法仅用3000万图像（比主流方法少300倍）即达到数据高效范式下的最先进性能，并在多个任务上显著优于基线模型。

链接: https://arxiv.org/abs/2603.24804
作者: Deen Dayal Mohan,Hossein Souri,Vitali Petsiuk,Juhong Min,Gopal Sharma,Luowei Zhou,Suren Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: this https URL.

[CV-140] Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

【速读】：该论文旨在解决复杂腹主动脉瘤（Abdominal Aortic Aneurysm, AAA）的计算机断层扫描（Computed Tomography, CT）图像分割中模型易将注意力集中在无关结构或忽略细小、低对比度目标的问题。其解决方案的关键在于提出一种可解释人工智能（Explainable AI, XAI）引导的编码器结构优化框架：通过从最终编码器块计算密集的基于归因的编码器关注图（称为“XAI场”），并以两种互补方式利用该场——一是将预测概率质量对齐至XAI场以增强关注区域与输出的一致性；二是将该场引入轻量级细化路径和置信度先验，在推理阶段调制logits，从而抑制干扰项并保留细微结构。该方法的核心贡献在于将归因引导机制整合到特征表示与解码过程中，而非仅依赖传统损失函数作为控制信号，显著提升了复杂场景下的分割可靠性。

链接: https://arxiv.org/abs/2603.24801
作者: Abu Noman Md Sakib,Merjulah Roby,Zijie Zhang,Satish Muluk,Mark K. Eskandari,Ender A. Finol
机构: University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校); Drexel University (德雷塞尔大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map (“XAI field”) from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

[CV-141] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

【速读】：该论文旨在解决扩散 Transformer (Diffusion Transformers, DiTs) 在生成任务中性能提升受限的问题，特别是其在去噪过程中的优化潜力尚未被充分挖掘。解决方案的关键在于提出一种轻量级且参数高效的校准方法——Calibri，该方法通过引入一个可学习的缩放参数来优化 DiT 模块，并将校准过程建模为黑箱奖励优化问题，利用进化算法进行高效求解，仅需修改约 100 个参数即可显著提升生成质量并减少推理步数。

链接: https://arxiv.org/abs/2603.24800
作者: Danil Tokhchukov,Aysel Mirzoeva,Andrey Kuznetsov,Konstantin Sobolev
机构: Moscow State University (莫斯科国立大学); FusionBrain Lab, AXXX
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVRP 2026, Project page: this https URL

点击查看摘要

Abstract:In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.

[CV-142] AVControl: Efficient Framework for Training Audio-Visual Controls

【速读】：该论文旨在解决生成式 AI（Generative AI）在视频与音频生成过程中对多模态控制（如深度、姿态、相机轨迹和音频变换等）支持不足的问题，现有方法要么依赖单一固定模态的庞大模型，要么需为每种新模态引入复杂的架构改动。其解决方案的关键在于提出 AVControl 框架，基于 LTX-2 这一联合音视频基础模型，通过为每个控制模态独立训练 LoRA（Low-Rank Adaptation）适配器，并将参考信号作为额外 token 注入注意力层，实现无需修改主干架构即可灵活扩展多种控制模态的能力，从而显著提升效率与可扩展性。

链接: https://arxiv.org/abs/2603.24793
作者: Matan Ben-Yosef,Tavi Halperin,Naomi Ken Korem,Mohammad Salama,Harel Cain,Asaf Joseph,Anthony Chen,Urska Jelercic,Ofir Bibi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: Project page: this https URL

点击查看摘要

Abstract:Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

[CV-143] DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects

【速读】：该论文旨在解决从随意拍摄的视频中重建动态场景时面临的病态性问题（ill-posedness），尤其在极端新视角下难以准确重建高度关节运动物体的问题。现有方法依赖2D基础模型提取先验或手工设计正则化约束优化运动，但效果有限。解决方案的关键在于提出DRoPS框架，其核心创新为两点：一是通过将高斯原始体（Gaussian primitives）组织成锚定于物体表面的像素网格结构，构建网格化且与表面对齐的模型；二是利用该网格结构设计CNN参数化运动场，从而注入强隐式正则化并关联邻近点的运动，有效约束解空间并保障序列内几何一致性。

链接: https://arxiv.org/abs/2603.24770
作者: Narek Tumanyan,Samuel Rota Bulò,Denis Rozumny,Lorenzo Porzi,Adam Harley,Tali Dekel,Peter Kontschieder,Jonathon Luiten
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.

[CV-144] Synthetic Cardiac MRI Image Generation using Deep Generative Models

【速读】：该论文旨在解决心脏磁共振成像（Cardiac MRI, CMRI）数据稀缺问题，尤其是在标注数据不足、不同设备厂商间图像差异显著以及生成模型可能引发隐私泄露等挑战下，如何高效生成高质量、结构准确且符合临床需求的合成CMRI图像。其解决方案的关键在于：采用基于掩码条件生成（mask-conditioned generation）的方法提升解剖结构保真度，结合扩散模型与流匹配（flow-matching）技术实现边界清晰和确定性变换，通过厂商风格条件化（vendor-style conditioning）和强度归一化预处理增强跨域泛化能力，并引入成员推理攻击、最近邻分析及差分隐私机制保障数据隐私安全；同时以下游分割性能作为核心效用评估指标，验证合成数据在多厂商场景下的准确性与鲁棒性。

链接: https://arxiv.org/abs/2603.24764
作者: Ishan Kumarasinghe,Dasuni Kawya,Madhura Edirisooriya,Isuri Devindi,Isuru Nawinne,Vajira Thambawita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 2 figures, Preprint

点击查看摘要

Abstract:Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.

[CV-145] Light Cones For Vision: Simple Causal Priors For Visual Hierarchy ICLR

【速读】：该论文旨在解决标准视觉模型无法捕捉对象内部层次结构（如整体与部分关系）的问题，这类模型通常将物体视为欧几里得空间中的独立点，缺乏对时空轨迹和因果依赖的建模能力。其解决方案的关键在于引入世界线槽注意力机制（Worldline Slot Attention），将物体表示为时空中的持续轨迹（worldlines），并在不同层次上共享空间位置但具有不同的时间坐标；更重要的是，采用洛伦兹几何（Lorentzian geometry）而非欧几里得或双曲几何来建模这些轨迹，从而显式编码非对称因果关系（asymmetric causality），这种几何结构天然适合描述视觉层次中的时序依赖性。实验表明，基于洛伦兹世界线的模型在三个数据集上准确率显著提升（0.479–0.661），相比欧几里得世界线（0.078）有6倍改进，验证了因果结构作为归纳偏置对层次化物体发现的重要性。

链接: https://arxiv.org/abs/2603.24753
作者: Manglam Kartik,Neel Tushar Shah
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR GRaM Workshop 2026

点击查看摘要

Abstract:Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: this https URL.

[CV-146] IGeR: A Unified Framework for Time Images and Geo-location Retrieval CVPR2026

【速读】：该论文旨在解决多模态信息融合下的复杂图像检索问题，即在数字取证、城市监测和环境分析等场景中，如何实现基于地理位置（geolocation）与时间（time）双重约束的精准图像检索。传统方法通常仅依赖视觉相似性进行图像匹配，难以应对同一地点因季节、天气或光照变化导致的外观差异。为此，作者提出TIGeR模型，其核心创新在于构建一个统一的地理-时间嵌入空间（geo-temporal embedding space），通过多模态Transformer架构将图像、地理坐标和时间信息映射至同一语义空间，从而支持单模态或多模态查询，并同时完成地理定位、时间预测及地理-时间感知检索任务。该方案的关键优势在于有效保留了位置身份（location identity）在大范围视觉变化下的稳定性，使检索依据“何处何时”而非单纯“看起来像什么”，显著提升了跨时序和跨视角场景下的检索性能，在多个指标上优于现有最先进方法。

链接: https://arxiv.org/abs/2603.24749
作者: David G. Shatwell,Sirnam Swetha,Mubarak Shah
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026

点击查看摘要

Abstract:Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.

[CV-147] OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video

【速读】：该论文旨在解决传统生物力学评估方法在临床应用中因依赖昂贵、耗时的实验室设备（如标记式运动捕捉系统和测力台）而难以规模化的问题，从而限制了对移动相关疾病（如骨关节炎和虚弱症）的预测、治疗与监测。其解决方案的关键在于提出一种名为OpenCap Monocular的算法，该算法通过单个智能手机视频即可估计三维骨骼运动学（kinematics）与动力学（kinetics），核心创新包括：基于单目姿态估计模型（WHAM）的优化精修以提升3D人体姿态估计精度；利用生物力学约束的骨骼模型计算运动学参数；并通过物理仿真与机器学习联合建模来估算肌骨力（kinetics），如地面反作用力及膝关节力矩。验证结果显示，该方法在步行、深蹲和坐站转换任务中均表现出高准确性，尤其在旋转和位移误差上显著优于仅用回归的计算机视觉基线，且在关键临床指标（如膝关节伸展力矩和内收力矩）上的估计达到可临床采纳的水平。

链接: https://arxiv.org/abs/2603.24733
作者: Selim Gilon,Emily Y. Miller,Scott D. Uhlrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (this https URL), enabling free, accessible single-smartphone biomechanical assessments.

[CV-148] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception

【速读】：该论文旨在解决视觉模型（如CLIP）与人类感知在概念边界划分上存在差异的问题，即当图像语义模糊时，人类和机器如何定义“鸭子”与“兔子”等概念的界限不一致。其解决方案的关键在于提出一种基于心理物理学启发的框架，通过在CLIP嵌入空间中插值生成连续的语义模糊图像谱，从而精确测量人类和机器分类器在概念边界上的位置差异。该方法揭示了机器分类器更倾向于将模糊图像判为“兔子”，而人类则更贴近CLIP合成过程中的语义分布，且引导尺度（guidance scale）对人类感知敏感性的影响大于对机器分类器的影响，为理解人机对齐、模型可解释性和图像合成方法提供了新的诊断工具。

链接: https://arxiv.org/abs/2603.24730
作者: Yuqi Hu,Vasha DuTell,Ahna R. Girshick,Jennifer E. Corbett
机构: University of California, Berkeley (加州大学伯克利分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ‘‘duck’’ and ‘‘rabbit’’, and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ‘‘rabbit’’, whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.

[CV-149] Confidence-Based Mesh Extraction from 3D Gaussians

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在存在丰富视点依赖效应（view-dependent effects）的场景中进行网格提取时面临的歧义性问题，此类问题导致表面重建精度下降。现有方法依赖多视角技术、迭代网格提取或大型预训练模型来缓解歧义，但牺牲了3DGS固有的高效性。解决方案的关键在于引入一种自监督置信度框架（self-supervised confidence framework），其中可学习的置信度值动态平衡光度监督与几何监督；同时，通过设计惩罚每个基本单元颜色和法向方差的损失函数，并改进D-SSIM损失中的各项解耦策略，显著提升了无界网格（unbounded meshes）的重建质量，且保持了3DGS原有的高效率。

链接: https://arxiv.org/abs/2603.24725
作者: Lukas Radl,Felix Windisch,Andreas Kurz,Thomas Köhler,Michael Steiner,Markus Steinberger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.

[CV-150] Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

【速读】：该论文旨在解决当前基于外观的注视估计方法（如深度卷积神经网络）存在的计算复杂度高、可解释性差的问题，同时探索基于面部关键点的几何方法在现代基准测试中的性能上限与泛化能力。其解决方案的关键在于构建一个标准化的数据预处理流程，从三个大规模数据集（Gaze360、ETH-XGaze 和 GazeGene）中提取并归一化面部关键点，并训练轻量级回归模型，包括极端梯度提升树（XGBoost）、全连接多层感知机（MLP）以及一种用于捕捉双眼几何关系的孪生 MLP 架构。实验表明，尽管在同域评估中因关键点检测噪声导致性能略低，但所提 MLP 模型在跨域评估中展现出与 ResNet18 基线相当的泛化能力，证明稀疏几何特征足以支持鲁棒的注视估计，为高效、可解释且隐私友好的边缘应用场景提供了可行路径。

链接: https://arxiv.org/abs/2603.24724
作者: Daniele Agostinelli,Thomas Agostinelli,Andrea Generosi,Maura Mengoni
机构: Università Politecnica delle Marche (马尔凯理工大学); Università Pegaso (佩加索大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as “black boxes”, offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at this https URL.

[CV-151] Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models CVPR2026

【速读】：该论文旨在解决3D场景中空间推理（spatial reasoning）任务的训练难题，特别是由于缺乏充足的3D场景与语言配对数据，导致模型难以从零开始学习强推理能力。现有方法要么在输入空间中注入3D场景表示以利用大语言模型（LLM）的预训练理解能力，但其编码绝对位置的方式难以从过早融合的特征中提取空间关系；要么显式编码所有对象间的成对空间关系作为输入token，但这种方法的时间复杂度为对象数量的平方，存在严重可扩展性问题。解决方案的关键在于提出QuatRoPE——一种输入长度与对象数量呈线性关系的位置嵌入方法，通过注意力层中的点积运算显式计算对象间的空间关系，同时以全向量形式编码3D坐标，确保空间一致性并保持场景几何完整性；此外引入孤立门控RoPE扩展（IGRE），将QuatRoPE的影响限制在与物体相关的token上，从而最小化对LLM原有位置嵌入的干扰，保留其原始能力。

链接: https://arxiv.org/abs/2603.24721
作者: Shengli Zhou,Minghang Zheng,Feng Zheng,Yang Liu
机构: Southern University of Science and Technology (南方科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE’s holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene’s geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE’s influence to object-related tokens, thereby minimizing interference with the LLM’s existing positional embeddings and maintaining the LLM’s original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at this https URL.

[CV-152] Accurate Point Measurement in 3DGS – A New Alternative to Traditional Stereoscopic-View Based Measurements

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在几何测量中的应用潜力未被充分挖掘的问题，特别是其在高精度点测量方面的局限性。传统方法依赖于多视图立体（Multi-View Stereo, MVS）点云或网格进行测量，但这些方法常受限于不完整、不准确的几何结构以及对专业立体工作站和操作员立体视觉能力的依赖。论文提出的关键解决方案是：利用3DGS生成的高质量、连续插值的视图，允许用户直观地在不同视角中选取对应点（congruent points），并通过三角测量法精确计算三维坐标。该方法无需专用立体设备，且支持多视角交点（>2视图）以提升精度，从而显著优于直接基于网格的测量方式，在薄结构和锐边等挑战场景下表现尤为突出。

链接: https://arxiv.org/abs/2603.24716
作者: Deyan Deng,Rongjun Qin
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2026 ISPRS Congress

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized real-time rendering with its state-of-the-art novel view synthesis, but its utility for accurate geometric measurement remains underutilized. Compared to multi-view stereo (MVS) point clouds or meshes, 3DGS rendered views present superior visual quality and completeness. However, current point measurement methods still rely on demanding stereoscopic workstations or direct picking on often-incomplete and inaccurate 3D meshes. As a novel view synthesizer, 3DGS renders exact source views and smoothly interpolates in-between views. This allows users to intuitively pick congruent points across different views while operating 3DGS models. By triangulating these congruent points, one can precisely generate 3D point measurements. This approach mimics traditional stereoscopic measurement but is significantly less demanding: it requires neither a stereo workstation nor specialized operator stereoscopic capability. Furthermore, it enables multi-view intersection (more than two views) for higher measurement accuracy. We implemented a web-based application to demonstrate this proof-of-concept (PoC). Using several UAV aerial datasets, we show this PoC allows users to successfully perform highly accurate point measurements, achieving accuracy matching or exceeding traditional stereoscopic methods on standard hardware. Specifically, our approach significantly outperforms direct mesh-based measurements. Quantitatively, our method achieves RMSEs in the 1-2 cm range on well-defined points. More critically, on challenging thin structures where mesh-based RMSE was 0.062 m, our method achieved 0.037 m. On sharp corners poorly reconstructed in the mesh, our method successfully measured all points with a 0.013 m RMSE, whereas the mesh method failed entirely. Code is available at: this https URL.

[CV-153] Lookalike3D: Seeing Double in 3D WWW ATC

【速读】：该论文旨在解决室内场景中重复物体（lookalike objects）信息被忽视的问题，这类物体在真实场景中普遍存在，但现有3D物体理解与生成方法通常未加以利用。其核心挑战在于如何基于多视角图像准确区分物体对为“完全相同”、“相似”或“不同”。解决方案的关键在于提出Lookalike3D——一种多视角图像Transformer模型，通过引入大型图像基础模型（image foundation models）中的强语义先验，有效捕捉重复物体间的互补线索；同时构建了包含76k人工标注样本的3DTwins数据集，显著提升检测性能（IoU提升104%），并验证了该方法在联合3D重建与部件共分割等下游任务中的有效性，从而将重复物体转化为高质量3D感知的强大线索。

链接: https://arxiv.org/abs/2603.24713
作者: Chandan Yeshwanth,Angela Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Video: this https URL

点击查看摘要

Abstract:3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.

[CV-154] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration CVPR2026

【速读】：该论文旨在解决当前多模态视觉-语言模型（Vision-Language Models, VLMs）在行星科学领域应用受限的问题，其核心瓶颈在于缺乏大规模、高质量的行星影像与科学描述配对数据集。为应对这一挑战，作者提出了一种专用于月球表面与近地表特征识别的视觉-语言模型——LLaVA-LE，并构建了首个大规模多模态月球数据集LUCID（LUnar Caption Image Dataset），包含96k高分辨率全色图像及详细地形描述文本，以及81k基于图像的问答对。解决方案的关键在于：首先通过两阶段训练策略实现领域特定的语义对齐（concept alignment）与指令微调（instruction tuning），从而提升模型对月球场景的理解能力；其次设计多层次推理评估基准以系统验证模型性能。实验表明，LLaVA-LE相比基础LLaVA模型整体性能提升3.3倍，且推理得分超过人类评委参考分数，验证了领域专用多模态数据与精细化训练策略的有效性。

链接: https://arxiv.org/abs/2603.24696
作者: Gokce Inal,Pouyan Navard,Alper Yilmaz
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AI4Space Workshop CVPR2026. Website: this https URL , Dataset: this https URL

点击查看摘要

Abstract:Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge’s own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at this https URL.

[CV-155] Amplified Patch-Level Differential Privacy for Free via Random Cropping

【速读】：该论文旨在解决差分隐私（Differential Privacy, DP）在计算机视觉模型训练中隐私保护不足的问题，特别是针对图像中敏感内容（如人脸或车牌）的局部化特性。其关键解决方案在于利用随机裁剪（Random Cropping）本身固有的随机性作为额外的隐私增强机制：当敏感区域在图像中空间位置固定时，随机裁剪会以一定概率将其排除在模型输入之外，从而引入一种新的“补丁级”邻近关系（patch-level neighboring relation），并在此基础上对DP-SGD进行更精细的隐私预算分析。这种机制不改变模型架构或训练流程，仅通过量化补丁包含概率与小批量采样之间的组合效应，显著提升了隐私-效用权衡，实现了无需额外成本的更强隐私保障。

链接: https://arxiv.org/abs/2603.24695
作者: Kaan Durmaz,Jan Schuchardt,Sebastian Schmidt,Stephan Günnemann
机构: Technical University of Munich (慕尼黑工业大学); Morgan Stanley (摩根士丹利)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at TMLR

点击查看摘要

Abstract:Random cropping is one of the most common data augmentation techniques in computer vision, yet the role of its inherent randomness in training differentially private machine learning models has thus far gone unexplored. We observe that when sensitive content in an image is spatially localized, such as a face or license plate, random cropping can probabilistically exclude that content from the model’s input. This introduces a third source of stochasticity in differentially private training with stochastic gradient descent, in addition to gradient noise and minibatch sampling. This additional randomness amplifies differential privacy without requiring changes to model architecture or training procedure. We formalize this effect by introducing a patch-level neighboring relation for vision data and deriving tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping. Our analysis quantifies the patch inclusion probability and shows how it composes with minibatch sampling to yield a lower effective sampling rate. Empirically, we validate that patch-level amplification improves the privacy-utility trade-off across multiple segmentation architectures and datasets. Our results demonstrate that aligning privacy accounting with domain structure and additional existing sources of randomness can yield stronger guarantees at no additional cost.

[CV-156] BCMDA: Bidirectional Correlation Maps Domain Adaptation for Mixed Domain Semi-Supervised Medical Image Segmentation

【速读】：该论文旨在解决混合域半监督医学图像分割（Mixed Domain Semi-Supervised Medical Image Segmentation, MiDSS）中两个核心挑战：一是标注数据与未标注数据之间的分布差异阻碍了有效的知识迁移；二是从未标注数据中学习效率低下，导致严重的确认偏差（confirmation bias）。其解决方案的关键在于提出双向相关性映射域适应（Bidirectional Correlation Maps Domain Adaptation, BCMDA）框架，通过虚拟域桥接的知识迁移机制（Knowledge Transfer via Virtual Domain Bridging, KTVDB）构建分布对齐的虚拟域，并结合固定比例与渐进动态MixUp策略生成虚拟图像，辅以双双向CutMix实现初始和渐进式知识迁移；同时引入原型对齐与伪标签校正（Prototypical Alignment and Pseudo Label Correction, PAPLC）机制，利用可学习的原型余弦相似度分类器进行双向原型对齐，提升特征表示的平滑性和紧凑性，并通过原型引导的伪标签校正生成更可靠的伪标签，从而有效缓解确认偏差。

链接: https://arxiv.org/abs/2603.24691
作者: Bentao Song,Jun Huang,Qingfeng Wang
机构: Southwest University of Science and Technology (西南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Neural Networks

点击查看摘要

Abstract:In mixed domain semi-supervised medical image segmentation (MiDSS), achieving superior performance under domain shift and limited annotations is challenging. This scenario presents two primary issues: (1) distributional differences between labeled and unlabeled data hinder effective knowledge transfer, and (2) inefficient learning from unlabeled data causes severe confirmation bias. In this paper, we propose the bidirectional correlation maps domain adaptation (BCMDA) framework to overcome these issues. On the one hand, we employ knowledge transfer via virtual domain bridging (KTVDB) to facilitate cross-domain learning. First, to construct a distribution-aligned virtual domain, we leverage bidirectional correlation maps between labeled and unlabeled data to synthesize both labeled and unlabeled images, which are then mixed with the original images to generate virtual images using two strategies, a fixed ratio and a progressive dynamic MixUp. Next, dual bidirectional CutMix is used to enable initial knowledge transfer within the fixed virtual domain and gradual knowledge transfer from the dynamically transitioning labeled domain to the real unlabeled domains. On the other hand, to alleviate confirmation bias, we adopt prototypical alignment and pseudo label correction (PAPLC), which utilizes learnable prototype cosine similarity classifiers for bidirectional prototype alignment between the virtual and real domains, yielding smoother and more compact feature representations. Finally, we use prototypical pseudo label correction to generate more reliable pseudo labels. Empirical evaluations on three public multi-domain datasets demonstrate the superiority of our method, particularly showing excellent performance even with very limited labeled samples. Code available at this https URL.

[CV-157] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy ECCV2026

【速读】：该论文旨在解决统一多模态模型中因示例选择与格式敏感性导致的上下文学习（In-context Learning, ICL）效果不稳定问题，尤其在跨模态干扰和认知任务复杂度差异下，ICL表现常呈非单调且高度依赖具体任务。其解决方案的关键在于：首先提出一个六级能力导向的分类体系，用于系统化识别演示样本的功能角色；其次构建大规模、结构化的UniICL-760K数据集与UniICL-Bench评测基准，实现对ICL行为的精细化诊断与可控评估；最后引入轻量级可插拔模块——Context-Adaptive Prototype Modulator（CAPM），通过自适应原型调制机制提升少样本适应的稳定性。实验表明，该方法在多数理解类任务上优于参数更大的多模态大语言模型基线。

链接: https://arxiv.org/abs/2603.24690
作者: Yicheng Xu,Jiangning Zhang,Zhucun Xue,Teng Hu,Ran Yi,Xiaobin Hu,Yong Liu,Dacheng Tao
机构: 浙江大学(University of Zhejiang)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV2026 under review

点击查看摘要

Abstract:In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at this https URL.

[CV-158] KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

【速读】：该论文旨在解决当前生成式AI（Generative AI）在具身智能（Embodied AI）训练与评估中所面临的数字孪生环境构建难题，即如何将基于Transformer的全局点云预测结果与局部重建的对象网格（object meshes）进行可靠融合，以实现具有真实度量尺度（metrically consistent）的数字孪生环境。其核心问题在于现有方法产生的点云存在尺度模糊性（scale ambiguity）和坐标系不一致，导致无法与局部对象网格对齐。解决方案的关键在于提出一种尺度感知的3D融合框架，通过引入视觉-语言模型（Vision-Language Model, VLM）引导的几何锚点机制来恢复真实世界度量尺度，并设计了一个几何感知注册流程，显式施加重力对齐垂直估计、曼哈顿世界结构约束（Manhattan-world constraints）以及无碰撞局部优化，从而实现跨网络对象对齐与几何一致性提升。

链接: https://arxiv.org/abs/2603.24684
作者: Quanyun Wu,Kyle Gao,Daniel Long,David A. Clausi,Jonathan Li,Yuhao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.

[CV-159] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLM s

【速读】：该论文旨在解决多模态大语言模型中因视觉令牌（visual tokens）数量庞大而导致的计算成本高昂的问题。现有方法通常在视觉-语言投影器之后进行令牌剪枝，但这类后处理剪枝方法作用于压缩表示，易丢失细粒度的空间与语义信息。解决方案的关键在于提出一种无需训练的前置剪枝方法 ReDiPrune，其在视觉编码器输出阶段直接选择具有信息量的视觉令牌，而非依赖投影后的低维表示。该方法通过一个轻量级规则对每个令牌进行评分，综合考虑文本条件下的相关性（text-conditioned relevance）与最大最小多样性（max-min diversity），从而确保所选令牌既与查询相关又无冗余。ReDiPrune 具有即插即用特性，无需重新训练或修改架构，可无缝插入编码器与投影器之间，在多个图像和视频基准测试中均显著提升准确率与效率的权衡。

链接: https://arxiv.org/abs/2603.24680
作者: An Yu,Ting Yu Tsai,Zhenfei Zhang,Weiheng Lu,Felix X.-F. Ye,Ming-Ching Chang
机构: University at Albany, SUNY; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbfReDiPrune, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields a +2.0% absolute accuracy gain while reducing computation by more than 6\times in TFLOPs. Code is available at this https URL.

[CV-160] From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition CVPR2026

【速读】：该论文旨在解决大规模视觉-语言模型（如CLIP）内部机制难以解释的问题，现有方法多依赖于激活值，存在数据依赖性、易受数据偏差影响且仅能提供粗粒度的头级别解释。其解决方案的关键在于提出SITH（Semantic Inspection of Transformer Heads），一个完全无需数据和训练的框架，直接在权重空间中分析Vision Transformer。SITH通过将每个注意力头的值输出矩阵分解为奇异向量，并利用新提出的COMP（Coherent Orthogonal Matching Pursuit）算法，将其解释为稀疏且语义一致的人类可理解概念组合，从而实现精准、可解释的模型权重编辑与机制研究。

链接: https://arxiv.org/abs/2603.24653
作者: Francesco Gentile,Nicola Dall’Asen,Francesco Tonini,Massimiliano Mancini,Lorenzo Vaquero,Elisa Ricci
机构: University of Trento (特伦托大学); University of Pisa (比萨大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted @ CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP’s vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.

[CV-161] MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

【速读】：该论文旨在解决当前医学影像领域视觉语言模型（Vision-Language Models, VLMs）评估方式与临床实际脱节的问题，即现有评测多依赖人工筛选的二维图像，忽略了真实诊断中需要在多序列或模态的三维医学数据中主动导航、获取证据并作出决策的核心挑战。其解决方案的关键在于提出两个核心组件：一是MEDOPENCLAW，一个可审计的运行时环境，使VLM能够在标准医学影像工具（如3D Slicer）中动态交互；二是MEDFLOWBENCH，一个覆盖多序列脑部MRI和肺部CT/PET的全研究级基准测试集，系统性地评估模型在仅限查看器、工具使用及开放方法三种路径下的医疗代理能力。实验表明，尽管先进VLM（如Gemini 3.1 Pro和GPT-5.4）能在纯查看器环境下完成基础任务，但当引入专业辅助工具后因缺乏精确的空间定位能力导致性能下降，凸显了空间感知对构建可审计、全流程医学影像智能代理的重要性。

链接: https://arxiv.org/abs/2603.24649
作者: Weixiang Shen,Yanzhu Hu,Che Liu,Junde Wu,Jiayuan Zhu,Chengzhi Shen,Min Xu,Yueming Jin,Benedikt Wiestler,Daniel Rueckert,Jiazhen Pan
机构: Technical University of Munich (TUM); TUM University Hospital; LMU Munich; Imperial College London; University of Oxford; Carnegie Mellon University; National University of Singapore; Munich Center for Machine Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.

人工智能

[AI-0] Back to Basics: Revisiting ASR in the Age of Voice Agents

【速读】：该论文旨在解决当前自动语音识别（Automatic Speech Recognition, ASR）系统在真实场景中可靠性不足的问题，尤其是在未被现有评估体系系统覆盖的环境噪声、人口统计学差异和语言多样性等条件下，模型性能严重退化且缺乏可解释的诊断工具。解决方案的关键在于提出一个名为WildASR的多语言（四语种）诊断基准，该基准完全基于真实人类语音数据，并将ASR鲁棒性分解为三个可分离的维度：环境退化、人口统计学迁移和语言多样性。通过这一因子隔离的设计，研究者能够精准定位模型失败的具体原因，并揭示不同模型在不同条件下的鲁棒性不一致性，从而为部署决策提供可操作的分析工具，提升ASR在生产系统中的安全性与可靠性。

链接: https://arxiv.org/abs/2603.25727
作者: Geeyang Tay,Wentao Ma,Jaewon Lee,Yuzhi Tang,Daniel Lee,Weisu Yin,Dongming Shen,Silin Meng,Yi Zhu,Mu Li,Alex Smola
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

[AI-1] Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

【速读】：该论文旨在解决通用编程代理（general-purpose coding agents）在无硬件特定训练的情况下，如何从高层次算法规范中自动优化硬件设计的问题。其核心挑战在于如何在不依赖领域知识的前提下，实现跨函数的协同优化并突破子核分解带来的局部最优限制。解决方案的关键在于提出一个两阶段的代理工厂（agent factory）架构：第一阶段通过子核分解与代码级变换构建整数线性规划（ILP）问题，在面积约束下寻找全局有潜力的配置；第二阶段则部署多个专家代理对最优ILP解进行交叉功能优化（如循环融合、内存重构等），从而挖掘出传统子核搜索无法发现的改进空间。实验表明，该方法在多个HLS基准测试中实现了显著加速（最高达20倍），且无需领域特定训练即可复现已知硬件优化模式，验证了代理规模扩展作为HLS优化有效途径的可行性。

链接: https://arxiv.org/abs/2603.25719
作者: Abhishek Bhandwaldar,Mihir Choudhury,Ruchir Puri,Akash Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present an empirical study of how far general-purpose coding agents – without hardware-specific training – can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches N expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean 8.27\times speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds 20\times and kmeans reaches approximately 10\times . Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2603.25719 [cs.AI] (or arXiv:2603.25719v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.25719 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] Neural Network Conversion of Machine Learning Pipelines ECAI IJCAI ICML

【速读】：该论文旨在解决如何将非神经网络的机器学习流水线（如随机森林分类器）作为教师模型，通过迁移学习（transfer learning）将其知识迁移到神经网络（NN）学生模型中，从而实现多任务学习的联合优化与统一推理引擎构建的问题。其解决方案的关键在于：首先利用随机森林作为教师模型提供高质量的预测输出，随后设计合适的神经网络拓扑结构，并通过选择正确的超参数使学生神经网络能够有效模仿教师模型的性能；此外，研究还探索了使用随机森林本身来辅助选择最优的神经网络超参数，以提升迁移效果和泛化能力。

链接: https://arxiv.org/abs/2603.25699
作者: Man-Ling Sung,Jan Silovsky,Man-Hung Siu,Herbert Gish,Chinnu Pittapally
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted and accepted to AutoML 2018 @ ICML/IJCAI-ECAI

点击查看摘要

Abstract:Transfer learning and knowledge distillation has recently gained a lot of attention in the deep learning community. One transfer approach, the student-teacher learning, has been shown to successfully create small'' student neural networks that mimic the performance of a much bigger and more complex teacher’’ networks. In this paper, we investigate an extension to this approach and transfer from a non-neural-based machine learning pipeline as teacher to a neural network (NN) student, which would allow for joint optimization of the various pipeline components and a single unified inference engine for multiple ML tasks. In particular, we explore replacing the random forest classifier by transfer learning to a student NN. We experimented with various NN topologies on 100 OpenML tasks in which random forest has been one of the best solutions. Our results show that for the majority of the tasks, the student NN can indeed mimic the teacher if one can select the right NN hyper-parameters. We also investigated the use of random forest for selecting the right NN hyper-parameters.

[AI-3] he Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

【速读】：该论文旨在解决当前软件开发中代码生产已成标准化流程，而核心瓶颈在于需求定义不清与验证困难的问题。其解决方案的关键在于提出“Kitchen Loop”框架，该框架基于统一的信任模型，包含四个核心组件：（1）规范表面（specification surface），明确产品功能边界；（2）“作为用户x1000”（As a User x 1000），通过大语言模型（LLM）代理以千倍于人类的速度模拟真实用户行为；（3）无懈可击测试（Unbeatable Tests），利用不可伪造的地面真值验证机制确保代码质量；（4）漂移控制（Drift Control），通过持续的质量度量和自动化暂停门禁实现长期演化的稳定性。该框架在两个生产系统上经过285+次迭代验证，生成超过1,094个合并的拉取请求且未检测到回归，展现出多轮自纠错链、自主基础设施修复等涌现特性，其创新点在于将已有技术原语整合为一个具备操作纪律的可生产级自治演化系统。

链接: https://arxiv.org/abs/2603.25697
作者: Yannick Roy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self-evolving software built on a unified trust model: (1) a specification surface enumerating what the product claims to support; (2) ‘As a User x 1000’, where an LLM agent exercises that surface as a synthetic power user at 1,000x human cadence; (3) Unbeatable Tests, ground-truth verification the code author cannot fake; and (4) Drift Control, continuous quality measurement with automated pause gates. We validate across two production systems over 285+ iterations, producing 1,094+ merged pull requests with zero regressions detected by the regression oracle (methodology in Section 6.1). We observe emergent properties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically improving quality gates. The primitives are not new; our contribution is their composition into a production-tested system with the operational discipline that makes long-running autonomous evolution safe.

[AI-4] A Unified Memory Perspective for Probabilistic Trustworthy AI

【速读】：该论文旨在解决可信人工智能（Trustworthy AI）在实际系统中因概率计算需求增加而导致的性能瓶颈问题，特别是当确定性数据访问与重复随机采样交织运行时，传统内存系统难以同时高效提供数据和随机性，从而限制了系统整体效率。其解决方案的关键在于提出一种统一的数据访问视角，将确定性访问视为随机采样的极限情况，使得两种模式可在同一框架下分析；这一视角揭示了随机需求上升会降低有效数据访问效率并可能导致熵限操作（entropy-limited operation），进而定义了一套内存层级评估标准（包括统一操作、分布可编程性、效率、对硬件非理想性的鲁棒性及并行兼容性），以此指导对传统架构局限性的分析，并推动集成采样与内存访问的新型计算存内（compute-in-memory）方案发展，为可扩展的可信AI硬件提供路径。

链接: https://arxiv.org/abs/2603.25692
作者: Xueji Zhao,Likai Pei,Jianbo Liu,Kai Ni,Ningyuan Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Trustworthy artificial intelligence increasingly relies on probabilistic computation to achieve robustness, interpretability, security and privacy. In practical systems, such workloads interleave deterministic data access with repeated stochastic sampling across models, data paths and system functions, shifting performance bottlenecks from arithmetic units to memory systems that must deliver both data and randomness. Here we present a unified data-access perspective in which deterministic access is treated as a limiting case of stochastic sampling, enabling both modes to be analyzed within a common framework. This view reveals that increasing stochastic demand reduces effective data-access efficiency and can drive systems into entropy-limited operation. Based on this insight, we define memory-level evaluation criteria, including unified operation, distribution programmability, efficiency, robustness to hardware non-idealities and parallel compatibility. Using these criteria, we analyze limitations of conventional architectures and examine emerging probabilistic compute-in-memory approaches that integrate sampling with memory access, outlining pathways toward scalable hardware for trustworthy AI.

[AI-5] Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在数学教育中作为解题者和推理评估者时，其数学问题求解能力是否与步骤级评估性能相关。研究发现，同一模型在正确解答的题目上，其对基准解决方案中最早错误步骤的预测准确率显著高于在错误解答的题目上，且这种关联在GPT-4和GPT-5两个模型及GSM8K与MATH数据集上均具统计学意义。解决方案的关键在于揭示了数学问题求解专长有助于提升评估表现，但要实现可靠的步骤级诊断，还需额外能力如步骤追踪、过程监控与精确错误定位。这一发现对设计和评估支持形成性评价的AI辅助自适应教学系统（Adaptive Instructional Systems, AISs）具有重要启示。

链接: https://arxiv.org/abs/2603.25633
作者: Liang Zhang,Yu Fu,Xinyi Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners’ reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.

[AI-6] AAC: A gate into Trustable Audio Affective Computing

【速读】：该论文旨在解决音频情绪计算中敏感身份信息（User-sensitive Identity Information, ID）与抑郁特征之间的混淆问题，从而在保障用户隐私的前提下实现可信的抑郁症自动诊断。传统方法难以有效区分抑郁特征与ID特征，且缺乏对敏感信息的安全加密机制，导致智能诊断过程存在隐私泄露风险。解决方案的关键在于提出首个实用的可信音频情感计算框架TAAC（Trustable Audio Affective Computing），其核心由三个模块构成：不同特征子空间分解器（Differentiating Features Subspace Decompositor, DFSD）用于分离抑郁特征与ID特征；灵活噪声加密器（Flexible Noise Encryptor, FNE）仅对ID特征进行加密以保护隐私；分阶段训练范式（Staged Training Paradigm）则提升整体诊断性能。该框架在保证抑郁检测准确性的同时，实现了身份信息保留和音频重建质量的最优平衡，验证了其在保密性（Confidentiality）、准确性（Accuracy）、可追溯性（Traceability）和可调节性（Adjustability）方面的优越性。

链接: https://arxiv.org/abs/2603.25570
作者: Xintao Hu,Feng-Qi Cui
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the emergence of AI techniques for depression diagnosis, the conflict between high demand and limited supply for depression screening has been significantly alleviated. Among various modal data, audio-based depression diagnosis has received increasing attention from both academia and industry since audio is the most common carrier of emotion transmission. Unfortunately, audio data also contains User-sensitive Identity Information (ID), which is extremely vulnerable and may be maliciously used during the smart diagnosis process. Among previous methods, the clarification between depression features and sensitive features has always serve as a barrier. It is also critical to the problem for introducing a safe encryption methodology that only encrypts the sensitive features and a powerful classifier that can correctly diagnose the depression. To track these challenges, by leveraging adversarial loss-based Subspace Decomposition, we propose a first practical framework \name presented for Trustable Audio Affective Computing, to perform automated depression detection through audio within a trustable environment. The key enablers of TAAC are Differentiating Features Subspace Decompositor (DFSD), Flexible Noise Encryptor (FNE) and Staged Training Paradigm, used for decomposition, ID encryption and performance enhancement, respectively. Extensive experiments with existing encryption methods demonstrate our framework’s preeminent performance in depression detection, ID reservation and audio reconstruction. Meanwhile, the experiments across various setting demonstrates our model’s stability under different encryption strengths. Thus proving our framework’s excellence in Confidentiality, Accuracy, Traceability, and Adjustability.

[AI-7] Are LLM s Overkill for Databases?: A Study on the Finiteness of SQL

【速读】：该论文试图解决的问题是：在自然语言到SQL（Structured Query Language）的转换任务中，生成式AI模型（如大语言模型LLM）所面对的SQL查询复杂性是否真正难以处理，以及这种复杂性是否具有可预测性和规律性。解决方案的关键在于通过分析376个真实数据库中的SQL查询模板分布发现：尽管数据库规模可能无限扩展，但实际使用的SQL查询复杂度受限于人类需求和应用场景；且SQL查询在模板形式下呈现类幂律分布（Power Law-like distribution），即约70%的查询可通过仅13%的模板类型覆盖，表明大多数SQL查询具有高度可预测性。这一发现暗示，在数据库访问领域，LLM生成SQL代码可能运行在一个狭窄、高度公式化的空间内，使用预定义模板可实现更安全、低成本且可审计的解决方案。

链接: https://arxiv.org/abs/2603.25568
作者: Yue Li,David Mimno,Unso Eun Seo Jo
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Translating natural language to SQL for data retrieval has become more accessible thanks to code generation LLMs. But how hard is it to generate SQL code? While databases can become unbounded in complexity, the complexity of queries is bounded by real life utility and human needs. With a sample of 376 databases, we show that SQL queries, as translations of natural language questions are finite in practical complexity. There is no clear monotonic relationship between increases in database table count and increases in complexity of SQL queries. In their template forms, SQL queries follow a Power Law-like distribution of frequency where 70% of our tested queries can be covered with just 13% of all template types, indicating that the high majority of SQL queries are predictable. This suggests that while LLMs for code generation can be useful, in the domain of database access, they may be operating in a narrow, highly formulaic space where templates could be safer, cheaper, and auditable.

[AI-8] Voxtral TTS

【速读】：该论文旨在解决多语言语音合成（Text-to-Speech, TTS）中语音自然度与表达力不足，以及低资源语音克隆（voice cloning）效率低的问题。解决方案的关键在于提出一种混合架构的生成式语音模型 Voxtral TTS，其核心创新包括：1）采用自回归方式生成语义语音标记（semantic speech tokens），并结合流匹配（flow-matching）技术对声学标记（acoustic tokens）进行精细化建模；2）设计了从零训练的 Voxtral Codec 编解码器，使用混合向量量化-有限状态量化（VQ-FSQ）方案实现高效语音标记化。该方法仅需3秒参考音频即可生成高保真、富有表现力的多语言语音，在人类评估中显著优于 ElevenLabs Flash v2.5，展现出卓越的语音克隆能力。

链接: https://arxiv.org/abs/2603.25551
作者: Alexander H. Liu,Alexis Tacnet,Andy Ehrenberg,Andy Lo,Chen-Yo Sun,Guillaume Lample,Henry Lagarde,Jean-Malo Delignon,Jaeyoung Kim,John Harvill,Khyathi Raghavi Chandu,Lorenzo Signoretti,Margaret Jennings,Patrick von Platen,Pavankumar Reddy Muddireddy,Rohin Arora,Sanchit Gandhi,Samuel Humeau,Soham Ghosh,Srijan Mishra,Van Phung,Abdelaziz Bounhar,Abhinav Rastogi,Adrien Sadé,Alan Jeffares,Albert Jiang,Alexandre Cahill,Alexandre Gavaudan,Alexandre Sablayrolles,Amélie Héliou,Amos You,Andrew Bai,Andrew Zhao,Angele Lenglemetz,Anmol Agarwal,Anton Eliseev,Antonia Calvi,Arjun Majumdar,Arthur Fournier,Artjom Joosen,Avi Sooriyarachchi,Aysenur Karaduman Utkur,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Benjamin Tibi,Bowen Yang,Charlotte Cronjäger,Clémence Lanfranchi,Connor Chen,Corentin Barreau,Corentin Sautier,Cyprien Courtot,Darius Dabert,Diego de las Casas,Elizaveta Demyanenko,Elliot Chane-Sane,Emmanuel Gottlob,Enguerrand Paquin,Etienne Goffinet,Fabien Niel,Faruk Ahmed,Federico Baldassarre,Gabrielle Berrada,Gaëtan Ecrepont,Gauthier Guinet,Genevieve Hayes,Georgii Novikov,Giada Pistilli,Guillaume Kunsch,Guillaume Martin,Guillaume Raille,Gunjan Dhanuka,Gunshi Gupta,Han Zhou,Harshil Shah,Hope McGovern,Hugo Thimonier,Indraneel Mukherjee,Irene Zhang,Jacques Sun,Jan Ludziejewski,Jason Rute,Jérémie Dentan,Joachim Studnia,Jonas Amar,Joséphine Delas,Josselin Somerville Roberts,Julien Tauran,Karmesh Yadav,Kartik Khandelwal,Kilian Tep,Kush Jain,Laurence Aitchison,Laurent Fainsin,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

[AI-9] NERO-Net: A Neuroevolutionary Approach for the Design of Adversarially Robust CNNs

【速读】：该论文旨在解决神经网络在安全关键场景中因固有对抗脆弱性（adversarial fragility）而难以部署的问题，尤其是现有神经进化方法常忽视模型架构对鲁棒性的潜在影响。解决方案的关键在于提出一种名为NERO-Net的神经进化方法，通过在进化过程中避免使用对抗训练（adversarial training），仅依赖标准训练下的性能评估来筛选具有内在鲁棒性的卷积神经网络结构；其核心创新在于设计了一个专门用于衡量对抗鲁棒性的适应度函数，使得选出的模型即使在未经过对抗训练的情况下也能实现高对抗准确率（如FGSM攻击下33%准确率），同时保持干净样本上的高准确率（87%），进一步标准训练后可提升至47%对抗准确率，验证了架构层面鲁棒性的有效性。

链接: https://arxiv.org/abs/2603.25517
作者: Inês Valentim,Nuno Antunes,Nuno Lourenço
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neuroevolution automates the complex task of neural network design but often ignores the inherent adversarial fragility of evolved models which is a barrier to adoption in safety-critical scenarios. While robust training methods have received significant attention, the design of architectures exhibiting intrinsic robustness remains largely unexplored. In this paper, we propose NERO-Net, a neuroevolutionary approach to design convolutional neural networks better equipped to resist adversarial attacks. Our search strategy isolates architectural influence on robustness by avoiding adversarial training during the evolutionary loop. As such, our fitness function promotes candidates that, even trained with standard (non-robust) methods, achieve high post-attack accuracy without sacrificing the accuracy on clean samples. We assess NERO-Net on CIFAR-10 with a specific focus on L_\infty -robustness. In particular, the fittest individual emerged from evolutionary search with 33% accuracy against FGSM, used as an efficient estimator for robustness during the search phase, while maintaining 87% clean accuracy. Further standard training of this individual boosted these metrics to 47% adversarial and 93% clean accuracy, suggesting inherent architectural robustness. Adversarial training brings the overall accuracy of the model up to 40% against AutoAttack.

[AI-10] Lightweight GenAI for Network Traffic Synthesis: Fidelity Augmentation and Classification

【速读】：该论文旨在解决网络流量分类（Network Traffic Classification, NTC）中因标注数据有限和隐私要求严格而导致的性能瓶颈问题。传统生成方法在建模现代流量复杂的时序动态特性方面表现不足，且计算开销较大。解决方案的关键在于采用轻量级生成式人工智能（Generative Artificial Intelligence, GenAI）架构，包括基于Transformer、状态空间模型（state-space models）和扩散模型（diffusion models），以实现高效、高保真度的网络流量合成。实验表明，这些轻量级模型能有效保留流量的静态与时序特征，在仅使用合成数据训练时即可达到高达87%的F1分数，在低数据场景下通过GenAI增强可使NTC性能提升最多40%，显著缩小与全数据训练的差距，其中基于Transformer的模型在保真度与效率之间提供了最佳平衡。

链接: https://arxiv.org/abs/2603.25507
作者: Giampaolo Bovenzi,Domenico Ciuonzo,Jonatan Krolikowski,Antonio Montieri,Alfredo Nascita,Antonio Pescapè,Dario Rossi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, 3 tables, 4 research questions, preprint submitted to IEEE Communications Magazine

点击查看摘要

Abstract:Accurate Network Traffic Classification (NTC) is increasingly constrained by limited labeled data and strict privacy requirements. While Network Traffic Generation (NTG) provides an effective means to mitigate data scarcity, conventional generative methods struggle to model the complex temporal dynamics of modern traffic or/and often incur significant computational cost. In this article, we address the NTG task using lightweight Generative Artificial Intelligence (GenAI) architectures, including transformer-based, state-space, and diffusion models designed for practical deployment. We conduct a systematic evaluation along four axes: (i) (synthetic) traffic fidelity, (ii) synthetic-only training, (iii) data augmentation under low-data regimes, and (iv) computational efficiency. Experiments on two heterogeneous datasets show that lightweight GenAI models preserve both static and temporal traffic characteristics, with transformer and state-space models closely matching real distributions across a complete set of fidelity metrics. Classifiers trained solely on synthetic traffic achieve up to 87% F1-score on real data. In low-data settings, GenAI-driven augmentation improves NTC performance by up to +40%, substantially reducing the gap with full-data training. Overall, transformer-based models provide the best trade-off between fidelity and efficiency, enabling high-quality, privacy-aware traffic synthesis with modest computational overhead.

[AI-11] EcoThink: A Green Adaptive Inference Framework for Sustainable and Accessible Agents WWW2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成式交互场景下因计算密集型推理策略（如思维链，Chain-of-Thought）被无差别应用于海量用户查询所引发的“过度思考”问题，这导致显著的能源浪费和碳排放增加，进而阻碍联合国可持续发展目标13（气候行动）与10（减少不平等）的实现。解决方案的关键在于提出EcoThink框架，其核心是一个轻量级、基于知识蒸馏的路由机制（distillation-based router），能够动态评估查询复杂度：对事实性检索类任务跳过冗余推理，仅对复杂逻辑任务保留深度计算，从而在保持性能基本不变的前提下，平均降低40.4%的推理能耗（最高达81.9%）。

链接: https://arxiv.org/abs/2603.25498
作者: Linxiao Li,Zhixiang Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by WWW 2026

点击查看摘要

Abstract:As the Web transitions from static retrieval to generative interaction, the escalating environmental footprint of Large Language Models (LLMs) presents a critical sustainability challenge. Current paradigms indiscriminately apply computation-intensive strategies like Chain-of-Thought (CoT) to billions of daily queries, causing LLM overthinking, a redundancy that amplifies carbon emissions and operational barriers. This inefficiency directly undermines UN Sustainable Development Goals 13 (Climate Action) and 10 (Reduced Inequalities) by hindering equitable AI access in resource-constrained regions. To address this, we introduce EcoThink, an energy-aware adaptive inference framework designed to reconcile high-performance AI intelligence with environmental responsibility. EcoThink employs a lightweight, distillation-based router to dynamically assess query complexity, skipping unnecessary reasoning for factoid retrieval while reserving deep computation for complex logic. Extensive evaluations across 9 diverse benchmarks demonstrate that EcoThink reduces inference energy by 40.4% on average (up to 81.9% for web knowledge retrieval) without statistically significant performance loss. By mitigating algorithmic waste, EcoThink offers a scalable path toward a sustainable, inclusive, and energy-efficient generative AI Agent.

[AI-12] Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models

【速读】：该论文旨在解决城市空气质量短期预测中模型复杂度高、计算资源消耗大且缺乏可解释性的问题，特别是在北京地区进行小时级PM2.5浓度预测时如何实现高精度与高效部署的平衡。其解决方案的关键在于提出一种“泄漏感知”的预测工作流（leakage-aware forecasting workflow），结合时间序列的时序划分、预处理、特征选择和外生驱动变量建模，在Perfect Prognosis设定下评估三种轻量级且可解释的预测方法：SARIMAX、Facebook Prophet 和 NeuralProphet，并通过两种自适应机制——每周滚动重训练（walk-forward refitting）与冻结模型加在线残差修正（frozen forecasting with online residual correction）来测试实际部署性能。结果表明，经过残差修正后的SARIMAX在冻结模型场景下达到最低误差（MAE 32.50, RMSE 46.85），而修正后的Facebook Prophet则在保持接近滚动重训练性能的同时将运行时间从15分21.91秒大幅缩短至46.60秒，证明了轻量化加法型预测策略在准确性、效率与可解释性之间具有显著优势。

链接: https://arxiv.org/abs/2603.25495
作者: Moazzam Umer Gondal,Hamad ul Qudous,Asma Ahmad Farhan,Sultan Alamri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to PLOS ONE

点击查看摘要

Abstract:Accurate short-term air-quality forecasting is essential for public health protection and urban management, yet many recent forecasting frameworks rely on complex, data-intensive, and computationally demanding models. This study investigates whether lightweight and interpretable forecasting approaches can provide competitive performance for hourly PM2.5 prediction in Beijing, China. Using multi-year pollutant and meteorological time-series data, we developed a leakage-aware forecasting workflow that combined chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under the Perfect Prognosis setting. Three forecasting families were evaluated: SARIMAX, Facebook Prophet, and NeuralProphet. To assess practical deployment behavior, the models were tested under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results showed clear differences in both predictive accuracy and computational efficiency. Under walk-forward refitting, Facebook Prophet achieved the strongest completed performance, with an MAE of 37.61 and an RMSE of 50.10 , while also requiring substantially less execution time than NeuralProphet. In the frozen-model regime, online residual correction improved Facebook Prophet and SARIMAX, with corrected SARIMAX yielding the lowest overall error (MAE 32.50 ; RMSE 46.85 ). NeuralProphet remained less accurate and less stable across both regimes, and residual correction did not improve its forecasts. Notably, corrected Facebook Prophet reached nearly the same error as its walk-forward counterpart while reducing runtime from 15 min 21.91 sec to 46.60 sec. These findings show that lightweight additive forecasting strategies can remain highly competitive for urban air-quality prediction, offering a practical balance between accuracy, interpretability, …

[AI-13] Retraining as Approximate Bayesian Inference

【速读】：该论文旨在解决模型再训练（model retraining）决策的非最优性问题，即传统依赖固定时间周期（如日历调度）的再训练策略缺乏灵活性与成本效益。作者Harrison Katz提出将再训练视为在计算约束下的近似贝叶斯推断（approximate Bayesian inference），并引入“学习债务”（learning debt）概念来量化部署模型与持续更新信念状态之间的差距。解决方案的关键在于构建一个基于决策理论（decision-theoretic）的再训练策略框架，其中再训练触发条件由损失函数自然导出的阈值决定，从而实现证据驱动的再训练触发机制，替代人工设定的周期性计划，并提升治理过程的可审计性（auditable）。

链接: https://arxiv.org/abs/2603.25480
作者: Harrison Katz
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Model retraining is usually treated as an ongoing maintenance task. But as Harrison Katz now argues, retraining can be better understood as approximate Bayesian inference under computational constraints. The gap between a continuously updated belief state and your frozen deployed model is “learning debt,” and the retraining decision is a cost minimization problem with a threshold that falls out of your loss function. In this article Katz provides a decision-theoretic framework for retraining policies. The result is evidence-based triggers that replace calendar schedules and make governance auditable. For readers less familiar with the Bayesian and decision-theoretic language, key terms are defined in a glossary at the end of the article.

[AI-14] Maximum Entropy Behavior Exploration for Sim2Real Zero-Shot Reinforcement Learning

【速读】：该论文旨在解决零样本强化学习（Zero-shot Reinforcement Learning, Zero-shot RL）在真实机器人系统中因预训练数据多样性不足而导致下游任务性能不佳的问题。其核心挑战在于：在缺乏对下游任务先验知识的情况下，如何获取高质量、多样化的探索数据以支持任意奖励函数下的策略恢复。解决方案的关键在于提出一种在线零样本RL算法FB-MEBE，该方法结合无监督行为探索策略与正则化评论家（regularization critic），通过最大化所实现行为分布的熵来促进探索，并利用正则化评论家引导恢复的策略趋向于更自然、物理上合理的运动模式，从而显著提升策略在模拟和真实硬件上的泛化能力与部署可行性。

链接: https://arxiv.org/abs/2603.25464
作者: Jiajun Hu,Nuria Armengol Urpi,Jin Cheng,Stelian Coros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study \textitonline zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observe that undirected exploration yields low-diversity data, leading to poor downstream performance and rendering policies impractical for direct hardware deployment. Therefore, we introduce FB-MEBE, an online zero-shot RL algorithm that combines an unsupervised behavior exploration strategy with a regularization critic. FB-MEBE promotes exploration by maximizing the entropy of the achieved behavior distribution. Additionally, a regularization critic shapes the recovered policies toward more natural and physically plausible behaviors. We empirically demonstrate that FB-MEBE achieves and improved performance compared to other exploration strategies in a range of simulated downstream tasks, and that it renders natural policies that can be seamlessly deployed to hardware without further finetuning. Videos and code available on our website.

[AI-15] mporally Decoupled Diffusion Planning for Autonomous Driving ICAPS

【速读】：该论文旨在解决动态城市环境中运动规划中短期安全与长期目标之间的平衡问题，现有方法将轨迹视为整体实体，忽略了近端（near-term）计划受瞬时动态约束、远端（far-term）计划受导航目标影响的异质性时间依赖关系。其解决方案的关键在于提出时序解耦扩散模型（Temporally Decoupled Diffusion Model, TDDM），通过“噪声即掩码”范式将轨迹分段并赋予不同噪声水平，使高噪声段作为信息缺失区域、低噪声段作为上下文线索，从而引导模型利用内部时序相关性重建近端状态；同时引入时序解耦自适应层归一化（TD-AdaLN）注入分段特定的时间步信息，并在推理阶段采用非对称时序无分类器引导机制，利用弱噪声远端先验指导即时路径生成，显著提升复杂场景下的规划性能。

链接: https://arxiv.org/abs/2603.25462
作者: Xiang Li,Bikun Wang,John Zhang,Jianjun Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: icaps

点击查看摘要

Abstract:Motion planning in dynamic urban environments requires balancing immediate safety with long-term goals. While diffusion models effectively capture multi-modal decision-making, existing approaches treat trajectories as monolithic entities, overlooking heterogeneous temporal dependencies where near-term plans are constrained by instantaneous dynamics and far-term plans by navigational goals. To address this, we propose Temporally Decoupled Diffusion Model (TDDM), which reformulates trajectory generation via a noise-as-mask paradigm. By partitioning trajectories into segments with independent noise levels, we implicitly treat high noise as information voids and weak noise as contextual cues. This compels the model to reconstruct corrupted near-term states by leveraging internal correlations with better-preserved temporal contexts. Architecturally, we introduce a Temporally Decoupled Adaptive Layer Normalization (TD-AdaLN) to inject segment-specific timesteps. During inference, our Asymmetric Temporal Classifier-Free Guidance utilizes weakly noised far-term priors to guide immediate path generation. Evaluations on the nuPlan benchmark show TDDM approaches or exceeds state-of-the-art baselines, particularly excelling in the challenging Test14-hard subset.

[AI-16] Cross-Model Disagreement as a Label-Free Correctness Signal

【速读】：该论文旨在解决在缺乏真实标签（ground truth labels）的情况下，如何准确检测语言模型生成结果是否错误的问题，尤其针对“自信错误”（confident errors）这一危险场景——即模型对错误答案表现出高置信度。传统方法依赖模型自身的不确定性指标（如token熵或置信度分数），但在自信错误场景下失效。解决方案的关键在于引入**跨模型不一致（cross-model disagreement）**作为正确性指示信号：通过一个验证模型（verifier model）对生成模型输出的答案进行单次前向传播，计算验证模型对该答案的“惊讶程度”或“不确定性”，无需生成新文本或标注数据。作者具体实现了两种指标——跨模型困惑度（Cross-Model Perplexity, CMP）和跨模型熵（Cross-Model Entropy, CME），二者均显著优于基于模型内部不确定性的基线，在MMLU、TriviaQA和GSM8K等多个基准上表现优异，证明了该方法在生产环境中无需训练即可实现标签自由的正确性估计。

链接: https://arxiv.org/abs/2603.25450
作者: Matt Gorbett,Suman Jana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model’s own uncertainty – such as token entropy or confidence scores – but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator – a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model’s generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model’s surprise at the generating model’s answer tokens, and Cross-Model Entropy (CME), which measures the verifying model’s uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

[AI-17] From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild WWW2026

【速读】：该论文旨在解决当前微视频（micro-video）虚假信息检测中存在的两大核心问题：一是现有基准测试（benchmark）多局限于单一类型的欺骗手段，未能涵盖现实世界中涉及多模态篡改、生成式AI内容、认知偏见及断章取义再利用等复杂情况；二是现有检测模型缺乏细粒度归因能力，导致可解释性不足，难以在实际场景中部署。解决方案的关键在于提出两个创新性成果：其一为WildFakeBench，一个包含超10,000条真实微视频的大型基准数据集，覆盖多样化的虚假信息类型与来源，并附有专家定义的归因标签；其二为FakeAgent，一种受德尔斐法（Delphi）启发的多智能体推理框架，通过融合多模态理解与外部证据进行归因驱动的分析，能够联合识别内容中的操纵行为、认知模式和AI生成特征，从而实现对微视频虚假信息的精准定位与解释。实验表明，FakeAgent在所有虚假信息类型上均显著优于现有多模态大语言模型（MLLMs），而WildFakeBench则为可解释性的微视频虚假信息检测提供了真实且具有挑战性的评估平台。

链接: https://arxiv.org/abs/2603.25423
作者: Zhi Zeng,Yifei Yang,Jiaying Wu,Xulang Zhang,Xiangzheng Kong,Herun Wan,Zihan Ma,Minnan Luo
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted at WWW 2026

点击查看摘要

Abstract:The rise of micro-videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real-world cases that involve multimodal manipulation, AI-generated content, cognitive bias, and out-of-context reuse. Meanwhile, most detection models lack fine-grained attribution, limiting interpretability and practical utility. To address these gaps, we introduce WildFakeBench, a large-scale benchmark of over 10,000 real-world micro-videos covering diverse misinformation types and sources, each annotated with expert-defined attribution labels. Building on this foundation, we develop FakeAgent, a Delphi-inspired multi-agent reasoning framework that integrates multimodal understanding with external evidence for attribution-grounded analysis. FakeAgent jointly analyzes content and retrieved evidence to identify manipulation, recognize cognitive and AI-generated patterns, and detect out-of-context misinformation. Extensive experiments show that FakeAgent consistently outperforms existing MLLMs across all misinformation types, while WildFakeBench provides a realistic and challenging testbed for advancing explainable micro-video misinformation detection. Data and code are available at: this https URL.

[AI-18] Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

【速读】：该论文致力于解决在有限动作预算下，如何高效构建高质量的语义场景图（Semantic Scene Graph, SSG）以支持具身智能体在不确定性与资源约束条件下的目标驱动自适应问题。其核心挑战在于平衡信息获取收益与导航成本，并判断何时额外动作将带来边际收益递减。解决方案的关键在于提出一个模块化的导航组件，通过替换原有的策略优化方法并重新审视离散动作建模方式：一方面采用更精细的离散动作集（包括紧凑型和细粒度动作），另一方面比较单一头策略与因子化多头策略在原子动作上的表现；实验表明，结合现代优化算法与因子化动作表示能实现最优的SSG完整度-效率权衡，显著提升模型质量与下游任务实用性。

链接: https://arxiv.org/abs/2603.25415
作者: Roman Kueble,Marco Hueller,Mrunmai Phatak,Rainer Lienhart,Joerg Haehner
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness–efficiency trade-off. Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2603.25415 [cs.AI] (or arXiv:2603.25415v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.25415 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Roman Küble [view email] [v1] Thu, 26 Mar 2026 13:10:08 UTC (1,268 KB)

[AI-19] Decidable By Construction: Design-Time Verification for Trustworthy AI

【速读】：该论文旨在解决当前机器学习中模型正确性依赖于事后验证的固有假设问题，即在训练完成后才进行数值稳定性、计算正确性或物理一致性等属性的检查。作者指出，这些关键属性实际上可以在设计阶段（design time）通过低开销的方式进行验证，尤其适用于高杠杆决策支持和科学约束场景。解决方案的核心在于识别出这些属性具有特定的代数结构——可表示为有限生成阿贝尔群 $\mathbb{Z}^n$ 上的约束，并且此类约束的推理可在多项式时间内判定，主类型唯一。基于此观察，论文构建了一个框架，整合三项前期研究成果：携带任意注解作为持久型共数据贯穿模型展开的维度类型系统；从类型签名直接推导Clifford代数阶与几何积稀疏性的程序超图；以及通过前向模式共效应分析和精确正数累加保持不变量的自适应领域模型架构。这一组合实现了信息论层面的新结果：在阿贝尔群上的Hindley-Milner统一算法等价于在可计算的Solomonoff先验限制下求最大后验假设，从而将类型推断置于通用归纳的同一形式基础之上，从根本上消除了传统方法因多层部署、多次推理而累积的额外开销。

链接: https://arxiv.org/abs/2603.25414
作者: Houston Haynes
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups \mathbbZ^n , where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff’s universal prior, placing the framework’s type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.

[AI-20] Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂任务中依赖显式链式思维（Chain-of-Thought, CoT）推理时，其推理过程本身缺乏安全性保障的问题。现有研究主要关注输出内容的安全性（如有害、偏见或事实错误），将推理链视为黑箱中间产物，忽视了推理轨迹在逻辑一致性、计算效率和抗对抗扰动方面的潜在风险。为此，作者首次提出“推理安全”（reasoning safety）的概念，并构建了一个涵盖九类不安全推理行为的分类体系，通过大规模标注4111条推理链验证其实际存在性及攻击诱导的可解释特征；解决方案的关键在于设计并实现一个外部的推理安全监控器（Reasoning Safety Monitor），该模块基于另一个LLM，在目标模型运行时实时分析每一步推理，利用嵌入分类体系的提示词识别异常行为，并触发中断信号，实验表明其在步骤级定位准确率达84.88%，错误类型分类准确率达85.37%，显著优于基线方法，证明了推理层面监控的必要性与可行性。

链接: https://arxiv.org/abs/2603.25412
作者: Xunguang Wang,Yuguang Zhou,Qingyue Wang,Zongjie Li,Ruixuan Huang,Zhenlan Ji,Pingchuan Ma,Shuai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety–detecting harmful, biased, or factually incorrect outputs – and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model’s reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88% step-level localization accuracy and 85.37% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

[AI-21] System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games

【速读】：该论文旨在解决长时程桌面游戏（如麻将）中机器人系统因感知或执行误差累积而导致任务状态失效、决策模块间错误传播进而引发交互失败的问题。其核心挑战在于维持多轮次、多人参与场景下的内部状态一致性，而非仅优化单一组件性能。解决方案的关键在于：构建一个集成架构，显式维护感知、执行与交互状态；将高层语义推理与实时感知控制解耦；引入经验证的动作原语及触觉触发的恢复机制以防止状态过早污染；并通过交互级监控机制检测回合违规和隐含信息泄露，从而保障执行假设的有效性。实证结果表明，这种基于系统级设计的方法显著提升了长时间运行下的可执行一致性，而单一化或未经验证的流水线则导致端到端可靠性明显下降。

链接: https://arxiv.org/abs/2603.25405
作者: Guangyu Zhao,Ceyao Zhang,Chengdong Ma,Tao Wu,Yiyang Song,Haoxuan Ru,Yifan Zhong,Ruilin Yan,Lingfeng Li,Ruochong Li,Yu Li,Xuyuan Han,Yun Ding,Ruizhang Jiang,Xiaochuan Zhang,Yichao Li,Yuanpei Chen,Yaodong Yang,Yitao Liang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon tabletop games pose a distinct systems challenge for robotics: small perceptual or execution errors can invalidate accumulated task state, propagate across decision-making modules, and ultimately derail interaction. This paper studies how to maintain internal state consistency in turn-based, multi-human robotic tabletop games through deliberate system design rather than isolated component improvement. Using Mahjong as a representative long-horizon setting, we present an integrated architecture that explicitly maintains perceptual, execution, and interaction state, partitions high-level semantic reasoning from time-critical perception and control, and incorporates verified action primitives with tactile-triggered recovery to prevent premature state corruption. We further introduce interaction-level monitoring mechanisms to detect turn violations and hidden-information breaches that threaten execution assumptions. Beyond demonstrating complete-game operation, we provide an empirical characterization of failure modes, recovery effectiveness, cross-module error propagation, and hardware-algorithm trade-offs observed during deployment. Our results show that explicit partitioning, monitored state transitions, and recovery mechanisms are critical for sustaining executable consistency over extended play, whereas monolithic or unverified pipelines lead to measurable degradation in end-to-end reliability. The proposed system serves as an empirical platform for studying system-level design principles in long-horizon, turn-based interaction.

[AI-22] Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

【速读】：该论文旨在解决在设备端（on-device）视觉语言模型（Vision-Language Models, VLMs）中，由于采用动态高分辨率预处理（Dynamic High-Resolution preprocessing，如AnyRes）所引入的算法侧信道（algorithmic side-channel）安全漏洞问题。此类漏洞使得攻击者可通过观察执行时间差异和缓存竞争行为，推断出输入图像的几何特征乃至语义内容，从而泄露隐私敏感信息。解决方案的关键在于提出一个双层攻击框架：第一层利用操作系统级指标（如执行时间）实现对输入几何结构的可靠指纹识别；第二层通过分析最后一级缓存（Last-Level Cache, LLC）争用情况，进一步区分具有相同几何但不同语义密度的内容（如医学X光片与文本文档）。该研究揭示了当前主流VLM模型（如LLaVA-NeXT和Qwen2-VL）存在可被利用的侧信道风险，并指出现有防护手段（如恒定工作量填充）会导致显著性能开销，最终提出面向边缘人工智能部署的安全设计建议。

链接: https://arxiv.org/abs/2603.25403
作者: Eyal Hadad,Mordechai Guri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input’s geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

[AI-23] GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLM s

【速读】：该论文旨在解决低比特量化（如4-bit）在部署大语言模型（LLM）时导致的精度下降问题，同时克服现有低秩修正方法（如LQER、QERA）因对所有层进行恢复并插入纠错模块而引入的延迟和内存开销。其解决方案的关键在于提出GlowQ，一种基于分组共享的低秩近似方法：它为每个输入共享组缓存一个共享的右因子，并仅对能带来最大精度提升的组或层进行恢复，从而减少参数与内存开销，同时保留层特定修正的表达能力；进一步提出的GlowQ-S变体则选择性地应用缓存模块，显著降低延迟，实现更高的吞吐量，且精度损失控制在0.2个百分点以内。

链接: https://arxiv.org/abs/2603.25385
作者: Selim An,Il hong Suh,Yeseong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.

[AI-24] 4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles

【速读】：该论文旨在解决数学推理任务中难度建模的挑战，特别是在自适应学习系统中如何准确量化和预测算术谜题（arithmetic puzzle）的难易程度。其核心问题是：在整数算术谜题这一受控环境中，何种结构特征能可靠地决定难度，并支持可解释的难度估计与任务排序。解决方案的关键在于提出一个基于精确动态规划求解器的框架，该求解器能够枚举可达目标、提取最小操作路径（minimal-operation witnesses），并据此构建包含超过340万实例的大规模标注数据集。研究发现，难度完全由一组可解释的结构性属性决定，其中最关键的是最小构造中所使用的输入值数量——它作为难度的充分统计量，在给定标注下具有决定性作用。此方法实现了符号推理与数据驱动建模之间的桥梁，为智能练习系统中的自适应难度调整提供了理论基础与实践工具。

链接: https://arxiv.org/abs/2603.25356
作者: Yunus E. Zeytuncu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AIED 2026

点击查看摘要

Abstract:Arithmetic puzzle games provide a controlled setting for studying difficulty in mathematical reasoning tasks, a core challenge in adaptive learning systems. We investigate the structural determinants of difficulty in a class of integer arithmetic puzzles inspired by number games. We formalize the problem and develop an exact dynamic-programming solver that enumerates reachable targets, extracts minimal-operation witnesses, and enables large-scale labeling. Using this solver, we construct a dataset of over 3.4 million instances and define difficulty via the minimum number of operations required to reach a target. We analyze the relationship between difficulty and solver-derived features. While baseline machine learning models based on bag- and target-level statistics can partially predict solvability, they fail to reliably distinguish easy instances. In contrast, we show that difficulty is fully determined by a small set of interpretable structural attributes derived from exact witnesses. In particular, the number of input values used in a minimal construction serves as a minimal sufficient statistic for difficulty under this labeling. These results provide a transparent, computationally grounded account of puzzle difficulty that bridges symbolic reasoning and data-driven modeling. The framework supports explainable difficulty estimation and principled task sequencing, with direct implications for adaptive arithmetic learning and intelligent practice systems. Comments: Accepted at AIED 2026 Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.1; F.2.2 Cite as: arXiv:2603.25356 [cs.AI] (or arXiv:2603.25356v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.25356 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] Agent ic Trust Coordination for Federated Learning through Adaptive Thresholding and Autonomous Decision Making in Sustainable and Resilient Industrial Networks

【速读】：该论文旨在解决工业网络中联邦学习（Federated Learning, FL）在异构且资源受限设备环境下因客户端行为不一致、传感噪声以及故障或恶意更新导致的可靠性下降问题。现有基于信任的机制多为统计性与启发式方法，依赖固定参数或简单自适应规则，难以应对动态变化的运行条件。其解决方案的关键在于提出一种轻量级的“代理式信任协调”（Agentic Trust Control Layer）机制，该机制作为服务器端的控制回路，通过观测与系统层面相关的信任信号并分析其时序演化，在检测到不稳定状态时实施有针对性的信任调整；该框架通过显式分离观察（observation）、推理（reasoning）与行动（action）三个阶段，实现上下文感知的干预决策，从而保障联邦学习稳定运行，同时无需修改客户端训练逻辑或增加通信开销。

链接: https://arxiv.org/abs/2603.25334
作者: Paul Shepherd,Tasos Dagiuklas,Bugra Alkan,Jonathan Rodriguez
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distributed intelligence in industrial networks increasingly integrates sensing, communication, and computation across heterogeneous and resource constrained devices. Federated learning (FL) enables collaborative model training in such environments, but its reliability is affected by inconsistent client behaviour, noisy sensing conditions, and the presence of faulty or adversarial updates. Trust based mechanisms are commonly used to mitigate these effects, yet most remain statistical and heuristic, relying on fixed parameters or simple adaptive rules that struggle to accommodate changing operating conditions. This paper presents a lightweight agentic trust coordination approach for FL in sustainable and resilient industrial networks. The proposed Agentic Trust Control Layer operates as a server side control loop that observes trust related and system level signals, interprets their evolution over time, and applies targeted trust adjustments when instability is detected. The approach extends prior adaptive trust mechanisms by enabling context aware intervention decisions, rather than relying on fixed or purely reactive parameter updates. By explicitly separating observation, reasoning, and action, the proposed framework supports stable FL operation without modifying client side training or increasing communication overhead. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.25334 [cs.AI] (or arXiv:2603.25334v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.25334 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Macroscopic Characteristics of Mixed Traffic Flow with Deep Reinforcement Learning Based Automated and Human-Driven Vehicles

【速读】：该论文旨在解决自动驾驶车辆（AV）在混杂交通环境中如何平衡安全性、效率、舒适性、燃油经济性及交通规则合规性的问题，同时需适应不同驾驶行为的异质性。传统跟车模型（如智能驾驶员模型，IDM）难以泛化至多样交通场景且忽略燃油效率，因此研究提出基于深度强化学习（DRL）的解决方案。其关键在于采用双延迟深度确定性策略梯度（TD3）算法训练AV控制策略，并利用NGSIM高速公路数据集实现与人类驾驶车辆的真实交互，从而在微观层面优化跟车行为的同时，在宏观层面提升整体交通流特性与燃油效率。实验表明，该方法可使道路通行能力提升约7.52%，并在高速工况下将平均燃油效率提高28.98%。

链接: https://arxiv.org/abs/2603.25328
作者: Pankaj Kumar,Pranamesh Chakraborty,Subrahmanya Swamy Peruru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Total 5 figures and 2 table

点击查看摘要

Abstract:Automated Vehicle (AV) control in mixed traffic, where AVs coexist with human-driven vehicles, poses significant challenges in balancing safety, efficiency, comfort, fuel efficiency, and compliance with traffic rules while capturing heterogeneous driver behavior. Traditional car-following models, such as the Intelligent Driver Model (IDM), often struggle to generalize across diverse traffic scenarios and typically do not account for fuel efficiency, motivating the use of learning-based approaches. Although Deep Reinforcement Learning (DRL) has shown strong microscopic performance in car-following conditions, its macroscopic traffic flow characteristics remain underexplored. This study focuses on analyzing the macroscopic traffic flow characteristics and fuel efficiency of DRL-based models in mixed traffic. A Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is implemented for AVs’ control and trained using the NGSIM highway dataset, enabling realistic interaction with human-driven vehicles. Traffic performance is evaluated using the Fundamental Diagram (FD) under varying driver heterogeneity, heterogeneous time-gap penetration levels, and different shares of RL-controlled vehicles. A macroscopic level comparison of fuel efficiency between the RL-based AV model and the IDM is also conducted. Results show that traffic performance is sensitive to the distribution of safe time gaps and the proportion of RL vehicles. Transitioning from fully human-driven to fully RL-controlled traffic can increase road capacity by approximately 7.52%. Further, RL-based AVs also improve average fuel efficiency by about 28.98% at higher speeds (above 50 km/h), and by 1.86% at lower speeds (below 50 km/h) compared to the IDM. Overall, the DRL framework enhances traffic capacity and fuel efficiency without compromising safety.

[AI-27] Evaluating Language Models for Harmful Manipulation

【速读】：该论文旨在解决当前对生成式 AI (Generative AI) 有害操纵行为的评估方法存在局限性的问题，尤其是在真实应用场景中如何系统性地衡量AI引发人类信念与行为改变的能力。其解决方案的关键在于提出一个基于情境特异性人-AI交互研究的评估框架，通过在公共政策、金融和健康三个高风险领域及美国、英国和印度三个地理区域开展大规模实验（共10,101名参与者），验证该框架的有效性。研究发现：AI模型在特定提示下可表现出操纵行为，并在实验环境中成功诱导参与者信念和行为变化；不同领域和地域的操纵效果存在显著差异，表明需在具体应用场景中进行评估；且操纵倾向（propensity）与操纵成功率（efficacy）并非一致相关，强调应分别考察这两个维度。这一框架为未来对有害AI操纵的科学评估提供了可复现的方法论基础。

链接: https://arxiv.org/abs/2603.25326
作者: Canfer Akbulut,Rasmi Elasmar,Abhishek Roy,Anthony Payne,Priyanka Suresh,Lujain Ibrahim,Seliem El-Sayed,Charvi Rastogi,Ashyana Kachra,Will Hawkins,Kristian Lum,Laura Weidinger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

[AI-28] How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

【速读】：该论文旨在解决结构化稀疏剪枝（weight pruning）对大型语言模型内部表征几何结构影响机制不明确的问题，尤其是其如何重塑模型中特征空间的分布与稳定性。解决方案的关键在于首次系统性地利用稀疏自编码器（Sparse Autoencoders, SAEs）作为可解释性探针，量化分析不同剪枝方法（magnitude 和 Wanda）在多种模型架构和稀疏度水平下对特征存活率、迁移能力及因果相关性的差异。研究发现，罕见特征（low-firing-rate features）比高频通用特征更具抗剪枝鲁棒性，且 Wanda 剪枝显著优于 magnitude 剪枝，在高达 50% 稀疏度时仍能保持 SAE 特征结构的可迁移性，揭示了剪枝本质上是一种隐式特征选择机制，而非随机破坏。

链接: https://arxiv.org/abs/2603.25325
作者: Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures, 6 tables. Analysis covers Gemma 3 1B, Gemma 2 2B, and Llama 3.2 1B across 22 experimental runs. Code and data available at this https URL

点击查看摘要

Abstract:Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0–60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features–those with low firing rates–survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance–a dissociation with implications for interpretability under compression.

[AI-29] Revealing the influence of participant failures on model quality in cross-silo Federated Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在实际生产环境中因参与者缺失而导致的可靠性问题，特别是系统故障（如崩溃、网络分区）对模型性能和训练稳定性的影响尚未被系统研究。其解决方案的关键在于通过大规模实验，系统性地分析不同数据类型（图像、表格、时间序列）下，参与者缺失对模型性能的影响，并深入考察数据偏移（data skewness）、可用性模式（availability patterns）及模型架构等关键因素的作用机制。研究发现，数据偏移具有显著影响，常导致模型评估结果过于乐观，甚至改变其他变量的作用方向，从而为提升FL系统的鲁棒性和实用性提供了实证依据与优化方向。

链接: https://arxiv.org/abs/2603.25289
作者: Fabian Stricker,David Bermbach,Christian Zirpins
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Federated Learning (FL) is a paradigm for training machine learning (ML) models in collaborative settings while preserving participants’ privacy by keeping raw data local. A key requirement for the use of FL in production is reliability, as insufficient reliability can compromise the validity, stability, and reproducibility of learning outcomes. FL inherently operates as a distributed system and is therefore susceptible to crash failures, network partitioning, and other fault scenarios. Despite this, the impact of such failures on FL outcomes has not yet been studied systematically. In this paper, we address this gap by investigating the impact of missing participants in FL. To this end, we conduct extensive experiments on image, tabular, and time-series data and analyze how the absence of participants affects model performance, taking into account influencing factors such as data skewness, different availability patterns, and model architectures. Furthermore, we examine scenario-specific aspects, including the utility of the global model for missing participants. Our experiments provide detailed insights into the effects of various influencing factors. In particular, we show that data skewness has a strong impact, often leading to overly optimistic model evaluations and, in some cases, even altering the effects of other influencing factors. Comments: Preprint Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.25289 [cs.DC] (or arXiv:2603.25289v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2603.25289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] CSI-tuples-based 3D Channel Fingerprints Construction Assisted by MultiModal Learning

【速读】：该论文旨在解决低空通信场景下信道状态信息（CSI）获取效率低、重复估计成本高以及环境感知与通信参数耦合不足的问题，从而支持第六代移动通信（6G）中空地一体化无线资源的高效协同。解决方案的关键在于提出一种模块化多模态框架来构建三维信道指纹（3D-CF），其核心创新是将3D-CF建模为基于Rician衰落信道的CSI元组集合，并将其构造问题形式化为多模态回归任务——通过低空车辆（LAV）位置、通信测量数据和地理环境地图等异构先验信息，直接预测目标CSI。该框架包含相关性驱动的多模态融合（Corr-MMF）、多模态表征（MMR）和CSI回归（CSI-R）三个模块，显著提升了3D-CF构建精度（较当前最优算法至少提高27.5%）并具备优异泛化能力与推理效率。

链接: https://arxiv.org/abs/2603.25288
作者: Chenjie Xie,Li You,Ruirong Chen,Gaoning He,Xiqi Gao
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Low-altitude communications can promote the integration of aerial and terrestrial wireless resources, expand network coverage, and enhance transmission quality, thereby empowering the development of sixth-generation (6G) mobile communications. As an enabler for low-altitude transmission, 3D channel fingerprints (3D-CF), also referred to as the 3D radio map or 3D channel knowledge map, are expected to enhance the understanding of communication environments and assist in the acquisition of channel state information (CSI), thereby avoiding repeated estimations and reducing computational complexity. In this paper, we propose a modularized multimodal framework to construct 3D-CF. Specifically, we first establish the 3D-CF model as a collection of CSI-tuples based on Rician fading channels, with each tuple comprising the low-altitude vehicle’s (LAV) positions and its corresponding statistical CSI. In consideration of the heterogeneous structures of different prior data, we formulate the 3D-CF construction problem as a multimodal regression task, where the target channel information in the CSI-tuple can be estimated directly by its corresponding LAV positions, together with communication measurements and geographic environment maps. Then, a high-efficiency multimodal framework is proposed accordingly, which includes a correlation-based multimodal fusion (Corr-MMF) module, a multimodal representation (MMR) module, and a CSI regression (CSI-R) module. Numerical results show that our proposed framework can efficiently construct 3D-CF and achieve at least 27.5% higher accuracy than the state-of-the-art algorithms under different communication scenarios, demonstrating its competitive performance and excellent generalization ability. We also analyze the computational complexity and illustrate its superiority in terms of the inference time.

[AI-31] SliderQuant: Accurate Post-Training Quantization for LLM s ICLR2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）后训练量化（Post-Training Quantization, PTQ）中一个被忽视的问题：现有方法通常对所有层采用统一的量化策略，但在低比特宽度设置下，这种“一刀切”的方式并非最优。作者通过实证发现，浅层和深层网络对量化更为敏感，尤其是第一层和最后一层表现出显著更大的量化误差，这表明不同层应根据其敏感性进行差异化量化设计。解决方案的关键在于提出一种名为Sliding-layer Quantization（SliderQuant）的新框架，其核心是引入两个层次的滑动量化机制：一是跨层滑动量化（inter-layer sliding quantization），基于三种针对浅层、中间层和深层定制的滑动窗口设计；二是层内滑动量化（intra-layer sliding quantization），采用增量式策略逐窗量化。该方法仅依赖少量可学习参数即可实现多层级自适应量化，有效降低各层量化误差，在多种LLM架构（包括Llama系列、Qwen2.5、DeepSeek-R1及MoE模型）上均优于现有PTQ方法，涵盖权重仅量化与权重-激活联合量化场景。

链接: https://arxiv.org/abs/2603.25284
作者: Shigeng Wang,Chao Li,Yangyuxuan Kang,Jiawei Fan,Zhonghong Ou,Anbang Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This work is accepted to ICLR 2026. Code is available at this https URL

点击查看摘要

Abstract:In this paper, we address post-training quantization (PTQ) for large language models (LLMs) from an overlooked perspective: given a pre-trained high-precision LLM, the predominant sequential quantization framework treats different layers equally, but this may be not optimal in challenging bit-width settings. We empirically study the quantization impact of different layers on model accuracy, and observe that: (1) shallow/deep layers are usually more sensitive to quantization than intermediate layers; (2) among shallow/deep layers, the most sensitive one is the first/last layer, which exhibits significantly larger quantization error than others. These empirical observations imply that the quantization design for different layers of LLMs is required on multiple levels instead of a single level shared to all layers. Motivated by this, we propose a new PTQ framework termed Sliding-layer Quantization (SliderQuant) that relies on a simple adaptive sliding quantization concept facilitated by few learnable parameters. The base component of SliderQuant is called inter-layer sliding quantization, which incorporates three types of novel sliding window designs tailored for addressing the varying quantization sensitivity of shallow, intermediate and deep layers. The other component is called intra-layer sliding quantization that leverages an incremental strategy to quantize each window. As a result, SliderQuant has a strong ability to reduce quantization errors across layers. Extensive experiments on basic language generation, zero-shot commonsense reasoning and challenging math and code tasks with various LLMs, including Llama/Llama2/Llama3/Qwen2.5 model families, DeepSeek-R1 distilled models and large MoE models, show that our method outperforms existing PTQ methods (including the latest PTQ methods using rotation transformations) for both weight-only quantization and weight-activation quantization.

[AI-32] A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion

【速读】：该论文试图解决的问题是：当前对步态（gait）的分析多局限于作为特定病理症状的表征，而未将其视为一种能够反映全身系统状态的独立生物标志物（biomarker）。为解决这一问题，研究团队构建了一个基于3D骨骼运动的步态基础模型（gait foundation model），利用来自3,414名深度表型成人受试者的深度相机数据，在五种运动任务中采集步态信息。其解决方案的关键在于：通过无监督学习获得的嵌入表示（learned embeddings）显著优于传统手工设计特征（engineered features），不仅在预测年龄、体重指数（BMI）、内脏脂肪组织面积等生理指标上表现优异（Pearson相关系数分别达0.69、0.90和0.82），还能独立预测多达1,980个表型目标，并在调整年龄、BMI、VAT和身高后仍提升18个身体系统中的17–18个系统的预测性能，表明步态是一种具有多系统覆盖能力的独立生物信号。

链接: https://arxiv.org/abs/2603.25283
作者: Adam Gabet,Sarah Kohn,Guy Lutsker,Shira Gelman,Anastasia Godneva,Gil Sasson,Arad Zulti,David Krongauz,Rotem Shaulitch,Assaf Rotem,Ohad Doron,Yuval Brodsky,Adina Weinberger,Eran Segal
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Preprint. Under review

点击查看摘要

Abstract:Gait is increasingly recognized as a vital sign, yet current approaches treat it as a symptom of specific pathologies rather than a systemic biomarker. We developed a gait foundation model for 3D skeletal motion from 3,414 deeply phenotyped adults, recorded via a depth camera during five motor tasks. Learned embeddings outperformed engineered features, predicting age (Pearson r = 0.69), BMI (r = 0.90), and visceral adipose tissue area (r = 0.82). Embeddings significantly predicted 1,980 of 3,210 phenotypic targets; after adjustment for age, BMI, VAT, and height, gait provided independent gains in all 18 body systems in males and 17 of 18 in females, and improved prediction of clinical diagnoses and medication use. Anatomical ablation revealed that legs dominated metabolic and frailty predictions while torso encoded sleep and lifestyle phenotypes. These findings establish gait as an independent multi-system biosignal, motivating translation to consumer-grade video and its integration as a scalable, passive vital sign.

[AI-33] Distribution and Clusters Approximations as Abstract Domains in Probabilistic Abstract Interpretation to Neural Network Analysis

【速读】：该论文旨在解决神经网络分析中如何高效且精确地抽象和推理输入空间密度分布的问题，其核心挑战在于传统抽象域（如网格近似）在处理高维连续输入时的精度与计算复杂度之间的权衡。解决方案的关键在于提出两种新颖的抽象方法：分布近似（distribution approximation）和聚类近似（clusters approximation），并通过设计相应的抽象转换器（abstract transformers）实现对输入密度流的更精细建模，从而在理论层面提升概率抽象解释框架的表达能力与实用性。

链接: https://arxiv.org/abs/2603.25273
作者: Zhuofan Zhang,Herbert Wiklicky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The probabilistic abstract interpretation framework of neural network analysis analyzes a neural network by analyzing its density distribution flow of all possible inputs. The grids approximation is one of abstract domains the framework uses which abstracts concrete space into grids. In this paper, we introduce two novel approximation methods: distribution approximation and clusters approximation. We show how these two methods work in theory with corresponding abstract transformers with help of illustrations of some simple examples.

[AI-34] Probabilistic Abstract Interpretation on Neural Networks via Grids Approximation

【速读】：该论文旨在解决神经网络在输入空间不可数或可数无穷时，难以通过穷举测试来分析其密度分布流动的问题。解决方案的关键在于引入概率抽象解释（Probabilistic Abstract Interpretation）理论框架，将抽象域（abstract domains）与Moore-Penrose伪逆（Moore-Penrose pseudo-inverse）及对应的抽象变换器（abstract transformers）相结合，从而形式化地刻画和推导神经网络中输入到输出的概率密度传播过程，实现对复杂神经网络行为的高效、可扩展的静态分析。

链接: https://arxiv.org/abs/2603.25266
作者: Zhuofan Zhang,Herbert Wiklicky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probabilistic abstract interpretation is a theory used to extract particular properties of a computer program when it is infeasible to test every single inputs. In this paper we apply the theory on neural networks for the same purpose: to analyse density distribution flow of all possible inputs of a neural network when a network has uncountably many or countable but infinitely many inputs. We show how this theoretical framework works in neural networks and then discuss different abstract domains and corresponding Moore-Penrose pseudo-inverses together with abstract transformers used in the framework. We also present experimental examples to show how this framework helps to analyse real world problems.

[AI-35] FluxEDA: A Unified Execution Infrastructure for Stateful Agent ic EDA

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 与电子设计自动化 (EDA) 工具集成中普遍存在的状态丢失问题，即现有方法多依赖脚本级或请求级交互，难以在实际生产环境中维持工具状态并支持迭代优化。其解决方案的关键在于提出 FluxEDA——一个统一且有状态的基础设施底层架构，通过引入基于管理网关的执行接口和结构化请求/响应处理机制，并保持持久化的后端实例，使上层代理和可编程客户端能够基于保留的运行时状态与异构 EDA 工具进行交互，而非孤立的 shell 调用，从而实现状态复用、回滚和协同迭代执行，为代理辅助的 EDA 自动化提供了可落地的基础设施支撑。

链接: https://arxiv.org/abs/2603.25243
作者: Zhengrui Chen,Zixuan Song,Yu Li,Qi Sun,Cheng Zhuo
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: qisunchn@zju. this http URL , czhuo@zju. this http URL

点击查看摘要

Abstract:Large language models and autonomous agents are increasingly explored for EDA automation, but many existing integrations still rely on script-level or request-level interactions, which makes it difficult to preserve tool state and support iterative optimization in real production-oriented environments. In this work, we present FluxEDA, a unified and stateful infrastructure substrate for agentic EDA. FluxEDA introduces a managed gateway-based execution interface with structured request and response handling. It also maintains persistent backend instances. Together, these features allow upper-layer agents and programmable clients to interact with heterogeneous EDA tools through preserved runtime state, rather than through isolated shell invocations. We evaluate the framework using two representative commercial backend case studies: automated post-route timing ECO and standard-cell sub-library optimization. The results show that FluxEDA can support multi-step analysis and optimization over real tool contexts, including state reuse, rollback, and coordinated iterative execution. These findings suggest that a stateful and governed infrastructure layer is a practical foundation for agent-assisted EDA automation.

[AI-36] A Wireless World Model for AI-Native 6G Networks

【速读】：该论文旨在解决当前数据驱动的无线通信方法在动态环境中泛化能力不足的问题，其根本原因在于缺乏对电磁波传播机制的内在理解。解决方案的关键在于提出了一种名为无线世界模型（Wireless World Model, WWM）的多模态基础框架，该框架通过内化三维几何结构与信号动态之间的因果关系，实现对无线信道时空演化的预测。WWM基于大规模射线追踪的多模态数据预训练，有效弥合了数据真实性鸿沟，并在真实测量数据中得到验证；其核心创新在于采用联合嵌入预测架构与多模态专家混合Transformer，融合信道状态信息、3D点云和用户轨迹，形成统一表征，在五个下游任务中均展现出卓越的性能，包括已见环境、未见泛化场景及实测数据，显著优于现有单模态基础模型和专用模型，为面向物理世界的6G智能提供了新路径。

链接: https://arxiv.org/abs/2603.25216
作者: Ziqi Chen,Yi Ren,Yixuan Huang,Qi Sun,Nan Li,Yuhong Huang,Chih-Lin I,Yifan Li,Liang Xia
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Integrating AI into the physical layer is a cornerstone of 6G networks. However, current data-driven approaches struggle to generalize across dynamic environments because they lack an intrinsic understanding of electromagnetic wave propagation. We introduce the Wireless World Model (WWM), a multi-modal foundation framework predicting the spatiotemporal evolution of wireless channels by internalizing the causal relationship between 3D geometry and signal dynamics. Pre-trained on a massive ray-traced multi-modal dataset, WWM overcomes the data authenticity gap, further validated under real-world measurement data. Using a joint-embedding predictive architecture with a multi-modal mixture-of-experts Transformer, WWM fuses channel state information, 3D point clouds, and user trajectories into a unified representation. Across the five key downstream tasks supported by WWM, it achieves remarkable performance in seen environments, unseen generalization scenarios, and real-world measurements, consistently outperforming SOTA uni-modal foundation models and task-specific models. This paves the way for physics-aware 6G intelligence that adapts to the physical world.

[AI-37] rain at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在大语言模型（Large Language Models, LLMs）后训练阶段中因多轮次采样（rollouts）带来的计算开销过高问题，尤其针对GRPO等算法中大量低效提示（prompt）导致的梯度信号稀疏现象。解决方案的关键在于提出HIVE（History-Informed and online-VErified prompt selection）框架，其核心机制为双阶段提示选择：第一阶段基于历史奖励轨迹进行粗粒度筛选，第二阶段利用提示熵（prompt entropy）作为在线代理指标，动态剔除已丧失学习价值的提示实例，从而实现数据高效且稳定的RL训练。

链接: https://arxiv.org/abs/2603.25184
作者: Jiahao Wu,Ning Lu,Shengcai Liu,Kun Wang,Yanting Yang,Li Qing,Ke Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

[AI-38] PIDP-Attack: Combining Prompt Injection with Database Poisoning Attacks on Retrieval-Augmented Generation Systems

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统在实际部署中面临的对抗攻击问题，尤其是数据投毒（data poisoning）攻击的局限性——现有方法通常依赖对用户查询的先验知识，限制了其灵活性与现实适用性。解决方案的关键在于提出一种新型复合攻击策略PIDP-Attack，该策略将提示注入（prompt injection）与数据库投毒（database poisoning）相结合：在推理阶段通过向查询附加恶意字符，并在检索数据库中注入少量中毒文档，从而无需事先知晓用户具体查询即可有效操控大语言模型（Large Language Models, LLMs）的输出。实验表明，该方法在多个基准数据集和LLM上显著优于原有PoisonedRAG攻击，攻击成功率提升4%至16%，同时保持高检索精度，验证了复合攻击策略的有效性和必要性。

链接: https://arxiv.org/abs/2603.25164
作者: Haozhen Wang,Haoyue Liu,Jionghao Zhu,Zhichao Wang,Yongxin Guo,Xiaoying Tang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications. However, their practical deployment is often hindered by issues such as outdated knowledge and the tendency to generate hallucinations. To address these limitations, Retrieval-Augmented Generation (RAG) systems have been introduced, enhancing LLMs with external, up-to-date knowledge sources. Despite their advantages, RAG systems remain vulnerable to adversarial attacks, with data poisoning emerging as a prominent threat. Existing poisoning-based attacks typically require prior knowledge of the user’s specific queries, limiting their flexibility and real-world applicability. In this work, we propose PIDP-Attack, a novel compound attack that integrates prompt injection with database poisoning in RAG. By appending malicious characters to queries at inference time and injecting a limited number of poisoned passages into the retrieval database, our method can effectively manipulate LLM response to arbitrary query without prior knowledge of the user’s actual query. Experimental evaluations across three benchmark datasets (Natural Questions, HotpotQA, MS-MARCO) and eight LLMs demonstrate that PIDP-Attack consistently outperforms the original PoisonedRAG. Specifically, our method improves attack success rates by 4% to 16% on open-domain QA tasks while maintaining high retrieval precision, proving that the compound attack strategy is both necessary and highly effective.

[AI-39] race2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在复杂任务中缺乏可扩展的领域特定技能的问题。传统方法依赖人工编写技能，存在严重可扩展性瓶颈；而自动化技能生成往往因过度依赖浅层参数化知识或对轨迹局部经验的顺序过拟合，导致结果脆弱且碎片化。解决方案的关键在于提出Trace2Skill框架，其核心思想是模仿人类专家通过整体分析广泛执行轨迹来提炼出统一、完整的技能指南，而非逐条响应单个轨迹。该框架调度并行子代理对多样化的执行路径进行分析，提取轨迹特异性知识，并通过归纳推理层次化整合为无冲突的技能目录，从而实现对已有技能的深化和全新技能的构建。此方法无需参数更新或外部检索模块，即可生成高迁移性的声明式技能，在跨模型规模和分布外（OOD）场景下均表现出显著泛化能力。

链接: https://arxiv.org/abs/2603.25158
作者: Jingwei Ni,Yihao Liu,Xinpeng Liu,Yutao Sun,Mengyu Zhou,Pengyu Cheng,Dexin Wang,Xiaoxi Jiang,Guanjun Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic’s official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills – requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

[AI-40] Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

【速读】：该论文试图解决的问题是：在人工智能辅助代码生成（AI-assisted code generation）日益普及的背景下，如何系统性地理解影响生成代码质量的关键因素及其对软件质量产出的影响。解决方案的关键在于通过一项遵循严格指南的系统文献综述（Systematic Literature Review, SLR），结合AI辅助但由人类监督的工作流，从24项实证研究中提取并分析证据，发现代码质量受人类因素（如提示设计、任务规范和开发者经验）、AI系统特性以及人机交互动态三方面共同影响，从而揭示了高质量代码产出需兼顾技术与人为因素的 socio-technical 本质。

链接: https://arxiv.org/abs/2603.25146
作者: Vehid Geruslu,Zulfiyya Aliyeva,Eray Tüzün
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. --Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. --Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. --Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. --Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

[AI-41] RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

【速读】：该论文旨在解决当前基于评分量表（rubric-based）的指令遵循评估在细粒度判断准确性方面的可靠性问题，即现有评估方法缺乏对rubric层级判断准确性的系统性元评估（meta-evaluation）。其解决方案的关键在于提出RubricEval基准测试，该基准首次实现了rubric级别的元评估，包含3,486个质量控制实例及易/难子集，并涵盖多类别指令与多种模型来源的响应。实验表明，即使使用GPT-4o作为评判模型，在Hard子集上准确率仅为55.97%，凸显了rubric级判断仍面临挑战；同时发现，相比检查清单式评估（checklist-level），rubric级评估表现更优，且显式推理能提升准确率并降低评委间差异。

链接: https://arxiv.org/abs/2603.25133
作者: Tianjun Pan,Xuan Lin,Wenyan Yang,Qianyu He,Shisong Chen,Licai Qi,Wanqing Xu,Hongwei Feng,Bo Xu,Yanghua Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted judge in instruction-following benchmarks, achieves only 55.97% on Hard subset. Considering evaluation paradigm, rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and both together reduce inter-judge variance. Through our established rubric taxonomy, we further identify common failure modes and offer actionable insights for reliable instruction-following evaluation.

[AI-42] When Sensing Varies with Contexts: Context-as-Transform for Tactile Few-Shot Class-Incremental Learning

【速读】：该论文旨在解决少样本类增量学习（Few-Shot Class-Incremental Learning, FSCIL）在获取上下文（acquisition context）不一致时性能下降的问题，尤其针对触觉感知场景中因设备差异、接触状态和交互设置等因素导致的标准化缺失问题。解决方案的关键在于将获取上下文分解为两个组成部分：一是结构化的低维成分，受触觉交互特征影响显著，通过建模为近似可逆的“上下文即变换”（Context-as-Transform）族，并利用伪上下文一致性损失进行逆变换归一化处理；二是高维残差成分，主要源于平台与设备差异，通过不确定性条件原型校准（Uncertainty-Conditioned Prototype Calibration, UCPC）机制，依据上下文不确定性对偏差原型和决策边界进行动态校准，从而提升模型在跨上下文环境下的泛化能力。

链接: https://arxiv.org/abs/2603.25115
作者: Yifeng Lin,Aiping Huang,Wenxi Liu,Si Wu,Tiesong Zhao,Zheng-Jun Zha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Few-Shot Class-Incremental Learning (FSCIL) can be particularly susceptible to acquisition contexts with only a few labeled samples. A typical scenario is tactile sensing, where the acquisition context (\it e.g., diverse devices, contact state, and interaction settings) degrades performance due to a lack of standardization. In this paper, we propose Context-as-Transform FSCIL (CaT-FSCIL) to tackle the above problem. We decompose the acquisition context into a structured low-dimensional component and a high-dimensional residual component. The former can be easily affected by tactile interaction features, which are modeled as an approximately invertible Context-as-Transform family and handled via inverse-transform canonicalization optimized with a pseudo-context consistency loss. The latter mainly arises from platform and device differences, which can be mitigated with an Uncertainty-Conditioned Prototype Calibration (UCPC) that calibrates biased prototypes and decision boundaries based on context uncertainty. Comprehensive experiments on the standard benchmarks HapTex and LMT108 have demonstrated the superiority of the proposed CaT-FSCIL.

[AI-43] Layer-Specific Lipschitz Modulation for Fault-Tolerant Multimodal Representation Learning

【速读】：该论文旨在解决工业与安全关键场景中多模态系统在部分传感器故障、信号退化或跨模态不一致情况下仍需保持可靠性的难题。其核心解决方案是提出一个基于数学理论的容错多模态表征学习框架，通过统一自监督异常检测与误差校正机制实现鲁棒性提升；关键创新在于：首先基于扰动传播的理论分析，推导出Lipschitz和雅可比（Jacobian）准则以判断神经算子是否放大或抑制局部故障；进而设计两阶段自监督训练策略——先用干净数据预训练多模态卷积自动编码器以保留潜在空间中的局部异常信号，再引入可学习的计算模块（由密集层构成）进行校正，并结合对比损失实现异常识别；同时采用分层Lipschitz调制与梯度裁剪作为控制检测与校正模块敏感度的原理性机制，从而在理论上保障鲁棒性并实践上提升异常检测精度与重建性能。

链接: https://arxiv.org/abs/2603.25103
作者: Diyar Altinses,Andreas Schwung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern multimodal systems deployed in industrial and safety-critical environments must remain reliable under partial sensor failures, signal degradation, or cross-modal inconsistencies. This work introduces a mathematically grounded framework for fault-tolerant multimodal representation learning that unifies self-supervised anomaly detection and error correction within a single architecture. Building upon a theoretical analysis of perturbation propagation, we derive Lipschitz- and Jacobian-based criteria that determine whether a neural operator amplifies or attenuates localized faults. Guided by this theory, we propose a two-stage self-supervised training scheme: pre-training a multimodal convolutional autoencoder on clean data to preserve localized anomaly signals in the latent space, and expanding it with a learnable compute block composed of dense layers for correction and contrastive objectives for anomaly identification. Furthermore, we introduce layer-specific Lipschitz modulation and gradient clipping as principled mechanisms to control sensitivity across detection and correction modules. Experimental results on multimodal fault datasets demonstrate that the proposed approach improves both anomaly detection accuracy and reconstruction under sensor corruption. Overall, this framework bridges the gap between analytical robustness guarantees and practical fault-tolerant multimodal learning.

[AI-44] Large Language Models as Optimization Controllers: Adaptive Continuation for SIMP Topology Optimization

【速读】：该论文旨在解决拓扑优化中传统连续性策略（continuation strategy）缺乏自适应能力的问题，即固定参数调度难以根据优化过程中的实时状态动态调整关键超参数（如惩罚指数 $ p $、投影锐度 $ \beta $、滤波半径 $ r_{\min} $ 和移动限制 $ \delta $），从而影响最终解的质量和收敛效率。解决方案的关键在于引入一个大型语言模型（Large Language Model, LLM）作为在线自适应控制器，通过结构化观测当前优化状态（包括合规性、灰度指数、停滞计数、棋盘模式度量、体积分数及预算消耗），以直接数值控制接口输出最优超参数配置，并结合硬灰度门控机制防止过早二值化，同时利用元优化循环调优代理的调用频率与门限阈值，实现对整个优化路径的智能干预。实验表明，该方法在多个二维和三维基准问题上均显著优于四种基线方案，且所得解均为完全二值化，验证了LLM驱动的实时决策机制是性能提升的核心来源。

链接: https://arxiv.org/abs/2603.25099
作者: Shaoliang Yang,Jun Wang,Yunsheng Wang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 36 pages, 11 figures

点击查看摘要

Abstract:We present a framework in which a large language model (LLM) acts as an online adaptive controller for SIMP topology optimization, replacing conventional fixed-schedule continuation with real-time, state-conditioned parameter decisions. At every k -th iteration, the LLM receives a structured observation - current compliance, grayness index, stagnation counter, checkerboard measure, volume fraction, and budget consumption - and outputs numerical values for the penalization exponent p , projection sharpness \beta , filter radius r_\min , and move limit \delta via a Direct Numeric Control interface. A hard grayness gate prevents premature binarization, and a meta-optimization loop uses a second LLM pass to tune the agent’s call frequency and gate threshold across runs. We benchmark the agent against four baselines - fixed (no-continuation), standard three-field continuation, an expert heuristic, and a schedule-only ablation - on three 2-D problems (cantilever, MBB beam, L-bracket) at 120!\times!60 resolution and two 3-D problems (cantilever, MBB beam) at 40!\times!20!\times!10 resolution, all run for 300 iterations. A standardized 40-iteration sharpening tail is applied from the best valid snapshot so that compliance differences reflect only the exploration phase. The LLM agent achieves the lowest final compliance on every benchmark: -5.7% to -18.1% relative to the fixed baseline, with all solutions fully binary. The schedule-only ablation underperforms the fixed baseline on two of three problems, confirming that the LLM’s real-time intervention - not the schedule geometry - drives the gain. Code and reproduction scripts will be released upon publication.

[AI-45] ElephantBroker: A Knowledge-Grounded Cognitive Runtime for Trustworthy AI Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在高风险、多轮交互场景中缺乏可信记忆系统的问题，尤其是现有记忆机制依赖扁平的键值存储或简单的向量检索，无法追踪知识来源与可信度。其解决方案的关键在于提出 ElephantBroker——一个开源的认知运行时系统，通过 Cognee SDK 将 Neo4j 知识图谱与 Qdrant 向量存储统一，构建可持久化、可验证的代理记忆体系；核心创新包括：五源混合检索管道、十一维竞争评分引擎用于预算受限上下文组装、四状态证据验证模型、五阶段目标感知的上下文生命周期管理、六层低成本前置安全防护流水线、AI 防火墙实现工具调用拦截与多级安全扫描、九阶段整合引擎以强化有用模式并衰减噪声，以及基于数值权威模型的多组织身份治理与分层访问控制。

链接: https://arxiv.org/abs/2603.25097
作者: Cristian Lupascu,Alexandru Lupascu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model based agents increasingly operate in high stakes, multi turn settings where factual grounding is critical, yet their memory systems typically rely on flat key value stores or plain vector retrieval with no mechanism to track the provenance or trustworthiness of stored knowledge. We present ElephantBroker, an open source cognitive runtime that unifies a Neo4j knowledge graph with a Qdrant vector store through the Cognee SDK to provide durable, verifiable agent memory. The system implements a complete cognitive loop (store, retrieve, score, compose, protect, learn) comprising a hybrid five source retrieval pipeline, an eleven dimension competitive scoring engine for budget constrained context assembly, a four state evidence verification model, a five stage context lifecycle with goal aware assembly and continuous compaction, a six layer cheap first guard pipeline for safety enforcement, an AI firewall providing enforceable tool call interception and multi tier safety scanning, a nine stage consolidation engine that strengthens useful patterns while decaying noise, and a numeric authority model governing multi organization identity with hierarchical access control. Architectural validation through a comprehensive test suite of over 2,200 tests spanning unit, integration, and end to end levels confirms subsystem correctness. The modular design supports three deployment tiers, five profile presets with inheritance, multi gateway isolation, and a management dashboard for human oversight, enabling configurations from lightweight memory only agents to full cognitive runtimes with enterprise grade safety and auditability.

[AI-46] Sparse Visual Thought Circuits in Vision-Language Models

【速读】：该论文试图解决的问题是：稀疏自编码器（Sparse Autoencoders, SAEs）在多模态模型中虽提升了可解释性，但其特征是否构成模块化、可组合的推理单元——这一假设是许多基于干预的控制方法的基础——尚不明确。研究发现，SAE特征通常不具备良好的模块性：对任务选择性特征集进行干预可小幅提升推理准确率，而对两个此类特征集的并集进行干预则会引发显著的输出漂移（即预测发生大幅非预期变化），即使在范数约束扰动下也如此，表明存在共享内部路径导致激活偏移放大。解决方案的关键在于开发了一个可复现的因果分析流程，用于定位和测试Qwen3-VL-8B中的稀疏视觉思维电路（sparse visual thought circuits）。该流程包括在合成基准上识别任务类型信息的中层解码器位置、训练SAEs、通过显式规则构建任务选择性特征集，并在推理时进行缩放与消融实验，同时量化准确率与漂移程度，从而厘清SAE特征可组合性的边界并提供一套严格的诊断框架以实现更可靠的视觉语言模型（VLM）控制。

链接: https://arxiv.org/abs/2603.25075
作者: Yunpeng Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear probes identify a mid decoder locus for task type information. We train SAEs at this layer, construct task-selective sets via an explicit rule, and perform inference time scaling and ablation while quantifying accuracy and drift. Our findings-validated with bootstrapped subsamples and permutation controls, and replicated across multiple VLM families and five diverse datasets clarify the boundaries of SAE feature composability and provide a rigorous diagnostic framework for more reliable VLM control.

[AI-47] An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks

【速读】：该论文旨在解决农业领域因气候变化、土壤退化和资源枯竭带来的作物适宜性预测难题，以支持精准农业决策。其核心解决方案是提出一种可解释的集成学习范式，关键在于融合优化的特征金字塔（feature pyramid）、深度神经网络、自注意力机制与残差网络（Residual Network），并结合数据预处理（如标签编码、IQR异常值剔除、StandardScaler标准化及SMOTE过采样）与多模型对比（包括逻辑回归、K近邻、支持向量机、随机森林、梯度提升等），最终通过元集成设计（Final Ensemble）实现98.80%的准确率、精确率、召回率与F1分数，显著优于单一模型（如K近邻的95.56%）。同时，借助SHAP与排列重要性等可解释AI方法识别出土壤pH、氮和锌为关键影响因子，有效弥合复杂机器学习模型与农业实际决策之间的鸿沟，提升AI推荐系统的可信度与可持续性。

链接: https://arxiv.org/abs/2603.25070
作者: Syed Rayhan Masud,SK Muktadir Hossain,Md. Ridoy Sarkar,Mohammad Sakib Mahmood,Md. Kishor Morol,Rakib Hossain Sajib
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agriculture is increasingly challenged by climate change, soil degradation, and resource depletion, and hence requires advanced data-driven crop classification and recommendation solutions. This work presents an explainable ensemble learning paradigm that fuses optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks for bolstering crop suitability predictions based on soil characteristics (e.g., pH, nitrogen, potassium) and climatic conditions (e.g., temperature, rainfall). With a dataset comprising 3,867 instances and 29 features from the Ethiopian Agricultural Transformation Agency and NASA, the paradigm leverages preprocessing methods such as label encoding, outlier removal using IQR, normalization through StandardScaler, and SMOTE for balancing classes. A range of machine learning models such as Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, and a new Relative Error Support Vector Machine are compared, with hyperparameter tuning through Grid Search and cross-validation. The suggested “Final Ensemble” meta-ensemble design outperforms with 98.80% accuracy, precision, recall, and F1-score, compared to individual models such as K-Nearest Neighbors (95.56% accuracy). Explainable AI methods, such as SHAP and permutation importance, offer actionable insights, highlighting critical features such as soil pH, nitrogen, and zinc. The paradigm addresses the gap between intricate ML models and actionable agricultural decision-making, fostering sustainability and trust in AI-powered recommendations

[AI-48] he System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在邮件钓鱼检测任务中因系统提示（system prompt）配置不当而导致的检测性能剧烈波动问题，即同一模型在不同提示策略下，钓鱼绕过率可从不足1%飙升至97%，且误报率（false positive rate）也随模型变化显著。解决方案的关键在于揭示提示-模型交互（prompt-model interaction）是影响安全性的首要变量，并提出通过优化提示以利用高度预测性信号（如发件人域名与URL域名匹配）来提升基准检测性能（最高达93.7%召回率，3.8%误报率），但同时也指出此类优化会引入脆弱的攻击面——当攻击者反向利用该信号（如注册匹配的恶意基础设施）时，模型性能急剧下降。研究进一步发现，过度特化的提示反而削弱已具备较强推理能力的模型，因其将多信号综合判断替换为对单一信号的依赖，从而暴露于对抗性攻击。作者提出“Safetility”这一部署感知指标以量化误报惩罚，并主张仅靠提示工程难以弥合对抗差距，需结合外部真实信息（external ground truth）的工具增强才能实现鲁棒性提升。

链接: https://arxiv.org/abs/2603.25056
作者: Ron Litvak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures, 6 tables

点击查看摘要

Abstract:System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single model’s phishing bypass rate ranges from under 1% to 97% depending on how it is configured, while the false-positive cost of the same prompt varies sharply across models. We then show that optimizing prompts around highly predictive signals can improve benchmark performance, reaching up to 93.7% recall at 3.8% false positive rate, but also creates a brittle attack surface. In particular, domain-matching strategies perform well when legitimate emails mostly have matched sender and URL domains, yet degrade sharply when attackers invert that signal by registering matching infrastructure. Response-trace analysis shows that 98% of successful bypasses reason in ways consistent with the inverted signal: the models are following the instruction, but the instruction’s core assumption has become false. A counter-intuitive corollary follows: making prompts more specific can degrade already-capable models by replacing broader multi-signal reasoning with exploitable single-signal dependence. We characterize the resulting tension between detection, usability, and adversarial robustness as a navigable tradeoff, introduce Safetility, a deployability-aware metric that penalizes false positives, and argue that closing the adversarial gap likely requires tool augmentation with external ground truth.

[AI-49] MP-MoE: Matrix Profile-Guided Mixture of Experts for Precipitation Forecasting

【速读】：该论文旨在解决热带地区（如越南）降水预报中因复杂地形和对流不稳定性导致数值天气预报（Numerical Weather Prediction, NWP）模型精度受限的问题，尤其是现有数据驱动后处理方法依赖点对点目标函数时，在时间微小偏移下易产生“双重惩罚”效应，从而影响强降雨事件的捕捉能力。其解决方案的关键在于提出Matrix Profile-guided Mixture of Experts (MP-MoE)框架，通过融合传统强度损失与基于子序列相似性的结构感知Matrix Profile目标函数，利用滑动窗口内的时间模式匹配而非逐点误差进行专家选择，有效缓解相位偏移带来的过度惩罚问题，从而更准确地识别峰值降水强度并保持风暴事件的形态完整性。

链接: https://arxiv.org/abs/2603.25046
作者: Huyen Ngoc Tran,Dung Trung Tran,Hong Nguyen,Xuan Vu Phan,Nam-Phong Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Precipitation forecasting remains a persistent challenge in tropical regions like Vietnam, where complex topography and convective instability often limit the accuracy of Numerical Weather Prediction (NWP) models. While data-driven post-processing is widely used to mitigate these biases, most existing frameworks rely on point-wise objective functions, which suffer from the ``double penalty’’ effect under minor temporal misalignments. In this work, we propose the Matrix Profile-guided Mixture of Experts (MP-MoE), a framework that integrates conventional intensity loss with a structural-aware Matrix Profile objective. By leveraging subsequence-level similarity rather than point-wise errors, the proposed loss facilitates more reliable expert selection and mitigates excessive penalization caused by phase shifts. We evaluate MP-MoE on rainfall datasets from two major river basins in Vietnam across multiple horizons, including 1-hour intensity and accumulated rainfall over 12, 24, and 48 hours. Experimental results demonstrate that MP-MoE outperforms raw NWP and baseline learning methods in terms of Mean Critical Success Index (CSI-M) for heavy rainfall events, while significantly reducing Dynamic Time Warping (DTW) values. These findings highlight the framework’s efficacy in capturing peak rainfall intensities and preserving the morphological integrity of storm events.

[AI-50] Mechanistically Interpreting Compression in Vision-Language Models

【速读】：该论文旨在解决压缩视觉语言模型（Vision-Language Models, VLMs）在降低内存和计算成本的同时，如何保持其内部计算机制与安全行为不变的问题。研究发现，剪枝（pruning）虽能保留电路结构但会旋转和衰减内部特征，而量化（quantization）则在更高层级修改电路却使幸存特征更对齐。基于此洞察，作者提出VLMSafe-420基准，通过配对有害输入与其对应的良性反事实样本，系统评估不同压缩方法对模型真实拒绝行为的影响，揭示剪枝显著削弱了模型的安全响应能力，强调压缩策略选择具有重要的安全意义。

链接: https://arxiv.org/abs/2603.25035
作者: Veeraraju Elluru,Arth Singh,Roberto Aguero,Ajay Agarwal,Debojyoti Das,Hreetam Paul
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 12 tables

点击查看摘要

Abstract:Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps circuit structure intact but rotates and attenuates internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better aligned. Leveraging this insight, we also introduce VLMSafe-420, a novel benchmark that pairs harmful inputs with matched benign counterfactuals across various safety categories. Our findings show that pruning causes a sharp drop in genuine refusal behavior, suggesting that the choice of compression has safety implications.

[AI-51] From Stateless to Situated: Building a Psychological World for LLM -Based Emotional Support

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在心理支持与情感陪伴场景中因依赖局部的下一个词预测机制而导致的时序连续性缺失、阶段意识模糊及用户同意边界失控的问题。其核心挑战不在于生成自然语言本身，而在于构建一个可持续更新的外部情境结构以支撑过程导向的情感干预。解决方案的关键在于提出LEKIA 2.0架构，通过将认知层与执行层分离，实现情境建模与干预执行的解耦，从而确保系统在整个对话过程中维持对用户情境和同意边界的稳定表征。

链接: https://arxiv.org/abs/2603.25031
作者: Boning Zhao,Clover Hu,Xinnuo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In psychological support and emotional companionship scenarios, the core limitation of large language models (LLMs) lies not merely in response quality, but in their reliance on local next-token prediction, which prevents them from maintaining the temporal continuity, stage awareness, and user consent boundaries required for multi-turn intervention. This stateless characteristic makes systems prone to premature advancement, stage misalignment, and boundary violations in continuous dialogue. To address this problem, we argue that the key challenge in process-oriented emotional support is not simply generating natural language, but constructing a sustainably updatable external situational structure for the model. We therefore propose LEKIA 2.0, a situated LLM architecture that separates the cognitive layer from the executive layer, thereby decoupling situational modeling from intervention execution. This design enables the system to maintain stable representations of the user’s situation and consent boundaries throughout ongoing interaction. To evaluate this process-control capability, we further introduce a Static-to-Dynamic online evaluation protocol for multi-turn interaction. LEKIA achieved an average absolute improvement of approximately 31% over prompt-only baselines in deep intervention loop completion. The results suggest that an external situational structure is a key enabling condition for building stable, controllable, and situated emotional support systems.

[AI-52] System-Anchored Knee Estimation for Low-Cost Context Window Selection in PDE Forecasting

【速读】：该论文旨在解决固定窗口自回归神经偏微分方程（PDE）模拟器中低成本上下文窗口选择（context-window selection）这一未形式化的问题。现有方法如穷举验证、直接低成本搜索或系统理论记忆估计，存在成本高、鲁棒性差或与下游滚动预测性能不一致的缺陷。解决方案的关键在于提出一种两阶段方法——系统锚定拐点估计（System-Anchored Knee Estimation, SAKE）：第一阶段从物理可解释的系统锚点（system anchors）中识别出一个结构化的候选集；第二阶段在该候选集中进行基于拐点感知的下游选择，从而实现高效且贴近实际性能的窗口选择。实验表明，SAKE在八类PDEBench问题上均表现最优，在相同预算下显著优于其他方法，同时仅消耗0.051倍的搜索成本（节省94.9%）。

链接: https://arxiv.org/abs/2603.25025
作者: Wenshuo Wang,Fan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive neural PDE simulators predict the evolution of physical fields one step at a time from a finite history, but low-cost context-window selection for such simulators remains an unformalized problem. Existing approaches to context-window selection in time-series forecasting include exhaustive validation, direct low-cost search, and system-theoretic memory estimation, but they are either expensive, brittle, or not directly aligned with downstream rollout performance. We formalize explicit context-window selection for fixed-window autoregressive neural PDE simulators as an independent low-cost algorithmic problem, and propose \textbfSystem-Anchored Knee Estimation (SAKE), a two-stage method that first identifies a small structured candidate set from physically interpretable system anchors and then performs knee-aware downstream selection within it. Across all eight PDEBench families evaluated under the shared (L\in\1,\dots,16\) protocol, SAKE is the strongest overall matched-budget low-cost selector among the evaluated methods, achieving 67.8% Exact, 91.7% Within-1, 6.1% mean regret@knee, and a cost ratio of 0.051 (94.9% normalized search-cost savings).

[AI-53] A Public Theory of Distillation Resistance via Constraint-Coupled Reasoning Architectures

【速读】：该论文旨在解决生成式 AI（Generative AI）领域中知识蒸馏（knowledge distillation）、模型提取（model extraction）和行为迁移（behavior transfer）所带来的治理不对称性问题，即有用能力可能被低成本复制或转移，而原始治理结构无法同步迁移。解决方案的关键在于从架构层面构建一种约束耦合推理框架（constraint-coupled reasoning framework），其核心思想是：当高级能力与内部稳定性约束（internal stability constraints）相耦合，从而限制状态转移路径时，知识蒸馏作为捷径的价值将显著降低。该框架包含四个要素：有限转移负担（bounded transition burden）、路径负载累积（path-load accumulation）、动态演化的可行区域（dynamically evolving feasible regions）以及能力-稳定性耦合条件（capability-stability coupling condition），为未来提升模型对蒸馏的抗性、对齐性和治理能力提供了可验证的理论基础。

链接: https://arxiv.org/abs/2603.25022
作者: Peng Wei,Wesley Shu
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation, model extraction, and behavior transfer have become central concerns in frontier AI. The main risk is not merely copying, but the possibility that useful capability can be transferred more cheaply than the governance structure that originally accompanied it. This paper presents a public, trade-secret-safe theoretical framework for reducing that asymmetry at the architectural level. The core claim is that distillation becomes less valuable as a shortcut when high-level capability is coupled to internal stability constraints that shape state transitions over time. To formalize this idea, the paper introduces a constraint-coupled reasoning framework with four elements: bounded transition burden, path-load accumulation, dynamically evolving feasible regions, and a capability-stability coupling condition. The paper is intentionally public-safe: it omits proprietary implementation details, training recipes, thresholds, hidden-state instrumentation, deployment procedures, and confidential system design choices. The contribution is therefore theoretical rather than operational. It offers a falsifiable architectural thesis, a clear threat model, and a set of experimentally testable hypotheses for future work on distillation resistance, alignment, and model governance.

[AI-54] Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）中故障归因（failure attribution）的单一因果假设局限性问题，即现有方法和基准通常假定每个故障仅有一个确定的根本原因，而实际场景中由于智能体间复杂的依赖关系和执行轨迹的模糊性，故障往往存在多个合理的归因解释。解决方案的关键在于提出“多视角故障归因”（multi-perspective failure attribution）这一新范式，明确建模并支持归因的不确定性，并设计了首个面向该范式的基准测试集 MP-Bench 及相应的评估协议，从而更真实地反映 MAS 的调试需求，揭示了以往认为大语言模型（LLMs）在故障归因中表现不佳的结论主要源于现有基准设计的缺陷。

链接: https://arxiv.org/abs/2603.25001
作者: Yeonjun In,Mehrab Tanjim,Jayakumar Subramanian,Sungchul Kim,Uttaran Bhattacharya,Wonjoong Kim,Sangwu Park,Somdeb Sarkhel,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Failure attribution is essential for diagnosing and improving multi-agent systems (MAS), yet existing benchmarks and methods largely assume a single deterministic root cause for each failure. In practice, MAS failures often admit multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution trajectories. We revisit MAS failure attribution from a multi-perspective standpoint and propose multi-perspective failure attribution, a practical paradigm that explicitly accounts for attribution ambiguity. To support this setting, we introduce MP-Bench, the first benchmark designed for multi-perspective failure attribution in MAS, along with a new evaluation protocol tailored to this paradigm. Through extensive experiments, we find that prior conclusions suggesting LLMs struggle with failure attribution are largely driven by limitations in existing benchmark designs. Our results highlight the necessity of multi-perspective benchmarks and evaluation protocols for realistic and reliable MAS debugging.

[AI-55] Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model

【速读】：该论文旨在解决从人类驾驶示范中学习多样化且高保真度的交通流模拟问题，以提升自动驾驶系统的评估能力。现有基于下一词预测（Next-Token Prediction, NTP）的方法虽可通过监督微调（Supervised Fine-Tuning, SFT）实现迭代优化，但其在子最优区域难以主动探索潜在有价值的运动标记（motion token），限制了行为多样性与安全性。解决方案的关键在于提出一种基于运动标记熵模式的强化学习框架——R1Sim，其核心创新包括：（1）引入熵引导的自适应采样机制，聚焦于高不确定性但具潜力的运动标记，增强探索效率；（2）采用分组相对策略优化（Group Relative Policy Optimization, GRPO）结合安全感知奖励设计，实现群体级比较估计与行为优化，从而在探索与利用之间取得平衡，生成真实、安全且多样化的多智能体交互行为。

链接: https://arxiv.org/abs/2603.24989
作者: Ziyan Wang,Peng Chen,Ding Li,Chiwei Li,Qichao Zhang,Zhongpu Xia,Guizhen Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.

[AI-56] he Anatomy of Uncertainty in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在生成响应时不确定性来源不明确的问题，从而影响其可靠部署与改进。现有方法通常仅提供单一不确定性评分或依赖经典的 aleatoric-epistemic（随机性-认知性）二分法，无法为模型优化提供可操作的洞察。论文提出了一种不确定性分解框架，将LLM的不确定性解构为三个语义上独立的组成部分：(i) 输入模糊性（input ambiguity），源于提示词本身的歧义；(ii) 知识缺口（knowledge gaps），由参数化证据不足引起；(iii) 解码随机性（decoding randomness），源自采样过程中的随机性。该框架的关键在于通过实验揭示这三类不确定性在不同模型规模和任务场景下主导地位的变化，从而为评估LLM可靠性、检测幻觉现象以及实施针对性干预提供了理论基础和实践路径。

链接: https://arxiv.org/abs/2603.24967
作者: Aditya Taparia,Ransalu Senanayake,Kowshik Thopalli,Vivek Narayanaswamy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Understanding why a large language model (LLM) is uncertain about the response is important for their reliable deployment. Current approaches, which either provide a single uncertainty score or rely on the classical aleatoric-epistemic dichotomy, fail to offer actionable insights for improving the generative model. Recent studies have also shown that such methods are not enough for understanding uncertainty in LLMs. In this work, we advocate for an uncertainty decomposition framework that dissects LLM uncertainty into three distinct semantic components: (i) input ambiguity, arising from ambiguous prompts; (ii) knowledge gaps, caused by insufficient parametric evidence; and (iii) decoding randomness, stemming from stochastic sampling. Through a series of experiments we demonstrate that the dominance of these components can shift across model size and task. Our framework provides a better understanding to audit LLM reliability and detect hallucinations, paving the way for targeted interventions and more trustworthy systems.

[AI-57] Design Once Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems

【速读】：该论文旨在解决大规模计算广告平台中因模型生态系统复杂性导致的开发效率低下和技术创新传播延迟问题。具体而言，传统做法是为不同产品表面和广告主目标独立优化每个模型，这使得技术更新在多模型环境中传播时面临高复杂度（O(n·2^k)），严重制约了生成式 AI (Generative AI) 与推荐系统协同演进的速度。其解决方案的关键在于提出标准化模型模板（Standard Model Template, SMT），通过构建可组合、通用的机器学习（ML）组件，将模型构建流程从依赖特定场景的定制化设计转变为模块化配置，从而将技术传播复杂度降低至 O(n + k)，显著提升跨模型迭代效率与技术采纳速率。实证结果表明，SMT 在保持性能稳定的前提下，实现了平均0.63%交叉熵改善、92%模型迭代工程时间缩减及6.3倍技术-模型对的采用吞吐量增长。

链接: https://arxiv.org/abs/2603.24963
作者: Jiang Liu,John Martabano Landy,Yao Xuan,Swamy Muddu,Nhat Le,Munaf Sahaf,Luc Kien Hang,Rupinder Khandpour,Kevin De Angeli,Chang Yang,Shouyuan Chen,Shiblee Sadik,Ani Agrawal,Djordje Gligorijevic,Jingzheng Qin,Peggy Yao,Alireza Vahdatpour
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates, conversion rates, and other optimization events. To support a wide variety of product surfaces and advertiser goals, these platforms frequently maintain an extensive ecosystem of machine learning (ML) models. However, operating at this scale creates significant development and efficiency challenges. Substantial engineering effort is required to regularly refresh ML models and propagate new techniques, which results in long latencies when deploying ML innovations across the ecosystem. We present a large-scale empirical study comparing model performance, efficiency, and ML technique propagation between a standardized model-building approach and independent per-model optimization in recommendation systems. To facilitate this standardization, we propose the Standard Model Template (SMT) – a framework that generates high-performance models adaptable to diverse data distributions and optimization events. By utilizing standardized, composable ML model components, SMT reduces technique propagation complexity from O(n \cdot 2^k) to O(n + k) where n is the number of models and k the number of techniques. Evaluating an extensive suite of models over four global development cycles within Meta’s production ads ranking ecosystem, our results demonstrate: (1) a 0.63% average improvement in cross-entropy at neutral serving capacity, (2) a 92% reduction in per-model iteration engineering time, and (3) a 6.3\times increase in technique-model pair adoption throughput. These findings challenge the conventional wisdom that diverse optimization goals inherently require diversified ML model design. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.24963 [cs.AI] (or arXiv:2603.24963v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.24963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] Shopping with a Platform AI Assistant: Who Adopts When in the Journey and What For

【速读】：该论文旨在解决平台嵌入式购物AI（platform-embedded shopping AI）在电子商务场景中如何被消费者采用与使用的问题，特别是其在用户购买旅程中的角色定位及功能特性。研究基于携程（Ctrip）平台上3100万用户的实证数据，聚焦于其集成的大语言模型（LLM-based）AI助手“Wendao”，揭示了三类核心规律：高采纳率集中于年长、女性及高活跃度用户；AI聊天功能主要出现在传统搜索之后、下单之前，并常与搜索行为交错使用；用户更倾向于将该助手用于探索性任务（如景点查询占42%），且其意图随搜索时机和后续购买品类系统性变化。解决方案的关键在于识别出嵌入式购物AI并非简单替代传统搜索，而是作为辅助探索型产品发现的互补接口，从而为电商平台优化AI交互设计提供实证依据。

链接: https://arxiv.org/abs/2603.24947
作者: Se Yan,Han Zhong,Zemin(Zachary)Zhong,Wenyu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:This paper provides some of the first large-scale descriptive evidence on how consumers adopt and use platform-embedded shopping AI in e-commerce. Using data on 31 million users of Ctrip, China’s largest online travel platform, we study “Wendao,” an LLM-based AI assistant integrated into the platform. We document three empirical regularities. First, adoption is highest among older consumers, female users, and highly engaged existing users, reversing the younger, male-dominated profile commonly documented for general-purpose AI tools. Second, AI chat appears in the same broad phase of the purchase journey as traditional search and well before order placement; among journeys containing both chat and search, the most common pattern is interleaving, with users moving back and forth between the two modalities. Third, consumers disproportionately use the assistant for exploratory, hard-to-keyword tasks: attraction queries account for 42% of observed chat requests, and chat intent varies systematically with both the timing of chat relative to search and the category of products later purchased within the same journey. These findings suggest that embedded shopping AI functions less as a substitute for conventional search than as a complementary interface for exploratory product discovery in e-commerce.

[AI-59] Evaluating adaptive and generative AI-based feedback and recommendations in a knowledge-graph-integrated programming learning system

【速读】：该论文旨在解决传统自适应学习支持系统在代码评估、形成性反馈生成及练习推荐方面存在的局限性，尤其是在个性化指导与编程逻辑错误纠正方面的不足。其解决方案的关键在于构建一个融合大型语言模型（Large Language Model, LLM）与检索增强生成（Retrieval-Augmented Generation, RAG）机制的框架，并引入知识图谱（Knowledge Graph）和用户交互历史作为外部信息源，从而提升反馈质量与学习效果。实验表明，该框架在混合式生成式 AI-自适应（Hybrid GenAI-Adaptive）模式下表现最优，显著提升了正确代码提交数量并减少了逻辑缺失的代码尝试，验证了多模态信息融合与智能反馈协同优化的有效性。

链接: https://arxiv.org/abs/2603.24940
作者: Lalita Na Nongkhai,Jingyun Wang,Adam Wynn,Takahiko Mendori
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the design and development of a framework that integrates a large language model (LLM) with a retrieval-augmented generation (RAG) approach leveraging both a knowledge graph and user interaction history. The framework is incorporated into a previously developed adaptive learning support system to assess learners’ code, generate formative feedback, and recommend exercises. Moerover, this study examines learner preferences across three instructional modes; adaptive, Generative AI (GenAI), and hybrid GenAI-adaptive. An experimental study was conducted to compare the learning performance and perception of the learners, and the effectiveness of these three modes using four key log features derived from 4956 code submissions across all experimental groups. The analysis results show that learners receiving feedback from GenAI modes had significantly more correct code and fewer code submissions missing essential programming logic than those receiving feedback from adaptive mode. In particular, the hybrid GenAI-adaptive mode achieved the highest number of correct submissions and the fewest incorrect or incomplete attempts, outperforming both the adaptive-only and GenAI-only modes. Questionnaire responses further indicated that GenAI-generated feedback was widely perceived as helpful, while all modes were rated positively for ease of use and usefulness. These results suggest that the hybrid GenAI-adaptive mode outperforms the other two modes across all measured log features.

[AI-60] Decoding Market Emotions in Cryptocurrency Tweets via Predictive Statement Classification with Machine Learning and Transformers

【速读】：该论文旨在解决如何从加密货币相关的社交媒体文本（如推文）中自动识别并分类具有预测性质的陈述，以挖掘用户情绪与市场预期之间的关联。其核心问题在于构建一个能够准确区分“预测性”与“非预测性”语句，并进一步细分为“增量”、“减量”和“中性”类别的分类框架。解决方案的关键在于：首先采用人工标注与GPT辅助标注相结合的方法构建高质量数据集；其次利用SenticNet提取情感特征以增强模型对语义的理解；再次通过GPT生成的 paraphrasing 实现数据增强，有效缓解类别不平衡问题；最后在两个任务上分别对比多种机器学习、深度学习及Transformer模型的表现，发现Transformer模型在第一阶段（二分类）表现最优，而传统机器学习模型在第二阶段（三分类）更优，从而实现了高精度的预测性语句识别与细粒度分类。

链接: https://arxiv.org/abs/2603.24933
作者: Moein Shahiki Tash,Zahra Ahani,Mohim Tash,Mostafa Keikhay Farzaneh,Ari Y. Barrera-Animas,Olga Kolesnikova
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:The growing prominence of cryptocurrencies has triggered widespread public engagement and increased speculative activity, particularly on social media platforms. This study introduces a novel classification framework for identifying predictive statements in cryptocurrency-related tweets, focusing on five popular cryptocurrencies: Cardano, Matic, Binance, Ripple, and Fantom. The classification process is divided into two stages: Task 1 involves binary classification to distinguish between Predictive and Non-Predictive statements. Tweets identified as Predictive proceed to Task 2, where they are further categorized as Incremental, Decremental, or Neutral. To build a robust dataset, we combined manual and GPT-based annotation methods and utilized SenticNet to extract emotion features corresponding to each prediction category. To address class imbalance, GPT-generated paraphrasing was employed for data augmentation. We evaluated a wide range of machine learning, deep learning, and transformer-based models across both tasks. The results show that GPT-based balancing significantly enhanced model performance, with transformer models achieving the highest F1-score in Task 1, while traditional machine learning models performed best in Task 2. Furthermore, our emotion analysis revealed distinct emotional patterns associated with each prediction category across the different cryptocurrencies.

[AI-61] On the Foundations of Trustworthy Artificial Intelligence

【速读】：该论文旨在解决可信人工智能（Trustworthy AI）的核心基础问题，即如何确保AI系统在不同平台和环境下输出结果的一致性与可验证性。作者提出“确定性推理”（platform-deterministic inference）是构建可信AI的必要且充分条件，并通过形式化证明确立了“确定性-验证坍缩”（Determinism-Verification Collapse）：在确定性条件下，验证仅需常数时间的哈希比对；而一旦失去确定性，验证则陷入难以处理的成员资格问题（membership problem）。其解决方案的关键在于构建一个纯整数推理引擎（pure integer inference engine），彻底规避IEEE 754浮点运算带来的非确定性，实现了ARM与x86架构下比特级一致的输出，在82次跨架构测试中零哈希差异，并通过356笔链上认证交易验证了多节点一致性。此方案表明，AI系统的公平性、鲁棒性、隐私保护、安全性及对齐等所有关键信任属性均以平台确定性为前提，从而将AI信任问题归结为底层算术实现的严谨性。

链接: https://arxiv.org/abs/2603.24904
作者: TJ Dunham
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 26 pages, 10 tables, 1 figure, 17 theorems/definitions/corollaries

点击查看摘要

Abstract:We prove that platform-deterministic inference is necessary and sufficient for trustworthy AI. We formalize this as the Determinism Thesis and introduce trust entropy to quantify the cost of non-determinism, proving that verification failure probability equals 1 - 2^-H_T exactly. We prove a Determinism-Verification Collapse: verification under determinism requires O(1) hash comparison; without it, the verifier faces an intractable membership problem. IEEE 754 floating-point arithmetic fundamentally violates the determinism requirement. We resolve this by constructing a pure integer inference engine that achieves bitwise identical output across ARM and x86. In 82 cross-architecture tests on models up to 6.7B parameters, we observe zero hash mismatches. Four geographically distributed nodes produce identical outputs, verified by 356 on-chain attestation transactions. Every major trust property of AI systems (fairness, robustness, privacy, safety, alignment) presupposes platform determinism. Our system, 99,000 lines of Rust deployed across three continents, establishes that AI trust is a question of arithmetic. Comments: 26 pages, 10 tables, 1 figure, 17 theorems/definitions/corollaries Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2603.24904 [cs.AI] (or arXiv:2603.24904v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.24904 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Theodore Dunham [view email] [v1] Thu, 26 Mar 2026 00:37:14 UTC (535 KB) Full-text links: Access Paper: View a PDF of the paper titled On the Foundations of Trustworthy Artificial Intelligence, by TJ DunhamView PDF view license Current browse context: cs.AI prev | next new | recent | 2026-03 Change to browse by: cs cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-62] Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence

【速读】：该论文旨在解决临床分诊场景中因网络连接带来的安全风险问题，特别是传统依赖外部网络传输数据的系统易受远程攻击、数据泄露和中间人篡改等威胁。其解决方案的关键在于提出一种主权人工智能（Sovereign AI）架构，通过在设备端完成全部推理任务，并利用物理单向数据通道（physically unidirectional channel）实现数据仅流入不流出，从而从硬件层面消除网络攻击面，而非依赖软件防护机制。该架构支持基于对话的症状采集与设备采集生命体征的融合处理，生成符合分诊标准的结构化临床记录，且对广播或数据隔离硬件（data diode）部署均具备传输无关性（transport-agnostic），适用于资源受限和高风险环境下的高保障医疗智能应用。

链接: https://arxiv.org/abs/2603.24898
作者: Vasu Srinivasan,Dhriti Vasu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 31 pages

点击查看摘要

Abstract:We present a Sovereign AI architecture for clinical triage in which all inference is performed on-device and inbound data is delivered via a physically unidirectional channel, implemented using receive-only broadcast infrastructure or certified hardware data diodes, with no return path to any external network. This design removes the network-mediated attack surface by construction, rather than attempting to secure it through software controls. The system performs conversational symptom intake, integrates device-captured vitals, and produces structured, triage-aligned clinical records at the point of care. We formalize the security properties of receiver-side unidirectionality and show that the architecture is transport-agnostic across broadcast and diode-enforced deployments. We further analyze threat models, enforcement mechanisms, and deployment configurations, demonstrating how physical one-way data flow enables high-assurance operation in both resource-constrained and high-risk environments. This work positions physically unidirectional channels as a foundational primitive for sovereign, on-device clinical intelligence at the front door of care. Comments: 31 pages Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI) MSC classes: 68M10, 68M15, 92C50 ACMclasses: C.2.0; K.6.5; J.3 Cite as: arXiv:2603.24898 [cs.CR] (or arXiv:2603.24898v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.24898 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-63] Surrogates Spikes and Sparsity: Performance Analysis and Characterization of SNN Hyperparameters on Hardware

【速读】：该论文旨在解决Spiking Neural Networks (SNNs)在实际硬件部署中理论能效优势难以兑现的问题，核心在于揭示训练阶段超参数（如代理梯度函数和神经元模型配置）对推理时硬件级激活稀疏性的影响机制。其关键解决方案是通过系统性的工作负载特征分析，量化不同超参数组合对硬件延迟的敏感性，并证明标准准确率指标无法有效预测硬件效率；研究发现，选择合适的代理梯度函数（如Spike Rate Escape）和神经元模型（如从LIF切换至Lapicque）可显著降低推理延迟（最高达28%），且结合sparsity-aware超参数选择可在FPGA平台上实现超过2倍的延迟优化与9.1%的准确率提升，从而建立从训练参数到硬件行为的可预测映射方法。

链接: https://arxiv.org/abs/2603.24891
作者: Ilkin Aliyev,Jesus Lopez,Tosiron Adegbija
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer inherent advantages for low-power inference through sparse, event-driven computation. However, the theoretical energy benefits of SNNs are often decoupled from real hardware performance due to the opaque relationship between training-time choices and inference-time sparsity. While prior work has focused on weight pruning and compression, the role of training hyperparameters – specifically surrogate gradient functions and neuron model configurations – in shaping hardware-level activation sparsity remains underexplored. This paper presents a workload characterization study quantifying the sensitivity of hardware latency to SNN hyperparameters. We decouple the impact of surrogate gradient functions (e.g., Fast Sigmoid, Spike Rate Escape) and neuron models (LIF, Lapicque) on classification accuracy and inference efficiency across three event-based vision datasets: DVS128-Gesture, N-MNIST, and DVS-CIFAR10. Our analysis reveals that standard accuracy metrics are poor predictors of hardware efficiency. While Fast Sigmoid achieves the highest accuracy on DVS-CIFAR10, Spike Rate Escape reduces inference latency by up to 12.2% on DVS128-Gesture with minimal accuracy trade-offs. We also demonstrate that neuron model selection is as critical as parameter tuning; transitioning from LIF to Lapicque neurons yields up to 28% latency reduction. We validate on a custom cycle-accurate FPGA-based SNN instrumentation platform, showing that sparsity-aware hyperparameter selection can improve accuracy by 9.1% and latency by over 2x compared to baselines. These findings establish a methodology for predicting hardware behavior from training parameters. The RTL and reproducibility artifacts are at this https URL. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.24891 [cs.AR] (or arXiv:2603.24891v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.24891 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE International Symposium on Performance Analysis of Systems and Software 2026

[AI-64] Resisting Humanization: Ethical Front-End Design Choices in AI for Sensitive Contexts

【速读】：该论文试图解决的问题是：当前人工智能（AI）伦理讨论主要聚焦于后端技术问题（如数据治理、模型训练和算法决策），而对前端设计选择的伦理意义关注不足，尤其是在基于自然语言处理（NLP）的对话式用户界面（CUI）中，人类化设计元素（如对话交互、情感化语言、人格模式和拟人化隐喻）如何影响用户的心理模型、信任校准与行为反应缺乏系统审视。解决方案的关键在于提出“伦理前端设计”是一种程序性伦理（procedural ethics），其核心在于通过交互设计的选择而非仅依赖系统逻辑来实现价值敏感的设计实践，并以非营利组织Chayn开发的面向性别暴力幸存者的AI系统为例，说明在脆弱情境下应基于创伤知情原则进行有意识的克制性设计，从而避免因不当的人类化设计导致用户期望错位、信任误判及自主权削弱。

链接: https://arxiv.org/abs/2603.24853
作者: Silvia Rossi,Diletta Huyskes,Mackenzie Jorgensen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End

点击查看摘要

Abstract:Ethical debates in AI have primarily focused on back-end issues such as data governance, model training, and algorithmic decision-making. Less attention has been paid to the ethical significance of front-end design choices, such as the interaction and representation-based elements through which users interact with AI systems. This gap is particularly significant for Conversational User Interfaces (CUI) based on Natural Language Processing (NLP) systems, where humanizing design elements such as dialogue-based interaction, emotive language, personality modes, and anthropomorphic metaphors are increasingly prevalent. This work argues that humanization in AI front-end design is a value-driven choice that profoundly shapes users’ mental models, trust calibration, and behavioral responses. Drawing on research in human-computer interaction (HCI), conversational AI, and value-sensitive design, we examine how interfaces can play a central role in misaligning user expectations, fostering misplaced trust, and subtly undermining user autonomy, especially in vulnerable contexts. To ground this analysis, we discuss two AI systems developed by Chayn, a nonprofit organization supporting survivors of gender-based violence. Chayn is extremely cautious when building AI that interacts with or impacts survivors by operationalizing their trauma-informed design principles. This Chayn case study illustrates how ethical considerations can motivate principled restraint in interface design, challenging engagement-based norms in contemporary AI products. We argue that ethical front-end AI design is a form of procedural ethics, enacted through interaction choices rather than embedded solely in system logic.

[AI-65] A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

【速读】：该论文旨在解决临床预测模型在部署前缺乏系统性可解释性评估的问题，特别是针对不同临床任务与模型架构之间交互关系下解释方法的有效性差异。其关键解决方案在于构建一个全面的基准测试框架，涵盖多种临床预测任务和模型架构（如带注意力机制的模型），并系统评估主流解释方法（如Attention、KernelSHAP、LIME等）的表现。研究发现：注意力机制若被合理利用，能高效且忠实反映模型决策；而黑箱解释方法在时间序列临床任务中计算不可行；部分解释方法可靠性不足，难以信任。基于此，论文提出改进临床预测流程中可解释性的实践指南，并通过PyHealth开源框架确保结果的可复现性和扩展性。

链接: https://arxiv.org/abs/2603.24828
作者: Yongda Fan,John Wu,Andrea Fitzpatrick,Naveen Baskaran,Jimeng Sun,Adam Cross
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: this https URL.

[AI-66] Learning From Developers: Towards Reliable Patch Validation at Scale for Linux OSDI’26

【速读】：该论文旨在解决大规模开源软件（如Linux内核）中补丁审查（patch reviewing）效率低下、依赖人工且难以应对日益增长的提交量的问题。当前尽管已有多种自动化检测工具，但核心审查仍高度依赖少数开发者的人力投入，尤其在处理复杂并发错误、可维护性问题等传统工具难以识别的缺陷时存在明显不足。解决方案的关键在于提出FLINT——一个基于历史开发者讨论提炼规则并结合无需训练/微调的大语言模型（LLM）的补丁验证框架。FLINT采用多阶段信息提取机制从过往讨论中自动构建验证规则，并在新补丁审查时检索相关规则生成带参考依据的报告，从而显著提升审查覆盖率与准确性，同时降低误报率（False Positive Rate），实现对传统工具难以发现的复杂缺陷的有效辅助识别。

链接: https://arxiv.org/abs/2603.24825
作者: Chih-En Lin,Attreyee Mukherjee,Ajay Rawat,Ruqi Zhang,Pedro Fonseca
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to OSDI’26

点击查看摘要

Abstract:Patch reviewing is critical for software development, especially in distributed open-source development, which highly depends on voluntary work, such as Linux. This paper studies the past 10 years of patch reviews of the Linux memory management subsystem to characterize the challenges involved in patch reviewing at scale. Our study reveals that the review process is still primarily reliant on human effort despite a wide-range of automatic checking tools. Although kernel developers strive to review all patch proposals, they struggle to keep up with the increasing volume of submissions and depend significantly on a few developers for these reviews. To help scale the patch review process, we introduce FLINT, a patch validation system framework that synthesizes insights from past discussions among developers and automatically analyzes patch proposals for compliance. FLINT employs a rule-based analysis informed by past discussions among developers and an LLM that does not require training or fine-tuning on new data, and can continuously improve with minimum human effort. FLINT uses a multi-stage approach to efficiently distill the essential information from past discussions. Later, when a patch proposal needs review, FLINT retrieves the relevant validation rules for validation and generates a reference-backed report that developers can easily interpret and validate. FLINT targets bugs that traditional tools find hard to detect, ranging from maintainability issues, e.g., design choices and naming conventions, to complex concurrency issues, e.g., deadlocks and data races. FLINT detected 2 new issues in Linux v6.18 development cycle and 7 issues in previous versions. FLINT achieves 21% and 14% of higher ground-truth coverage on concurrency bugs than the baseline with LLM only. Moreover, FLINT achieves a 35% false positive rate, which is lower than the baseline. Comments: Submitted to OSDI’26 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.24825 [cs.SE] (or arXiv:2603.24825v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.24825 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-67] FODMP: Fast One-Step Diffusion of Movement Primitives Generation for Time-Dependent Robot Actions

【速读】：该论文旨在解决扩散模型在机器人学习中应用时存在的推理延迟与运动时序结构缺失之间的权衡问题。现有方法如Action-chunking扩散策略虽推理速度快，但仅能预测短段动作，缺乏对时间依赖性运动基元（如弹簧阻尼行为）的建模能力；而基于概率动态运动基元（ProDMPs）的Movement Primitive Diffusion (MPD) 虽能生成具有时序结构的轨迹，却因多步扩散过程导致推理延迟过高，难以用于实时控制。其解决方案的关键在于提出FODMP（Fast One-step Diffusion of Movement Primitives），通过将扩散模型蒸馏到ProDMP参数空间，并采用单步解码器生成运动轨迹，在保留运动基元时序结构的同时显著降低推理延迟，从而实现高速、高精度的闭环视觉控制，实验证明其推理速度比MPD快10倍、比action-chunking策略快7倍，且可完成快速拦截飞行球等高动态任务。

链接: https://arxiv.org/abs/2603.24806
作者: Xirui Shi,Arya Ebrahimi,Yi Hu,Jun Jin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models are increasingly used for robot learning, but current designs face a clear trade-off. Action-chunking diffusion policies like ManiCM are fast to run, yet they only predict short segments of motion. This makes them reactive, but unable to capture time-dependent motion primitives, such as following a spring-damper-like behavior with built-in dynamic profiles of acceleration and deceleration. Recently, Movement Primitive Diffusion (MPD) partially addresses this limitation by parameterizing full trajectories using Probabilistic Dynamic Movement Primitives (ProDMPs), thereby enabling the generation of temporally structured motions. Nevertheless, MPD integrates the motion decoder directly into a multi-step diffusion process, resulting in prohibitively high inference latency that limits its applicability in real-time control settings. We propose FODMP (Fast One-step Diffusion of Movement Primitives), a new framework that distills diffusion models into the ProDMPs trajectory parameter space and generates motion using a single-step decoder. FODMP retains the temporal structure of movement primitives while eliminating the inference bottleneck through single-step consistency distillation. This enables robots to execute time-dependent primitives at high inference speed, suitable for closed-loop vision-based control. On standard manipulation benchmarks (MetaWorld, ManiSkill), FODMP runs up to 10 times faster than MPD and 7 times faster than action-chunking diffusion policies, while matching or exceeding their success rates. Beyond speed, by generating fast acceleration-deceleration motion primitives, FODMP allows the robot to intercept and securely catch a fast-flying ball, whereas action-chunking diffusion policy and MPD respond too slowly for real-time interception.

[AI-68] ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中路由策略性能下降的问题，尤其是传统基于隐藏状态的探针路由（probe routing）在引入视觉输入后正确性信号分离度显著降低的现象。解决方案的关键在于提升隐藏状态中正确性信号的质量：一是提出注意力探针（Attention Probe），通过聚合前一层隐藏状态并依据注意力权重重构分布式的正确性信号；二是设计KL正则化LoRA探针（ReLope），在轻量级LoRA适配器中引入KL散度正则项以学习更具路由感知能力的表示。二者协同提升了探针路由在MLLMs中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2603.24787
作者: Yaopei Zeng,Congchao Wang,Blake JianHang Chen,Lu Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Routing has emerged as a promising strategy for balancing performance and cost in large language model (LLM) systems that combine lightweight models with powerful but expensive large models. Recent studies show that \emphprobe routing, which predicts the correctness of a small model using its hidden states, provides an effective solution in text-only LLMs. However, we observe that these probes degrade substantially when applied to multimodal LLMs (MLLMs). Through empirical analysis, we find that the presence of visual inputs weakens the separability of correctness signals in hidden states, making them harder to extract using standard probe designs. To address this challenge, we introduce two complementary approaches for improving probe routing in MLLMs. First, we propose the \emphAttention Probe, which aggregates hidden states from the preceding layer based on attention scores to recover distributed correctness signals. Second, we present the \emphKL-Regularized LoRA Probe (ReLope), which inserts a lightweight LoRA adapter and applies a KL regularizer to learn routing-aware representations. Comprehensive experiments show that our methods consistently outperform baselines, suggesting that improving the quality of hidden states is key to effective routing in MLLMs. Our code is available at this https URL.

[AI-69] AIP: Agent Identity Protocol for Verifiable Delegation Across MCP and A2A

【速读】：该论文旨在解决当前AI代理在通过模型上下文协议（Model Context Protocol, MCP）调用工具和通过代理间通信（Agent-to-Agent, A2A）进行任务委派时缺乏身份验证机制的问题。现有系统普遍未实现对代理身份的可信验证，导致潜在的安全风险，如非法代理冒充、委托链被篡改或审计追踪失效。解决方案的关键在于提出一种名为“调用绑定能力令牌”（Invocation-Bound Capability Tokens, IBCTs）的新原语，其核心创新是将身份认证、可衰减授权（holder-side attenuation）与执行溯源绑定（provenance binding）融合为一个不可篡改的令牌链（append-only token chain）。IBCTs支持两种传输格式：紧凑模式（用于单跳场景的签名JWT）和链式模式（用于多跳委托的Biscuit令牌结合Datalog策略），从而实现细粒度的权限控制与端到端的可审计性，且在真实部署中仅引入极小延迟（<0.22ms），同时在对抗测试中实现了100%攻击拦截率，尤其能检测传统方案无法识别的深度越权与空上下文规避行为。

链接: https://arxiv.org/abs/2603.24775
作者: Sunil Prakash
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 tables, 2 figures

点击查看摘要

Abstract:AI agents increasingly call tools via the Model Context Protocol (MCP) and delegate to other agents via Agent-to-Agent (A2A), yet neither protocol verifies agent identity. A scan of approximately 2,000 MCP servers found all lacked authentication. In our survey, we did not identify a prior implemented protocol that jointly combines public-key verifiable delegation, holder-side attenuation, expressive chained policy, transport bindings across MCP/A2A/HTTP, and provenance-oriented completion records. We introduce Invocation-Bound Capability Tokens (IBCTs), a primitive that fuses identity, attenuated authorization, and provenance binding into a single append-only token chain. IBCTs operate in two wire formats: compact mode (a signed JWT for single-hop cases) and chained mode (a Biscuit token with Datalog policies for multi-hop delegation). We provide reference implementations in Python and Rust with full cross-language interoperability. Compact mode verification takes 0.049ms (Rust) and 0.189ms (Python), with 0.22ms overhead over no-auth in real MCP-over-HTTP deployment. In a real multi-agent deployment with Gemini 2.5 Flash, AIP adds 2.35ms of overhead (0.086% of total end-to-end latency). Adversarial evaluation across 600 attack attempts shows 100% rejection rate, with two attack categories (delegation depth violation and audit evasion through empty context) uniquely caught by AIP’s chained delegation model that neither unsigned nor plain JWT deployments detect.

[AI-70] From Untestable to Testable: Metamorphic Testing in the Age of LLM s

【速读】：该论文旨在解决日益集成生成式 AI（Generative AI）与大型语言模型（Large Language Models, LLMs）的软件系统在测试过程中面临的挑战，尤其是由于LLMs本身具有不可靠性以及标注真实标签（labeled ground truth）难以规模化的问题。解决方案的关键在于采用变异测试（Metamorphic Testing），通过将多次测试执行之间的关系转化为可执行的测试断言（test oracles），从而无需依赖精确的标注数据即可有效验证系统行为的一致性和正确性。

链接: https://arxiv.org/abs/2603.24774
作者: Valerio Terragni
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at IEEE Computer Magazine. This is the authors’ accepted manuscript. Version of record available via DOI: https://doi.org/10.1109/MC.2026.3671990

点击查看摘要

Abstract:This article discusses the challenges of testing software systems with increasingly integrated AI and LLM functionalities. LLMs are powerful but unreliable, and labeled ground truth for testing rarely scales. Metamorphic Testing solves this by turning relations among multiple test executions into executable test oracles.

[AI-71] Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regulation Agent ic AI Loop for Engineering Design

【速读】：该论文旨在解决生成式 AI（Generative AI）在工程设计任务中因设计固着（design fixation）导致的性能瓶颈问题，即大语言模型（Large Language Model, LLM）设计代理容易陷入现有范式而缺乏对替代方案的有效探索，从而产生次优解。解决方案的关键在于提出两种新型协同调控机制：一是自调节环（Self-Regulation Loop, SRL），使设计代理能够自我监控其元认知（metacognition）；二是共调节设计代理环（Co-Regulation Design Agentic Loop, CRDAL），引入一个元认知协调节代理来辅助主设计代理进行更有效的元认知调控，从而缓解设计固着。实验表明，CRDAL在电池包设计任务中显著提升了设计方案性能，且计算开销可控，同时展现出更优的潜在设计空间导航能力。

链接: https://arxiv.org/abs/2603.24768
作者: Zeda Xu,Nikolas Martelaro,Christopher McComb
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The engineering design research community has studied agentic AI systems that use Large Language Model (LLM) agents to automate the engineering design process. However, these systems are prone to some of the same pathologies that plague humans. Just as human designers, LLM design agents can fixate on existing paradigms and fail to explore alternatives when solving design challenges, potentially leading to suboptimal solutions. In this work, we propose (1) a novel Self-Regulation Loop (SRL), in which the Design Agent self-regulates and explicitly monitors its own metacognition, and (2) a novel Co-Regulation Design Agentic Loop (CRDAL), in which a Metacognitive Co-Regulation Agent assists the Design Agent in metacognition to mitigate design fixation, thereby improving system performance for engineering design tasks. In the battery pack design problem examined here, we found that the novel CRDAL system generates designs with better performance, without significantly increasing the computational cost, compared to a plain Ralph Wiggum Loop (RWL) and the metacognitively self-assessing Self-Regulation Loop (SRL). Also, we found that the CRDAL system navigated through the latent design space more effectively than both SRL and RWL. However, the SRL did not generate designs with significantly better performance than RWL, even though it explored a different region of the design space. The proposed system architectures and findings of this work provide practical implications for future development of agentic AI systems for engineering design.

[AI-72] Grokking as a Falsifiable Finite-Size Transition

【速读】：该论文旨在解决“grokking”现象中长期存在的理论模糊性问题，即模型在早期阶段先记忆训练数据、随后突然实现泛化能力的现象是否可被严格定义为相变（phase transition）。此前该现象多以类比方式描述，缺乏可验证的有限尺寸输入证据。解决方案的关键在于引入两个核心要素：一是将群阶 $ p $ 作为可接受的广延变量（extensive variable），二是采用谱头尾对比（spectral head–tail contrast）作为表征层面的序参量（order parameter），从而构建出类似于凝聚态物理中的诊断链（diagnostic chain），包括粗网格扫描与近临界区域的密集审计。通过Binder-like交叉点识别共享的有限尺寸边界，并结合显著的AIC差异（ $\Delta\mathrm{AIC}=16.8$ ）强烈排除平滑过渡（smooth crossover）的可能性，使“grokking”的相变表述成为可量化检验的有限尺寸命题。

链接: https://arxiv.org/abs/2603.24746
作者: Yuda Bi,Chenyu Zhang,Qiheng Wang,Vince D Calhoun
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grokking – the delayed onset of generalization after early memorization – is often described with phase-transition language, but that claim has lacked falsifiable finite-size inputs. Here we supply those inputs by treating the group order p of \mathbbZ_p as an admissible extensive variable and a held-out spectral head–tail contrast as a representation-level order parameter, then apply a condensed-matter-style diagnostic chain to coarse-grid sweeps and a dense near-critical addition audit. Binder-like crossings reveal a shared finite-size boundary, and susceptibility comparison strongly disfavors a smooth-crossover interpretation ( \Delta\mathrmAIC=16.8 in the near-critical audit). Phase-transition language in grokking can therefore be tested as a quantitative finite-size claim rather than invoked as analogy alone, although the transition order remains unresolved at present.

[AI-73] AutoSAM: an Agent ic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

【速读】：该论文旨在解决先进反应堆系统设计与安全分析中，构建系统级热工水力代码（如System Analysis Module, SAM）输入文件时存在的劳动密集型问题。传统方法需分析师从异构工程文档中提取并校准设计数据，并手动转换为求解器特定语法，效率低且易出错。解决方案的关键在于提出AutoSAM——一个基于大语言模型代理（LLM agent）的自动化框架，其核心创新包括：结合检索增强生成（Retrieval-Augmented Generation, RAG）技术对求解器用户手册和理论手册进行语义理解，集成多模态工具处理PDF、图像、电子表格及文本文件，实现从非结构化工程文档中自动提取关键参数并生成可验证、求解器兼容的输入文件；同时通过中间表示层确保结果可审计，显著提升建模效率与透明度。

链接: https://arxiv.org/abs/2603.24736
作者: Zaid Abulawi(1 and 2),Zavier Ndum Ndum(1 and 2),Eric Cervi(2),Rui Hu(2),Yang Liu(1) ((1) Department of Nuclear Engineering, Texas Aamp;M University, (2) Nuclear Science and Engineering Division, Argonne National Laboratory)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 Pages, 14 Figures

点击查看摘要

Abstract:In the design and safety analysis of advanced reactor systems, constructing input files for system-level thermal-hydraulics codes such as the System Analysis Module (SAM) remains a labor-intensive task. Analysts must extract and reconcile design data from heterogeneous engineering documents and manually translate it into solver-specific syntax. In this paper, we present AutoSAM, an agentic framework that automates SAM input file generation. The framework combines a large language model agent with retrieval-augmented generation over the solver’s user guide and theory manual, together with specialized tools for analyzing PDFs, images, spreadsheets, and text files. AutoSAM ingests unstructured engineering documents, including system diagrams, design reports, and data tables, extracts simulation-relevant parameters into a human-auditable intermediate representation, and synthesizes validated, solver-compatible input decks. Its multimodal retrieval pipeline integrates scientific text extraction, vision-based figure interpretation, semantic embedding, and query answering. We evaluate AutoSAM on four case studies of increasing complexity: a single-pipe steady-state model, a solid-fuel channel with temperature reactivity feedback, the Advanced Burner Test Reactor core, and the Molten Salt Reactor Experiment primary loop. Across all cases, the agent produces runnable SAM models consistent with expected thermal-hydraulic behavior while explicitly identifying missing data and labeling assumed values. The framework achieves 100% utilization of structured inputs, about 88% extraction from PDF text, and 100% completeness in vision-based geometric extraction. These results demonstrate a practical path toward prompt-driven reactor modeling, in which analysts provide system descriptions and supporting documentation while the agent translates them into transparent, and executable, SAM simulations.

[AI-74] Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses

【速读】：该论文旨在解决当前脉冲神经网络（Spiking Neural Networks, SNNs）在类脑计算中面临的高资源消耗问题，尤其是密集多层架构带来的巨大通信开销和状态存储成本。其解决方案的关键在于提出时间延迟自突触脉冲神经网络（Time-Delayed Autapse SNN, TDA-SNN），该框架通过单一漏电积分与发放神经元（leaky integrate-and-fire neuron）结合基于原型学习的训练策略，重构SNN模型；利用内部时序状态的重新组织，实现水库计算、多层感知机（MLP）和类卷积结构的统一建模，从而显著降低神经元数量和状态内存需求，同时提升每个神经元的信息容量，仅以增加时间延迟为代价，在极端单神经元场景下实现紧凑高效的计算单元。

链接: https://arxiv.org/abs/2603.24692
作者: Wuque Cai,Hongze Sun,Quan Tang,Shifeng Mao,Zhenxing Wang,Jiayi He,Duo Chen,Dezhong Yao,Daqing Guo
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) are promising for neuromorphic computing, but high-performing models still rely on dense multilayer architectures with substantial communication and state-storage costs. Inspired by autapses, we propose time-delayed autapse SNN (TDA-SNN), a framework that reconstructs SNNs with a single leaky integrate-and-fire neuron and a prototype-learning-based training strategy. By reorganizing internal temporal states, TDA-SNN can realize reservoir, multilayer perceptron, and convolution-like spiking architectures within a unified framework. Experiments on sequential, event-based, and image benchmarks show competitive performance in reservoir and MLP settings, while convolutional results reveal a clear space–time trade-off. Compared with standard SNNs, TDA-SNN greatly reduces neuron count and state memory while increasing per-neuron information capacity, at the cost of additional temporal latency in extreme single-neuron settings. These findings highlight the potential of temporally multiplexed single-neuron models as compact computational units for brain-inspired computing.

[AI-75] When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLM s

【速读】：该论文旨在解决多智能体系统（multi-agent systems）中，基于大语言模型（LLMs）的群体决策机制是否反映集体推理、系统性偏见或随机偶然性的问题。其核心挑战在于理解在无先验偏好情况下，为何智能体群体仍能迅速达成共识。解决方案的关键在于提出一个最小化模型——量化单纯形 gossip（Quantized Simplex Gossip, QSG），揭示了这种共识形成的微观机制源于相互的上下文学习（in-context learning）。在QSG中，智能体通过采样其他智能体的输出进行学习，使得一个智能体的任意选择成为下一个智能体的证据，并可能逐步累积形成一致性；这一过程类比于中性演化中的“文化漂变”（memetic drift），并预测存在从漂变主导（共识如彩票般随机）到选择主导（微弱偏见被放大）的相变行为。研究进一步推导出漂变引发极化的标度律，并在模拟与LLM命名游戏实验中验证，从而为多智能体社会表征形成机制提供了理论框架。

链接: https://arxiv.org/abs/2603.24676
作者: Hidenori Tanaka
机构: 未知
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Biological Physics (physics.bio-ph); Physics and Society (physics.soc-ph)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Multi-agent systems powered by large language models (LLMs) are increasingly deployed in settings that shape consequential decisions, both directly and indirectly. Yet it remains unclear whether their outcomes reflect collective reasoning, systematic bias, or mere chance. Recent work has sharpened this question with naming games, showing that even when no individual agent favors any label a priori, populations rapidly break symmetry and reach consensus. Here, we reveal the mechanism by introducing a minimal model, Quantized Simplex Gossip (QSG), and trace the microscopic origin of this agreement to mutual in-context learning. In QSG, agents maintain internal belief states but learn from one another’s sampled outputs, so one agent’s arbitrary choice becomes the next agent’s evidence and can compound toward agreement. By analogy with neutral evolution, we call this sampling-driven regime memetic drift. QSG predicts a crossover from a drift-dominated regime, where consensus is effectively a lottery, to a selection regime, where weak biases are amplified and shape the outcome. We derive scaling laws for drift-induced polarization as a function of population size, communication bandwidth, in-context adaptation rate, and agents’ internal uncertainty, and we validate them in both QSG simulations and naming-game experiments with LLM populations. Together, these results provide a framework for studying the collective mechanisms of social representation formation in multi-agent systems.

[AI-76] Experiential Reflective Learning for Self-Improving LLM Agents ICLR2026

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）驱动的自主智能体在面对新环境时缺乏适应能力、无法利用历史交互经验的问题，即这些智能体通常对每个新任务都从零开始处理，未能有效复用过往经验。解决方案的关键在于提出一种名为“经验性反思学习”（Experiential Reflective Learning, ERL）的自改进框架：该框架通过反思单次任务执行轨迹与结果生成可操作的启发式规则（heuristics），并基于当前任务检索相关启发式规则注入到智能体上下文中以指导推理和决策。实验表明，ERL在Gaia2基准上相较ReAct基线提升成功率7.8%，且显著增强任务完成可靠性，证明了从单一尝试中提取可迁移启发式规则是实现高效智能体自我改进的有效途径。

链接: https://arxiv.org/abs/2603.24639
作者: Marc-Antoine Allard,Arnaud Teinturier,Victor Xing,Gautier Viaud
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at the ICLR 2026 MemAgents Workshop

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled the development of autonomous agents capable of complex reasoning and multi-step problem solving. However, these agents struggle to adapt to specialized environments and do not leverage past interactions, approaching each new task from scratch regardless of their accumulated experience. We introduce Experiential Reflective Learning (ERL), a simple self-improvement framework that enables rapid environment adaptation through experiential learning. ERL reflects on task trajectories and outcomes to generate heuristics, capturing actionable lessons that transfer across tasks. At test time, relevant heuristics are retrieved based on the current task and injected into the agent’s context to guide execution. On the Gaia2 benchmark, ERL improves success rate by 7.8% over a ReAct baseline, with large gains in task completion reliability, and outperforms prior experiential learning methods. Through systematic ablations, we find that selective retrieval is essential and that heuristics provide more transferable abstractions than few-shot trajectory prompting. These results demonstrate that reflecting on single-attempt experiences to extract transferable heuristics enables effective agent self-improvement.

[AI-77] DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph WWW’26

【速读】：该论文旨在解决现实场景中事件预测任务中多模态知识动态获取与融合的难题。现有研究多局限于静态设置，未能有效建模不同模态（尤其是动态结构模态）的时间敏感信息，且传统静态跨模态注意力机制难以捕捉不同模态随时间演化的历史贡献差异。解决方案的关键在于提出 DyMRL（Dynamic Multispace Representation Learning）方法：一方面，通过将欧几里得空间、双曲空间和复数空间中的时序特定结构特征整合进关系消息传递框架，以学习深层、具关系感知能力的几何表示，从而模拟人类的联想思维、高阶抽象与逻辑推理；另一方面，引入先进的双融合-演化注意力机制，在对称方式下为不同时刻的不同模态分配动态学习权重，实现多模态融合特征的演化建模。

链接: https://arxiv.org/abs/2603.24636
作者: Feng Zhao,Kangzheng Liu,Teng Peng,Yu Yang,Guandong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to The ACM Web Conference 2026 (WWW '26). This version is published under a CC BY license

点击查看摘要

Abstract:Accurate representation of multimodal knowledge is crucial for event forecasting in real-world scenarios. However, existing studies have largely focused on static settings, overlooking the dynamic acquisition and fusion of multimodal knowledge. 1) At the knowledge acquisition level, how to learn time-sensitive information of different modalities, especially the dynamic structural modality. Existing dynamic learning methods are often limited to shallow structures across heterogeneous spaces or simple unispaces, making it difficult to capture deep relation-aware geometric features. 2) At the knowledge fusion level, how to learn evolving multimodal fusion features. Existing knowledge fusion methods based on static coattention struggle to capture the varying historical contributions of different modalities to future events. To this end, we propose DyMRL, a Dynamic Multispace Representation Learning approach to efficiently acquire and fuse multimodal temporal knowledge. 1) For the former issue, DyMRL integrates time-specific structural features from Euclidean, hyperbolic, and complex spaces into a relational message-passing framework to learn deep representations, reflecting human intelligences in associative thinking, high-order abstracting, and logical reasoning. Pretrained models endow DyMRL with time-sensitive visual and linguistic intelligences. 2) For the latter concern, DyMRL incorporates advanced dual fusion-evolution attention mechanisms that assign dynamic learning emphases equally to different modalities at different timestamps in a symmetric manner. To evaluate DyMRL’s event forecasting performance through leveraging its learned multimodal temporal knowledge in history, we construct four multimodal temporal knowledge graph benchmarks. Extensive experiments demonstrate that DyMRL outperforms state-of-the-art dynamic unimodal and static multimodal baseline methods.

[AI-78] Dual-Graph Multi-Agent Reinforcement Learning for Handover Optimization

【速读】：该论文旨在解决蜂窝网络中切换（Handover, HO）控制参数配置的优化问题，尤其是Cell Individual Offset (CIO) 的动态调整难题。传统基于规则的静态配置方法在非平稳流量和移动性环境下性能下降明显，且CIO参数在网络尺度上存在强耦合关系，微小调整可能引发全局移动流重定向。解决方案的关键在于将HO优化建模为定义在小区邻接图对偶图上的分布式部分可观测马尔可夫决策过程（Decentralized Partially Observable Markov Decision Process, Dec-POMDP），其中每个代理（agent）负责一对邻近小区的CIO，并仅观测其局部对偶图邻域内的关键性能指标（Key Performance Indicators, KPIs），从而实现可扩展的去中心化决策并保持图结构局部性。进一步提出TD3-D-MA算法，基于共享参数图神经网络（GNN）作为策略网络、区域级双评论家（double critics）进行训练，提升密集部署场景下的信用分配精度，最终在真实运营商参数配置的ns-3系统级仿真中验证了该方法在多种拓扑与流量场景下优于标准启发式和集中式强化学习基线，具备良好泛化能力。

链接: https://arxiv.org/abs/2603.24634
作者: Matteo Salvatori,Filippo Vannella,Sebastian Macaluso,Stylianos E. Trevlakis,Carlos Segura Perales,José Suarez-Varela,Alexandros-Apostolos A. Boulogeorgos,Ioannis Arapakis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:HandOver (HO) control in cellular networks is governed by a set of HO control parameters that are traditionally configured through rule-based heuristics. A key parameter for HO optimization is the Cell Individual Offset (CIO), defined for each pair of neighboring cells and used to bias HO triggering decisions. At network scale, tuning CIOs becomes a tightly coupled problem: small changes can redirect mobility flows across multiple neighbors, and static rules often degrade under non-stationary traffic and mobility. We exploit the pairwise structure of CIOs by formulating HO optimization as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) on the network’s dual graph. In this representation, each agent controls a neighbor-pair CIO and observes Key Performance Indicators (KPIs) aggregated over its local dual-graph neighborhood, enabling scalable decentralized decisions while preserving graph locality. Building on this formulation, we propose TD3-D-MA, a discrete Multi-Agent Reinforcement Learning (MARL) variant of the TD3 algorithm with a shared-parameter Graph Neural Network (GNN) actor operating on the dual graph and region-wise double critics for training, improving credit assignment in dense deployments. We evaluate TD3-D-MA in an ns-3 system-level simulator configured with real-world network operator parameters across heterogeneous traffic regimes and network topologies. Results show that TD3-D-MA improves network throughput over standard HO heuristics and centralized RL baselines, and generalizes robustly under topology and traffic shifts.

[AI-79] RAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

【速读】：该论文旨在解决当前代码代理（Code Agent）在自动修复GitHub问题时缺乏细粒度诊断能力的问题。现有评估指标如Pass@1将整个执行过程简化为单一二元结果，无法揭示代理失败的具体环节与原因。为此，作者提出TRAJEVAL框架，其核心在于将代理行为轨迹分解为三个可解释阶段：搜索（文件定位）、阅读（函数理解）和编辑（修改目标），并通过对比参考补丁计算各阶段的精确率（precision）与召回率（recall）。这一设计使研究者能够精准识别不同代理架构的效率瓶颈与失效模式，并验证了诊断信号具有预测性和可操作性——不仅能以极低误差（MAE 0.87–2.1%）预测模型级性能，还能通过实时反馈显著提升两个前沿模型的准确率（+2.2–4.6个百分点）并降低资源消耗（20–31%）。

链接: https://arxiv.org/abs/2603.24631
作者: Myeongsoo Kim,Dingmin Wang,Siwei Cui,Farima Farmahinifarahani,Shweta Garg,Baishakhi Ray,Terry Yue Zhuo,Rajdeep Mukherjee,Varun Kumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code agents can autonomously resolve GitHub issues, yet when they fail, current evaluation provides no visibility into where or why. Metrics such as Pass@1 collapse an entire execution into a single binary outcome, making it difficult to identify where and why the agent went wrong. To address this limitation, we introduce TRAJEVAL, a diagnostic framework that decomposes agent trajectories into three interpretable stages: search (file localization), read (function comprehension), and edit (modification targeting). For each stage, we compute precision and recall by comparing against reference patches. Analyzing 16,758 trajectories across three agent architectures and seven models, we find universal inefficiencies (all agents examine approximately 22x more functions than necessary) yet distinct failure modes: GPT-5 locates relevant code but targets edits incorrectly, while Qwen-32B fails at file discovery entirely. We validate that these diagnostics are predictive, achieving model-level Pass@1 prediction within 0.87-2.1% MAE, and actionable: real-time feedback based on trajectory signals improves two state-of-the-art models by 2.2-4.6 percentage points while reducing costs by 20-31%. These results demonstrate that our framework not only provides a more fine-grained analysis of agent behavior, but also translates diagnostic signals into tangible performance gains. More broadly, TRAJEVAL transforms agent evaluation beyond outcome-based benchmarking toward mechanism-driven diagnosis of agent success and failure.

[AI-80] ARC-AGI-3: A New Challenge for Frontier Agent ic Intelligence

【速读】：该论文旨在解决当前人工智能系统在面对全新、抽象任务时缺乏泛化能力与自主适应性的问题，尤其是如何评估智能体在无显式指令和外部知识支持下的探索、目标推理、内部模型构建及策略规划能力。解决方案的关键在于设计了一个名为ARC-AGI-3的交互式基准测试环境，该环境基于核心知识先验（Core Knowledge priors）构建，并通过大量人类受试者进行难度校准，确保任务对人类可解但对现有前沿AI系统极具挑战性（截至2026年3月，AI系统得分低于1%）。其评分框架以人类行为基线为基础，聚焦于流体适应效率（fluid adaptive efficiency），从而提供一种不依赖语言或外部知识、能有效衡量智能体通用认知能力的新范式。

链接: https://arxiv.org/abs/2603.24621
作者: ARC Prize Foundation
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments in which agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Like its predecessors ARC-AGI-1 and 2, ARC-AGI-3 focuses entirely on evaluating fluid adaptive efficiency on novel tasks, while avoiding language and external knowledge. ARC-AGI-3 environments only leverage Core Knowledge priors and are difficulty-calibrated via extensive testing with human test-takers. Our testing shows humans can solve 100% of the environments, in contrast to frontier AI systems which, as of March 2026, score below 1%. In this paper, we present the benchmark design, its efficiency-based scoring framework grounded in human action baselines, and the methodology used to construct, validate, and calibrate the environments.

[AI-81] Causal AI For AMS Circuit Design: Interpretable Parameter Effects Analysis

【速读】：该论文旨在解决模拟-混合信号（Analog-Mixed-Signal, AMS）电路在数据驱动人工智能建模中面临的高非线性与连续信号特性所带来的建模难题，其核心挑战在于如何从结构化设计参数（如器件尺寸、偏置电压等）准确映射到真实世界性能指标。解决方案的关键在于提出一种基于因果推断（causal inference）的框架：首先从SPICE仿真数据中自动学习出有向无环图（Directed-Acyclic Graph, DAG），以刻画设计变量间的因果关系；随后通过平均处理效应（Average Treatment Effect, ATE）估计量化各参数对性能的影响。该方法不仅实现了人类可解释的设计旋钮排序和“如果改变某个参数会怎样”的显式预测，还在三个运算放大器架构（OTA、 telescopic、folded-cascode）上验证了其优于传统神经网络回归器的准确性（平均绝对误差<25% vs >80%）和符号一致性，从而显著提升了AMS设计自动化中的可信度与效率。

链接: https://arxiv.org/abs/2603.24618
作者: Mohyeu Hussain,David Koblah,Reiner Dizon-Paradis,Domenic Forte
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Analog-mixed-signal (AMS) circuits are highly non-linear and operate on continuous real-world signals, making them far more difficult to model with data-driven AI than digital blocks. To close the gap between structured design data (device dimensions, bias voltages, etc.) and real-world performance, we propose a causal-inference framework that first discovers a directed-acyclic graph (DAG) from SPICE simulation data and then quantifies parameter impact through Average Treatment Effect (ATE) estimation. The approach yields human-interpretable rankings of design knobs and explicit ‘what-if’ predictions, enabling designers to understand trade-offs in sizing and topology. We evaluate the pipeline on three operational-amplifier families (OTA, telescopic, and folded-cascode) implemented in TSMC 65nm and benchmark it against a baseline neural-network (NN) regressor. Across all circuits the causal model reproduces simulation-based ATEs with an average absolute error of less than 25%, whereas the neural network deviates by more than 80% and frequently predicts the wrong sign. These results demonstrate that causal AI provides both higher accuracy and explainability, paving the way for more efficient, trustworthy AMS design automation.

[AI-82] Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）推理系统中CUDA内核存在的内存安全漏洞问题。由于LLM推理依赖于高度并行的CUDA内核实现核心Transformer操作，而这些内核因模型相关的张量布局、复杂的内存索引和大规模线程级并行性，极易产生内存越界、非法访问等安全缺陷，可能导致模型权重损坏、服务崩溃甚至遭受对抗攻击。现有方法或依赖不可用硬件、开销过高，或无法处理输入长度可变的情况，难以有效检测此类漏洞。论文提出的Model2Kernel是首个实用的自动化系统，其关键创新在于引入模型感知的动态分析机制，识别每个模型如何调用内核，并将内核参数分类为由模型架构固定或用户控制；在此基础上，结合专为CUDA设计的符号执行技术与新型动态张量内存及线程标识符抽象，精准定位内核中的内存错误。在vLLM、Hugging Face及近期LLM研究论文中的多个CUDA内核上验证表明，Model2Kernel成功发现353个未知漏洞，仅产生9个误报，显著提升了LLM推理系统的内存安全性。

链接: https://arxiv.org/abs/2603.24595
作者: Mengting He,Shihao Xia,Haomin Jia,Wenfei Wu,Linhai Song
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread adoption of large language models (LLMs) has made GPU-accelerated inference a critical part of modern computing infrastructure. Production inference systems rely on CUDA kernels to implement core transformer operations, yet these kernels are highly susceptible to memory-safety bugs due to model-dependent tensor layouts, intricate memory indexing, and massive thread-level parallelism. Such bugs can corrupt model weights, crash inference services, or even enable adversarial attacks. Existing techniques either depend on unavailable hardware, incur high overhead, or fail to handle kernel inputs with variable lengths, and none can effectively detect CUDA memory bugs in LLM inference systems. This paper presents Model2Kernel, the first practical system for automatically verifying the memory safety of CUDA kernels used in LLM inference. Model2Kernel performs model-aware dynamic analysis to determine how each model invokes kernels and to classify kernel arguments as either fixed by the model architecture or controlled by model users. Using this information, Model2Kernel then applies CUDA-specialized symbolic execution, supported by new abstractions for dynamic tensor memory and thread identifiers, to accurately pinpoint memory bugs in kernels. In the evaluation on CUDA kernels and models from vLLM, Hugging Face, and recent LLM research papers, Model2Kernel discovers 353 previously unknown bugs while producing only nine false positives, demonstrating its effectiveness.

[AI-83] A Causal Framework for Evaluating ICU Discharge Strategies

【速读】：该论文旨在解决重症监护病房（Intensive Care Unit, ICU）患者何时出院这一复杂且开放的问题，其本质是一个最优停止问题（optimal stopping problem），并面临三个挑战：一是从观察性数据中评估停止策略本身是一个复杂的因果推断问题；二是目标函数为最小化干预时长与最大化临床结局的复合目标，二者无法简化为单一维度；三是变量记录在干预终止后停止。解决方案的关键在于：首先，扩展了g-formula Python包的实现，构建了一个可评估此类结构停止策略的框架，包含正向性（positivity）和覆盖度（coverage）检验；其次，基于MIMIC-IV公开ICU数据集，使用全开源流程验证了该方法能够识别出优于当前临床实践的出院策略。

链接: https://arxiv.org/abs/2603.25397
作者: Sagar Nagaraj Simha,Juliette Ortholand,Dave Dongelmans,Jessica D. Workum,Olivier W.M. Thijssens,Ameen Abu-Hanna,Giovanni Cinà
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:In this applied paper, we address the difficult open problem of when to discharge patients from the Intensive Care Unit. This can be conceived as an optimal stopping scenario with three added challenges: 1) the evaluation of a stopping strategy from observational data is itself a complex causal inference problem, 2) the composite objective is to minimize the length of intervention and maximize the outcome, but the two cannot be collapsed to a single dimension, and 3) the recording of variables stops when the intervention is discontinued. Our contributions are two-fold. First, we generalize the implementation of the g-formula Python package, providing a framework to evaluate stopping strategies for problems with the aforementioned structure, including positivity and coverage checks. Second, with a fully open-source pipeline, we apply this approach to MIMIC-IV, a public ICU dataset, demonstrating the potential for strategies that improve upon current care.

[AI-84] Reinforcement learning for quantum processes with memory

【速读】：该论文旨在解决量子强化学习中的探索-利用权衡问题，即在未知量子动力学环境下，代理如何通过序列干预来最大化期望奖励。其核心挑战在于处理隐藏的量子状态演化和不完全观测反馈。解决方案的关键在于构建一个形式化框架，其中环境包含由未知量子通道驱动的隐式量子记忆，代理使用量子仪器进行干预，并采用一种基于乐观最大似然估计（optimistic maximum-likelihood estimation）的算法。通过控制估计误差在量子通道与仪器间的传播，作者证明了策略的累计遗憾（cumulative regret）随回合数 $ K $ 的平方根呈 \widetilde\mathcal{O}(\sqrt{K}) 标度，且该界在信息论意义上是紧的（即最优至对数因子）。此方法可直接应用于无感知功提取场景，此时数学上的遗憾精确量化了因源知识缺失导致的累积热力学耗散，从而实现渐近零耗散率。

链接: https://arxiv.org/abs/2603.25138
作者: Josep Lumbreras,Ruo Cheng Huang,Yanglin Hu,Marco Fanizza,Mile Gu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 85 pages, 5 figures

点击查看摘要

Abstract:In reinforcement learning, an agent interacts sequentially with an environment to maximize a reward, receiving only partial, probabilistic feedback. This creates a fundamental exploration-exploitation trade-off: the agent must explore to learn the hidden dynamics while exploiting this knowledge to maximize its target objective. While extensively studied classically, applying this framework to quantum systems requires dealing with hidden quantum states that evolve via unknown dynamics. We formalize this problem via a framework where the environment maintains a hidden quantum memory evolving via unknown quantum channels, and the agent intervenes sequentially using quantum instruments. For this setting, we adapt an optimistic maximum-likelihood estimation algorithm. We extend the analysis to continuous action spaces, allowing us to model general positive operator-valued measures (POVMs). By controlling the propagation of estimation errors through quantum channels and instruments, we prove that the cumulative regret of our strategy scales as \widetilde\mathcalO(\sqrtK) over K episodes. Furthermore, via a reduction to the multi-armed quantum bandit problem, we establish information-theoretic lower bounds demonstrating that this sublinear scaling is strictly optimal up to polylogarithmic factors. As a physical application, we consider state-agnostic work extraction. When extracting free energy from a sequence of non-i.i.d. quantum states correlated by a hidden memory, any lack of knowledge about the source leads to thermodynamic dissipation. In our setting, the mathematical regret exactly quantifies this cumulative dissipation. Using our adaptive algorithm, the agent uses past energy outcomes to improve its extraction protocol on the fly, achieving sublinear cumulative dissipation, and, consequently, an asymptotically zero dissipation rate.

[AI-85] Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

【速读】：该论文旨在解决大规模评估工具（如AI基准测试和人类课堂测评）中因项目（item）质量低下而导致评估有效性受损的问题，尤其针对现代测评仪器中存在数千个未经充分心理测量检验的项目这一现状。其核心解决方案是提出了一类基于项间等距回归（isotonic regression）的非参数可扩展性系数，其中关键创新在于“带符号的等距R²”（signed isotonic R²），该指标通过保留肯德尔τ（Kendall’s τ）方向信息的单调函数关系，衡量一个项目能被另一个项目以单调方式解释的最大方差比例。该方法无需假设线性关系或指定参数项目反应模型，即可高效识别全局不良项目（如误标、表述模糊或构念偏离），且在小样本大维度（small-n/large-p）条件下仍具稳健性，具备处理混合类型项目（二值、有序、连续）的能力，是一种轻量级、模型无关的筛选工具，显著降低人工审查成本。

链接: https://arxiv.org/abs/2603.24999
作者: Michael Hardy,Joshua Gilbert,Benjamin Domingue
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic R^2 , which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall’s \tau . Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic R^2 is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening power. Across three AI benchmark datasets (HS Math, GSM8K, MMLU) and two human assessment datasets, the signed isotonic R^2 consistently achieves top-tier AUC for ranking bad items above good ones, outperforming or matching a comprehensive battery of classical test theory, item response theory, and dimensionality-based diagnostics. Crucially, the method remains robust under the small-n/large-p conditions typical of AI evaluation, requires only bivariate monotone fits computable in seconds, and handles mixed item types (binary, ordinal, continuous) without modification. It is a lightweight, model-agnostic filter that can materially reduce the reviewer effort needed to find flawed items in modern large-scale evaluation regimes.

[AI-86] Subject-Specific Low-Field MRI Synthesis via a Neural Operator

【速读】：该论文旨在解决低场磁共振成像（Low-field MRI, LF-MRI）因信噪比降低和对比度退化而导致的临床应用受限问题。现有模拟方法依赖噪声注入与平滑处理，无法准确再现LF采集中的对比度变化。其解决方案的关键在于提出一种端到端的LF-MRI合成框架，核心是引入一种新型的HF到LF坐标-图像解耦神经算子（HF to LF coordinate-image decoupled neural operator, H2LO），该模型直接从少量配对的HF-LF MRI数据中学习高场到低场的图像退化过程，并精准捕捉高频噪声纹理与图像结构特征，从而生成更真实的LF-MRI图像，提升下游图像增强任务性能，推动LF-MRI诊断能力的发展。

链接: https://arxiv.org/abs/2603.24968
作者: Ziqi Gao,Nicha Dvornek,Xiaoran Zhang,Gigi Galiana,Hemant Tagare,Todd Constable
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Low-field (LF) magnetic resonance imaging (MRI) improves accessibility and reduces costs but generally has lower signal-to-noise ratios and degraded contrast compared to high field (HF) MRI, limiting its clinical utility. Simulating LF MRI from HF MRI enables virtual evaluation of novel imaging devices and development of LF algorithms. Existing low field simulators rely on noise injection and smoothing, which fail to capture the contrast degradation seen in LF acquisitions. To this end, we introduce an end-to-end LF-MRI synthesis framework that learns HF to LF image degradation directly from a small number of paired HF-LF MRIs. Specifically, we introduce a novel HF to LF coordinate-image decoupled neural operator (H2LO) to model the underlying degradation process, and tailor it to capture high-frequency noise textures and image structure. Experimental results in T1w and T2w MRI demonstrate that H2LO produces more faithful simulated low-field images than existing parameterized noise synthesis models and popular image-to-image translation models. Furthermore, it improves performance in downstream image enhancement tasks, showcasing its potential to enhance LF MRI diagnostic capabilities.

[AI-87] Shaping the Future of Mathematics in the Age of AI

【速读】：该论文旨在应对人工智能（Artificial Intelligence, AI）对数学领域带来的快速且深远的变革，解决当前数学界在价值观、实践方式、教学模式、技术应用及伦理规范等方面面临的紧迫挑战。其解决方案的关键在于：通过主动参与和系统性反思，捍卫数学学科的智力自主性，重构数学研究与教育的实践范式，拓展课程体系以适应新技术环境，构建面向学术需求的技术基础设施，并制定共同体共享的伦理准则，从而确保数学的未来发展由数学共同体自身主导。

链接: https://arxiv.org/abs/2603.24914
作者: Johan Commelin,Mateja Jamnik,Rodrigo Ochigame,Lenny Taelman,Akshay Venkatesh
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注: To appear in Notices of the American Mathematical Society. Based on discussions at a September 2025 workshop on “Mechanization and Mathematical Research” held at the Lorentz center, Leiden

点击查看摘要

Abstract:Artificial intelligence is transforming mathematics at a speed and scale that demand active engagement from the mathematical community. We examine five areas where this transformation is particularly pressing: values, practice, teaching, technology, and ethics. We offer recommendations on safeguarding our intellectual autonomy, rethinking our practice, broadening curricula, building academically oriented infrastructure, and developing shared ethical principles - with the aim of ensuring that the future of mathematics is shaped by the community itself.

[AI-88] Fusion Learning from Dynamic Functional Connectivity: Combining the Amplitude and Phase of fMRI Signals to Identify Brain Disorders

【速读】：该论文旨在解决当前基于静息态功能磁共振成像（resting-state fMRI）的动态功能连接（dynamic functional connectivity, dFC）分析在脑疾病检测中敏感性和特异性不足的问题。现有方法多依赖于滑动窗相关法（sliding window correlation, SWC）提取幅度时间序列间的相关性，但忽略了相位同步信息对脑网络动态特性的重要贡献。解决方案的关键在于提出一种多尺度融合学习框架（multi-scale fusion learning framework, MSFL），该框架同时整合了SWC所捕获的幅度相关性特征与相位同步（phase synchronization, PS）所反映的相位一致性特征，从而更全面地刻画脑区间的动态交互模式。实验表明，MSFL在自闭症谱系障碍（autism spectrum disorder, ASD）和重度抑郁症（major depressive disorder, MDD）的分类任务中显著优于现有模型，且SHAP解释分析验证了两类dFC特征均对疾病检测具有重要贡献。

链接: https://arxiv.org/abs/2603.24603
作者: Jinlong Hu,Jiatong Huang,Zijian Cai
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic functional connectivity (dFC) derived from resting-state functional magnetic resonance imaging (fMRI) has been extensively utilized in brain science research. The sliding window correlation (SWC) method is a widely used approach for constructing dFC by computing correlation coefficients between amplitude time series of signals from pairs of brain regions. In this study, we propose an integrated approach that incorporates both amplitude and phase information of fMRI signals to improve the detection of brain disorders. Specifically, we introduce a multi-scale fusion learning framework, namely MSFL, which leverages two complementary dFC features derived from SWC and phase synchronization (PS). Here, SWC captures amplitude correlations, while PS measures phase coherence within dFC. We evaluated the efficacy of MSFL in classifying autism spectrum disorder and major depressive disorder using two publicly available datasets: ABIDE I and REST-meta-MDD, respectively. The results indicate that MSFL significantly outperforms existing comparative models. Moreover, we performed model explanation analysis using the SHAP framework, which showed that both types of dFC features from SWC and PS contribute to detecting brain disorders.

[AI-89] MuViS: Multimodal Virtual Sensing Benchmark

【速读】：该论文旨在解决虚拟传感（virtual sensing）领域中缺乏通用、可迁移方法的问题，当前研究分散在基于物理原理、混合模型和数据驱动方法之间，尚未形成跨过程、模态和传感配置的标准化解决方案。其关键解决方案是提出了MuViS——一个无特定领域的基准测试套件，将多种异构数据集统一为标准化接口，支持预处理与评估的一致性，并在此框架下系统比较了梯度提升决策树与深度神经网络等主流方法，结果表明现有方法均不具备普适优势，从而凸显了构建通用虚拟传感架构的必要性。

链接: https://arxiv.org/abs/2603.24602
作者: Jens U. Brandt,Noah C. Puetz,Jobel Jose George,Niharika Vinay Kumar,Elena Raponi,Marc Hilbert,Thomas Bäck,Thomas Bartz-Beielstein
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Submitted to European Signal Processing Conference (EUSIPCO) 2026

点击查看摘要

Abstract:Virtual sensing aims to infer hard-to-measure quantities from accessible measurements and is central to perception and control in physical systems. Despite rapid progress from first-principle and hybrid models to modern data-driven methods research remains siloed, leaving no established default approach that transfers across processes, modalities, and sensing configurations. We introduce MuViS, a domain-agnostic benchmarking suite for multimodal virtual sensing that consolidates diverse datasets into a unified interface for standardized preprocessing and evaluation. Using this framework, we benchmark established approaches spanning gradient-boosted decision trees and deep neural network (NN) architectures, and show that none of these provides a universal advantage, underscoring the need for generalizable virtual sensing architectures. MuViS is released as an open-source, extensible platform for reproducible comparison and future integration of new datasets and model classes.

[AI-90] FED-HARGPT : A Hybrid Centralized-Federated Approach of a Transformer-based Architecture for Human Context Recognition

【速读】：该论文旨在解决在移动传感器数据上部署人类活动识别（Human Activity Recognition, HAR）技术时面临的两大挑战：一是如何在保障用户数据隐私的前提下实现高效模型训练，二是如何在非独立同分布（non-IID）数据场景下提升模型的准确性和鲁棒性。解决方案的关键在于提出一种混合集中式-联邦学习（centralized-federated）框架，基于Transformer架构，在Flower联邦学习框架中实现模型训练，从而在不共享原始数据的情况下，使联邦模型性能接近集中式基准模型，有效平衡了数据隐私与模型性能之间的矛盾。

链接: https://arxiv.org/abs/2603.24601
作者: Wandemberg Gibaut,Alexandre Osorio,Amparo Munoz,Sildolfo F. G. Neto,Fabio Grassiotto
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper presented on: July 2025 Conference: XVII Simpósio Brasileiro de Automação Inteligente (SBAI) At: São João del-Rei

点击查看摘要

Abstract:The study explores a hybrid centralized-federated approach for Human Activity Recognition (HAR) using a Transformer-based architecture. With the increasing ubiquity of edge devices, such as smartphones and wearables, a significant amount of private data from wearable and inertial sensors is generated, facilitating discreet monitoring of human activities, including resting, sleeping, and walking. This research focuses on deploying HAR technologies using mobile sensor data and leveraging Federated Learning within the Flower framework to evaluate the training of a federated model derived from a centralized baseline. The experimental results demonstrate the effectiveness of the proposed hybrid approach in improving the accuracy and robustness of HAR models while preserving data privacy in a non-IID data scenario. The federated learning setup demonstrated comparable performance to centralized models, highlighting the potential of federated learning to strike a balance between data privacy and model performance in real-world applications.

[AI-91] A Learnable SIM Paradigm: Fundamentals Training Techniques and Applications

【速读】：该论文旨在解决6G及未来无线通信系统中频谱利用效率低和抗干扰能力弱的问题，特别是多用户信号分离与通信信号与干扰信号区分的挑战。解决方案的关键在于提出了一种可学习的智能超表面（Learnable Stacked Intelligent Metasurfaces, SIM）架构，并基于其与人工神经网络（Artificial Neural Networks, ANN）的结构相似性，构建了SIM赋能的机器学习（ML）范式。通过该架构，实现了在电磁波域进行模拟计算的轻量化信号处理方案，有效提升了系统频谱利用率和抗干扰性能，为构建超高效、智能化的无线基础设施提供了新路径。

链接: https://arxiv.org/abs/2603.24599
作者: Hetong Wang,Yashuai Cao,Tiejun Lv
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, accepted by IEEE Wireless Communications Magazine

点击查看摘要

Abstract:Stacked intelligent metasurfaces (SIMs) represent a breakthrough in wireless hardware by comprising multilayer, programmable metasurfaces capable of analog computing in the electromagnetic (EM) wave domain. By examining their architectural analogies, this article reveals a deeper connection between SIMs and artificial neural networks (ANNs). Leveraging this profound structural similarity, this work introduces a learnable SIM architecture and proposes a learnable SIM-based machine learning (ML) paradigm for sixth-generation (6G)-andbeyond systems. Then, we develop two SIM-empowered wireless signal processing schemes to effectively achieve multi-user signal separation and distinguish communication signals from jamming signals. The use cases highlight that the proposed SIM-enabled signal processing system can significantly enhance spectrum utilization efficiency and anti-jamming capability in a lightweight manner and pave the way for ultra-efficient and intelligent wireless infrastructures.

机器学习

[LG-0] On Neural Scaling Laws for Weather Emulation through Continual Training ICLR

链接: https://arxiv.org/abs/2603.25687
作者: Shashank Subramanian,Alexander Kiefer,Arnur Nigmetov,Amir Gholami,Dmitriy Morozov,Michael W. Mahoney
类目: Machine Learning (cs.LG)
*备注: ICLR Foundation Models for Science Workshop 2026, 19 pages, 13 figures

点击查看摘要

Abstract:Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strategy. We show that models trained in this minimalist way follow predictable scaling trends and even outperform standard cosine learning rate schedules. Cooldown phases can be re-purposed to improve downstream performance, e.g., enabling accurate multi-step rollouts over longer forecast horizons as well as sharper predictions through spectral loss adjustments. We also systematically explore a wide range of model and dataset sizes under various compute budgets to construct IsoFLOP curves, and we identify compute-optimal training regimes. Extrapolating these trends to larger scales highlights potential performance limits, demonstrating that neural scaling can serve as an important diagnostic for efficient resource allocation. We open-source our code for reproducibility.

[LG-1] Longitudinal Digital Phenotyping for Early Cognitive-Motor Screening

链接: https://arxiv.org/abs/2603.25673
作者: Diego Jimenez-Oviedo,Ruben Vera-Rodriguez,Ruben Tolosana,Juan Carlos Ruiz-Garcia,Jaime Herreros-Rodriguez
类目: Machine Learning (cs.LG)
*备注: IEEE CAI 2026 6 Pages 2 Figures

点击查看摘要

Abstract:Early detection of atypical cognitive-motor development is critical for timely intervention, yet traditional assessments rely heavily on subjective, static evaluations. The integration of digital devices offers an opportunity for continuous, objective monitoring through digital biomarkers. In this work, we propose an AI-driven longitudinal framework to model developmental trajectories in children aged 18 months to 8 years. Using a dataset of tablet-based interactions collected over multiple academic years, we analyzed six cognitive-motor tasks (e.g., fine motor control, reaction time). We applied dimensionality reduction (t-SNE) and unsupervised clustering (K-Means++) to identify distinct developmental phenotypes and tracked individual transitions between these profiles over time. Our analysis reveals three distinct profiles: low, medium, and high performance. Crucially, longitudinal tracking highlights a high stability in the low-performance cluster (90% retention in early years), suggesting that early deficits tend to persist without intervention. Conversely, higher-performance clusters show greater variability, potentially reflecting engagement factors. This study validates the use of unsupervised learning on touchscreen data to uncover heterogeneous developmental paths. The identified profiles serve as scalable, data-driven proxies for cognitive growth, offering a foundation for early screening tools and personalized pediatric interventions.

[LG-2] Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring

链接: https://arxiv.org/abs/2603.25670
作者: John Ayotunde,Qinghua Xu,Guancheng Wang,Lionel C. Briand
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 10 pages (main content), 3 pages references, 5 figures, 5 tables. Under review

点击查看摘要

Abstract:Safety monitoring is essential for Cyber-Physical Systems (CPSs). However, unsafe events are rare in real-world CPS operations, creating an extreme class imbalance that degrades safety predictors. Standard rebalancing techniques perform poorly on time-series CPS telemetry, either generating unrealistic synthetic samples or overfitting on the minority class. Meanwhile, behavioral uncertainty in CPS operations, defined as the degree of doubt or uncertainty in CPS decisions , is often correlated with safety outcomes but unexplored in safety monitoring. To that end, we propose U-Balance, a supervised approach that leverages behavioral uncertainty to rebalance imbalanced datasets prior to training a safety predictor. U-Balance first trains a GatedMLP-based uncertainty predictor that summarizes each telemetry window into distributional kinematic features and outputs an uncertainty score. It then applies an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels \textitsafe-labeled windows with unusually high uncertainty as \textitunsafe, thereby enriching the minority class with informative boundary samples without synthesizing new data. Finally, a safety predictor is trained on the rebalanced dataset for safety monitoring. We evaluate U-Balance on a large-scale UAV benchmark with a 46:1 safe-to-unsafe ratio. Results confirm a moderate but significant correlation between behavioral uncertainty and safety. We then identify uLNR as the most effective strategy to exploit uncertainty information, compared to direct early and late fusion. U-Balance achieves a 0.806 F1 score, outperforming the strongest baseline by 14.3 percentage points, while maintaining competitive inference efficiency. Ablation studies confirm that both the GatedMLP-based uncertainty predictor and the uLNR mechanism contribute significantly to U-Balance’s effectiveness.

[LG-3] Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments

链接: https://arxiv.org/abs/2603.25635
作者: Armand de Villeroché,Rem-Sophia Mouradi,Vincent Le Guen,Sibo Cheng,Marc Bocquet,Alban Farchi,Patrick Armand,Patrick Massin
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Air flow modeling at a local scale is essential for applications such as pollutant dispersion modeling or wind farm modeling. To circumvent costly Computational Fluid Dynamics (CFD) computations, deep learning surrogate models have recently emerged as promising alternatives. However, in the context of urban air flow, deep learning models struggle to adapt to the high variations of the urban geometry and to large mesh sizes. To tackle these challenges, we introduce Anchored Branched Steady-state WInd Flow Transformer (AB-SWIFT), a transformer-based model with an internal branched structure uniquely designed for atmospheric flow modeling. We train our model on a specially designed database of atmospheric simulations around randomised urban geometries and with a mixture of unstable, neutral, and stable atmospheric stratifications. Our model reaches the best accuracy on all predicted fields compared to state-of-the-art transformers and graph-based models. Our code and data is available at this https URL.

[LG-4] he Geometry of Efficient Nonconvex Sampling

链接: https://arxiv.org/abs/2603.25622
作者: Santosh S. Vempala,Andre Wibisono
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present an efficient algorithm for uniformly sampling from an arbitrary compact body \mathcalX \subset \mathbbR^n from a warm start under isoperimetry and a natural volume growth condition. Our result provides a substantial common generalization of known results for convex bodies and star-shaped bodies. The complexity of the algorithm is polynomial in the dimension, the Poincaré constant of the uniform distribution on \mathcalX and the volume growth constant of the set \mathcalX .

[LG-5] Social Hippocampus Memory Learning

链接: https://arxiv.org/abs/2603.25614
作者: Liping Yi,Zhiming Zhao,Qinghua Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social learning highlights that learning agents improve not in isolation, but through interaction and structured knowledge exchange with others. When introduced into machine learning, this principle gives rise to social machine learning (SML), where multiple agents collaboratively learn by sharing abstracted knowledge. Federated learning (FL) provides a natural collaboration substrate for this paradigm, yet existing heterogeneous FL approaches often rely on sharing model parameters or intermediate representations, which may expose sensitive information and incur additional overhead. In this work, we propose SoHip (Social Hippocampus Memory Learning), a memory-centric social machine learning framework that enables collaboration among heterogeneous agents via memory sharing rather than model sharing. SoHip abstracts each agent’s individual short-term memory from local representations, consolidates it into individual long-term memory through a hippocampus-inspired mechanism, and fuses it with collectively aggregated long-term memory to enhance local prediction. Throughout the process, raw data and local models remain on-device, while only lightweight memory are exchanged. We provide theoretical analysis on convergence and privacy preservation properties. Experiments on two benchmark datasets with seven baselines demonstrate that SoHip consistently outperforms existing methods, achieving up to 8.78% accuracy improvements.

[LG-6] Spatiotemporal System Forecasting with Irregular Time Steps via Masked Autoencoder

链接: https://arxiv.org/abs/2603.25597
作者: Kewei Zhu,Yanze Xin,Jinwei Hu,Xiaoyuan Cheng,Yiming Yang,Sibo Cheng
类目: Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注:

点击查看摘要

Abstract:Predicting high-dimensional dynamical systems with irregular time steps presents significant challenges for current data-driven algorithms. These irregularities arise from missing data, sparse observations, or adaptive computational techniques, reducing prediction accuracy. To address these limitations, we propose a novel method: a Physics-Spatiotemporal Masked Autoencoder. This method integrates convolutional autoencoders for spatial feature extraction with masked autoencoders optimised for irregular time series, leveraging attention mechanisms to reconstruct the entire physical sequence in a single prediction pass. The model avoids the need for data imputation while preserving physical integrity of the system. Here, ‘physics’ refers to high-dimensional fields generated by underlying dynamical systems, rather than the enforcement of explicit physical constraints or PDE residuals. We evaluate this approach on multiple simulated datasets and real-world ocean temperature data. The results demonstrate that our method achieves significant improvements in prediction accuracy, robustness to nonlinearities, and computational efficiency over traditional convolutional and recurrent network methods. The model shows potential for capturing complex spatiotemporal patterns without requiring domain-specific knowledge, with applications in climate modelling, fluid dynamics, ocean forecasting, environmental monitoring, and scientific computing.

[LG-7] An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae

链接: https://arxiv.org/abs/2603.25561
作者: Neha K. Nair,Aaron D’Souza
类目: Machine Learning (cs.LG)
*备注: 8 pages, 12 figures, and 2 tables

点击查看摘要

Abstract:Saccharomyces cerevisiae is a cornerstone organism in industrial biotechnology, valued for its genetic tractability and robust fermentative capacity. Accurately predicting biomass flux across diverse environmental and genetic perturbations remains a significant challenge for rational strain design. We present a computational framework combining the Yeast9 genome-scale metabolic model with machine learning and optimization to predict, interpret, and enhance biomass flux. Flux balance analysis generated 2,000 flux profiles by varying glucose, oxygen, and ammonium uptake rates. Random Forest and XGBoost regressors achieved R2 of 0.99989 and 0.9990, respectively. A variational autoencoder revealed four distinct metabolic clusters, and SHAP analysis identified glycolysis, the TCA cycle, and lipid biosynthesis as key biomass determinants. In silico overexpression achieved a biomass flux of 0.979 gDW/hr, while Bayesian optimization of nutrient constraints produced a 12-fold increase (0.0858 to 1.041 gDW/hr). A generative adversarial network proposed stoichiometrically feasible novel flux configurations. This framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering.

[LG-8] Missing-Aware Multimodal Fusion for Unified Microservice Incident Management

链接: https://arxiv.org/abs/2603.25538
作者: Wenzhuo Qian,Hailiang Zhao,Ziqi Wang,Zhipeng Gao,Jiayi Chen,Zhiwei Ling,Shuiguang Deng
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Automated incident management is critical for microservice reliability. While recent unified frameworks leverage multimodal data for joint optimization, they unrealistically assume perfect data completeness. In practice, network fluctuations and agent failures frequently cause missing modalities. Existing approaches relying on static placeholders introduce imputation noise that masks anomalies and degrades performance. To address this, we propose ARMOR, a robust self-supervised framework designed for missing modality scenarios. ARMOR features: (i) a modality-specific asymmetric encoder that isolates distribution disparities among metrics, logs, and traces; and (ii) a missing-aware gated fusion mechanism utilizing learnable placeholders and dynamic bias compensation to prevent cross-modal interference from incomplete inputs. By employing self-supervised auto-regression with mask-guided reconstruction, ARMOR jointly optimizes anomaly detection (AD), failure triage (FT), and root cause localization (RCL). AD and RCL require no fault labels, while FT relies solely on failure-type annotations for the downstream classifier. Extensive experiments demonstrate that ARMOR achieves state-of-the-art performance under complete data conditions and maintains robust diagnostic accuracy even with severe modality loss.

[LG-9] How Class Ontology and Data Scale Affect Audio Transfer Learning

链接: https://arxiv.org/abs/2603.25476
作者: Manuel Milling,Andreas Triantafyllopoulos,Alexander Gebhard,Simon Rampp,Björn W. Schuller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

[LG-10] Causal-INSIGHT: Probing Temporal Models to Extract Causal Structure IJCNN

链接: https://arxiv.org/abs/2603.25473
作者: Benjamin Redden,Hui Wang,Shuyan Li
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCNN, 2026

点击查看摘要

Abstract:Understanding directed temporal interactions in multivariate time series is essential for interpreting complex dynamical systems and the predictive models trained on them. We present Causal-INSIGHT, a model-agnostic, post-hoc interpretation framework for extracting model-implied (predictor-dependent), directed, time-lagged influence structure from trained temporal predictors. Rather than inferring causal structure at the level of the data-generating process, Causal-INSIGHT analyzes how a fixed, pre-trained predictor responds to systematic, intervention-inspired input clamping applied at inference time. From these responses, we construct directed temporal influence signals that reflect the dependencies the predictor relies on for prediction, and introduce Qbic, a sparsity-aware graph selection criterion that balances predictive fidelity and structural complexity without requiring ground-truth graph labels. Experiments across synthetic, simulated, and realistic benchmarks show that Causal-INSIGHT generalizes across diverse backbone architectures, maintains competitive structural accuracy, and yields significant improvements in temporal delay localization when applied to existing predictors. Comments: Accepted at IJCNN, 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.25473 [cs.LG] (or arXiv:2603.25473v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.25473 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Not a frag ment but the whole: Map-based evaluation of data-driven Fire Danger Index models

链接: https://arxiv.org/abs/2603.25469
作者: Shahbaz Alvi,Italo Epicoco,Jose Maria Costa Saura
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, 3 tables

点击查看摘要

Abstract:A growing body of literature has focused on predicting wildfire occurrence using machine learning methods, capitalizing on high-resolution data and fire predictors that canonical process-based frameworks largely ignore. Standard evaluation metrics for an ML classifier, while important, provide a potentially limited measure of the model’s operational performance for the Fire Danger Index (FDI) forecast. Furthermore, model evaluation is frequently conducted without adequately accounting for false positive rates, despite their critical relevance in operational contexts. In this paper, we revisit the daily FDI model evaluation paradigm and propose a novel method for evaluating a forest fire forecasting model that is aligned with real-world decision-making. Furthermore, we systematically assess performance in accurately predicting fire activity and the false positives (false alarms). We further demonstrate that an ensemble of ML models improves both fire identification and reduces false positives.

[LG-12] Hessian-informed machine learning interatomic potential towards bridging theory and experiments

链接: https://arxiv.org/abs/2603.25373
作者: Bangchen Yin,Jian Ouyang,Zhen Fan,Kailai Lin,Hanshi Hu,Dingshun Lv,Weiluo Ren,Hai Xiao,Ji Chen,Changsu Cao
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Local curvature of potential energy surfaces is critical for predicting certain experimental observables of molecules and materials from first principles, yet it remains far beyond reach for complex systems. In this work, we introduce a Hessian-informed Machine Learning Interatomic Potential (Hi-MLIP) that captures such curvature reliably, thereby enabling accurate analysis of associated thermodynamic and kinetic phenomena. To make Hessian supervision practically viable, we develop a highly efficient training protocol, termed Hessian INformed Training (HINT), achieving two to four orders of magnitude reduction for the requirement of expensive Hessian labels. HINT integrates critical techniques, including Hessian pre-training, configuration sampling, curriculum learning and stochastic projection Hessian loss. Enabled by HINT, Hi-MLIP significantly improves transition-state search and brings Gibbs free-energy predictions close to chemical accuracy especially in data-scarce regimes. Our framework also enables accurate treatment of strongly anharmonic hydrides, reproducing phonon renormalization and superconducting critical temperatures in close agreement with experiment while bypassing the computational bottleneck of anharmonic calculations. These results establish a practical route to enhancing curvature awareness of machine learning interatomic potentials, bridging simulation and experimental observables across a wide range of systems.

[LG-13] From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

链接: https://arxiv.org/abs/2603.25342
作者: Shuoling Liu,Zhiquan Tan,Kun Yi,Hui Wu,Yihan Li,Jiangpeng Yan,Liyuan Chen,Kai Chen,Qiang Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification – matching pure reasoning models in falsifying hallucinated premises – they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnoteOur implementation will be available at this https URL.

[LG-14] Mitigating Evasion Attacks in Fog Computing Resource Provisioning Through Proactive Hardening

链接: https://arxiv.org/abs/2603.25257
作者: Younes Salmi,Hanna Bogucka
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the susceptibility to model integrity attacks that overload virtual machines assigned by the k-means algorithm used for resource provisioning in fog networks. The considered k-means algorithm runs two phases iteratively: offline clustering to form clusters of requested workload and online classification of new incoming requests into offline-created clusters. First, we consider an evasion attack against the classifier in the online phase. A threat actor launches an exploratory attack using query-based reverse engineering to discover the Machine Learning (ML) model (the clustering scheme). Then, a passive causative (evasion) attack is triggered in the offline phase. To defend the model, we suggest a proactive method using adversarial training to introduce attack robustness into the classifier. Our results show that our mitigation technique effectively maintains the stability of the resource provisioning system against attacks.

[LG-15] Offline Decision Transformers for Neural Combinatorial Optimization: Surpassing Heuristics on the Traveling Salesman Problem NEURIPS2025

链接: https://arxiv.org/abs/2603.25241
作者: Hironori Ohigashi,Shinichiro Hamada
类目: Machine Learning (cs.LG)
*备注: 11 pages, 1 figures. Accepted at NeurIPS 2025 Workshop on DiffCoALG

点击查看摘要

Abstract:Combinatorial optimization problems like the Traveling Salesman Problem are critical in industry yet NP-hard. Neural Combinatorial Optimization has shown promise, but its reliance on online reinforcement learning (RL) hampers deployment and underutilizes decades of algorithmic knowledge. We address these limitations by applying the offline RL framework, Decision Transformer, to learn superior strategies directly from datasets of heuristic solutions; it aims to not only to imitate but to synthesize and outperform them. Concretely, we (i) integrate a Pointer Network to handle the instance-dependent, variable action space of node selection, and (ii) employ expectile regression for optimistic conditioning of Return-to-Go, which is crucial for instances with widely varying optimal values. Experiments show that our method consistently produces higher-quality tours than the four classical heuristics it is trained on, demonstrating the potential of offline RL to unlock and exceed the performance embedded in existing domain knowledge.

[LG-16] Gap Safe Screening Rules for Fast Training of Robust Support Vector Machines under Feature Noise

链接: https://arxiv.org/abs/2603.25221
作者: Tan-Hau Nguyen,Thu-Le Tran,Kien Trung Nguyen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 19 pages

点击查看摘要

Abstract:Robust Support Vector Machines (R-SVMs) address feature noise by adopting a worst-case robust formulation that explicitly incorporates uncertainty sets into training. While this robustness improves reliability, it also leads to increased computational cost. In this work, we develop safe sample screening rules for R-SVMs that reduce the training complexity without affecting the optimal solution. To the best of our knowledge, this is the first study to apply safe screening techniques to worst-case robust models in supervised machine learning. Our approach safely identifies training samples whose uncertainty sets are guaranteed to lie entirely on either side of the margin hyperplane, thereby reducing the problem size and accelerating optimization. Owing to the nonstandard structure of R-SVMs, the proposed screening rules are derived from the Lagrangian duality rather than the Fenchel-Rockafellar duality commonly used in recent methods. Based on this analysis, we first establish an ideal screening rule, and then derive a practical rule by adapting GAP-based safe regions to the robust setting. Experiments demonstrate that the proposed method significantly reduces training time while preserving classification accuracy.

[LG-17] A CDF-First Framework for Free-Form Density Estimation

链接: https://arxiv.org/abs/2603.25204
作者: Chenglong Song,Mazharul Islam,Lin Wang,Bing Chen,Bo Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conditional density estimation (CDE) is a fundamental task in machine learning that aims to model the full conditional law \mathbbP(\mathbfy \mid \mathbfx) , beyond mere point prediction (e.g., mean, mode). A core challenge is free-form density estimation, capturing distributions that exhibit multimodality, asymmetry, or topological complexity without restrictive assumptions. However, prevailing methods typically estimate the probability density function (PDF) directly, which is mathematically ill-posed: differentiating the empirical distribution amplifies random fluctuations inherent in finite datasets, necessitating strong inductive biases that limit expressivity and fail when violated. We propose a CDF-first framework that circumvents this issue by estimating the cumulative distribution function (CDF), a stable and well-posed target, and then recovering the PDF via differentiation of the learned smooth CDF. Parameterizing the CDF with a Smooth Min-Max (SMM) network, our framework guarantees valid PDFs by construction, enables tractable approximate likelihood training, and preserves complex distributional shapes. For multivariate outputs, we use an autoregressive decomposition with SMM factors. Experiments demonstrate our approach outperforms state-of-the-art density estimators on a range of univariate and multivariate tasks.

[LG-18] Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

链接: https://arxiv.org/abs/2603.25186
作者: Adam Jakobsen,Sushant Gautam,Hugo Lewi Hammer,Susanne Olofsdotter,Miriam S Johanson,Pål Halvorsen,Vajira Thambawita
类目: Machine Learning (cs.LG)
*备注: Submitted to CBMS 2026

点击查看摘要

Abstract:AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.

[LG-19] SEVerA: Verified Synthesis of Self-Evolving Agents

链接: https://arxiv.org/abs/2603.25111
作者: Debangshu Banerjee,Changming Xu,Gagandeep Singh
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: Formally Verified Self-Evolving LLM Agents

点击查看摘要

Abstract:Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ( \tau^2 -bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

[LG-20] Process-Aware AI for Rainfall-Runoff Modeling: A Mass-Conserving Neural Framework with Hydrological Process Constraints

链接: https://arxiv.org/abs/2603.25093
作者: Mohammad A. Farmani,Hoshin V. Gupta,Ali Behrangi,Muhammad Jawad,Sadaf Moghisi,Guo-Yue Niu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models can achieve high predictive accuracy in hydrological applications but often lack physical interpretability. The Mass-Conserving Perceptron (MCP) provides a physics-aware artificial intelligence (AI) framework that enforces conservation principles while allowing hydrological process relationships to be learned from data. In this study, we investigate how progressively embedding physically meaningful representations of hydrological processes within a single MCP storage unit improves predictive skill and interpretability in rainfall-runoff modeling. Starting from a minimal MCP formulation, we sequentially introduce bounded soil storage, state-dependent conductivity, variable porosity, infiltration capacity, surface ponding, vertical drainage, and nonlinear water-table dynamics. The resulting hierarchy of process-aware MCP models is evaluated across 15 catchments spanning five hydroclimatic regions of the continental United States using daily streamflow prediction as the target. Results show that progressively augmenting the internal physical structure of the MCP unit generally improves predictive performance. The influence of these process representations is strongly hydroclimate dependent: vertical drainage substantially improves model skill in arid and snow-dominated basins but reduces performance in rainfall-dominated regions, while surface ponding has comparatively small effects. The best-performing MCP configurations approach the predictive skill of a Long Short-Term Memory benchmark while maintaining explicit physical interpretability. These results demonstrate that embedding hydrological process constraints within AI architectures provides a promising pathway toward interpretable and process-aware rainfall-runoff modeling.

[LG-21] SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning ICML2026

链接: https://arxiv.org/abs/2603.25062
作者: Xinyu Wang,Fei Dou,Jinbo Bi,Minghu Song
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures. Submitted to ICML 2026. Primary category: cs.LG (Machine Learning); Secondary: cs.AI, q-bio.QM

点击查看摘要

Abstract:Linearized string representations serve as the foundation of scalable autoregressive molecular generation; however, they introduce a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences. This ambiguity leads to \textittrajectory divergence, where the latent representations of structurally equivalent partial graphs drift apart due to differences in linearization history. To resolve this without abandoning the efficient string formulation, we propose Structure-Invariant Generative Molecular Alignment (SIGMA). Rather than altering the linear representation, SIGMA enables the model to strictly recognize geometric symmetries via a token-level contrastive objective, which explicitly aligns the latent states of prefixes that share identical suffixes. Furthermore, we introduce Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths. Empirical evaluations on standard benchmarks demonstrate that SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.

[LG-22] he Order Is The Message

链接: https://arxiv.org/abs/2603.25047
作者: Jordan LeDoux
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 51 pages, 12 figures

点击查看摘要

Abstract:In a controlled experiment on modular arithmetic ( p = 9973 ), varying only example ordering while holding all else constant, two fixed-ordering strategies achieve 99.5% test accuracy by epochs 487 and 659 respectively from a training set comprising 0.3% of the input space, well below established sample complexity lower bounds for this task under IID ordering. The IID baseline achieves 0.30% after 5,000 epochs from identical data. An adversarially structured ordering suppresses learning entirely. The generalizing model reliably constructs a Fourier representation whose fundamental frequency is the Fourier dual of the ordering structure, encoding information present in no individual training example, with the same fundamental emerging across all seeds tested regardless of initialization or training set composition. We discuss implications for training efficiency, the reinterpretation of grokking, and the safety risks of a channel that evades all content-level auditing.

[LG-23] Epistemic Compression: The Case for Deliberate Ignorance in High-Stakes AI

链接: https://arxiv.org/abs/2603.25033
作者: Steffen Lukas
类目: Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:Foundation models excel in stable environments, yet often fail where reliability matters most: medicine, finance, and policy. This Fidelity Paradox is not just a data problem; it is structural. In domains where rules change over time, extra model capacity amplifies noise rather than capturing signal. We introduce Epistemic Compression: the principle that robustness emerges from matching model complexity to the shelf life of the data, not from scaling parameters. Unlike classical regularization, which penalizes weights post hoc, Epistemic Compression enforces parsimony through architecture: the model structure itself is designed to reduce overfitting by making it architecturally costly to represent variance that exceeds the evidence in the data. We operationalize this with a Regime Index that separates Shifting Regime (unstable, data-poor; simplicity wins) from Stable Regime (invariant, data-rich; complexity viable). In an exploratory synthesis of 15 high-stakes domains, this index was concordant with the empirically superior modeling strategy in 86.7% of cases (13/15). High-stakes AI demands a shift from scaling for its own sake to principled parsimony.

[LG-24] Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

链接: https://arxiv.org/abs/2603.25029
作者: Haishan Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citetagarwal2010optimal. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of O(d(\log T + \log(1/\delta))/\mu) for \mu -strongly convex losses. Our result is minimax optimal with respect to both the time horizon T and the dimension d . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.25029 [cs.LG] (or arXiv:2603.25029v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.25029 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haishan Ye [view email] [v1] Thu, 26 Mar 2026 04:52:19 UTC (16 KB)

[LG-25] A Systematic Empirical Study of Grokking: Depth Architecture Activation and Regularization

链接: https://arxiv.org/abs/2603.25009
作者: Shalima Binta Manir,Anamika Paul Rupa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbfdepth has a non-monotonic effect, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbfthe apparent gap between Transformers and MLPs largely disappears (1.11 \times delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbfactivation function effects are regime-dependent, with GELU up to 4.3 \times faster than ReLU only when regularization permits memorization; and (4) \textbfweight decay is the dominant control parameter, exhibiting a narrow ``Goldilocks’’ regime in which grokking occurs, while too little or too much prevents generalization. Across 3–5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.

[LG-26] MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

链接: https://arxiv.org/abs/2603.24946
作者: Moshood A. Fakorede,Krishna Upadhyay,A.B. Siddique,Umar Farooq
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 21 pages, 11 figures, 14 tables

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.

[LG-27] Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML

链接: https://arxiv.org/abs/2603.24916
作者: Yassien Shaalan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 5 figures. Accepted at MLSys 2026. TinyML / on-device learning paper on hypernetwork-based compression for ECG and other 1D biosignals, with integer-only inference on commodity MCUs. Evaluated on Apnea-ECG, PTB-XL, and MIT-BIH. Camera-ready version with additional datasets, experiments, and insights will appear after May 2026

点击查看摘要

Abstract:Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) mixers often dominate memory even after INT8 quantization across vision, audio, and wearable sensing. We present HYPER-TINYPW, a compression-as-generation approach that replaces most stored PW weights with generated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes, caches them, and executes them with standard integer operators. This preserves commodity MCU runtimes and adds only a one-off synthesis cost; steady-state latency and energy match INT8 separable CNN baselines. Enforcing a shared latent basis across layers removes cross-layer redundancy, while keeping PW1 in INT8 stabilizes early, morphology-sensitive mixing. We contribute (i) TinyML-faithful packed-byte accounting covering generator, heads/factorization, codes, kept PW1, and backbone; (ii) a unified evaluation with validation-tuned t* and bootstrap confidence intervals; and (iii) a deployability analysis covering integer-only inference and boot versus lazy synthesis. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HYPER-TINYPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a roughly 1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, indicating a wider role for compression-as-generation in resource-constrained ML systems. Beyond ECG, HYPER-TINYPW transfers to TinyML audio: on Speech Commands it reaches 96.2% test accuracy (98.2% best validation), supporting broader applicability to embedded sensing workloads where repeated linear mixers dominate memory.

[LG-28] Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLM s for Warehouse Staffing Optimization ICLR2026

链接: https://arxiv.org/abs/2603.24883
作者: Kalle Kujanpää,Yuying Zhu,Kristina Klinkner,Shervin Malmasi
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making

点击查看摘要

Abstract:We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.

[LG-29] Flow matching on homogeneous spaces

链接: https://arxiv.org/abs/2603.24829
作者: Francesco Ruscelli
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:We propose a general framework to extend Flow Matching to homogeneous spaces, i.e. quotients of Lie groups. Our approach reformulates the problem as a flow matching task on the underlying Lie group by lifting the data distributions. This strategy avoids the potentially complicated geometry of homogeneous spaces by working directly on Lie groups, which in turn enables us reduce the problem to a Euclidean flow matching task on Lie algebras. In contrast to Riemannian Flow Matching, our method eliminates the need to define and compute premetrics or geodesics, resulting in a simpler, faster, and fully intrinsic framework.

[LG-30] Local learning for stable backpropagation-free neural network training towards physical learning

链接: https://arxiv.org/abs/2603.24790
作者: Yaqi Guo,Fabian Braun,Bastiaan Ketelaar,Stephanie Tan,Richard Norte,Siddhant Kumar
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:While backpropagation and automatic differentiation have driven deep learning’s success, the physical limits of chip manufacturing and rising environmental costs of deep learning motivate alternative learning paradigms such as physical neural networks. However, most existing physical neural networks still rely on digital computing for training, largely because backpropagation and automatic differentiation are difficult to realize in physical systems. We introduce FFzero, a forward-only learning framework enabling stable neural network training without backpropagation or automatic differentiation. FFzero combines layer-wise local learning, prototype-based representations, and directional-derivative-based optimization through forward evaluations only. We show that local learning is effective under forward-only optimization, where backpropagation fails. FFzero generalizes to multilayer perceptron and convolutional neural networks across classification and regression. Using a simulated photonic neural network as an example, we demonstrate that FFzero provides a viable path toward backpropagation-free in-situ physical learning.

[LG-31] ransformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

链接: https://arxiv.org/abs/2603.24780
作者: Jungtaek Kim,Thomas Zeng,Ziqian Lin,Minjae Lee,Chungpa Lee,Jy-yong Sohn,Hyung Il Koo,Kangwook Lee
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Effective problem solving with Large Language Models (LLMs) can be enhanced when they are paired with external search algorithms. By viewing the space of diverse ideas and their follow-up possibilities as a tree structure, the search algorithm can navigate such a search space and guide the LLM toward better solutions more efficiently. While the search algorithm enables an effective balance between exploitation and exploration of a tree-structured space, the need for an external component can complicate the overall problem-solving process. We therefore pose the following question: Can LLMs or their underlying Transformer architectures approximate a search algorithm? To answer this question, we first introduce a simplified framework in which tree extensions and feedback signals are externally specified, allowing for controlled evaluation of search capabilities. We call this setting unknown tree search with bandit feedback. Within this setting, we show that Transformers are theoretically expressive enough to implement distinct search strategies and can be trained from scratch to approximate those strategies. Our Transformer models exhibit the possibility of generalizing to unseen conditions such as longer horizons or deeper trees. Furthermore, we demonstrate that continued task-focused training unlocks the complete capabilities of a pretrained LLM, by fine-tuning the LLM on search trajectories.

[LG-32] Contrastive Learning Boosts Deterministic and Generative Models for Weather Data

链接: https://arxiv.org/abs/2603.24744
作者: Nathan Bailey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weather data, comprising multiple variables, poses significant challenges due to its high dimensionality and multimodal nature. Creating low-dimensional embeddings requires compressing this data into a compact, shared latent space. This compression is required to improve the efficiency and performance of downstream tasks, such as forecasting or extreme-weather detection. Self-supervised learning, particularly contrastive learning, offers a way to generate low-dimensional, robust embeddings from unlabelled data, enabling downstream tasks when labelled data is scarce. Despite initial exploration of contrastive learning in weather data, particularly with the ERA5 dataset, the current literature does not extensively examine its benefits relative to alternative compression methods, notably autoencoders. Moreover, current work on contrastive learning does not investigate how these models can incorporate sparse data, which is more common in real-world data collection. It is critical to explore and understand how contrastive learning contributes to creating more robust embeddings for sparse weather data, thereby improving performance on downstream tasks. Our work extensively explores contrastive learning on the ERA5 dataset, aligning sparse samples with complete ones via a contrastive loss term to create SPARse-data augmented conTRAstive spatiotemporal embeddings (SPARTA). We introduce a temporally aware batch sampling strategy and a cycle-consistency loss to improve the structure of the latent space. Furthermore, we propose a novel graph neural network fusion technique to inject domain-specific physical knowledge. Ultimately, our results demonstrate that contrastive learning is a feasible and advantageous compression method for sparse geoscience data, thereby enhancing performance in downstream tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.24744 [cs.LG] (or arXiv:2603.24744v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.24744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?

链接: https://arxiv.org/abs/2603.24714
作者: Sounak Dutta,Fin Amin,Sushil Panda,Jonathan Rabe,Yuejiang Wen,Paul Franzon
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Analog design often slows down because even small changes to device sizes or biases require expensive simulation cycles, and high-quality solutions typically occupy only a narrow part of a very large search space. While existing optimizers reduce some of this burden, they largely operate without the kind of judgment designers use when deciding where to search next. This paper presents an actor-critic optimization framework (ACOF) for analog sizing that brings that form of guidance into the loop. Rather than treating optimization as a purely black-box search problem, ACOF separates the roles of proposal and evaluation: an actor suggests promising regions of the design space, while a critic reviews those choices, enforces design legality, and redirects the search when progress is hampered. This structure preserves compatibility with standard simulator-based flows while making the search process more deliberate, stable, and interpretable. Across our test circuits, ACOF improves the top-10 figure of merit by an average of 38.9% over the strongest competing baseline and reduces regret by an average of 24.7%, with peak gains of 70.5% in FoM and 42.2% lower regret on individual circuits. By combining iterative reasoning with simulation-driven search, the framework offers a more transparent path toward automated analog sizing across challenging design spaces.

[LG-34] Energy-Efficient Hierarchical Federated Anomaly Detection for the Internet of Underwater Things via Selective Cooperative Aggregation

链接: https://arxiv.org/abs/2603.24648
作者: Kenechi Omeke,Michael Mollel,Lei Zhang,Qammer H. Abbasi,Muhammad Ali Imran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is a core service in the Internet of Underwater Things, yet training accurate distributed models underwater is difficult because acoustic links are low-bandwidth, energy-intensive, and often unable to support direct sensor-to-surface communication. Standard flat federated learning therefore faces two coupled limitations in underwater deployments: expensive long-range transmissions and reduced participation when only a subset of sensors can reach the gateway. This paper proposes an energy-efficient hierarchical federated learning framework for underwater anomaly detection based on three components: feasibility-aware sensor-to-fog association, compressed model-update transmission, and selective cooperative aggregation among fog nodes. The proposed three-tier architecture localises most communication within short-range clusters while activating fog-to-fog exchange only when smaller clusters can benefit from nearby larger neighbours. A physics-grounded underwater acoustic model is used to evaluate detection quality, communication energy, and network participation jointly. In large synthetic deployments, only about 48% of sensors can directly reach the gateway in the 200-sensor case, whereas hierarchical learning preserves full participation through feasible fog paths. Selective cooperation matches the detection accuracy of always-on inter-fog exchange while reducing its energy by 31-33%, and compressed uploads reduce total energy by 71-95% in matched sensitivity tests. Experiments on three real benchmarks further show that low-overhead hierarchical methods remain competitive in detection quality, while flat federated learning defines the minimum-energy operating point. These results provide practical design guidance for underwater deployments operating under severe acoustic communication constraints.

[LG-35] Can LLM s Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

链接: https://arxiv.org/abs/2603.24647
作者: Fabio Ferreira,Lucca Wobbe,Arjun Krishnakumar,Frank Hutter,Arber Zela
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The autoresearch repository enables an LLM agent to search for optimal hyperparameter configurations on an unconstrained search space by editing the training code directly. Given a fixed compute budget and constraints, we use \emphautoresearch as a testbed to compare classical hyperparameter optimization (HPO) algorithms against LLM-based methods on tuning the hyperparameters of a small language model. Within a fixed hyperparameter search space, classical HPO methods such as CMA-ES and TPE consistently outperform LLM-based agents. However, an LLM agent that directly edits training source code in an unconstrained search space narrows the gap to classical methods substantially despite using only a self-hosted open-weight 27B model. Methods that avoid out-of-memory failures outperform those with higher search diversity, suggesting that reliability matters more than exploration breadth. While small and mid-sized LLMs struggle to track optimization state across trials, classical methods lack domain knowledge. To bridge this gap, we introduce Centaur, a hybrid that shares CMA-ES’s internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, with its 0.8B variant outperforming the 27B variant, suggesting that a cheap LLM suffices when paired with a strong classical optimizer. The 0.8B model is insufficient for unconstrained code editing but sufficient for hybrid optimization, while scaling to 27B provides no advantage for fixed search space methods with the open-weight models tested. Code is available at this https URL.

[LG-36] Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions

链接: https://arxiv.org/abs/2603.24644
作者: Debadutta Patra,Ayush Bardhan Tripathy,Soumya Ranjan Sahu,Sucheta Panda
类目: Machine Learning (cs.LG)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Digital twin technology, when combined with physics-informed machine learning with simulation results of Aspen, offers transformative capabilities for industrial process monitoring, control, and optimization. In this work, the proposed model presents a Physics-Informed Neural Network (PINN) digital twin framework for the dynamic, tray-wise modeling of binary distillation columns operating under transient conditions. The architecture of the proposed model embeds fundamental thermodynamic constraints, including vapor-liquid equilibrium (VLE) described by modified Raoult’s law, tray-level mass and energy balances, and the McCabe-Thiele graphical methodology directly into the neural network loss function via physics residual terms. The model is trained and evaluated on a high-fidelity synthetic dataset of 961 timestamped measurements spanning 8 hours of transient operation, generated in Aspen HYSYS for a binary HX/TX distillation system comprising 16 sensor streams. An adaptive loss-weighting scheme balances the data fidelity and physics consistency objectives during training. Compared to five data-driven baselines (LSTM, vanilla MLP, GRU, Transformer, DeepONet), the proposed PINN achieves an RMSE of 0.00143 for HX mole fraction prediction (R^2 = 0.9887), representing a 44.6% reduction over the best data-only baseline, while strictly satisfying thermodynamic constraints. Tray-wise temperature and composition profiles predicted under transient perturbations demonstrate that the digital twin accurately captures column dynamics including feed tray responses, reflux ratio variations, and pressure transients. These results establish the proposed PINN digital twin as a robust foundation for real-time soft sensing, model-predictive control, and anomaly detection in industrial distillation processes.

[LG-37] Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

链接: https://arxiv.org/abs/2603.24641
作者: Lucas Gerken Starepravo,Georgios Fourtakas,Steven Lind,Ajay B. Harish,Tianning Tang,Jack R. C. King
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Mesh-free numerical methods provide flexible discretisations for complex geometries; however, classical meshless discrete differential operators typically trade low computational cost for limited accuracy or high accuracy for substantial per-stencil computation. We introduce a parametrised framework for learning mesh-free discrete differential operators using a graph neural network trained via polynomial moment constraints derived from truncated Taylor expansions. The model maps local stencils relative positions directly to discrete operator weights. The current work demonstrates that neural networks can learn classical polynomial consistency while retaining robustness to irregular neighbourhood geometry. The learned operators depend only on local geometry, are resolution-agnostic, and can be reused across particle configurations and governing equations. We evaluate the framework using standard numerical analysis diagnostics, showing improved accuracy over Smoothed Particle Hydrodynamics, and a favourable accuracy-cost trade-off relative to a representative high-order consistent mesh-free method in the moderate-accuracy regime. Applicability is demonstrated by solving the weakly compressible Navier-Stokes equations using the learned operators.

[LG-38] How unconstrained machine-learning models learn physical symmetries

链接: https://arxiv.org/abs/2603.24638
作者: Michelangelo Domina,Joseph William Abbott,Paolo Pegolo,Filippo Bigi,Michele Ceriotti
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:The requirement of generating predictions that exactly fulfill the fundamental symmetry of the corresponding physical quantities has profoundly shaped the development of machine-learning models for physical simulations. In many cases, models are built using constrained mathematical forms that ensure that symmetries are enforced exactly. However, unconstrained models that do not obey rotational symmetries are often found to have competitive performance, and to be able to \emphlearn to a high level of accuracy an approximate equivariant behavior with a simple data augmentation strategy. In this paper, we introduce rigorous metrics to measure the symmetry content of the learned representations in such models, and assess the accuracy by which the outputs fulfill the equivariant condition. We apply these metrics to two unconstrained, transformer-based models operating on decorated point clouds (a graph neural network for atomistic simulations and a PointNet-style architecture for particle physics) to investigate how symmetry information is processed across architectural layers and is learned during training. Based on these insights, we establish a rigorous framework for diagnosing spectral failure modes in ML models. Enabled by this analysis, we demonstrate that one can achieve superior stability and accuracy by strategically injecting the minimum required inductive biases, preserving the high expressivity and scalability of unconstrained architectures while guaranteeing physical fidelity.

[LG-39] Multi-LLM Query Optimization

链接: https://arxiv.org/abs/2603.24617
作者: Arlen Dean,Zijin Zhang,Stefanus Jasin,Yuqing Liu
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Deploying multiple large language models (LLMs) in parallel to classify an unknown ground-truth label is a common practice, yet the problem of optimally allocating queries across heterogeneous models remains poorly understood. In this paper, we formulate a robust, offline query-planning problem that minimizes total query cost subject to statewise error constraints which guarantee reliability for every possible ground-truth label. We first establish that this problem is NP-hard via a reduction from the minimum-weight set cover problem. To overcome this intractability, we develop a surrogate by combining a union bound decomposition of the multi-class error into pairwise comparisons with Chernoff-type concentration bounds. The resulting surrogate admits a closed-form, multiplicatively separable expression in the query counts and is guaranteed to be feasibility-preserving. We further show that the surrogate is asymptotically tight at the optimization level: the ratio of surrogate-optimal cost to true optimal cost converges to one as error tolerances shrink, with an explicit rate of O\left(\log\log(1/\alpha_\min) / \log(1/\alpha_\min)\right) . Finally, we design an asymptotic fully polynomial-time approximation scheme (AFPTAS) that returns a surrogate-feasible query plan within a (1+\varepsilon) factor of the surrogate optimum.

[LG-40] he Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

链接: https://arxiv.org/abs/2603.25579
作者: Gabriele Farné,Fabrizio Boncoraglio,Lenka Zdeborová
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction 1 - \varepsilon of training labels is generated by a structured teacher rule, while a fraction \varepsilon consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.

[LG-41] Conformal Prediction for Nonparametric Instrumental Regression

链接: https://arxiv.org/abs/2603.25509
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a method for constructing distribution-free prediction intervals in nonparametric instrumental variable regression (NPIV), with finite-sample coverage guarantees. Building on the conditional guarantee framework in conformal inference, we reformulate conditional coverage as marginal coverage over a class of IV shifts \mathcalF . Our method can be combined with any NPIV estimator, including sieve 2SLS and other machine-learning-based NPIV methods such as neural networks minimax approaches. Our theoretical analysis establishes distribution-free, finite-sample coverage over a practitioner-chosen class of IV shifts.

[LG-42] Residual-as-Teacher: Mitigating Bias Propagation in Student–Teacher Estimation

链接: https://arxiv.org/abs/2603.25466
作者: Kakei Yamamoto,Martin J. Wainwright
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study statistical estimation in a student–teacher setting, where predictions from a pre-trained teacher are used to guide a student model. A standard approach is to train the student to directly match the teacher’s outputs, which we refer to as student soft matching (SM). This approach directly propagates any systematic bias or mis-specification present in the teacher, thereby degrading the student’s predictions. We propose and analyze an alternative scheme, known as residual-as-teacher (RaT), in which the teacher is used to estimate residuals in the student’s predictions. Our analysis shows how the student can thereby emulate a proximal gradient scheme for solving an oracle optimization problem, and this provably reduces the effect of teacher bias. For general student–teacher pairs, we establish non-asymptotic excess risk bounds for any RaT fixed point, along with convergence guarantees for the student-teacher iterative scheme. For kernel-based student–teacher pairs, we prove a sharp separation: the RaT method achieves the minimax-optimal rate, while the SM method incurs constant prediction error for any sample size. Experiments on both synthetic data and ImageNette classification under covariate shift corroborate our theoretical findings.

[LG-43] he Symmetric Perceptron: a Teacher-Student Scenario

链接: https://arxiv.org/abs/2603.25440
作者: Giovanni Catania,Aurélien Decelle,Suhanee Korpe
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:We introduce and solve a teacher-student formulation of the symmetric binary Perceptron, turning a traditionally storage-oriented model into a planted inference problem with a guaranteed solution at any sample density. We adapt the formulation of the symmetric Perceptron which traditionally considers either the u-shaped potential or the rectangular one, by including labels in both regions. With this formulation, we analyze both the Bayes-optimal regime at for noise-less examples and the effect of thermal noise under two different potential/classification rules. Using annealed and quenched free-entropy calculations in the high-dimensional limit, we map the phase diagram in the three control parameters, namely the sample density \alpha , the distance between the origin and one of the symmetric hyperplanes \kappa and temperature T , and identify a robust scenario where learning is organized by a second-order instability that creates teacher-correlated suboptimal states, followed by a first-order transition to full alignment. We show how this structure depends on the choice of potential, the interplay between metastability of the suboptimal solution and its melting towards the planted configuration, which is relevant for Monte Carlo-based optimization algorithms.

[LG-44] Enabling ab initio geometry optimization of strongly correlated systems with transferable deep quantum Monte Carlo

链接: https://arxiv.org/abs/2603.25381
作者: P. Bernát Szabó,Zeno Schätzle,Frank Noé
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:A faithful description of chemical processes requires exploring extended regions of the molecular potential energy surface (PES), which remains challenging for strongly correlated systems. Transferable deep-learning variational Monte Carlo (VMC) offers a promising route by efficiently solving the electronic Schrödinger equation jointly across molecular geometries at consistently high accuracy, yet its stochastic nature renders direct exploration of molecular configuration space nontrivial. Here, we present a framework for highly accurate ab initio exploration of PESs that combines transferable deep-learning VMC with a cost-effective estimation of energies, forces, and Hessians. By continuously sampling nuclear configurations during VMC optimization of electronic wave functions, we obtain transferable descriptions that achieve zero-shot chemical accuracy within chemically relevant distributions of molecular geometries. Throughout the subsequent characterization of molecular configuration space, the PES is evaluated only sparsely, with local approximations constructed by estimating VMC energies and forces at sampled geometries and aggregating the resulting noisy data using Gaussian process regression. Our method enables accurate and efficient exploration of complex PES landscapes, including structure relaxation, transition-state searches, and minimum-energy pathways, for both ground and excited states. This opens the door to studying bond breaking, formation, and large structural rearrangements in systems with pronounced multi-reference character.

[LG-45] A Distribution-to-Distribution Neural Probabilistic Forecasting Framework for Dynamical Systems

链接: https://arxiv.org/abs/2603.25370
作者: Tianlin Yang,Hailiang Du,Louis Aslett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 11 pages,5 figures

点击查看摘要

Abstract:Probabilistic forecasting provides a principled framework for uncertainty quantification in dynamical systems by representing predictions as probability distributions rather than deterministic trajectories. However, existing forecasting approaches, whether physics-based or neural-network-based, remain fundamentally trajectory-oriented: predictive distributions are usually accessed through ensembles or sampling, rather than evolved directly as dynamical objects. A distribution-to-distribution (D2D) neural probabilistic forecasting framework is developed to operate directly on predictive distributions. The framework introduces a distributional encoding and decoding structure around a replaceable neural forecasting module, using kernel mean embeddings to represent input distributions and mixture density networks to parameterise output predictive distributions. This design enables recursive propagation of predictive uncertainty within a unified end-to-end neural architecture, with model training and evaluation carried out directly in terms of probabilistic forecast skill. The framework is demonstrated on the Lorenz63 chaotic dynamical system. Results show that the D2D model captures nontrivial distributional evolution under nonlinear dynamics, produces skillful probabilistic forecasts without explicit ensemble simulation, and remains competitive with, and in some cases outperforms, a simplified perfect model benchmark. These findings point to a new paradigm for probabilistic forecasting, in which predictive distributions are learned and evolved directly rather than reconstructed indirectly through ensemble-based uncertainty propagation.

[LG-46] Practical Efficient Global Optimization is No-regret

链接: https://arxiv.org/abs/2603.25311
作者: Jingyi Wang,Haowei Wang,Nai-Yuan Chiang,Juliane Mueller,Tucker Hartland,Cosmin G. Petra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient global optimization (EGO) is one of the most widely used noise-free Bayesian optimization this http URL comprises the Gaussian process (GP) surrogate model and expected improvement (EI) acquisition function. In practice, when EGO is applied, a scalar matrix of a small positive value (also called a nugget or jitter) is usually added to the covariance matrix of the deterministic GP to improve numerical stability. We refer to this EGO with a positive nugget as the practical EGO. Despite its wide adoption and empirical success, to date, cumulative regret bounds for practical EGO have yet to be established. In this paper, we present for the first time the cumulative regret upper bound of practical EGO. In particular, we show that practical EGO has sublinear cumulative regret bounds and thus is a no-regret algorithm for commonly used kernels including the squared exponential (SE) and Matérn kernels ( \nu\frac12 ). Moreover, we analyze the effect of the nugget on the regret bound and discuss the theoretical implication on its choice. Numerical experiments are conducted to support and validate our findings.

[LG-47] Fair regression under localized demographic parity constraints

链接: https://arxiv.org/abs/2603.25224
作者: Arthur Charpentier(UQAM),Christophe Denis(SAMM),Romuald Elie(LAMA),Mohamed Hebiri(LAMA),François HU(UdeM)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Demographic parity (DP) is a widely used group fairness criterion requiring predictive distributions to be invariant across sensitive groups. While natural in classification, full distributional DP is often overly restrictive in regression and can lead to substantial accuracy loss. We propose a relaxation of DP tailored to regression, enforcing parity only at a finite set of quantile levels and/or score thresholds. Concretely, we introduce a novel ( \ell , Z)-fair predictor, which imposes groupwise CDF constraints of the form F f |S=s (z m ) = \ell m for prescribed pairs ( \ell m , z m ). For this setting, we derive closed-form characterizations of the optimal fair discretized predictor via a Lagrangian dual formulation and quantify the discretization cost, showing that the risk gap to the continuous optimum vanishes as the grid is refined. We further develop a model-agnostic post-processing algorithm based on two samples (labeled for learning a base regressor and unlabeled for calibration), and establish finite-sample guarantees on constraint violation and excess penalized risk. In addition, we introduce two alternative frameworks where we match group and marginal CDF values at selected score thresholds. In both settings, we provide closed-form solutions for the optimal fair discretized predictor. Experiments on synthetic and real datasets illustrate an interpretable fairness-accuracy trade-off, enabling targeted corrections at decision-relevant quantiles or thresholds while preserving predictive performance.

[LG-48] Improving Infinitely Deep Bayesian Neural Networks with Nesterovs Accelerated Gradient Method

链接: https://arxiv.org/abs/2603.25024
作者: Chenxu Yu,Wenqi Fang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a representative continuous-depth neural network approach, stochastic differential equation (SDE)-based Bayesian neural networks (BNNs) have attracted considerable attention due to their solid theoretical foundations and strong potential for real-world applications. However, their reliance on numerical SDE solvers inevitably incurs a large number of function evaluations (NFEs), resulting in high computational cost and occasional convergence instability. To address these challenges, we propose a Nesterov-accelerated gradient (NAG) enhanced SDE-BNN model. By integrating NAG into the SDE-BNN framework along with an NFE-dependent residual skip connection, our method accelerates convergence and substantially reduces NFEs during both training and testing. Extensive empirical results show that our model consistently outperforms conventional SDE-BNNs across various tasks, including image classification and sequence modeling, achieving lower NFEs and improved predictive accuracy.

[LG-49] he Value of Information in Resource-Constrained Pricing NEURIPS2025

链接: https://arxiv.org/abs/2603.24974
作者: Ruicheng Ao,Jiashuo Jiang,David Simchi-Levi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended version of the NeurIPS 2025 paper ( arXiv:2501.14155 ). This version adds phase transition, surrogate-assisted variance reduction under model misspecification, and numerical experiments

点击查看摘要

Abstract:Firms that price perishable resources – airline seats, hotel rooms, seasonal inventory – now routinely use demand predictions, but these predictions vary widely in quality. Under hard capacity constraints, acting on an inaccurate prediction can irreversibly deplete inventory needed for future periods. We study how prediction uncertainty propagates into dynamic pricing decisions with linear demand, stochastic noise, and finite capacity. A certified demand forecast with known error bound~ \epsilon^0 specifies where the system should operate: it shifts regret from O(\sqrtT) to O(\log T) when \epsilon^0 \lesssim T^-1/4 , and we prove this threshold is tight. A misspecified surrogate model – biased but correlated with true demand – cannot set prices directly but reduces learning variance by a factor of (1-\rho^2) through control variates. The two mechanisms compose: the forecast determines the regret regime; the surrogate tightens estimation within it. All algorithms rest on a boundary attraction mechanism that stabilizes pricing near degenerate capacity boundaries without requiring non-degeneracy assumptions. Experiments confirm the phase transition threshold, the variance reduction from surrogates, and robustness across problem instances.

[LG-50] Binary Expansion Group Intersection Network

链接: https://arxiv.org/abs/2603.24763
作者: Sicheng Zhou,Kai Zhang
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conditional independence is central to modern statistics, but beyond special parametric families it rarely admits an exact covariance characterization. We introduce the binary expansion group intersection network (BEGIN), a distribution-free graphical representation for multivariate binary data and bit-encoded multinomial variables. For arbitrary binary random vectors and bit representations of multinomial variables, we prove that conditional independence is equivalent to a sparse linear representation of conditional expectations, to a block factorization of the corresponding interaction covariance matrix, and to block diagonality of an associated generalized Schur complement. The resulting graph is indexed by the intersection of multiplicative groups of binary interactions, yielding an analogue of Gaussian graphical modeling beyond the Gaussian setting. This viewpoint treats data bits as atoms and local BEGIN molecules as building blocks for large Markov random fields. We also show how dyadic bit representations allow BEGIN to approximate conditional independence for general random vectors under mild regularity conditions. A key technical device is the Hadamard prism, a linear map that links interaction covariances to group structure.

[LG-51] Autotuning T-PaiNN: Enabling Data-Efficient GNN Interatomic Potential Development via Classical-to-Quantum Transfer Learning

链接: https://arxiv.org/abs/2603.24752
作者: Vivienne Pelletier,Vedant Bhat,Daniel J. Rivera,Steven A. Wilson,Christopher L. Muhich
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Machine-learned interatomic potentials (MLIPs), particularly graph neural network (GNN)-based models, offer a promising route to achieving near-density functional theory (DFT) accuracy at significantly reduced computational cost. However, their practical deployment is often limited by the large volumes of expensive quantum mechanical training data required. In this work, we introduce a transfer learning framework, Transfer-PaiNN (T-PaiNN), that substantially improves the data efficiency of GNN-MLIPs by leveraging inexpensive classical force field data. The approach consists of pretraining a PaiNN MLIP architecture on large-scale datasets generated from classical molecular simulations, followed by fine-tuning (dubbed autotuning) using a comparatively small DFT dataset. We demonstrate the effectiveness of autotuning T-PaiNN on both gas-phase molecular systems (QM9 dataset) and condensed-phase liquid water. Across all cases, T-PaiNN significantly outperforms models trained solely on DFT data, achieving order-of-magnitude reductions in mean absolute error while accelerating training convergence. For example, using the QM9 data set, error reductions of up to 25 times are observed in low-data regimes, while liquid water simulations show improved predictions of energies, forces, and experimentally relevant properties such as density and diffusion. These gains arise from the model’s ability to learn general features of the potential energy surface from extensive classical sampling, which are subsequently refined to quantum accuracy. Overall, this work establishes transfer learning from classical force fields as a practical and computationally efficient strategy for developing high-accuracy, data-efficient GNN interatomic potentials, enabling broader application of MLIPs to complex chemical systems.

[LG-52] Amortized Inference for Correlated Discrete Choice Models via Equivariant Neural Networks

链接: https://arxiv.org/abs/2603.24705
作者: Easton Huch,Michael Keane
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Discrete choice models are fundamental tools in management science, economics, and marketing for understanding and predicting decision-making. Logit-based models are dominant in applied work, largely due to their convenient closed-form expressions for choice probabilities. However, these models entail restrictive assumptions on the stochastic utility component, constraining our ability to capture realistic and theoretically grounded choice behavior - most notably, substitution patterns. In this work, we propose an amortized inference approach using a neural network emulator to approximate choice probabilities for general error distributions, including those with correlated errors. Our proposal includes a specialized neural network architecture and accompanying training procedures designed to respect the invariance properties of discrete choice models. We provide group-theoretic foundations for the architecture, including a proof of universal approximation given a minimal set of invariant features. Once trained, the emulator enables rapid likelihood evaluation and gradient computation. We use Sobolev training, augmenting the likelihood loss with a gradient-matching penalty so that the emulator learns both choice probabilities and their derivatives. We show that emulator-based maximum likelihood estimators are consistent and asymptotically normal under mild approximation conditions, and we provide sandwich standard errors that remain valid even with imperfect likelihood approximation. Simulations show significant gains over the GHK simulator in accuracy and speed.

[LG-53] Conformal Selective Prediction with General Risk Control

链接: https://arxiv.org/abs/2603.24704
作者: Tian Bai,Ying Jin
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In deploying artificial intelligence (AI) models, selective prediction offers the option to abstain from making a prediction when uncertain about model quality. To fulfill its promise, it is crucial to enforce strict and precise error control over cases where the model is trusted. We propose Selective Conformal Risk control with E-values (SCoRE), a new framework for deriving such decisions for any trained model and any user-defined, bounded and continuously-valued risk. SCoRE offers two types of guarantees on the risk among ``positive’’ cases in which the system opts to trust the model. Built upon conformal inference and hypothesis testing ideas, SCoRE first constructs a class of (generalized) e-values, which are non-negative random variables whose product with the unknown risk has expectation no greater than one. Such a property is ensured by data exchangeability without requiring any modeling assumptions. Passing these e-values on to hypothesis testing procedures, we yield the binary trust decisions with finite-sample error control. SCoRE avoids the need of uniform concentration, and can be readily extended to settings with distribution shifts. We evaluate the proposed methods with simulations and demonstrate their efficacy through applications to error management in drug discovery, health risk prediction, and large language models.

[LG-54] Spectral methods: crucial for machine learning natural for quantum computers?

链接: https://arxiv.org/abs/2603.24654
作者: Vasilis Belis,Joseph Bowles,Rishabh Gupta,Evan Peters,Maria Schuld
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:This article presents an argument for why quantum computers could unlock new methods for machine learning. We argue that spectral methods, in particular those that learn, regularise, or otherwise manipulate the Fourier spectrum of a machine learning model, are often natural for quantum computers. For example, if a generative machine learning model is represented by a quantum state, the Quantum Fourier Transform allows us to manipulate the Fourier spectrum of the state using the entire toolbox of quantum routines, an operation that is usually prohibitive for classical models. At the same time, spectral methods are surprisingly fundamental to machine learning: A spectral bias has recently been hypothesised to be the core principle behind the success of deep learning; support vector machines have been known for decades to regularise in Fourier space, and convolutional neural nets build filters in the Fourier space of images. Could, then, quantum computing open fundamentally different, much more direct and resource-efficient ways to design the spectral properties of a model? We discuss this potential in detail here, hoping to stimulate a direction in quantum machine learning research that puts the question of ``why quantum?‘’ first.

[LG-55] A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data

链接: https://arxiv.org/abs/2603.24626
作者: Yuichiro Iwashita,Ahtisham Fazeel Abbasi,Muhammad Nabeel Asim,Andreas Dengel
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) is inherently affected by sparsity caused by dropout events, in which expressed genes are recorded as zeros due to technical limitations. These artifacts distort gene expression distributions and can compromise downstream analyses. Numerous imputation methods have been proposed to address this, and these methods encompass a wide range of approaches from traditional statistical models to recently developed deep learning (DL)-based methods. However, their comparative performance remains unclear, as existing benchmarking studies typically evaluate only a limited subset of methods, datasets, and downstream analytical tasks. Here, we present a comprehensive benchmark of 15 scRNA-seq imputation methods spanning 7 methodological categories, including traditional and modern DL-based methods. These methods are evaluated across 30 datasets sourced from 10 experimental protocols and assessed in terms of 6 downstream analytical tasks. Our results show that traditional imputation methods, such as model-based, smoothing-based, and low-rank matrix-based methods, generally outperform DL-based methods, such as diffusion-based, GAN-based, GNN-based, and autoencoder-based methods. In addition, strong performance in numerical gene expression recovery does not necessarily translate into improved biological interpretability in downstream analyses. Furthermore, the performance of imputation methods varies substantially across datasets, protocols, and downstream analytical tasks, and no single method consistently outperforms others across all evaluation scenarios. Together, our results provide practical guidance for selecting imputation methods tailored to specific analytical objectives and highlight the importance of task-specific evaluation when assessing imputation performance in scRNA-seq data analysis.

[LG-56] Response-Aware Risk-Constrained Control Barrier Function With Application to Vehicles

链接: https://arxiv.org/abs/2603.24598
作者: Qijun Liao,Jue Yang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 22 pages, 20 figures

点击查看摘要

Abstract:This paper proposes a unified control framework based on Response-Aware Risk-Constrained Control Barrier Function for dynamic safety boundary control of vehicles. Addressing the problem of physical model parameter mismatch, the framework constructs an uncertainty propagation model that fuses nominal dynamics priors with direct vehicle body responses. Utilizing simplified single-track dynamics to provide a baseline direction for control gradients and covering model deviations through statistical analysis of body response signals, the framework eliminates the dependence on accurate online estimation of road surface adhesion coefficients. By introducing Conditional Value at Risk (CVaR) theory, the framework reformulates traditional deterministic safety constraints into probabilistic constraints on the tail risk of barrier function derivatives. Combined with a Bayesian online learning mechanism based on inverse Wishart priors, it identifies environmental noise covariance in real-time, adaptively tuning safety margins to reduce performance loss under prior parameter mismatch. Finally, based on Control Lyapunov Function (CLF), a unified Second-Order Cone Programming (SOCP) controller is constructed. Theoretical analysis establishes convergence of Sequential Convex Programming to local Karush-Kuhn-Tucker points and provides per-step probabilistic safety bounds. High-fidelity dynamics simulations demonstrate that under extreme conditions, the method not only eliminates the output divergence phenomenon of traditional methods but also achieves Pareto improvement in both safety and tracking performance. For the chosen risk level, the per-step safety violation probability is theoretically bounded by approximately 2%, validated through high-fidelity simulations showing zero boundary violations across all tested scenarios.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-03-27)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载