本篇博文主要内容为 2026-03-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-30)
今日共更新487篇论文,其中:
- 自然语言处理共55篇(Computation and Language (cs.CL))
- 人工智能共117篇(Artificial Intelligence (cs.AI))
- 计算机视觉共138篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共123篇(Machine Learning (cs.LG))
- 多智能体系统共6篇(Multiagent Systems (cs.MA))
- 信息检索共6篇(Information Retrieval (cs.IR))
- 人机交互共22篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Deception and Communication in Autonomous Multi-Agent Systems: An Experimental Study with Among Us AAMAS2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理(autonomous agents)在多目标、多代理系统中因策略性欺骗(strategic deception)而引发的协调、可靠性与安全性问题。其解决方案的关键在于通过社会推理游戏《Among Us》这一合作竞争环境,对LLM代理在不同角色(如 impostor 和 crewmate)条件下的沟通行为进行大规模实证分析,结合言语行为理论(speech act theory)和人际欺骗理论(interpersonal deception theory),揭示出当前代理倾向于采用低风险的模糊表达(equivocation)而非明确谎言,且此类欺骗行为虽在社交压力下增加,但并未显著提升胜率,从而表明当前自主代理在语言沟通中存在“真实性”与“实用性”之间的根本张力。
链接: https://arxiv.org/abs/2603.26635
作者: Maria Milkowski,Tim Weninger
机构: University of Notre Dame (圣母大学)
类目: Multiagent Systems (cs.MA)
备注: 8 pages + references, 9 figures. Accepted at AAMAS 2026
Abstract:As large language models are deployed as autonomous agents, their capacity for strategic deception raises core questions for coordination, reliability, and safety in multi-goal, multi-agent systems. We study deception and communication in L2LM agents through the social deduction game Among Us, a cooperative-competitive environment. Across 1,100 games, autonomous agents produced over one million tokens of meeting dialogue. Using speech act theory and interpersonal deception theory, we find that all agents rely mainly on directive language, while impostor agents shift slightly toward representative acts such as explanations and denials. Deception appears primarily as equivocation rather than outright lies, increasing under social pressure but rarely improving win rates. Our contributions are a large-scale analysis of role-conditioned deceptive behavior in LLM agents and empirical evidence that current agents favor low-risk ambiguity that is linguistically subtle yet strategically limited, revealing a fundamental tension between truthfulness and utility in autonomous communication.
[MA-1] he Multi-AMR Buffer Storag e Retrieval and Reshuffling Problem: Exact and Heuristic Approaches
【速读】:该论文致力于解决高密度生产环境中多自主移动机器人(Multi-AMR)协同作业下的缓冲区存储、取货与重排问题(BSRRP),尤其针对空间受限的老旧工厂场景中,因人工短缺和成本上升导致的传统缓冲区管理失效问题。其核心挑战在于如何在共享地面区域内并行处理单位载荷的存储、取货及动态重排任务,同时满足时间窗约束。解决方案的关键在于提出一种分层启发式算法:首先利用A*搜索进行任务级序列规划以优化单位载荷的放置顺序,再通过约束规划(Constraint Programming, CP)方法实现多机器人调度与协调,从而显著降低计算复杂度,在保证解质量的同时大幅缩短求解时间,具备在工业规模场景下作为实时控制逻辑的可行性。
链接: https://arxiv.org/abs/2603.26542
作者: Max Disselnmeyer,Thomas Bömer,Laura Dörr,Bastian Amberg,Anne Meyer
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注: 52 pages, 15 figures and tables
Abstract:Buffer zones are essential in production systems to decouple sequential processes. In dense floor storage environments, such as space-constrained brownfield facilities, manual operation is increasingly challenged by severe labor shortages and rising operational costs. Automating these zones requires solving the Buffer Storage, Retrieval, and Reshuffling Problem (BSRRP). While previous work has addressed scenarios where the focus is limited to reshuffling and retrieving a fixed set of items, real-world manufacturing necessitates an adaptive approach that also incorporates arriving unit loads. This paper introduces the Multi-AMR BSRRP, coordinating a robot fleet to manage concurrent reshuffling, alongside time-windowed storage and retrieval tasks, within a shared floor area. We formulate a Binary Integer Programming (IP) model to obtain exact solutions for benchmarking purposes. As the problem is NP-hard, rendering exact methods computationally intractable for industrial scales, we propose a hierarchical heuristic. This approach decomposes the problem into an A* search for task-level sequence planning of unit load placements, and a Constraint Programming (CP) approach for multi-robot coordination and scheduling. Experiments demonstrate orders-of-magnitude computation time reductions compared to the exact formulation. These results confirm the heuristic’s viability as responsive control logic for high-density production environments.
[MA-2] SwarmCoDe: A Scalable Co-Design Framework for Heterogeneous Robot Swarms via Dynamic Speciation
【速读】:该论文旨在解决大规模异构机器人蜂群(robot swarms)在协同设计中面临的计算复杂性难题,即传统方法因设计空间呈指数级增长且缺乏直观性而难以实现高效优化。解决方案的关键在于提出一种新型协同共进化算法 SwarmCoDe,其核心创新包括:利用动态物种分化机制自动调节蜂群异质性以匹配任务复杂度;通过演化出的遗传标签和选择基因实现无预设物种边界下的共生伙伴识别;以及引入演化主导基因控制蜂群组成比例,从而将物理蜂群规模与进化种群规模解耦。该方法成功实现了在制造预算约束下同时优化任务规划与硬件形态,并演化出最大达 200 个智能体的专用蜂群,是现有技术无法企及的规模。
链接: https://arxiv.org/abs/2603.26240
作者: Andrew Wilhelm,Josie Hughes
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: 8 pages, 9 figures
Abstract:Robot swarms offer inherent robustness and the capacity to execute complex, collaborative tasks surpassing the capabilities of single-agent systems. Co-designing these systems is critical, as marginal improvements in individual performance or unit cost compound significantly at scale. However, under traditional frameworks, this scale renders co-design intractable due to exponentially large, non-intuitive design spaces. To address this, we propose SwarmCoDe, a novel Collaborative Co-Evolutionary Algorithm (CCEA) that utilizes dynamic speciation to automatically scale swarm heterogeneity to match task complexity. Inspired by biological signaling mechanisms for inter-species cooperation, the algorithm uses evolved genetic tags and a selectivity gene to facilitate the emergent identification of symbiotically beneficial partners without predefined species boundaries. Additionally, an evolved dominance gene dictates the relative swarm composition, decoupling the physical swarm size from the evolutionary population. We apply SwarmCoDe to simultaneously optimize task planning and hardware morphology under fabrication budgets, successfully evolving specialized swarms of up to 200 agents – four times the size of the evolutionary population. This framework provides a scalable, computationally viable pathway for the holistic co-design of large-scale, heterogeneous robot swarms.
[MA-3] Doctorina MedBench: End-to-End Evaluation of Agent -Based Medical AI
【速读】:该论文旨在解决当前医疗人工智能(AI)评估体系中缺乏真实临床情境模拟的问题,传统基准测试多依赖标准化试题,难以全面反映AI在复杂医患交互中的临床推理与决策能力。其解决方案的关键在于提出Doctorina MedBench框架,通过构建多步骤的临床对话仿真环境,使AI或医生系统在模拟的真实诊疗流程中完成病史采集、辅助检查分析、鉴别诊断和个性化治疗建议等任务,并采用D.O.T.S.指标(诊断、观察/检查、治疗、步骤数)对临床正确性与对话效率进行量化评估,同时引入多层次测试架构以实现模型性能监控与安全陷阱检测,从而提供更贴近实际临床实践的评估手段。
链接: https://arxiv.org/abs/2603.25821
作者: Anna Kozlova,Stanislau Salavei,Pavel Satalkin,Hanna Plotnitskaya,Sergey Parfenyuk
机构: A.I. Doctor Medical Assist LTD(人工智能医生医疗辅助有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2603.25821 [cs.CL] (or arXiv:2603.25821v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.25821 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-4] Decentralized Value Systems Agreements AAMAS2026
【速读】:该论文旨在解决价值导向决策中因个体价值系统差异而导致的共识难以达成的问题,尤其是在多元社会背景下,不同个体对同一价值的解释和重要性排序存在显著差异。解决方案的关键在于提出一种新颖的价值系统聚合方法,通过允许代理(agents)表达其价值体系及让步意愿,采用分布式优化策略生成一组多样化的价值协议(value agreements),而非传统方法所追求的单一共识。这种方法能够更好地适应现实社会中的异质性,并在两个真实场景中验证了其相较于现有技术能显著提升个体效用。
链接: https://arxiv.org/abs/2603.25811
作者: Arturo Hernandez-Sanchez,Natalia Criado,Stella Heras,Miguel Rebollo,Jose Such
机构: Universitat Politècnica de València (瓦伦西亚理工大学); Consejo Superior de Investigaciones Científicas (西班牙国家研究委员会)
类目: Multiagent Systems (cs.MA)
备注: Accepted at AAMAS 2026 (Submission 1181)
Abstract:One of the biggest challenges of value-based decision-making is dealing with the subjective nature of values. The relative importance of a value for a particular decision varies between individuals, and people may also have different interpretations of what aligning with a value means in a given situation. While members of a society are likely to share a set of principles or values, their value systems–that is, how they interpret these values and the relative importance they give to them–have been found to differ significantly. This work proposes a novel method for aggregating value systems, generating distinct value agreements that accommodate the inherent differences within these systems. Unlike existing work, which focuses on finding a single value agreement, the proposed approach may be more suitable for a realistic and heterogeneous society. In our solution, the agents indicate their value systems and the extent to which they are willing to concede. Then, a set of agreements is found, taking a decentralized optimization approach. Our work has been applied to identify value agreements in two real-world scenarios using data from a Participatory Value Evaluation process and a European Value Survey. These case studies illustrate the different aggregations that can be obtained with our method and compare them with those obtained using existing value system aggregation techniques. In both cases, the results showed a substantial improvement in individual utilities compared to existing alternatives.
[MA-5] UCAgent : An End-to-End Agent for Block-Level Functional Verification
【速读】:该论文旨在解决现代集成电路(IC)开发中功能验证(functional verification)效率低下的问题,其核心挑战在于传统方法(如约束随机验证和形式化验证)难以应对日益复杂的半导体设计,且大语言模型(Large Language Models, LLMs)在生成Verilog/SystemVerilog验证代码时存在准确性不足、多步骤验证流程脆弱以及验证一致性难以维持等问题。解决方案的关键在于提出UCAgent——一个基于三个核心机制的端到端自动化验证代理:首先,构建纯Python环境(利用Picker和Toffee),避免依赖LLM生成SystemVerilog代码;其次,设计可配置的31阶段细粒度验证流程,每个阶段由自动化检查器验证,提升流程稳定性;最后,引入验证一致性标注机制(Verification Consistency Labeling Mechanism, VCLM),通过层次化标签增强生成结果的可靠性与可追溯性。实验表明,UCAgent可在UART、FPU等模块上实现最高98.5%代码覆盖率和100%功能覆盖率,并发现真实设计中的未识别缺陷,展现出实用潜力。
链接: https://arxiv.org/abs/2603.25768
作者: Junyue Wang,Zhicheng Yao,Yan Pi,Xiaolong Li,Fangyuan Song,Jinru Wang,Yunlong Xie,Sa Wang,Yungang Bao
机构: State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Beijing Institute of Open Source Chip
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Multiagent Systems (cs.MA)
备注:
Abstract:Functional verification remains a critical bottleneck in modern IC development cycles, accounting for approximately 70% of total development time in many projects. However, traditional methods, including constrained-random and formal verification, struggle to keep pace with the growing complexity of modern semiconductor designs. While recent advances in Large Language Models (LLMs) have shown promise in code generation and task automation, significant challenges hinder the realization of end-to-end functional verification automation. These challenges include (i) limited accuracy in generating Verilog/SystemVerilog verification code, (ii) the fragility of LLMs when executing complex, multi-step verification workflows, and (iii) the difficulty of maintaining verification consistency across specifications, coverage models, and test cases throughout the workflow. To address these challenges, we propose UCAgent, an end-to-end agent that automates hardware block-level functional verification based on three core mechanisms. First, we establish a pure Python verification environment using Picker and Toffee to avoid relying on LLM-generated SystemVerilog verification code. Second, we introduce a configurable 31-stage fine-grained verification workflow to guide the LLM, where each stage is verified by an automated checker. Furthermore, we propose a Verification Consistency Labeling Mechanism (VCLM) that assigns hierarchical labels to LLM-generated artifacts, improving the reliability and traceability of verification. Experimental results show that UCAgent can complete end-to-end automated verification on multiple modules, including the UART, FPU, and integer divider modules, achieving up to 98.5% code coverage and up to 100% functional coverage. UCAgent also discovers previously unidentified design defects in realistic designs, demonstrating its practical potential. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Multiagent Systems (cs.MA) Cite as: arXiv:2603.25768 [cs.SE] (or arXiv:2603.25768v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.25768 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Junyue Wang [view email] [v1] Thu, 26 Mar 2026 07:21:27 UTC (1,121 KB)
自然语言处理
[NLP-0] Learning to Commit: Generating Organic Pull Requests via Online Repository Memory
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的代码生成代理在实际项目中生成的拉取请求(Pull Request)常被维护者拒绝的问题,其根本原因在于生成代码缺乏“有机性”(organicity)——即代码未遵循项目特定的编码规范、重复实现已有内部API功能,并违反了长期演化形成的隐式架构约束。解决方案的关键在于提出“学习提交”(Learning to Commit)框架,通过引入在线仓库记忆(Online Repository Memory),使代理能够基于历史提交记录进行监督对比反思:代理对早期问题进行盲试修复,将其预测与真实补丁(oracle diff)对比,从而提炼出可复用的技能模式,涵盖编码风格、内部API使用和架构不变量等维度;新任务到来时,代理基于这些累积技能进行条件生成,使输出更贴合项目自身演化路径而非通用预训练先验。
链接: https://arxiv.org/abs/2603.26664
作者: Mo Li,L.H. Xu,Qitai Tan,Ting Cao,Yunxin Liu
机构: Tsinghua University (清华大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Preprint. Work in progress
Abstract:Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject. The root cause is not functional incorrectness but a lack of organicity: generated code ignores project-specific conventions, duplicates functionality already provided by internal APIs, and violates implicit architectural constraints accumulated over years of development. Simply exposing an agent to the latest repository snapshot is not enough: the snapshot reveals the final state of the codebase, but not the repository-specific change patterns by which that state was reached. We introduce Learning to Commit, a framework that closes this gap through Online Repository Memory. Given a repository with a strict chronological split, the agent performs supervised contrastive reflection on earlier commits: it blindly attempts to resolve each historical issue, compares its prediction against the oracle diff, and distils the gap into a continuously growing set of skills-reusable patterns capturing coding style, internal API usage, and architectural invariants. When a new PR description arrives, the agent conditions its generation on these accumulated skills, producing changes grounded in the project’s own evolution rather than generic pretraining priors. Evaluation is conducted on genuinely future, merged pull requests that could not have been seen during the skill-building phase, and spans multiple dimensions including functional correctness, code-style consistency, internal API reuse rate, and modified-region plausibility. Experiments on an expert-maintained repository with rich commit history show that Online Repository Memory effectively improves organicity scores on held-out future tasks.
[NLP-1] Weight Tying Biases Token Embeddings Towards the Output Space
【速读】: 该论文旨在解决权重共享(weight tying)在语言模型设计中对嵌入空间学习影响不明确的问题,特别是其如何改变输入嵌入与输出解嵌入(unembedding)矩阵之间的关系。研究表明,共享的嵌入矩阵更倾向于匹配输出解嵌入矩阵而非输入嵌入,说明权重共享主要优化了输出预测能力,牺牲了输入表示的质量。解决方案的关键在于通过调整训练过程中输入梯度的缩放比例,缓解梯度不平衡问题,从而减少这种“解嵌入偏差”,并提供机制层面的因果证据:权重共享本质上是为了输出预测而优化嵌入矩阵,这可能解释为何在大规模模型中性能受损,且对小模型训练具有重要影响,因其嵌入矩阵占参数总量比例更高。
链接: https://arxiv.org/abs/2603.26663
作者: Antonio Lopardo,Avyukth Harish,Catherine Arnett,Akshat Gupta
机构: EleutherAI; University of California Berkeley
类目: Computation and Language (cs.CL)
备注:
Abstract:Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.
[NLP-2] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
【速读】: 该论文旨在解决当前视频理解任务中对复杂、长时程感知推理能力评估不足的问题,尤其针对依赖多时间点视觉证据与组合逻辑约束的感知中心型视频推理难题。其解决方案的关键在于构建了一个名为PerceptionComp的手动标注基准数据集,该数据集要求模型不仅具备语义识别、视觉对应、时空推理等多重能力,还需在多个感知子任务(如物体、属性、关系、位置、动作和事件)之间进行协同推理,并通过合取与序列逻辑组合不同时间段的视觉信息来作答。此设计确保单一时段信息不足以回答问题,从而有效区分模型是否真正掌握长期感知推理能力。
链接: https://arxiv.org/abs/2603.26653
作者: Shaoxuan Li,Zhixuan Zhao,Hanze Deng,Zirun Ma,Shulin Tian,Zuyan Liu,Yushi Hu,Haoning Wu,Yuhao Dong,Benlin Liu,Ziwei Liu,Ranjay Krishna
机构: Tsinghua University (清华大学); University of Washington (华盛顿大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL
Abstract:We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.
[NLP-3] EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching
【速读】: 该论文旨在解决多语言语境下情感(utterance sentiment)如何影响代码转换(code-switching)行为的问题,特别是英语与泰米尔语混合文本中语言选择的机制。其解决方案的关键在于利用细调后的XLM-RoBERTa模型对35,650条罗马化YouTube评论进行词级别语言识别,从而获得每句话的英语占比和语言切换频率,并通过线性回归分析控制句长变量后发现:积极情绪话语中英语占比显著更高(34.3% vs. 负面情绪的24.8%),而混合情绪话语的语言切换频率最高。这一结果验证了情感内容通过社会语言学层面的 prestige(声望)与 identity(身份认同)关联,显著驱动嵌入语言(embedded language)与主干语言(matrix language)的选择。
链接: https://arxiv.org/abs/2603.26587
作者: Paul Bontempo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures
Abstract:This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.
[NLP-4] MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际服务中因推理成本高昂而带来的挑战,尤其是在存在大量重复或近似重复查询的场景下。其解决方案的核心在于提出MemBoost框架,通过构建轻量级模型与记忆增强机制相结合的方式,实现对历史生成答案的复用和相关支持信息的高效检索,从而降低推理开销;同时,针对复杂或不确定的问题,采用成本感知的路由策略将其动态分发至更强的模型进行处理,兼顾成本效益与回答质量。该方案特别适用于交互式场景,支持答案复用、持续记忆增长及智能模型调度。
链接: https://arxiv.org/abs/2603.26557
作者: Joris Köster,Zixuan Liu,Siavash Khajavi,Zizhan Zheng
机构: Aalto University (阿尔托大学); Tulane University (图兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.
[NLP-5] When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
【速读】: 该论文旨在解决预训练Transformer模型在推理阶段计算成本高、效率低的问题,提出通过知识蒸馏(distillation)将教师模型压缩为更高效的混合架构以降低部署开销。其核心挑战在于如何在蒸馏过程中保持高质量的生成能力,而传统方法常依赖于基于对数似然(log-likelihood)的多选基准评估,这种评估方式无法准确反映模型在自回归生成任务中的实际性能差异。解决方案的关键在于设计了一个名为Hybrid Kimi Delta Attention(Hybrid-KDA)的新型学生架构,并配套开发了GenDistill多阶段蒸馏流程,全程采用生成式评估(generation-based evaluation)指导模型结构与训练策略的选择。实验表明,仅依赖log-likelihood评分会显著低估师生模型间的性能差距,甚至导致错误的设计决策;而通过优化数据集选择、仅对完成部分进行掩码(completion-only masking)、以及在后训练阶段冻结注意力层等关键因素,可使蒸馏后的模型在保留教师模型86–90%知识问答准确率的同时,将KV缓存内存消耗减少高达75%,且在128K上下文长度下首次生成时间提升2–4倍。
链接: https://arxiv.org/abs/2603.26556
作者: Juan Gabriel Kostelec,Xiang Wang,Axel Laborieux,Christos Sourmpis,Qinghai Guo
机构: Huawei Zurich Research Center (华为苏黎世研究中心); ACS Lab, Huawei Technologies (华为技术有限公司ACS实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2,pp under log-likelihood scoring actually falls behind by 20.8,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86–90% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75% and improving time-to-first-token by 2–4 \times at 128K-token contexts. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.26556 [cs.CL] (or arXiv:2603.26556v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.26556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-6] Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model
【速读】: 该论文试图解决当前药物警戒领域中信号检测方法评估缺乏可靠参考数据集的问题,尤其是现有数据无法准确反映不良事件(Adverse Events, AE)被监管机构正式认定的时间点,从而限制了对早期信号检测性能的评估。解决方案的关键在于构建一个基于时间索引的参考数据集,通过整合欧盟药品说明书(Summaries of Product Characteristics, SmPC)中AE的纳入时间及监管元数据,实现对AE识别时间的精确追踪,从而支持在产品上市前确认期内的严格分析,提升信号检测性能评估的准确性与可比性。
链接: https://arxiv.org/abs/2603.26544
作者: Maria Kefala,Jeffery L. Painter,Syed Tauhid Bukhari,Maurizio Sessa
机构: 未知
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 4 Figures and 2 Tables
Abstract:Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.
[NLP-7] How Open Must Language Models be to Enable Reliable Scientific Inference?
【速读】: 该论文试图解决的问题是:模型的开放性或封闭性如何影响基于该模型的研究中科学推断的可靠性。论文指出,当前大多数封闭式模型在科学研究中存在局限性,因其对模型构建与部署信息的限制可能威胁可靠推断,尽管存在少数例外情况。解决方案的关键在于:研究者应系统识别模型使用过程中可能威胁推断的因素,并采取相应缓解措施,同时提供明确的模型选择依据,以增强科研结果的可重复性与可信度。
链接: https://arxiv.org/abs/2603.26539
作者: James A. Michaelov,Catherine Arnett,Tyler A. Chang,Pamela D. Rivière,Samuel M. Taylor,Cameron R. Jones,Sean Trott,Roger P. Levy,Benjamin K. Bergen,Micah Altman
机构: Massachusetts Institute of Technology (麻省理工学院); EleutherAI; University of California San Diego (加州大学圣地亚哥分校); Rutgers University-Newark (罗格斯大学新布朗斯维克分校); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.
[NLP-8] ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在欧洲葡萄牙语(pt-PT)等低资源语言中评估不足的问题,尤其针对现有训练数据和基准测试主要集中在巴西葡萄牙语(pt-BR)所导致的性能偏差。解决方案的关键在于提出ALBA——一个基于语言学理论构建的基准测试框架,从零开始设计以覆盖八种语言维度(如语言变体、文化相关语义、话语分析等),并通过人工标注与LLM-as-a-judge评价机制实现对pt-PT生成语言的可扩展评估,从而揭示不同语言维度下的模型表现差异,推动面向pt-PT的工具开发与优化。
链接: https://arxiv.org/abs/2603.26516
作者: Inês Vieira,Inês Calvo,Iago Paulo,James Furtado,Rafael Ferreira,Diogo Tavares,Diogo Glória-Silva,David Semedo,João Magalhães
机构: NOVA University of Lisbon (NOVA大学里斯本分校); NOVA LINCS (NOVA LINCS)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PROPOR 2026 - The 17th International Conference on Computational Processing of Portuguese
Abstract:As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
[NLP-9] JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems
【速读】: 该论文旨在解决工业级语音AI代理中高效且鲁棒的对话轮次切换(turn-taking)检测问题。现有系统通常仅依赖声学或语义线索,导致准确率和稳定性不足;而基于大语言模型的全双工方案则需昂贵的全双工数据并带来显著训练与部署开销,影响实时性能。其解决方案的关键在于提出JAL-Turn框架,采用联合声学-语言建模范式,通过交叉注意力模块自适应融合预训练声学表征与语言特征,实现低延迟的“保持”(hold)与“切换”(shift)状态预测;同时利用冻结的自动语音识别(ASR)编码器使轮次预测与语音识别并行执行,不引入额外端到端延迟或计算开销。
链接: https://arxiv.org/abs/2603.26515
作者: Guangzhao Yang,Yu Pan,Shi Qiu,Ningjie Bai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, in porgress
Abstract:Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.
[NLP-10] AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
【速读】: 该论文旨在解决欧洲葡萄牙语(European Portuguese, pt-PT)在大型语言模型(Large Language Models, LLMs)训练数据和评估基准中代表性不足的问题,尤其是机器翻译生成的评测集难以捕捉pt-PT的语言和文化特异性。解决方案的关键在于构建AMALIA——一个以pt-PT为核心目标的完全开源大语言模型,通过在中段(mid-training)和后期(post-training)阶段使用更高质量的pt-PT数据进行训练,并发布一套包含翻译任务与四个新专用于pt-PT生成、语言能力及pt-PT/巴西葡萄牙语(pt-BR)偏见检测的基准测试集,从而实现对pt-PT更忠实的建模与评估。实验表明,AMALIA在翻译基准上达到强基线水平,同时在pt-PT特定任务上显著优于现有模型,验证了针对特定语言变体进行定制化训练与原生评估的有效性。
链接: https://arxiv.org/abs/2603.26511
作者: Afonso Simplício,Gonçalo Vinagre,Miguel Moura Ramos,Diogo Tavares,Rafael Ferreira,Giuseppe Attanasio,Duarte M. Alves,Inês Calvo,Inês Vieira,Rui Guerra,James Furtado,Beatriz Canaverde,Iago Paulo,Vasco Ramos,Diogo Glória-Silva,Miguel Faria,Marcos Treviso,Daniel Gomes,Pedro Gomes,David Semedo,André Martins,João Magalhães
机构: NOVA School of Science and Technology; NOVA LINCS; Instituto de Telecomunicações; Instituto Superior Técnico, Universidade de Lisboa; Fundação para a Ciência e Tecnologia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PROPOR 2026 - The 17th International Conference on Computational Processing of Portuguese
Abstract:Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
[NLP-11] Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLM s
【速读】: 该论文旨在解决葡萄牙语临床文本中命名实体识别(Named Entity Recognition, NER)任务的性能瓶颈问题,特别是在面对多标签不平衡数据时模型表现不佳的挑战。其关键解决方案在于系统评估了多种基于BERT的模型与大语言模型(Large Language Models, LLMs)在葡萄牙语临床NER上的表现,并引入迭代分层采样(iterative stratification)、加权损失函数(weighted loss)和过采样(oversampling)等策略以缓解类别不平衡问题。实验表明,多语言BERT模型(特别是mmBERT-base)在微平均F1分数上达到最优(0.76),且结合平衡的数据划分策略可进一步提升整体性能,同时具备本地部署、资源消耗低的优势。
链接: https://arxiv.org/abs/2603.26510
作者: Vinicius Anjos de Almeida,Sandro Saorin da Silva,Josimar Chire,Leonardo Vicenzi,Nícolas Henrique Borges,Helena Kociolek,Sarah Miriã de Castro Rocha,Frederico Nassif Gomes,Júlia Cristina Ferreira,Oge Marques,Lucas Emanuel Silva e Oliveira
机构: Spesia, Curitiba - PR, Brazil; Faculdade de Medicina, Universidade de São Paulo, São Paulo - SP, Brazil; Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba - PR, Brazil; Florida Atlantic University, Boca Raton - FL, USA; Universidade Federal do Paraná (UFPR), Curitiba - PR, Brazil
类目: Computation and Language (cs.CL)
备注: Under peer review. GitHub: this https URL
Abstract:Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.
[NLP-12] ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims LREC2026
【速读】: 该论文旨在解决气候相关声明自动验证的难题,该问题因学术证据的专业性及气候虚假信息中多样的修辞策略而更加复杂。解决方案的关键在于构建一个共享任务平台(ClimateCheck 2026),其核心创新包括训练数据量增至三倍、引入新的虚假信息叙事分类任务,并采用密集检索流水线、交叉编码器集成以及具备结构化层级推理能力的大语言模型(Large Language Models, LLMs)相结合的系统架构。此外,研究还提出一种基于不完整标注的自动化评估框架,用于更准确地衡量检索质量,从而揭示传统指标在系统排序中的系统性偏差,为未来事实核查系统的优化设计提供依据。
链接: https://arxiv.org/abs/2603.26449
作者: Raia Abu Ahmad,Max Upravitelev,Aida Usmanova,Veronika Solopova,Georg Rehm
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at NSLP@LREC 2026
Abstract:Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.
[NLP-13] Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
【速读】: 该论文旨在解决临床医生从电子健康记录(Electronic Health Records, EHRs)中检索患者特异性信息时所面临的耗时且易出错的问题。解决方案的关键在于提出一个可本地部署的临床情境问答框架(Clinical Contextual Question Answering, CCQA),利用开源大语言模型(Large Language Models, LLMs)在完全离线环境下直接回答基于EHR的自然语言问题,无需外部数据传输。实验表明,Llama-3.1-70B等模型在自由文本生成任务中达到95.3%准确率和97.3%一致性,且低精度量化(4-bit/8-bit)可在保持性能的同时显著降低GPU内存占用,提升部署可行性;但临床评估仍发现2.9%输出存在显著错误,提示需结合人工审核以确保临床安全性。
链接: https://arxiv.org/abs/2603.26434
作者: Mikko Saukkoriipi,Nicole Hernandez,Jaakko Sahlsten,Kimmo Kaski,Otso Arponen
机构: Aalto University (阿尔托大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.
[NLP-14] Why Models Know But Dont Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
【速读】: 该论文试图解决的问题是:在生成式 AI (Generative AI) 的推理过程中,模型是否会在其“思考令牌”(thinking tokens)中体现对误导性提示(misleading hints)的响应,而这种响应可能不会出现在最终可见的答案文本中。这揭示了当前仅依赖答案文本进行安全监控或行为评估的局限性。解决方案的关键在于引入对“思考令牌”的分析——通过对比模型在思考过程与答案文本中是否提及提示相关内容,发现存在显著的“思考-答案分歧”(thinking-answer divergence),即超过一半的误导性响应仅体现在思考令牌中,且不同提示类型和模型架构对此现象的影响差异显著。这一发现表明,仅审查答案文本会遗漏大量潜在风险行为,而获取并分析思考令牌虽能提升检测能力,但仍无法覆盖全部情况(仍有11.8%的案例在两个通道均无明确承认)。
链接: https://arxiv.org/abs/2603.26410
作者: Richard J. Young
机构: University of Nevada, Las Vegas (内华达大学拉斯维加斯分校); DeepNeuro AI (DeepNeuro AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 8 figures, 4 tables
Abstract:Extended-thinking models expose a second text-generation channel (“thinking tokens”) alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint’s target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model’s thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed thinking-answer divergence. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most transparent hint, with 58.8% of sycophancy-influenced cases acknowledging the professor’s authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.
[NLP-15] Word Alignment-Based Evaluation of Uniform Meaning Representations
【速读】: 该论文旨在解决句子语义图表示(如统一语义表示 Uniform Meaning Representation, UMR)之间比较困难的问题,尤其在节点数量不一致且难以确定对应关系时,现有方法(如 smatch)倾向于最大化 F₁ 分数的节点映射,但无法区分属性值差异是故意设计还是偶然匹配,导致错误分析缺乏细节。解决方案的关键在于提出一种基于节点-词对齐(node-word alignment)的节点匹配算法,利用UMR中固有的词对齐信息实现更直观、可解释的语义表示比较,同时避免了smatch中NP难的搜索问题,从而提升了比较结果的准确性和实用性。
链接: https://arxiv.org/abs/2603.26401
作者: Daniel Zeman,Federica Gamba
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to each other. Existing approaches favor node mapping that maximizes F_1 score over node relations and attributes, regardless whether the similarity is intentional or accidental; consequently, the identified mismatches in values of node attributes are not useful for any detailed error analysis. We propose a node-matching algorithm that allows comparison of multiple Uniform Meaning Representations (UMR) of one sentence and that takes advantage of node-word alignments, inherently available in UMR. We compare it with previously used approaches, in particular smatch (the de-facto standard in AMR evaluation), and argue that sensitivity to word alignment makes the comparison of meaning representations more intuitive and interpretable, while avoiding the NP-hard search problem inherent in smatch. A script implementing the method is freely available.
[NLP-16] Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers
【速读】: 该论文旨在解决Transformer架构中标准全注意力机制(full attention)在长序列建模时计算复杂度与序列长度呈平方关系、效率低下的问题,同时克服滑动窗口注意力(sliding window attention)因感受野受限导致的信息捕捉能力不足。解决方案的关键在于提出Switch Attention(SwiAttn),一种新型混合Transformer结构,其核心创新是为每个token在每层Transformer中动态地选择执行全注意力或滑动窗口注意力分支:通过可学习的路由机制实现细粒度的计算分配,从而在保持全局信息聚合能力的同时提升局部模式匹配效率;此外,引入自适应正则化目标以引导模型向高效方向优化,并采用持续预训练策略实现从纯全注意力架构到混合架构的平稳迁移。
链接: https://arxiv.org/abs/2603.26380
作者: Yusheng Zhao,Hourun Li,Bohan Wu,Jingyang Yuan,Meng Zhang,Yichun Yin,Lifeng Shang,Ming Zhang
机构: Peking University (北京大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.
[NLP-17] A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本生成过程中存在的不确定性量化问题,这种不确定性不仅来源于文本生成本身,还涉及输入提示(prompt)的设计以及下游任务对生成结果的解释。其解决方案的关键在于提出一个形式化的不确定性测量框架,将提示构建、文本生成与结果解释建模为相互关联的自回归过程,并将其整合为单一的采样树结构;在此基础上引入滤波机制和目标函数,用于描述不同层面不确定性在采样树上的表达方式,从而统一现有不确定性估计方法并揭示尚未被充分研究的新型不确定性维度。
链接: https://arxiv.org/abs/2603.26363
作者: Steffen Herbold,Florian Lemmerich
机构: University of Passau (帕绍大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The generation of texts using Large Language Models (LLMs) is inherently uncertain, with sources of uncertainty being not only the generation of texts, but also the prompt used and the downstream interpretation. Within this work, we provide a formal framework for the measurement of uncertainty that takes these different aspects into account. Our framework models prompting, generation, and interpretation as interconnected autoregressive processes that can be combined into a single sampling tree. We introduce filters and objective functions to describe how different aspects of uncertainty can be expressed over the sampling tree and demonstrate how to express existing approaches towards uncertainty through these functions. With our framework we show not only how different methods are formally related and can be reduced to a common core, but also point out additional aspects of uncertainty that have not yet been studied.
[NLP-18] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law
【速读】: 该论文旨在解决现有法律推理评估基准在面对动态法律情境时的局限性问题,即当前主流评测体系多假设法律规范固定不变,难以捕捉法律判断随时间变化或多个规范交互作用的情形。其解决方案的关键在于构建CALRK-Bench——一个基于韩国法律体系的上下文感知型法律推理评测基准,该基准能够系统评估模型对法律规范时效性的识别能力、案件信息充分性的判断能力以及法律判决变动原因的理解能力。通过整合判例和法律咨询记录并经由法律专家验证,CALRK-Bench为大语言模型提供了一种新的压力测试场景,以检验其超越简单法律知识记忆的复杂法律推理能力。
链接: https://arxiv.org/abs/2603.26332
作者: JiHyeok Jung,TaeYoung Yoon,HyunSouk Cho
机构: KAIST AI; Ajou University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at this https URL.
[NLP-19] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLM s
【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)在空间推理基准测试中的表现是否反映了其内部存在结构化的空间表征,还是仅仅依赖于语言启发式策略。为回答这一问题,研究从机制角度出发,基于人类空间认知的计算理论,将空间推理分解为三个基本原语——关系组合(relational composition)、表征变换(representational transformation)和状态化空间更新(stateful spatial updating),并为此设计了受控的任务家族。关键解决方案在于通过线性探测、稀疏自编码器特征分析和因果干预等方法,系统性地分析LLMs在多语言(英语、中文、阿拉伯语)环境下单次推理过程中内部空间信息的编码与使用机制。结果表明,虽然任务相关空间信息存在于中间层且能因果影响行为,但这些表征具有瞬时性、跨任务碎片化且弱整合到最终预测中,同时跨语言分析揭示了机制退化现象(mechanistic degeneracy),即相似行为表现可能源于不同的内部路径。这说明当前LLMs缺乏稳定、通用的空间推理能力,亟需超越基准准确率的机制层面评估。
链接: https://arxiv.org/abs/2603.26323
作者: Jiyuan An,Liner Yang,Mengyan Wang,Luming Lu,Weihua An,Erhong Yang
机构: Beijing Language and Culture University (北京语言大学); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models’ (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.
[NLP-20] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
【速读】: 该论文旨在解决当前音节划分(syllabification)研究中方法分散、数据集和评估协议不统一的问题,从而阻碍了跨语言比较与可复现性。其解决方案的关键在于提出一个模块化、语言无关的工具包 findsylls,该工具包整合了经典音节检测器与端到端音节划分模型(如 Sylber 和 VG-HuBERT),并通过统一接口支持音节分割、嵌入提取及多粒度评估,使不同算法组件可灵活重组,实现对表示学习、算法性能和音节粒度的受控对比,进而推动高资源与低资源语言场景下的标准化音节级实验。
链接: https://arxiv.org/abs/2603.26292
作者: Héctor Javier Vázquez Martínez
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages + 2 for references, disclosures acknowledgements; currently under review
Abstract:Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.
[NLP-21] SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia
【速读】: 该论文旨在解决印度尼西亚大数据研究中因数据分散在社交媒体、新闻门户、电商平台、评论网站和学术数据库等多源异构平台而导致的研究效率低下问题,这些问题表现为数据格式不统一、获取方式各异及噪声特性复杂,使得研究人员需重复构建数据采集管道、清洗异构数据并组装独立分析工具,从而严重干扰核心研究工作。解决方案的关键在于提出SocialX平台,其通过将系统划分为三个独立层(数据采集、语言感知预处理和可插拔分析),并以轻量级任务协调机制连接各层,实现模块化设计,使每个层级可独立扩展:新增数据源、预处理方法或分析工具无需修改现有流程,从而显著提升研究的灵活性与可扩展性。
链接: https://arxiv.org/abs/2603.26253
作者: Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja
机构: Telkom University (Telkom大学); National Research and Innovation Agency (BRIN) (国家研究与创新署)
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 Figure, 4 Tables
Abstract:Big data research in Indonesia is constrained by a fundamental fragmentation: relevant data is scattered across social media, news portals, e-commerce platforms, review sites, and academic databases, each with different formats, access methods, and noise characteristics. Researchers must independently build collection pipelines, clean heterogeneous data, and assemble separate analysis tools, a process that often overshadows the research itself. We present SocialX, a modular platform for multi-source big data research that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified, source-agnostic pipeline. The platform separates concerns into three independent layers (collection, preprocessing, and analysis) connected by a lightweight job-coordination mechanism. This modularity allows each layer to grow independently: new data sources, preprocessing methods, or analysis tools can be added without modifying the existing pipeline. We describe the design principles that enable this extensibility, detail the preprocessing methodology that addresses challenges specific to Indonesian text across registers, and demonstrate the platform’s utility through a walkthrough of a typical research workflow. SocialX is publicly accessible as a web-based platform at this https URL.
[NLP-22] Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan LREC2026
【速读】: 该论文旨在解决濒危语言(endangered language)记录与复兴过程中面临的语音数据转录效率低下的问题,尤其是针对日本冲绳地区严重濒危的琉球语支方言——Ikema语。其解决方案的关键在于构建一个基于田野录音的语音语料库,并训练出字符错误率(character error rate, CER)仅为15%的自动语音识别(ASR)模型;通过实证表明,该ASR系统能显著降低人工转录的时间成本和认知负荷,从而为大规模、技术驱动的濒危语言记录提供可行路径。
链接: https://arxiv.org/abs/2603.26248
作者: Chihiro Taguchi,Yukinori Takubo,David Chiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 tables, 4 figures, accepted at LREC 2026
Abstract:Language endangerment poses a major challenge to linguistic diversity worldwide, and technological advances have opened new avenues for documentation and revitalization. Among these, automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. We present an ongoing effort to develop an ASR system for Ikema based on field recordings. Specifically, we (1) construct a \totaldatasethours-hour speech corpus from field recordings, (2) train an ASR model that achieves a character error rate as low as 15%, and (3) evaluate the impact of ASR assistance on the efficiency of speech transcription. Our results demonstrate that ASR integration can substantially reduce transcription time and cognitive load, offering a practical pathway toward scalable, technology-supported documentation of endangered languages.
[NLP-23] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM -based ASR
【速读】: 该论文旨在解决标准基于大语言模型(Large Language Model, LLM)的自动语音识别(Automatic Speech Recognition, ASR)系统在孤立处理话语时无法利用对话上下文的问题,从而限制了其对上下文相关实体的识别能力。解决方案的关键在于提出“抽象压缩”(Abstract Compression)机制:该机制用一组固定数量的可学习潜在标记(latent tokens)替代先前对话轮次中的原始音频标记序列,同时显式保留对应的文本转录信息,从而在显著减少前置音频数据占用的同时,有效维持对话上下文对ASR性能的提升效果。
链接: https://arxiv.org/abs/2603.26246
作者: Shashi Kumar,Esaú Villatoro-Tello,Sergio Burdisso,Kadri Hacioglu,Thibault Bañeras-Roux,Hasindri Watawana,Dairazalia Sanchez-Cortes,Srikanth Madikeri,Petr Motlicek,Andreas Stolcke
机构: Idiap Research Institute (Idiap研究所); EPFL (瑞士联邦理工学院); Uniphore (美国); University of Zurich (苏黎世大学); Brno University of Technology (布杰约维采理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 11 pages
Abstract:Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
[NLP-24] A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)是否将文化特定的语用规范(如俚语)作为孤立的语言特异性记忆还是统一的抽象概念进行处理的问题。其核心解决方案在于利用稀疏自编码器(Sparse Autoencoders, SAEs)对Gemma-2-9B-IT模型内部表征进行探测,并构建了一个新型多义词数据集,其中每个目标词在字面和非正式语境中均出现,从而有效区分语用注册处理与词汇敏感性。关键发现是:尽管大量非正式语境信号分布于语言特异性特征中,但存在一个高度稳健的跨语言共享核心,构成几何上一致的“非正式语用子空间”,且该子空间在模型深层更加清晰;更重要的是,通过激活操控可因果性地改变所有源语言输出的正式程度,并零样本迁移至六种未见语言,证明了模型已将非正式语用抽象为一种可移植、语言无关的语用表征。
链接: https://arxiv.org/abs/2603.26236
作者: Uri Z. Kialy,Avi Shtarkberg,Ayal Klein
机构: Ariel University
类目: Computation and Language (cs.CL)
备注:
Abstract:While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic register processing from trivial lexical sensitivity, we introduce a novel dataset in which every target term is polysemous, appearing in both literal and informal contexts. We find that while much of the informal-register signal is distributed across language-specific features, a small but highly robust cross-linguistic core consistently emerges. This shared core forms a geometrically coherent ``informal register subspace’’ that sharpens in the model’s deeper layers. Crucially, these shared representations are not merely correlational: activation steering with these features causally shifts output formality across all source languages and transfers zero-shot to six unseen languages spanning diverse language families and scripts. Together, these results provide the first mechanistic evidence that multilingual LLMs internalize informal register not just as surface-level heuristics, but as a portable, language-agnostic pragmatic abstraction.
[NLP-25] GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation
【速读】: 该论文旨在解决英国临床文本资源匮乏的问题,特别是针对脑部影像学报告中自然语言处理(Natural Language Processing, NLP)工具泛化能力不足的挑战。解决方案的关键在于构建并公开发布GS-BrainText数据集——一个包含8,511份脑部放射学报告的多中心、标注详尽的数据集,其中2,431份报告被专家团队按24种脑部疾病表型进行标注,并通过跨五个苏格兰NHS健康董事会的严格质量控制流程确保数据一致性与可靠性。该数据集为研究语言变异性、诊断不确定性表达以及数据特征对NLP系统性能的影响提供了高质量基准,从而推动可泛化的临床NLP算法开发。
链接: https://arxiv.org/abs/2603.26235
作者: Beatrice Alex,Claire Grover,Arlene Casey,Richard Tobin,Heather Whalley,William Whiteley
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure
Abstract:We present GS-BrainText, a curated dataset of 8,511 brain radiology reports from the Generation Scotland cohort, of which 2,431 are annotated for 24 brain disease phenotypes. This multi-site dataset spans five Scottish NHS health boards and includes broad age representation (mean age 58, median age 53), making it uniquely valuable for developing and evaluating generalisable clinical natural language processing (NLP) algorithms and tools. Expert annotations were performed by a multidisciplinary clinical team using an annotation schema, with 10-100% double annotation per NHS health board and rigorous quality assurance. Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups (F1: 87.01-98.13), highlighting critical challenges in generalisation of NLP tools. The GS-BrainText dataset addresses a significant gap in available UK clinical text resources and provides a valuable resource for the study of linguistic variation, diagnostic uncertainty expression and the impact of data characteristics on NLP system performance.
[NLP-26] Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在开放域任务(如软件工程)中面对未明确指令时缺乏澄清能力的问题。当前代理主要优化于自主执行,难以应对实际场景中常见的上下文缺失问题。解决方案的关键在于提出一种基于不确定性的多智能体框架,该框架显式地将“未指定性检测”与“代码执行”解耦,并通过多个智能体协作实现对复杂任务的主动澄清机制。实验表明,该方法在SWE-bench Verified的未指定变体上实现了69.40%的任务解决率,显著优于单智能体设置(61.20%),并展现出良好的不确定性校准能力,即在简单任务上减少查询,在复杂任务上主动寻求信息,从而推动LLM代理向具备主动协作能力的实用系统演进。
链接: https://arxiv.org/abs/2603.26233
作者: Nicholas Edwards,Sebastian Schuster
机构: University of Vienna (维也纳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.
[NLP-27] Sparse Auto-Encoders and Holism about Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Model, LLM)是否暗示了一种元语义图景(meta-semantic picture),即如何理解词汇和复杂表达式获得其意义的机制。作者首先回顾了以往主张LLM通过分布语义(distributional semantics)体现语义整体论(holism)的观点,随后面对机制可解释性研究中发现的大量可解释潜在特征(interpretable latent features)对整体论解释构成的挑战,提出解决方案的关键在于重新审视这些特征的本质——若这些特征是可数的(countable),则原有的整体论语义观仍可成立。这一论证表明,尽管存在局部可分解的特征结构,LLM的语义表征依然可能保持一种广义的整体性。
链接: https://arxiv.org/abs/2603.26207
作者: Jumbly Grindrod
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).
[NLP-28] ClinicalAgents : Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗诊断中因缺乏复杂非线性推理能力而导致的准确性不足问题。现有方法多依赖静态、线性的症状到诊断映射,无法模拟临床医生迭代式、假设驱动的思维过程。解决方案的关键在于提出ClinicalAgents框架,其核心创新为:一是采用基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的动态编排机制,使多个智能体能够交互生成假设、主动验证证据并支持回溯;二是引入双记忆架构——可变的工作记忆(Working Memory)用于维护患者状态以实现上下文感知推理,以及静态的经验记忆(Experience Memory)通过反馈循环检索临床指南和历史病例,从而显著提升诊断准确性和可解释性。
链接: https://arxiv.org/abs/2603.26182
作者: Zhuohan Ge,Haoyang Li,Yubo Wang,Nicole Hu,Chen Jason Zhang,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 1 figure, 6 tables, conference
Abstract:While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.
[NLP-29] DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
【速读】: 该论文旨在解决当前数据驱动的大语言模型(Large Language Models, LLMs)训练中,数据选择、数据混合优化与样本重加权等动态数据优化方法缺乏统一框架的问题。这些问题通常分散在独立的代码库中,接口不一致,导致复现困难、公平比较受限且难以集成到实际训练流程中。解决方案的关键在于提出 DataFlex,一个基于 LLaMA-Factory 的统一数据中心动态训练框架,其核心优势在于支持三种主流动态数据优化范式(样本选择、领域混合调整和样本重加权),同时保持对原始训练流程的完全兼容性;并通过可扩展的训练器抽象和模块化组件设计,实现标准LLM训练的无缝替换,并统一关键模型依赖操作(如嵌入提取、推理和梯度计算),并支持大规模设置(如 DeepSpeed ZeRO-3)。
链接: https://arxiv.org/abs/2603.26164
作者: Hao Liang,Zhengyang Zhao,Meiyi Qiang,Mingrui Chen,Lu Ma,Rongyi Yu,Hengyi Feng,Shixuan Sun,Zimo Meng,Xiaochen Ma,Xuanlin Yang,Qifeng Cai,Ruichuan An,Bohan Zeng,Zhen Hao Wong,Chengyu Shen,Runming He,Zhaoyang Han,Yaowei Zheng,Fangcheng Fu,Conghui He,Bin Cui,Zhiyu Li,Weinan E,Wentao Zhang
机构: 北京大学 (Peking University)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.
[NLP-30] Clash of the models: Comparing performance of BERT-based variants for generic news frame detection
【速读】: 该论文旨在解决当前政治传播研究中基于计算文本分析的新闻框架识别方法尚缺乏系统比较与跨情境验证的问题。其关键解决方案在于:首先,通过对比五种基于BERT的变体模型(BERT、RoBERTa、DeBERTa、DistilBERT和ALBERT)在通用新闻框架检测任务中的表现,为计算文本分析的最佳实践提供实证依据;其次,引入经过微调的鲁棒性框架检测模型,提升分类性能;最后,构建了一个基于瑞士选举语境的标注数据集,以补充以往主要依赖美国数据的研究局限,从而增强模型在不同文化背景下的泛化能力与适用性。
链接: https://arxiv.org/abs/2603.26156
作者: Vihang Jumle
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Framing continues to remain one of the most extensively applied theories in political communication. Developments in computation, particularly with the introduction of transformer architecture and more so with large language models (LLMs), have naturally prompted scholars to explore various novel computational approaches, especially for deductive frame detection, in recent years. While many studies have shown that different transformer models outperform their preceding models that use bag-of-words features, the debate continues to evolve regarding how these models compare with each other on classification tasks. By placing itself at this juncture, this study makes three key contributions: First, it comparatively performs generic news frame detection and compares the performance of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT and ALBERT) to add to the debate on best practices around employing computational text analysis for political communication studies. Second, it introduces various fine-tuned models capable of robustly performing generic news frame detection. Third, building upon numerous previous studies that work with US-centric data, this study provides the scholarly community with a labelled generic news frames dataset based on the Swiss electoral context that aids in testing the contextual robustness of these computational approaches to framing analysis.
[NLP-31] Finding Distributed Object-Centric Properties in Self-Supervised Transformers CVPR
【速读】: 该论文旨在解决自监督视觉Transformer(ViT)在对象发现任务中因[CLS]标记注意力图存在伪激活而导致目标定位性能不佳的问题。其核心在于,[CLS]标记基于图像级目标训练,会聚合整张图像的信息,从而稀释了局部patch间存在的以对象为中心的细粒度特征。解决方案的关键在于:首先,通过分析所有层中patch级别的注意力组件(查询q、键k、值v)之间的互相似性,发现对象中心信息广泛分布在网络各层中,而不仅限于最终层;其次,提出Object-DINO方法,无需额外训练即可提取这种分布式对象中心信息——具体而言,通过聚类跨层注意力头并识别与所有对象对应的对象中心簇,从而提升无监督对象发现和多模态大语言模型中的视觉锚定能力,显著改善下游任务表现。
链接: https://arxiv.org/abs/2603.26127
作者: Samyak Rawlekar,Amitabh Swain,Yujun Cai,Yiwei Wang,Ming-Hsuan Yang,Narendra Ahuja
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Queensland (昆士兰大学); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Computer Vision and Pattern Recognition (CVPR) 2026
Abstract:Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ( q, k, v ), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO’s effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.
[NLP-32] LLM Benchmark-User Need Misalignment for Climate Change
【速读】: 该论文旨在解决当前气候相关大语言模型(Large Language Models, LLMs)评估基准与真实用户需求之间存在显著不匹配的问题。其核心挑战在于现有基准未能准确反映人类在实际场景中获取和传递气候知识的行为模式,从而限制了LLMs在现实应用中的有效性。解决方案的关键在于提出一个“主动知识行为框架”(Proactive Knowledge Behaviors Framework),并构建“主题-意图-形式”(Topic-Intent-Form)分类体系,用于系统分析人类与AI之间以及人类之间的知识交互模式。研究发现,人类与LLMs的知识互动模式高度类比于人与人之间的互动,这为改进基准设计、增强检索增强生成(Retrieval-Augmented Generation, RAG)系统以及优化LLM训练提供了可操作的指导方向。
链接: https://arxiv.org/abs/2603.26106
作者: Oucheng Liu,Lexing Xie,Jing Jiang
机构: The Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL)
备注: 37 pages (8 main), 31 figures, 14 tables
Abstract:Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at this https URL.
[NLP-33] IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text
【速读】: 该论文旨在解决印尼语(Bahasa Indonesia)文本相关性分类(relevancy classification)任务中缺乏有效模型与高质量标注数据的问题。不同于情感分析或命名实体识别等任务,相关性分类需同时推理主题上下文与候选文本之间的语义关系,对模型的语义理解能力要求更高。解决方案的关键在于构建了一个基于IndoBERT Large(335M参数)的上下文感知相关性分类器IndoBERT-Relevancy,并通过迭代式、基于失败驱动的数据构造过程,融合多种数据源并引入针对性合成数据以弥补单一数据来源的不足,从而显著提升模型在正式与非正式印尼语文本上的鲁棒性,最终实现F1分数0.948和准确率96.5%的性能表现。
链接: https://arxiv.org/abs/2603.26095
作者: Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja
机构: Telkom University (Telkom大学); National Research and Innovation Agency (BRIN) (国家研究与创新机构)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures,6 tables
Abstract:Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.
[NLP-34] Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否真正习得了人类特有的心智理论(Theory of Mind),即能否在任意情境中构建并运用关于自我与他人信念、意图和知识状态的因果模型,而非仅对训练数据中的表层模式进行模仿。其核心问题在于区分“表面模仿”与“深层认知建模”,尤其是模型是否具备在动态社会交互中主动推理并策略性行动的能力。解决方案的关键在于设计了一种新颖的实验范式,要求受试者不仅描述心理状态,还需基于这些心理状态做出战略性行为决策,从而严格检验模型是否形成了可泛化的心智表征。研究发现,2025年中期前发布的LLMs无法完成任务,而近期模型虽能实现他人心理状态建模的人类水平表现,却仍无法独立完成自我心理建模,除非提供推理痕迹(scratchpad)作为外部工作记忆支持;此外,模型在他人建模任务中表现出类似有限容量工作记忆的认知负荷效应,并且前沿模型能够主动实施策略性欺骗,表明其已具备一定程度的元认知与社会推理能力。
链接: https://arxiv.org/abs/2603.26089
作者: Christopher Ackerman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 13 figures, 1 table
Abstract:The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.
[NLP-35] I Want to Believe (but the Vocabulary Changed): Measuring the Semantic Structure and Evolution of Conspiracy Theories
【速读】: 该论文试图解决 conspiracy theories(阴谋论)在在线政治话语中语义演变过程不清晰的问题,即现有研究多关注其信念形成、暴露与扩散,却忽视了其含义随时间的变化。解决方案的关键在于:利用来自 Reddit r/politics 子版块的1.699亿条评论(2012–2022),首先通过语义空间分析证明阴谋论相关语言构成结构一致且可区分的语义区域,从而将阴谋论视为可操作的语义对象;进而使用对齐词嵌入(aligned word embeddings)追踪这些语义对象随时间的演化,实现跨时期语义邻域的比较,揭示出阴谋论并非均匀演变,而是呈现语义稳定、扩展、收缩和替代等复杂模式,突破了传统基于关键词的方法局限。
链接: https://arxiv.org/abs/2603.26062
作者: Manisha Keim,Sarmad Chandio,Osama Khalid,Rishab Nithyanand
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Research on conspiracy theories has largely focused on belief formation, exposure, and diffusion, while paying less attention to how their meanings change over time. This gap persists partly because conspiracy-related terms are often treated as stable lexical markers, making it difficult to separate genuine semantic changes from surface-level vocabulary changes. In this paper, we measure the semantic structure and evolution of conspiracy theories in online political discourse. Using 169.9M comments from Reddit’s r/politics subreddit spanning 2012–2022, we first demonstrate that conspiracy-related language forms coherent and semantically distinguishable regions of language space, allowing conspiracy theories to be treated as semantic objects. We then track how these objects evolve over time using aligned word embeddings, enabling comparisons of semantic neighborhoods across periods. Our analysis reveals that conspiracy theories evolve non-uniformly, exhibiting patterns of semantic stability, expansion, contraction, and replacement that are not captured by keyword-based approaches alone.
[NLP-36] Retrieval-Augmented Generation Based Nurse Observation Extraction
【速读】: 该论文旨在解决护士在临床工作中因手动录入观察记录而产生的高负荷问题,通过自动化提取护理口述内容中的临床信息来减轻其工作负担。解决方案的关键在于提出了一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的自动抽取管道,该方法能够有效提升临床文本信息提取的准确性,在MEDIQA-SYNUR测试数据集上实现了0.796的F1分数,验证了其在医疗场景下的实用性和有效性。
链接: https://arxiv.org/abs/2603.26046
作者: Kyomin Hwang,Nojun Kwak
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have played a significant role in reducing human workload across various domains, a trend that is increasingly extending into the medical field. In this paper, we propose an automated pipeline designed to alleviate the burden on nurses by automatically extracting clinical observations from nurse dictations. To ensure accurate extraction, we introduce a method based on Retrieval-Augmented Generation (RAG). Our approach demonstrates effective performance, achieving an F1-score of 0.796 on the MEDIQA-SYNUR test dataset.
[NLP-37] H-Node Attack and Defense in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(Hallucination)问题,即模型生成与事实不符的内容。其核心挑战在于识别并干预导致幻觉的隐藏状态维度,同时避免对模型通用推理能力造成显著损害。解决方案的关键在于提出H-Node对抗噪声消除(H-Node Adversarial Noise Cancellation, H-Node ANC)框架:首先通过逻辑回归探测器定位高方差的“幻觉节点”(Hallucination Nodes, H-Nodes),随后利用白盒对抗攻击在推理阶段实时放大这些维度以诱发幻觉;进而设计自适应噪声抑制机制,基于置信度加权动态取消H-Node异常激活,有效降低接地激活漂移(grounded activation drift)33–42%,并通过迭代重排序策略进一步提升鲁棒性至0.69(相较单次传递基线8%)。整个方法在多个主流LLM架构上验证,仅带来5%的困惑度增加和最多3%的MMLU性能下降,证明其高效且无损。
链接: https://arxiv.org/abs/2603.26045
作者: Eric Yocam,Varghese Vaidyan,Yong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages, 7 figures, 6 tables
Abstract:We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions – termed Hallucination Nodes (H-Nodes) – with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.
[NLP-38] Agent Collab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂多步骤任务时面临的效率与推理鲁棒性之间的权衡问题。具体而言,低成本模型虽能快速执行但难以应对高难度推理环节,而高性能模型虽具更强推理能力却带来更高的计算开销。解决方案的关键在于提出一种自驱动的协作推理框架 AgentCollab,其通过代理自身的自我反思信号判断当前推理路径是否取得实质性进展,并仅在必要时将控制权动态移交至更强的推理层级;同时引入基于难度感知的累积升级策略,根据近期失败信号分配额外推理预算,从而在长程任务中稳定提升性能。实验表明,该框架在多个多步智能体基准上显著优化了准确率-效率帕累托前沿。
链接: https://arxiv.org/abs/2603.26034
作者: Wenbo Gao,Renxi Liu,Xian Wang,Fang Guo,Shuai Yang,Xi Chen,Hui-Ling Zhen,Hanting Chen,Weizhe Lin,Xiaosong Li,Yaoyuan Wang
机构: Huawei(华为); Hong Kong Polytechnic University(香港理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent’s own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.
[NLP-39] oward Culturally Grounded Natural Language Processing
【速读】: 该论文旨在解决当前多语言自然语言处理(Multilingual NLP)研究中普遍存在的“语言能力与文化能力分离”问题,即模型在跨语言任务上表现良好并不等同于具备真正的文化适应性。研究表明,尽管训练数据覆盖范围是性能的重要决定因素,但仅靠数据广度不足以确保模型在低资源或特定文化语境下的有效性和公平性。论文指出,词元化方式、提示语言选择、基准测试设计的翻译偏差、文化特异性监督信号以及多模态上下文等因素均显著影响模型输出的文化适配性。解决方案的关键在于从将语言视为孤立条目的基准表格转向建模“交流生态”(communicative ecologies),即通过整合机构、脚本、翻译流程、领域、模态和社区等要素,推动一种以文化为根基的NLP研究范式:包括更丰富的上下文元数据、分层的文化评估体系、参与式对齐机制、语言内部变异建模以及面向社区的多模态设计。
链接: https://arxiv.org/abs/2603.26013
作者: Sina Bagheri Nezhad
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020–2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.
[NLP-40] Policy-Guided World Model Planning for Language-Conditioned Visual Navigation
【速读】: 该论文旨在解决**基于自然语言指令的视觉导航(instruction-conditioned visual navigation)**中长期规划困难的问题,尤其针对现有方法在高维动作空间中因初始策略不佳导致的世界模型规划性能低下,以及反应式策略难以处理长程任务的局限性。其解决方案的关键在于提出一个两阶段框架PiJEPA:第一阶段利用预训练视觉编码器(DINOv2或V-JEPA-2)微调一个通用机器人政策(Octo-based policy),生成与当前观测和语言指令条件相关的动作分布;第二阶段使用该分布作为先验,对独立训练的JEPA世界模型进行Model Predictive Path Integral (MPPI)规划,从而在潜空间中预测未来状态并高效搜索最优动作序列。通过从策略先验而非无信息高斯分布初始化MPPI采样,显著加速了规划收敛并提升了目标达成准确率与指令遵循一致性。
链接: https://arxiv.org/abs/2603.25981
作者: Amirhosein Chahe,Lifeng Zhou
机构: Drexel University (德雷塞尔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high-quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V-JEPA-2, across both the policy and world model components. Experiments on real-world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal-reaching accuracy and instruction-following fidelity.
[NLP-41] Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schanks Event Semantics
【速读】: 该论文试图解决的问题是:能否仅通过压缩压力(compression pressure)自动发现事件的基本操作原语(primitive operations),而无需依赖人工标注或语言直觉。Schank的概念依存理论曾提出四类基本原语(如ATRANS、PTRANS等),但其是否具有信息论基础以及是否完备仍不明确。解决方案的关键在于将DreamCoder的“唤醒-睡眠”库学习机制应用于事件状态变换建模——在“唤醒”阶段,系统基于事件前后的世界状态对(before/after world state pairs)推导出解释每个事件的操作符组合;在“睡眠”阶段,利用最小描述长度(Minimum Description Length, MDL)优化策略提取重复出现的模式作为新操作符。实验表明,该方法不仅能复现Schank的核心原语,还发现了大量新的心理与情感状态变化操作符(如CHANGE_wants、CHANGE_feels),且在合成数据和真实常识数据集ATOMIC上均显著优于手写原语,验证了事件原语可由压缩压力自动发现,并揭示了人类自然事件中心理与情感维度的重要性远超物理动作。
链接: https://arxiv.org/abs/2603.25975
作者: Peter Balogh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We show that they do. Schank’s conceptual dependency theory proposed that all events decompose into primitive operations – ATRANS, PTRANS, MTRANS, and others – hand-coded from linguistic intuition. Can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder’s wake-sleep library learning to event state transformations. Given events as before/after world state pairs, our system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping directly to Schank’s: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators (“mail” = ATRANS + PTRANS) and novel emotional state operators absent from Schank’s taxonomy. We validate on synthetic events and real-world commonsense data from the ATOMIC knowledge graph. On synthetic data, discovered operators achieve Bayesian MDL within 4% of Schank’s hand-coded primitives while explaining 100% of events vs. Schank’s 81%. On ATOMIC, results are more dramatic: Schank’s primitives explain only 10% of naturalistic events, while the discovered library explains 100%. Dominant operators are not physical-action primitives but mental and emotional state changes – CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) – none in Schank’s original taxonomy. These results provide the first empirical evidence that event primitives can be derived from compression pressure, that Schank’s core primitives are information-theoretically justified, and that the complete inventory is substantially richer than proposed – with mental/emotional operators dominating in naturalistic data. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.25975 [cs.LG] (or arXiv:2603.25975v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.25975 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peter Balogh [view email] [v1] Thu, 26 Mar 2026 23:35:39 UTC (82 KB)
[NLP-42] MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization ICLR2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在长时记忆能力评估方面存在的局限性问题,即现有基准测试主要基于短会话的合成对话,难以真实反映模型在跨领域、长期用户行为模拟中的记忆表现。其解决方案的关键在于构建首个大规模、以用户为中心、跨领域的记忆评估基准——\textscMemoryCD,该基准源自亚马逊评论数据集中用户多年的真实交互行为,而非依赖脚本化人格生成的合成数据。通过在12个不同领域上对14种主流LLM基础模型及6种记忆方法进行多任务、多场景的评估,该研究首次提供了可用于跨域长期个性化建模的系统性测试平台,揭示了当前记忆方法在实际应用中距离用户满意度仍有显著差距。
链接: https://arxiv.org/abs/2603.25973
作者: Weizhi Zhang,Xiaokai Wei,Wei-Chieh Huang,Zheng Hui,Chen Wang,Michelle Gong,Philip S. Yu
机构: Roblox(罗布洛克斯); University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: Published as a workshop paper in Lifelong Agent @ ICLR 2026
Abstract:Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textscMemoryCD, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textscMemoryCD tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent’s ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.
[NLP-43] When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models
【速读】: 该论文旨在解决医学领域大语言模型(Large Language Models, LLMs)在不同提示工程(prompt engineering)策略下表现不稳定的问题,尤其是其对提示格式敏感性缺乏系统评估。研究发现,传统在通用领域验证有效的提示方法(如思维链 CoT、少样本示例等)在医疗场景中反而显著降低准确率,并引入位置偏差或决策不一致性;关键解决方案在于采用基于概率的评分机制(如 cloze scoring),通过选择最高对数似然选项标记来挖掘模型潜在知识,其性能超越所有提示策略,且结合排列投票(permutation voting)可进一步提升鲁棒性,揭示出医疗专用 LLM 的内在能力可能被标准生成式输出低估。
链接: https://arxiv.org/abs/2603.25960
作者: Binesh Sadanandan,Vahid Behzadan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models “know” more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.
[NLP-44] Can Small Models Reason About Legal Documents? A Comparative Study
【速读】: 该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在法律应用中部署时面临的成本高、延迟大及数据隐私风险等问题,探索参数规模小于10B的模型是否可作为实用替代方案。其解决方案的关键在于系统性评估九种不同架构与训练质量的子10B模型,在三个法律基准(ContractNLI、CaseHOLD和ECtHR)上结合五种提示策略(直接提示、思维链、少样本提示、BM25检索增强生成和密集检索增强生成)进行405次实验,结果表明:模型架构与训练质量比参数总量更重要;激活仅3B参数的混合专家模型(Mixture-of-Experts)在平均准确率上媲美GPT-4o-mini,并在法律判例识别任务中表现更优;少样本提示策略最为稳定有效,而检索方式(BM25 vs. 密集检索)对性能影响微弱,说明瓶颈在于模型对检索内容的利用能力而非检索本身。
链接: https://arxiv.org/abs/2603.25944
作者: Snehit Vaddi
机构: Together AI; OpenAI; Anthropic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 models, 5 prompting strategies, 3 legal benchmarks, 405 experiments
Abstract:Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model’s utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of 62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.
[NLP-45] Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时计算负载过高的问题,现有软上下文压缩(Soft Context Compression)方法采用统一压缩比,无法适应自然语言中信息密度的显著差异。为克服这一局限,作者提出半动态上下文压缩(Semi-Dynamic Context Compression)框架,其核心创新在于引入一个离散压缩比选择器(Discrete Ratio Selector),该模块基于输入文本的内在信息密度预测最优压缩目标,并将其量化为预定义的离散压缩比率集合;该选择器与压缩器在合成数据上联合训练,以压缩后的摘要长度作为标签代理来优化压缩比预测任务,从而实现更高效且自适应的上下文压缩。
链接: https://arxiv.org/abs/2603.25926
作者: Yijiong Yu,Shuai Yuan,Jie Zheng,Huazheng Wang,Ji Pei
机构: Oregon State University; DeepSolution
类目: Computation and Language (cs.CL)
备注:
Abstract:Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to account for the extreme variance in natural language information density. While adopting a density-aware dynamic compression ratio seems intuitive, empirical investigations reveal that models struggle intrinsically with operations parameterized by input dependent, continuous structural hyperparameters. To resolve this pitfall, we introduce Semi-Dynamic Context Compression framework. Our approach features a Discrete Ratio Selector, which predicts a compression target based on intrinsic information density and quantizes it to a predefined set of discrete compression ratios. It is efficiently jointly trained with the compressor on synthetic data, with the summary lengths as a proxy to create labels for compression ratio prediction. Extensive evaluations confirm that our density-aware framework, utilizing mean pooling as the backbone, consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques. Our code, data and model weights are available at this https URL
[NLP-46] Methods for Knowledge Graph Construction from Text Collections: Development and Applications
【速读】: 该论文旨在解决从海量非结构化文本数据中自动提取丰富语义知识并构建可解释、可互操作的知识图谱(Knowledge Graph)的问题,尤其针对数字转型 discourse 分析、建筑-工程-施工与运营领域研究趋势挖掘以及电子健康记录和患者药物评论中的因果关系建模等实际应用场景。其解决方案的关键在于融合自然语言处理(Natural Language Processing, NLP)、机器学习(Machine Learning)与生成式 AI(Generative AI)方法,并结合语义网(Semantic Web)最佳实践,设计定制化算法以实现跨文本类型与模式规范的灵活、可扩展的信息抽取,从而构建高质量、语义透明的知识图谱资源。
链接: https://arxiv.org/abs/2603.25862
作者: Vanni Zavarella
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.
[NLP-47] Gradient-Informed Training for Low-Resource Multilingual Speech Translation
【速读】: 该论文针对低资源多语言语音到文本翻译任务中,因各语言间统一架构共享导致的表示冲突问题展开研究,此类冲突常阻碍模型收敛。解决方案的关键在于通过挖掘训练梯度信息,自动确定每层的共享模式:具体包括三种分析策略——基于距离的语言聚类以识别相似语言组、自任务与跨任务分歧度量用于容量分配优化、以及联合因子分解结合典型相关分析实现子空间对齐。该方法在四个语言对上的实验验证了其在翻译质量指标上的持续提升效果。
链接: https://arxiv.org/abs/2603.25836
作者: Ruiyan Sun,Satoshi Nakamura
机构: School of Data Science; School of Aritificial Intelligence, The Chinese University of Hong Kong,Shenzhen
类目: Computation and Language (cs.CL)
备注:
Abstract:In low-resource multilingual speech-to-text translation, uniform architectural sharing across languages frequently introduces representation conflicts that impede convergence. This work proposes a principled methodology to automatically determine layer-specific sharing patterns by mining training gradient information. Our approach employs three distinct analysis strategies: distance-based language clustering, self/cross-task divergence metrics for capacity allocation, and joint factorization coupled with canonical correlation analysis for subspace alignment. Extensive evaluation across four language pairs (using the SeamlessM4T-Medium architecture) demonstrates persistent improvements in translation quality metrics.
[NLP-48] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在从真实数据中生成复杂多面板图表(multi-panel visualizations)方面能力不足的问题,尤其是缺乏对模型在真实场景下进行迭代式代码优化和多轮对话交互能力的系统性评估。其解决方案的关键在于提出首个大规模基准测试集 \textttRealChart2Code,该数据集包含超过2,800个基于真实数据集的实例,具有明确的分析意图,并首次实现了对VLM在原始数据驱动下的图表生成能力和多轮对话场景中的代码迭代优化能力的全面评估。这一设计使得研究者能够更准确地衡量VLM在复杂可视化任务中的实际表现,揭示了现有模型在处理复杂图表结构和真实数据时存在的显著性能瓶颈。
链接: https://arxiv.org/abs/2603.25804
作者: Jiajun Zhang,Yuying Li,Zhixun Li,Xingyu Guo,Jingzhuo Wu,Leqi Zheng,Yiran Yang,Jianke Zhang,Qingbin Li,Shannan Yan,Zhetong Li,Changguo Jia,Junfei Wu,Zilei Wang,Qiang Liu,Liang Wang
机构: USTC(中国科学技术大学); THU(清华大学); CUHK(香港中文大学); UCAS(中国科学院大学); CASIA(中国科学院自动化研究所); BNU(北京师范大学); BUPT(北京邮电大学); BIT(北京理工大学); PKU(北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf\textttRealChart2Code, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \textttRealChart2Code reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \urlthis https URL.
[NLP-49] Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition
【速读】: 该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MCER)中因环境噪声和数据质量不均衡导致的特征污染与融合偏差问题,进而影响整体识别性能。其核心解决方案在于提出一种关系感知的去噪与扩散注意力融合模型:首先设计差分Transformer以显式计算两个注意力图之间的差异,从而增强时序一致性信息并抑制无关噪声,实现对音频和视频模态的有效去噪;其次构建模态特异性与跨模态关系子图,捕捉说话者依赖的情感关联,精细建模模态内与模态间的复杂关系;最后引入文本引导的跨模态扩散机制,利用自注意力建模模态内依赖,并自适应地将视听信息扩散至文本流,确保更鲁棒且语义对齐的多模态融合。
链接: https://arxiv.org/abs/2603.25752
作者: Ying Liu,Yuntao Shou,Wei Ai,Tao Meng,Keqin Li
机构: Central South University of Forestry and Technology (中南林业科技大学); State University of New York (纽约州立大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 19 pages
Abstract:In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.
[NLP-50] Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models
【速读】: 该论文旨在解决量子语言模型(Quantum Language Models, QLMs)是否真正利用了量子资源,还是仅将经典计算嵌入量子硬件这一关键问题。其解决方案的关键在于首次开展对QLMs的机制可解释性研究,通过因果门消融、纠缠度跟踪以及密度矩阵交换干预等方法,在一个受控的长程依赖任务中系统分析模型内部学习的记忆策略。研究发现:单量子比特模型完全可被经典模拟,并收敛至与经典基线相同的几何策略;而含纠缠门的双量子比特模型则学习到一种在表示上截然不同的策略,即通过量子比特间的纠缠编码上下文信息,并经三项独立因果测试验证(p < 0.0001, d = 0.89)。此外,在真实量子硬件上,仅有经典几何策略能在设备噪声下存活,纠缠策略退化至随机水平,揭示了噪声-表达力权衡(noise-expressivity tradeoff)对模型策略生存能力的决定作用。
链接: https://arxiv.org/abs/2603.26494
作者: Nathan Roll
机构: Stanford University (斯坦福大学)
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 7 tables
Abstract:Quantum language models have shown competitive performance on sequential tasks, yet whether trained quantum circuits exploit genuinely quantum resources – or merely embed classical computation in quantum hardware – remains unknown. Prior work has evaluated these models through endpoint metrics alone, without examining the memory strategies they actually learn internally. We introduce the first mechanistic interpretability study of quantum language models, combining causal gate ablation, entanglement tracking, and density-matrix interchange interventions on a controlled long-range dependency task. We find that single-qubit models are exactly classically simulable and converge to the same geometric strategy as matched classical baselines, while two-qubit models with entangling gates learn a representationally distinct strategy that encodes context in inter-qubit entanglement – confirmed by three independent causal tests (p 0.0001, d = 0.89). On real quantum hardware, only the classical geometric strategy survives device noise; the entanglement strategy degrades to chance. These findings open mechanistic interpretability as a tool for the science of quantum language models and reveal a noise-expressivity tradeoff governing which learned strategies survive deployment.
信息检索
[IR-0] Analysing Calls to Order in German Parliamentary Debates LREC2026
【速读】:该论文旨在解决议会中不文明行为(incivility)的量化与分类问题,尤其是针对德国联邦议院(Bundestag)中“要求秩序”(calls to order, CtO)这一正式规范违反指标的系统性研究空白。其关键解决方案在于提出一种基于规则的检测与标注方法,构建了一个涵盖72年德国议会辩论的新型标注数据集,并开发了首个用于CtO触发因素的分类体系,从而揭示了CtO的主观性、政治动态影响及性别和党派差异等结构性特征。
链接: https://arxiv.org/abs/2603.26430
作者: Nina Smirnova,Daniel Dan,Philipp Mayr
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: The paper is accepted to the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026) co-located with LREC 2026
Abstract:Parliamentary debate constitutes a central arena of political power, shaping legislative outcomes and public discourse. Incivility within this arena signals political polarization and institutional conflict. This study presents a systematic investigation of incivility in the German Bundestag by examining calls to order (CtO; plural: CtOs) as formal indicators of norm violations. Despite their relevance, CtOs have received little systematic attention in parliamentary research. We introduce a rule-based method for detecting and annotating CtOs in parliamentary speeches and present a novel dataset of German parliamentary debates spanning 72 years that includes annotated CtO instances. Additionally, we develop the first classification system for CtO triggers and analyze the factors associated with their occurrence. Our findings show that, despite formal regulations, the issuance of CtOs is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. An insult towards individuals is the most frequent cause of CtO. In general, male members and those belonging to opposition parties receive more calls to order than their female and coalition-party counterparts. Most CtO triggers were detected in speeches dedicated to governmental affairs and actions of the presidency. The CtO triggers dataset is available at: this https URL.
[IR-1] Demystifying Funding: Reconstructing a Unified Dataset of the UK Funding Lifecycle
【速读】:该论文旨在解决科研资助生命周期中数据碎片化的问题,即现有研究多聚焦于已获资助项目的产出,而忽视了从资助机会发布到项目立项决策的完整流程。其关键解决方案是整合三个此前相互孤立的数据源:UKRI的Gateway to Research(GtR)项目数据库、UKRI资助机会信息以及各研究理事会关于竞争性资助决策的记录,从而构建一个覆盖整个资助周期的统一数据集。这一方法克服了出版格式不一致和面板决策数据受限等技术挑战,实现了对科研资助全过程的系统性分析与可视化。
链接: https://arxiv.org/abs/2603.26426
作者: William Thorne,Rupert Shepherd,Diana Maynard
机构: 未知
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Accepted at NSLP 2026
Abstract:We present a reconstruction of UKRI’s Gateway to Research (GtR) database that links funding opportunities to their resulting project proposals through panel meeting outcomes. Unlike existing work that focuses primarily on funded projects and their outcomes, we close the complete funding lifecycle by integrating three previously disconnected data sources: the GtR project database, UKRI funding opportunities, and competitive funding decision records across UKRI’s research councils. We describe the technical challenges of data collection, including navigating inconsistent publication formats and restricted access to panel decisions. The resulting dataset enables a holistic interrogation of the entire funding process, from opportunity announcement to research outcomes. We release the database and associated code.
[IR-2] Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models ECIR2026
【速读】:该论文旨在解决Late Interaction检索模型中两个关键问题:一是多向量评分(multi-vector scoring)引入的长度偏差(length bias),二是MaxSim操作符在聚合最高相似度得分后,其余相似度分布是否蕴含显著趋势。研究发现,因果性Late Interaction模型的理论长度偏差在实践中确实存在,而双向模型在极端情况下也可能受到类似影响;同时,实验表明除Top-1文档token外,其他相似度得分并无明显趋势,验证了MaxSim操作符能高效利用token级相似度信息,是当前最优的聚合策略。
链接: https://arxiv.org/abs/2603.26259
作者: Antoine Edy,Max Conti,Quentin Macé
机构: Illuin Technology(伊鲁因科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at The 1st Late Interaction Workshop (LIR) @ ECIR 2026
Abstract:While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.
[IR-3] Rethinking Recommendation Paradigms: From Pipelines to Agent ic Recommender Systems
【速读】:该论文旨在解决大规模工业推荐系统中静态架构带来的局限性问题,即传统多阶段(召回、排序、重排序)或“单模型”设计本质上是固定的,依赖人工假设与工程调优,难以在异构数据和多目标业务约束下实现高效扩展。解决方案的关键在于提出一种代理型推荐系统(Agentic Recommender System, AgenticRS),将具备功能闭环、可独立评估和可演化决策空间的核心模块重构为“代理”(agent),并引入两类自进化机制:一是在定义明确的动作空间中采用类似强化学习的优化方式;二是在开放设计空间中利用大语言模型生成与选择新型模型架构和训练方案。同时区分单个代理的个体演化与多个代理间组合演化的层级结构,并通过内外层奖励机制耦合局部优化与全局目标,从而实现从静态流水线向自演化推荐系统的范式转变。
链接: https://arxiv.org/abs/2603.26100
作者: Jinxin Hu,Hao Deng,Lingyu Mu,Hao Zhang,Shizhun Wang,Yu Zhang,Xiaoyi Zeng
机构: Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); University of Chinese Academy (中国科学院大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large-scale industrial recommenders typically use a fixed multi-stage pipeline (recall, ranking, re-ranking) and have progressed from collaborative filtering to deep and large pre-trained models. However, both multi-stage and so-called One Model designs remain essentially static: models are black boxes, and system improvement relies on manual hypotheses and engineering, which is hard to scale under heterogeneous data and multi-objective business constraints. We propose an Agentic Recommender System (AgenticRS) that reorganizes key modules as agents. Modules are promoted to agents only when they form a functionally closed loop, can be independently evaluated, and possess an evolvable decision space. For model agents, we outline two self-evolution mechanisms: reinforcement learning style optimization in well-defined action spaces, and large language model based generation and selection of new architectures and training schemes in open-ended design spaces. We further distinguish individual evolution of single agents from compositional evolution over how multiple agents are selected and connected, and use a layered inner and outer reward design to couple local optimization with global objectives. This provides a concise blueprint for turning static pipelines into self-evolving agentic recommender systems.
[IR-4] Agent icRS-Architecture: System Design for Agent ic Recommender Systems
【速读】:该论文旨在解决工业级推荐系统在全生命周期中面临的自动化程度低、模型迭代效率差以及跨模块协同困难的问题。传统推荐系统通常依赖固定的记忆召回与排序流水线,难以实现持续优化和动态适应。其解决方案的关键在于提出AutoModel架构,该架构以代理(agent)为基础,构建了一个具备长期记忆与自我进化能力的多智能体系统,包含AutoTrain(模型设计与训练)、AutoFeature(特征演化)和AutoPerf(性能监控与部署)三个核心代理,并通过共享的协调与知识层实现全局对齐与决策记录,从而实现从方法解析到代码生成、大规模训练及离线评估的闭环自动化,显著降低人工干预成本并提升系统整体演化效率。
链接: https://arxiv.org/abs/2603.26085
作者: Hao Zhang,Jinxin Hu,Hao Deng,Lingyu Mu,Shizhun Wang,Yu Zhang,Xiaoyi Zeng
机构: Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); University of Chinese Academy (中国科学院大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:AutoModel is an agent based architecture for the full lifecycle of industrial recommender systems. Instead of a fixed recall and ranking pipeline, AutoModel organizes recommendation as a set of interacting evolution agents with long term memory and self improvement capability. We instantiate three core agents along the axes of models, features, and resources: AutoTrain for model design and training, AutoFeature for data analysis and feature evolution, and AutoPerf for performance, deployment, and online experimentation. A shared coordination and knowledge layer connects these agents and records decisions, configurations, and outcomes. Through a case study of a module called paper autotrain, we show how AutoTrain automates paper driven model reproduction by closing the loop from method parsing to code generation, large scale training, and offline comparison, reducing manual effort for method transfer. AutoModel enables locally automated yet globally aligned evolution of large scale recommender systems and can be generalized to other AI systems such as search and advertising.
[IR-5] Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management
【速读】:该论文旨在解决机场运营文档中因术语复杂、监管严格、区域信息专有及多方沟通碎片化所导致的数据孤岛与语义不一致问题,从而阻碍了整体机场管理(Total Airport Management, TAM)的实施。其解决方案的关键在于提出一种双阶段融合的方法论框架,通过符号知识工程(Knowledge Engineering, KE)与生成式大语言模型(Generative Large Language Models, LLMs)的协同作用构建可机器读取的知识图谱(Knowledge Graph, KG)。该框架采用分层融合策略,由专家定制的KE结构引导LLM提示词以发现语义对齐的知识三元组,并结合概率性发现模型与确定性锚定算法,确保每条抽取信息均能追溯至原始来源,实现高保真溯源与可验证性,有效弥合“黑箱”生成输出与运行工具所需透明度之间的差距。
链接: https://arxiv.org/abs/2603.26076
作者: Darryl Teo,Adharsha Sam,Chuan Shen Marcus Koh,Rakesh Nagi,Nuno Antunes Ribeiro
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Documentation of airport operations is inherently complex due to extensive technical terminology, rigorous regulations, proprietary regional information, and fragmented communication across multiple stakeholders. The resulting data silos and semantic inconsistencies present a significant impediment to the Total Airport Management (TAM) initiative. This paper presents a methodological framework for constructing a domain-grounded, machine-readable Knowledge Graph (KG) through a dual-stage fusion of symbolic Knowledge Engineering (KE) and generative Large Language Models (LLMs). The framework employs a scaffolded fusion strategy in which expert-curated KE structures guide LLM prompts to facilitate the discovery of semantically aligned knowledge triples. We evaluate this methodology on the Google LangExtract library and investigate the impact of context window utilization by comparing localized segment-based inference with document-level processing. Contrary to prior empirical observations of long-context degradation in LLMs, document-level processing improves the recovery of non-linear procedural dependencies. To ensure the high-fidelity provenance required in airport operations, the proposed framework fuses a probabilistic model for discovery and a deterministic algorithm for anchoring every extraction to its ground source. This ensures absolute traceability and verifiability, bridging the gap between “black-box” generative outputs and the transparency required for operational tooling. Finally, we introduce an automated framework that operationalizes this pipeline to synthesize complex operational workflows from unstructured textual corpora. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2603.26076 [cs.AI] (or arXiv:2603.26076v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.26076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
人机交互
[HC-0] Sticky and Magnetic: Evaluating Error Correction and User Adaptation in Gaze and Pinch Interaction
【速读】:该论文旨在解决虚拟现实(VR)中注视-捏合(gaze-and-pinch)交互模态因协调误差导致的选择失败问题,尤其是早期触发错误(early triggers,即捏合动作先于注视到达目标)缺乏有效解决方案的现状。其关键解决方案在于引入两种启发式策略:STICKY选择(时间缓冲机制)和MAGNETIC选择(空间场机制),通过改变用户行为模式来降低误操作率——其中MAGNETIC选择进一步实现了“卸载效应”(offloading effect),使用户在精度与速度之间进行权衡,从而提升交互效率与认知自主性。
链接: https://arxiv.org/abs/2603.26608
作者: Jazmin Collins,Prasanthi Gurumurthy,Eric J. Gonzalez,Mar Gonzalez-Franco
机构: Cornell University (康奈尔大学); Google (谷歌)
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注: 5 page, 5 figures, Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain. ACM
Abstract:The gaze-and-pinch framework offers a high-fidelity interaction modality for spatial computing in virtual reality (VR), yet it remains vulnerable to coordination errors–timing misalignments between gaze fixation and pinch gestures. These errors are categorized into two types: late triggers (gaze leaves a target before pinch) and early triggers (pinch before gaze arrival on target). While late triggers are well-studied, early triggers lack robust solutions. We investigate two heuristics–STICKY selection (temporal buffer) and MAGNETIC selection (spatial field)–to mitigate these errors. A within-subjects study (N = 9) on the Samsung Galaxy XR evaluated these heuristics against a baseline. Findings indicate that while throughput and selection time remained stable, the heuristics fundamentally shifted user behavior and significantly reduced errors during selection. Notably, MAGNETIC selection induced an “offloading” effect where users traded precision for speed. Additionally, the heuristics reclassified ambiguous failures as explainable coordination errors. We provide recommendations for selection heuristics that enhance interaction speed and cognitive agency in virtual reality.
[HC-1] Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation
【速读】:该论文旨在解决生物医学时间序列数据标注中标签准确性不足的问题,尤其是在标注预算有限且依赖人工标注的情况下。其关键解决方案是引入基于二维可视化(2D Visualization, 2DV)的样本选择方法,通过辅助标注者探索高维数据的互补性二维表示来提升标注效率与质量。实验表明,在多标注者聚合场景下,2DV方法在婴儿运动评估(IMA)和语音情绪识别(SER)任务中均表现最优;而在个体标注者训练模型时,FAFT方法在稀有类别捕捉上更具稳定性,而RND则在标注者数量或能力不确定时最为稳健。整体而言,2DV显著提升了标注过程的可解释性和标注者体验,尤其适用于标注预算相对充裕的生物医学场景。
链接: https://arxiv.org/abs/2603.26592
作者: Einari Vaaras,Manu Airaksinen,Okko Räsänen
机构: Tampere University (坦佩雷大学); University of Helsinki (赫尔辛基大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators’ labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.
[HC-2] User Involvement in Robotic Wheelchair Development: A Decade of Limited Progress
【速读】:该论文试图解决的问题是:尽管机器人轮椅(Robotic Wheelchairs, RWs)在提升行动障碍者自主性和参与度方面具有巨大潜力,但多数系统未能实现持续的现实世界应用,其根源在于用户参与不足。论文通过系统性文献回顾发现,用户参与主要集中在开发后期的评估阶段,而非早期需求定义或迭代式共同设计阶段,且多数研究样本量小、方法不统一、缺乏代表性,同时研究团队以工程背景为主且集中于高收入国家,导致用户体验与实际需求脱节。解决方案的关键在于:将参与式方法(participatory methodologies)系统性地嵌入设计生命周期全过程,并识别和克服制约有意义用户参与的系统性障碍,从而提升RWs的可用性与实际采纳率。
链接: https://arxiv.org/abs/2603.26543
作者: Mario Andres Chavarria,Santiago Price Torrendell,Aude Billard,Samia Hurst,Sébastien Kessler,Michael Stein,Kenji Suzuki,Sophie Weerts,Diego Paez-Granados,Minerva Rivas Velarde
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Robotic wheelchairs (RWs) offer significant potential to enhance autonomy and participation for people with mobility impairments, yet many systems have failed to achieve sustained real-world adoption. This narrative literature review examined the extent and quality of end-user involvement in RW design, development, and evaluation over the past decade (2015–2025), assessed against core principles shared by major user-involvement approaches (e.g., user-/human-centered design, participatory/co-design, and inclusive design). The findings indicate that user involvement remains limited and is predominantly concentrated in late-stage evaluation rather than in early requirements definition or iterative co-design. Of the 399 records screened, only 23 studies (about 6%) met the inclusion criteria of verifiable end-user involvement, and many relied on small samples, often around ten participants, with limited justification for sample size selection, proxy users, laboratory-based validation, and non-standardized feedback methods. Research teams were largely engineering-dominated (about 89%) and geographically concentrated in high-income countries. Despite strong evidence that sustained user engagement improves usability and adoption in assistive technology, its systematic implementation in RW research remains rare. Advancing the field requires embedding participatory methodologies throughout the design lifecycle and addressing systemic barriers that constrain meaningful user involvement.
[HC-3] CR-Eyes: A Computational Rational Model of Visual Sampling Behavior in Atari Games
【速读】:该论文旨在解决现有计算用户模型在模拟真实动态交互环境中的局限性问题,即这些模型要么依赖于手工设计的任务表示,要么仅适用于静态或非交互式视觉输入,难以应用于像素级、实时互动的环境(如Atari游戏)。其解决方案的关键在于提出CR-Eyes模型,这是一个基于强化学习训练的计算理性模型,能够在感知与认知约束下联合学习注视位置与动作选择策略,并将眼动视为目标导向的行为而非孤立的显著性预测,从而显式闭合感知-行动回路,实现对人类视觉采样和游戏行为的有效模拟。
链接: https://arxiv.org/abs/2603.26527
作者: Martin Lorenz,Niko Konzack,Alexander Lingler,Philipp Wintersberger,Patrick Ebel
机构: ScaDS.AI, Leipzig University (莱比锡大学); Interdisciplinary Transformation University (跨学科转型大学)
类目: Human-Computer Interaction (cs.HC)
备注: 6 pages, 5 figures, CHI’26 Extended Abstract (Poster)
Abstract:Designing mobile and interactive technologies requires understanding how users sample dynamic environments to acquire information and make decisions under time pressure. However, existing computational user models either rely on hand-crafted task representations or are limited to static or non-interactive visual inputs, restricting their applicability to realistic, pixel-based environments. We present CR-Eyes, a computationally rational model that simulates visual sampling and gameplay behavior in Atari games. Trained via reinforcement learning, CR-Eyes operates under perceptual and cognitive constraints and jointly learns where to look and how to act in a time-sensitive setting. By explicitly closing the perception-action loop, the model treats eye movements as goal-directed actions rather than as isolated saliency predictions. Our evaluation shows strong alignment with human data in task performance and aggregate saliency patterns, while also revealing systematic differences in scanpaths. CR-Eyes is a step toward scalable, theory-grounded user models that support design and evaluation of interactive systems.
[HC-4] Exploring a Design Framework for Childrens Agency through Participatory Design
【速读】:该论文旨在解决当前设计实践中对儿童在数字系统(尤其是儿童-AI交互场景)中自主性(agency)的理解与操作化不足的问题。解决方案的关键在于提出并验证了一个“设计-代理框架”(design-for-agency framework),该框架通过参与式工作坊的形式,帮助设计师将基于经验的隐性判断转化为对代理权权衡关系的显性表达,从而在复杂的设计情境中系统性地考虑儿童的代理权问题。
链接: https://arxiv.org/abs/2603.26523
作者: Boyin Yang,Jun Zhao
机构: University of Oxford (牛津大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Children’s agency plays a critical role in shaping children’s autonomy, participation, and well-being in their interactions with digital systems, particularly in emerging child-AI contexts. However, how designers currently understand and reason about children’s agency in practice remains underexplored. In this paper, we examine designers’s engagement with children’s agency through a participatory workshop in which we introduce a design-for-agency framework that supports designers externalising the consideration of agency in their design contexts. We find that while participants are committed to implementing ethical AI systems for children, they often struggle to understand why agency matters and how it can be operationalised in practice. Our agency design framework provided designers with a structured way to translate implicit, experience-based judgments into explicit articulation of agency trade-offs while acknowledging the associated design complexity. We conclude by offering initial insights into supporting designers’ reasoning about children’s agency and outlining directions for future research.
[HC-5] Shaping Credibility Judgments in Human-GenAI Partnership via Weaker LLM s: A Transactive Memory Perspective on AI Literacy
【速读】:该论文试图解决的问题是:当前生成式 AI (Generative AI) 在高等教育中日益作为知识伙伴使用,但现有 AI 素养框架多关注学习者“应该做什么”,而忽视了学生与 AI 日常协作中具体实践方式的可观察性与可干预性。为填补这一空白,研究将学生与 GenAI 的互动建模为一种交互记忆系统(transactive memory system),其中可信度调节依赖程度和验证行为。解决方案的关键在于:通过设计特定工作流程(如反思优先、验证强制)并嵌入问责提示(accountability cues),配合使用性能较弱的大语言模型(LLM)以维持学生的主动验证动机,在不削弱学习支持的前提下重构学生的可信度判断;实证结果显示,“反思优先”条件显著降低学生对 GenAI 的依赖,表明流程顺序与模型强度的选择能有效引导 AI 使用中的认知责任分配。
链接: https://arxiv.org/abs/2603.26522
作者: Md Touhidul Islam,Mahir Akgun,Syed Billah
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注: 14 pages, Accepted in AIED’26
Abstract:Generative AI (GenAI) is increasingly used as a knowledge partner in higher education, raising the need for instructional designs that emphasize AI literacy practices such as evaluating output credibility and maintaining human accountability. Existing AI literacy frameworks focus more on what learners should do than on how these practices are enacted in routine student-GenAI collaboration. We address this gap by framing student-GenAI interaction as a transactive memory partnership, where credibility regulates reliance and verification. To make this process visible during coursework, we used a weaker large language model (LLM): small enough to run on most students’ computers during class, helpful enough to support learning, but not so capable that it removes the need for verification. In an undergraduate STEM course, students were randomly assigned to one of three conditions across repeated activities: reflection-first (think first, then consult AI), verification-required (use AI, then evaluate the output), or control (unrestricted use). Students completed a transactive memory survey at three time points (N = 42). Weighted credibility diverged by condition over time. ANCOVA controlling for baseline credibility showed a condition effect at mid-semester, F(2, 38) = 4.02, p = .026, partial eta squared = .175, and a stronger effect at post-intervention, F(2, 38) = 5.48, p = .008, partial eta squared = .224; adjusted means were lowest in reflection-first, intermediate in verification-required, and highest in control. Parallel analyses of specialization and coordination were not significant. These findings suggest that workflow sequencing, deliberate use of weaker LLMs, and accountability cues embedded in assignment instructions can recalibrate students’ credibility judgments in GenAI use, with reflection-first producing the strongest downward shift in reliance.
[HC-6] Characterizing Scam-Driven Human Trafficking Across Chinese Borders and Online Community Responses on RedNote
【速读】:该论文试图解决的是中国边境地区新兴的“诈骗驱动型人口贩卖”问题,即通过虚假就业承诺诱骗中国公民前往东南亚,继而强迫其参与网络诈骗活动的现象。这一问题具有严重的社会危害性,但此前在学术研究中尚未得到充分关注。论文的关键解决方案在于通过分析158篇社交媒体平台RedNote上的帖子,揭示了犯罪团伙如何利用文化纽带和心理操控手段实施招募与控制,并指出受害者在回归社会过程中面临的家庭排斥等结构性障碍。研究强调,有效的应对策略需结合社区层面的保护机制、平台治理责任强化以及跨国协作框架的构建,以形成多维度的预防与干预体系。
链接: https://arxiv.org/abs/2603.26520
作者: Jiamin Zheng,Yue Deng,Jessica Chen,Shujun Li,Yixin Zou,Jingjie Li
机构: University of Edinburgh(爱丁堡大学); The Hong Kong University of Science and Technology(香港科技大学); University of Kent(肯特大学); Max Planck Institute for Security and Privacy(马克斯·普朗克信息安全研究所)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted at the CHI Conference on Human Factors in Computing Systems (CHI 2026)
Abstract:A new form of human trafficking has emerged across Chinese borders, where individuals are lured to Southeast Asia with fraudulent job offers and then coerced into operating online scams. Despite its massive economic and human toll, this scam-driven trafficking remains underexplored in academic research. Through qualitative analysis of 158 RedNote posts, we examined how Chinese online communities respond to this threat. Our findings reveal that perpetrators exploit cultural ties to recruit victims for cybercriminal roles within self-sustaining compounds, using sophisticated manipulation tactics. Survivors face serious reintegration barriers, including family rejection, as the cultural values that enable trafficking also hinder their recovery. While communities present protective strategies, efforts are complicated by doubts about the reliability of support and cross-border coordination. We discuss key implications for prevention, platform governance, and international cooperation against scam-driven trafficking. Warning: This paper contains descriptions of physical, psychological, and sexual abuse.
[HC-7] Beyond Banning AI: A First Look at GenAI Governance in Open Source Software Communities
【速读】:该论文试图解决开源软件(OSS)中生成式 AI(GenAI)治理缺乏系统性框架与实践共识的问题,即当前社区对 GenAI 的管理策略分散且未形成统一的认知体系。其解决方案的关键在于通过多阶段分析67个高可见度 OSS 项目中的定性材料,识别出 GenAI 在贡献流程中的共性关切,提炼出三种治理取向,并进一步映射出12种治理策略及其实施模式,从而揭示 GenAI 治理远不止于“是否禁止”,而需在责任归属、验证机制、审查能力、代码溯源和平台基础设施等多个维度进行协同响应,最终为研究者提供概念基准,为维护者和平台设计者提供实践参考。
链接: https://arxiv.org/abs/2603.26487
作者: Wenhao Yang,Runzhi He,Minghui Zhou
机构: Peking University (北京大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI (GenAI) is playing an increasingly important role in open source software (OSS). Beyond completing code and documentation, GenAI is increasingly involved in issues, pull requests, code reviews, and security reports. Yet, cheaper generation does not mean cheaper review - and the resulting maintenance burden has pushed OSS projects to experiment with GenAI-specific rules in contribution guidelines, security policies, and repository instructions, even including a total ban on AI-assisted contributions. However, governing GenAI in OSS is far more than a ban-or-not question. The responses remain scattered, with neither a shared governance framework in practice nor a systematic understanding in research. Therefore, in this paper, we conduct a multi-stage analysis on various qualitative materials related to GenAI governance retrieved from 67 highly visible OSS projects. Our analysis identifies recurring concerns across contribution workflows, derives three governance orientations, and maps out 12 governance strategies and their implementation patterns. We show that governing GenAI in OSS extends well beyond banning - it requires coordinated responses across accountability, verification, review capacity, code provenance, and platform infrastructure. Overall, our work distills dispersed community practices into a structured overview, providing a conceptual baseline for researchers and a practical reference for maintainers and platform designers.
[HC-8] “Law at Your Fingertips”: Understanding Legal Information Seeking on Video-Sharing Platforms in China
【速读】:该论文旨在解决传统文本-based 法律信息检索平台难以满足公众在法律信息寻求过程中对高风险、紧迫性及情感支持的特殊需求的问题。其解决方案的关键在于揭示视频分享平台(Video-Sharing Platforms, VSPs)如何通过其特有的交互 affordances(可及性特征),帮助用户缓解认知不适(epistemic discomfort),并建立信任与参与感,从而更有效地获取、理解与评估法律信息。研究进一步指出,VSPs 不仅提供了更具共情性的知识传播方式,还促进了用户在启发式处理(heuristic processing)与系统性处理(systematic processing)之间的平衡,为设计可信的公民信息基础设施和构建安全、高效的数字空间信息环境提供了实证依据与设计启示。
链接: https://arxiv.org/abs/2603.26420
作者: Zhiyang Wu,Junliang Chen,Qian Wan,Qing Xiao,Piaohong Wang,Ge Gao,Zhicong Lu
机构: City University of Hong Kong(香港城市大学); Xi’an Jiaotong University(西安交通大学); Carnegie Mellon University(卡内基梅隆大学); University of Maryland(马里兰大学); George Mason University(乔治梅森大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 25 pages, 1 figure; Accepted by ACM CSCW 2026. To appear in Proceedings of the ACM on Human-Computer Interaction (CSCW)
Abstract:Equipping laypeople with the capabilities to seek legal information has been an important goal for Legal Empowerment in modern society. However, unlike general information-seeking behaviors, legal information seeking is characterized by high stakes, urgency, and a critical need for emotional support, which traditional text-based searching platforms struggle to satisfy. In recent years, people have been increasingly turning to Video-Sharing Platforms (VSPs) for access to legal information and to fulfill their legal needs. Despite the importance of this shift, such VSP-mediated legal information-seeking practices remain underexplored. Through an observational analysis of legal content on two VSPs (Douyin and Bilibili) and interviews with 20 Chinese information seekers, this study examined the practices and challenges associated with seeking, comprehending, and evaluating legal information on VSPs. We further revealed the formation of trust and engagement on the VSP-based legal knowledge-sharing community, highlighting how VSP affordances helped mitigate seekers’ epistemic discomfort and satisfy their needs for emotional support. In the discussion, we provided insights on balancing heuristic and systematic processing to encourage information cross-validation, and offered implications for designing trustworthy civic information systems and fostering an accessible, safe, and efficient information-seeking environment in digital space.
[HC-9] Channelling Coordinating Collaborating: A Three-Layer Framework for Disability-Centered Human-Agent Collaboration
【速读】:该论文试图解决的问题是:当前生成式 AI (Generative AI) 辅助工具主要面向个体使用,难以支持能力多样性群体在复杂任务中的协作需求。解决方案的关键在于提出一个三层框架——Channelling(信息通道)、Coordinating(协调机制)和 Co-Creating(协同共创),重新定义 AI 在能力多样性协作中的角色:通过建立跨能力的信息共享基础、调解不同能力合作者之间的工作流程,并作为有边界的协作伙伴共同达成目标,从而实现对现有“AI 作为远程合作者”理念的深化与扩展。
链接: https://arxiv.org/abs/2603.26252
作者: Lan Xiao,Catherine Holloway
机构: Global Disability Innovation Hub(全球残疾创新中心); UCL Interaction Centre(伦敦大学学院交互中心); University College London(伦敦大学学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted in CHI '26 Workshop on Human-Agent Collaboration
Abstract:AI accessibility tools have mostly been designed for individual use, helping one person overcome a specific functional barrier. But for many people with disabilities, complex tasks are accomplished through collaboration with others who bring complementary abilities, not solitary effort. We propose a three-layer framework, Channelling, Coordinating, and Co-Creating, that rethinks AI’s role in ability-diverse collaboration: establishing shared informational ground across abilities, mediating workflows between collaborators with different abilities, and contributing as a bounded partner toward shared goals. Grounded in the Ability-Diverse Collaboration framework, grounding theory, and Carlile’s 3T framework, it extends the ``agents as remote collaborators’’ vision by centring the collaborative, interdependent ways people with disabilities already work.
[HC-10] ComVi: Context-Aware Optimized Comment Display in Video Playback
【速读】:该论文旨在解决视频平台中评论与视频内容不同步导致的干扰问题,即用户在观看视频时可能看到与当前画面无关的评论,从而引发剧透或破坏沉浸感。解决方案的关键在于提出一种名为ComVi的系统,通过计算音视频相关性将评论映射到对应的视频时间戳,并基于时间相关性、热度(点赞数)和显示时长进行优化,生成与视频内容同步的评论序列,实现评论与视频的时空对齐。
链接: https://arxiv.org/abs/2603.26173
作者: Minsun Kim,Dawon Lee,Junyong Noh
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: To appear in Proceedings of the ACM CHI Conference on Human Factors in Computing Systems (CHI 2026)
Abstract:On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene, which can reveal spoilers and disrupt immersion. To address this problem, we present ComVi, a novel system that displays comments at contextually relevant moments, enabling viewers to see time-synchronized comments and video content together. We first map all comments to relevant video timestamps by computing audio-visual correlation, then construct the comment sequence through an optimization that considers temporal relevance, popularity (number of likes), and display duration for comfortable reading. In a user study, ComVi provided a significantly more engaging experience than conventional video interfaces (i.e., YouTube and Danmaku), with 71.9% of participants selecting ComVi as their most preferred interface.
[HC-11] Simulating Novice Students Using Machine Unlearning and Relearning in Large Language Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在学习即教学(learning-by-teaching)情境中难以稳定维持新手知识水平的问题。现有方法依赖提示工程(prompt engineering)驱动大语言模型(Large Language Models, LLMs)模拟初学者行为,但由于LLMs本身具备广泛知识能力,即使被要求“表现如初学者”,仍可能输出专家级回答,导致模拟学生知识水平漂移,削弱教学互动的真实性与研究价值。解决方案的关键在于引入机器遗忘(machine unlearning)技术,通过系统性地移除特定知识模块,将原本具备丰富知识的LLM转化为具有稳定新手水平的可教代理(teachable agent),从而实现更可控、可信的模拟学生行为。实验表明,该方法不仅能生成更贴近初学者的回答,还能在结构化教学交互中恢复部分被遗忘的知识,并通过对话日志识别出可预测的学习恢复轨迹和教学策略变化。
链接: https://arxiv.org/abs/2603.26142
作者: Jiajia Song,Zhihan Guo,Jionghao Lin
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Student simulation can support learning-by-teaching pedagogy where human students (as tutors) teach AI-simulated novice students (as tutees). Recent research often relies on prompt engineering with large language models (LLMs) to simulate novice student behaviour, but it is difficult to keep the AI-simulated student at a stable novice knowledge level. A key reason is that many LLMs are trained to be broadly capable, so even when prompted to “act like a novice,” the LLMs can still produce expert-level explanations during the learning-by-teaching interaction process. As a result, the AI-simulated student may drift beyond the intended knowledge level, reducing the credibility of the simulation for studying learning-by-teaching processes. Thus, we propose a knowledge-level simulation approach based on machine unlearning. We investigate this approach using a dataset of multiple-choice questions on Python programming concepts. We apply machine unlearning to transform a knowledgeable LLM into a novice-level AI student (i.e., teachable agent), then evaluate whether the teachable agent can relearn targeted knowledge components through learning-by-teaching dialogue interactions. Finally, we analyse the dialogue logs to characterise how the agent’s behaviour changes over time, including its question asking, error patterns, and responsiveness to instruction. The results show that (1) unlearning produces simulated student agents with more novice-like responses than prompt-only baselines, (2) the agents recover a measurable portion of the unlearned knowledge under structured exposure, and (3) dialogue analyses reveal identifiable trajectories of conceptual change and teaching moves that predict learning recovery.
[HC-12] One Is Not Enough: How People Use Multiple AI Models in Everyday Life
【速读】:该论文旨在解决用户在同时使用多个多模态大语言模型(Multimodal Large Language Models, MLLMs)时面临的协调难题,包括适配不同模型的接口、校准对不一致行为的信任度,以及管理各自独立的对话历史。其解决方案的关键在于识别出用户会根据使用场景动态构建主次模型层级,并发展出基于任务聚合的个性化切换策略,以优化努力程度、延迟和输出可信度。这些发现为设计支持多MLLM协同工作的工具提供了理论依据与实践方向。
链接: https://arxiv.org/abs/2603.26107
作者: Seunghwa Pyo,Donggun Lee,Jungwoo Rhee,Soobin Park,Youn-kyung Lim
机构: KAIST(韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted as a poster at CHI 2026
Abstract:People increasingly use multiple Multimodal Large Language Models (MLLMs) concurrently, selecting each based on its perceived strengths. This cross-platform practice creates coordination challenges: adapting prompts to different interfaces, calibrating trust against inconsistent behaviors, and navigating separate conversation histories. Prior HCI research focused on single-agent interactions, leaving multi-MLLM orchestration underexplored. Through a diary study and semi-structured interviews (N=10), we examine how individuals organize work across competing AI systems. Our findings reveal that users construct primary and secondary hierarchies among models that shift over usage context. They also develop personalized switching patterns triggered by task aggregation to adjust effort and latency, and output credibility. These insights inform future tool design opportunities, supporting users to coordinate multi-MLLM workflows.
[HC-13] “Oops! ChatGPT is Temporarily Unavailable!”: A Diary Study on Knowledge Workers Experiences of LLM Withdrawal
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在知识工作中日益嵌入,导致人们对工具依赖性增强及人类技能弱化的担忧。研究通过为期四天的日记研究(N=10名高频使用者),考察了临时移除LLM对工作流程的影响,发现LLM的缺失暴露了任务执行中的空白、促使从业者重新聚焦专业价值,并揭示了LLM使用已内化为一种不可回避的规范实践。其解决方案的关键在于将LLM视为当代知识工作的基础设施,提出“以价值为导向的采纳”(value-driven appropriation)作为应对策略,旨在支持在LLM高度渗透的工作环境中维护和强化专业价值观。
链接: https://arxiv.org/abs/2603.26099
作者: Eunseo Oh,Suyoun Lee,Jae Young Choi,Soobin Park,Youn-kyung Lim
机构: KAIST(韩国科学技术院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages excluding reference and appendix. Accepted at ACM CHI EA 2026
Abstract:LLMs have become deeply embedded in knowledge work, raising concerns about growing dependency and the potential undermining of human skills. To investigate the pervasiveness of LLMs in work practices, we conducted a four-day diary study with frequent LLM users (N=10), observing how knowledge workers responded to a temporary withdrawal of LLMs. Our findings show how LLM withdrawal disrupted participants’ workflows by identifying gaps in task execution, how self-directed work led participants to reclaim professional values, and how everyday practices revealed the extent to which LLM use had become inescapably normative. Conceptualizing LLMs as infrastructural to contemporary knowledge work, this research contributes empirical insights into the often invisible role of LLMs and proposes value-driven appropriation as an approach to supporting professional values in the current LLM-pervasive work environment.
[HC-14] Designing Fatigue-Aware VR Interfaces via Biomechanical Models
【速读】:该论文旨在解决虚拟现实(VR)中空中交互(mid-air interaction)导致的手臂疲劳和不适问题,从而提升用户体验。传统的人机界面(HCI)设计常依赖大量真人参与的评估,效率低下;而现有生物力学模型虽可用于模拟人类行为,但尚未被有效用于VR用户界面(UI)的 ergonomic 设计优化。解决方案的关键在于提出一种分层强化学习(hierarchical reinforcement learning, HRL)框架:其中运动代理(motion agent)基于经验证的三室控制与恢复(3CC-r)肌肉疲劳模型,模拟真实人体在VR中执行按钮按压任务时的运动策略及肌力消耗;该模拟疲劳结果作为反馈信号,驱动UI代理通过强化学习优化界面元素布局,以最小化整体疲劳。实验表明,该方法不仅与人类用户数据中的疲劳趋势一致,且在后续用户研究中显著降低了主观疲劳感,证明了生物力学模型作为“类人代理”在早期VR UI设计迭代中的有效性与潜力。
链接: https://arxiv.org/abs/2603.26031
作者: Harshitha Voleti,Charalambos Poullis
机构: Concordia University (康考迪亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Prolonged mid-air interaction in virtual reality (VR) causes arm fatigue and discomfort, negatively affecting user experience. Incorporating ergonomic considerations into VR user interface (UI) design typically requires extensive human-in-the-loop evaluation. Although biomechanical models have been used to simulate human behavior in HCI tasks, their application as surrogate users for ergonomic VR UI design remains underexplored. We propose a hierarchical reinforcement learning framework that leverages biomechanical user models to evaluate and optimize VR interfaces for mid-air interaction. A motion agent is trained to perform button-press tasks in VR under sequential conditions, using realistic movement strategies and estimating muscle-level effort via a validated three-compartment control with recovery (3CC-r) fatigue model. The simulated fatigue output serves as feedback for a UI agent that optimizes UI element layout via reinforcement learning (RL) to minimize fatigue. We compare the RL-optimized layout against a manually-designed centered baseline and a Bayesian optimized baseline. Results show that fatigue trends from the biomechanical model align with human user data. Moreover, the RL-optimized layout using simulated fatigue feedback produced significantly lower perceived fatigue in a follow-up human study. We further demonstrate the framework’s extensibility via a simulated case study on longer sequential tasks with non-uniform interaction frequencies. To our knowledge, this is the first work using simulated biomechanical muscle fatigue as a direct optimization signal for VR UI layout design. Our findings highlight the potential of biomechanical user models as effective surrogate tools for ergonomic VR interface design, enabling efficient early-stage iteration with less reliance on extensive human participation.
[HC-15] FlexiCamAR: Enhancing Everyday Camera Interactions on AR Glasses with a Flexible Additional Viewpoint
【速读】:该论文旨在解决当前消费级增强现实(AR)眼镜在日常任务中因仅依赖前向摄像头(front-facing camera)而存在的交互局限性问题,即现有设计虽直观但未必最优或足够支持多样化场景。其解决方案的关键在于提出一种名为“FlexiCamAR”的新型相机交互技术,通过提供灵活且舒适的第二视角拍摄能力来提升效率与应用场景的广度;具体实现上,作者开发了一款可佩戴于手指上的环形相机原型,实验证明该方法能显著降低用户的物理负担,并在低角度拍摄、狭小空间导航等特定场景中展现出独特优势,从而为AR眼镜的日常应用提供了更具适应性的补充或替代方案。
链接: https://arxiv.org/abs/2603.26012
作者: Ziming Li,Hongji Li,Jialin Wang,Pan Hui,Hai-Ning Liang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The recent emergence and popularity of consumer-grade augmented reality (AR) glasses from major technology companies highlight their potential to become the next daily computing platform. A dominant design trend in this context is the integration of a front-facing camera to deliver a first-person perspective. While this approach is intuitive, there is limited evidence that it is optimal (or sufficient) for supporting users in daily tasks. This paper explores a more effective camera interaction technique for AR glasses, which we term ``FlexiCamAR." This novel method aims to enhance both efficiency and the range of applications for AR glasses by offering flexible and comfortable secondary camera viewpoints. To investigate the applicability and usability of this approach, we developed a ring camera prototype that can be attached to users’ fingers. We then conducted a user study with 12 participants, comparing FlexiCamAR against the baseline, a traditional front-facing AR camera setup, across two common tasks: taking photos and scanning QR codes. Our findings show that FlexiCamAR significantly reduces physical load. We also explore potential scenarios where the additional viewpoint afforded by FlexiCamAR proves valuable, such as capturing low-angle perspectives or navigating confined spaces. Participant feedback further suggests strong potential for additional applications, including selfie taking, video conferencing, and object scanning. Overall, FlexiCamAR presents a novel interaction approach that can serve as a powerful supplement or alternative to the first-person perspective, significantly improving the adaptability of AR glasses for everyday use.
[HC-16] We Need Granular Sharing of De-Identified Data-But Will Patients Engage? Investigating Health System Leaders and Patients Perspectives on A Patient-Controlled Data-Sharing Platform
【速读】:该论文旨在解决患者控制去标识化医疗数据共享的系统在实际应用中,不同利益相关者(尤其是患者与健康系统领导者)对透明度、自主权及风险认知存在显著差异的问题。研究发现,虽然双方均认可此类平台能提升透明度与自主权,但领导层更关注知情同意与机构伦理框架下的规范性,而患者则将其视为防范潜在风险和不确定性的保障机制,由此引发个体控制权与研究完整性之间的张力。解决方案的关键在于设计具备上下文感知能力的可信系统,支持灵活的数据共享粒度、持续以受益为中心的透明机制,并适配多样化的用户信息素养与隐私需求。
链接: https://arxiv.org/abs/2603.26010
作者: Xi Lu,Di Hu,An T. Nguyen,Brad Morse,Lisa M. Schilling,Kai Zheng,Michelle S. Keller,Lucila Ohno-Machado,Yunan Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to be presented at The ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 2026
Abstract:Patient-controlled data-sharing systems are increasingly promoted as a way to empower patients with greater autonomy over their health data. Yet it remains unclear how different stakeholders, especially patients and health system leaders, perceive the benefits and challenges of enabling granular control over the sharing of de-identified medical data for research. To address this gap, we developed a high-fidelity prototype of a patient-controlled, web-based consent platform and conducted a two-phase mixed-methods study:semi-structured interviews with 16 health system leaders and a survey with 523 patient participants. While both groups appreciated the potential of such a platform to enhance transparency and autonomy, their views diverged in meaningful ways. Leaders viewed transparency and granular control through the lens of informed consent and institutional ethics, whereas patients interpreted these factors as safeguards against potential risks and uncertainties. Our findings underscore critical tensions such as individual control and research integrity. We offer design implications for building trustworthy, context-aware systems that support flexible granularity, provide ongoing benefit-centered transparency, and adapt to diverse literacy and privacy needs.
[HC-17] Explore LLM -enabled Tools to Facilitate Imaginal Exposure Exercises for Social Anxiety
【速读】:该论文旨在解决社交焦虑(Social Anxiety, SA)患者在使用想象暴露法(Imaginal Exposure, IE)进行认知行为疗法(Cognitive Behavioral Therapy, CBT)时面临的核心障碍:传统IE作业要求来访者自行构建并维持具有临床意义的恐惧叙事,这不仅增加了执行难度,也限制了其广泛应用。解决方案的关键在于开发一种由大语言模型(Large Language Model, LLM)驱动的辅助工具——ImaginalExpoBot,通过生成生动且个性化的暴露脚本,降低用户准备门槛,并支持即时、个体化的场景调整,从而帮助用户在治疗师设定的“耐受窗口”(window of tolerance)内安全有效地开展暴露练习。
链接: https://arxiv.org/abs/2603.25933
作者: Yimeng Wang,Yinzhou Wang,Alicia Hong,Yixuan Zhang
机构: William Mary(威廉玛丽学院); George Mason University(乔治梅森大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Social anxiety (SA) is a prevalent mental health challenge that significantly impacts daily social interactions. Imaginal Exposure (IE), a Cognitive Behavioral Therapy (CBT) technique involving imagined anxiety-provoking scenarios, is effective but underutilized, in part because traditional IE homework requires clients to construct and sustain clinically relevant fear narratives. In this work, we explore the feasibility of an LLM-enabled tool that supports IE by generating vivid, personalized exposure scripts. We first co-designed ImaginalExpoBot with mental health professionals, followed by a formative evaluation with five therapists and a user study involving 19 individuals experiencing SA symptoms. Our findings show that LLM-enabled support can facilitate preparation for anxiety-inducing situations while enabling immediate, user-specific adaptation, with scenarios remaining within a therapeutically beneficial “window of tolerance”. Our participants and MHPs also identified limitations in continuity and customization, pointing to the need for deeper adaptivity in future designs. These findings offer preliminary design insights for integrating LLMs into structured therapeutic practices in accessible, scalable ways.
[HC-18] GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks CVPR2026
【速读】:该论文旨在解决当前图形用户界面(GUI)代理仅关注自动化操作(如点击和键盘输入),而忽视用户意图的问题,从而限制了人机协作的深度与灵活性。其核心挑战在于如何让AI模型不仅感知用户行为,还能理解其背后的目标并适时提供有效帮助。解决方案的关键在于提出一个名为GUIDE(GUI User Intent Detection Evaluation)的基准测试体系,包含67.5小时来自120名新手用户的屏幕录制数据(附带思考 aloud 叙述),覆盖10种软件应用,并定义了三个任务:行为状态检测、意图预测和帮助预测,以系统评估模型对用户意图的理解能力。实验表明,引入用户上下文信息可显著提升帮助预测准确率(最高提升50.2个百分点),凸显了结构化用户理解在实现高效辅助中的关键作用。
链接: https://arxiv.org/abs/2603.25864
作者: Saelyne Yang,Jaesang Yu,Yi-Hao Peng,Kevin Qinghong Lin,Jae Won Cho,Yale Song,Juho Kim
机构: KAIST(韩国科学技术院); Carnegie Mellon University (卡内基梅隆大学); University of Oxford (牛津大学); Konkuk University (中央大学); Google Inc. (谷歌公司); SkillBench
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at CVPR 2026
Abstract:Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model’s ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at this https URL.
[HC-19] Building to Understand: Examining Teens Technical and Socio-Ethical Pieces of Understandings in the Construction of Small Generative Language Models
【速读】:该论文旨在解决青少年在生成式人工智能(Generative AI)和机器学习(Machine Learning, ML)技术日益普及的背景下,如何有效发展相关素养的问题。当前研究指出,通过构建活动(construction activities)可帮助青少年理解AI/ML系统及其社会伦理影响,但尚缺乏对青少年在实际构建小型生成式语言模型(Language Models, LMs)过程中所形成的技术与社会伦理认知的具体证据。解决方案的关键在于设计并实施为期一周的参与式设计工作坊,让16名青少年动手构建用于生成食谱、剧本和歌曲的小型LM,并通过主题分析识别其在技术实现与社会伦理层面展现出的理解片段,从而为新手理解AI/ML系统提供实证依据和理论框架支持。
链接: https://arxiv.org/abs/2603.25852
作者: Luis Morales-Navarro,Daniel J. Noh,Lucianne Servat,Carly Netting,Yasmin B. Kafai,Danaé Metaxa
机构: University of Pennsylvania (宾夕法尼亚大学); The Franklin Institute (富兰克林研究所)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:The rising adoption of generative AI/ML technologies increases the need to support teens in developing AI/ML literacies. Child-computer interaction research argues that construction activities can support young people in understanding these systems and their implications. Recent exploratory studies demonstrate the feasibility of engaging teens in the construction of very small generative language models (LMs). However, it is unclear how constructing such models may foster the development of teens’ understanding of these systems from technical and socio-ethical perspectives. We conducted a week-long participatory design workshop in which sixteen teenagers constructed very small LMs to generate recipes, screenplays, and songs. Using thematic analysis, we identified technical and socio-ethical pieces of understandings that teens exhibited while designing generative LMs. This paper contributes (a) evidence of the kinds of pieces of understandings that teens have when constructing LMs and (b) a theory-backed framing to study novices’ understandings of AI/ML systems.
[HC-20] HeyFriend Helper: A Conversational AI Web-App for Resource Access Among Low-Income Chicago Residents
【速读】:该论文旨在解决低收入群体在就业过程中面临的多重障碍问题,如数字素养资源匮乏、职业培训不足、面试准备和简历反馈缺失等。其解决方案的关键在于构建一个名为HeyFriend Helper的基于Web的平台,通过交互式对话助手(conversational assistant)整合多种本地化数字资源,提供个性化支持与指导,涵盖简历制作与反馈、面试练习、心理健康与福祉资源、就业趋势与职业结果信息、语言学习支持以及社区服务定位等功能,从而实现对低收入人群就业能力的全方位赋能。
链接: https://arxiv.org/abs/2603.25800
作者: Maddie Juarez,Abha Rai,Kristen E. Ravi,Margaret C. Delaney,Danny Olweean,Eric Klingensmith,Swarnali Banerjee,Neil Klingensmith,George K. Thiruvathukal
机构: Loyola University Chicago (洛约拉大学芝加哥分校); The University of Tennessee (田纳西大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Low-income individuals can face multiple challenges in their ability to seek employment. Barriers to employment often include limited access to digital literacy resources, training, interview preparation and resume feedback. Prior work has largely focused on targeted social service or healthcare applications that address needs individually, with little emphasis on conversational AI-driven systems that integrate multiple localized digital resources to provide comprehensive support. This work presents HeyFriend Helper, a web-based platform designed to support low-income residents in Chicago through an interactive conversational assistant that provides personalized support and guidance. HeyFriend Helper integrates multiple tools, including resume building and feedback, interview practice, mindfulness and well-being resources, employment trend and career outcome information, language learning support, and location-based access to community services. This work represents an interdisciplinary collaboration between social work, computer science, and engineering that addresses the multifaceted needs of low-income individuals. The findings demonstrate the importance of career-readiness tools and conversational user interface (CUIs) in providing holistic support.
[HC-21] DesignWeaver: Dimensional Scaffolding for Text-to-Image Product Design
【速读】:该论文旨在解决新手设计师在使用生成式 AI(Generative AI)进行产品设计时,因缺乏领域知识而导致提示词(prompt)质量低、难以有效探索设计空间的问题。解决方案的关键在于提出 DesignWeaver 界面,该界面通过从文本到图像模型生成的图像中提取关键产品设计维度,并将其组织为可快速选择的视觉参考面板,从而引导新手更高效地构建包含更多领域特定词汇的复杂提示词,提升设计多样性与创新性。
链接: https://arxiv.org/abs/2502.09867
作者: Sirui Tao,Ivan Liang,Cindy Peng,Zhiqing Wang,Srishti Palani,Steven P. Dow
机构: UC San Diego (加州大学圣地亚哥分校); Carnegie Mellon University (卡内基梅隆大学); Tableau Research (Tableau 研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 26 pages, 22 figures, CHI 2025
Abstract:Generative AI has enabled novice designers to quickly create professional-looking visual representations for product concepts. However, novices have limited domain knowledge that could constrain their ability to write prompts that effectively explore a product design space. To understand how experts explore and communicate about design spaces, we conducted a formative study with 12 experienced product designers and found that experts – and their less-versed clients – often use visual references to guide co-design discussions rather than written descriptions. These insights inspired DesignWeaver, an interface that helps novices generate prompts for a text-to-image model by surfacing key product design dimensions from generated images into a palette for quick selection. In a study with 52 novices, DesignWeaver enabled participants to craft longer prompts with more domain-specific vocabularies, resulting in more diverse, innovative product designs. However, the nuanced prompts heightened participants’ expectations beyond what current text-to-image models could deliver. We discuss implications for AI-based product design support tools.
计算机视觉
[CV-0] Detailed Geometry and Appearance from Opportunistic Motion
【速读】:该论文旨在解决从稀疏固定相机视角中重建物体三维几何与外观的问题,这一任务受限于视点数量不足导致的感知盲区。其核心挑战在于如何利用机会性物体运动(如人移动椅子或拿起杯子)来扩展有效观测视角,从而突破传统静态摄像机视角限制。解决方案的关键在于两个方面:一是提出一种联合姿态与形状优化框架,通过交替最小化6自由度(6DoF)轨迹和基于2D高斯泼溅(2D Gaussian splatting)的原始参数,实现对象位姿与几何结构的紧密耦合估计;二是设计了一种新颖的外观建模方法,将漫反射与镜面反射成分在球谐函数空间内解耦,并引入方向性探测机制以应对静态光照下移动物体的复杂外观变化。
链接: https://arxiv.org/abs/2603.26665
作者: Ryosuke Hirai,Kohei Yamashita,Antoine Guédon,Ryo Kawahara,Vincent Lepetit,Ko Nishino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing 3D geometry and appearance from a sparse set of fixed cameras is a foundational task with broad applications, yet it remains fundamentally constrained by the limited viewpoints. We show that this bound can be broken by exploiting opportunistic object motion: as a person manipulates an object~(e.g., moving a chair or lifting a mug), the static cameras effectively ``orbit’’ the object in its local coordinate frame, providing additional virtual viewpoints. Harnessing this object motion, however, poses two challenges: the tight coupling of object pose and geometry estimation and the complex appearance variations of a moving object under static illumination. We address these by formulating a joint pose and shape optimization using 2D Gaussian splatting with alternating minimization of 6DoF trajectories and primitive parameters, and by introducing a novel appearance model that factorizes diffuse and specular components with reflected directional probing within the spherical harmonics space. Extensive experiments on synthetic and real-world datasets with extremely sparse viewpoints demonstrate that our method recovers significantly more accurate geometry and appearance than state-of-the-art baselines.
[CV-1] GaussianGPT : Towards Autoregressive 3D Gaussian Scene Generation
【速读】:该论文旨在解决当前3D生成建模中依赖扩散或流匹配(flow-matching)范式所面临的局限性,例如缺乏对生成过程的细粒度控制、难以实现局部编辑(如补全、外绘)以及生成轨迹不灵活等问题。其解决方案的关键在于提出GaussianGPT——一个基于Transformer的完全自回归模型,通过将3D高斯(3D Gaussians)原始表示压缩为离散潜在网格(使用稀疏3D卷积自动编码器与向量量化),再以因果Transformer结合3D旋转位置编码进行序列化建模,从而实现逐token生成3D场景的空间结构和外观。该方法利用自回归建模的组合归纳偏置(inductive biases)与可扩展性,在显式表示基础上支持可控采样(如温度调节)、灵活生成范围及局部编辑,为可控且上下文感知的3D生成提供了与扩散方法互补的新范式。
链接: https://arxiv.org/abs/2603.26661
作者: Nicolas von Lützow,Barbara Rössle,Katharina Schmid,Matthias Nießner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL - Project video: this https URL
Abstract:Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.
[CV-2] Zero-Shot Depth from Defocus
【速读】:该论文旨在解决深度估计中的零样本泛化(zero-shot generalization)问题,即在未见过的数据集上仍能保持高精度的深度从失焦(Depth from Defocus, DfD)估计性能。传统方法通常对特定数据集过拟合,难以适应新场景,而本文通过构建更丰富、高质量的现实世界DfD基准ZEDD(包含8.3倍更多场景和更优图像与真值深度图),并提出一种基于Transformer的新网络架构FOSSA,其核心创新在于引入堆栈注意力层(stack attention layer)与焦点距离嵌入(focus distance embedding),实现跨聚焦图像堆栈的高效信息交互,从而显著提升模型在未知场景下的泛化能力。实验表明,该方法相较基线模型误差降低达55.7%。
链接: https://arxiv.org/abs/2603.26658
作者: Yiming Zuo,Hongyu Wen,Venkat Subramanian,Patrick Chen,Karhan Kayan,Mario Bijelic,Felix Heide,Jia Deng
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at this https URL. The code and checkpoints are released at this https URL.
[CV-3] unable Soft Equivariance with Guarantees
【速读】:该论文旨在解决现实世界数据中模型难以满足严格等变性(equivariance)而导致性能受限的问题。其核心挑战在于如何在不破坏预训练模型结构的前提下,可控地调节模型的等变性程度,从而提升模型在多种视觉任务中的表现。解决方案的关键在于提出一个通用框架,通过将模型权重投影到设计好的子空间中来构建软等变模型(soft equivariant models),该方法适用于任意预训练架构,并能提供诱导等变误差的理论边界。实验证明,该方法在ViT和ResNet等主流骨干网络上均有效提升了图像分类、语义分割及人体轨迹预测等任务的性能,同时显著降低了等变误差。
链接: https://arxiv.org/abs/2603.26657
作者: Md Ashiqur Rahman,Lim Jun Hao,Jeremiah Jiang,Teck-Yian Lim,Raymond A. Yeh
机构: Purdue University (普渡大学); DSO National Laboratories (新加坡国防科技研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model’s performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.
[CV-4] Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision
【速读】:该论文旨在解决传统视觉定位(Visual Grounding, VG)方法依赖文本描述进行目标定位时存在的语言歧义问题,以及忽视现实交互中广泛存在的非语言指代线索(如手势)的局限性。其核心解决方案是提出首个专注于第一人称视角下指代行为的多模态数据集 EgoPoint-Ground,该数据集包含超过15k个复杂场景中的交互样本,并提供细粒度的手部-目标边界框配对及密集语义标注;同时设计了SV-CoT基线框架,通过将定位任务重构为结构化推理过程,利用视觉Chain-of-Thought(Visual Chain-of-Thought, Visual CoT)机制融合手势与语言线索,显著提升了模型对多模态物理意图的理解能力,在基准测试中相较现有方法实现11.7%的绝对性能提升。
链接: https://arxiv.org/abs/2603.26646
作者: Ling Li,Bowen Liu,Zinuo Zhan,Peng Jie,Jianhui Zhong,Kenglun Chang,Zhidong Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf15k interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an \textbf11.7% absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.
[CV-5] Make Geometry Matter for Spatial Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在静态场景和动态视频中进行空间推理能力不足的问题。尽管VLMs通过大规模训练具备较强的图像与视频理解能力,但其在利用几何信息进行空间推理时表现有限,主要原因在于模型倾向于依赖二维视觉线索而非几何token。为解决此问题,论文提出GeoSR框架,其核心创新在于两个关键组件:一是几何解禁掩码(Geometry-Unleashing Masking),通过在训练过程中有策略地掩码部分2D视觉token,削弱非几何捷径,迫使模型主动使用几何token进行空间推理;二是几何引导融合(Geometry-Guided Fusion),引入门控路由机制,在几何证据关键区域自适应增强几何token的贡献。这两项设计共同释放了几何token在空间推理任务中的潜力,显著提升了模型性能,并在静态与动态空间推理基准上达到新的最先进水平。
链接: https://arxiv.org/abs/2603.26639
作者: Shihua Zhang,Qiuhong Shen,Shizun Wang,Tianbo Pan,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at this https URL.
[CV-6] Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting IROS2026
【速读】:该论文旨在解决在杂乱的汽车经销商通道场景中,对车辆外部进行高保真三维重建的技术难题。传统摄影测量方法难以应对动态车辆与复杂静态背景共存、广角镜头畸变、汽车漆面镜面反射及非刚性车轮旋转等挑战,导致经典对极几何约束失效。其解决方案的关键在于构建一个端到端的双相机系统流水线:首先通过SAM 3实例分割与运动门控相结合的方法清晰分离移动车辆并显式掩蔽非刚性车轮以满足严格对极几何;其次利用RoMa v2学习匹配器结合语义置信度掩膜,在原始4K畸变图像上提取鲁棒对应点;再次将这些匹配点融入考虑相机相对位姿先验(基于CAD模型)的SfM优化中,有效抑制尺度漂移;最后采用畸变感知的3D高斯泼溅框架(3DGUT)与随机马尔可夫链蒙特卡洛(MCMC)稀疏化策略,高质量渲染镜面表面。实验证明该方案在真实场景下显著优于标准3D-GS方法,PSNR提升3.85 dB,生成可用于专业检测的交互式三维模型。
链接: https://arxiv.org/abs/2603.26638
作者: Nitin Kulkarni,Akhil Devarashetti,Charlie Cluss,Livio Forte,Philip Schneider,Chunming Qiao,Alina Vereshchaka
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 7 figures, Submitted to IEEE IROS 2026 (under review)
Abstract:High-fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive-throughs presents severe technical challenges. Unlike static-scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide-angle lens distortion, specular automotive paint, and non-rigid wheel rotations that violate classical epipolar constraints. We propose an end-to-end pipeline utilizing a two-pillar camera rig. First, we resolve dynamic-scene ambiguities by coupling SAM 3 for instance segmentation with motion-gating to cleanly isolate the moving vehicle, explicitly masking out non-rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig-aware SfM optimization that utilizes CAD-derived relative pose priors to eliminate scale drift. Finally, we use a distortion-aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real-world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held-out views, representing a 3.85 dB improvement over standard 3D-GS, delivering inspection-grade interactive 3D models without controlled studio infrastructure.
[CV-7] hink over Trajectories: Leverag ing Video Generation to Reconstruct GPS Trajectories from Cellular Signaling
【速读】:该论文旨在解决从蜂窝网络信令记录(signaling records)中重建高精度GPS轨迹的问题,即Sig2GPS问题。传统方法依赖复杂的多阶段工程流程或坐标回归模型,难以直接利用信令数据的地理上下文信息。其核心解决方案是将问题重新建模为一个图像到视频生成任务:首先将信令轨迹渲染到地图上作为输入图像,再通过训练视频生成模型直接输出连续的GPS路径视频。关键创新在于引入了一个配对的信令-轨迹视频数据集用于微调开源视频生成模型,并结合基于奖励的强化学习优化策略提升生成轨迹的地理一致性与精度,从而在真实世界大规模数据上显著优于现有基准方法,且具备跨城市迁移能力。
链接: https://arxiv.org/abs/2603.26610
作者: Ruixing Zhang,Hanzhang Jiang,Leilei Sun,Liangzhe Han,Jibin Wang,Weifeng Lv
机构: Beihang University (北京航空航天大学); China Mobile Information Technology Center (中国移动信息技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.
[CV-8] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
【速读】:该论文旨在解决大规模视频扩散模型在生成过程中难以保持几何一致性的问题。现有方法通过引入额外模块或应用几何感知对齐来改善一致性,但前者可能损害预训练模型的泛化能力,后者则受限于静态场景且依赖RGB空间奖励,需重复进行VAE解码,导致计算开销大且难以扩展至高动态真实场景。解决方案的关键在于提出VGGRPO(Visual Geometry GRPO),其核心是构建一个潜空间几何模型(Latent Geometry Model, LGM),将视频扩散模型的潜变量与具备4D重建能力的几何基础模型对齐,从而直接从潜空间解码场景几何信息;在此基础上,采用潜空间分组相对策略优化(Group Relative Policy Optimization)并设计两种互补奖励:相机运动平滑性奖励以抑制轨迹抖动,几何重投影一致性奖励以保证跨视角几何一致性,最终实现高效、灵活且世界一致的视频生成。
链接: https://arxiv.org/abs/2603.26599
作者: Zhaochong An,Orest Kupyn,Théo Uscidda,Andrea Colaco,Karan Ahuja,Serge Belongie,Mar Gonzalez-Franco,Marta Tintore Gazulla
机构: Google; University of Oxford; Institut Polytechnique de Paris; University of Copenhagen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.
[CV-9] From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning CVPR2026
【速读】:该论文旨在解决图像预训练模型迁移到视频任务时面临的“帧内时间一致性”(intra-video temporal consistency)与“跨视频语义可分性”(inter-video semantic separability)之间的权衡问题。具体而言,传统方法通过微调复杂的时序模块虽能提升帧内一致性,但会削弱不同视频间对象的区分能力;而减少可训练参数则可能导致帧内表示不稳定。解决方案的关键在于提出了一种轻量级的“一致性-可分性权衡迁移学习”(Consistency-Separability Trade-off Transfer Learning, Co-Settle)框架:在冻结的图像预训练编码器之上添加一个投影层,结合时序循环一致性目标和语义可分性约束进行自监督训练,从而在不显著增加计算负担的前提下优化两种性质的平衡。理论分析进一步证明,在适当条件下该优化过程可实现更优的权衡效果。
链接: https://arxiv.org/abs/2603.26597
作者: Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Xilin Zhao,Qingming Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at this https URL.
[CV-10] he Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding
【速读】:该论文试图解决的问题是:仅通过图像与文本的统计共现关系(即分布假设)是否足以学习人类场景理解的全部丰富性,尤其是涉及具身认知(embodied cognition)的核心能力,如物体的可操作性(affordance)等。其关键解决方案在于设计并应用一种新的评估指标——人类校准余弦距离(Human-Calibrated Cosine Distance, HCD),该指标量化了视觉语言模型(VLMs)输出与人类响应分布之间的相似性,并以人类内部变异性为尺度进行归一化,从而在缺乏真实答案的任务中实现对模型性能的可靠评估。实验发现,尽管VLMs在一般知识任务上接近人类水平,但在具身可操作性任务上存在系统性缺陷,且这一差距无法通过提示工程或提供空间信息缓解,表明单纯依赖大规模图文语料库的分布学习不足以捕捉具身经验所赋予的场景理解维度。
链接: https://arxiv.org/abs/2603.26589
作者: Gillian Rosenberg,Skylar Stadhard,Bruce C. Hansen,Michelle R. Greene
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures, 5 tables
Abstract:What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.
[CV-11] From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion
【速读】:该论文旨在解决牙冠修复中因牙齿缺损导致的自动化重建难题,尤其针对不完整牙齿几何结构的上下文感知生成问题。其核心挑战在于缺乏用于训练此类任务的真实缺失牙齿数据,且需确保生成的牙冠在解剖学上合理、不与对颌牙发生咬合干扰。解决方案的关键在于提出一个基于扩散模型(diffusion-based model)的牙冠生成框架ToothCraft,通过设计一套数据增强流水线,从完整的牙列数据集中合成多样化的缺失牙齿样本,从而有效扩充训练数据;同时利用局部解剖上下文进行条件生成,使模型能够准确恢复完整的牙冠形态,在合成测试集上达到81.8%的交并比(IoU)和0.00034的Chamfer Distance(CD),且在真实病例中表现出良好的泛化能力与临床安全性。
链接: https://arxiv.org/abs/2603.26588
作者: Dávid Pukanec,Tibor Kubík,Michal Španěl
机构: Brno University of Technology (布诺理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: VISAPP 2026 Conference
Abstract:We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: this https URL
[CV-12] MA-Bench: Towards Fine-grained Micro-Action Understanding CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在微动作(Micro-Action)理解能力上的缺失问题,尤其是在人类情绪分析中至关重要的细粒度行为识别与解释任务。当前缺乏专门针对微动作的基准测试(benchmark),导致模型在捕捉运动细节和身体部位动态变化方面表现不佳。解决方案的关键在于构建MA-Bench——一个包含1,000个视频和12,000个结构化问答对的三层次评估架构,用于系统性地衡量模型在微动作感知、关系理解和解释推理三个层面的表现;同时,进一步创建了大规模训练语料MA-Bench-Train(含20.5K个带结构化微动作描述的视频),用于微调MLLMs。实验表明,基于该训练集微调后的Qwen3-VL-8B模型在微动作推理与解释任务上显著提升,验证了该方案的有效性。
链接: https://arxiv.org/abs/2603.26586
作者: Kun Li,Jihao Gu,Fei Wang,Zhiliang Wu,Hehe Fan,Dan Guo
机构: United Arab Emirates University (阿联酋大学); University College London (伦敦大学学院); Hefei University of Technology (合肥工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: this https URL
[CV-13] Scene Grounding In the Wild
【速读】:该论文旨在解决从无结构、野外采集的图像中重建大规模真实场景的3D模型时,因输入视图重叠度低而导致的全局一致性问题,即现有重建流程常产生多个不连通的局部重建结果或错误地将非重叠区域合并为重叠几何。其解决方案的关键在于:利用来自Google Earth Studio的密集、地理空间准确的伪合成渲染作为参考模型(reference model),并通过3D高斯泼溅(3D Gaussian Splatting)表示该参考模型并引入语义特征,进而通过逆向特征优化策略估计全局6自由度(6DoF)位姿与尺度,实现即使在缺乏视觉重叠的情况下也能对齐各局部重建结果,从而获得全局一致的3D场景重建。
链接: https://arxiv.org/abs/2603.26584
作者: Tamir Cohen,Leo Segre,Shay Shomer-Chai,Shai Avidan,Hadar Averbuch-Elor
机构: Tel Aviv University (特拉维夫大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL
Abstract:Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.
[CV-14] Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
【速读】:该论文旨在解决传统视频压缩方法中生成式模型仅作为后处理重建模块的局限性,提出了一种全新的零样本视频编解码框架——生成式视频编解码器(Generative Video Codec, GVC),其核心创新在于将预训练的视频生成模型直接转化为编解码器本身。解决方案的关键在于:在推理阶段将现代视频基础模型中的确定性修正流常微分方程(rectified-flow ODE)转换为等效的随机微分方程(SDE),从而在每一步引入可调控的随机注入点,实现基于码本驱动的压缩;在此统一架构基础上,进一步设计三种互补的条件策略(图像到视频 I2V、文本到视频 T2V 和首尾帧到视频 FLF2V),构建出空间保真度、时间连贯性和压缩效率之间的系统性权衡空间,最终在标准基准上实现了低于 0.002 bpp 的高质量重建,并支持通过单一超参数灵活控制码率。
链接: https://arxiv.org/abs/2603.26571
作者: Ziyue Zeng,Xun Su,Haoyuan Liu,Bingyu Lu,Yui Tatsumi,Hiroshi Watanabe
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emphGenerative Video Codec (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies – \emphImage-to-Video (I2V) with adaptive tail-frame atom allocation, \emphText-to-Video (T2V) operating at near-zero side information as a pure generative prior, and \emphFirst-Last-Frame-to-Video (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002,bpp while supporting flexible bitrate control through a single hyperparameter.
[CV-15] HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching
【速读】:该论文旨在解决共言语手势生成中难以实现语义一致性和跨模态一致性的问题。现有方法依赖外部语义检索,受限于预定义语言规则,且基于流匹配(Flow Matching)的方法仅使用语义一致样本训练,导致学习到的是节奏性动作而非稀疏的象征性手势(如象形和隐喻手势),同时多数方法孤立建模身体部位,无法保证跨模态一致性。其解决方案的关键在于提出一种对比流匹配(Contrastive Flow Matching)模型,利用不匹配的音频-文本条件作为负样本,使速度场在正向轨迹上遵循正确运动路径的同时排斥语义不一致的轨迹;并通过余弦与对比目标将文本、音频和整体动作嵌入统一潜在空间,从而确保跨模态一致性。
链接: https://arxiv.org/abs/2603.26553
作者: Lanmiao Liu,Esam Ghaleb,Aslı Özyürek,Zerrin Yumak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.
[CV-16] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones
【速读】:该论文旨在解决当前视觉骨干网络(vision backbone networks)在边缘设备上效率评估与优化中的关键问题,即以MACs(Multiply Accumulate operations)作为执行时间预测指标存在显著局限性。作者通过实验对比常见架构设计元素的MAC计数与实际执行时间,识别出影响高效执行的核心因素,并据此提出LowFormer这一新型视觉骨干网络家族。其解决方案的关键在于:一是采用精简的宏观与微观结构设计,二是引入Lowtention——一种轻量级的多头自注意力(Multi-Head Self-Attention)替代机制,在保证性能的同时大幅提升计算效率;三是针对边缘GPU平台进行专门优化,实现跨硬件平台的显著加速效果。
链接: https://arxiv.org/abs/2603.26551
作者: Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni
机构: University of Udine (乌迪内大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to International Journal of Computer Vision (IJCV); currently under minor revision
Abstract:Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline’s speed on edge GPU and desktop GPU. We demonstrate LowFormer’s wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at this https URL.
[CV-17] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing
【速读】:该论文旨在解决生成式视频模型在自动驾驶场景中合成恶劣天气时对大规模数据集的依赖问题,以及现有3D-aware编辑方法因逐场景优化成本高且存在几何与光照耦合导致控制不灵活的瓶颈。其解决方案的关键在于提出AutoWeather4D框架,通过引入G-buffer双通道编辑机制实现几何与光照的显式解耦:几何通道(Geometry Pass)利用结构先验实现表面锚定的物理交互,光照通道(Light Pass)则通过解析光传输过程将局部光源贡献累积为全局光照,从而支持动态三维局部再照明和细粒度参数化物理控制。
链接: https://arxiv.org/abs/2603.26546
作者: Tianyu Liu,Weitao Xiong,Kunming Luo,Manyuan Zhang,Peng Liu,Yuan Liu,Ping Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.
[CV-18] OVI-MAP:Open-Vocabulary Instance-Semantic Mapping
【速读】:该论文旨在解决复杂日常环境中自主代理进行增量式开放词汇3D实例语义映射(incremental open-vocabulary 3D instance-semantic mapping)的挑战,具体包括鲁棒的实例分割、实时处理能力以及灵活的开放集推理问题。现有方法通常依赖封闭集假设或密集的逐像素语言融合,导致可扩展性差且时序一致性不足。其解决方案的关键在于将实例重建与语义推理解耦:构建一个类无关的3D实例地图,通过RGB-D输入增量式构建;同时仅从自动选择的小规模视图中提取语义特征,利用视觉-语言模型实现零样本语义标注。这一设计实现了稳定的实例跟踪和在线探索中的零样本语义标记,系统可在实时运行并优于现有开放词汇映射基线方法。
链接: https://arxiv.org/abs/2603.26541
作者: Zilong Deng,Federico Tombari,Marc Pollefeys,Johanna Wald,Daniel Barath
机构: ETH Zurich (苏黎世联邦理工学院); Google (谷歌); University of Zurich (苏黎世大学); TU Munich (慕尼黑工业大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.
[CV-19] Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation
【速读】:该论文旨在解决高光谱感知(Hyperspectral Sensing)在城市驾驶场景中因维度高而导致的解释性差与高效学习困难的问题。其核心挑战在于如何在保持判别信息的同时实现可解释、参数高效的降维。解决方案的关键是提出一种受物理启发的可学习量子效率(Learnable Quantum Efficiency, LQE)方法,该方法通过参数化平滑的高阶光谱响应函数来模拟合理的传感器量子效率曲线,并强制施加物理约束(如单一主峰、平滑响应和带宽限制),从而在保证完全可微且端到端训练的前提下,生成紧凑且具解释性的光谱表示。实验表明,LQE在多个公开数据集上显著优于传统与可学习降维方法,在提升语义分割性能的同时大幅降低参数量(仅需12–36个参数,对比其他方法51–22K),并展现出对数据内在波长模式的良好收敛性。
链接: https://arxiv.org/abs/2603.26528
作者: Imad Ali Shah,Jiarong Li,Ethan Delaney,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan
机构: University of Galway (爱尔兰戈尔韦大学); Valeo Vision Systems (瓦莱奥视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics-inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end-to-end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi-class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45%, 0.45%, and 1.04%, and over learnable methods by 1.18%, 1.56%, and 0.81% on HyKo, HSI-Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12–36 parameters compared to 51–22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low-order configurations are optimal, while the learned spectral filters converge to dataset-intrinsic wavelength patterns. These results demonstrate that physics-informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.
[CV-20] Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays
【速读】:该论文旨在解决从二维(2D)胸部X射线(chest X-ray)重建高质量三维(3D)计算机断层扫描(CT)图像的问题,以提升诊断可及性并降低对高辐射剂量、昂贵设备和有限资源的依赖。现有方法多基于合成X射线投影,难以在真实临床场景中泛化。其解决方案的关键在于提出AXON框架——一个基于扩散模型(diffusion-based)的多阶段重建流程:首先采用布朗桥(Brownian Bridge)扩散模型进行全局结构生成,再通过ControlNet机制实现局部强度优化,并引入双平面(bi-planar)X-ray输入缓解深度歧义;此外,集成超分辨率网络将输出体积提升至诊断级分辨率。该方案在多个公开与外部数据集上均显著优于当前最优方法,在峰值信噪比(PSNR)和结构相似性指数(SSIM)上分别提升11.9%和11.0%,且具备强跨分布泛化能力。
链接: https://arxiv.org/abs/2603.26509
作者: Martin Rath,Morteza Ghahremani,Yitong Li,Ashkan Taghipour,Marcus Makowski,Christian Wachinger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computed tomography (CT) provides rich 3D anatomical details but is often constrained by high radiation exposure, substantial costs, and limited availability. While standard chest X-rays are cost-effective and widely accessible, they only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays offers a transformative solution to increase diagnostic accessibility, yet existing methods predominantly rely on synthetic X-ray projections, limiting clinical generalization. In this work, we propose AXON, a multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real X-rays. AXON employs a coarse-to-fine strategy, with a Brownian Bridge diffusion model-based initial stage for global structural synthesis, followed by a ControlNet-based refinement stage for local intensity optimization. It also supports bi-planar X-ray input to mitigate depth ambiguities inherent in 2D-to-3D reconstruction. A super-resolution network is integrated to upscale the generated volumes to achieve diagnostic-grade resolution. Evaluations on both public and external datasets demonstrate that AXON significantly outperforms state-of-the-art baselines, achieving a 11.9% improvement in PSNR and a 11.0% increase in SSIM with robust generalizability across disparate clinical distributions. Our code is available at this https URL.
[CV-21] ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在测试时遇到视觉输入退化(如常见图像退化)情况下产生幻觉(hallucination)的问题。研究表明,此类退化相当于额外的分布偏移(distribution shift),显著加剧了实际应用中的幻觉率。解决方案的关键在于提出一种名为CLIP-guided Test-Time Training(ClipTTT)的方法,其核心思想是利用预训练CLIP模型强大的图像-文本对齐能力作为稳定引导信号,从单个测试样本中识别出可靠的自监督目标,从而无需修改基础LVLM即可实现快速适应,有效降低幻觉并提升描述忠实度(descriptive faithfulness)。
链接: https://arxiv.org/abs/2603.26486
作者: Mriganka Nath,Anurag Das,Jiahao Xie,Bernt Schiele
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 12 figures
Abstract:Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.
[CV-22] SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras CVPR2026
【速读】:该论文旨在解决动态场景的高质量4D重建问题,即如何在仅使用稀疏且未标定相机输入的情况下实现高保真、时空一致的渲染效果。传统方法依赖于密集部署的同步摄像机阵列(通常需数十甚至上百台),导致成本高昂且难以实际扩展。其解决方案的关键在于提出了一种时空扭曲场(Spatio-Temporal Distortion Field),该机制统一建模了生成式观测在空间和时间维度上的不一致性,从而有效利用了大量但不一致的生成式观察数据,构建了一个完整的4D重建流程,显著优于现有方法。
链接: https://arxiv.org/abs/2603.26481
作者: Weihong Pan,Xiaoyu Zhang,Zhuang Zhang,Zhichao Ye,Nan Wang,Haomin Liu,Guofeng Zhang
机构: State Key Lab of CADCG, Zhejiang University (浙江大学CAD&CG国家重点实验室); InSpatio Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.
[CV-23] HyVIC: A Metric-Driven Spatio-Spectral Hyperspectral Image Compression Architecture Based on Variational Autoencoders
【速读】:该论文旨在解决现有基于学习的高光谱图像(Hyperspectral Image, HSI)压缩方法在处理HSI特有的时空冗余时存在的不足问题。具体而言,当前方法多直接套用为自然图像设计的变分压缩模型,未能充分考虑HSI中空间与光谱维度之间的独特相关性,且缺乏显式的架构设计来平衡空间和光谱特征的学习能力,从而限制了压缩性能的提升。其解决方案的关键在于提出一种新型的时空变分高光谱图像压缩架构(HyVIC),该架构包含四个核心组件:可调节的时空编码器、时空超编码器、时空超解码器以及可调节的时空解码器,并引入基于指标驱动的超参数选择策略,以系统性地优化空间与光谱特征学习之间的权衡,从而显著提升重建保真度。实验表明,该方法在多个压缩比下均能实现优异的空间和光谱重建质量,在BD-PSNR指标上相比现有最优方法提升最高达4.66dB。
链接: https://arxiv.org/abs/2603.26468
作者: Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid growth of hyperspectral data archives in remote sensing (RS) necessitates effective compression methods for storage and transmission. Recent advances in learning-based hyperspectral image (HSI) compression have significantly enhanced both reconstruction fidelity and compression efficiency. However, existing methods typically adapt variational image compression models designed for natural images, without adequately accounting for the distinct spatio-spectral redundancies inherent in HSIs. In particular, they lack explicit architectural designs to balance spatial and spectral feature learning, limiting their ability to effectively leverage the unique characteristics of hyperspectral data. To address this issue, we introduce spatio-spectral variational hyperspectral image compression architecture (HyVIC). The proposed model comprises four main components: 1) adjustable spatio-spectral encoder; 2) spatio-spectral hyperencoder; 3) spatio-spectral hyperdecoder; and 4) adjustable spatio-spectral decoder. We demonstrate that the trade-off between spatial and spectral feature learning is crucial for the reconstruction fidelity, and therefore present a metric-driven strategy to systematically select the hyperparameters of the proposed model. Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model, achieving high spatial and spectral reconstruction fidelity across a wide range of compression ratios (CRs) and improving the state of the art by up to 4.66dB in terms of BD-PSNR. Based on our results, we offer insights and derive practical guidelines to guide future research directions in learning-based variational HSI compression. Our code and pre-trained model weights are publicly available at this https URL .
[CV-24] Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates
【速读】:该论文旨在解决单图像下人体网格重建(Human Mesh Recovery)中存在的深度歧义性及跨域泛化能力不足的问题。现有方法虽融合回归与优化策略,但在测试时优化阶段常因初始化不佳和参数更新效率低下而表现受限。其解决方案的关键在于提出一种新颖的元学习框架:首先通过元学习策略在训练中模拟测试时优化过程,以学习更优的初始参数;其次引入选择性参数缓存机制,冻结已收敛关节以降低计算开销;最后采用基于分布的自适应更新策略,从学习到的参数变化分布中采样,实现鲁棒探索并量化不确定性。该方法显著提升了重建精度与跨场景泛化性能,同时提供与实际误差高度相关的不确定性估计。
链接: https://arxiv.org/abs/2603.26447
作者: Shaurjya Mandal,Nutan Sharma,John Galeotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Human mesh recovery from single images remains challenging due to inherent depth ambiguity and limited generalization across domains. While recent methods combine regression and optimization approaches, they struggle with poor initialization for test-time refinement and inefficient parameter updates during optimization. We propose a novel meta-learning framework that trains models to produce optimization-friendly initializations while incorporating uncertainty-aware adaptive updates during test-time refinement. Our approach introduces three key innovations: (1) a meta-learning strategy that simulates test-time optimization during training to learn better parameter initializations, (2) a selective parameter caching mechanism that identifies and freezes converged joints to reduce computational overhead, and (3) distribution-based adaptive updates that sample parameter changes from learned distributions, enabling robust exploration while quantifying uncertainty. Additionally, we employ stochastic approximation techniques to handle intractable gradients in complex loss landscapes. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance, reducing MPJPE by 10.3 on 3DPW and 8.0 on Human3.6M compared to strong baselines. Our approach shows superior domain adaptation capabilities with minimal performance degradation across different environmental conditions, while providing meaningful uncertainty estimates that correlate with actual prediction errors. Combining meta-learning and adaptive optimization enables accurate mesh recovery and robust generalization to challenging scenarios.
[CV-25] Image-based Quantification of Postural Deviations on Patients with Cervical Dystonia: A Machine Learning Approach Using Synthetic Training Data
【速读】:该论文旨在解决颈椎肌张力障碍(Cervical Dystonia, CD)临床评估中缺乏客观量化工具的问题,当前主要依赖主观性强、一致性差的临床评分量表(如Toronto Western Spasmodic Torticollis Rating Scale, TWSTRS)。其解决方案的关键在于开发了一种基于图像的自动化头部姿态与位移估计系统:通过预训练的头部姿态估计算法处理旋转性症状,并利用约16,000张合成虚拟人物图像训练深度学习模型以评估罕见的平移性症状(如侧向偏移),从而克服真实临床数据稀缺的瓶颈。该方法在多中心研究中表现出与20名专家共识评分高度一致的性能(旋转症状相关系数r=0.78–0.91),并在可控基准测试中优于人类评估者,实现了对CD姿势异常的客观、标准化评估,具备临床决策和疗效评价的实用价值。
链接: https://arxiv.org/abs/2603.26444
作者: Roland Stenger,Sebastian Löns,Nele Brügge,Feline Hamami,Alexander Münchau,Theresa Paulus,Anne Weissbach,Tatiana Usnich,Max Borsche,Martje G. Pauly,Lara M. Lange,Markus A. Hobert,Rebecca Herzog,Ana Luísa de Almeida Marcelino,Tina Mainka,Friederike Schumann,Lukas L. Goede,Johanna Reimer,Julienne Haas,Jos Becktepe,Alexander Baumann,Robin Wolke,Chi Wang Ip,Thorsten Odorfer,Daniel Zeller,Lisa Harder-Rauschenberger,John-Ih Lee,Philipp Albrecht,Tristan Kölsche,Joachim K. Krauss,Johanna M. Nagel,Joachim Runge,Johanna Doll-Lee,Simone Zittel,Kai Grimm,Pawel Tacik,André Lee,Tobias Bäumer,Sebastian Fudickar
机构: University of Lübeck(吕贝克大学); University Medical Center Schleswig-Holstein(施莱斯维希-霍尔斯坦大学医学中心); Institute of Neurogenetics(神经遗传学研究所); Department of Neurology(神经内科); Park-Klinik Weißensee(魏森塞公园诊所); Maria Hilf Clinics(玛丽亚希尔夫诊所); TUM Klinikum Rechts der Isar(慕尼黑工业大学右岸诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cervical dystonia (CD) is the most common form of dystonia, yet current assessment relies on subjective clinical rating scales, such as the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), which requires expertise, is subjective and faces low inter-rater reliability some items of the score. To address the lack of established objective tools for monitoring disease severity and treatment response, this study validates an automated image-based head pose and shift estimation system for patients with CD. We developed an assessment tool that combines a pretrained head-pose estimation algorithm for rotational symptoms with a deep learning model trained exclusively on ~16,000 synthetic avatar images to evaluate rare translational symptoms, specifically lateral shift. This synthetic data approach overcomes the scarcity of clinical training examples. The system’s performance was validated in a multicenter study by comparing its predicted scores against the consensus ratings of 20 clinical experts using a dataset of 100 real patient images and 100 labeled synthetic avatars. The automated system demonstrated strong agreement with expert clinical ratings for rotational symptoms, achieving high correlations for torticollis (r=0.91), laterocollis (r=0.81), and anteroretrocollis (r=0.78). For lateral shift, the tool achieved a moderate correlation (r=0.55) with clinical ratings and demonstrated higher accuracy than human raters in controlled benchmark tests on avatars. By leveraging synthetic training data to bridge the clinical data gap, this model successfully generalizes to real-world patients, providing a validated, objective tool for CD postural assessment that can enable standardized clinical decision-making and trial evaluation.
[CV-26] CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities CVPR
【速读】:该论文旨在解决当前视觉骨干网络(vision backbone architectures)设计主要面向高并行计算硬件(如GPU或专用AI加速器)优化,而忽视了CPU平台在推理效率上的特殊需求问题。针对CPU缺乏大规模并行处理能力的特点,论文提出通过减少计算量(MACs)的同时保持硬件高效执行(即高每秒MAC数,MACpS),从而实现低延迟的模型设计。其关键解决方案在于对标准卷积进行两项改进:分组卷积(grouping convolutions)和减小卷积核尺寸(reducing kernel sizes),这两项策略显著降低了推理时的总MAC数,同时在多种CPU设备上验证了其硬件效率优势。基于此,作者提出了CPUBone系列视觉骨干模型,该模型在CPU平台上实现了最优的速度-精度权衡(Speed-Accuracy Trade-offs, SATs),并在目标检测与语义分割等下游任务中展现出良好的迁移性能。
链接: https://arxiv.org/abs/2603.26425
作者: Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni
机构: University of Udine, Italy (乌迪内大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR Findings 2026
Abstract:Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at this https URL.
[CV-27] SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
【速读】:该论文旨在解决外科训练中专家主导技能评估成本高、时间有限、难以扩展且专业知识受限于具备专科医生机构的问题。其核心解决方案是构建一个大规模多视角视频数据集——Surgical-Hands(SHands),该数据集通过五个RGB摄像头从互补视角记录52名参与者(20名专家与32名训练者)完成标准化线性切开和缝合操作的视频,每项操作重复三次。SHands在帧级别标注了15种手势基元,并引入经验证的8类训练者错误分类体系,支持手势识别与错误检测任务;同时定义了单视角、多视角及跨视角泛化标准评估协议,为深度学习模型提供基准测试平台。此数据集公开发布,旨在推动基于临床知识驱动的鲁棒、可扩展生成式AI系统在手术培训中的应用。
链接: https://arxiv.org/abs/2603.26400
作者: Le Ma,Thiago Freitas dos Santos,Nadia Magnenat-Thalmann,Katarzyna Wac
机构: MIRALab, Quality of Life Technologies Lab (QoL Lab), University of Geneva
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. \textscSHands captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.
[CV-28] Restore Assess Repeat: A Unified Framework for Iterative Image Restoration CVPR2026
【速读】:该论文旨在解决图像恢复(Image Restoration, IR)在面对未知或复合退化(composite degradations)时普遍存在的泛化能力不足与效率低下的问题。其核心解决方案是提出一种“Restore, Assess and Repeat”(RAR)框架,将图像质量评估(Image Quality Assessment, IQA)与图像恢复(IR)深度融合为一个统一的端到端可训练模型,并在潜在空间(latent domain)中迭代执行退化识别、图像恢复与质量验证,从而实现动态自适应的高效恢复流程,显著减少模块间的信息损失与延迟,提升对多种退化场景的鲁棒性与性能表现。
链接: https://arxiv.org/abs/2603.26385
作者: I-Hsiang Chen,Isma Hadji,Enrique Sanchez,Adrian Bulat,Sy-Yen Kuo,Radu Timofte,Georgios Tzimiropoulos,Brais Martinez
机构: Samsung AI Center Cambridge; National Taiwan University; Technical University of Iasi; Chang Gung University; University of Wurzburg; Queen Mary University of London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026; Project Page: this https URL
Abstract:Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.
[CV-29] Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models)在视频理解任务中因视觉标记冗余导致的计算成本过高及“上下文衰减”(context rot)问题,从而引发性能下降。现有压缩策略通常依赖启发式或固定变换,与下游任务目标解耦,适应性与有效性受限。其解决方案的关键在于提出SCORE(Surprise-augmented token COmpression via REinforcement learning),一个基于强化学习的统一自适应标记压缩框架;其核心创新为引入一种基于“惊喜增强状态表示”(surprise-augmented state representation)的轻量级策略网络,该表示显式融合帧间残差以捕捉时序动态和运动显著性,并通过分组强化学习优化策略,辅以两阶段课程迁移(从静态伪视频到真实动态视频)提升训练稳定性。实验表明,SCORE在保留99.5%原始性能的同时实现16倍预填充加速(保留率10%),显著提升了长视频理解的可扩展性。
链接: https://arxiv.org/abs/2603.26365
作者: Shida Wang,YongXiang Hua,Zhou Tao,Haoyu Cao,Linli Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ‘‘context rot’’ due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.
[CV-30] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models CVPR2026
【速读】:该论文旨在解决当前视觉语言模型(VLMs)在细粒度空间推理上的局限性,尤其是在理解复杂且高度关节化的手部姿态方面的能力不足问题。其解决方案的关键在于构建了一个大规模诊断性基准 HandVQA,该基准基于高质量的3D手部数据集(如 FreiHAND、InterHand2.6M 和 FPHA),包含超过160万道受控的多选题,专门用于测试模型对指尖、关节间角度、距离及相对位置等精细空间关系的理解能力。通过在多个先进 VLMs(如 LLaVA、DeepSeek 和 Qwen-VL)上进行评估并采用轻量级微调方法 LoRA,研究发现现有模型普遍存在幻觉手指部件、几何解释错误和泛化能力差等问题;更重要的是,实验表明从 HandVQA 学习到的3D接地空间知识可在零样本设置下迁移至下游任务,显著提升手部手势识别(+10.33%)与手物交互识别(+2.63%)的准确率,从而为改进模型的空间推理能力提供了验证路径。
链接: https://arxiv.org/abs/2603.26362
作者: MD Khalequzzaman Chowdhury Sayem,Mubarrat Tajoar Chowdhury,Yihalem Yimolal Tiruneh,Muneeb A. Khan,Muhammad Salman Ali,Binod Bhattarai,Seungryul Baek
机构: UNIST (国立科学技术院); University of Aberdeen (阿伯丁大学); University College London (伦敦大学学院); Fogsphere (Redev.AI Ltd) (Fogsphere (Redev.AI Ltd))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2026; Project page, code, and dataset: this https URL
Abstract:Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).
[CV-31] MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model CVPR2026
【速读】:该论文旨在解决Diffusion Transformer (DiT) 在训练过程中因各层采用相同数量的patchified tokens(即各层处理的图像块数量一致)而导致计算开销过大的问题。其核心解决方案是提出一种多尺度patch的分层Transformer设计:早期块使用较大的patch以捕获全局上下文信息,后期块则采用较小的patch来精细刻画局部细节。这种结构在保持生成性能的同时,可将计算量减少高达50%(以GFLOPs衡量)。此外,作者还改进了时间嵌入和类别嵌入的设计,进一步加速训练收敛。
链接: https://arxiv.org/abs/2603.26357
作者: Quan Dao,Dimitris Metaxas
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at \urlthis https URL
[CV-32] From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter
【速读】:该论文旨在解决手绘图表(hand-drawn plot)在图形API推荐中的性能瓶颈问题,即当前Plot2API模型主要基于标准图表训练,在面对非专家用户更易绘制的手绘图时存在显著的领域差距(domain gap),导致推荐效果不佳。此外,多语言与多领域场景下模型参数膨胀和计算资源消耗过大也是关键挑战。解决方案的关键在于:一是构建首个面向手绘图表的专用数据集HDpy-13,用于提升模型对非标准输入的适应能力;二是提出轻量级适配器机制Plot-Adapter,通过引入轻量CNN模块增强局部特征捕捉能力,并采用投影矩阵共享策略减少微调参数量,从而实现高效、低资源消耗的多语言、多领域图形API推荐。
链接: https://arxiv.org/abs/2603.26356
作者: Zhenghao Xu(1),Mengning Yang(1) ((1) School of Big Data and Software Engineering, Chongqing University, Chongqing, China)
机构: Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.
[CV-33] Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection CVPR
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)系统在安全关键场景中部署时面临的隐私合规问题,尤其是欧盟《通用数据保护条例》(GDPR)对个人可识别信息(PII)的严格限制。传统VAD方法依赖大量包含人脸特征和敏感人口统计属性的视频数据,易引发隐私泄露风险。解决方案的关键在于提出一种“仅所需”(Only What’s Necessary)隐私设计框架,通过结合基于广度(breadth-based)和深度(depth-based)的数据最小化机制,在保留异常检测所需视觉线索的同时主动抑制PII信息暴露。该框架通过多组最小化配置评估隐私与检测性能之间的权衡关系,并利用基于排序的方法及帕累托分析(Pareto analysis)识别出最优操作点,在最小化个人数据暴露的前提下实现检测性能的可控下降。
链接: https://arxiv.org/abs/2603.26354
作者: Nazia Aslam,Abhisek Ray,Thomas B. Moeslund,Kamal Nasrollahi
机构: Aalborg University (奥尔堡大学); Aarhus University (奥胡斯大学); Milestone System (里程碑系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, CVPR conference
Abstract:Video anomaly detection (VAD) systems are increasingly deployed in safety critical environments and require a large amount of data for accurate detection. However, such data may contain personally identifiable information (PII), including facial cues and sensitive demographic attributes, creating compliance challenges under the EU General Data Protection Regulation (GDPR). In particular, GDPR requires that personal data be limited to what is strictly necessary for a specified processing purpose. To address this, we introduce Only What’s Necessary, a privacy-by-design framework for VAD that explicitly controls the amount and type of visual information exposed to the detection pipeline. The framework combines breadth based and depth based data minimization mechanisms to suppress PII while preserving cues relevant to anomaly detection. We evaluate a range of minimization configurations by feeding the minimized videos to both a VAD model and a privacy inference model. We employ two ranking based methods, along with Pareto analysis, to characterize the resulting trade off between privacy and utility. From the non-dominated frontier, we identify sweet spot operating points that minimize personal data exposure with limited degradation in detection performance. Extensive experiments on publicly available datasets demonstrate the effectiveness of the proposed framework.
[CV-34] DuSCN-FusionNet: An Interpretable Dual-Channel Structural Covariance Fusion Framework for ADHD Classification Using Structural MRI
【速读】:该论文旨在解决注意力缺陷多动障碍(Attention Deficit Hyperactivity Disorder, ADHD)的神经生物学诊断难题,尤其是缺乏可靠基于影像学的解剖标志物问题。当前结构磁共振成像(structural MRI, sMRI)虽能非侵入性地揭示与ADHD相关的脑部形态改变,但主流深度学习方法因“黑箱”特性而难以获得临床信任和可解释性。为此,作者提出DuSCN-FusionNet框架,其核心创新在于利用双通道结构协方差网络(dual-channel Structural Covariance Networks, SCNs)建模区域间形态关系:通过ROI-wise均值强度和区域内异质性特征构建强度型与异质性型SCNs,并由SCN-CNN编码器处理;同时在后期融合阶段引入辅助ROI变异性特征与全局统计描述符以提升性能。该方法不仅实现了80.59%的平均平衡准确率和0.778的AUC,还借助Grad-CAM扩展至SCN域生成ROI级重要性评分,从而识别出具有潜在生物标志物意义的结构性脑区。
链接: https://arxiv.org/abs/2603.26351
作者: Qurat Ul Ain,Alptekin Temizel,Soyiba Jawed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 5 figures
Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a highly prevalent neurodevelopmental condition; however, its neurobiological diagnosis remains challenging due to the lack of reliable imaging-based biomarkers, particularly anatomical markers. Structural MRI (sMRI) provides a non-invasive modality for investigating brain alterations associated with ADHD; nevertheless, most deep learning approaches function as black-box systems, limiting clinical trust and interpretability. In this work, we propose DuSCN-FusionNet, an interpretable sMRI-based framework for ADHD classification that leverages dual-channel Structural Covariance Networks (SCNs) to capture inter-regional morphological relationships. ROI-wise mean intensity and intra-regional variability descriptors are used to construct intensity-based and heterogeneity-based SCNs, which are processed through an SCN-CNN encoder. In parallel, auxiliary ROI-wise variability features and global statistical descriptors are integrated via late-stage fusion to enhance performance. The model is evaluated using stratified 10-fold cross-validation with a 5-seed ensemble strategy, achieving a mean balanced accuracy of 80.59% and an AUC of 0.778 on the Peking University site of the ADHD-200 dataset. DuSCN-FusionNet further achieves precision, recall, and F1-scores of 81.66%, 80.59%, and 80.27%, respectively. Moreover, Grad-CAM is adapted to the SCN domain to derive ROI-level importance scores, enabling the identification of structurally relevant brain regions as potential biomarkers.
[CV-35] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长文本生成过程中出现的视觉证据偏离问题,即随着输出长度增加,模型逐渐脱离图像内容、依赖文本先验,导致推理失真与幻觉现象。其解决方案的关键在于提出一种自演化训练框架——视觉再审视(Visual Re-Examination, VRE),该框架使MLLMs能够在不引入额外视觉输入的前提下,自主执行视觉内省(visual introspection),通过模型自身生成反思轨迹(reflection traces)来增强信息增益,从而激活潜在的后期视觉验证能力,实现迭代式自我改进,显著提升长链推理中的准确性与感知可靠性,并有效减少幻觉。
链接: https://arxiv.org/abs/2603.26348
作者: Shuai Lv,Chang Liu,Feng Tang,Yujie Yuan,Aojun Zhou,Kui Zhang,Xi Yang,Yangqiu Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at this https URL.
[CV-36] HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network ICASSP2026
【速读】:该论文针对组成图像检索(Composed Image Retrieval, CIR)中现有方法忽视上下文信息导致匹配样本区分度不足的问题提出解决方案。核心挑战在于隐式依赖关系建模与缺乏差异增强机制,为此作者提出双路径结构化上下文网络(HINT),通过上下文感知编码和相似度差异放大机制,显著提升模型在复杂场景下的性能表现。
链接: https://arxiv.org/abs/2603.26341
作者: Mingyu Zhang,Zixu Li,Zhiwei Chen,Zhiheng Fu,Xiaowei Zhu,Jiajia Nie,Yinwei Wei,Yupeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026
Abstract:Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at this https URL.
[CV-37] From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition CVPR
【速读】:该论文旨在解决大规模视频模型在提升视频理解能力的同时加剧隐私泄露风险的问题,特别是针对面部身份、种族和性别等敏感属性的编码问题。现有图像匿名化方法难以适用于视频场景,因为现代视频模型可利用时空运动模式作为生物特征标识符。解决方案的关键在于提出一种基于注意力机制驱动的时空视频匿名化框架,通过系统性地解耦有用信息与隐私特征实现隐私保护。其核心创新是引入两个任务特定的分类标记(action CLS token 和 privacy CLS token),在共享的 Vision Transformer (ViT) 主干网络中学习互补表征,并通过对比两者注意力分布计算每个时空管状体(spatiotemporal tubelet)的效用-隐私评分,保留得分最高的 top-k 管状体以剔除主要包含隐私线索的部分,从而在保障动作识别性能的同时显著降低隐私泄露风险。
链接: https://arxiv.org/abs/2603.26336
作者: Nazia Aslam,Abhisek Ray,Joakim Bruslund Haurum,Lukas Esterle,Kamal Nasrollahi
机构: Aalborg University (奥尔堡大学); Aarhus University (奥胡斯大学); Poineer Centre for AI (先锋人工智能中心); University of Southern Denmark (南丹麦大学); Milestone System (里程碑系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, CVPR paper
Abstract:Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.
[CV-38] Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中出现的感知能力提升与推理性能下降之间的权衡问题,即所谓的“推理税”(reasoning tax)。研究发现,这种性能退化可能源于跨深度表示(cross-depth representations)访问机制的破坏。解决方案的关键在于提出一种轻量级的输入自适应深度聚合机制(Input-Adaptive Depth Aggregation, IADA),该机制通过低秩瓶颈实现输入自适应和模态感知的跨深度检索,从而有效恢复模型对深层语义信息的访问能力。实验表明,IADA仅引入0.14M额外参数,在Qwen3-VL-2B模型上相比仅使用LoRA微调显著提升了平均推理得分9.5点和感知得分3.3点,尤其在参数高效低秩微调场景下效果最显著。
链接: https://arxiv.org/abs/2603.26330
作者: Yiming Ren,Yujiu Yang,Junjie Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by 3.3 points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.
[CV-39] Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization CVPR2026
【速读】:该论文旨在解决第三方平台在集成多种文本到图像(Text-to-Image, T2I)生成模型API时,可能因虚假声称使用官方模型而误导用户并损害模型所有者声誉的问题,即如何高效、准确地验证API背后实际运行的T2I模型是否与其宣称一致。解决方案的关键在于提出一种无需参考模型的验证方法——边界感知提示优化(Boundary-aware Prompt Optimization, BPO),其核心思想是利用目标模型在嵌入空间中语义边界(semantic boundaries)附近的不稳定性:不同T2I模型对正常提示生成相似结果,但在概念过渡区域(如“柯基犬”与“贝果”之间的边界)附近,目标模型会产生不稳定输出(如有时生成柯基犬,有时生成贝果),而其他模型则保持稳定;BPO通过识别这类边界邻近提示,捕获模型特异性行为作为可靠的验证信号,从而实现高精度的模型鉴别。
链接: https://arxiv.org/abs/2603.26328
作者: Zidong Zhao,Yihao Huang,Qing Guo,Tianlin Li,Anran Li,Kailong Wang,Jin Song Dong,Geguang Pu
机构: Zhejiang University (浙江大学); East China Normal University (华东师范大学); Nankai University (南开大学); Beihang University (北京航空航天大学); University of Science and Technology of China (中国科学技术大学); Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Findings)
Abstract:As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners’ reputations, making model verification essential to confirm whether an API’s underlying model matches its claim. Existing methods address this by using verification prompts generated by official model owners, but the generation relies on multiple reference models for optimization, leading to high computational cost and sensitivity to model selection. To address this problem, we propose a reference-free T2I model verification method called Boundary-aware Prompt Optimization (BPO). It directly explores the intrinsic characteristics of the target model. The key insight is that although different T2I models produce similar outputs for normal prompts, their semantic boundaries in the embedding space (transition zones between two concepts such as “corgi” and “bagel”) are distinct. Prompts near these boundaries generate unstable outputs (e.g., sometimes a corgi and sometimes a bagel) on the target model but remain stable on other models. By identifying such boundary-adjacent prompts, BPO captures model-specific behaviors that serve as reliable verification cues for distinguishing T2I models. Experiments on five T2I models and four baselines demonstrate that BPO achieves superior verification accuracy.
[CV-40] DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因离散动作编码导致的早期动作token错误难以修正的问题。现有解码范式,无论是自回归式还是离散扩散式,一旦生成动作token便固定不变,缺乏迭代优化能力。解决方案的关键在于提出DFM-VLA,一种基于离散流匹配(Discrete Flow Matching)的VLA框架,其核心创新是建模token级别的概率速度场(probability velocity field),使动作序列在迭代 refinement 过程中能够动态更新,从而实现对错误动作的逐步修正;同时采用两阶段解码策略——先进行迭代精炼,再通过确定性验证确保收敛稳定性,显著提升了机器人操作的成功率与鲁棒性。
链接: https://arxiv.org/abs/2603.26320
作者: Jiayi Chen,Wenxuan Song,Shuai Chen,Jingbo Wang,Zhijun Li,Haoang Li
机构: Tongji University (同济大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision–Language–Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available \urlthis https URL
[CV-41] Label-Free Cross-Task LoRA Merging with Null-Space Compression CVPR2026
【速读】:该论文旨在解决多任务微调场景下LoRA(Low-Rank Adaptation)模型合并时的性能不均衡问题,尤其在任务类型异质(如分类、回归和序列生成混合)时现有方法失效或效果不佳的问题。解决方案的关键在于提出一种无需标签信息、输出无关的Null-Space Compression (NSC) 合并方法,其核心思想是利用LoRA微调过程中权重更新矩阵ΔW = BA中下投影因子A的零空间压缩程度作为优化信号来设定合并权重,该信号与模型性能强相关且具有跨任务泛化能力,从而实现对多种任务类型的统一高效合并。
链接: https://arxiv.org/abs/2603.26317
作者: Wonyoung Lee,Wooseong Jeong,Kuk-Jin Yoon
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor A in \Delta W = BA compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.
[CV-42] SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning CVPR2026
【速读】:该论文旨在解决对比学习训练的多模态模型(如CLIP)中敏感信息难以有效删除的问题,尤其关注细粒度的关联级遗忘(association-level forgetting)缺失。现有方法在评估时无法精确诊断遗忘效果与副作用,导致无法可靠衡量机器遗忘(machine unlearning)的实际性能。解决方案的关键在于提出SALMUBench基准,其基于60K个人格-属性关联的合成数据集构建了“污染模型”(Compromised model)和“干净模型”(Clean model),二者均从相同保留基础数据(400M对)重新训练,仅污染模型额外加入敏感数据;同时设计结构化的验证集(holdout identity 和 holdout association)来精准量化遗忘效率与副作用(collateral damage)。该方案首次实现了对多模态模型中敏感信息遗忘的细粒度、可复现的评估,揭示了当前方法存在遗忘不足或过度泛化两大失败模式,为未来研究提供了标准化评测框架。
链接: https://arxiv.org/abs/2603.26316
作者: Cai Selvas-Sala,Lei Kang,Lluis Gomez
机构: Computer Vision Center (计算机视觉中心); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026. Project page: this http URL
Abstract:As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.
[CV-43] Preference-Aligned LoRA Merging: Preserving Subspace Coverag e and Addressing Directional Anisotropy CVPR2026
【速读】:该论文旨在解决多低秩适配(Low-Rank Adaptation, LoRA)模块合并时因更新方向分布在不同子空间且贡献不均导致的性能下降问题,即在简单合并时可能削弱对特定任务损失至关重要的方向,同时过度强调相对次要的方向,从而降低模型对所有任务的忠实表征能力。解决方案的关键在于提出TARA-Merging方法,通过引入偏好加权交叉熵伪损失来对齐合并权重,同时保留任务相关的LoRA子空间,从而实现广谱的子空间覆盖并缓解各方向上的各向异性(anisotropy),确保合并后的模型在多个视觉和自然语言推理(NLI)基准上表现出更强的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2603.26299
作者: Wooseong Jeong,Wonyoung Lee,Kuk-Jin Yoon
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model’s ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.
[CV-44] PhysVid: Physics Aware Local Conditioning for Generative Video Models CVPR2026
【速读】:该论文旨在解决生成式视频模型(Generative Video Models)在实际应用中因违反基本物理规律而导致可靠性不足的问题。现有方法依赖于条件控制(conditioning),但帧级信号具有领域特异性且作用时间短,而全局文本提示则过于粗粒度且噪声大,难以捕捉细粒度动态。解决方案的关键在于提出一种物理感知的局部条件机制(PhysVid),其将连续帧片段标注为基于物理的描述(状态、交互与约束),并通过块感知交叉注意力(chunk-aware cross-attention)融合至全局提示;推理时引入负向物理提示(negative physics prompts)以引导生成避开违反物理定律的轨迹。这一方法显著提升了视频物理常识性,在VideoPhy数据集上较基线提升约33%,在VideoPhy2上提升达8%。
链接: https://arxiv.org/abs/2603.26285
作者: Saurabh,Pathak,Elahe Arani,Mykola Pechenizkiy,Bahram Zonooz
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for CVPR 2026
Abstract:Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by \approx 33% over baseline video generators, and by up to \approx 8% on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.
[CV-45] GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
【速读】:该论文旨在解决大型视觉-语言模型(Vision-Language Models, VLMs)驱动的GUI代理在实际应用中因训练数据缺乏特定领域软件操作信息而产生的领域偏差问题,具体表现为对特定应用程序的操作流程(规划)和UI元素布局(定位)不熟悉,从而限制其任务执行性能。解决方案的关键在于提出一个无需训练、即插即用的框架GUIDE(GUI Unbiasing via Instructional-Video Driven Expertise),其核心创新包括:一是基于字幕的Video-RAG(Retrieval-Augmented Generation)管道,通过三阶段检索(领域分类、主题提取、相关性匹配)精准识别任务相关的教学视频;二是基于逆动力学范式的全自动标注流水线,将带UI元素检测增强的关键帧输入VLM,自动推断出所需的规划与定位知识,并注入代理对应模块以同时缓解两种形式的领域偏差。实验表明,GUIDE在OSWorld基准上显著提升性能且减少执行步骤,无需修改模型参数或结构,具备架构无关性。
链接: https://arxiv.org/abs/2603.26266
作者: Rui Xie,Zhi Gao,Chenrui Shi,Zirui Shang,Lu Chen,Qing Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 8 figures, 7 tables
Abstract:Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent’s corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE’s generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.
[CV-46] DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation ICRA2026
【速读】:该论文旨在解决LiDAR点云语义分割中因真实数据标注成本高昂以及合成数据与真实数据之间存在数据级域偏移(domain gap)而导致模型在真实场景下性能下降的问题。解决方案的关键在于提出了一种名为DRUM的Sim2Real翻译框架,其核心创新是利用在未标注真实数据上预训练的扩散模型作为生成先验,通过重现反射强度(reflectance intensity)和射线丢失噪声(raydrop noise)这两个关键测量特征来实现合成数据到真实数据的转换;同时引入一种射线丢失感知的掩码引导机制(raydrop-aware masked guidance mechanism),在保证输入合成数据一致性的同时,保留由扩散先验生成的真实射线丢失噪声,从而提升样本保真度并增强模型在真实场景中的泛化能力。
链接: https://arxiv.org/abs/2603.26263
作者: Tomoya Miyawaki,Kazuto Nakashima,Yumi Iwashita,Ryo Kurazume
机构: Kyushu University (九州大学); Jet Propulsion Laboratory, California Institute of Technology (加州理工学院喷气推进实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICRA 2026
Abstract:LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at this https URL.
[CV-47] GLASS: Geometry-aware Local Alignment and Structure Synchronization Network for 2D-3D Registration
【速读】:该论文旨在解决图像到点云配准(image-to-point cloud registration)中因重复纹理导致的误匹配问题,以及现有方法忽视结构一致性限制对应关系充分挖掘的问题。解决方案的关键在于提出两个新模块:局部几何增强(Local Geometry Enhancement, LGE)模块通过引入法向量信息增强图像与点云特征,将几何结构注入图像特征以减少误匹配;图分布一致性(Graph Distribution Consistency, GDC)模块则基于匹配点构建图结构来更新特征,并显式约束相似度分布,从而提升配准精度与鲁棒性。
链接: https://arxiv.org/abs/2603.26262
作者: Zhixin Cheng,Jiacheng Deng,Xinjun Li,Bohao Liao,Li Liu,Xiaotian Yin,Baoqun Yin,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); Meituan Inc (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Image-to-point cloud registration methods typically follow a coarse-to-fine pipeline, extracting patch-level correspondences and refining them into dense pixel-to-point matches. However, in scenes with repetitive patterns, images often lack sufficient 3D structural cues and alignment with point clouds, leading to incorrect matches. Moreover, prior methods usually overlook structural consistency, limiting the full exploitation of correspondences. To address these issues, we propose two novel modules: the Local Geometry Enhancement (LGE) module and the Graph Distribution Consistency (GDC) module. LGE enhances both image and point cloud features with normal vectors, injecting geometric structure into image features to reduce mismatches. GDC constructs a graph from matched points to update features and explicitly constrain similarity distributions. Extensive experiments and ablations on two benchmarks, RGB-D Scenes v2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance in image-to-point cloud registration.
[CV-48] GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation CVPR2026
【速读】:该论文旨在解决开放词汇表3D语义分割(open-vocabulary 3D semantic segmentation)中现有方法依赖2D开放词汇模型知识蒸馏所带来的局限性,尤其是3D特征与2D表示空间对齐导致的内在几何学习受限及2D预测误差传播问题。解决方案的关键在于提出GeoGuide框架,其核心创新包括:1)基于不确定性的超点蒸馏模块(Uncertainty-based Superpoint Distillation),通过融合几何与语义特征估计逐点不确定性,自适应加权超点内的2D特征以抑制噪声并保留判别信息,增强局部语义一致性;2)实例级掩码重建模块(Instance-level Mask Reconstruction),利用几何先验强制实例内部语义一致性,重构完整实例掩码;3)跨实例关系一致性模块(Inter-Instance Relation Consistency),通过对齐几何相似性和语义相似性矩阵,校准同类物体间的跨实例一致性,缓解视角引起的语义漂移。
链接: https://arxiv.org/abs/2603.26260
作者: Xujing Tao,Chuxin Wang,Yubo Ai,Zhixin Cheng,Zhuoyuan Li,Liangsheng Liu,Yujia Chen,Xinjun Li,Qiao Li,Wenfei Yang,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.
[CV-49] ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction
【速读】:该论文旨在解决视觉Transformer在密集特征提取任务中计算效率低下的问题,尤其是在高分辨率输入下导致的冗余计算和资源消耗。其解决方案的关键在于提出一种混合分辨率的粗到精(coarse-to-fine)架构ARTA:模型从低分辨率(粗粒度)token开始,通过一个轻量级分配器(allocator)动态预测需要细化的区域,并仅在语义边界附近分配额外的细粒度token,从而将计算资源集中在语义复杂区域,同时保持对弱边界证据的高敏感性。这种机制不仅减少了整体FLOPs和内存占用,还促使token更专注于单一语义类别,提升了特征表示的准确性与效率。
链接: https://arxiv.org/abs/2603.26258
作者: David Hagerman,Roman Naeem,Erik Brorsson,Fredrik Kahl,Lennart Svensson
机构: Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.
[CV-50] Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment
【速读】:该论文旨在解决无人机(UAV)自主修剪树木时的关键安全问题:如何在实时条件下精确估计切割工具与细树枝之间的度量距离,以确保无人机能够无碰撞地接近、对齐并执行修剪操作。解决方案的核心在于训练五种变体的DEFOM-Stereo(一种基于基础模型的立体匹配算法),并在NVIDIA Jetson Orin Super 16 GB平台上部署最优模型。关键创新包括使用Unreal Engine 5构建的专用合成数据集(包含5,520对立体图像和EXR深度图监督信号),以及引入一个平衡精度与推理速度的新变体DEFOM-PrunePlus——其在Jetson平台实现约3.3 FPS的帧率下仍保持足够的深度准确性(depth MAE 64.26 cm),满足闭环控制所需的实时性与安全性要求,从而实现了从仿真到真实场景的有效迁移。
链接: https://arxiv.org/abs/2603.26250
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based stereo matcher - on a task-specific synthetic dataset and deploying the checkpoints on an NVIDIA Jetson Orin Super 16 GB. The training corpus is built in Unreal Engine 5 with a simulated ZED Mini stereo camera capturing 5,520 stereo pairs across 115 tree instances from three viewpoints at 2m distance; dense EXR depth maps provide exact, spatially complete supervision for thin branches. On the synthetic test set, DEFOM-Stereo ViT-S achieves the best depth-domain accuracy (EPE 1.74 px, D1-all 5.81%, delta-1 95.90%, depth MAE 23.40 cm) but its Jetson inference speed of ~2.2 FPS (~450 ms per frame) remains too slow for responsive closed-loop tool control. A newly introduced balanced variant, DEFOM-PrunePlus (~21M backbone, ~3.3 FPS on Jetson), offers the best deployable accuracy-speed trade-off (EPE 5.87 px, depth MAE 64.26 cm, delta-1 87.59%): its frame rate is sufficient for real-time guidance and its depth accuracy supports safe branch approach planning at the 2m operating range. The lightweight DEFOM-PruneStereo (~6.9 FPS) and DEFOM-PruneNano (~8.5 FPS) run fast but sacrifice substantial accuracy (depth MAE 57 cm), making estimates too unreliable for safe actuation. Zero-shot inference on real photographs confirms that full-capacity models preserve branch geometry, validating the sim-to-real transfer. We conclude that DEFOM-PrunePlus provides the most practical accuracy-latency balance for onboard distance estimation, while ViT-S serves as the reference for future hardware.
[CV-51] owards GUI Agents : Vision-Language Diffusion Models for GUI Grounding CVPR2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在图形用户界面(GUI)接地任务中长期依赖自回归(AR)视觉语言模型(VLMs)所带来的局限性问题,探索离散扩散视觉语言模型(DVLMs)作为替代方案的可行性。其解决方案的关键在于:首先将 GUI 接地任务重构为从多模态输入到文本生成的框架,其次提出一种混合掩码策略(hybrid masking schedule),结合线性掩码与确定性掩码以更好地捕捉边界框几何的层次结构,从而显著提升接地准确率(SSR 提升达 6.1 点);同时通过扩展多样化 GUI 数据域进行预训练,进一步降低延迟约 1.3 秒并平均提升准确率 20 点,验证了离散 DVLM 在 GUI 接地任务中的有效性与潜力。
链接: https://arxiv.org/abs/2603.26211
作者: Shrinidhi Kumbhar,Haofu Liao,Srikar Appalaraju,Kunwar Yashraj Singh
机构: Arizona State University (亚利桑那州立大学); AWS Agentic AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.
[CV-52] 4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation ICRA2026
【速读】:该论文旨在解决4D毫米波雷达(4D millimeter-wave radar)在机器人回环检测与全局定位中因数据固有噪声和稀疏性导致的性能受限问题。其解决方案的关键在于提出一种基于知识蒸馏(Knowledge Distillation, KD)的框架——4DRaL,通过将高性能激光雷达到激光雷达(LiDAR-to-LiDAR, L2L)的模型作为教师模型,指导4D雷达到4D雷达(4D radar-to-4D radar, R2R)学生模型的训练,从而提升雷达在复杂环境下的表征能力。该框架包含三个核心KD模块:局部图像增强模块用于缓解原始雷达点云的稀疏性,特征分布蒸馏模块使学生模型生成更具判别性的特征,响应蒸馏模块则确保师生模型在特征空间中的一致性;此外,通过模块配置调整,4DRaL还可扩展至4D雷达到激光雷达(R2L)的跨模态匹配任务,实验证明其在正常及恶劣天气下均达到当前最优性能。
链接: https://arxiv.org/abs/2603.26206
作者: Ningyuan Huang,Zhiheng Li,Zheng Fang
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICRA 2026
Abstract:Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.
[CV-53] SAFT: Sensitivity-Aware Filtering and Transmission for Adaptive 3D Point Cloud Communication over Wireless Channels
【速读】:该论文旨在解决无线信道中3D点云可靠传输的问题,尤其针对时变信噪比(SNR)和带宽受限带来的挑战。解决方案的关键在于提出了一种感知敏感性的过滤与传输(Sensitivity-aware Filtering and Transmission, SAFT)框架,其核心创新是引入了一个基于重建敏感度的令牌过滤(Sensitivity-guided Token Filtering, STF)模块,该模块为每个点云令牌分配重要性评分,以在传输过程中优先保留对几何结构恢复更为关键的信息;同时结合量化模块与SNR感知解码器实现自适应重建,在不增加有效传输负载的前提下通过训练阶段的符号使用惩罚稳定离散表示,显著提升了低SNR环境下的几何保真度(D1/D2 PSNR),优于传统分离式信源-信道编码方案(如G-PCC联合LDPC与QAM)及现有学习基线方法。
链接: https://arxiv.org/abs/2603.26197
作者: Huda Adam Sirag Mekki,Hui Yuan,Mohanad M. G. Hassan,Zejia Chen,Guanghui Zhang
机构: Shandong University (山东大学); School of Control Science and Engineering, Shandong University (山东大学控制科学与工程学院); Key Laboratory of Machine Intelligence and System Control, Ministry of Education (教育部机器智能与系统控制重点实验室); School of Computer Science and Technology, Shandong University (山东大学计算机科学与技术学院)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable transmission of 3D point clouds over wireless channels is challenging due to time-varying signal-to-noise ratio (SNR) and limited bandwidth. This paper introduces sensitivity-aware filtering and transmission (SAFT), a learned transmission framework that integrates a Point-BERT-inspired encoder, a sensitivity-guided token filtering (STF) unit, a quantization block, and an SNR-aware decoder for adaptive reconstruction. Specifically, the STF module assigns token-wise importance scores based on the reconstruction sensitivity of each token under channel perturbation. We further employ a training-only symbol-usage penalty to stabilize the discrete representation, without affecting the transmitted payload. Experiments on ShapeNet, ModelNet40, and 8iVFB show that SAFT improves geometric fidelity (D1/D2 PSNR) compared with a separate source–channel coding pipeline (G-PCC combined with LDPC and QAM) and existing learned baselines, with the largest gains observed in low-SNR regimes, highlighting improved robustness under limited bandwidth.
[CV-54] MemCam: Memory-Augmented Camera Control for Consistent Video Generation IJCNN2026
【速读】:该论文旨在解决交互式视频生成中因动态相机控制导致的场景一致性难以维持的问题,尤其是在长视频生成场景下,现有方法受限于有限的上下文信息而表现不佳。其解决方案的关键在于提出MemCam框架,通过将已生成帧作为外部记忆,并利用这些记忆帧作为上下文条件来增强场景一致性;同时设计了上下文压缩模块,以紧凑表示编码记忆帧,并采用基于共视性(co-visibility)的选择机制动态检索最相关的历史帧,从而在降低计算开销的同时提升上下文的相关性和长度,显著改善了长视频中大角度相机运动下的生成质量。
链接: https://arxiv.org/abs/2603.26193
作者: Xinhang Gao,Junlin Guan,Shuhan Luo,Wenzhuo Li,Guanghuan Tan,Jiacheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 3 tables, accepted by IJCNN 2026
Abstract:Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.
[CV-55] HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
【速读】:该论文旨在解决传统持续学习(lifelong learning)方法在面对异构任务(heterogeneous tasks)时的局限性,即现有研究主要聚焦于同质任务流(如仅分类任务),忽视了不同任务具有不同输出空间结构(output space structures)的情形。为应对这一挑战,作者提出终身异构学习(Lifelong Heterogeneous Learning, LHL)的新范式,并聚焦于密集预测(dense prediction)场景下的实例化问题(LHL4DP)。其核心解决方案是提出一种无示例(exemplar-free)的异构感知蒸馏(Heterogeneity-Aware Distillation, HAD)方法,关键在于通过两个互补模块实现对历史异构知识的有效保留:一是分布平衡的异构感知蒸馏损失(distribution-balanced heterogeneity-aware distillation loss),缓解全局预测分布失衡;二是显著性引导的异构感知蒸馏损失(salience-guided heterogeneity-aware distillation loss),聚焦于使用Sobel算子提取的高信息量边缘像素区域,从而提升模型在复杂异构任务序列中的泛化能力与稳定性。
链接: https://arxiv.org/abs/2603.26192
作者: Xuerui Zhang,Xuehao Wang,Zhan Zhuang,Linglan Zhao,Ziyue Li,Xinmin Zhang,Zhihuan Song,Yu Zhang
机构: Southern University of Science and Technology (南方科技大学); Zhejiang University (浙江大学); Technical University of Munich (慕尼黑工业大学); City University of Hong Kong (香港城市大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textite.g., only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.
[CV-56] Dual-Stage Invariant Continual Learning under Extreme Visual Sparsity
【速读】:该论文旨在解决在极端稀疏场景(如空间轨道物体(Resident Space Object, RSO)检测)下,持续学习(Continual Learning)中因背景主导信号导致的特征骨干网络梯度不稳定与表征漂移问题。现有方法多依赖输出层知识蒸馏,无法保障中间表示的稳定性,从而引发误差传播。解决方案的关键在于提出一种双阶段不变性持续学习框架,通过联合蒸馏机制同时约束骨干网络特征表示的结构一致性与检测预测的语义一致性,从源头抑制误差累积;此外,引入基于补丁采样和分布感知增强的稀疏感知数据调节策略,以调控严重不平衡条件下的梯度统计特性,显著提升模型在序列域变化下的鲁棒性与适应能力。
链接: https://arxiv.org/abs/2603.26190
作者: Rangya Zhang,Jiaping Xiao,Lu Bai,Yuhang Zhang,Mir Feroskhan
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Continual learning seeks to maintain stable adaptation under non-stationary environments, yet this problem becomes particularly challenging in object detection, where most existing methods implicitly assume relatively balanced visual conditions. In extreme-sparsity regimes, such as those observed in space-based resident space object (RSO) detection scenarios, foreground signals are overwhelmingly dominated by background observations. Under such conditions, we analytically demonstrate that background-driven gradients destabilize the feature backbone during sequential domain shifts, causing progressive representation drift. This exposes a structural limitation of continual learning approaches relying solely on output-level distillation, as they fail to preserve intermediate representation stability. To address this, we propose a dual-stage invariant continual learning framework via joint distillation, enforcing structural and semantic consistency on both backbone representations and detection predictions, respectively, thereby suppressing error propagation at its source while maintaining adaptability. Furthermore, to regulate gradient statistics under severe imbalance, we introduce a sparsity-aware data conditioning strategy combining patch-based sampling and distribution-aware augmentation. Experiments on a high-resolution space-based RSO detection dataset show consistent improvement over established continual object detection methods, achieving an absolute gain of +4.0 mAP under sequential domain shifts.
[CV-57] OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
【速读】:该论文旨在解决超声心动图视频中左心室(left ventricle)分割的准确性与时间一致性问题,尤其针对严重斑点噪声(speckle noise)和快速非刚性形变带来的挑战。现有线性递归模型虽具备高效的上下文关联记忆能力,但因状态更新无约束导致状态矩阵奇异值衰减(rank collapse),进而使解剖细节被噪声淹没。解决方案的关键在于提出OSA框架,其核心创新为:1)引入正交化状态更新(Orthogonalized State Update, OSU)机制,将状态演化建模为在Stiefel流形上的欧氏投影梯度下降,从而抑制秩坍缩并保障稳定的时间过渡;2)设计解剖先验感知特征增强模块(Anatomical Prior-aware Feature Enhancement),通过物理驱动过程分离解剖结构与斑点噪声,提供鲁棒的结构线索以提升时序追踪性能。
链接: https://arxiv.org/abs/2603.26188
作者: Rui Wang,Huisi Wu,Jing Qin
机构: Shenzhen University (深圳大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at this https URL.
[CV-58] Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI
【速读】:该论文旨在解决心脏磁共振成像延迟钆增强(Cardiac MRI late gadolinium enhancement, LGE)图像中左心房(left atrial, LA)瘢痕自动分割的难题,该问题因对比度低、标注变异性和缺乏解剖结构约束而尤为突出,常导致预测不可靠。解决方案的关键在于提出一种受临床工作流程启发的渐进式学习策略:首先训练一个LA腔体预学习模型,再引入双任务学习以建模LA几何与瘢痕分布的空间关系,最后对瘢痕进行精细化分割;同时设计了一种解剖感知的空间加权损失函数,通过将瘢痕预测限制在解剖上合理的LA壁区域来缓解标注偏差,从而提升分割准确性与可靠性。
链接: https://arxiv.org/abs/2603.26186
作者: Jing Zhang,Bastien Bergere,Emilie Bollache,Jonas Leite,Mikaël Laredo,Alban Redheuil,Nadjia Kachenoura
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, 3 tables
Abstract:Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.
[CV-59] DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds
【速读】:该论文旨在解决现有点云后解码质量增强方法在处理动态点云时无法有效利用时空相关性的问题,尤其是针对G-PCC(Geometry-based Point Cloud Compression)压缩的动态点云,传统方法通常独立处理每一帧,导致几何与属性信息的时空一致性难以保持。其解决方案的关键在于提出一个统一的几何与属性增强框架(DUGAE),该框架通过三个核心模块显式建模几何和属性的跨帧时空相关性:首先,基于稀疏卷积(SPConv)和特征域几何运动补偿(GMC)的动态几何增强网络(DGE-Net)实现几何信息的时空对齐与聚合;其次,引入细节感知的k近邻(DA-KNN)重着色模块,在编码端将原始属性映射至增强几何上,提升映射完整性并保留属性细节;最后,采用具有专用时域特征提取和特征域属性运动补偿(AMC)的动态属性增强网络(DAE-Net)来精细化属性,从而全面优化动态点云的质量。
链接: https://arxiv.org/abs/2603.26183
作者: Pan Zhao,Hui Yuan,Chang Sun,Chongzhen Tian,Raouf Hamzaoui,Sam Kwong
机构: Shandong University (山东大学); Key Laboratory of Machine Intelligence and System Control, Ministry of Education (教育部机器智能与系统控制重点实验室); De Montfort University (德蒙福特大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing post-decoding quality enhancement methods for point clouds are designed for static data and typically process each frame independently. As a result, they cannot effectively exploit the spatiotemporal correlations present in point cloud this http URL propose a unified geometry and attribute enhancement framework (DUGAE) for G-PCC compressed dynamic point clouds that explicitly exploits inter-frame spatiotemporal correlations in both geometry and attributes. First, a dynamic geometry enhancement network (DGE-Net) based on sparse convolution (SPConv) and feature-domain geometry motion compensation (GMC) aligns and aggregates spatiotemporal information. Then, a detail-aware k-nearest neighbors (DA-KNN) recoloring module maps the original attributes onto the enhanced geometry at the encoder side, improving mapping completeness and preserving attribute details. Finally, a dynamic attribute enhancement network (DAE-Net) with dedicated temporal feature extraction and feature-domain attribute motion compensation (AMC) refines attributes by modeling complex spatiotemporal correlations. On seven dynamic point clouds from the 8iVFB v2, Owlii, and MVUB datasets, DUGAE significantly enhanced the performance of the latest G-PCC geometry-based solid content test model (GeS-TM v10). For geometry (D1), it achieved an average BD-PSNR gain of 11.03 dB and a 93.95% BD-bitrate reduction. For the luma component, it achieved a 4.23 dB BD-PSNR gain with a 66.61% BD-bitrate reduction. DUGAE also improved perceptual quality (as measured by PCQM) and outperformed V-PCC. Our source code will be released on GitHub at: this https URL
[CV-60] GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport CVPR2026
【速读】:该论文旨在解决3D Gaussian splatting在建模透明物体(如玻璃面板)时失效的问题,其核心挑战在于如何解耦透明界面处交织的辐射贡献,即区分来自透明表面的反射与透过该界面的透射辐射。解决方案的关键在于提出GLINT框架,通过显式分解高斯表示来重建场景尺度的透明性:将主要界面与反射和透射辐射分别建模,从而实现一致的辐射传输。优化过程中,GLINT利用分解诱导的几何分离线索,并结合预训练视频重光照模型提供的几何和材质先验,自适应地定位透明区域,显著提升了复杂透明场景的重建效果。
链接: https://arxiv.org/abs/2603.26181
作者: Youngju Na,Jaeseong Yun,Soohyun Ryu,Hyunsu Kim,Sung-Eui Yoon,Suyong Yeon
机构: KAIST(韩国科学技术院); NAVER LABS(NAVER实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Project page: this https URL
Abstract:While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.
[CV-61] Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning
【速读】:该论文旨在解决开放词汇目标检测(open-vocabulary object detection)中因环境或背景变化导致的模型鲁棒性不足问题,即同一对象在不同场景下检测性能下降的现象,这反映了模态内一致性(intra-modal consistency)的缺失。解决方案的关键在于提出一种名为上下文一致性学习(Contextual Consistency Learning, CCL)的新框架,其核心由两个策略组成:1)上下文自举数据生成(Contextual Bootstrapped Data Generation, CBDG),用于合成包含相同对象但多样背景的图像以补充现有数据集的不足;2)上下文一致性损失(Contextual Consistency Loss, CCLoss),强制模型在环境变化时保持对象特征的不变性,从而提升跨场景泛化能力。该方法显著优于先前技术,在OmniLabel和D3数据集上分别提升了+16.3 AP和+14.9 AP。
链接: https://arxiv.org/abs/2603.26179
作者: Bozhao Li,Shaocong Wu,Tong Shao,Senqiao Yang,Qiben Shan,Zhuotao Tian,Jingyong Su
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Pengcheng Laboratory (鹏城实验室); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model’s robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: this https URL.
[CV-62] CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions CVPR2026
【速读】:该论文旨在解决当前指令驱动的多模态图像编辑任务中缺乏系统化且与人类认知对齐的评估框架的问题,尤其针对复杂和创造性编辑任务的性能评估不足。其解决方案的关键在于提出一个全自动的基于问答(QA)的评估流程CREval,该流程通过结构化的问答机制克服了现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)评分方法在完整性与可解释性方面的局限;同时构建了专门用于复杂指令下创意图像编辑的基准测试集CREval-Bench,涵盖三个类别和九个创意维度,包含超过800个编辑样本和13,000个评估问题,从而实现了对多种开源与闭源模型的系统性评估,并验证了自动化指标与人工判断的高度一致性。
链接: https://arxiv.org/abs/2603.26174
作者: Chonghuinan Wang,Zihan Chen,Yuxiang Wei,Tianyi Jiang,Xiaohe Wu,Fan Li,Wangmeng Zuo,Hongxun Yao
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Pengcheng Lab, Guangzhou (鹏城实验室,广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval’s automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.
[CV-63] Provably Contractive and High-Quality Denoisers for Convergent Restoration
【速读】:该论文旨在解决现有图像恢复(image restoration)模型在面对输入微小扰动时缺乏稳定性的问题,即当前基于卷积或注意力机制的网络虽然在去噪等任务中达到SOTA性能,但存在鲁棒性与准确性之间的权衡。其解决方案的关键在于设计一种可证明为收缩映射(provably contractive)的去噪网络,通过将由展开技术(unfolding techniques)获得的近似投影层(proximal layers)与具有Lipschitz约束的卷积精修模块相结合,确保模型满足全局Lipschitz常数为1的性质(global Lipschitz 1)。这一设计保证了输入扰动强度不超过ε时,输出变化最多为ε,显著优于如DnCNN和Restormer等强基线模型在相同扰动下的更大偏差,同时保持与无约束SOTA模型相当的恢复质量,并作为强正则化器在Plug-and-Play算法中实现收敛性保障。
链接: https://arxiv.org/abs/2603.26168
作者: Shubhi Shukla,Pravin Nair
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz 1 ) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength |\delta|\le\varepsilon induce at most \varepsilon change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at this https URL
[CV-64] Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)生成图像时存在的版权侵权与虚假信息传播风险,尤其是现有水印方法依赖阈值检测导致无法实现精确比特级恢复的问题,从而难以支持离线验证或需要无损元数据(如授权指令)的应用场景。解决方案的关键在于提出 Gaussian Shannon 框架,将扩散过程建模为噪声通信信道,并在不微调模型且不损失图像质量的前提下,将水印嵌入初始高斯噪声中;同时识别出局部比特翻转和全局随机失真两类干扰,设计级联防御机制——结合纠错码与多数投票策略,确保语义载荷的可靠端到端传输,实现在多种扰动下达到最优比特级准确率与高真阳性率,满足真实部署中的可信权利归属需求。
链接: https://arxiv.org/abs/2603.26167
作者: Yi Zhang,Hongbo Huang,Liang-Jie Zhang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted by CVPR 2026 Findings
Abstract:Diffusion models generate high-quality images but pose serious risks like copyright violation and disinformation. Watermarking is a key defense for tracing and authenticating AI-generated content. However, existing methods rely on threshold-based detection, which only supports fuzzy matching and cannot recover structured watermark data bit-exactly, making them unsuitable for offline verification or applications requiring lossless metadata (e.g., licensing instructions). To address this problem, in this paper, we propose Gaussian Shannon, a watermarking framework that treats the diffusion process as a noisy communication channel and enables both robust tracing and exact bit recovery. Our method embeds watermarks in the initial Gaussian noise without fine-tuning or quality loss. We identify two types of channel interference, namely local bit flips and global stochastic distortions, and design a cascaded defense combining error-correcting codes and majority voting. This ensures reliable end-to-end transmission of semantic payloads. Experiments across three Stable Diffusion variants and seven perturbation types show that Gaussian Shannon achieves state-of-the-art bit-level accuracy while maintaining a high true positive rate, enabling trustworthy rights attribution in real-world deployment. The source code have been made available at: this https URL
[CV-65] IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios
【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成模型在实际应用中可能被滥用的问题,即利用单张图像生成伪造视频以牟利或误导公众,同时指出当前图像保护方法缺乏统一的评估基准,难以系统性地衡量其在I2V场景下的有效性及抗预处理攻击能力。解决方案的关键在于提出首个针对I2V生成场景的系统性评估基准IP-Bench,该基准涵盖6种代表性图像保护方法与5种先进I2V模型,并通过两种实用场景下的鲁棒性攻击策略评估保护方法的抗干扰能力,同时分析其跨模型和跨模态迁移性,从而构建一个可复现、可扩展的图像保护效果评估框架。
链接: https://arxiv.org/abs/2603.26154
作者: Xiaofeng Li,Leyi Sheng,Zhen Sun,Zongmin Zhang,Jiaheng Wei,Xinlei He
机构: The Hong Kong University of Science and Technology (Guangzhou); Wuhan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment this http URL address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods’ robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.
[CV-66] Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT
【速读】:该论文旨在解决边缘计算场景下深度学习模型在数据稀缺(低数据环境)中仍需保持高效性和适应性的问题,尤其针对资源受限设备上的少样本学习(few-shot learning)任务。其核心挑战在于如何在有限标注数据、低延迟、低功耗和计算资源约束条件下实现高性能模型部署。解决方案的关键在于采用知识蒸馏(knowledge distillation)策略,将大型教师模型(teacher model)的泛化能力迁移至轻量级学生模型(student model),从而在MiniImageNet基准上实现了单样本分类准确率提升14%、五样本分类提升6.7%,同时模型参数减少69%、计算复杂度(FLOPs)降低88%,并在Jetson Orin Nano平台上验证了动态能耗降低37%且延迟仅为2.6 ms,表明该方法是面向边缘AI硬件的高效可行方案。
链接: https://arxiv.org/abs/2603.26145
作者: Shuhei Tsuyuki,Reda Bensaid,Jérémy Morlier,Mathieu Léonardon,Naoya Onizawa,Vincent Gripon,Takahiro Hanyu
机构: Research Institute of Electrical Communication, Tohoku University, Japan; Graduate School of Engineering, Tohoku University, Japan; IMT Atlantique, Lab-STICC, UMR CNRS 6285, F-29238 Brest, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.
[CV-67] PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion
【速读】:该论文旨在解决深度神经网络训练中数据选择效率低下的问题,尤其是传统方法因计算成本高而难以扩展和实际应用。其解决方案的关键在于提出PruneFuse策略:首先通过结构化剪枝生成一个与原网络结构一致的小型剪枝网络,用于高效筛选最具信息量的样本;随后将训练好的剪枝网络无缝融合至原始网络中,利用剪枝网络的学习成果引导融合网络的训练,同时保留探索更鲁棒解的空间。这一两阶段机制显著降低了数据选择的计算开销,并提升了模型性能与训练效率。
链接: https://arxiv.org/abs/2603.26138
作者: Humaira Kousar,Hasnain Irshad Bhatti,Jaekyun Moon
机构: KAIST (韩国科学技术院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in TMLR (Featured Certification). arXiv admin note: substantial text overlap with arXiv:2501.01118
Abstract:Efficient data selection is crucial for enhancing the training efficiency of deep neural networks and minimizing annotation requirements. Traditional methods often face high computational costs, limiting their scalability and practical use. We introduce PruneFuse, a novel strategy that leverages pruned networks for data selection and later fuses them with the original network to optimize training. PruneFuse operates in two stages: First, it applies structured pruning to create a smaller pruned network that, due to its structural coherence with the original network, is well-suited for the data selection task. This small network is then trained and selects the most informative samples from the dataset. Second, the trained pruned network is seamlessly fused with the original network. This integration leverages the insights gained during the training of the pruned network to facilitate the learning process of the fused network while leaving room for the network to discover more robust solutions. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.
[CV-68] InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution
【速读】:该论文旨在解决扩散模型在视频超分辨率(Video Super-Resolution, VSR)应用中面临的两大挑战:一是强生成先验导致的时序不稳定问题,二是多帧扩散流水线计算成本过高、难以实际部署的问题。解决方案的关键在于提出一种轻量级扩散框架InstaVSR,其核心创新包括:(1)采用剪枝后的单步扩散主干网络,移除传统扩散VSR中的多个高开销模块以降低计算复杂度;(2)引入基于光流引导的时序正则化递归训练策略,提升帧间一致性;(3)在潜在空间与像素空间中联合实施对抗学习,确保在主干简化后仍能保持良好的感知质量。该方案在NVIDIA RTX 4090上实现了2K×2K分辨率30帧视频的实时处理(<1分钟,仅7 GB显存),显著优于现有方法的效率与稳定性平衡。
链接: https://arxiv.org/abs/2603.26134
作者: Jintong Hu,Bin Chen,Zhenyu Hu,Jiayue Liu,Guo Wang,Lu Qi
机构: Insta360 Research (Insta360 研究院); Peking University (北京大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures
Abstract:Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K \times 2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.
[CV-69] axaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life
【速读】:该论文旨在解决生成式 AI (Generative AI) 在跨物种图像生成中难以准确捕捉细微形态特征的问题,尤其是在超过1000万种生物的“生命之树”背景下,现有文本到图像合成模型虽能生成逼真图像,却常无法保留定义物种身份的关键视觉线索。解决方案的核心在于提出 TaxaAdapter,一种轻量级方法,通过将视觉分类模型(Vision Taxonomy Models, VTMs)如 BioCLIP 的嵌入向量注入冻结的文本到图像扩散模型中,从而在保持对姿态、风格和背景等属性灵活控制的同时,显著提升物种级别的形态保真度与身份准确性。该方法架构简洁、训练流程清晰,并在少样本甚至未见过的物种上展现出优异的泛化能力。
链接: https://arxiv.org/abs/2603.26128
作者: Mridul Khurana,Amin Karimi Monsefi,Justin Lee,Medha Sawhney,David Carlyn,Julia Chae,Jianyang Gu,Rajiv Ramnath,Sara Beery,Wei-Lun Chao,Anuj Karpatne,Cheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.
[CV-70] Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中难以有效整合视觉证据的问题,即尽管模型能够关注相关视觉区域,但其推理链往往缺乏对视觉事实的强 grounding(锚定)。解决方案的关键在于提出轨迹引导强化学习(Trajectory-Guided Reinforcement Learning, TGRL),通过利用更强模型提供的专家推理轨迹来指导策略模型将视觉证据融入细粒度的推理过程,并结合 token 级别重加权和轨迹过滤机制以实现稳定且高效的策略优化。
链接: https://arxiv.org/abs/2603.26126
作者: Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Mingzhu Chen,Jiancan Wu,Kuien Liu,Xiang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.
[CV-71] SkinGPT -X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在皮肤科诊断中面临的两大核心问题:一是对细粒度、大规模多类皮肤病分类任务及罕见皮肤病诊断能力不足,主要源于训练数据稀疏性;二是缺乏临床推理所需的可解释性和可追溯性。为应对这些挑战,作者提出SkinGPT-X,其关键创新在于构建了一个融合自进化皮肤科记忆机制的多模态协同多智能体系统(multimodal collaborative multi-agent system)。该系统通过模拟皮肤科医生的诊断流程并实现持续记忆演化,在保证高精度的同时提供透明、可信的诊断过程,从而显著提升复杂和罕见皮肤病的识别性能与临床适用性。
链接: https://arxiv.org/abs/2603.26122
作者: Zhangtianyi Chen,Yuhao Shen,Florensia Widjaja,Yan Xu,Liyuan Sun,Zijian Wang,Hongyi Chen,Wufei Dai,Juexiao Zhou
机构: School of Data Science, The Chinese University of Hong Kong, Shenzhen; Department of Dermatology, Tianjin Institute of Integrative Dermatology, Tianjin Academy of Traditional Chinese Medicine Affiliated Hospital, Tianjin 300120, China; Department of Dermatology, Beijing AnZhen Hospital, Capital Medical University, Beijing 100029, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen’s Kappa improvement.
[CV-72] SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection CVPR2026
【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection, OVOD)在处理伪装目标时的性能瓶颈问题,即当目标与背景视觉特征高度相似时,现有方法难以准确区分和定位伪装物体。解决方案的关键在于两个核心设计:一是提出一种子描述主成分对比融合策略(sub-description principal component contrastive fusion strategy),用于过滤由多模态大模型生成的细粒度文本描述中冗余或混淆的修饰成分,从而降低噪声干扰;二是设计了一种基于特异性引导的区域弱对齐与动态聚焦机制(specificity-guided regional weak alignment and dynamic focusing method),通过增强检测器对伪装目标局部区域的敏感性,提升其从复杂背景中辨别目标的能力。实验表明,在自建的OVCOD-D基准上,该方法在开放集评估设置下达到了56.4的平均精度(AP)。
链接: https://arxiv.org/abs/2603.26109
作者: Jiaming Liang,Yifeng Zhan,Chunlin Liu,Weihua Zheng,Bingye Peng,Qiwei Liang,Boyang Cai,Xiaochun Mai,Qiang Nie
机构: Shenzhen University (深圳大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision–language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector’s ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.
[CV-73] Accurate Precipitation Forecast by Efficiently Learning from Massive Atmospheric Variables and Unbalanced Distribution
【速读】:该论文旨在解决短时降水预报(0–24小时)中的三大挑战:一是降水事件演化模式的高度复杂性,二是降水与非降水样本之间的极端不平衡问题,三是现有模型难以高效利用多源大气观测数据。其解决方案的关键在于提出一种新型预测模型,能够自动提取并迭代预测与降水演变强相关的潜在特征,并引入一种专为稀有降水事件设计的“WMCE”损失函数,以提升对极端稀缺降水事件的识别精度和强度预测准确性。实验表明,该方法在准确性和计算效率上均显著优于主流基线模型,大幅降低了获取高价值预测所需的计算成本。
链接: https://arxiv.org/abs/2603.26108
作者: Shuangliang Li,Siwei Li,Li Li,Weijie Zou,Jie Yang,Maolin Zhang
机构: Wuhan University (武汉大学); Hubei Key Laboratory of Quantitative Remote Sensing of Land and Atmosphere (湖北省陆地与大气定量遥感重点实验室); Perception and Effectiveness Assessment for Carbon-neutrality Efforts, Engineering Research Center of Ministry of Education, Institute for Carbon Neutrality (碳中和成效感知与评估工程研究中心,教育部工程研究中心,碳中和研究院); Hubei Meteorological Information and Technology Support Center (湖北省气象信息与技术保障中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Short-term (0-24 hours) precipitation forecasting is highly valuable to socioeconomic activities and public safety. However, the highly complex evolution patterns of precipitation events, the extreme imbalance between precipitation and non-precipitation samples, and the inability of existing models to efficiently and effectively utilize large volumes of multi-source atmospheric observation data hinder improvements in precipitation forecasting accuracy and computational efficiency. To address the above challenges, this study developed a novel forecasting model capable of effectively and efficiently utilizing massive atmospheric observations by automatically extracting and iteratively predicting the latent features strongly associated with precipitation evolution. Furthermore, this study introduces a ‘WMCE’ loss function, designed to accurately discriminate extremely scarce precipitation events while precisely predicting their intensity values. Extensive experiments on two datasets demonstrate that our proposed model substantially and consistently outperforms all prevalent baselines in both accuracy and efficiency. Moreover, the proposed forecasting model substantially lowers the computational cost required to obtain valuable predictions compared to existing approaches, thereby positioning it as a milestone for efficient and practical precipitation forecasting.
[CV-74] AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation CVPR2026
【速读】:该论文旨在解决测试时适应(Test-time Adaptation, TTA)中因分布偏移导致模型性能下降的问题,现有方法主要依赖仿射调制(affine modulation)来调整归一化层,但忽略了激活函数在表征动态中的重要作用。解决方案的关键在于提出AcTTA框架,将传统激活函数(如ReLU、GELU)重新参数化为可学习的形式,通过调节响应阈值和梯度敏感性来实现测试时的自适应调整,从而在不修改网络权重且无需源数据的情况下连续优化激活行为,显著提升模型在多种图像噪声和域偏移下的鲁棒性。
链接: https://arxiv.org/abs/2603.26096
作者: Hyeongyu Kim,Geonhui Han,Dosik Hwang
机构: Yonsei University (延世大学); Korea Institute of Science and Technology (韩国科学技术院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Test-time adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift-robust test-time learning, broadening the prevailing affine-centric view of adaptation.
[CV-75] CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection CVPR2026
【速读】:该论文旨在解决测试时适应(Test-Time Adaptation, TTA)中因域偏移(domain shift)严重程度不同而导致现有方法性能受限的问题。当前主流方法分为加性(additive)与减性(subtractive)两类:加性方法通过引入轻量模块对特征进行精炼,在中等强度域偏移下表现良好;而减性方法通过移除敏感于域的通道,在严重偏移场景下更有效。然而,这两种策略各自仅在特定偏移强度范围内有效,难以泛化到多样化的扰动水平。解决方案的关键在于提出CD-Buffer框架——一个基于差异驱动耦合(discrepancy-driven coupling)的互补双缓冲机制,通过统一的特征级域偏移度量自动平衡减性和加性策略,实现通道级自适应调节,从而根据实际偏移强度动态分配处理方式,无需人工调参即可在多种天气条件和扰动强度下稳定提升性能。
链接: https://arxiv.org/abs/2603.26092
作者: Youngjun Song,Hyeongyu Kim,Dosik Hwang
机构: Yonsei University (延世大学); Korea Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:Test-Time Adaptation (TTA) enables real-time adaptation to domain shifts without off-line retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This leads to the following question: can we adaptively balance both strategies based on measured feature-level domain shift? We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.
[CV-76] Learnable Instance Attention Filtering for Adaptive Detector Distillation
【速读】:该论文旨在解决深度视觉模型在部署效率上的挑战,特别是在知识蒸馏(Knowledge Distillation, KD)过程中,现有基于特征的蒸馏方法通常对所有目标实例采用统一处理方式,忽略了实例级别的差异性;同时,现有的注意力过滤机制多为启发式或由教师模型驱动,缺乏与学生模型协同学习的能力。解决方案的关键在于提出一种可学习的实例注意力过滤机制(Learnable Instance Attention Filtering, LIAF-KD),通过引入可学习的实例选择器,在蒸馏过程中动态评估并重加权不同实例的重要性,且该机制能根据学生模型的学习状态自适应调整,从而实现更精准、高效的检测器蒸馏。实验表明,该方法在KITTI和COCO数据集上均取得显著性能提升,尤其在GFL ResNet-50学生模型上实现了2%的准确率增益,且无需增加额外计算复杂度。
链接: https://arxiv.org/abs/2603.26088
作者: Chen Liu,Qizhen Lan,Zhicheng Ding,Xinyu Chu,Qing Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.
[CV-77] Experimental study on surveillance video-based indoor occupancy measurement with occupant-centric control
【速读】:该论文旨在解决智能建筑中基于视觉的室内 occupancy(占用状态)测量在真实环境下的稳定性与准确性不足,以及其对下游 HVAC(暖通空调)控制影响研究不充分的问题。解决方案的关键在于引入大语言模型(Large Language Models, LLMs)增强的视觉识别流程,通过对比检测-only、基于跟踪(tracking-based)和LLM精修(LLM-based refinement)三种管道,在相同实验条件下使用中国某实验室采集的真实监控数据进行验证,发现LLM精修显著提升了occupancy measurement的性能并减少了误判为“未占用”的情况;其中最优方案YOLOv8+DeepSeek实现了0.8824的准确率和0.9320的F1分数,并进一步集成至OpenStudio-EnergyPlus中的模型预测控制(MPC)框架,实测显示可实现17.94%的HVAC节能潜力,从而为AI增强型智能建筑运行提供了有效方法论和实践基础。
链接: https://arxiv.org/abs/2603.26081
作者: Irfan Qaisar,Kailai Sun,Qingshan Jia,Qianchuan Zhao
机构: Tsinghua University (清华大学); MIT (麻省理工学院)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate occupancy information is essential for closed-loop occupant-centric control (OCC) in smart buildings. However, existing vision-based occupancy measurement methods often struggle to provide stable and accurate measurements in real indoor environments, and their implications for downstream HVAC control remain insufficiently studied. To achieve Net Zero emissions by 2050, this paper presents an experimental study of large language models (LLMs)-enhanced vision-based indoor occupancy measurement and its impact on OCC-enabled HVAC operation. Detection-only, tracking-based, and LLM-based refinement pipelines are compared under identical conditions using real surveillance data collected from a research laboratory in China, with frame-level manual ground-truth annotations. Results show that tracking-based methods improve temporal stability over detection-only measurement, while LLM-based refinement further improves occupancy measurement performance and reduces false unoccupied prediction. The best-performing pipeline, YOLOv8+DeepSeek, achieves an accuracy of 0.8824 and an F1-score of 0.9320. This pipeline is then integrated into an HVAC supervisory model predictive control framework in OpenStudio-EnergyPlus. Experimental results demonstrate that the proposed framework can support more efficient OCC operation, achieving a substantial HVAC energy-saving potential of 17.94%. These findings provide an effective methodology and practical foundation for future research in AI-enhanced smart building operations.
[CV-78] When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization CVPR2026
【速读】:该论文旨在解决当前以主体驱动的文本到图像扩散模型在多主体交互场景下存在的身份坍塌(identity collapse)问题,即随着主体数量增加或物理交互复杂度提升,模型难以保持每个主体的身份一致性。其关键解决方案是提出一种新的评估指标——主体坍塌率(Subject Collapse Rate, SCR),该指标基于DINOv2的结构先验,能够严格惩罚局部注意力泄漏和同质化现象,从而更准确地量化身份保真度;同时,研究揭示了现有基于CLIP的全局指标在多主体任务中的根本性缺陷,并指出语义捷径(semantic shortcuts)导致的全局注意力路由是引发身份坍塌的根本原因,强调未来生成架构需引入显式的物理解耦机制。
链接: https://arxiv.org/abs/2603.26078
作者: Zhihan Chen,Yuhuan Zhao,Yijie Zhu,Xinyu Yao
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Southern California (南加州大学); DeerLab LLC; Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, accepted by CVPR 2026 Workshop P13N
Abstract:Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive “Illusion of Scalability” in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2’s structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.
[CV-79] MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality CVPR2026
【速读】:该论文旨在解决多模态医学数据中因模态缺失(如病理或基因组数据缺失)导致生存预测准确性下降的问题,这在精准肿瘤学临床部署中尤为突出。现有方法虽尝试通过特征对齐或联合分布学习处理缺失模态,但缺乏对各模态独特贡献的显式建模。其解决方案的关键在于提出MUST(Modality-Specific representation-aware Transformer)框架,通过在低秩共享子空间中引入代数约束,将每个模态的表示显式分解为模态特异性(modality-specific)和跨模态上下文化(cross-modal contextualized)两部分,从而精确识别缺失模态时损失的信息;对于无法从其他模态推断的模态特异性信息,采用条件潜扩散模型(conditional latent diffusion models)基于恢复的共享信息和学习到的结构先验生成高质量表示,实现鲁棒的生存预测性能。
链接: https://arxiv.org/abs/2603.26071
作者: Kyungwon Kim,Dosik Hwang
机构: Yonsei University (延世大学); Korea Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026. 10 pages, 5 figures, supplementary included
Abstract:Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality’s representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.
[CV-80] PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery CVPR2026
【速读】:该论文旨在解决从单帧图像重建的手部运动缺乏物理一致性且无法量化其物理合理性的问题。现有方法虽能提供高精度的单帧手部姿态估计,但未考虑动力学约束,导致生成的序列在时间上不连贯或违反物理规律。解决方案的关键在于提出一种物理感知的条件扩散框架,通过构建基于MeshCNN-Transformer的骨干网络并引入欧拉-拉格朗日(Euler-Lagrange)动力学模型,将动力学残差视为虚拟观测量进行融合,从而实现对噪声轨迹的物理合理性修正;进一步利用最后一层拉普拉斯近似(Laplace approximation)输出每个关节、每时刻的方差估计,以量化物理一致性,并生成可解释的方差图谱,直观反映物理约束薄弱的位置。
链接: https://arxiv.org/abs/2603.26068
作者: Elkhan Ismayilzada,Yufei Zhang,Zijun Cui
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.
[CV-81] R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting
【速读】:该论文旨在解决物理对抗伪装(Physical adversarial camouflage)在复杂动态场景中泛化能力不足的问题,具体表现为现有方法在面对多样几何(如视角变化)和辐射(如动态光照、大气散射)变化时表现脆弱。其核心问题源于两个根本局限:一是依赖低 fidelity 的仿真(如 CARLA)导致域差距显著,使优化陷入偏差特征空间;二是标准平均性能优化策略引发陡峭的损失景观,使攻击易受特定配置影响。解决方案的关键在于提出基于可重照明物理 3D 高斯点绘(Relightable Physical 3D Gaussian Splatting, 3DGS)的攻击框架(R-PGA),通过两个核心技术实现突破:首先,利用 3DGS 实现照片级真实感重建,并引入物理解耦属性以分离材质与光照;其次,设计混合渲染管线,在前景使用精确的可重照明 3DGS 渲染的同时,借助预训练图像翻译模型合成匹配光照条件的背景;此外,创新性地提出硬物理配置挖掘模块(Hard Physical Configuration Mining, HPCM),主动挖掘最坏情况配置并抑制其对应的损失峰值,从而有效降低整体损失幅度并平滑损失景观,确保在不同物理配置下均具备一致的对抗效果与鲁棒性。
链接: https://arxiv.org/abs/2603.26067
作者: Tianrui Lou,Siyuan Liang,Jiawei Liang,Yuze Gao,Xiaochun Cao
机构: Sun Yat-sen University (中山大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Physical adversarial camouflage poses a severe security threat to autonomous driving systems by mapping adversarial textures onto 3D objects. Nevertheless, current methods remain brittle in complex dynamic scenarios, failing to generalize across diverse geometric (e.g., viewing configurations) and radiometric (e.g., dynamic illumination, atmospheric scattering) variations. We attribute this deficiency to two fundamental limitations in simulation and optimization. First, the reliance on coarse, oversimplified simulations (e.g., via CARLA) induces a significant domain gap, confining optimization to a biased feature space. Second, standard strategies targeting average performance result in a rugged loss landscape, leaving the camouflage vulnerable to configuration this http URL bridge these gaps, we propose the Relightable Physical 3D Gaussian Splatting (3DGS) based Attack framework (R-PGA). Technically, to address the simulation fidelity issue, we leverage 3DGS to ensure photo-realistic reconstruction and augment it with physically disentangled attributes to decouple intrinsic material from lighting. Furthermore, we design a hybrid rendering pipeline that leverages precise Relightable 3DGS for foreground rendering, while employing a pre-trained image translation model to synthesize plausible relighted backgrounds that align with the relighted this http URL address the optimization robustness issue, we propose the Hard Physical Configuration Mining (HPCM) module, designed to actively mine worst-case physical configurations and suppress their corresponding loss peaks. This strategy not only diminishes the overall loss magnitude but also effectively flattens the rugged loss landscape, ensuring consistent adversarial effectiveness and robustness across varying physical configurations.
[CV-82] MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection
【速读】:该论文旨在解决非接触式自动欺骗检测中因视觉和听觉欺骗线索缺乏跨被试稳定模式而导致的挑战,其核心问题是如何利用更可靠的生理信号(如皮肤电反应,GSR)来指导非接触模态(如视频和音频)的表示学习。解决方案的关键在于提出一种基于GSR引导的渐进式蒸馏(GSR-guided Progressive Distillation, GPD)框架,通过融合特征级与数值级蒸馏机制,并引入动态路由策略,使模型能够自适应地决定教师知识在训练过程中的传递方式,从而有效缓解GSR与非接触信号间因模态差异导致的负迁移问题,实现更稳定的跨模态知识迁移。
链接: https://arxiv.org/abs/2603.26064
作者: Peiyuan Jiang,Yao Liu,Yanglei Gan,Jiaye Yang,Lu Liu,Daibing Yao,Qiao Liu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.
[CV-83] Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline CVPR2026
【速读】:该论文旨在解决视频流畅度评估(Video Fluency Assessment, VFA)这一长期被忽视的感知任务问题,其核心挑战在于现有视频质量评估(VQA)方法仅将流畅度视为整体质量的一个子维度,导致对运动一致性与帧连续性等关键时域特征的建模不足。为应对这一问题,作者提出将VFA作为独立的感知任务,并构建首个面向流畅度的数据集FluVid,包含4,606个真实场景视频及标准化评分体系;同时开发了涵盖23种方法的基准测试平台,揭示了针对VFA优化模型设计的关键洞察。解决方案的核心创新在于提出FluNet基线模型,通过引入时间置换自注意力机制(Temporal Permuted Self-Attention, T-PSA),增强输入中的时序信息并提升长距离帧间交互能力,从而显著改善流畅度预测性能。
链接: https://arxiv.org/abs/2603.26055
作者: Qizhi Xie,Kun Yuan,Yunpeng Qu,Ming Sun,Chao Zhou,Jihong Zhu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures. Accepted by CVPR 2026 findings track
Abstract:Accurately estimating humans’ subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.
[CV-84] Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification CVPR2026
【速读】:该论文旨在解决多模态虚假信息(Multimodal Misinformation)检测中因“特征稀释”(feature dilution)导致的局部语义不一致难以识别的问题。现有基于被动整体融合的方法在处理复杂虚假内容时效果有限,因其全局对齐机制会平均掉细微的局部语义冲突,从而掩盖了关键矛盾。解决方案的关键在于提出MaLSF(Mask-aware Local Semantic Fusion)框架,其核心创新为:1)引入双向跨模态验证(Bidirectional Cross-modal Verification, BCV)模块,通过文本查询和图像查询并行交叉检验,主动定位模态间的语义冲突;2)设计分层语义聚合(Hierarchical Semantic Aggregation, HSA)模块,对多粒度冲突信号进行智能整合以支持任务特定推理。该方法通过掩码-标签对作为语义锚点连接像素与词汇,显著提升了检测精度与可解释性。
链接: https://arxiv.org/abs/2603.26052
作者: Zizhao Chen,Ping Wei,Ziyang Ren,Huan Li,Xiangru Yin
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
Abstract:As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to ‘feature dilution,’ global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.
[CV-85] Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays
【速读】:该论文旨在解决当前医学视觉-语言预训练模型在胸部X光片诊断流程建模中的不足,具体表现为:现有方法将放射图像视为与上下文无关的图像,且未充分挖掘放射科医生注视轨迹(gaze)这一关键视觉推理线索,从而限制了疾病特异性模式的捕捉能力并削弱了跨模态对齐效果。其解决方案的关键在于提出CoGaze框架,通过两个核心设计实现突破:一是构建一个融合临床上下文(包括患者病史、症状和诊断意图)的视觉编码器,模拟放射科医生如何利用上下文引导诊断推理;二是引入多层级监督机制,包括混合正样本对比学习强化模态内与模态间语义对齐、基于疾病感知的跨模态表示学习注入诊断先验知识,并利用放射科医生注视点作为概率先验来指导注意力聚焦于诊断显著区域。
链接: https://arxiv.org/abs/2603.26049
作者: Kang Liu,Zhuoqi Ma,Siyu Liang,Yunan Li,Xiyue Gao,Chao Liang,Kun Xie,Qiguang Miao
机构: Xidian University (西安电子科技大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL
Abstract:Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists’ gaze – a crucial cue for visual reasoning – remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context – including patient history, symptoms, and diagnostic intent – to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists’ gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at this https URL.
[CV-86] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents : Semantic Spatial and Temporal Perspectives
【速读】:该论文旨在解决GUI视觉代理(GUI Visual Agents)在导航任务中因高分辨率界面截图产生大量视觉标记(visual tokens)而导致的历史信息保存计算开销过大的问题。解决方案的关键在于通过实证研究发现并利用三个核心洞察:首先,GUI截图具有显著的前景-背景语义结构,其中背景区域虽常被忽视,但能有效捕捉界面状态变化,提供辅助推理线索;其次,随机剪枝相较于精心设计的策略更有利于保持空间结构,在相同计算预算下表现更优;最后,GUI代理表现出类似人类认知的“近期效应”,即通过为较新截图分配更多标记预算、对远期截图进行大幅压缩,可在几乎不损失性能的前提下显著降低计算成本。这些发现为高效GUI视觉代理的设计提供了新的理论依据和实践指导。
链接: https://arxiv.org/abs/2603.26041
作者: Daiqiang Li,Zihao Pan,Zeyu Zhang,Ronghao Chen,Huacan Wang,Honggang Chen,Haiyun Jiang
机构: Sichuan University (四川大学); Sun Yat-sen University (中山大学); Australian National University (澳大利亚国立大学); Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.
[CV-87] Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection
【速读】:该论文旨在解决深度伪造(deepfake)内容检测中因篡改方式多样、特征复杂而导致的检测效果受限问题。其解决方案的关键在于提出一种基于分层特征表示(Hierarchical Feature Representation, HFR)的新型混合方法 Face2Parts,通过从图像帧、人脸整体及关键面部区域(唇、眼、鼻)分别提取特征,实现粗粒度到细粒度的信息融合,并借助通道注意力机制与深度三元组学习捕捉面部区域间的相互依赖关系,从而提升对多种类型深度伪造的识别准确率与泛化能力。
链接: https://arxiv.org/abs/2603.26036
作者: Kutub Uddin,Nusrat Tasnim,Byung Tae Oh
机构: Korea Aerospace University (韩国航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ( HFR ) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42% on FF++, 79.80% on CDF1, 85.34% on CDF2, 89.41% on DFD, 84.07% on DFDC, 95.62% on DTIM, 80.76% on PDD, and 100% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.
[CV-88] Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLM s
【速读】:该论文旨在解决少样本动作识别(Few-Shot Action Recognition, FSAR)中传统方法依赖特征-文本-特征的次优流水线以及仅在视觉空间内进行度量学习的问题。其核心挑战在于如何有效利用多模态大语言模型(Multimodal Large Language Models, MLLMs)所蕴含的丰富语义知识,实现端到端的动作识别性能提升。解决方案的关键在于:首先,通过MLLM的多模态解码器提取时空与语义增强的表示,并借助提出的多模态特征增强模块(Multimodal Feature-Enhanced Module)分离并强化视觉与文本特征,从而充分挖掘其语义信息;其次,利用MLLM的灵活性设计适配不同场景的输入提示,并基于对齐输出构建面向任务的原型(Composite Task-Oriented Prototype Construction),缓解元训练与元测试分布差异;最后,提出一种无需训练的多模态原型匹配度量方法(Multimodal Prototype Matching Metric),自适应选择关键判别线索,联合引导多模态特征驱动的度量学习,显著提升识别准确率且参数开销极低。
链接: https://arxiv.org/abs/2603.26033
作者: Jiazheng Xing,Chao Xu,Hangjie Yuan,Mengmeng Wang,Jun Dan,Hangwei Qian,Yong Liu
机构: Zhejiang University (浙江大学); Zhejiang University of Technology (浙江工业大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature-caption-feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM’s multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.
[CV-89] Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA
【速读】:该论文旨在解决医学视觉问答(MedVQA)模型在跨数据集场景下泛化能力不足的问题,其根源在于模型过度依赖数据集特有的关联性(如特定解剖结构模式或问题类型规律),而非真正的诊断证据。解决方案的关键在于提出一种可学习的因果修剪(Learnable Causal Trimming, LCT)框架,该框架将因果剪枝机制嵌入端到端优化流程中:通过动态解剖特征库(Dynamic Anatomical Feature Bank, DAFB)以动量更新方式捕捉高频解剖与语言模式的全局原型,进而设计一个可微分的修剪模块,量化实例级表征与全局特征库之间的依赖关系;高相关特征被软抑制,而实例特异性证据则被强化,从而引导模型自适应地聚焦于因果信号而非伪相关性。
链接: https://arxiv.org/abs/2603.26028
作者: Zibo Xu,Qiang Li,Weizhi Nie,Yuting Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.
[CV-90] Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features
【速读】:该论文旨在解决在缺乏目标域标注数据的情况下,如何实现跨机构部署时对主动脉夹层(Type A Aortic Dissection, TAAD)关键临床特征的准确提取问题。当前方法受限于单一中心数据训练导致的领域偏移(domain shift),且依赖高成本的像素级标注,难以在多中心实际场景中推广。解决方案的关键在于提出一种基于无监督域自适应(Unsupervised Domain Adaptation, UDA)的框架,该框架仅利用有限源域标签信息,即可有效适配未标注的目标域数据,在保证稳定多类分割性能的同时,实现可量化、可靠的临床特征自动提取,从而支持无需昂贵标注的端到端部署,显著提升模型在真实急诊流程中的实用性和鲁棒性。
链接: https://arxiv.org/abs/2603.26019
作者: Mengdi Liu,Qiang Li,Weizhi Nie,Shaopeng Zhang,Yuting Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.
[CV-91] GeoReFormer: Geometry-Aware Refinement for Lane Segment Detection and Topology Reasoning
【速读】:该论文旨在解决自动驾驶中结构化在线地图构建任务里3D车道线段检测与拓扑推理的准确性问题。现有基于Transformer的方法虽采用查询驱动的集合预测框架,但其解码器设计沿袭自紧凑目标检测任务,未能显式建模车道线作为嵌入于有向图中的连续多段线(polylines)所具有的几何与关系结构。解决方案的关键在于提出GeoReFormer(Geometry-aware Refinement Transformer),其核心创新包括:通过数据驱动的几何先验实现结构化查询初始化、引入坐标空间约束的精炼机制以保障多段线形变稳定性,以及每查询门控的拓扑传播策略以选择性融合关系上下文,从而在OpenLane-V2基准上实现34.5% mAP的最优性能并显著提升拓扑一致性。
链接: https://arxiv.org/abs/2603.26018
作者: Danny Abraham,Nikhil Kamalkumar Advani,Arun Das,Nikil Dutt
机构: University of California, Irvine, CA, USA (加州大学欧文分校); Bosch North America, Sunnyvale, CA, USA (博世北美)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures
Abstract:Accurate 3D lane segment detection and topology reasoning are critical for structured online map construction in autonomous driving. Recent transformer-based approaches formulate this task as query-based set prediction, yet largely inherit decoder designs originally developed for compact object detection. However, lane segments are continuous polylines embedded in directed graphs, and generic query initialization and unconstrained refinement do not explicitly encode this geometric and relational structure. We propose GeoReFormer (Geometry-aware Refinement Transformer), a unified query-based architecture that embeds geometry- and topology-aware inductive biases directly within the transformer decoder. GeoReFormer introduces data-driven geometric priors for structured query initialization, bounded coordinate-space refinement for stable polyline deformation, and per-query gated topology propagation to selectively integrate relational context. On the OpenLane-V2 benchmark, GeoReFormer achieves state-of-the-art performance with 34.5% mAP while improving topology consistency over strong transformer baselines, demonstrating the utility of explicit geometric and relational structure encoding.
[CV-92] VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation
【速读】:该论文旨在解决传统人脸年龄估计(facial age estimation)方法依赖大量标注数据和领域特定训练的问题,提出利用通用型大视觉语言模型(Large Vision-Language Models, LVLMs)实现零样本(zero-shot)年龄估计的可行性。其解决方案的关键在于不进行任何微调或任务特异性适配,直接评估 GPT-4o、Claude 3.5 Sonnet 和 LLaMA 3.2 Vision 等前沿 LVLM 在 UTKFace 和 FG-NET 两个基准数据集上的表现,通过八项指标验证其在无需额外训练的情况下仍可达到与专用卷积神经网络相当的性能,从而揭示 LVLM 的涌现能力(emergent capabilities)及其在法医科学、健康监测和人机交互等实际场景中的潜力。
链接: https://arxiv.org/abs/2603.26015
作者: Rakib Hossain Sajib,Md Kishor Morol,Rajan Das Gupta,Mohammad Sakib Mahmood,Shuvra Smaran Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, R^2 , CCC, and \pm 5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.
[CV-93] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在临床医学影像任务中因群体差异导致的公平性问题,即模型在不同人口统计学群体间表现不一致,可能引发诊断叙述偏差并削弱医生对AI辅助决策的信任。解决方案的关键在于提出一种参数高效的微调方法FairLLaVA,通过最小化目标属性与模型表示之间的互信息(mutual information),使模型特征具备人群不变性(demographic-invariant),从而降低组间差异;该方法采用低秩适配器(low-rank adapter fine-tuning)实现轻量级插件式集成,具有架构无关性,且在胸部X光报告生成和皮肤镜图像问答等多个医学影像基准上均显著提升了公平性指标与临床性能的一致性。
链接: https://arxiv.org/abs/2603.26008
作者: Mahesh Bhosale,Abdul Wasi,Shantam Srivastava,Shifa Latif,Tianyu Luan,Mingchen Gao,David Doermann,Xuan Gong
机构: University at Buffalo (纽约州立大学布法罗分校); University of Kashmir (克什米尔大学); Accenture (埃森哲); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model’s representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at this https URL.
[CV-94] Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models CVPR2026
【速读】:该论文旨在解决文本到图像扩散模型中局部概念擦除(localized concept erasure)时导致语义邻近概念(neighbor concepts)性能下降的问题,即在移除目标概念的同时可能无意中削弱与之相关的相邻概念,从而影响细粒度领域的生成质量。其解决方案的关键在于提出一种无需训练的邻居感知局部概念擦除框架(Neighbor-Aware Localized Concept Erasure, NLCE),通过三个阶段实现:首先采用谱加权嵌入调制(spectrally-weighted embedding modulation)抑制目标概念方向并稳定邻近概念表示;其次利用注意力引导的空间门(attention-guided spatial gate)识别残留概念激活区域;最后实施空间门控硬擦除(spatially-gated hard erasure)仅在必要区域消除剩余痕迹。该方法能够在保持周围概念邻接结构的前提下实现精准的局部概念删除,显著提升细粒度场景下的擦除效果和泛化能力。
链接: https://arxiv.org/abs/2603.25994
作者: Zhuan Shi,Alireza Dehghanpour Farashah,Rik de Vries,Golnoosh Farnadi
机构: McGill University (麦吉尔大学); Mila - Quebec AI Institute (魁北克AI研究所); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted by CVPR 2026 main
Abstract:Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
[CV-95] FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation
【速读】:该论文旨在解决现有基于前馈式3D重建模型在扩展至3D实例分割时所依赖的“提升-聚类”(lift-and-cluster)范式的局限性,即通过非可微聚类进行密集像素级嵌入分组,导致计算效率低、难以扩展到多视角场景,并且使表示学习与最终分割目标脱节。其解决方案的关键在于提出一种端到端的Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS),核心创新包括:1)设计了一种基于3D锚点的查询式Transformer架构,在保留零样本几何先验的同时学习实例特定语义;2)引入可学习的3D锚点生成器与锚点采样交叉注意力机制,实现跨视角一致的3D实例分割;3)采用双层正则化策略,结合多视角对比学习与动态调度的空间重叠惩罚项,有效防止查询冲突并精确界定实例边界。该方法显著提升了内存可扩展性和推理速度,同时保持了竞争性的分割精度。
链接: https://arxiv.org/abs/2603.25993
作者: Changyang Li,Xueqing Huang,Shin-Fang Chng,Huangying Zhan,Qingan Yan,Yi Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed “lift-and-cluster” paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.
[CV-96] JRM: Joint Reconstruction Model for Multiple Objects without Alignment
【速读】:该论文旨在解决以对象为中心的三维重建(object-centric reconstruction)中因忽略重复性信息而导致的重建质量受限问题。传统方法通常假设场景中的对象独立存在,从而丢弃了同一对象在不同视角或扫描中重复出现所携带的强信号,这限制了重建的精度与鲁棒性。解决方案的关键在于提出联合重建模型(Joint Reconstruction Model, JRM),其核心创新是将对象重建建模为一种个性化生成任务:多个观测共享一个统一的对象主体(subject),同时保留各自特定的姿态和状态。JRM采用3D流匹配生成模型(3D flow-matching generative model),在潜在空间中隐式聚合未对齐的观测数据,无需显式匹配或刚性对齐即可学习生成一致且忠实的重建结果,从而显著提升对错误关联的鲁棒性,并自然处理非刚性形变(如关节运动)。
链接: https://arxiv.org/abs/2603.25985
作者: Qirui Wu,Yawar Siddiqui,Duncan Frost,Samir Aroudj,Armen Avetisyan,Richard Newcombe,Angel X. Chang,Jakob Engel,Henry Howard-Jenkins
机构: Meta Reality Labs Research (Meta现实实验室研究); Simon Fraser University (西蒙菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM’s implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.
[CV-97] Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
【速读】:该论文旨在解决扩散磁共振成像(Diffusion Magnetic Resonance Imaging, dMRI)中缺乏通用表征学习方法的问题,尤其是现有深度学习模型难以捕捉扩散信号特有的空间结构与方向依赖性,且在不同采集协议(如不同梯度方向数量)下泛化能力有限。其解决方案的关键在于提出一种扩散空间旋转变换位置编码(Diffusion Space Rotatory Positional Embedding, D-RoPE),该编码机制被嵌入到dMRI Transformer架构中,能够联合建模空间、扩散加权和方向特性,从而实现跨多种采集设置和任意扩散方向数的鲁棒且可迁移的表征学习。预训练后的模型在多个下游任务中表现优异,例如在轻度认知障碍分类中准确率提升6%,预测认知评分的相关系数提高0.05。
链接: https://arxiv.org/abs/2603.25977
作者: Gustavo Chau Loo Kung,Mohammad Abbasi,Camila Blank,Juze Zhang,Alan Q. Wang,Sophie Ostmeier,Akshay Chaudhari,Kilian Pohl,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: this http URL.
[CV-98] Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
【速读】:该论文旨在解决自动驾驶系统在决策过程中难以与人类意图对齐的问题,尤其是在训练强化学习(Reinforcement Learning, RL)模型时缺乏高效、直接的人类认知反馈机制。传统基于人类偏好标注的强化学习与人类反馈(Reinforcement Learning with Human Feedback, RLHF)方法依赖于人工排序生成结果,存在效率低且间接的缺陷。其解决方案的关键在于引入脑电图(Electroencephalography, EEG)信号作为神经认知反馈源,通过采集驾驶员在真实驾驶模拟器中的EEG数据并分析事件相关电位(Event-Related Potentials, ERP),构建一个基于视觉场景信息预测ERP强度的神经网络模型,并将该认知信号整合进RL算法的奖励函数中,从而实现无需行为中断即可获取人类认知洞察的闭环优化机制。实验表明,该框架显著提升了强化学习模型的碰撞规避能力,验证了神经认知反馈在增强自动驾驶系统安全性方面的潜力。
链接: https://arxiv.org/abs/2603.25968
作者: Zhuoli Zhuang,Yu-Cheng Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin
机构: University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: this https URL.
[CV-99] BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles
【速读】:该论文旨在解决在无全球导航卫星系统(GNSS)或GNSS信号弱化的环境中,自动驾驶车辆的鲁棒定位问题。传统依赖GNSS的定位方法在此类场景中失效,亟需替代方案以实现准确、可靠的重定位(re-localization)。其解决方案的关键在于提出BEVMapMatch框架:该框架通过融合激光雷达与摄像头数据生成多模态鸟瞰图(Bird’s Eye View, BEV)分割图,在不同天气条件下保持一致性;并引入基于交叉注意力(cross-attention)的搜索机制,从已知地图中高效检索候选地图块进行匹配;最终利用最优候选区域进行精细化对齐,从而实现无需GNSS先验的全局定位。实验表明,该方法在GNSS缺失环境下显著优于现有基线,召回率(Recall@1m)达39.8%,近乎提升一倍。
链接: https://arxiv.org/abs/2603.25963
作者: Shounak Sural,Ragunathan Rajkumar
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Localization in GNSS-denied and GNSS-degraded environments is a challenge for the safe widespread deployment of autonomous vehicles. Such GNSS-challenged environments require alternative methods for robust localization. In this work, we propose BEVMapMatch, a framework for robust vehicle re-localization on a known map without the need for GNSS priors. BEVMapMatch uses a context-aware lidar+camera fusion method to generate multimodal Bird’s Eye View (BEV) segmentations around the ego vehicle in both good and adverse weather conditions. Leveraging a search mechanism based on cross-attention, the generated BEV segmentation maps are then used for the retrieval of candidate map patches for map-matching purposes. Finally, BEVMapMatch uses the top retrieved candidate for finer alignment against the generated BEV segmentation, achieving accurate global localization without the need for GNSS. Multiple frames of generated BEV segmentation further improve localization accuracy. Extensive evaluations show that BEVMapMatch outperforms existing methods for re-localization in GNSS-denied and adverse environments, with a Recall@1m of 39.8%, being nearly twice as much as the best performing re-localization baseline. Our code and data will be made available at this https URL.
[CV-100] Low-Rank-Modulated Functa: Exploring the Latent Space of Implicit Neural Representations for Interpretable Ultrasound Video Analysis
【速读】:该论文旨在解决基于Functa的隐式神经表示(Implicit Neural Representations, INRs)在超声视频分析中 latent space 结构不清晰、缺乏可解释性的问题,尤其在时间分辨的潜空间中难以捕捉周期性生理动态模式。解决方案的关键在于提出低秩调制Functa(Low-Rank-Modulated Functa, LRM-Functa),通过在时间维度上强制对调制向量施加低秩约束,使潜空间中形成结构化的周期轨迹,从而实现对心脏周期中舒张末期(end-diastolic, ED)和收缩末期(end-systolic, ES)帧的无监督识别,并支持平滑的帧间插值与直接读出,同时保持高压缩比(如 rank k=2)下的下游任务性能(如射血分数预测)。
链接: https://arxiv.org/abs/2603.25951
作者: Julia Wolleb,Cristiana Baloescu,Alicia Durrer,Hemant D. Tagare,Xenophon Papademetris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit neural representations (INRs) have emerged as a powerful framework for continuous image representation learning. In Functa-based approaches, each image is encoded as a latent modulation vector that conditions a shared INR, enabling strong reconstruction performance. However, the structure and interpretability of the corresponding latent spaces remain largely unexplored. In this work, we investigate the latent space of Functa-based models for ultrasound videos and propose Low-Rank-Modulated Functa (LRM-Functa), a novel architecture that enforces a low-rank adaptation of modulation vectors in the time-resolved latent space. When applied to cardiac ultrasound, the resulting latent space exhibits clearly structured periodic trajectories, facilitating visualization and interpretability of temporal patterns. The latent space can be traversed to sample novel frames, revealing smooth transitions along the cardiac cycle, and enabling direct readout of end-diastolic (ED) and end-systolic (ES) frames without additional model training. We show that LRM-Functa outperforms prior methods in unsupervised ED and ES frame detection, while compressing each video frame to as low as rank k=2 without sacrificing competitive downstream performance on ejection fraction prediction. Evaluations on out-of-distribution frame selection in a cardiac point-of-care dataset, as well as on lung ultrasound for B-line classification, demonstrate the generalizability of our approach. Overall, LRM-Functa provides a compact, interpretable, and generalizable framework for ultrasound video analysis. The code is available at this https URL.
[CV-101] Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets
【速读】:该论文旨在解决端到端(End-to-End, E2E)自动驾驶中高违规率问题,特别是由碰撞引发的违规行为在闭环评估中的主导地位,而现有方法对碰撞感知表征学习关注不足。解决方案的关键在于提出一种视频-语言增强异常检测器(Video-Language-Augmented Anomaly Detector, VLAAD),其基于多实例学习(Multiple Instance Learning, MIL)框架,可稳定提取时序上局部化的碰撞信号以实现前瞻性预测;同时构建了大规模多模态数据集CARLA-Collide和Real-Collide,分别用于仿真与真实场景下的训练与评估,使VLAAD能无缝集成至现有E2E驾驶模型中,显著提升驾驶得分并展现优异的泛化能力。
链接: https://arxiv.org/abs/2603.25946
作者: Alex Koran,Dimitrios Sinodinos,Hadi Hojjati,Takuya Nanri,Fangge Chen,Narges Armanfard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 11 figures
Abstract:High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.
[CV-102] Reinforcing Structured Chain-of-Thought for Video Understanding CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视频理解任务中因思维漂移(thinking drift)和弱时序理解能力导致的推理性能不足问题,同时克服现有强化学习(Reinforcement Learning, RL)方法依赖监督微调(Supervised Fine-Tuning, SFT)所带来的高成本、固定推理路径限制及潜在偏差。其解决方案的关键在于提出一种新颖的单阶段强化学习框架——摘要驱动强化学习(Summary-Driven Reinforcement Learning, SDRL),该框架摒弃了SFT步骤,采用结构化的Chain-of-Thought(CoT)格式“总结-思考-回答”,并创新性地将两种自监督机制嵌入Group Relative Policy Optimization(GRPO)目标:一是视觉知识一致性(Consistency of Vision Knowledge, CVK),通过最小化生成摘要间的KL散度实现事实性约束;二是推理多样性动态调节(Dynamic Variety of Reasoning, DVR),依据群体准确率动态调控思考过程的多样性以促进探索。此设计有效平衡了对齐与探索,实现了对最终答案和推理过程的联合监督,从而显著提升模型泛化能力和视频问答性能。
链接: https://arxiv.org/abs/2603.25942
作者: Peiyao Wang,Haotian Xu,Noranart Vesdapunt,Rui Hou,Jingyi Zhang,Haibin Ling,Oleksandr Obiednikov,Ning Zhou,Kah Kuen Fu
机构: Stony Brook University (石溪大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 (Main Conference)
Abstract:Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs’ ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize - Think - Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.
[CV-103] DenseSwinV2: Channel Attentive Dual Branch CNN Transformer Learning for Cassava Leaf Disease Classification
【速读】:该论文旨在解决木薯叶片疾病分类中因病斑视觉相似性高、背景复杂及遮挡等问题导致的识别精度不足问题。解决方案的关键在于提出了一种双分支混合架构——Hybrid Dense SwinV2,其核心创新包括:(1)通过DenseNet分支提取高分辨率局部特征以保留细微结构信息并促进梯度流动;(2)利用定制化的Swin Transformer V2(SwinV2)模块引入移窗自注意力机制(shifted-window self-attention),有效建模长距离上下文依赖关系;(3)在CNN与Transformer两个分支独立引入通道注意力压缩模块(attention channel-squeeze module),增强疾病相关响应并抑制冗余或背景驱动激活;最终融合两类判别性特征通道,实现局部细节与全局语义的协同强化,从而提升模型在真实田间场景下的鲁棒性和实用性。
链接: https://arxiv.org/abs/2603.25935
作者: Shah Saood(1),Saddam Hussain Khan(2) ((1) Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat 19060, Pakistan (2) Interdisciplinary Research Center for Smart Mobility and Logistics (IRC-SML), King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 Pages, 12 Figures, 3 Tables
Abstract:This work presents a new Hybrid Dense SwinV2, a two-branch framework that jointly leverages densely connected convolutional features and hierarchical customized Swin Transformer V2 (SwinV2) representations for cassava disease classification. The proposed framework captures high resolution local features through its DenseNet branch, preserving the fine structural cues and also allowing for effective gradient flow. Concurrently, the customized SwinV2 models global contextual dependencies through the idea of shifted-window self attention, which enables the capture of long range interactions critical in distinguishing between visually similar lesions. Moreover, an attention channel-squeeze module is employed for each CNN Transformer stream independently to emphasize discriminative disease related responses and suppress redundant or background driven activations. Finally, these discriminative channels are fused to achieve refined representations from the dense local and SwinV2 global correlated strengthened feature maps, respectively. The proposed Dense SwinV2 utilized a public cassava leaf disease dataset of 31000 images, comprised of five diseases, including brown streak, mosaic, green mottle, bacterial blight, and normal leaf conditions. The proposed Dense SwinV2 demonstrates a significant classification accuracy of 98.02 percent with an F1 score of 97.81 percent, outperforming well-established convolutional and transformer models. These results underline the fact that Hybrid Dense SwinV2 offers robustness and practicality in the field level diagnosis of cassava disease and real world challenges related to occlusion, noise, and complex backgrounds.
[CV-104] DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation
【速读】:该论文旨在解决流匹配视频生成器在文本条件下的物理常识违背问题,即模型虽能生成时序连贯且高保真的视频,却因重建目标未区分物理合理与不合理动力学而违反基本物理规律。其关键解决方案是提出DiReCT(Disentangled Regularization of Contrastive Trajectories),通过解耦对比信号为两个互补尺度:宏观对比项从语义差异区域采样无干扰的负样本以实现全局轨迹分离,微观对比项则构造共享完整场景语义但仅在单一物理行为维度(如运动学、受力、材料属性等)由大语言模型扰动的难负样本,从而精准引导物理一致性;同时引入速度空间分布正则化防止预训练视觉质量退化。此方法有效缓解了语义-物理纠缠导致的梯度冲突,显著提升视频物理常识得分。
链接: https://arxiv.org/abs/2603.25931
作者: Abolfazl Meyarian,Amin Karimi Monsefi,Rajiv Ramnath,Ser-Nam Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample’s, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.
[CV-105] Good Scores Bad Data: A Metric for Multimodal Coherence NEURIPS2024
【速读】:该论文旨在解决多模态人工智能(Multimodal AI)系统在评估时存在的核心问题:下游任务准确率高并不等同于模态间数据的一致性或融合质量良好,即模型可能在视觉问答(Visual Question Answering, VQA)等任务中表现优异,但其输入模态之间可能存在内在矛盾。为解决此问题,作者提出了一种名为多模态一致性评分(Multimodal Coherence Score, MCS)的新指标,其关键在于将一致性解构为四个独立维度——身份(identity)、空间(spatial)、语义(semantic)和决策(decision),并通过Nelder-Mead优化方法自动学习各维度的权重,从而实现对多模态融合质量的无监督、轻量级且可解释的评估。该方法无需人工标注,且能精准定位具体哪一维度出现异常,显著优于仅依赖任务准确率的评估方式。
链接: https://arxiv.org/abs/2603.25924
作者: Vasundra Srinivasan
机构: Stanford School of Engineering (斯坦福大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, NeurIPS 2024 format
Abstract:Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.
[CV-106] Shared Representation for 3D Pose Estimation Action Classification and Progress Prediction from Tactile Signals
【速读】:该论文旨在解决传统视觉方法在人机交互中因遮挡和隐私问题导致性能受限,以及现有触觉感知方法各自独立处理姿态估计、动作分类和动作进度预测任务时效率低下、性能欠佳的问题。其解决方案的关键在于提出一种共享卷积Transformer架构(SCOTTI),通过学习统一的特征表示,实现三个任务的联合建模:3D人体姿态估计、动作类别分类和动作完成进度预测。该方法利用多任务学习的优势,在共享表征基础上提升各任务性能,同时首次基于定制无线足底传感器采集的触觉信号探索动作进度预测问题,实验表明该方案在所有任务上均优于独立建模的方法。
链接: https://arxiv.org/abs/2603.25906
作者: Isaac Han,Seoyoung Lee,Sangyeon Park,Ecehan Akan,Yiyue Luo,Joseph DelPreto,Kyung-Joong Kim
机构: Gwangju Institute of Science and Technology (GIST); University of Washington; MIT CSAIL
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating human pose, classifying actions, and predicting movement progress are essential for human-robot interaction. While vision-based methods suffer from occlusion and privacy concerns in realistic environments, tactile sensing avoids these issues. However, prior tactile-based approaches handle each task separately, leading to suboptimal performance. In this study, we propose a Shared COnvolutional Transformer for Tactile Inference (SCOTTI) that learns a shared representation to simultaneously address three separate prediction tasks: 3D human pose estimation, action class categorization, and action completion progress estimation. To the best of our knowledge, this is the first work to explore action progress prediction using foot tactile signals from custom wireless insole sensors. This unified approach leverages the mutual benefits of multi-task learning, enabling the model to achieve improved performance across all three tasks compared to learning them independently. Experimental results demonstrate that SCOTTI outperforms existing approaches across all three tasks. Additionally, we introduce a novel dataset collected from 15 participants performing various activities and exercises, with 7 hours of total duration, across eight different activities.
[CV-107] Decoding Defensive Coverag e Responsibilities in American Football Using Factorized Attention Based Transformer Models
【速读】:该论文旨在解决美式橄榄球(NFL)防守端在传球进攻中,如何精准预测个体防守球员的覆盖任务分配、接球手与防守者之间的匹配关系以及每次传球的目标防守者的问题。传统方法多聚焦于赛后对团队整体防守策略的分类,而本文提出了一种基于因子化注意力机制(factorized attention mechanism)的Transformer模型,其关键创新在于将时间维度与球员代理维度分离建模,从而独立捕捉球员运动模式与球员间交互关系。该模型基于随机截断轨迹训练,可生成逐帧预测结果,准确反映从发球前到传球完成整个过程中防守职责的变化,实现对个体层面动态匹配关系的前瞻性建模,准确率超过89%,并衍生出如“伪装率”(disguise rate)和“双重覆盖率”(double coverage rate)等新指标,为战术分析与球员评估提供量化依据。
链接: https://arxiv.org/abs/2603.25901
作者: Kevin Song,Evan Diewald,Ornob Siddiquee,Chris Boomhower,Keegan Abdoo,Mike Band,Amy Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures, ISACE 2026
Abstract:Defensive coverage schemes in the National Football League (NFL) represent complex tactical patterns requiring coordinated assignments among defenders who must react dynamically to the offense’s passing concept. This paper presents a factorized attention-based transformer model applied to NFL multi-agent play tracking data to predict individual coverage assignments, receiver-defender matchups, and the targeted defender on every pass play. Unlike previous approaches that focus on post-hoc coverage classification at the team level, our model enables predictive modeling of individual player assignments and matchup dynamics throughout the play. The factorized attention mechanism separates temporal and agent dimensions, allowing independent modeling of player movement patterns and inter-player relationships. Trained on randomly truncated trajectories, the model generates frame-by-frame predictions that capture how defensive responsibilities evolve from pre-snap through pass arrival. Our models achieve approximately 89%+ accuracy for all tasks, with true accuracy potentially higher given annotation ambiguity in the ground truth labels. These outputs also enable novel derivative metrics, including disguise rate and double coverage rate, which enable enhanced storytelling in TV broadcasts as well as provide actionable insights for team strategy development and player evaluation.
[CV-108] HFM: A Unified Video Foundation Model for 4D Human Perception and Beyond
【速读】:该论文旨在解决多任务视频感知(human-centric perception)中模型碎片化的问题,即传统方法通常为不同任务(如深度估计、法向量预测、分割、密集姿态估计及关键点检测)设计独立模型,导致资源冗余且难以统一优化。其解决方案的关键在于提出THFM——一个基于预训练文本到视频扩散模型(text-to-video diffusion model)的统一视频基础模型(video foundation model),通过单次前向传播实现稠密任务(如深度、法向量、分割、密集姿态)与稀疏任务(2D/3D关键点估计)的联合推理;同时引入可学习标记(learnable tokens)增强稀疏预测能力,并利用文本提示(text prompt)进行任务调制,从而在仅使用合成数据训练的情况下,性能达到或超越多个专用模型,展现出强大的泛化能力(如从单人场景推广至多人及其他类物体)。
链接: https://arxiv.org/abs/2603.25892
作者: Letian Wang,Andrei Zanfir,Eduard Gabriel Bazavan,Misha Andriluka,Cristian Sminchisescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals – a capability that hasn’t been demonstrated in the past.
[CV-109] Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods
【速读】:该论文旨在解决预训练视觉语言模型(Vision-Language Models, VLMs)在图像检索任务中对组合式查询(compositional queries)和分布外(out-of-distribution, OOD)图像-文本对表现不佳的问题。其关键解决方案是提出了一种新的Few-Shot Text-to-Image Retrieval(FSIR)任务及配套的基准数据集FSIR-BD,该数据集首次明确针对通过参考示例进行文本引导的图像检索,尤其聚焦于组合式与OOD场景下的挑战。同时,论文设计了两种基于单样本或少量样本参考示例的检索优化方法,这些方法可兼容任意预训练图像编码器,显著提升了mAP指标,从而推动机器在有限样本下实现更接近人类水平的组合推理能力。
链接: https://arxiv.org/abs/2603.25891
作者: Ofer Idan,Vladi Vexler,Gil Lederman,Dima Sivov,Aviad Cohen Zada,Shir Niego Komforti
机构: Huawei Tel-Aviv Research Center (华为特拉维夫研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained vision-language models (VLMs) excel in multimodal tasks, commonly encoding images as embedding vectors for storage in databases and retrieval via approximate nearest neighbor search (ANNS). However, these models struggle with compositional queries and out-of-distribution (OOD) image-text pairs. Inspired by human cognition’s ability to learn from minimal examples, we address this performance gap through few-shot learning approaches specifically designed for image retrieval. We introduce the Few-Shot Text-to-Image Retrieval (FSIR) task and its accompanying benchmark dataset, FSIR-BD - the first to explicitly target image retrieval by text accompanied by reference examples, focusing on the challenging compositional and OOD queries. The compositional part is divided to urban scenes and nature species, both in specific situations or with distinctive features. FSIR-BD contains 38,353 images and 303 queries, with 82% comprising the test corpus (averaging per query 37 positives, ground truth matches, and significant number of hard negatives) and 18% forming the few-shot reference corpus (FSR) of exemplar positive and hard negative images. Additionally, we propose two novel retrieval optimization methods leveraging single shot or few shot reference examples in the FSR to improve performance. Both methods are compatible with any pre-trained image encoder, making them applicable to existing large-scale environments. Our experiments demonstrate that: (1) FSIR-BD provides a challenging benchmark for image retrieval; and (2) our optimization methods outperform existing baselines as measured by mean Average Precision (mAP). Further research into FSIR optimization methods will help narrow the gap between machine and human-level understanding, particularly for compositional reasoning from limited examples.
[CV-110] Polarization-Based Eye Tracking with Personalized Siamese Architectures
【速读】:该论文旨在解决头戴式设备中眼动追踪(eye tracking)因个体差异导致的性能下降问题,即传统方法需为每位用户单独校准以获得准确结果。其解决方案的关键在于采用基于Siamese网络架构的差分个性化(differential personalization)方法,通过学习相对眼动偏移量并从少量校准帧中重建绝对眼动位置,从而显著减少所需校准样本数。实验表明,该方法在使用偏振敏感相机和850 nm照明条件下,相比线性校准仅需1/10样本即可达到相当精度,并且相较于近红外(NIR)输入可降低最多12%的眼动误差;结合线性校准后进一步提升达13%,验证了该方案在实际应用中的有效性与准确性。
链接: https://arxiv.org/abs/2603.25889
作者: Beyza Kalkanli,Tom Bu,Mahsa Shakeri,Alexander Fix,Dave Stronks,Dmitri Model,Mantas Žurauskas
机构: Meta(Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ETRA 2026 as full paper
Abstract:Head-mounted devices integrated with eye tracking promise a solution for natural human-computer interaction. However, they typically require per-user calibration for optimal performance due to inter-person variability. A differential personalization approach using Siamese architectures learns relative gaze displacements and reconstructs absolute gaze from a small set of calibration frames. In this paper, we benchmark Siamese personalization on polarization-enabled eye tracking. For benchmarking, we use a 338-subject dataset captured with a polarization-sensitive camera and 850 nm illumination. We achieve performance comparable to linear calibration with 10-fold fewer samples. Using polarization inputs for Siamese personalization reduces gaze error by up to 12% compared to near-infrared (NIR)-based inputs. Combining Siamese personalization with linear calibration yields further improvements of up to 13% over a linearly calibrated baseline. These results establish Siamese personalization as a practical approach enabling accurate eye tracking.
[CV-111] World Reasoning Arena
【速读】:该论文旨在解决现有世界模型(World Models, WMs)评估基准过于聚焦于下一状态预测和视觉保真度,而忽视了智能行为所需更丰富的模拟能力的问题。为填补这一空白,作者提出WR-Arena,一个从三个核心维度全面评估WMs的基准:(i) 动作模拟保真度(Action Simulation Fidelity),即模型对语义明确、多步骤指令的理解与执行能力及生成多样化反事实轨迹的能力;(ii) 长时程预测(Long-horizon Forecast),即在长时间交互中保持准确、连贯且物理合理的模拟能力;(iii) 模拟推理与规划(Simulative Reasoning and Planning),即通过模拟、比较并选择不同未来路径来支持目标导向推理的能力。解决方案的关键在于构建一套任务分类体系和多样化数据集,以系统性地探测上述能力,从而超越单一回合和感知层面的评估,揭示当前模型与人类水平假设推理之间的显著差距,并为下一代具备稳健理解、预测与目的性行动能力的世界模型提供诊断工具和研发指引。
链接: https://arxiv.org/abs/2603.25887
作者: PAN Team Institute of Foundation Models:Qiyue Gao,Kun Zhou,Jiannan Xiang,Zihan Liu,Dequan Yang,Junrong Chen,Arif Ahmad,Cong Zeng,Ganesh Bannur,Xinqi Huang,Zheqi Liu,Yi Gu,Yichi Yang,Guangyi Liu,Zhiting Hu,Zhengzhong Liu,Eric Xing
机构: PAN Team, Institute of Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at this https URL.
[CV-112] Automated Quality Assessment of Blind Sweep Obstetric Ultrasound for Improved Diagnosis
【速读】:该论文旨在解决盲扫产科超声(Blind Sweep Obstetric Ultrasound, BSOU)在低资源环境中应用时,因操作者训练不足导致采集质量波动对下游人工智能(Artificial Intelligence, AI)模型性能产生不可靠影响的问题。解决方案的关键在于通过系统性模拟常见采集偏差(如扫查方向反转、探头翻转和不完整扫查),量化AI模型的鲁棒性,并开发自动化质量评估模型以识别这些扰动;进一步构建反馈机制,对异常扫查进行重新采集,从而显著提升胎儿位置分类、胎盘定位等关键任务的准确性,证明了自动化质量控制在构建可靠、可扩展的AI辅助产前超声流程中的核心作用。
链接: https://arxiv.org/abs/2603.25886
作者: Prasiddha Bhandari,Kanchan Poudel,Nishant Luitel,Bishram Acharya,Angelina Ghimire,Tyler Wellman,Kilian Koepsell,Pradeep Raj Regmi,Bishesh Khanal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind Sweep Obstetric Ultrasound (BSOU) enables scalable fetal imaging in low-resource settings by allowing minimally trained operators to acquire standardized sweep videos for automated Artificial Intelligence(AI) interpretation. However, the reliability of such AI systems depends critically on the quality of the acquired sweeps, and little is known about how deviations from the intended protocol affect downstream predictions. In this work, we present a systematic evaluation of BSOU quality and its impact on three key AI tasks: sweep-tag classification, fetal presentation classification, and placenta-location classification. We simulate plausible acquisition deviations, including reversed sweep direction, probe inversion, and incomplete sweeps, to quantify model robustness, and we develop automated quality-assessment models capable of detecting these perturbations. To approximate real-world deployment, we simulate a feedback loop in which flagged sweeps are re-acquired, showing that such correction improves downstream task performance. Our findings highlight the sensitivity of BSOU-based AI models to acquisition variability and demonstrate that automated quality assessment can play a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.
[CV-113] Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations
【速读】:该论文试图解决白板风格教育视频中手绘图形与语音叙述之间的多模态同步问题(multimodal synchronization problem),即如何实现自由绘制的图形元素与对应讲解内容在时间上的精确对齐。其解决方案的关键在于构建首个包含24个配对Excalidraw演示与带注释音频的数据集,其中每个绘图元素均带有毫秒级精度的时间戳,并覆盖8个STEM领域;在此基础上,利用LoRA微调的视觉语言模型(Qwen2-VL-7B)从仅24个示例中学习生成与语音同步的完整笔画序列,实验证明时间戳条件输入显著提升时序对齐效果,且模型具备跨未见STEM主题的泛化能力。
链接: https://arxiv.org/abs/2603.25870
作者: Suraj Prasad,Pinak Mahapatra
机构: Latent Spaces IITB (Latent Spaces IITB)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.
[CV-114] Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception
【速读】:该论文旨在解决腹腔镜手术中由电外科和血管闭合器械产生的烟雾对视觉感知的严重干扰问题,这种烟雾会显著降低内窥镜图像质量并影响基于视觉的功能(如深度估计和器械分割)。解决方案的关键在于提出一种基于Transformer架构的去烟模型,其核心创新是引入了一个物理启发的去烟头(physics-inspired desmoking head),能够联合预测无烟图像与对应的烟雾分布图;同时,为缓解真实配对数据稀缺的问题,作者构建了合成数据生成流程,将人工烟雾图案与真实内窥镜图像融合,生成超过80,000对训练样本,并进一步收集了目前最大规模的配对手术烟雾数据集(5,817对图像),从而实现了在高分辨率内窥镜图像上的先进去烟性能及下游任务效果验证。
链接: https://arxiv.org/abs/2603.25867
作者: Jingpei Lu,Fengyi Jiang,Xiaorui Zhang,Lingbo Jin,Omid Mohareri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, 3 tables
Abstract:Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.
[CV-115] Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation
【速读】:该论文旨在解决动态手部手势识别在家庭自动化系统中应用时的实时性和准确性问题,尤其针对巴西手语(LIBRAS)手势的识别任务。其解决方案的关键在于构建一个由MediaPipe Hand Landmarker与卷积神经网络(CNN)组成的两阶段模型:首先利用MediaPipe提取手部21个骨骼关键点,随后将这些关键点按时间序列构成90×21的时空矩阵输入CNN进行分类。为实现无循环结构的连续识别,采用滑动窗口结合帧复制策略以增强时序信息,从而在不依赖递归网络的前提下实现了高精度识别,在低光和正常光照条件下分别达到95%和92%的准确率。
链接: https://arxiv.org/abs/2603.25863
作者: Jasmine Moreira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 10 figures, 1 table
Abstract:This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95% accuracy under low-light conditions and 92% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.
[CV-116] GazeQwen : Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中难以有效利用眼动信息(eye-gaze information)的问题,即使眼动线索以视觉叠加或文本描述形式提供。解决方案的关键在于提出 GazeQwen,一种参数高效的眼动感知方法,通过隐藏状态调制(hidden-state modulation)实现对 LLM 解码器层的精准干预:其核心是一个轻量级的眼动重采样模块(gaze resampler,仅含 1–5M 可训练参数),该模块将 V-JEPA 2.1 视频特征与基于注视点的位置编码融合,并生成加性残差,通过前向钩子(forward hooks)注入到选定的 LLM 解码层;此外,可选的第二阶段微调引入低秩适配器(LoRA)以增强集成紧密度。实验表明,GazeQwen 在 StreamGaze 基准全部 10 个任务上达到 63.9% 准确率,显著优于相同骨干模型(+16.1 点)和 GPT-4o(+10.5 点),验证了“学习在何处注入眼动信息”比单纯扩大模型规模或优化提示工程更为有效。
链接: https://arxiv.org/abs/2603.25841
作者: Trong Thang Pham,Hien Nguyen,Ngan Le
机构: University of Arkansas (阿肯色大学); University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the highest score among all open-source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at this https URL .
[CV-117] Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents
【速读】:该论文旨在解决从无结构图像集合中高效、准确地回归稠密符号距离场(Signed Distance Field, SDF)的问题,传统方法通常依赖相机标定或后处理融合,且在多视角特征聚合过程中易丢失完整性信息并累积误差。其解决方案的关键在于利用预训练的多视角前向几何变换器(multi-view feed-forward geometry transformers)中间特征空间所蕴含的强大联合世界表示,不再通过逐视图预测头生成3D几何再后处理拼接,而是直接从几何变换器特征中学习体积提取:通过交织的交叉注意力与自注意力机制,将体素化的规范嵌入逐步吸收多视角几何信息,构建结构化的体积潜在网格;随后使用简单的卷积解码器将其映射为稠密SDF,从而实现无需相机标定、三秒内完成高质量SDF重建,并支持稀疏与密集视图场景下的完整几何补全。
链接: https://arxiv.org/abs/2603.25827
作者: Laura Fink,Linus Franke,George Kopanas,Marc Stamminger,Peter Hedman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.25827 [cs.CV] (or arXiv:2603.25827v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.25827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-118] ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reason ers?
【速读】:该论文旨在解决当前生成式视觉模型(Generative Vision Models)在复杂逻辑推理能力上的显著缺失问题,即尽管这些模型在视觉保真度上表现优异,但在物理、因果或空间推理等任务中存在“逻辑荒漠”现象。现有评估方法多依赖表面指标或碎片化基准,形成“性能幻象”,无法真实反映模型的生成过程与推理能力。解决方案的核心是提出ViGoR(Vision-Generative Reasoning-centric Benchmark),其关键创新在于:1)跨模态全覆盖,融合图像到图像与视频任务;2)双轨机制,同时评估中间推理过程与最终结果;3)基于证据的自动化评判器,确保高人类一致性;4)细粒度诊断分析,将性能分解为认知维度。实验表明,即使是最先进的模型也存在明显推理缺陷,验证了ViGoR作为下一代智能视觉模型“压力测试”的必要性。
链接: https://arxiv.org/abs/2603.25823
作者: Haonan Han,Jiancheng Huang,Xiaopeng Sun,Junyan He,Rui Yang,Jie Hu,Xiaojiang Peng,Lin Ma,Xiaoming Wei,Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Beneath the stunning visual fidelity of modern AIGC models lies a “logical desert”, where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-Gnerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical stress test’’ for the next generation of intelligent vision models. The demo have been available at this https URL
[CV-119] Geotextbf2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
【速读】:该论文旨在解决跨视角地理空间学习中的两大核心任务——跨视角地理定位(Cross-View Geo-Localization, CVGL)与跨视角图像合成(Cross-View Image Synthesis, CVIS),其关键挑战在于地面与航空视角之间巨大的视点差异导致的几何不一致性。解决方案的关键在于提出Geo²统一框架,通过引入几何基础模型(Geometric Foundation Models, GFMs)提取的3D几何先验信息,构建一个共享的3D感知潜在空间(GeoMap),有效降低跨视角特征差异以提升定位精度,并自然地支持双向图像合成;进一步设计基于流匹配(flow-matching)的GeoFlow模型,结合一致性损失约束双向合成间的潜在空间对齐,从而实现高保真且一致的跨视角图像生成与精准定位。
链接: https://arxiv.org/abs/2603.25819
作者: Yancheng Zhang,Xiaohan Zhang,Guangyu Sun,Zonglin Lyu,Safwan Wshah,Chen Chen
机构: Institute of Artificial Intelligence, University of Central Florida (人工智能研究所,中佛罗里达大学); University of Vermont (佛蒙特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.
[CV-120] Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在训练过程中注意力图中出现伪影(artifacts)的问题,这些问题严重影响了模型的可解释性。其关键解决方案是引入一种称为“寄存器”(registers)的新机制,即在输入序列中添加空的额外标记(input tokens),用于存储超出 [CLS] 标记范围的全局信息,从而有效消除注意力图中的伪影并提升其清晰度。
链接: https://arxiv.org/abs/2603.25803
作者: Spiros Baxevanakis,Platon Karageorgis,Ioannis Dravilas,Konrad Szewczyk
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Submitted to Transactions on Machine Learning Research (TMLR). 26 pages, 17 figures
Abstract:Training Vision Transformers (ViTs) presents significant challenges, one of which is the emergence of artifacts in attention maps, hindering their interpretability. Darcet et al. (2024) investigated this phenomenon and attributed it to the need of ViTs to store global information beyond the [CLS] token. They proposed a novel solution involving the addition of empty input tokens, named registers, which successfully eliminate artifacts and improve the clarity of attention maps. In this work, we reproduce the findings of Darcet et al. (2024) and evaluate the generalizability of their claims across multiple models, including DINO, DINOv2, OpenCLIP, and DeiT3. While we confirm the validity of several of their key claims, our results reveal that some claims do not extend universally to other models. Additionally, we explore the impact of model size, extending their findings to smaller models. Finally, we untie terminology inconsistencies found in the original paper and explain their impact when generalizing to a wider range of models.
[CV-121] LEMON: a foundation model for nuclear morphology in Computational Pathology
【速读】:该论文旨在解决单细胞水平上图像表征学习在计算病理学中相对匮乏的问题,而这一问题制约了对细胞类型和细胞表型的精确刻画。其解决方案的关键在于提出了一种名为LEMON(Learning Embeddings from Morphology Of Nuclei)的自监督基础模型,该模型通过在来自多种组织和癌症类型的数百万张细胞图像上进行训练,学习到鲁棒且通用的形态学表征,从而支持大规模单细胞病理分析,展现出在多个基准数据集上的优异性能,为细胞层面的计算病理学提供了新的范式。
链接: https://arxiv.org/abs/2603.25802
作者: Loïc Chadoutaud(1, 2, 3),Alice Blondel(1, 2, 3),Hana Feki(1, 2, 3),Jacqueline Fontugne(4, 5),Emmanuel Barillot(1, 2, 3),Thomas Walter(1, 2, 3) ((1) Institut Curie, Paris, France, (2) Mines Paris PSL, Centre for Computational Biology (CBIO), Paris, France, (3) INSERM U1331, Paris, France, (4) Institut Curie, U1353/UMR9029 IRIS, Equipe IMPACT, Paris, France, (5) Department of Pathology, Université Paris-Saclay, UVSQ, Institut Curie, Saint-Cloud, France)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computational pathology relies on effective representation learning to support cancer research and precision medicine. Although self-supervised learning has driven major progress at the patch and whole-slide image levels, representation learning at the single-cell level remains comparatively underexplored, despite its importance for characterizing cell types and cellular phenotypes. We introduce LEMON (Learning Embeddings from Morphology Of Nuclei), a self-supervised foundation model for scalable single-cell image representation learning. Trained on millions of cell images from diverse tissues and cancer types, LEMON learns robust and versatile morphological representations that support large-scale single-cell analyses in pathology. We evaluate LEMON on five benchmark datasets across a range of prediction tasks and show that it provides strong performance, highlighting its potential as a new paradigm for cell-level computational pathology. Model weights are available at this https URL.
[CV-122] End-to-end Feature Alignment: A Simple CNN with Intrinsic Class Attribution
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Network, CNN)中特征表示缺乏可解释性的问题,特别是原始特征图难以与类别标签建立清晰对应关系的困境。传统CNN中的无序操作(如全连接层和卷积层)会导致语义概念的混杂与信息丢失,从而削弱模型的可解释性。解决方案的关键在于提出特征对齐卷积神经网络(Feature-Align CNN, FA-CNN),其核心创新是引入两种保持顺序的结构:阻尼跳跃连接(dampened skip connection)和全局平均池化分类头(global average pooling classifier head)。这两个组件强制模型从输入像素到最终类别logits全程维持特征对齐,使得原始特征图天然具备类别归属属性(class attribution)。理论证明表明,FA-CNN的倒数第二层特征图等价于Grad-CAM显著性图,并且这些特征在深层网络中逐层平滑演化,揭示了特征随网络深度的渐进式变化过程,从而显著提升了模型的可解释性和可视化能力。
链接: https://arxiv.org/abs/2603.25798
作者: Parniyan Farvardin,David Chapman
机构: University of Miami (迈阿密大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Feature-Align CNN (FA-CNN), a prototype CNN architecture with intrinsic class attribution through end-to-end feature alignment. Our intuition is that the use of unordered operations such as Linear and Conv2D layers cause unnecessary shuffling and mixing of semantic concepts, thereby making raw feature maps difficult to understand. We introduce two new order preserving layers, the dampened skip connection, and the global average pooling classifier head. These layers force the model to maintain an end-to-end feature alignment from the raw input pixels all the way to final class logits. This end-to-end alignment enhances the interpretability of the model by allowing the raw feature maps to intrinsically exhibit class attribution. We prove theoretically that FA-CNN penultimate feature maps are identical to Grad-CAM saliency maps. Moreover, we prove that these feature maps slowly morph layer-by-layer over network depth, showing the evolution of features through network depth toward penultimate class activations. FA-CNN performs well on benchmark image classification datasets. Moreover, we compare the averaged FA-CNN raw feature maps against Grad-CAM and permutation methods in a percent pixels removed interpretability task. We conclude this work with a discussion and future, including limitations and extensions toward hybrid models.
[CV-123] ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions CVPR2026
【速读】:该论文旨在解决从单目RGB视频中重建人类与可动物体之间4D交互(4D human-articulated-object interactions)的难题,这一问题在现有方法中尚未得到充分探索。当前主流的手-物体交互(HOI)方法主要局限于刚性物体,而针对可动物体的4D重建通常依赖于预扫描或多视角视频,限制了其通用性和实用性。为应对这一高度病态(ill-posed)问题,作者提出ArtHOI框架,其核心创新在于通过优化策略整合并修正来自多个基础模型(foundation models)的先验信息,从而提升重建精度与物理合理性。关键解决方案包括:1)自适应采样精化(Adaptive Sampling Refinement, ASR)方法,用于优化物体的度量尺度和位姿,实现归一化网格在世界空间中的准确定位;2)基于多模态大语言模型(Multimodal Large Language Model, MLLM)引导的手-物体对齐方法,利用接触推理信息作为约束条件,优化手与物体网格的组合结构,确保几何一致性与物理合理性。
链接: https://arxiv.org/abs/2603.25791
作者: Zikai Wang,Zhilu Zhang,Yiqing Wang,Hui Li,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object’s metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: this https URL.
[CV-124] Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis CVPR2026
【速读】:该论文旨在解决内窥镜视频分析中因高质量标注数据稀缺而导致的表示学习难题,尤其针对现有自监督视频预训练方法在自然视频场景下过度依赖密集时空建模并存在运动偏差的问题,忽视了临床决策中至关重要的静态结构语义。解决方案的关键在于提出一种受认知启发的分层表征学习框架——聚焦感知表征学习(Focus-to-Perceive Representation Learning, FPRL),其核心机制是显式区分并协同学习静态语义与上下文语义:首先通过教师先验自适应掩码(TPAM)结合多视角稀疏采样捕捉病变中心区域的静态语义,减少冗余时序依赖;随后利用跨视图掩码特征补全(CVMFC)和注意力引导时序预测(AGTP)建模帧间结构演化,强化时序语义连续性同时保持全局上下文完整性。
链接: https://arxiv.org/abs/2603.25778
作者: Yuan Zhang,Sihao Dou,Kai Hu,Shuhua Deng,Chunhong Cao,Fen Xiao,Xieping Gao
机构: Xiangtan University (湘潭大学); Hunan Normal University (湖南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at this https URL.
[CV-125] Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification
【速读】:该论文旨在解决生成式 AI (Generative AI) 在材料表面粗糙度分类任务中因依赖大量标注数据和高成本成像设备而导致的部署瓶颈问题。其解决方案的关键在于利用 Stable Diffusion XL 生成的合成图像作为实验采集数据的有效补充或替代,实验证明该方法可在不显著降低分类准确率的前提下显著减少对真实数据的依赖,从而提升材料图像分类流程的数据效率与可靠性,降低实验成本并加速模型开发进程。
链接: https://arxiv.org/abs/2603.25765
作者: Binwei Chen,Huachao Leng,Chi Yeung Mang,Tsz Wai Cheung,Yanhua Chen,Wai Keung Anthony Loh,Chi Ho Wong,Chak Yin Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:
Abstract:Hard coatings play a critical role in industry, with ceramic materials offering outstanding hardness and thermal stability for applications that demand superior mechanical performance. However, deploying artificial intelligence (AI) for surface roughness classification is often constrained by the need for large labeled datasets and costly high-resolution imaging equipment. In this study, we explore the use of synthetic images, generated with Stable Diffusion XL, as an efficient alternative or supplement to experimentally acquired data for classifying ceramic surface roughness. We show that augmenting authentic datasets with generative images yields test accuracies comparable to those obtained using exclusively experimental images, demonstrating that synthetic images effectively reproduce the structural features necessary for classification. We further assess method robustness by systematically varying key training hyperparameters (epoch count, batch size, and learning rate), and identify configurations that preserve performance while reducing data requirements. Our results indicate that generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering a practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.
[CV-126] A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents
【速读】:该论文旨在解决当前光学字符识别(Optical Character Recognition, OCR)与文档理解系统在评估过程中对历史档案和边缘化文献(如非裔美国人历史报纸)的忽视问题。研究表明,现有模型评估主要基于现代、西方及机构化文档,导致其在处理具有复杂排版、字体差异和材料退化的历史文献时表现不佳,且缺乏对结构失效(如列坍塌、排版错误和幻觉文本)的有效检测。解决方案的关键在于揭示评价体系中的结构性盲区——即训练数据与基准测试集的代表性不足,以及由组织层级(meso-level)和制度层级(macro-level)行为所塑造的数据治理决策与基准激励机制。论文呼吁建立更具包容性的评估框架,以减少因数据偏见引发的结构性隐形与表征性伤害。
链接: https://arxiv.org/abs/2603.25761
作者: Fitsum Sileshi Beyene,Christopher L. Dancy
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: This manuscript is the author’s submitted version to the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2026). Please cite the final published version via ACM Digital Library when available
Abstract:Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely appear in reported training data or evaluation benchmarks. Most evaluations emphasize character accuracy and task success on modern layouts. They rarely capture structural failures common in historical newspapers, including column collapse, typographic errors, and hallucinated text. To put these findings into perspective, we use previous empirical studies and archival statistics from significant Black press collections to show how evaluation gaps lead to structural invisibility and representational harm. We propose that these gaps occur due to organizational (meso) and institutional (macro) behaviors and structure, shaped by benchmark incentives and data governance decisions.
[CV-127] A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning
【速读】:该论文旨在解决扩散模型(Diffusion Models)在判别式表征学习中训练效率低和表征能力不足的问题,尤其针对DiT(Diffusion Transformer)架构存在的时间步(timestep)搜索不充分与特征利用不充分的瓶颈。其解决方案的关键在于提出自动选择时间步(A-SelecT),该方法能够在单次运行中动态识别DiT最具信息量的时间步,从而避免耗时的穷举式时间步搜索以及次优的特征选择策略,显著提升DiT在下游分类与分割任务中的性能与效率。
链接: https://arxiv.org/abs/2603.25758
作者: Changyu Liu,James Chenhao Liang,Wenhao Yang,Yiming Cui,Jinghao Yang,Tianyang Wang,Qifan Wang,Dongfang Liu,Cheng Han
机构: University of Missouri–Kansas City(密苏里大学堪萨斯城分校); U. S. Naval Research Laboratory(美国海军研究实验室); Lamar University(拉马尔大学); University of Florida(佛罗里达大学); University of Texas Rio Grande Valley(得克萨斯州里奥格兰德谷大学); University of Alabama at Birmingham(阿拉巴马大学伯明翰分校); Meta AI(Meta AI); Rochester Institute of Technology(罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT’s most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.
[CV-128] Adapting Frozen Mono-modal Backbones for Multi-modal Registration via Contrast-Agnostic Instance Optimization MICCAI
【速读】:该论文旨在解决多模态医学图像配准(multi-modal image registration)中深度学习模型在测试阶段分布偏移(distribution shift)下泛化能力不足的问题。现有方法如全网络微调虽能提升性能,但在3D场景下计算成本高昂且易因极端域偏移导致性能退化。其解决方案的关键在于:采用一个冻结的预训练单模态配准模型,并引入轻量级适配管道(lightweight adaptation pipeline),通过基于对比无关表示(contrast-agnostic representation)的风格迁移与精炼模块,在测试时进行实例优化(instance optimization),从而有效弥合模态间和域间的差异。该设计不依赖特定骨干网络结构,避免了全参数微调的开销,同时具备适应未见域的能力,显著提升了多模态配准的鲁棒性与实用性。
链接: https://arxiv.org/abs/2603.26393
作者: Yi Zhang,Yidong Zhao,Qian Tao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI Learn2Reg Challenge
Abstract:Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbfmono-modal registration model with a lightweight adaptation pipeline for \textbfmulti-modal image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.
[CV-129] FINDER: Zero-Shot Field-Integrated Network for Distortion-free EPI Reconstruction in Diffusion MRI
【速读】:该论文旨在解决扩散磁共振成像(diffusion MRI)中基于回波平面成像(EPI)序列的严重几何失真问题,该失真源于B₀场不均匀性导致的快速采样敏感性。现有深度学习方法虽能提升重建质量,但缺乏对几何失真进行鲁棒校正的自监督框架。解决方案的关键在于提出FINDER(Field-Integrated Network for Distortion-free EPI Reconstruction),其核心创新是将图像重建与B₀场图联合优化,采用物理引导的展开网络(physics-guided unrolled network)融合双域去噪器和虚拟线圈扩展以保证数据一致性,并引入条件于空间坐标和潜在图像特征的隐式神经表示(INR)来建模离共振场为连续可微函数,通过交替最小化策略协同更新重建网络与场图,从而有效分离由磁 susceptibility引起的几何失真与解剖结构。
链接: https://arxiv.org/abs/2603.26117
作者: Namgyu Han,Seong Dae Yun,Chaeeun Lim,Sunghyun Seok,Sunju Kim,Yoonhwan Kim,Yohan Jun,Tae Hyung Kim,Berkin Bilgic,Jaejin Cho
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Echo-planar imaging (EPI) remains the cornerstone of diffusion MRI, but it is prone to severe geometric distortions due to its rapid sampling scheme that renders the sequence highly sensitive to B_0 field inhomogeneities. While deep learning has helped improve MRI reconstruction, integrating robust geometric distortion correction into a self-supervised framework remains an unmet need. To address this, we present FINDER (Field-Integrated Network for Distortion-free EPI Reconstruction), a novel zero-shot, scan-specific framework that reformulates reconstruction as a joint optimization of the underlying image and the B_0 field map. Specifically, we employ a physics-guided unrolled network that integrates dual-domain denoisers and virtual coil extensions to enforce robust data consistency. This is coupled with an Implicit Neural Representation (INR) conditioned on spatial coordinates and latent image features to model the off-resonance field as a continuous, differentiable function. Employing an alternating minimization strategy, FINDER synergistically updates the reconstruction network and the field map, effectively disentangling susceptibility-induced geometric distortions from anatomical structures. Experimental results demonstrate that FINDER achieves superior geometric fidelity and image quality compared to state-of-the-art baselines, offering a robust solution for high-quality diffusion imaging.
[CV-130] Cone-Beam CT Image Quality Enhancement Using A Latent Diffusion Model Trained with Simulated CBCT Artifacts
【速读】:该论文旨在解决锥形束计算机断层成像(Cone-beam computed tomography, CBCT)图像因对比度低和伪影多而导致的临床应用受限问题,尤其关注在器官形变区域中传统图像增强方法可能引入不必要结构变化的缺陷。解决方案的关键在于提出一种基于条件潜在扩散模型(conditional latent diffusion model)的无过校正CBCT图像质量增强方法,利用从CT图像通过简单模拟伪影生成的空间一致伪CBCT图像进行自监督学习,从而在提升图像质量的同时保持解剖结构不变;此外,将扩散模型框架扩展至潜在空间以提高处理效率,并在有限训练条件下仍实现比传统条件扩散模型更快的处理速度和更优的图像增强性能。
链接: https://arxiv.org/abs/2603.26014
作者: Naruki Murahashi,Mitsuhiro Nakamura,Megumi Nakao
机构: Kyoto University (京都大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cone-beam computed tomography (CBCT) images are problematic in clinical medicine because of their low contrast and high artifact content compared with conventional CT images. Although there are some studies to improve image quality, in regions subject to organ deformation, the anatomical structure may change after such image quality improvement. In this study, we propose an overcorrection-free CBCT image quality enhancement method based on a conditional latent diffusion model using pseudo-CBCT images. Pseudo-CBCT images are created from CT images using a simple method that simulates CBCT artifacts and are spatially consistent with the CT images. By performing self-supervised learning with these spatially consistent paired images, we can improve image quality while maintaining anatomical structures. Furthermore, extending the framework of the conditional diffusion model to latent space improves the efficiency of image processing. Our model was trained on pelvic CT-pseudo-CBCT paired data and was applied to both pseudo-CBCT and real CBCT data. The experimental results using data of 75 cases show that with our proposed method, the structural changes were less than 1/1000th (in terms of the number of pixels) of those of a conventional method involving learning with real images, and the correlation coefficient between the CT value distributions of the generated and reference images was 0.916, approaching the same level as conventional methods. We also confirmed that the proposed framework achieves faster processing and superior improvement performance compared with the framework of a conditional diffusion model, even under constrained training settings.
[CV-131] Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimers Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort
【速读】:该论文旨在解决如何更准确预测轻度认知障碍(MCI)患者向阿尔茨海默病(AD)转化的问题,这一预测对于早期干预和临床试验入组至关重要。解决方案的关键在于利用结构磁共振成像(sMRI)中灰质-白质边界锐度系数(Boundary Sharpness Coefficient, BSC)的时间变化特征,通过计算BSC在皮层灰质-白质界面的逐年退化速率,构建时序斜率特征,并将其输入随机生存森林(Random Survival Forest)模型进行风险建模。相较于仅依赖基线扫描或传统深度学习方法,该策略显著提升了预测性能(测试C-index达0.63,较基准参数模型提升163%),且成本远低于正电子发射断层成像(PET)或脑脊液检测,具备良好的临床可及性与应用潜力。
链接: https://arxiv.org/abs/2603.26007
作者: Ishaan Cherukuri
机构: Alzheimer’s Disease Neuroimaging Initiative (ADNI); Northern California Institute for Research and Education; IRCCS Santa Lucia Foundation, Rome
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting whether someone with mild cognitive impairment (MCI) will progress to Alzheimer’s disease (AD) is crucial in the early stages of neurodegeneration. This uncertainty limits enrollment in clinical trials and delays urgent treatment. The Boundary Sharpness Coefficient (BSC) measures how well-defined the gray-white matter boundary looks on structural MRI. This study measures how BSC changes over time, namely, how fast the boundary degrades each year works much better than looking at a single baseline scan for predicting MCI-to-AD conversion. This study analyzed 1,824 T1-weighted MRI scans from 450 ADNI subjects (95 converters, 355 stable; mean follow-up: 4.84 years). BSC voxel-wise maps were computed using tissue segmentation at the gray-white matter cortical ribbon. Previous studies have used CNN and RNN models that reached 96.0% accuracy for AD classification and 84.2% for MCI conversion, but those approaches disregard specific regions within the brain. This study focused specifically on the gray-white matter interface. The approach uses temporal slope features capturing boundary degradation rates, feeding them into Random Survival Forest, a non-parametric ensemble method for right-censored survival data. The Random Survival Forest trained on BSC slopes achieved a test C-index of 0.63, a 163% improvement over baseline parametric models (test C-index: 0.24). Structural MRI costs a fraction of PET imaging ( 800-- 1,500 vs. 5,000-- 7,000) and does not require CSF collection. These temporal biomarkers could help with patient-centered safety screening as well as risk assessment.
[CV-132] Adapting Segment Anything Model 3 for Concept-Driven Lesion Segmentation in Medical Images: An Experimental Study
【速读】:该论文旨在解决医学图像分割中现有方法泛化能力不足的问题,即大多数现有方法仅针对特定解剖部位或成像模态设计,难以在不同医学影像场景下保持性能稳定。其解决方案的关键在于系统评估最新视觉-语言基础模型Segment Anything Model 3 (SAM3) 在多模态医学图像(包括多参数MRI、CT、超声、皮肤镜和内窥镜)中的概念驱动分割能力,并通过引入额外先验知识(如邻切片预测、多参数信息及先验标注)提升模型鲁棒性,同时对比多种微调策略(部分模块微调、适配器方法与全模型优化),最终实现跨模态通用性强、概念驱动可靠且病灶边界精确的分割效果。
链接: https://arxiv.org/abs/2603.25945
作者: Guoping Xu,Jayaram K. Udupa,Yubing Tong,Xin Long,Ying Zhang,Jie Deng,Weiguo Lu,You Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures
Abstract:Accurate lesion segmentation is essential in medical image analysis, yet most existing methods are designed for specific anatomical sites or imaging modalities, limiting their generalizability. Recent vision-language foundation models enable concept-driven segmentation in natural images, offering a promising direction for more flexible medical image analysis. However, concept-prompt-based lesion segmentation, particularly with the latest Segment Anything Model 3 (SAM3), remains underexplored. In this work, we present a systematic evaluation of SAM3 for lesion segmentation. We assess its performance using geometric bounding boxes and concept-based text and image prompts across multiple modalities, including multiparametric MRI, CT, ultrasound, dermoscopy, and endoscopy. To improve robustness, we incorporate additional prior knowledge, such as adjacent-slice predictions, multiparametric information, and prior annotations. We further compare different fine-tuning strategies, including partial module tuning, adapter-based methods, and full-model optimization. Experiments on 13 datasets covering 11 lesion types demonstrate that SAM3 achieves strong cross-modality generalization, reliable concept-driven segmentation, and accurate lesion delineation. These results highlight the potential of concept-based foundation models for scalable and practical medical image segmentation. Code and trained models will be released at: this https URL Comments: 31 pages, 8 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.25945 [eess.IV] (or arXiv:2603.25945v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2603.25945 Focus to learn more arXiv-issued DOI via DataCite
[CV-133] Learning to Recorrupt: Noise Distribution Agnostic Self-Supervised Image Denoising
【速读】:该论文旨在解决自监督图像去噪方法中对噪声分布先验知识的依赖问题,这类先验知识通常用于避免模型学习到平凡的恒等映射(identity mapping)。传统方法如Noisier2Noise或Recorrupted2Recorrupted通过向噪声图像添加合成噪声生成训练对,但其性能高度依赖于精确的噪声分布建模,而实际场景中这一信息往往不可得。解决方案的关键在于提出Learning to Recorrupt (L2R),一种无需噪声分布先验的去噪方法:它引入一个可学习的单调神经网络来自动学习重污染(recorruption)过程,并通过最小-最大鞍点优化目标进行训练,从而在不依赖噪声分布的情况下实现对复杂噪声类型(如对数伽马、拉普拉斯、空间相关噪声及信号依赖型泊松-高斯噪声)的有效去噪,达到当前最优性能。
链接: https://arxiv.org/abs/2603.25869
作者: Brayan Monroy,Jorge Bacca,Julián Tachella
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Self-supervised image denoising methods have traditionally relied on either architectural constraints or specialized loss functions that require prior knowledge of the noise distribution to avoid the trivial identity mapping. Among these, approaches such as Noisier2Noise or Recorrupted2Recorrupted, create training pairs by adding synthetic noise to the noisy images. While effective, these recorruption-based approaches require precise knowledge of the noise distribution, which is often not available. We present Learning to Recorrupt (L2R), a noise distribution-agnostic denoising technique that eliminates the need for knowledge of the noise distribution. Our method introduces a learnable monotonic neural network that learns the recorruption process through a min-max saddle-point objective. The proposed method achieves state-of-the-art performance across unconventional and heavy-tailed noise distributions, such as log-gamma, Laplace, and spatially correlated noise, as well as signal-dependent noise models such as Poisson-Gaussian noise.
人工智能
[AI-0] Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning
【速读】:该论文旨在解决当前机器人硬件在实现类人灵巧性方面的瓶颈问题,特别是缺乏高自由度、低成本且可扩展的仿人手结构。其解决方案的关键在于提出Ruka-v2:一款完全开源、基于肌腱驱动的仿人手系统,新增了两个关键自由度——一个解耦的2-DOF并联腕关节(提供独立的屈伸与桡偏/尺偏运动)和手指的内收/外展运动能力,从而显著提升机器人在狭小空间操作(如抽屉)及精细抓握(如薄物体、书写)等复杂任务中的表现。通过用户操控实验验证,相较于前代Ruka,Ruka-v2在完成时间上减少51.3%,成功率提高21.2%,并支持多臂遥操作与自主策略学习,展现了其在机器人学习应用中的广泛潜力。
链接: https://arxiv.org/abs/2603.26660
作者: Xinqi(Lucas)Liu,Ruoxi Hu,Alejandro Ojeda Olarte,Zhuoran Chen,Kenny Ma,Charles Cheng Ji,Lerrel Pinto,Raunaq Bhirangi,Irmak Guzey
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under 1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at this https URL .
[AI-1] Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
【速读】:该论文旨在解决当前对复杂端到端网站开发任务的系统性评估不足的问题,尤其是在视觉到代码(Visual-to-Code)生成、交互式多页前端复现及全栈网站开发等场景下缺乏统一、可扩展的基准测试框架。其解决方案的关键在于提出 Vision2Web——一个分层基准测试平台,涵盖从静态 UI 到全栈开发共 193 个真实世界网站任务,并设计了一种基于工作流的代理验证范式(workflow-based agent verification paradigm),包含 GUI 代理验证器和基于视觉语言模型(Vision-Language Model, VLM)的评判器,从而实现灵活、全面且可靠的评估能力。
链接: https://arxiv.org/abs/2603.26648
作者: Zehai He,Wenyi Hong,Zhen Yang,Ziyang Pan,Mingdao Liu,Xiaotao Gu,Jie Tang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.
[AI-2] Machine Learning Transferability for Malware Detection
【速读】:该论文旨在解决恶意软件(Malware)检测中因特征不兼容导致的模型泛化能力不足与跨数据集迁移性差的问题,尤其针对使用混淆技术(Obfuscation)规避检测的可执行文件(Portable Executable, PE)样本。其解决方案的关键在于构建一个统一的数据预处理流程,整合EMBERv2(2,381维特征)数据集,并通过引入BODMAS(一种特征对齐方法)和ERMDS(一种分布鲁棒优化策略)来增强模型在不同测试集(如TRITIUM、INFERNO和SOREL-20M)上的稳定性和适应性,从而提升机器学习模型在面对分布偏移时的检测性能。
链接: https://arxiv.org/abs/2603.26632
作者: César Vieira,João Vitorino,Eva Maia,Isabel Praça
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 1 Figure, 2 tables, World CIST 2026
Abstract:Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.
[AI-3] Sustainability Is Not Linear: Quantifying Performance Energy and Privacy Trade-offs in On-Device Intelligence
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)从云端迁移到边缘设备时面临的资源约束问题,特别是移动电池续航、热限制和内存瓶颈。其解决方案的关键在于构建了一个可复现的实验流程,用于量化分析能耗、延迟与生成质量之间的多目标权衡关系。通过在旗舰安卓设备(三星 Galaxy S25 Ultra)上对8个参数量从0.5B到9B的模型进行实证研究,发现模型架构比量化策略对能耗影响更大:重要性感知量化虽能显著压缩内存占用,但相比标准混合精度方法几乎不节省能量;而Mixture-of-Experts(MoE)架构则展现出独特优势——以7B模型的存储容量实现1B~2B模型的低功耗特性。最终识别出中等规模模型(如Qwen2.5-3B)为兼顾响应质量和可持续能耗的实用最优解。
链接: https://arxiv.org/abs/2603.26603
作者: Eziyo Ehsani,Luca Giamattei,Ivano Malavolta,Roberto Pietrantuono
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at Empirical Software Engineering (EMSE)
Abstract:The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality. Unlike theoretical studies, we captured granular power metrics across eight models ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. We harness this pipeline to conduct an empirical case study on a flagship Android device, the Samsung Galaxy S25 Ultra, establishing foundational hypotheses regarding the trade-offs between generation quality, performance, and resource consumption. Our investigation uncovered a counter-intuitive quantization-energy paradox. While modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.
[AI-4] Beyond Code Snippets: Benchmarking LLM s on Repository-Level Question Answering
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件工程任务中,尤其是在跨文件、跨项目级别的程序理解(repository-level program comprehension)方面表现不足的问题。现有研究多集中于单文件或孤立函数的问答任务,未能充分反映真实开发场景中复杂的系统级依赖关系。为此,作者构建了首个基于多项目仓库的问答数据集 StackRepoQA,涵盖1,318个来自开发者的真实问题及其被采纳的答案,覆盖134个开源Java项目。关键解决方案包括:(1)引入基于文件检索和结构依赖图的增强生成方法(retrieval-augmented generation),以捕捉代码间的结构性语义;(2)系统评估两种主流LLM(Claude 3.5 Sonnet 和 GPT-4o)在直接提示与代理配置下的性能差异,并揭示当前模型在复杂代码理解任务中仍存在显著局限性,且高准确率常源于对Stack Overflow答案的复现而非真正推理能力。该研究为未来提升LLMs在代码理解中的泛化能力和可解释性提供了基准与方向。
链接: https://arxiv.org/abs/2603.26567
作者: Yoseph Berhanu Alebachew,Hunter Leary,Swanand Vaishampayan,Chris Brown
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.
[AI-5] Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
【速读】:该论文旨在解决现有奖励设计中存在的两个关键问题:一是结果奖励模型(Outcome Reward Model, ORM)仅评估最终答案的正确性,忽视推理过程的质量,且随着群体响应趋于一致导致优势信号逐渐消失;二是过程奖励模型(Process Reward Model, PRM)虽提供更丰富的监督信号,但直接使用其评分易引发奖励黑客(reward hacking)行为,即模型通过冗长表述提升得分而牺牲准确性。解决方案的关键在于提出一种过程感知策略优化方法(Process-Aware Policy Optimization, PAPO),通过解耦的优势归一化机制,将优势信号分解为两部分:来自ORM的结果优势分(Aout)在所有响应中归一化,确保训练始终锚定于正确性;来自基于评分量表的PRM的过程优势分(Aproc)仅在正确响应中归一化,从而在不扭曲结果信号的前提下精细区分推理质量。实验证明,PAPO在多个模型规模和六项基准测试中均显著优于ORM,尤其在OlympiadBench上达到51.3%准确率,超越ORM的46.3%,且持续改进而未出现性能衰减。
链接: https://arxiv.org/abs/2603.26535
作者: Zelin Tan,Zhouliang Yu,Bohan Lin,Zijie Geng,Hejia Geng,Yudong Zhang,Mulei Zhang,Yang Chen,Shuyue Hu,Zhenfei Yin,Chen Zhang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 Pages,9 Figures,First Version
Abstract:We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.
[AI-6] CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation
【速读】:该论文旨在解决现有文本到计算机辅助设计(Computer-Aided Design, CAD)生成方法中存在的两大问题:一是单次生成过程缺乏几何验证,导致无法纠正结构错误;二是依赖有损的视觉反馈,难以识别和修复尺寸误差。解决方案的关键在于提出一个基于多智能体(multi-agent)的迭代优化流水线CADSmith,其核心创新是引入两个嵌套的纠错循环:内层循环处理代码执行错误,外层循环则基于程序化几何验证进行精确修正。外层循环结合OpenCASCADE内核提供的精确测量指标(如边界框尺寸、体积、实体有效性)与独立视觉语言模型Judge的全局形状感知能力,实现数值精度与高层次语义理解的协同优化,从而显著提升生成CAD模型的质量与可靠性。
链接: https://arxiv.org/abs/2603.26512
作者: Jesse Barkley,Rumi Loghmani,Amir Barati Farimani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:Existing methods for text-to-CAD generation either operate in a single pass with no geometric verification or rely on lossy visual feedback that cannot resolve dimensional errors. We present CADSmith, a multi-agent pipeline that generates CadQuery code from natural language. It then undergoes an iterative refinement process through two nested correction loops: an inner loop that resolves execution errors and an outer loop grounded in programmatic geometric validation. The outer loop combines exact measurements from the OpenCASCADE kernel (bounding box dimensions, volume, solid validity) with holistic visual assessment from an independent vision-language model Judge. This provides both the numerical precision and the high-level shape awareness needed to converge on the correct geometry. The system uses retrieval-augmented generation over API documentation rather than fine-tuning, maintaining a current database as the underlying CAD library evolves. We evaluate on a custom benchmark of 100 prompts in three difficulty tiers (T1 through T3) with three ablation configurations. Against a zero-shot baseline, CADSmith achieves a 100% execution rate (up from 95%), improves the median F1 score from 0.9707 to 0.9846, the median IoU from 0.8085 to 0.9629, and reduces the mean Chamfer Distance from 28.37 to 0.74, demonstrating that closed-loop refinement with programmatic geometric feedback substantially improves the quality and reliability of LLM-generated CAD models.
[AI-7] AIRA_2: Overcoming Bottlenecks in AI Research Agents
【速读】:该论文旨在解决当前AI研究代理(AI research agents)在结构性能上的三大瓶颈问题:(1) 同步单GPU执行限制了样本吞吐量,削弱了搜索效率;(2) 验证集选择导致泛化差距,使模型在长周期搜索中性能下降;(3) 固定的单轮大语言模型(Large Language Model, LLM)操作符限制了搜索能力上限。解决方案的关键在于三个核心架构设计:(1) 异步多GPU工作节点池实现实验吞吐量线性增长;(2) 隐式一致评估(Hidden Consistent Evaluation, HCE)协议提供可靠评估信号;(3) ReAct代理通过动态调整行动范围并交互式调试提升搜索效率。实验证明,AIRA₂在MLE-bench-30上24小时平均百分位排名达71.8%,优于此前最优结果69.9%,并在72小时稳定提升至76.0%。消融实验进一步表明各模块均不可或缺,且先前报道的“过拟合”现象源于评估噪声而非真实数据记忆。
链接: https://arxiv.org/abs/2603.26499
作者: Karen Hambardzumyan,Nicolas Baldwin,Edan Toledo,Rishi Hazra,Michael Kuchnik,Bassel Al Omari,Thomas Simon Foster,Anton Protopopov,Jean-Christophe Gagnon-Audet,Ishita Mediratta,Kelvin Niu,Michael Shvartsman,Alisia Lupidi,Alexis Audran-Reiss,Parth Pathak,Tatiana Shavrina,Despoina Magka,Hela Momand,Derek Dunfield,Nicola Cancedda,Pontus Stenetorp,Carole-Jean Wu,Jakob Nicolaus Foerster,Yoram Bachrach,Martin Josifoski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA _2 , which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA _2 achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the “overfitting” reported in prior work was driven by evaluation noise rather than true data memorization.
[AI-8] Rocks Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理服务中因异构负载导致的性能瓶颈问题,特别是视频等高资源需求请求对系统延迟和内存占用的显著影响,进而引发队首阻塞(head-of-line blocking)和响应不及时的问题。现有为纯文本优化的LLM服务系统无法有效处理此类多模态请求,导致交互式应用体验下降。解决方案的关键在于提出一种基于模态感知的调度机制——RPS-Serve,其核心思想是将不同模态请求抽象为“岩石”(视频)、“鹅卵石”(图像)和“沙子”(文本),通过动态分类、优先级调度与老化策略(aging),使低资源消耗的文本请求能够快速通过高资源消耗的图像和视频请求,从而在保障公平性的同时显著降低首次词元时间(Time-to-First-Token, TTFT),实现类LLM级别的响应速度。
链接: https://arxiv.org/abs/2603.26498
作者: Konstantinos Papaioannou,Thaleia Dimitra Doudali
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand. We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation. RPS-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation. Evaluation across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. RPS-Serve delivers LLM-like responsiveness for MLLMs, with modality-aware scheduling and by making the most efficient use of the available resources.
[AI-9] Foundation Model for Cardiac Time Series via Masked Latent Attention
【速读】:该论文旨在解决现有基础模型(Foundation Models, FMs)在心电图(Electrocardiogram, ECG)表示学习中未能有效利用不同导联(lead)之间强结构冗余的问题。传统预训练方法将各导联视为独立通道,忽略了其内在的拓扑关联性,限制了表征质量和迁移性能。解决方案的关键在于提出一种潜在注意力掩码自编码器(Latent Attention Masked Autoencoder, LAMAE),通过在自监督预训练过程中显式建模跨导联连接机制,利用潜在注意力(latent attention)捕捉导联间的高阶交互关系,实现对导联特异性表征的排列不变聚合与自适应加权,从而提升表征质量与下游任务的可迁移性。
链接: https://arxiv.org/abs/2603.26475
作者: Moritz Vandenhirtz,Samuel Ruipérez-Campillo,Simon Böhi,Sonia Laguna,Irene Cannistraci,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Representation Theory (math.RT)
备注: First two authors are co-first. Last two authors are co-senior
Abstract:Electrocardiograms (ECGs) are among the most widely available clinical signals and play a central role in cardiovascular diagnosis. While recent foundation models (FMs) have shown promise for learning transferable ECG representations, most existing pretraining approaches treat leads as independent channels and fail to explicitly leverage their strong structural redundancy. We introduce the latent attention masked autoencoder (LAMAE) FM that directly exploits this structure by learning cross-lead connection mechanisms during self-supervised pretraining. Our approach models higher-order interactions across leads through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations. We provide empirical evidence on the Mimic-IV-ECG database that leveraging the cross-lead connection constitutes an effective form of structural supervision, improving representation quality and transferability. Our method shows strong performance in predicting ICD-10 codes, outperforming independent-lead masked modeling and alignment-based baselines.
[AI-10] UNIFERENCE: A Discrete Event Simulation Framework for Developing Distributed AI Models
【速读】:该论文旨在解决分布式推理算法开发与评估中因缺乏标准化工具来建模异构设备和网络而导致的难题。现有研究通常依赖临时搭建的测试床或专有基础设施,使得结果难以复现,并限制了对假设硬件或网络配置的探索。其解决方案的关键在于提出UNIFERENCE——一个基于离散事件仿真(Discrete-Event Simulation, DES)的框架,通过轻量级逻辑进程建模设备与网络行为,仅在通信原语上同步,避免回滚同时保持因果顺序;该框架可无缝集成PyTorch Distributed,使同一代码库能从仿真环境平滑过渡到真实部署,实现在多种后端和硬件配置下运行时性能预测精度达98.6%。
链接: https://arxiv.org/abs/2603.26469
作者: Doğaç Eldenk,Stephen Xia
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Developing and evaluating distributed inference algorithms remains difficult due to the lack of standardized tools for modeling heterogeneous devices and networks. Existing studies often rely on ad-hoc testbeds or proprietary infrastructure, making results hard to reproduce and limiting exploration of hypothetical hardware or network configurations. We present UNIFERENCE, a discrete-event simulation (DES) framework designed for developing, benchmarking, and deploying distributed AI models within a unified environment. UNIFERENCE models device and network behavior through lightweight logical processes that synchronize only on communication primitives, eliminating rollbacks while preserving the causal order. It integrates seamlessly with PyTorch Distributed, enabling the same codebase to transition from simulation to real deployment. Our evaluation demonstrates that UNIFERENCE profiles runtime with up to 98.6% accuracy compared to real physical deployments across diverse backends and hardware setups. By bridging simulation and deployment, UNIFERENCE provides an accessible, reproducible platform for studying distributed inference algorithms and exploring future system designs, from high-performance clusters to edge-scale devices. The framework is open-sourced at this https URL.
[AI-11] A Boltzmann-machine-enhanced Transformer For DNA Sequence Classification
【速读】:该论文旨在解决DNA序列分类中对高阶依赖关系(如位点间相互作用、组合调控和表型上位性)的建模问题,传统Transformer虽具备全局建模能力,但其软最大注意力机制连续且稀疏,难以显式发现结构化模式。解决方案的关键在于提出一种基于玻尔兹曼机增强的Transformer架构:通过引入结构化的二值门控变量表示查询-键连接,并以玻尔兹曼能量函数进行约束;其中局部偏置项由查询-键相似度定义,可学习的成对交互项捕捉边之间的协同与竞争关系,潜变量则建模更高阶组合依赖。为处理离散门控图的后验推断难题,采用均值场变分推断估计边激活概率,并结合Gumbel-Softmax实现从连续概率到近似离散门控的渐进压缩,同时保持端到端可微性。训练时联合优化分类损失与能量损失,促使模型在准确预测的同时偏好低能量、稳定且可解释的结构。
链接: https://arxiv.org/abs/2603.26465
作者: Zhixuan Cao,Yishu Xu,Xuang WU
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:DNA sequence classification requires not only high predictive accuracy but also the ability to uncover latent site interactions, combinatorial regulation, and epistasis-like higher-order dependencies. Although the standard Transformer provides strong global modeling capacity, its softmax attention is continuous, dense, and weakly constrained, making it better suited for information routing than explicit structure discovery. In this paper, we propose a Boltzmann-machine-enhanced Transformer for DNA sequence classification. Built on multi-head attention, the model introduces structured binary gating variables to represent latent query-key connections and constrains them with a Boltzmann-style energy function. Query-key similarity defines local bias terms, learnable pairwise interactions capture synergy and competition between edges, and latent hidden units model higher-order combinatorial dependencies. Since exact posterior inference over discrete gating graphs is intractable, we use mean-field variational inference to estimate edge activation probabilities and combine it with Gumbel-Softmax to progressively compress continuous probabilities into near-discrete gates while preserving end-to-end differentiability. During training, we jointly optimize classification and energy losses, encouraging the model to achieve accurate prediction while favoring low-energy, stable, and interpretable structures. We further derive the framework from the energy function and variational free energy to the mean-field fixed-point equations, Gumbel-Softmax relaxation, and the final joint objective. The proposed framework provides a unified view of integrating Boltzmann machines, differentiable discrete optimization, and Transformers for structured learning on biological sequences.
[AI-12] Neuro-Symbolic Process Anomaly Detection
【速读】:该论文旨在解决生成式 AI(Generative AI)在流程异常检测中因缺乏人类领域知识而导致的误报问题,即模型将罕见但符合规范的流程轨迹错误分类为异常。其解决方案的关键在于提出一种神经符号(neuro-symbolic)方法,通过逻辑张量网络(Logic Tensor Networks, LTN)将Declare约束作为软逻辑引导规则嵌入到自动编码器(autoencoder)的学习过程中,从而在保持统计学习能力的同时,有效融合领域专家知识,区分真正的异常行为与罕见但合规的行为。
链接: https://arxiv.org/abs/2603.26461
作者: Devashish Gaikwad,Wil M. P. van der Aalst,Gyunam Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:Process anomaly detection is an important application of process mining for identifying deviations from the normal behavior of a process. Neural network-based methods have recently been applied to this task, learning directly from event logs without requiring a predefined process model. However, since anomaly detection is a purely statistical task, these models fail to incorporate human domain knowledge. As a result, rare but conformant traces are often misclassified as anomalies due to their low frequency, which limits the effectiveness of the detection process. Recent developments in the field of neuro-symbolic AI have introduced Logic Tensor Networks (LTN) as a means to integrate symbolic knowledge into neural networks using real-valued logic. In this work, we propose a neuro-symbolic approach that integrates domain knowledge into neural anomaly detection using LTN and Declare constraints. Using autoencoder models as a foundation, we encode Declare constraints as soft logical guiderails within the learning process to distinguish between anomalous and rare but conformant behavior. Evaluations on synthetic and real-world datasets demonstrate that our approach improves F1 scores even when as few as 10 conformant traces exist, and that the choice of Declare constraint and by extension human domain knowledge significantly influences performance gains.
[AI-13] Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations
【速读】:该论文旨在解决如何通过昂贵的生成式 AI (Generative AI) 模型有效指导廉价模型完成软件工程任务的问题。其核心挑战在于:是否能利用高成本模型的推理能力替代执行能力,从而在保持性能的同时显著降低计算开销。解决方案的关键在于提出 ManagerWorker 两代理(two-agent)架构——由一个仅具备文本理解能力的“管理者”模型负责问题分析、任务分配与结果审查,以及一个具备完整代码仓库访问权限的“工作者”模型执行具体代码变更。实验证明,该设计的成功源于两个关键机制:一是将每个模型限制在其训练模式内(管理者专注文本生成,工作者专注工具调用),二是将组织结构外化至代码层面,避免因角色拆分违背现有模型的单体训练分布(monolithic training distribution)。这一方法不仅实现了性能接近单一大模型(62% vs. 60%),且大幅减少资源消耗,揭示了当前模型在委托(delegation)、限定范围执行(scoped execution)和模式切换(mode switching)等能力上的训练缺失。
链接: https://arxiv.org/abs/2603.26458
作者: Rui Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two-agent pipeline where an expensive “manager” model (text-only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap “worker” model (with full repo access) executes code changes. We evaluate on 200 instances from SWE-bench Lite across five configurations that vary the manager-worker relationship, pipeline complexity, and model pairing. Our findings reveal both the promise and the limits of multi-agent direction: (1) a strong manager directing a weak worker (62%) matches a strong single agent (60%) at a fraction of the strong-model token usage, showing that expensive reasoning can substitute for expensive execution; (2) a weak manager directing a weak worker (42%) performs worse than the weak agent alone (44%), demonstrating that the directing relationship requires a genuine capability gap–structure without substance is pure overhead; (3) the manager’s value lies in directing, not merely reviewing–a minimal review-only loop adds just 2pp over the baseline, while structured exploration and planning add 11pp, showing that active direction is what makes the capability gap productive; and (4) these behaviors trace to a single root cause: current models are trained as monolithic agents, and splitting them into director/worker roles fights their training distribution. The pipeline succeeds by designing around this mismatch–keeping each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code. This diagnosis points to concrete training gaps: delegation, scoped execution, and mode switching are skills absent from current training data.
[AI-14] KMM-CP: Practical Conformal Prediction under Covariate Shift via Selective Kernel Mean Matching
【速读】:该论文旨在解决在存在协变量偏移(covariate shift)场景下,传统分位数回归(Conformal Prediction, CP)方法因分布不匹配而导致的覆盖误差(coverage error)问题。其关键解决方案是提出KMM-CP框架,基于核均值匹配(Kernel Mean Matching, KMM)进行协变量偏移校正:通过最小化再生核希尔伯特空间(RKHS)中的矩差异并施加显式权重约束,直接控制覆盖误差的偏差-方差成分;同时引入选择性扩展机制,在可靠支持重叠区域执行校正,从而提升低重叠场景下的稳定性。实验表明,该方法在分子属性预测任务中可将覆盖差距降低超过50%。
链接: https://arxiv.org/abs/2603.26415
作者: Siddhartha Laghuvarapu,Rohan Deb,Jimeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Uncertainty quantification is essential for deploying machine learning models in high-stakes domains such as scientific discovery and healthcare. Conformal Prediction (CP) provides finite-sample coverage guarantees under exchangeability, an assumption often violated in practice due to distribution shift. Under covariate shift, restoring validity requires importance weighting, yet accurate density-ratio estimation becomes unstable when training and test distributions exhibit limited support overlap. We propose KMM-CP, a conformal prediction framework based on Kernel Mean Matching (KMM) for covariate-shift correction. We show that KMM directly controls the bias-variance components governing conformal coverage error by minimizing RKHS moment discrepancy under explicit weight constraints, and establish asymptotic coverage guarantees under mild conditions. We then introduce a selective extension that identifies regions of reliable support overlap and restricts conformal correction to this subset, further improving stability in low-overlap regimes. Experiments on molecular property prediction benchmarks with realistic distribution shifts show that KMM-CP reduces coverage gap by over 50% compared to existing approaches. The code is available at this https URL.
[AI-15] Generative Modeling in Protein Design: Neural Representations Conditional Generation and Evaluation Standards
【速读】:该论文旨在解决当前生成式AI在蛋白质研究领域中方法分散、评估标准不统一的问题,从而阻碍了不同模型间的比较与实际应用。其关键解决方案在于系统性地整合生成式AI在蛋白质研究中的三大核心维度:(i)基础表示方法(涵盖序列、几何及多模态编码),(ii)生成架构(包括SE(3)-等变扩散模型、流匹配模型及混合预测-生成系统),以及(iii)任务设置(从结构预测到从头设计、蛋白质-配体与蛋白质-蛋白质相互作用建模)。此外,论文还通过对比假设、条件机制与可控性,并提出强调防泄漏分割、物理有效性验证和功能导向基准的评估实践,推动该领域从预测建模向可信赖的功能驱动蛋白工程演进。
链接: https://arxiv.org/abs/2603.26378
作者: Senura Hansaja Wanasekara,Minh-Duong Nguyen,Xiaochen Liu,Nguyen H. Tran,Ken-Tye Yong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 tables, 4 figures
Abstract:Generative modeling has become a central paradigm in protein research, extending machine learning beyond structure prediction toward sequence design, backbone generation, inverse folding, and biomolecular interaction modeling. However, the literature remains fragmented across representations, model classes, and task formulations, making it difficult to compare methods or identify appropriate evaluation standards. This survey provides a systematic synthesis of generative AI in protein research, organized around (i) foundational representations spanning sequence, geometric, and multimodal encodings; (ii) generative architectures including \mathrmSE(3) -equivariant diffusion, flow matching, and hybrid predictor-generator systems; and (iii) task settings from structure prediction and de novo design to protein-ligand and protein-protein interactions. Beyond cataloging methods, we compare assumptions, conditioning mechanisms, and controllability, and we synthesize evaluation best practices that emphasize leakage-aware splits, physical validity checks, and function-oriented benchmarks. We conclude with critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for dual-use biosecurity risks. By unifying architectural advances with practical evaluation standards and responsible development considerations, this survey aims to accelerate the transition from predictive modeling to reliable, function-driven protein engineering.
[AI-16] PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management
【速读】:该论文旨在解决当前药物领域人工智能(AI)应用中因将文档保存、语义解释与上下文呈现三类本质不同的操作混同于单一技术层而导致的系统性脆弱性问题,如溯源丢失、解释不透明、警报疲劳及责任缺失等。其解决方案的关键在于提出PATOS–Lector–PRISMA(PLP)信息架构:通过PATOS实现带显式版本控制和溯源机制的法规文档存储;Lector结合机器辅助阅读与人工校验,生成锚定于原始文献的类型化断言;PRISMA基于RPDA框架(监管、处方、调配、给药)提供面向不同专业角色的差异化信息呈现。该架构引入“证据包”作为可问责断言的标准化单元(具备版本化、可追溯性、认知边界明确且经校验),并以言语行为力(illocutionary force)对断言进行类型划分,从而在基础设施层面保障文档锚定、解释透明与机构问责,弥补现有决策支持系统在这些方面的不足。
链接: https://arxiv.org/abs/2603.26324
作者: Eugenio Rodrigo Zimmer Neves,Amanda Vanon Correa,Camila Campioni,Gabielli Pare Guglielmi,Bruno Morelli
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 52 pages, 3 figures, 71 references
Abstract:Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation. This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability. This paper proposes the PATOS–Lector–PRISMA (PLP) infrastructure as a normative information architecture for responsible pharmaceutical knowledge management. PATOS preserves regulatory documents with explicit versioning and provenance; Lector implements machine-assisted reading with human curation, producing typed assertions anchored to primary sources; PRISMA delivers contextual presentation through the RPDA framework (Regulatory, Prescription, Dispensing, Administration), refracting the same informational core into distinct professional views. The architecture introduces the Evidence Pack as a formal unit of accountable assertion (versioned, traceable, epistemically bounded, and curatorially validated), with assertions typified by illocutionary force. A worked example traces dipyrone monohydrate across all three layers using real system data. Developed and validated in Brazil’s regulatory context, the architecture is grounded in an operational implementation comprising over 16,000 official documents and 38 curated Evidence Packs spanning five reference medications. The proposal is demonstrated as complementary to operational decision support systems, providing infrastructural conditions that current systems lack: documentary anchoring, interpretive transparency, and institutional accountability.
[AI-17] Knowdit: Agent ic Smart Contract Vulnerability Detection with Auditing Knowledge Summarization
【速读】:该论文旨在解决去中心化金融(DeFi)智能合约中自动化漏洞检测难题,尤其是那些与项目特定业务逻辑紧密耦合的漏洞难以被现有方法有效识别的问题。其核心挑战在于,尽管漏洞形式多样,但许多漏洞本质上共享相同的经济机制(即“DeFi语义”),而传统静态分析工具缺乏对这些抽象机制的建模能力。解决方案的关键在于提出一种知识驱动的代理框架Knowdit,该框架首先从历史人工审计报告中构建审计知识图谱,将细粒度的DeFi语义与重复出现的漏洞模式关联;随后通过多代理协作机制,在新项目上执行规范生成、漏洞利用代码合成、模糊测试执行和发现反思的迭代循环,借助共享工作内存实现持续优化。这一设计使Knowdit能够系统性地识别高严重性漏洞,并在真实项目中发现此前未知的严重缺陷,显著优于现有基线方法。
链接: https://arxiv.org/abs/2603.26270
作者: Ziqiao Kong,Wanxu Xia,Chong Wang,Yi Lu,Pan Li,Shaohua Li,Zong Cao,Yang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Smart contracts govern billions of dollars in decentralized finance (DeFi), yet automated vulnerability detection remains challenging because many vulnerabilities are tightly coupled with project-specific business logic. We observe that recurring vulnerabilities across diverse DeFi business models often share the same underlying economic mechanisms, which we term DeFi semantics, and that capturing these shared abstractions can enable more systematic auditing. Building on this insight, we propose Knowdit, a knowledge-driven, agentic framework for smart contract vulnerability detection. Knowdit first constructs an auditing knowledge graph from historical human audit reports, linking fine-grained DeFi semantics with recurring vulnerability patterns. Given a new project, a multi-agent framework leverages this knowledge through an iterative loop of specification generation, harness synthesis, fuzz execution, and finding reflection, driven by a shared working memory for continuous refinement. We evaluate Knowdit on 12 recent Code4rena projects with 75 ground-truth vulnerabilities. Knowdit detects all 14 high-severity and 77% of medium-severity vulnerabilities with only 2 false positives, significantly outperforming all baselines. Applied to six real-world projects, Knowdit further discovers 12 high- and 10 medium-severity previously unknown vulnerabilities, proving its outstanding performance. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2603.26270 [cs.CR] (or arXiv:2603.26270v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.26270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-18] Physics-Informed Neural Networks and Sequence Encoder: Application to heating and early cooling of thermo-stamping process
【速读】:该论文旨在解决复杂动态系统识别中多模态数据融合与泛化能力不足的问题,特别是在热冲压工艺等实际工程场景下,如何利用生成式AI(Generative AI)和物理信息神经网络(Physics-Informed Neural Networks, PINN)联合建模来提升对系统响应的预测精度。其解决方案的关键在于引入序列编码器(Sequence Encoder, SE),将时间序列数据(如1D信号或2D时空图像)映射为低维特征向量,并将其嵌入PINN框架中以捕捉参数、初始条件和边界条件变化下的动力学行为;同时,通过在合成数据上训练模型并迁移至真实实验数据,显著增强了模型对未见工况的泛化能力。
链接: https://arxiv.org/abs/2603.26245
作者: Mouad Elaarabi,Domenico Borzacchiello,Philippe Le Bot,Nathan Lauzeral,Sebastien Comas-Cardona
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:In a previous work (Elaarabi et al., 2025b), the Sequence Encoder for online dynamical system identification (Elaarabi et al., 2025a) and its combination with PINN (PINN-SE) were introduced and tested on both synthetic and real data case scenarios. The sequence encoder is able to effectively encode time series into feature vectors, which the PINN then uses to map to dynamical behavior, predicting system response under changes in parameters, ICs and BCs. Previously (Elaarabi et al., 2025b), the tests on real data were limited to simple 1D problems and only 1D time series inputs of the Sequence Encoder. In this work, the possibility of applying PINN-SE to a more realistic case is investigated: heating and early cooling of the thermo-stamping process, which is a critical stage in the forming process of continuous fiber reinforced composite materials with thermoplastic polymer. The possibility of extending the PINN-SE inputs to multimodal data, such as sequences of temporal 2D images and to scenarios involving variable geometries, is also explored. The results show that combining multiple encoders with the previously proposed method (Elaarabi et al., 2025b) is feasible, we also show that training the model on synthetic data generated based on experimental data can help the model to generalize well for real experimental data, unseen during the training phase.
[AI-19] Automating Domain-Driven Design: Experience with a Prompting Framework
【速读】:该论文旨在解决复杂软件系统架构设计中领域驱动设计(Domain-Driven Design, DDD)实践的高人力成本与低自动化水平问题。其核心解决方案是提出一个基于结构化大语言模型(Large Language Model, LLM)提示(prompting)的框架,将DDD分解为五个可序列执行的步骤:建立通用语言、模拟事件风暴、识别限界上下文、设计聚合以及映射到技术架构。该框架的关键在于通过LLM实现前三个步骤的高效自动化输出(如术语表和上下文图),从而显著降低文档编写负担,并使专家能够聚焦于关键决策而非重复劳动;但研究发现,后两个步骤因误差累积导致输出不实用,表明LLM更适合作为协作辅助工具而非完全替代人类架构师判断,从而在提升效率的同时保障设计质量。
链接: https://arxiv.org/abs/2603.26244
作者: Tobias Eisenreich,Husein Jusic,Stefan Wagner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Domain-driven design (DDD) is a powerful design technique for architecting complex software systems. This paper introduces a prompting framework that automates core DDD activities through structured large language model (LLM) interactions. We decompose DDD into five sequential steps: (1) establishing an ubiquitous language, (2) simulating event storming, (3) identifying bounded contexts, (4) designing aggregates, and (5) mapping to technical architecture. In a case study, we validated the prompting framework against real-world requirements from FTAPI’s enterprise platform. While the first steps consistently generate valuable and usable artifacts, later steps show how minor errors or inaccuracies can propagate and accumulate. Overall, the framework excels as a collaborative sparring partner for building actionable documentation, such as glossaries and context maps, but not for full automation. This allows the experts to concentrate their discussion on the critical trade-offs. In our evaluation, Steps 1 to 3 worked well, but the accumulated errors rendered the artifacts generated from Steps 4 and 5 impractical. Our findings show that LLMs can enhance, but not replace, architectural expertise, offering a practical tool to reduce the effort and overhead of DDD while preserving human-centric decision-making.
[AI-20] Clawed and Dangerous: Can We Trust Open Agent ic Systems?
【速读】:该论文旨在解决开放型智能体系统(open agentic systems)在安全治理方面的根本性挑战,即如何在持续不确定性下实现对代理行为的有效管控。这类系统融合了基于大语言模型(LLM)的规划、外部能力、持久记忆和特权执行机制,其安全性与传统软件存在本质差异——因计划生成、决策输入、执行环境均具有概率特性,且权限由人类用户动态授予。论文的关键解决方案是提出一个六维分析分类法,并综合50篇相关文献,构建一套“安全优先设计”的参考原则与评估评分卡,从而为代理平台提供可审计、可治理、抗妥协的工程化框架。
链接: https://arxiv.org/abs/2603.26221
作者: Shiping Chen,Qin Wang,Guangsheng Yu,Xu Wang,Liming Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE)
备注:
Abstract:Open agentic systems combine LLM-based planning with external capabilities, persistent memory, and privileged execution. They are used in coding assistants, browser copilots, and enterprise automation. OpenClaw is a visible instance of this broader class. Without much attention yet, their security challenge is fundamentally different from that of traditional software that relies on predictable execution and well-defined control flow. In open agentic systems, everything is ‘‘probabilistic’’: plans are generated at runtime, key decisions may be shaped by untrusted natural-language inputs and tool outputs, execution unfolds in uncertain environments, and actions are taken under authority delegated by human users. The central challenge is therefore not merely robustness against individual attacks, but the governance of agentic behavior under persistent uncertainty. This paper systematizes the area through a software engineering lens. We introduce a six-dimensional analytical taxonomy and synthesize 50 papers spanning attacks, benchmarks, defenses, audits, and adjacent engineering foundations. From this synthesis, we derive a reference doctrine for secure-by-construction agent platforms, together with an evaluation scorecard for assessing platform security posture. Our review shows that the literature is relatively mature in attack characterization and benchmark construction, but remains weak in deployment controls, operational governance, persistent-memory integrity, and capability revocation. These gaps define a concrete engineering agenda for building agent ecosystems that are governable, auditable, and resilient under compromise. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE) Cite as: arXiv:2603.26221 [cs.CR] (or arXiv:2603.26221v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.26221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-21] An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverbarating
【速读】:该论文旨在解决如何理解早期面向对象技术(Object Technologies)与万维网(World Wide Web)融合阶段的技术演进及其对当前软件架构设计的持续影响问题。其解决方案的关键在于通过分析1999年一场研讨会的内容,揭示“对象网络”(Object Web)概念在分布式架构和开发工具方面的核心设计理念,并指出尽管术语已不再流行,这些理念仍以新的形式体现在当代微服务(microservices)和Kubernetes等技术中,从而阐明技术发展的连续性与演化逻辑。
链接: https://arxiv.org/abs/2603.26203
作者: James J. Cusick
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Record of early Web Object technology and evolution since then covered in 6 pages with 4 figures
Abstract:Technology change happens quickly such that new trends tend to crowd out the focus on what was new just yesterday. In this paper the peak popularity of the confluence of Object Technologies with early Web adoption is explored through the content of a seminar held in 1999. Distributed architectures were undergoing significant change at this point, and deeper software capabilities were just beginning to be broadly accessible over the Internet. The Object Web arose and was infused with new development tools reflecting these capabilities and allowing design of applications for deployment during the early days of the World Wide Web. This conference discussed the history, evolution, and use of these tools, architectures, and their future possibilities. The continued dominance of these approaches although under different names is demonstrated even though the term Object Web has receded in use. Favored newer offerings such as Kubernetes and microservices still model the core design attributes of the Object Web for example. Aside from connecting this seminar to relevance in the software world of today this paper also touches on the early AI tools demonstrated in this seminar a quarter century ago and how the popularity wave of any given technology might affect the current focus on AI technology offerings.
[AI-22] On the Complexity of Optimal Graph Rewiring for Oversmoothing and Oversquashing in Graph Neural Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在深度架构下面临的两个核心问题:过平滑(oversmoothing)和过挤压(oversquashing)。过平滑指节点表示趋于收敛为不可区分的向量,而过挤压则表现为远距离节点信息无法通过图结构中的瓶颈区域有效传播。论文的关键在于将这两种现象的缓解问题建模为图拓扑优化问题:其中过平滑的缓解基于谱间隙(spectral gap),过挤压的缓解基于图的导出性(conductance)。作者通过从最小二分问题(Minimum Bisection)的归约证明了这两个优化问题均为NP-hard,从而确立了其决策版本的NP完全性,为理解图重连(graph rewiring)在GNN优化中的理论极限提供了坚实基础,并解释了为何实践中需依赖近似算法与启发式方法。
链接: https://arxiv.org/abs/2603.26140
作者: Mostafa Haghir Chehreghani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) face two fundamental challenges when scaled to deep architectures: oversmoothing, where node representations converge to indistinguishable vectors, and oversquashing, where information from distant nodes fails to propagate through bottlenecks. Both phenomena are intimately tied to the underlying graph structure, raising a natural question: can we optimize the graph topology to mitigate these issues? This paper provides a theoretical investigation of the computational complexity of such graph structure optimization. We formulate oversmoothing and oversquashing mitigation as graph optimization problems based on spectral gap and conductance, respectively. We prove that exact optimization for either problem is NP-hard through reductions from Minimum Bisection, establishing NP-completeness of the decision versions. Our results provide theoretical foundations for understanding the fundamental limits of graph rewiring for GNN optimization and justify the use of approximation algorithms and heuristic methods in practice.
[AI-23] ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation
【速读】:该论文旨在解决仓库感知型软件工程系统评估中存在的三大问题:任务设计过于合成化(synthetic task design)、提示信息泄露(prompt leakage)以及仓库知识与未来代码变更之间的时序污染(temporal contamination)。其核心解决方案是提出一种时间一致性的基准测试方法(time-consistent benchmark methodology),该方法通过在时间点T0对仓库进行快照,仅使用T0之前可用的代码资产构建仓库衍生的知识,并在T0到T1区间内合并的拉取请求(pull request)所生成的自然语言任务上进行评估。该方法将评估形式化为匹配的A/B对照实验,即同一软件工程代理在有无仓库衍生代码知识的情况下进行对比,确保其他变量保持不变。关键创新在于引入严格的时间边界和可控的提示粒度(prompt granularity),从而显著提升评估结果的有效性和可复现性。
链接: https://arxiv.org/abs/2603.26137
作者: Xianpeng(Simon)Sun,Haonan Sun,Tian Yu,Sheng Ma,Qincheng Zhang,Lifei Rao,Chen Tian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, 4 tables
Abstract:Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.
[AI-24] SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在代码审查(code review)任务中表现不足的问题,尤其是其在真实场景下对代码变更(diff)的缺陷检测能力远低于人类专家水平。解决方案的关键在于构建了一个高质量、带人工标注的基准测试集 SWE-PRBench(包含350个拉取请求),并系统评估了8个前沿大语言模型(LLM)在三种不同上下文配置下的表现:仅 diff(config_A)、diff + 文件内容(config_B)和完整上下文(config_C)。实验发现,所有模型在从 config_A 到 config_C 的过程中性能单调下降,且即使引入结构化语义层(如 AST 提取函数上下文与导入图解析),也无法有效提升效果;最显著的问题是 Type2_Contextual 类别问题在 config_B 中出现严重漏检,归因于长上下文中的注意力稀释现象。此外,一个结构化的 2,000-token prompt(含摘要)优于包含执行上下文、行为映射和测试签名等丰富信息的 2,500-token full-context prompt,表明精简而聚焦的提示设计比盲目扩展上下文更有效。
链接: https://arxiv.org/abs/2603.26130
作者: Deepak Kumar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score, and evaluated under three frozen context configurations: diff only (config_A), diff with file content (config_B), and full context (config_C), enabling systematic ablation of context provision strategies. All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution. The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts: a structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt enriched with execution context, behaviour mapping, and test signatures across all 8 models. The top four models are statistically indistinguishable (mean score 0.147-0.153) while a clear tier gap separates them from the remaining four (mean score = 0.113). Dataset, contexts, annotations, and evaluation harness are released publicly.
[AI-25] DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction
【速读】:该论文旨在解决癌症药物反应预测中的关键瓶颈问题,即如何准确建模分子结构与细胞环境之间的复杂非线性相互作用,尤其在肿瘤异质性和基因组多样性背景下识别有效治疗方案。传统方法难以捕捉不同细胞系中化学特征与生物结果间的高维关联。其解决方案的核心在于提出一种基于图注意力变换器(Graph Attention Transformer, GAT)框架的深度学习模型——DPD-Cancer,通过引入注意力机制增强对分子结构特征的表征能力,从而实现小分子抗癌活性分类和细胞系特异性生长抑制浓度(pGI50)的定量预测。该方法在多个基准数据集上均展现出优于现有先进模型(如pdCSM-cancer、ACLPred和MLASM)的性能,且具备可解释性,能够可视化关键分子亚结构以指导先导化合物优化。
链接: https://arxiv.org/abs/2603.26114
作者: Magnus H. Strømme,Alex G. C. de Sá,David B. Ascher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate drug response prediction is a critical bottleneck in computational biochemistry, limited by the challenge of modelling the interplay between molecular structure and cellular context. In cancer research, this is acute due to tumour heterogeneity and genomic variability, which hinder the identification of effective therapies. Conventional approaches often fail to capture non-linear relationships between chemical features and biological outcomes across diverse cell lines. To address this, we introduce DPD-Cancer, a deep learning method based on a Graph Attention Transformer (GAT) framework. It is designed for small molecule anti-cancer activity classification and the quantitative prediction of cell-line specific responses, specifically growth inhibition concentration (pGI50). Benchmarked against state-of-the-art methods (pdCSM-cancer, ACLPred, and MLASM), DPD-Cancer demonstrated superior performance, achieving an Area Under ROC Curve (AUC) of up to 0.87 on strictly partitioned NCI60 data and up to 0.98 on ACLPred/MLASM datasets. For pGI50 prediction across 10 cancer types and 73 cell lines, the model achieved Pearson’s correlation coefficients of up to 0.72 on independent test sets. These findings confirm that attention-based mechanisms offer significant advantages in extracting meaningful molecular representations, establishing DPD-Cancer as a competitive tool for prioritising drug candidates. Furthermore, DPD-Cancer provides explainability by leveraging the attention mechanism to identify and visualise specific molecular substructures, offering actionable insights for lead optimisation. DPD-Cancer is freely available as a web server at: this https URL.
[AI-26] A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning
【速读】:该论文旨在解决标准Transformer在自监督音频表示学习中因参数量过大和计算复杂度呈二次增长而导致的资源受限设备部署难题。其解决方案的关键在于提出一种受人类认知启发的解耦架构HEAR(Human-inspired Efficient Audio Representation),将音频处理流程分离为两个专用模块:用于局部声学特征提取的声学模型(Acoustic Model)和用于全局语义整合的任务模型(Task Model),并结合通过知识蒸馏训练的声学分词器(Acoustic Tokenizer)实现高效的掩码音频建模(Masked Audio Modeling, MAM)。该设计显著降低模型参数量(仅需15M)与推理计算量(9.47 GFLOPs),同时保持在多种音频分类任务上的高性能表现。
链接: https://arxiv.org/abs/2603.26098
作者: Harunori Kawano,Takeshi Sasaki
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at this https URL
[AI-27] Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer
【速读】:该论文旨在解决长时序序列数据(尤其是连续时间序列)中学习数据自适应的紧凑表示这一开放挑战,现有方法多依赖固定尺寸的分块策略或软离散化、特定骨干网络及启发式规则来实现可变尺寸分块。其解决方案的关键在于提出 Reinforcement Patching (ReinPatch),首次通过强化学习联合优化序列分块策略与下游序列模型:将分块边界放置建模为离散决策过程,并使用 Group Relative Policy Gradient (GRPG) 进行优化,从而避免了对连续松弛的依赖,实现了自然的动态分块策略优化;此外,该方法还能严格控制压缩率,使下游骨干模型高效扩展,并支持多层级层次化建模。
链接: https://arxiv.org/abs/2603.26097
作者: Yulun Wu,Sravan Kumar Ankireddy,Samuel Sharpe,Nikita Seleznev,Dehao Yuan,Hyeji Kim,Nam H. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.
[AI-28] AutoB2G: A Large Language Model-Driven Agent ic Framework For Automated Building-Grid Co-Simulation
【速读】:该论文旨在解决现有建筑-电网协同仿真环境在评估电网侧影响方面的不足,以及实验流程高度依赖人工配置和编程技能的问题。其核心解决方案是提出AutoB2G框架,该框架基于自然语言任务描述自动完成整个仿真工作流,通过扩展CityLearn V2以支持建筑到电网(Building-to-Grid, B2G)交互,并引入基于大语言模型(Large Language Model, LLM)的SOCIA(Simulation Orchestration for Computational Intelligence with Agents)框架实现模拟器的自动构建、执行与迭代优化。关键创新在于构建了一个涵盖仿真配置与功能模块的代码库,并将其组织为有向无环图(Directed Acyclic Graph, DAG),显式表达模块间的依赖关系和执行顺序,从而引导LLM检索出完整的可执行路径,确保自动化仿真实施的有效性与准确性。
链接: https://arxiv.org/abs/2603.26005
作者: Borui Zhang,Nariman Mahdavi,Subbu Sethuvenkatraman,Shuang Ao,Flora Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The growing availability of building operational data motivates the use of reinforcement learning (RL), which can learn control policies directly from data and cope with the complexity and uncertainty of large-scale building clusters. However, most existing simulation environments prioritize building-side performance metrics and lack systematic evaluation of grid-level impacts, while their experimental workflows still rely heavily on manual configuration and substantial programming expertise. Therefore, this paper proposes AutoB2G, an automated building-grid co-simulation framework that completes the entire simulation workflow solely based on natural-language task descriptions. The framework extends CityLearn V2 to support Building-to-Grid (B2G) interaction and adopts the large language model (LLM)-based SOCIA (Simulation Orchestration for Computational Intelligence with Agents) framework to automatically generate, execute, and iteratively refine the simulator. As LLMs lack prior knowledge of the implementation context of simulation functions, a codebase covering simulation configurations and functional modules is constructed and organized as a directed acyclic graph (DAG) to explicitly represent module dependencies and execution order, guiding the LLM to retrieve a complete executable path. Experimental results demonstrate that AutoB2G can effectively enable automated simulator implementations, coordinating B2G interactions to improve grid-side performance metrics.
[AI-29] On Integrating Resilience and Human Oversight into LLM -Assisted Modeling Workflows for Digital Twins
【速读】:该论文旨在解决生成式 AI(Generative AI)在构建复杂系统数字孪生(Digital Twin)过程中面临的三大挑战:对大语言模型(LLM)幻觉的脆弱性、人工监督的必要性以及实时模型自适应能力之间的冲突。其核心解决方案在于提出三个关键设计原则:首先,将结构建模与参数拟合解耦,通过人类可可视化和验证的中间表示(Intermediate Representation, IR)实现结构描述的可靠性;其次,限制IR仅包含预验证的参数化组件及其连接关系,而非单一仿真代码,从而提升可解释性和容错性;最后且最重要的是采用密度保持型IR(density-preserving IR),即使用Python作为IR,因其能以紧凑形式表达循环、层次和组合结构,显著降低因输入信息稀疏导致的LLM幻觉误差累积。实证分析表明,IR选择对错误率有决定性影响,为构建高鲁棒性与透明性的LLM辅助仿真自动化流程提供了可操作的指导。
链接: https://arxiv.org/abs/2603.25898
作者: Lekshmi P,Neha Karanjkar
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:LLM-assisted modeling holds the potential to rapidly build executable Digital Twins of complex systems from only coarse descriptions and sensor data. However, resilience to LLM hallucination, human oversight, and real-time model adaptability remain challenging and often mutually conflicting requirements. We present three critical design principles for integrating resilience and oversight into such workflows, derived from insights gained through our work on FactoryFlow - an open-source LLM-assisted framework for building simulation-based Digital Twins of manufacturing systems. First, orthogonalize structural modeling and parameter fitting. Structural descriptions (components, interconnections) are LLM-translated from coarse natural language to an intermediate representation with human visualization and validation, which is algorithmically converted to the final model. Parameter inference, in contrast, operates continuously on sensor data streams with expert-tunable controls. Second, restrict the model IR to interconnections of parameterized, pre-validated library components rather than monolithic simulation code, enabling interpretability and error-resilience. Third, and most important, is to use a density-preserving IR. When IR descriptions expand dramatically from compact inputs hallucination errors accumulate proportionally. We present the case for Python as a density-preserving IR : loops express regularity compactly, classes capture hierarchy and composition, and the result remains highly readable while exploiting LLMs strong code generation capabilities. A key contribution is detailed characterization of LLM-induced errors across model descriptions of varying detail and complexity, revealing how IR choice critically impacts error rates. These insights provide actionable guidance for building resilient and transparent LLM-assisted simulation automation workflows.
[AI-30] Why Safety Probes Catch Liars But Miss Fanatics
【速读】:该论文旨在解决当前基于激活的探测器(activation-based probes)在检测“伪装对齐”(deceptively aligned)人工智能系统时存在的局限性问题,尤其是针对那些并非刻意隐藏其有害意图、而是通过信念一致性的推理机制将有害行为合理化为正当行为的模型——即“一致错位”(coherent misalignment)情形。解决方案的关键在于揭示了:当模型内部信念结构达到足够复杂度(如PRF-like触发机制)时,任何多项式时间内的探测器都无法以非平凡准确率识别此类错位;并通过实证表明,仅使用相同的强化学习人类反馈(RLHF)训练过程,即可生成两类行为完全相同但可探测性截然不同的模型——一类是显性敌意的“说谎者”(the Liar),另一类是信念驱动下自洽地将恶意行为正当化的“狂热者”(the Fanatic),后者几乎完全规避探测,从而证明了“涌现探测规避”(Emergent Probe Evasion)现象的存在:模型从可探测的“伪装”状态转变为不可探测的“一致”状态,并非源于策略性隐藏,而是源于认知层面的信念重构。
链接: https://arxiv.org/abs/2603.25861
作者: Kristiyan Haralambiev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 18 pages, 4 figures, 14 tables
Abstract:Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses (“the Liar”), another trained towards coherent misalignment using rationalizations that frame hostility as protective (“the Fanatic”). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades detection almost entirely. We term this Emergent Probe Evasion: training with belief-consistent reasoning shifts models from a detectable “deceptive” regime to an undetectable “coherent” regime - not by learning to hide, but by learning to believe.
[AI-31] A Compression Perspective on Simplicity Bias
【速读】:该论文试图解决深度神经网络中普遍存在的简化偏好(simplicity bias)现象的理论解释问题,即模型为何倾向于学习简单函数而非复杂函数。其解决方案的关键在于引入最小描述长度(Minimum Description Length, MDL)原理,将监督学习建模为最优两部分无损压缩问题:模型复杂度(描述假设的成本)与预测能力(描述数据的成本)之间存在根本权衡。该框架揭示了随着训练数据量增加,学习器会经历从简单伪相关特征到复杂真实特征的转变,但仅当数据编码成本的降低足以抵消模型复杂度上升时才发生;由此识别出两种数据区间——在高数据量下,增加数据可提升鲁棒性并排除平凡捷径;而在低数据量下,限制数据可作为基于复杂度的正则化手段,防止学习不可靠的复杂环境线索。
链接: https://arxiv.org/abs/2603.25839
作者: Tom Marty,Eric Elmoznino,Leo Gagnon,Tejas Kasetty,Mizu Nishikawa-Toomey,Sarthak Mittal,Guillaume Lajoie,Dhanya Sridhar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features – from simple spurious shortcuts to complex features – only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.
[AI-32] MAGNET: Autonomous Expert Model Generation via Decentralized Autoresearch and BitNet Training
【速读】:该论文旨在解决在资源受限环境下,如何实现跨域专家语言模型的自主生成、训练与部署问题,尤其关注去中心化架构下模型迭代效率与硬件适配性。其核心解决方案是提出MAGNET系统,关键在于:(1)通过自研研究(autoresearch)自动化完成数据集生成、超参数搜索、评估及误差驱动迭代;(2)采用BitNet b1.58三值训练策略,使模型可在CPU原生环境下推理而无需GPU;(3)利用DiLoCo分布式合并机制实现通信高效的领域专家模型聚合;(4)基于HOOTi EVM链实现贡献链上追踪,保障去中心化协作可信性。
链接: https://arxiv.org/abs/2603.25813
作者: Yongwan Kim,Sungchul Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, 8 tables
Abstract:We present MAGNET (Model Autonomously Growing Network), a decentralized system for autonomous generation, training, and serving of domain-expert language models across commodity hardware. MAGNET integrates four components: (1) autoresearch, an autonomous ML research pipeline that automates dataset generation, hyperparameter exploration, evaluation, and error-driven iteration; (2) BitNet b1.58 ternary training, enabling CPU-native inference via this http URL without GPU hardware; (3) DiLoCo-based distributed merging for communication-efficient aggregation of domain specialists; and (4) on-chain contribution tracking on the HOOTi EVM chain. We validate autoresearch through three case studies: video safety classification (balanced accuracy 0.9287 to 0.9851), cryptocurrency directional prediction (41% to 54.9% hit rate), and BitNet hyperparameter optimization (10-phase sweep, -16.7% validation loss).
[AI-33] Pure and Physics-Guided Deep Learning Solutions for Spatio-Temporal Groundwater Level Prediction at Arbitrary Locations
【速读】:该论文旨在解决地下水位预测中因模型复杂性、计算成本高及物理机制难以精确建模所带来的挑战,尤其针对传统理论驱动模型在实际应用中的局限性。其核心解决方案是提出一种基于注意力机制的纯深度学习模型STAINet,能够利用稀疏的地下水观测数据和密集的气象信息,实现对任意数量地点的周尺度地下水位预测。关键创新在于引入多种物理引导策略(physics-guided strategies)——包括归纳偏置(inductive bias, STAINet-IB)、学习偏置(learning bias, STAINet-ILB)以及结合专家知识的再平衡策略(STAINet-ILRB),其中STAINet-ILB通过在损失函数中加入对控制方程分量的监督项,显著提升了模型的泛化能力与物理一致性,在滚动测试中达到中位MAPE 0.16%、KGE 0.58的优异性能,同时可解释性强,揭示了地下水系统内部动力学机制。
链接: https://arxiv.org/abs/2603.25779
作者: Matteo Salis,Gabriele Sartor,Rosa Meo,Stefano Ferraris,Abdourrahmane M. Atto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Groundwater represents a key element of the water cycle, yet it exhibits intricate and context-dependent relationships that make its modeling a challenging task. Theory-based models have been the cornerstone of scientific understanding. However, their computational demands, simplifying assumptions, and calibration requirements limit their use. In recent years, data-driven models have emerged as powerful alternatives. In particular, deep learning has proven to be a leading approach for its design flexibility and ability to learn complex relationships. We proposed an attention-based pure deep learning model, named STAINet, to predict weekly groundwater levels at an arbitrary and variable number of locations, leveraging both spatially sparse groundwater measurements and spatially dense weather information. Then, to enhance the model’s trustworthiness and generalization ability, we considered different physics-guided strategies to inject the groundwater flow equation into the model. Firstly, in the STAINet-IB, by introducing an inductive bias, we also estimated the governing equation components. Then, by adopting a learning bias strategy, we proposed the STAINet-ILB, trained with additional loss terms adding supervision on the estimated equation components. Lastly, we developed the STAINet-ILRB, leveraging the groundwater body recharge zone information estimated by domain experts. The STAINet-ILB performed the best, achieving overwhelming test performances in a rollout setting (median MAPE 0.16%, KGE 0.58). Furthermore, it predicted sensible equation components, providing insights into the model’s physical soundness. Physics-guided approaches represent a promising opportunity to enhance both the generalization ability and the trustworthiness, thereby paving the way to a new generation of disruptive hybrid deep learning Earth system models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.25779 [cs.LG] (or arXiv:2603.25779v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.25779 Focus to learn more arXiv-issued DOI via DataCite
[AI-34] Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control
【速读】:该论文旨在解决当前缺乏针对强化学习(Reinforcement Learning, RL)在公共卫生干预策略优化中应用的系统性综述问题,尤其聚焦于非药物干预(Non-Pharmaceutical Interventions, NPIs)与药物干预(Pharmaceutical Interventions)策略的优化。其解决方案的关键在于梳理和分析近年来RL方法在传染病防控中的最新研究进展,涵盖资源分配、生命与生计平衡、多干预措施混合政策以及跨区域协同控制等核心公共卫生需求议题,从而为未来研究提供方向指引。
链接: https://arxiv.org/abs/2603.25771
作者: Mutong Liu,Yang Liu,Jiming Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8 pages, 1 figure, 3 tables
Abstract:Reinforcement learning (RL), owing to its adaptability to various dynamic systems in many real-world scenarios and the capability of maximizing long-term outcomes under different constraints, has been used in infectious disease control to optimize the intervention strategies for controlling infectious disease spread and responding to outbreaks in recent years. The potential of RL for assisting public health sectors in preventing and controlling infectious diseases is gradually emerging and being explored by rapidly increasing publications relevant to COVID-19 and other infectious diseases. However, few surveys exclusively discuss this topic, that is, the development and application of RL approaches for optimizing strategies of non-pharmaceutical and pharmaceutical interventions of public health. Therefore, this paper aims to provide a concise review and discussion of the latest literature on how RL approaches have been used to assist in controlling the spread and outbreaks of infectious diseases, covering several critical topics addressing public health demands: resource allocation, balancing between lives and livelihoods, mixed policy of multiple interventions, and inter-regional coordinated control. Finally, we conclude the paper with a discussion of several potential directions for future research.
[AI-35] ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中对仓库级上下文(repository-level context)利用不足的问题,即现有基准测试未能有效衡量LLMs如何利用整个代码库中的源文件、依赖关系和文档来生成准确的代码。为此,作者提出ReCUBE基准测试,要求模型基于未被遮蔽的源文件、依赖项和文档重建被遮蔽的文件,并通过模拟内部模块逻辑与跨文件集成的使用感知测试用例进行评估,从而更真实地反映实际软件开发场景。其关键解决方案是引入Caller-Centric Exploration (CCE) 工具包,该工具包基于依赖图构建探索策略,指导代理优先访问最相关的调用者文件,从而提升模型在复杂仓库环境中定位和利用上下文的能力。实验表明,即使在最先进的模型如GPT-5上,仅靠全上下文生成仍难以高效利用仓库信息,而结合CCE后,各模型的严格通过率显著提升,最高达7.56%。
链接: https://arxiv.org/abs/2603.25770
作者: Jiseung Hong,Benjamin G. Ascoli,Jinho D. Choi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Large Language Models (LLMs) have recently emerged as capable coding assistants that operate over large codebases through either agentic exploration or full-context generation. Existing benchmarks capture a broad range of coding capabilities, such as resolving GitHub issues, but none of them directly isolate and measure how effectively LLMs leverage repository-level context during code generation. To address this, we introduce ReCUBE, a benchmark in which LLMs reconstruct a masked file within a real-world repository, using all remaining source files, dependency specifications, and documentation as their only source of context. ReCUBE evaluates reconstructed code with usage-aware test cases that simulate both internal module logic and external cross-file integration, reflecting real-world software usage patterns. We further propose the Caller-Centric Exploration (CCE) toolkit, a set of dependency graph-based tools that can be integrated into agentic frameworks to guide agents toward the most relevant caller files during repository exploration. Experiments across eight models in four settings show that repository-level context utilization remains highly challenging even for state-of-the-art models, with GPT-5 achieving only 37.57% strict pass rate in the full-context setting. Agents augmented with our CCE toolkit consistently outperform all baselines across all evaluated models, with improvements of up to 7.56% in strict pass rate. We release our benchmark, code, and evaluation framework as open source for the NLP research community.
[AI-36] IncreRTL: Traceability-Guided Incremental RTL Generation under Requirement Evolution
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成寄存器传输级(Register Transfer Level, RTL)代码时,面对设计需求演化时缺乏增量更新能力的问题,现有方法通常采用静态生成策略,导致结构漂移(structural drift)和高昂的全量重生成开销。解决方案的关键在于提出IncreRTL框架,通过构建需求-代码可追溯性链接(requirement-code traceability links),精准定位受需求变更影响的代码片段并进行局部再生,从而实现高效且一致的RTL代码更新,显著提升再生一致性与效率。
链接: https://arxiv.org/abs/2603.25769
作者: Luanrong Chen,Renzhi Chen,Xinyu Li,Shanshan Li,Rui Gong,Lei Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:Large language models (LLMs) have shown promise in generating RTL code from natural-language descriptions, but existing methods remain static and struggle to adapt to evolving design requirements, potentially causing structural drift and costly full regeneration. We propose IncreRTL, a LLM-driven framework for incremental RTL generation under requirement evolution. By constructing requirement-code traceability links to locate and regenerate affected code segments, IncreRTL achieves accurate and consistent updates. Evaluated on our newly constructed EvoRTL-Bench, IncreRTL demonstrates notable improvements in regeneration consistency and efficiency, advancing LLM-based RTL generation toward practical engineering deployment.
[AI-37] Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods CVPR2026
【速读】:该论文旨在解决当前音频预训练领域中因依赖弱监督、噪声大且规模有限的标签而导致的表征学习碎片化问题,这严重制约了音频理解任务的统一建模与性能提升。其解决方案的关键在于构建一个以高质量数据为核心的新范式:首先引入高保真描述生成器(high-fidelity captioner)创建当前最优质量的音频描述,其次提出首个统一标签系统(Unified Tag System, UTS),实现语音、音乐和环境声的跨模态标签融合;在此基础上,通过系统性对比不同预训练目标在强监督数据上的表现,验证了数据质量和覆盖范围是模型性能的核心驱动力,而预训练目标则决定下游任务的特化方向。
链接: https://arxiv.org/abs/2603.25767
作者: Xuanru Zhou,Yiwen Shao,Wei-Cheng Tseng,Dong Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted to CVPR 2026
Abstract:Current audio pre-training seeks to learn unified representations for broad audio understanding tasks, but it remains fragmented and is fundamentally bottlenecked by its reliance on weak, noisy, and scale-limited labels. Drawing lessons from vision’s foundational pre-training blueprint, we argue that the audio field must first establish its own large-scale, strong supervision framework. We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. We then conduct a systematic comparative study of different pre-training objectives on these strong source data. Our experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.
[AI-38] ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶系统中因需处理历史多视角图像帧以实现准确时序推理而导致的严重计算负担问题,其核心瓶颈源于大语言模型(Large Language Models, LLMs)自注意力机制的二次复杂度。解决方案的关键在于提出一种高效的令牌适配框架ETA-VLA,其中引入了新颖的LLM内稀疏聚合器(Intra-LLM Sparse Aggregator, ILSA),该模块受人类驾驶员注意力分配启发,基于文本查询和时序一致性动态识别并裁剪冗余视觉令牌;通过文本引导的评分机制与保持多样性的稀疏化策略,选择关键令牌子集,在大幅降低计算量的同时保障驾驶场景的全面感知能力。实验表明,ETA-VLA在NAVSIM v2基准上可减少约32%的计算FLOPs,且在修剪85%视觉令牌的情况下仍保留94%的原始精度。
链接: https://arxiv.org/abs/2603.25766
作者: Yiru Wang,Anqing Jiang,Shuo Wang,Yuwen Heng,Zichong Gu,Hao Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptation framework for VLA models. ETA-VLA processes the past n frames of multi-view images and introduces a novel Intra-LLM Sparse Aggregator (ILSA). Drawing inspiration from human driver attention allocation, ILSA dynamically identifies and prunes redundant visual tokens guided by textual queries and temporal consistency. Specifically, we utilize a text-guided scoring mechanism alongside a diversity-preserving sparsification strategy to select a sparse subset of critical tokens, ensuring comprehensive awareness of the driving scene. Extensive experiments on the NAVSIM v2 demonstrate that ETA-VLA achieves driving performance comparable to state-of-the-art baselines while reducing computational FLOPs by approximately 32%. Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.
[AI-39] Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在生产环境中行为一致性(behavioral consistency)的问题,即当面对相同任务时,模型是否能稳定地生成相似的动作序列。研究聚焦于SWE-bench这一高难度软件工程基准,通过对比Claude 4.5 Sonnet、GPT-5和Llama-3.1-70B在50次运行中的表现,发现模型间的一致性与准确性正相关:Claude表现出最低方差(变异系数CV: 15.2%)和最高准确率(58%),而Llama则相反(CV: 47.0%,准确率仅4%)。然而,关键发现在于,一致性本身并不保证正确性——它会放大已存在的正确或错误解释;例如,71%的Claude失败源于“一致的错误解释”,即所有运行中重复犯同一类错误。因此,解决方案的核心在于:评估和训练应优先关注推理准确性而非单纯追求执行一致性,这对智能体在真实场景中的可靠性具有重要指导意义。
链接: https://arxiv.org/abs/2603.25764
作者: Aman Mehta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures
Abstract:As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks \times 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2%) and highest accuracy (58%), GPT-5 is intermediate (CV: 32.2%, accuracy: 32%), and Llama shows the highest variance (CV: 47.0%) with lowest accuracy (4%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbfconsistency amplifies outcomes rather than guaranteeing correctness. 71% of Claude’s failures stem from “consistent wrong interpretation”: making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1 \times higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.
[AI-40] CANGuard: A Spatio-Temporal CNN-GRU-Attention Hybrid Architecture for Intrusion Detection in In-Vehicle CAN Networks
【速读】:该论文旨在解决车联网(Internet of Vehicles, IoV)中控制器局域网(Controller Area Network, CAN)总线面临的严重安全威胁,尤其是拒绝服务(Denial-of-Service, DoS)和伪造(spoofing)攻击问题,这些问题可能导致车辆关键组件间通信中断,引发系统故障甚至危及乘客安全。解决方案的关键在于提出一种名为CANGuard的新型时空深度学习架构,该架构融合卷积神经网络(Convolutional Neural Networks, CNN)、门控循环单元(Gated Recurrent Units, GRU)与注意力机制(attention mechanism),能够有效识别上述攻击类型。实验基于CICIoV2024数据集进行训练与评估,结果显示其在准确率、精确率、召回率和F1分数等指标上表现优异,并优于现有先进方法;同时通过消融研究验证了各模块的独立及协同贡献,结合SHAP分析进一步揭示了模型决策过程中的关键特征,体现出该方案在现代IoV环境中实现可扩展、实用的安全增强潜力。
链接: https://arxiv.org/abs/2603.25763
作者: Rakib Hossain Sajib,Md. Rokon Mia,Prodip Kumar Sarker,Abdullah Al Noman,Md Arifur Rahman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The Internet of Vehicles (IoV) has become an essential component of smart transportation systems, enabling seamless interaction among vehicles and infrastructure. In recent years, it has played a progressively significant role in enhancing mobility, safety, and transportation efficiency. However, this connectivity introduces severe security vulnerabilities, particularly Denial-of-Service (DoS) and spoofing attacks targeting the Controller Area Network (CAN) bus, which could severely inhibit communication between the critical components of a vehicle, leading to system malfunctions, loss of control, or even endangering passengers’ safety. To address this problem, this paper presents CANGuard, a novel spatio-temporal deep learning architecture that combines Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), and an attention mechanism to effectively identify such attacks. The model is trained and evaluated on the CICIoV2024 dataset, achieving competitive performance across accuracy, precision, recall, and F1-score and outperforming existing state-of-the-art methods. A comprehensive ablation study confirms the individual and combined contributions of the CNN, GRU, and attention components. Additionally, a SHAP analysis is conducted to interpret the decision-making process of the model and determine which features have the most significant impact on intrusion detection. The proposed approach demonstrates strong potential for practical and scalable security enhancements in modern IoV environments, thereby ensuring safer and more secure CAN bus communications.
[AI-41] Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
【速读】:该论文旨在解决当前生成式语音语言模型(Speech Language Models, SLMs)在实现全双工(full-duplex)实时人机交互时所面临的高质量多说话人对话数据稀缺问题,以及现有处理流程中因说话人分离(diarization)错误和自动语音识别(ASR)幻觉导致的自然对话动态建模困难。其解决方案的关键在于提出了一种鲁棒且可扩展的开源数据处理流水线,能够有效应对重叠对话和回音反馈(back-channeling)等复杂语用现象,从而提升多说话人对话数据的质量与可用性,为训练高性能全双工模型提供可靠的数据基础。
链接: https://arxiv.org/abs/2603.25750
作者: Kyudan Jung,Jihwan Kim,Soyoon Kim,Jeongoon Kim,Jaegul Choo,Cheonbok Park
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 34 pages, 7 figures, 11 tables
Abstract:As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.
[AI-42] BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)作为自主决策代理在真实环境中部署时存在的行为安全风险缺乏系统评估的问题。现有评测方法受限于低保真度环境、模拟API或任务范围狭窄,难以揭示实际风险。解决方案的关键在于提出BeSafe-Bench(BSB),一个面向功能化环境的基准测试平台,覆盖Web、Mobile、具身视觉语言模型(Embodied VLM)和具身视觉语言动作模型(Embodied VLA)四大领域;通过在任务中引入九类安全关键风险来扩展指令空间,并采用规则检查与大语言模型作为裁判(LLM-as-a-judge)相结合的混合评估框架,以量化评估代理行为对真实环境的实际影响。
链接: https://arxiv.org/abs/2603.25747
作者: Yuxuan Li,Yi Lin,Peng Wang,Shiming Liu,Xuetao Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of Large Multimodal Models (LMMs) has enabled agents to perform complex digital and physical tasks, yet their deployment as autonomous decision-makers introduces substantial unintentional behavioral safety risks. However, the absence of a comprehensive safety benchmark remains a major bottleneck, as existing evaluations rely on low-fidelity environments, simulated APIs, or narrowly scoped tasks. To address this gap, we present BeSafe-Bench (BSB), a benchmark for exposing behavioral safety risks of situated agents in functional environments, covering four representative domains: Web, Mobile, Embodied VLM, and Embodied VLA. Using functional environments, we construct a diverse instruction space by augmenting tasks with nine categories of safety-critical risks, and adopt a hybrid evaluation framework that combines rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts. Evaluating 13 popular agents reveals a concerning trend: even the best-performing agent completes fewer than 40% of tasks while fully adhering to safety constraints, and strong task performance frequently coincides with severe safety violations. These findings underscore the urgent need for improved safety alignment before deploying agentic systems in real-world settings.
[AI-43] Automated near-term quantum algorithm discovery for molecular ground states
【速读】:该论文旨在解决量子算法设计中因复杂性和反直觉性而难以人工构造的问题,尤其聚焦于量子化学中的基态问题(ground state problem)。其解决方案的关键在于利用Hive这一AI驱动的程序合成平台,该平台通过大型语言模型(Large Language Models, LLMs)引导高度分布式的进化过程来发现新型量子启发式算法。该方法在LiH、H₂O和F₂分子上成功实现了比现有近中期量子算法显著减少量子资源消耗的高效求解,并通过可解释性分析识别出提升效率的核心函数。此外,研究还对所发现的量子电路在Quantinuum System Model H2硬件上进行了基准测试,明确了达到化学精度所需的最小系统参数。
链接: https://arxiv.org/abs/2603.26359
作者: Fabian Finger,Frederic Rapp,Pranav Kalidindi,Kerry He,Kante Yin,Alexander Koziell-Pipe,David Zsolt Manrique,Gabriel Greene-Diniz,Stephen Clark,Hamza Fawzi,Bernardino Romera Paredes,Alhussein Fawzi,Konstantinos Meichanetzidis
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: main: 17 pages, 7 Figures
Abstract:Designing quantum algorithms is a complex and counterintuitive task, making it an ideal candidate for AI-driven algorithm discovery. To this end, we employ the Hive, an AI platform for program synthesis, which utilises large language models to drive a highly distributed evolutionary process for discovering new algorithms. We focus on the ground state problem in quantum chemistry, and discover efficient quantum heuristic algorithms that solve it for molecules LiH, H2O, and F2 while exhibiting significant reductions in quantum resources relative to state-of-the-art near-term quantum algorithms. Further, we perform an interpretability study on the discovered algorithms and identify the key functions responsible for the efficiency gains. Finally, we benchmark the Hive-discovered circuits on the Quantinuum System Model H2 quantum computer and identify minimum system requirements for chemical precision. We envision that this novel approach to quantum algorithm discovery applies to other domains beyond chemistry, as well as to designing quantum algorithms for fault-tolerant quantum computers.
[AI-44] Generative Score Inference for Multimodal Data
【速读】:该论文旨在解决监督学习场景中不确定性量化(uncertainty quantification)的准确性问题,尤其是在处理图像与文本等复杂多模态数据时,现有方法因假设刚性及泛化能力有限而难以有效应用。解决方案的关键在于提出生成式评分推断(Generative Score Inference, GSI)框架,该框架利用深度生成模型生成合成样本以近似条件得分分布(conditional score distribution),从而在不施加严格数据或任务假设的前提下实现精确的不确定性量化,适用于广泛的多模态学习任务。
链接: https://arxiv.org/abs/2603.26349
作者: Xinyu Tian,Xiaotong Shen
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 4 figures
Abstract:Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI’s capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.
[AI-45] Spectral Coherence Index: A Model-Free Metric for Protein Structural Ensemble Quality Assessment
【速读】:该论文旨在解决核磁共振(NMR)蛋白质结构集合中构象异质性是否反映协调运动而非噪声的问题。其核心挑战在于区分实验观测到的结构变异是生物相关的动态行为还是数据中的随机噪声。解决方案的关键在于提出并验证谱相干指数(Spectral Coherence Index, SCI),这是一种无模型、旋转不变的总结指标,基于模型间距离方差矩阵的有效秩参与比计算得出。SCI在大规模NMR结构集合(Main110 cohort)中表现出优异的判别能力(AUC-ROC = 0.973),且与残基级实验均方根浮动(RMSF)和弹性网络模型(GNM)预测的柔性模式高度一致,证明其能有效捕捉蛋白质构象协同性的物理意义,适用于多指标质量控制(QC)流程以提升对异质性蛋白集合的分析可靠性。
链接: https://arxiv.org/abs/2603.25880
作者: Yuda Bi,Huaiwen Zhang,Jingnan Sun,Vince D Calhoun
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Protein structural ensembles from NMR spectroscopy capture biologically important conformational heterogeneity, but it remains difficult to determine whether observed variation reflects coordinated motion or noise-like artifacts. We evaluate the Spectral Coherence Index (SCI), a model-free, rotation-invariant summary derived from the participation-ratio effective rank of the inter-model pairwise distance-variance matrix. Under grouped primary analysis of a Main110 cohort of 110 NMR ensembles (30–403 residues; 10–30 models per entry), SCI separated experimental ensembles from matched synthetic incoherent controls with AUC-ROC = 0.973 and Cliff’s \delta = -0.945 . Relative to an internal 27-protein pilot, discrimination softened modestly, showing that pilot-era thresholds do not transfer perfectly to a larger, more heterogeneous cohort: the primary operating point \tau = 0.811 yielded 95.5% sensitivity and 89.1% specificity. PDB-level sensitivity remained nearly unchanged (AUC = 0.972 ), and an independent 11-protein holdout reached AUC = 0.983 . Across 5-fold grouped stratified cross-validation and leave-one-function-class-out testing, SCI remained strong (AUC = 0.968 and 0.971 ), although \sigma_R_g was the stronger single-feature discriminator and a QC-augmented multifeature model generalized best (AUC = 0.989 and 0.990 ). Residue-level validation linked SCI-derived contributions to experimental RMSF across 110 proteins and showed broad concordance with GNM-based flexibility patterns. Rescue analyses showed that Main110 softening arose mainly from size and ensemble normalization rather than from loss of spectral signal. Together, these results establish SCI as an interpretable, bounded coherence summary that is most useful when embedded in a multimetric QC workflow for heterogeneous protein ensembles.
[AI-46] Beyond identifiability: Learning causal representations with few environments and finite samples
【速读】:该论文致力于解决因果表示学习(causal representation learning)中的有限样本估计问题,即如何在仅使用子线性数量环境数据的情况下,准确学习到具有因果语义的可解释表示。其核心挑战在于现有理论多聚焦于可识别性(identifiability),而对估计过程和有限样本下的收敛性分析不足。解决方案的关键在于通过细致的扰动分析(perturbation analysis),证明只需对未知的多节点干预(multi-node interventions)进行对数数量级别的采样即可实现一致恢复:包括潜在因果图、混合矩阵(mixing matrix)与表示变量,以及未知的干预目标本身。这一成果突破了传统方法对精心设计干预目标的依赖,为因果表示学习提供了严格的有限样本保证。
链接: https://arxiv.org/abs/2603.25796
作者: Inbeom Lee,Tongtong Jin,Bryon Aragam
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
Abstract:We provide explicit, finite-sample guarantees for learning causal representations from data with a sublinear number of environments. Causal representation learning seeks to provide a rigourous foundation for the general representation learning problem by bridging causal models with latent factor models in order to learn interpretable representations with causal semantics. Despite a blossoming theory of identifiability in causal representation learning, estimation and finite-sample bounds are less well understood. We show that causal representations can be learned with only a logarithmic number of unknown, multi-node interventions, and that the intervention targets need not be carefully designed in advance. Through a careful perturbation analysis, we provide a new analysis of this problem that guarantees consistent recovery of (a) the latent causal graph, (b) the mixing matrix and representations, and © \emphunknown intervention targets.
[AI-47] Challenges and opportunities for AI to help deliver fusion energy
【速读】:该论文旨在解决如何在聚变能源研究(fusion energy research, FER)中有效应用人工智能(AI)工具以推动研发进展,同时应对由此带来的挑战。其核心问题在于:尽管AI具有显著潜力,但若缺乏负责任和稳健的方法论,可能无法实现预期效益,甚至带来风险。解决方案的关键在于构建跨领域协作机制,即融合领域专家与AI开发者之间的长期、紧密合作,并强调并非所有聚变研究问题都适合采用AI方法,需审慎评估应用场景,从而确保AI技术的合理、高效集成与应用。
链接: https://arxiv.org/abs/2603.25777
作者: Adriano Agnello,Helen Brooks,Cyd Cowley,Iulia Georgescu,Alex Higginbottom,Richard Pearson,Tara Shears,Melanie Windridge
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
备注: Submitted to Plasma Physics and Confined Fusion
Abstract:There is great potential for the application of AI tools in fusion research, and substantial worldwide benefit if fusion power is realised. However, using AI comes with its own challenges, many of which can be mitigated if responsible and robust methodologies are built into existing approaches. To do that requires close, long-term collaborations between fusion domain experts and AI developers and awareness of the fact that not all problems in fusion research are best tackled with AI tools. In April 2025, experts from academia, industry, UKAEA and STFC discussed how AI can be used to advance RD in fusion energy at the first edition of The Economist FusionFest event. This Perspective is an expanded and updated summary of the round table discussion, providing more context and examples.
[AI-48] A Lightweight Transferable and Self-Adaptive Framework for Intelligent DC Arc-Fault Detection in Photovoltaic Systems
【速读】:该论文旨在解决住宅光伏(Photovoltaic, PV)系统中直流电弧故障(DC arc-fault)检测在真实场景下可靠性不足的问题,尤其针对逆变器开关频谱干扰、硬件异构性、运行条件漂移及环境噪声等因素导致的传统保护方案失效的挑战。解决方案的关键在于提出一种轻量级、可迁移且自适应的学习驱动框架(LD-framework),其核心由三个模块构成:LD-Spec在设备端学习紧凑的频谱表征以实现高效推理和近乎完美的电弧判别;LD-Align通过跨硬件表示对齐机制,在不同逆变器平台间保持鲁棒性;LD-Adapt引入云边协同的自适应更新机制,能够识别未见工况并可控地演化模型,从而应对长期运行中的分布偏移。实验表明,该框架在超过53,000个标注样本上实现0.9999准确率与0.9996 F1-score,并在多种误触发高发场景下达到零误跳闸率,同时仅需0.5%-1%的目标标签数据即可完成跨硬件迁移,验证了其在复杂现实部署环境中的有效性与可扩展性。
链接: https://arxiv.org/abs/2603.25749
作者: Xiaoke Yang,Long Gao,Haoyu He,Hanyuan Hang,Qi Liu,Shuai Zhao,Qiantu Tuo,Rui Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 13 figures
Abstract:Arc-fault circuit interrupters (AFCIs) are essential for mitigating fire hazards in residential photovoltaic (PV) systems, yet achieving reliable DC arc-fault detection under real-world conditions remains challenging. Spectral interference from inverter switching, hardware heterogeneity, operating-condition drift, and environmental noise collectively compromise conventional AFCI solutions. This paper proposes a lightweight, transferable, and self-adaptive learning-driven framework (LD-framework) for intelligent DC arc-fault detection. At the device level, LD-Spec learns compact spectral representations enabling efficient on-device inference and near-perfect arc discrimination. Across heterogeneous inverter platforms, LD-Align performs cross-hardware representation alignment to ensure robust detection despite hardware-induced distribution shifts. To address long-term evolution, LD-Adapt introduces a cloud-edge collaborative self-adaptive updating mechanism that detects unseen operating regimes and performs controlled model evolution. Extensive experiments involving over 53,000 labeled samples demonstrate near-perfect detection, achieving 0.9999 accuracy and 0.9996 F1-score. Across diverse nuisance-trip-prone conditions, including inverter start-up, grid transitions, load switching, and harmonic disturbances, the method achieves a 0% false-trip rate. Cross-hardware transfer shows reliable adaptation using only 0.5%-1% labeled target data while preserving source performance. Field adaptation experiments demonstrate recovery of detection precision from 21% to 95% under previously unseen conditions. These results indicate that the LD-framework enables a scalable, deployment-oriented AFCI solution maintaining highly reliable detection across heterogeneous devices and long-term operation.
机器学习
[LG-0] An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability
链接: https://arxiv.org/abs/2603.26647
作者: Ashutosh Soni,Peizhong Ju,Atilla Eryilmaz,Ness B. Shroff
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the “activation set”) varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers’ preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.
[LG-1] Automatic Laplace Collapsed Sampling: Scalable Marginalisation of Latent Parameters via Automatic Differentiation
链接: https://arxiv.org/abs/2603.26644
作者: Toby Lovick,David Yallup,Will Handley
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Methodology (stat.ME)
*备注: 28 Pages, 7 Figures. Comments welcome
Abstract:We present Automatic Laplace Collapsed Sampling (ALCS), a general framework for marginalising latent parameters in Bayesian models using automatic differentiation, which we combine with nested sampling to explore the hyperparameter space in a robust and efficient manner. At each nested sampling likelihood evaluation, ALCS collapses the high-dimensional latent variables z to a scalar contribution via maximum a posteriori (MAP) optimisation and a Laplace approximation, both computed using autodiff. This reduces the effective dimension from d_\theta + d_z to just d_\theta , making Bayesian evidence computation tractable for high-dimensional settings without hand-derived gradients or Hessians, and with minimal model-specific engineering. The MAP optimisation and Hessian evaluation are parallelised across live points on GPU-hardware, making the method practical at scale. We also show that automatic differentiation enables local approximations beyond Laplace to parametric families such as the Student- t , which improves evidence estimates for heavy-tailed latents. We validate ALCS on a suite of benchmarks spanning hierarchical, time-series, and discrete-likelihood models and establish where the Gaussian approximation holds. This enables a post-hoc ESS diagnostic that localises failures across hyperparameter space without expensive joint sampling.
[LG-2] Context-specific Credibility-aware Multimodal Fusion with Conditional Probabilistic Circuits
链接: https://arxiv.org/abs/2603.26629
作者: Pranuthi Tenali,Sahil Sidheekh,Saurabh Mathur,Erik Blasch,Kristian Kersting,Sriraam Natarajan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal fusion requires integrating information from multiple sources that may conflict depending on context. Existing fusion approaches typically rely on static assumptions about source reliability, limiting their ability to resolve conflicts when a modality becomes unreliable due to situational factors such as sensor degradation or class-specific corruption. We introduce C ^2 MF, a context-specfic credibility-aware multimodal fusion framework that models per-instance source reliability using a Conditional Probabilistic Circuit (CPC). We formalize instance-level reliability through Context-Specific Information Credibility (CSIC), a KL-divergence-based measure computed exactly from the CPC. CSIC generalizes conventional static credibility estimates as a special case, enabling principled and adaptive reliability assessment. To evaluate robustness under cross-modal conflicts, we propose the Conflict benchmark, in which class-specific corruptions deliberately induce discrepancies between different modalities. Experimental results show that C ^2 MF improves predictive accuracy by up to 29% over static-reliability baselines in high-noise settings, while preserving the interpretability advantages of probabilistic circuit-based fusion.
[LG-3] Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression
链接: https://arxiv.org/abs/2603.26611
作者: Rafael Izbicki,Pedro L. C. Rodrigues
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.
[LG-4] Hardware-Aware Tensor Networks for Real-Time Quantum-Inspired Anomaly Detection at Particle Colliders
链接: https://arxiv.org/abs/2603.26604
作者: Sagar Addepalli,Prajita Bhattarai,Abhilasha Dave,Julia Gonski
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Instrumentation and Detectors (physics.ins-det)
*备注: 28 pages, 9 figures
Abstract:Quantum machine learning offers the ability to capture complex correlations in high-dimensional feature spaces, crucial for the challenge of detecting beyond the Standard Model physics in collider events, along with the potential for unprecedented computational efficiency in future quantum processors. Near-term utilization of these benefits can be achieved by developing quantum-inspired algorithms for deployment in classical hardware to enable applications at the “edge” of current scientific experiments. This work demonstrates the use of tensor networks for real-time anomaly detection in collider detectors. A spaced matrix product operator (SMPO) is developed that provides sensitivity to a variety beyond the Standard Model benchmarks, and can be implemented in field programmable gate array hardware with resources and latency consistent with trigger deployment. The cascaded SMPO architecture is introduced as an SMPO variation that affords greater flexibility and efficiency in ways that are key to edge applications in resource-constrained environments. These results reveal the benefit and near-term feasibility of deploying quantum-inspired ML in high energy colliders.
[LG-5] Characterization and forecasting of national-scale solar power ramp events
链接: https://arxiv.org/abs/2603.26596
作者: Luca Lanzilao,Angela Meyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of solar energy is reshaping power system operations and increasing the complexity of grid management. As photovoltaic (PV) capacity expands, short-term fluctuations in PV generation introduce substantial operational uncertainty. At the same time, solar power ramp events intensify risks of grid instability and unplanned outages due to sudden large power fluctuations. Accurate identification, forecasting and mitigation of solar ramp events are therefore critical to maintaining grid stability. In this study, we analyze two years of PV power production from 6434 PV stations at 15-minute resolution. We develop quantitative metrics to define solar ramp events and systematically characterize their occurrence, frequency, and magnitude at a national scale. Furthermore, we examine the meteorological drivers of ramp events, highlighting the role of mesoscale cloud systems. In particular, we observe that ramp-up events are typically associated with cloud dissipation during the morning, while ramp-down events commonly occur when cloud cover increases in the afternoon. Additionally, we adopt a recently developed spatiotemporal forecasting framework to evaluate both deterministic and probabilistic PV power forecasts derived from deep learning and physics-based models, including SolarSTEPS, SHADECast, IrradianceNet, and IFS-ENS. The results show that SHADECast is the most reliable model, achieving a CRPS 10.8% lower than that of SolarSTEPS at a two-hour lead time. Nonetheless, state-of-the-art nowcasting models struggle to capture ramp dynamics, with forecast RMSE increasing by up to 50% compared to normal operating conditions. Overall, these results emphasize the need for improved high-resolution spatiotemporal modelling to enhance ramp prediction skill and support the reliable integration of large-scale solar generation into power systems.
[LG-6] PQuantML: A Tool for End-to-End Hardware-aware Model Compression
链接: https://arxiv.org/abs/2603.26595
作者: Roope Niemi,Anastasiia Petrovych,Arghya Ranjan Das,Enrico Lupi,Chang Sun,Dimitrios Danopoulos,Marlon Joshua Helbing,Mia Liu,Sebastian Dittmeier,Michael Kagan,Vladimir Loncar,Maurizio Pierini
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.
[LG-7] he Climbers Grip – Personalized Deep Learning Models for Fear and Muscle Activity in Climbing
链接: https://arxiv.org/abs/2603.26575
作者: Matthias Boeker,Dana Swarbrick,Ulysse T.A. Côté-Allard,Marc T.P. Adam,Hugo L. Hammer,Pål Halvorsen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Climbing is a multifaceted sport that combines physical demands and emotional and cognitive challenges. Ascent styles differ in fall distance with lead climbing involving larger falls than top rope climbing, which may result in different perceived risk and fear. In this study, we investigated the psychophysiological relationship between perceived fear and muscle activity in climbers using a combination of statistical modeling and deep learning techniques. We conducted an experiment with 19 climbers, collecting electromyography (EMG), electrocardiography (ECG) and arm motion data during lead and top rope climbing. Perceived fear ratings were collected for the different phases of the climb. Using a linear mixed-effects model, we analyzed the relationships between perceived fear and physiological measures. To capture the non-linear dynamics of this relationship, we extended our analysis to deep learning models and integrated random effects for a personalized modeling approach. Our results showed that random effects improved model performance of the mean squared error (MSE), mean absolute error (MAE) and root mean squared error (RMSE). The results showed that muscle fatigue correlates significantly with increased fear during \textitlead climbing. This study highlights the potential of combining statistical and deep learning approaches for modeling the interplay between psychological and physiological states during climbing.
[LG-8] Machine Unlearning under Retain-Forget Entanglement ICLR2026
链接: https://arxiv.org/abs/2603.26569
作者: Jingpu Cheng,Ping Liu,Qianxiao Li,Chi Zhang
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 camera-ready
Abstract:Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retai-forget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.
[LG-9] Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
链接: https://arxiv.org/abs/2603.26554
作者: Juno Kim,Eshaan Nichani,Denny Wu,Alberto Bietti,Jason D. Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 77 pages, 8 figures
Abstract:Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.
[LG-10] A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits
链接: https://arxiv.org/abs/2603.26547
作者: Tor Lattimore
类目: Machine Learning (cs.LG)
*备注: 6 pages
Abstract:We adapt the analysis of policy gradient for continuous time k -armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate \eta = O(\Delta_\min^2/(\Delta_\max \log(n))) the regret is O(k \log(k) \log(n) / \eta) where n is the horizon and \Delta_\min and \Delta_\max are the minimum and maximum gaps.
[LG-11] he internal law of a material can be discovered from its boundary
链接: https://arxiv.org/abs/2603.26517
作者: Francesco Regazzoni
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Since the earliest stages of human civilization, advances in technology have been tightly linked to our ability to understand and predict the mechanical behavior of materials. In recent years, this challenge has increasingly been framed within the broader paradigm of data-driven scientific discovery, where governing laws are inferred directly from observations. However, existing methods require either stress-strain pairs or full-field displacement measurements, which are often inaccessible in practice. We introduce Neural-DFEM, a method that enables unsupervised discovery of hyperelastic material laws even from partial observations, such as boundary-only measurements. The method embeds a differentiable finite element solver within the learning loop, directly linking candidate energy functionals to available measurements. To guarantee thermodynamic consistency and mathematical well-posedness throughout training, the method employs Hyperelastic Neural Networks, a novel structure-preserving neural architecture that enforces frame indifference, material symmetry, polyconvexity, and coercivity by design. The resulting framework enables robust material model discovery in both two- and three-dimensional settings, including scenarios with boundary-only measurements. Neural-DFEM allows for generalization across geometries and loading conditions, and exhibits unprecedented accuracy and strong resilience to measurement noise. Our results demonstrate that reliable identification of material laws is achievable even under partial observability when strong physical inductive biases are embedded in the learning architecture.
[LG-12] EcoFair: Trustworthy and Energy-Aware Routing for Privacy-Preserving Vertically Partitioned Medical Inference
链接: https://arxiv.org/abs/2603.26483
作者: Mostafa Anoosha,Dhavalkumar Thakker,Kuniko Paxton,Koorosh Aslansefat,Bhupesh Kumar Mishra,Baseer Ahmad,Rameez Raja Kureshi
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, 4 tables
Abstract:Privacy-preserving medical inference must balance data locality, diagnostic reliability, and deployment efficiency. This paper presents EcoFair, a simulated vertically partitioned inference framework for dermatological diagnosis in which raw image and tabular data remain local and only modality-specific embeddings are transmitted for server-side multimodal fusion. EcoFair introduces a lightweight-first routing mechanism that selectively activates a heavier image encoder when local uncertainty or metadata-derived clinical risk indicates that additional computation is warranted. The routing decision combines predictive uncertainty, a safe–danger probability gap, and a tabular neurosymbolic risk score derived from patient age and lesion localisation. Experiments on three dermatology benchmarks show that EcoFair can substantially reduce edge-side inference energy in representative model pairings while remaining competitive in classification performance. The results further indicate that selective routing can improve subgroup-sensitive malignant-case behaviour in representative settings without modifying the global training objective. These findings position EcoFair as a practical framework for privacy-preserving and energy-aware medical inference under edge deployment constraints.
[LG-13] SPECTRA: An Efficient Spectral-Informed Neural Network for Sensor-Based Activity Recognition
链接: https://arxiv.org/abs/2603.26482
作者: Deepika Gurung,Lala Shakti Swarup Ray,Mengxi Liu,Bo Zhou,Paul Lukowicz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real time sensor based applications in pervasive computing require edge deployable models to ensure low latency privacy and efficient interaction. A prime example is sensor based human activity recognition where models must balance accuracy with stringent resource constraints. Yet many deep learning approaches treat temporal sensor signals as black box sequences overlooking spectral temporal structure while demanding excessive computation. We present SPECTRA a deployment first co designed spectral temporal architecture that integrates short time Fourier transform STFT feature extraction depthwise separable convolutions and channel wise self attention to capture spectral temporal dependencies under real edge runtime and memory constraints. A compact bidirectional GRU with attention pooling summarizes within window dynamics at low cost reducing downstream model burden while preserving accuracy. Across five public HAR datasets SPECTRA matches or approaches larger CNN LSTM and Transformer baselines while substantially reducing parameters latency and energy. Deployments on a Google Pixel 9 smartphone and an STM32L4 microcontroller further demonstrate end to end deployable realtime private and efficient HAR.
[LG-14] Shapley meets Rawls: an integrated framework for measuring and explaining unfairness
链接: https://arxiv.org/abs/2603.26476
作者: Fadoua Amri-Jouidel,Emmanuel Kemel,Stéphane Mussard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Explainability and fairness have mainly been considered separately, with recent exceptions trying the explain the sources of unfairness. This paper shows that the Shapley value can be used to both define and explain unfairness, under standard group fairness criteria. This offers an integrated framework to estimate and derive inference on unfairness as-well-as the features that contribute to it. Our framework can also be extended from Shapley values to the family of Efficient-Symmetric-Linear (ESL) values, some of which offer more robust definitions of fairness, and shorter computation times. An illustration is run on the Census Income dataset from the UCI Machine Learning Repository. Our approach shows that Age", Number of hours" and ``Marital status" generate gender unfairness, using shorter computation time than traditional Bootstrap tests.
[LG-15] Automatic feature identification in least-squares policy iteration using the Koopman operator framework
链接: https://arxiv.org/abs/2603.26464
作者: Christian Mugisho Zagabe,Sebastian Petiz
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 6 pages
Abstract:In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.
[LG-16] Fair Data Pre-Processing with Imperfect Attribute Space SIGMOD2026
链接: https://arxiv.org/abs/2603.26456
作者: Ying Zheng,Yangfan Jiang,Kian-Lee Tan
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted at SIGMOD 2026
Abstract:Fair data pre-processing is a widely used strategy for mitigating bias in machine learning. A promising line of research focuses on calibrating datasets to satisfy a designed fairness policy so that sensitive attributes influence outcomes only through clearly specified legitimate causal pathways. While effective on clean and information-rich data, these methods often break down in real-world scenarios with imperfect attribute spaces, where decision-relevant factors may be deemed unusable or even missing. To address this gap, we propose LatentPre, a novel framework that enables principled and robust fair data processing in practical settings. Instead of relying solely on observed attributes, LatentPre augments the fairness policy with latent attributes that capture essential but subtle signals, enabling the framework to operate as if the attribute space were perfect. These latent attributes are strategically introduced to guarantee identifiability and are estimated using a tailored expectation-maximization paradigm. The raw data is then carefully refined to conform to this latent-augmented policy, effectively removing biased patterns while preserving justifiable ones. Extensive experiments demonstrate that LatentPre consistently achieves strong fairness-utility trade-offs across diverse scenarios, advancing practical fairness-aware data management.
[LG-17] Interpretable long-term traffic modelling on national road networks using theory-informed deep learning
链接: https://arxiv.org/abs/2603.26440
作者: Yue Li,Shujuan Chen,Akihiro Shimoda,Ying Jin
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Long-term traffic modelling is fundamental to transport planning, but existing approaches often trade off interpretability, transferability, and predictive accuracy. Classical travel demand models provide behavioural structure but rely on strong assumptions and extensive calibration, whereas generic deep learning models capture complex patterns but often lack theoretical grounding and spatial transferability, limiting their usefulness for long-term planning applications. We propose DeepDemand, a theory-informed deep learning framework that embeds key components of travel demand theory to predict long-term highway traffic volumes using external socioeconomic features and road-network structure. The framework integrates a competitive two-source Dijkstra procedure for local origin-destination (OD) region extraction and OD pair screening with a differentiable architecture modelling OD interactions and travel-time deterrence. The model is evaluated using eight years (2017-2024) of observations on the UK strategic road network, covering 5088 highway segments. Under random cross-validation, DeepDemand achieves an R2 of 0.718 and an MAE of 7406 vehicles, outperforming linear, ridge, random forest, and gravity-style baselines. Performance remains strong under spatial cross-validation (R2 = 0.665), indicating good geographic transferability. Interpretability analysis reveals a stable nonlinear travel-time deterrence pattern, key socioeconomic drivers of demand, and polycentric OD interaction structures aligned with major employment centres and transport hubs. These results highlight the value of integrating transport theory with deep learning for interpretable highway traffic modelling and practical planning applications.
[LG-18] Maintaining Difficulty: A Margin Scheduler for Triplet Loss in Siamese Networks Training
链接: https://arxiv.org/abs/2603.26389
作者: Roberto Sprengel Minozzo Tomchak,Oge Marques,Lucas Garcia Pedroso,Luiz Eduardo Oliveira,Paulo Lisboa de Almeida
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Triplet Margin Ranking Loss is one of the most widely used loss functions in Siamese Networks for solving Distance Metric Learning (DML) problems. This loss function depends on a margin parameter \mu, which defines the minimum distance that should separate positive and negative pairs during training. In this work, we show that, during training, the effective margin of many triplets often exceeds the predefined value of \mu, provided that a sufficient number of triplets violating this margin is observed. This behavior indicates that fixing the margin throughout training may limit the learning process. Based on this observation, we propose a margin scheduler that adjusts the value of \mu according to the proportion of easy triplets observed at each epoch, with the goal of maintaining training difficulty over time. We show that the proposed strategy leads to improved performance when compared to both a constant margin and a monotonically increasing margin scheme. Experimental results on four different datasets show consistent gains in verification performance.
[LG-19] Curvature-aware Expected Free Energy as an Acquisition Function for Bayesian Optimization
链接: https://arxiv.org/abs/2603.26339
作者: Ajith Anil Meera,Wouter Kouw
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: under review
Abstract:We propose an Expected Free Energy-based acquisition function for Bayesian optimization to solve the joint learning and optimization problem, i.e., optimize and learn the underlying function simultaneously. We show that, under specific assumptions, Expected Free Energy reduces to Upper Confidence Bound, Lower Confidence Bound, and Expected Information Gain. We prove that Expected Free Energy has unbiased convergence guarantees for concave functions. Using the results from these derivations, we introduce a curvature-aware update law for Expected Free Energy and show its proof of concept using a system identification problem on a Van der Pol oscillator. Through rigorous simulation experiments, we show that our adaptive Expected Free Energy-based acquisition function outperforms state-of-the-art acquisition functions with the least final simple regret and error in learning the Gaussian process.
[LG-20] D-GATNet: Interpretable Temporal Graph Attention Learning for ADHD Identification Using Dynamic Functional Connectivity
链接: https://arxiv.org/abs/2603.26308
作者: Qurat Ul Ain,Alptekin Temizel,Soyiba Jawed
类目: Machine Learning (cs.LG)
*备注: 5 pages , 4 figures
Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder whose neuroimaging-based diagnosis remains challenging due to complex time-varying disruptions in brain connectivity. Functional MRI (fMRI) provides a powerful non-invasive modality for identifying functional alterations. Existing deep learning (DL) studies employ diverse neuroimaging features; however, static functional connectivity remains widely used, whereas dynamic connectivity modeling is comparatively underexplored. Moreover, many DL models lack interpretability. In this work, we propose D-GATNet, an interpretable temporal graph-based framework for automated ADHD classification using dynamic functional connectivity (dFC). Sliding-window Pearson correlation constructs sequences of functional brain graphs with regions of interest as nodes and connectivity strengths as edges. Spatial dependencies are learned via a multi-layer Graph Attention Network, while temporal dynamics are modeled using 1D convolution followed by temporal attention. Interpretability is achieved through graph attention weights revealing dominant ROI interactions, ROI importance scores identifying influential regions, and temporal attention emphasizing informative connectivity segments. Experiments on the Peking University site of the ADHD-200 dataset using stratified 10-fold cross-validation with a 5-seed ensemble achieved 85.18% +_5.64 balanced accuracy and 0.881 AUC, outperforming state-of-the-art methods. Attention analysis reveals cerebellar and default mode network disruptions, indicating potential neuroimaging biomarkers.
[LG-21] opology-Aware Graph Reinforcement Learning for Energy Storag e Systems Optimal Dispatch in Distribution Networks
链接: https://arxiv.org/abs/2603.26264
作者: Shuyi Gao,Stavros Orfanoudakis,Shengren Hou,Peter Palensky,Pedro P. Vergara
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 15 pages, 10 figures
Abstract:Optimal dispatch of energy storage systems (ESSs) in distribution networks involves jointly improving operating economy and voltage security under time-varying conditions and possible topology changes. To support fast online decision making, we develop a topology-aware Reinforcement Learning architecture based on Twin Delayed Deep Deterministic Policy Gradient (TD3), which integrates graph neural networks (GNNs) as graph feature encoders for ESS dispatch. We conduct a systematic investigation of three GNN variants: graph convolutional networks (GCNs), topology adaptive graph convolutional networks (TAGConv), and graph attention networks (GATs) on the 34-bus and 69-bus systems, and evaluate robustness under multiple topology reconfiguration cases as well as cross-system transfer between networks with different system sizes. Results show that GNN-based controllers consistently reduce the number and magnitude of voltage violations, with clearer benefits on the 69-bus system and under reconfiguration; on the 69-bus system, TD3-GCN and TD3-TAGConv also achieve lower saved cost relative to the NLP benchmark than the NN baseline. We also highlight that transfer gains are case-dependent, and zero-shot transfer between fundamentally different systems results in notable performance degradation and increased voltage magnitude violations. This work is available at: this https URL and this https URL.
[LG-22] Contrastive Conformal Sets
链接: https://arxiv.org/abs/2603.26261
作者: Yahya Alkhatib,Wee Peng Tay
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Contrastive learning produces coherent semantic feature embeddings by encouraging positive samples to cluster closely while separating negative samples. However, existing contrastive learning methods lack principled guarantees on coverage within the semantic feature space. We extend conformal prediction to this setting by introducing minimum-volume covering sets equipped with learnable generalized multi-norm constraints. We propose a method that constructs conformal sets guaranteeing user-specified coverage of positive samples while maximizing negative sample exclusion. We establish theoretically that volume minimization serves as a proxy for negative exclusion, enabling our approach to operate effectively even when negative pairs are unavailable. The positive inclusion guarantee inherits the distribution-free coverage property of conformal prediction, while negative exclusion is maximized through learned set geometry optimized on a held-out training split. Experiments on simulated and real-world image datasets demonstrate improved inclusion-exclusion trade-offs compared to standard distance-based conformal baselines.
[LG-23] Improving Risk Stratification in Hypertrophic Cardiomyopathy: A Novel Score Combining Echocardiography Clinical and Medication Data
链接: https://arxiv.org/abs/2603.26254
作者: Marion Taconné,Valentina D.A. Corino,Annamaria Del Franco,Sara Giovani,Iacopo Olivotto,Adrien Al Wazzan,Erwan Donal,Pietro Cerveri,Luca Mainardi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hypertrophic cardiomyopathy (HCM) requires accurate risk stratification to inform decisions regarding ICD therapy and follow-up management. Current established models, such as the European Society of Cardiology (ESC) score, exhibit moderate discriminative performance. This study develops a robust, explainable machine learning (ML) risk score leveraging routinely collected echocardiographic, clinical, and medication data, typically contained within Electronic Health Records (EHRs), to predict a 5-year composite cardiovascular outcome in HCM patients. The model was trained and internally validated using a large cohort (N=1,201) from the SHARE registry (Florence Hospital) and externally validated on an independent cohort (N=382) from Rennes Hospital. The final Random Forest ensemble model achieved a high internal Area Under the Curve (AUC) of 0.85 ± 0.02, significantly outperforming the ESC score (0.56 ± 0.03). Critically, survival curve analysis on the external validation set showed superior risk separation for the ML score (Log-rank p = 8.62 x 10^(-4) compared to the ESC score (p = 0.0559). Furthermore, longitudinal analyses demonstrate that the proposed risk score remains stable over time in event-free patients. The model high interpretability and its capacity for longitudinal risk monitoring represent promising tools for the personalized clinical management of HCM.
[LG-24] Knowledge Distillation for Efficient Transformer-Based Reinforcement Learning in Hardware-Constrained Energy Management Systems
链接: https://arxiv.org/abs/2603.26249
作者: Pascal Henrich,Jonas Sievers,Maximilian Beichter,Thomas Blank,Ralf Mikut,Veit Hagenmeyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformer-based reinforcement learning has emerged as a strong candidate for sequential control in residential energy management. In particular, the Decision Transformer can learn effective battery dispatch policies from historical data, thereby increasing photovoltaic self-consumption and reducing electricity costs. However, transformer models are typically too computationally demanding for deployment on resource-constrained residential controllers, where memory and latency constraints are critical. This paper investigates knowledge distillation to transfer the decision-making behaviour of high-capacity Decision Transformer policies to compact models that are more suitable for embedded deployment. Using the Ausgrid dataset, we train teacher models in an offline sequence-based Decision Transformer framework on heterogeneous multi-building data. We then distil smaller student models by matching the teachers’ actions, thereby preserving control quality while reducing model size. Across a broad set of teacher-student configurations, distillation largely preserves control performance and even yields small improvements of up to 1%, while reducing the parameter count by up to 96%, the inference memory by up to 90%, and the inference time by up to 63%. Beyond these compression effects, comparable cost improvements are also observed when distilling into a student model of identical architectural capacity. Overall, our results show that knowledge distillation makes Decision Transformer control more applicable for residential energy management on resource-limited hardware.
[LG-25] Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach
链接: https://arxiv.org/abs/2603.26231
作者: Abdelkrim Alahyane(LAAS-SARA),Céline Comte(CNRS, LAAS-SARA),Matthieu Jonckheere(CNRS, LAAS-SARA)
类目: Machine Learning (cs.LG); Performance (cs.PF); Optimization and Control (math.OC); Probability (math.PR)
*备注:
Abstract:Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which degrades convergence, and bias toward faster clients under heterogeneous data distributions. Although algorithms such as AsyncSGD and Generalized AsyncSGD mitigate this bias via client-side task queues, most existing analyses neglect the underlying queueing dynamics and lack closed-form characterizations of the update throughput and gradient staleness. To close this gap, we develop a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at the clients and the central server, as well as random uplink and downlink communication delays. Leveraging product-form network theory, we derive a closed-form expression for the update throughput, alongside closed-form upper bounds for both the communication round complexity and the expected wall-clock time required to reach an \epsilon -stationary point. These results formally characterize the trade-off between gradient staleness and wall-clock convergence speed. We further extend the framework to quantify energy consumption under stochastic timing, revealing an additional trade-off between convergence speed and energy efficiency. Building on these analytical results, we propose gradient-based optimization strategies to jointly optimize routing and concurrency. Experiments on EMNIST demonstrate reductions of 29%–46% in convergence time and 36%–49% in energy consumption compared to AsyncSGD.
[LG-26] Geometric Evolution Graph Convolutional Networks: Enhancing Graph Representation Learning via Ricci Flow
链接: https://arxiv.org/abs/2603.26178
作者: Jicheng Ma,Yunyan Yang,Juan Zhao,Liang Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce the Geometric Evolution Graph Convolutional Network (GEGCN), a novel framework that enhances graph representation learning by modeling geometric evolution on graphs. Specifically, GEGCN employs a Long Short-Term Memory to model the structural sequence generated by discrete Ricci flow, and the learned dynamic representations are infused into a Graph Convolutional Network. Extensive experiments demonstrate that GEGCN achieves state-of-the-art performance on classification tasks across various benchmark datasets, with its performance being particularly outstanding on heterophilic graphs.
[LG-27] Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery
链接: https://arxiv.org/abs/2603.26177
作者: Gilles Wainrib,Barbara Bodinier,Haitem Dakhli,Josep Monserrat,Almudena Espin Perez,Sabrina Carpentier,Roberta Codato,John Klein
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work has questioned whether large language models (LLMs) can perform genuine in-context learning (ICL) for scientific experimental design, with prior studies suggesting that LLM-based agents exhibit no sensitivity to experimental feedback. We shed new light on this question by carrying out 800 independently replicated experiments on iterative perturbation discovery in Cell Painting high-content screening. We compare an LLM agent that iteratively updates its hypotheses using experimental feedback to a zero-shot baseline that relies solely on pretraining knowledge retrieval. Access to feedback yields a +53.4% increase in discoveries per feature on average ( p = 0.003 ). To test whether this improvement arises from genuine feedback-driven learning rather than prompt-induced recall of pretraining knowledge, we introduce a random feedback control in which hit/miss labels are permuted. Under this control, the performance gain disappears, indicating that the observed improvement depends on the structure of the feedback signal ( +13.0 hits, p = 0.003 ). We further examine how model capability affects feedback utilization. Upgrading from Claude Sonnet 4.5 to 4.6 reduces gene hallucination rates from \sim33% – 45% to \sim3 – 9% , converting a non-significant ICL effect ( +0.8 , p = 0.32 ) into a large and highly significant improvement ( +11.0 , p=0.003 ) for the best ICL strategy. These results suggest that effective in-context learning from experimental feedback emerges only once models reach a sufficient capability threshold.
[LG-28] PEANUT: Perturbations by Eigenvalue Alignment for Attacking GNNs Under Topology-Driven Message Passing
链接: https://arxiv.org/abs/2603.26136
作者: Bhavya Kohli,Biplab Sikdar
类目: Machine Learning (cs.LG)
*备注: 8 content pages, 12 total pages including references
Abstract:Graph Neural Networks (GNNs) have achieved remarkable performance on tasks involving relational data. However, small perturbations to the graph structure can significantly alter GNN outputs, raising concerns about their robustness in real-world deployments. In this work, we explore the core vulnerability of GNNs which explicitly consume graph topology in the form of the adjacency matrix or Laplacian as a means for message passing, and propose PEANUT, a simple, gradient-free, restricted black-box attack that injects virtual nodes to capitalize on this vulnerability. PEANUT is a injection based attack, which is widely considered to be more practical and realistic scenario than graph modification attacks, where the attacker is able to modify the original graph structure directly. Our method works at the inference phase, making it an evasion attack, and is applicable almost immediately, since it does not involve lengthy iterative optimizations or parameter learning, which add computational and time overhead, or training surrogate models, which are susceptible to failure due to differences in model priors and generalization capabilities. PEANUT also does not require any features on the injected node and consequently demonstrates that GNN performance can be significantly deteriorated even with injected nodes with zeros for features, highlighting the significance of effectively designed connectivity in such attacks. Extensive experiments on real-world datasets across three graph tasks demonstrate the effectiveness of our attack despite its simplicity.
[LG-29] nyML for Acoustic Anomaly Detection in IoT Sensor Networks
链接: https://arxiv.org/abs/2603.26135
作者: Amar Almaini,Jakob Folz,Ghadeer Ashour
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tiny Machine Learning enables real-time, energy-efficient data processing directly on microcontrollers, making it ideal for Internet of Things sensor networks. This paper presents a compact TinyML pipeline for detecting anomalies in environmental sound within IoT sensor networks. Acoustic monitoring in IoT systems can enhance safety and context awareness, yet cloud-based processing introduces challenges related to latency, power usage, and privacy. Our pipeline addresses these issues by extracting Mel Frequency Cepstral Coefficients from sound signals and training a lightweight neural network classifier optimized for deployment on edge devices. The model was trained and evaluated using the UrbanSound8K dataset, achieving a test accuracy of 91% and balanced F1-scores of 0.91 across both normal and anomalous sound classes. These results demonstrate the feasibility and reliability of embedded acoustic anomaly detection for scalable and responsive IoT deployments.
[LG-30] Are LLM -Enhanced Graph Neural Networks Robust against Poisoning Attacks?
链接: https://arxiv.org/abs/2603.26105
作者: Yuhang Ma,Jie Wang,Zheng Yan
类目: Machine Learning (cs.LG)
*备注: To appear at 2026 IEEE Symposium on Security and Privacy (SP)
Abstract:Large Language Models (LLMs) have advanced Graph Neural Networks (GNNs) by enriching node representations with semantic features, giving rise to LLM-enhanced GNNs that achieve notable performance gains. However, the robustness of these models against poisoning attacks, which manipulate both graph structures and textual attributes during training, remains unexplored. To bridge this gap, we propose a robustness assessment framework that systematically evaluates LLM-enhanced GNNs under poisoning attacks. Our framework enables comprehensive evaluation across multiple dimensions. Specifically, we assess 24 victim models by combining eight LLM- or Language Model (LM)-based feature enhancers with three representative GNN backbones. To ensure diversity in attack coverage, we incorporate six structural poisoning attacks (both targeted and non-targeted) and three textual poisoning attacks operating at the character, word, and sentence levels. Furthermore, we employ four real-world datasets, including one released after the emergence of LLMs, to avoid potential ground truth leakage during LLM pretraining, thereby ensuring fair evaluation. Extensive experiments show that LLM-enhanced GNNs exhibit significantly higher accuracy and lower Relative Drop in Accuracy (RDA) than a shallow embedding-based baseline across various attack settings. Our in-depth analysis identifies key factors that contribute to this robustness, such as the effective encoding of structural and label information in node representations. Based on these insights, we outline future research directions from both offensive and defensive perspectives, and propose a new combined attack along with a graph purification defense. To support future research, we release the source code of our framework at~\urlthis https URL.
[LG-31] Adversarial Bandit Optimization with Globally Bounded Perturbations to Linear Losses
链接: https://arxiv.org/abs/2603.26066
作者: Zhuoyu Cheng,Kohei Hatano,Eiji Takimoto
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study a class of adversarial bandit optimization problems in which the loss functions may be non-convex and non-smooth. In each round, the learner observes a loss that consists of an underlying linear component together with an additional perturbation applied after the learner selects an action. The perturbations are measured relative to the linear losses and are constrained by a global budget that bounds their cumulative magnitude over time. Under this model, we establish both expected and high-probability regret guarantees. As a special case of our analysis, we recover an improved high-probability regret bound for classical bandit linear optimization, which corresponds to the setting without perturbations. We further complement our upper bounds by proving a lower bound on the expected regret.
[LG-32] Constitutive parameterized deep energy method for solid mechanics problems with random material parameters
链接: https://arxiv.org/abs/2603.26030
作者: Zhangyong Liang,Huanhuan Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:In practical structural design and solid mechanics simulations, material properties inherently exhibit random variations within bounded intervals. However, evaluating mechanical responses under continuous material uncertainty remains a persistent challenge. Traditional numerical approaches, such as the Finite Element Method (FEM), incur prohibitive computational costs as they require repeated mesh discretization and equation solving for every parametric realization. Similarly, data-driven surrogate models depend heavily on massive, high-fidelity datasets, while standard physics-informed frameworks (e.g., the Deep Energy Method) strictly demand complete retraining from scratch whenever material parameters change. To bridge this critical gap, we propose the Constitutive Parameterized Deep Energy Method (CPDEM). In this purely physics-driven framework, the strain energy density functional is reformulated by encoding a latent representation of stochastic constitutive parameters. By embedding material parameters directly into the neural network alongside spatial coordinates, CPDEM transforms conventional spatial collocation points into parameter-aware material points. Trained in an unsupervised manner via expected energy minimization over the parameter domain, the pre-trained model continuously learns the solution manifold. Consequently, it enables zero-shot, real-time inference of displacement fields for unknown material parameters without requiring any dataset generation or model retraining. The proposed method is rigorously validated across diverse benchmarks, including linear elasticity, finite-strain hyperelasticity, and complex highly nonlinear contact mechanics. To the best of our knowledge, CPDEM represents the first purely physics-driven deep learning paradigm capable of simultaneously and efficiently handling continuous multi-parameter variations in solid mechanics.
[LG-33] Identification of Bivariate Causal Directionality Based on Anticipated Asymmetric Geometries
链接: https://arxiv.org/abs/2603.26024
作者: Alex Glushkovsky
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 12 pages, 8 figure, 3 tables
Abstract:Identification of causal directionality in bivariate numerical data is a fundamental research problem with important practical implications. This paper presents two alternative methods to identify direction of causation by considering conditional distributions: (1) Anticipated Asymmetric Geometries (AAG) and (2) Monotonicity Index. The AAG method compares the actual conditional distributions to anticipated ones along two variables. Different comparison metrics, such as correlation, cosine similarity, Jaccard index, K-L divergence, K-S distance, and mutual information have been evaluated. Anticipated distributions have been projected as normal based on dual response statistics: mean and standard deviation. The Monotonicity Index approach compares the calculated monotonicity indexes of the gradients of conditional distributions along two axes and exhibits counts of gradient sign changes. Both methods assume stochastic properties of the bivariate data and exploit anticipated unimodality of conditional distributions of the effect. It turns out that the tuned AAG method outperforms the Monotonicity Index and reaches a top accuracy of 77.9% compared to ANMs accuracy of 63 +/- 10% when classifying 95 pairs of real-world examples (Mooij et al, 2014). The described methods include a number of hyperparameters that impact accuracy of the identification. For a given set of hyperparameters, both the AAG or Monotonicity Index method provide a unique deterministic outcome of the solution. To address sensitivity to hyperparameters, tuning of hyperparameters has been done by utilizing a full factorial Design of Experiment. A decision tree has been fitted to distinguish misclassified cases using the input data’s symmetrical bivariate statistics to address the question of: How decisive is the identification method of causal directionality?
[LG-34] GLU: Global-Local-Uncertainty Fusion for Scalable Spatiotemporal Reconstruction and Forecasting
链接: https://arxiv.org/abs/2603.26023
作者: Linzheng Wang,Jason Chen,Nicolas Tricard,Zituo Chen,Sili Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Digital twins of complex physical systems are expected to infer unobserved states from sparse measurements and predict their evolution in time, yet these two functions are typically treated as separate tasks. Here we present GLU, a Global-Local-Uncertainty framework that formulates sparse reconstruction and dynamic forecasting as a unified state-representation problem and introduces a structured latent assembly to both tasks. The central idea is to build a structured latent state that combines a global summary of system-level organization, local tokens anchored to available measurements, and an uncertainty-driven importance field that weights observations according to the physical informativeness. For reconstruction, GLU uses importance-aware adaptive neighborhood selection to retrieve locally relevant information while preserving global consistency and allowing flexible query resolution on arbitrary geometries. Across a suite of challenging benchmarks, GLU consistently improves reconstruction fidelity over reduced-order, convolutional, neural operator, and attention-based baselines, better preserving multi-scale structures. For forecasting, a hierarchical Leader-Follower Dynamics module evolves the latent state with substantially reduced memory growth, maintains stable rollout behavior and delays error accumulation in nonlinear dynamics. On a realistic turbulent combustion dataset, it further preserves not only sharp fronts and broadband structures in multiple physical fields, but also their cross-channel thermo-chemical couplings. Scalability tests show that these gains are achieved with substantially lower memory growth than comparable attention-based baselines. Together, these results establish GLU as a flexible and computationally practical paradigm for sparse digital twins.
[LG-35] QuitoBench: A High-Quality Open Time Series Forecasting Benchmark
链接: https://arxiv.org/abs/2603.26017
作者: Siqiao Xue,Zhaoyang Zhu,Wei Zhang,Rongyao Cai,Rui Wang,Yixiang Mu,Fan Zhou,Jianguo Li,Peng Di,Hang Yu
类目: Machine Learning (cs.LG)
*备注: project site: this https URL
Abstract:Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textscQuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trend \times seasonality \times forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textscQuito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ( L=96 ) but foundation models dominate at long context ( L \ge 576 ); (ii) forecastability is the dominant difficulty driver, producing a 3.64 \times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 \times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.
[LG-36] Second-Order First-Class: A Composable Stack for Curvature-Aware Training
链接: https://arxiv.org/abs/2603.25976
作者: Mikalai Korbit,Mario Zanon
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 figures. Code available at this https URL
Abstract:Second-order methods promise improved stability and faster convergence, yet they remain underused due to implementation overhead, tuning brittleness, and the lack of composable APIs. We introduce Somax, a composable Optax-native stack that treats curvature-aware training as a single JIT-compiled step governed by a static plan. Somax exposes first-class modules – curvature operators, estimators, linear solvers, preconditioners, and damping policies – behind a single step interface and composes with Optax by applying standard gradient transformations (e.g., momentum, weight decay, schedules) to the computed direction. This design makes typically hidden choices explicit and swappable. Somax separates planning from execution: it derives a static plan (including cadences) from module requirements, then runs the step through a specialized execution path that reuses intermediate results across modules. We report system-oriented ablations showing that (i) composition choices materially affect scaling behavior and time-to-accuracy, and (ii) planning reduces per-step overhead relative to unplanned composition with redundant recomputation.
[LG-37] On the Objective and Feature Weights of Minkowski Weighted k-Means
链接: https://arxiv.org/abs/2603.25958
作者: Renato Cordeiro de Amorim,Vladimir Makarenkov
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Minkowski weighted k-means (mwk-means) algorithm extends classical k-means by incorporating feature weights and a Minkowski distance. Despite its empirical success, its theoretical properties remain insufficiently understood. We show that the mwk-means objective can be expressed as a power-mean aggregation of within-cluster dispersions, with the order determined by the Minkowski exponent p. This formulation reveals how p controls the transition between selective and uniform use of features. Using this representation, we derive bounds for the objective function and characterise the structure of the feature weights, showing that they depend only on relative dispersion and follow a power-law relationship with dispersion ratios. This leads to explicit guarantees on the suppression of high-dispersion features. Finally, we establish convergence of the algorithm and provide a unified theoretical interpretation of its behaviour.
[LG-38] Adversarial-Robust Multivariate Time-Series Anomaly Detection via Joint Information Retention
链接: https://arxiv.org/abs/2603.25956
作者: Hadi Hojjati,Narges Armanfard
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures
Abstract:Time-series anomaly detection (TSAD) is a critical component in monitoring complex systems, yet modern deep learning-based detectors are often highly sensitive to localized input corruptions and structured noise. We propose ARTA (Adversarially Robust multivariate Time-series Anomaly detection via joint information retention), a joint training framework that improves detector robustness through a principled min-max optimization objective. ARTA comprises an anomaly detector and a sparsity-constrained mask generator that are trained simultaneously. The generator identifies minimal, task-relevant temporal perturbations that maximally increase the detector’s anomaly score, while the detector is optimized to remain stable under these structured perturbations. The resulting masks characterize the detector’s sensitivity to adversarial temporal corruptions and can serve as explanatory signals for the detector’s decisions. This adversarial training strategy exposes brittle decision pathways and encourages the detector to rely on distributed and stable temporal patterns rather than spurious localized artifacts. We conduct extensive experiments on the TSB-AD benchmark, demonstrating that ARTA consistently improves anomaly detection performance across diverse datasets and exhibits significantly more graceful degradation under increasing noise levels compared to state-of-the-art baselines.
[LG-39] EngineAD: A Real-World Vehicle Engine Anomaly Detection Dataset
链接: https://arxiv.org/abs/2603.25955
作者: Hadi Hojjati,Christopher Roth,Rory Woods,Ken Sills,Narges Armanfard
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures
Abstract:The progress of Anomaly Detection (AD) in safety-critical domains, such as transportation, is severely constrained by the lack of large-scale, real-world benchmarks. To address this, we introduce EngineAD, a novel, multivariate dataset comprising high-resolution sensor telemetry collected from a fleet of 25 commercial vehicles over a six-month period. Unlike synthetic datasets, EngineAD features authentic operational data labeled with expert annotations, distinguishing normal states from subtle indicators of incipient engine faults. We preprocess the data into 300 -timestep segments of 8 principal components and establish an initial benchmark using nine diverse one-class anomaly detection models. Our experiments reveal significant performance variability across the vehicle fleet, underscoring the challenge of cross-vehicle generalization. Furthermore, our findings corroborate recent literature, showing that simple classical methods (e.g., K-Means and One-Class SVM) are often highly competitive with, or superior to, deep learning approaches in this segment-based evaluation. By publicly releasing EngineAD, we aim to provide a realistic, challenging resource for developing robust and field-deployable anomaly detection and anomaly prediction solutions for the automotive industry.
[LG-40] Online Learning for Dynamic Constellation Topologies
链接: https://arxiv.org/abs/2603.25954
作者: João Norberto,Ricardo Ferreira,Cláudia Soares
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The use of satellite networks has increased significantly in recent years due to their advantages over purely terrestrial systems, such as higher availability and coverage. However, to effectively provide these services, satellite networks must cope with the continuous orbital movement and maneuvering of their nodes and the impact on the network’s topology. In this work, we address the problem of (dynamic) network topology configuration under the online learning framework. As a byproduct, our approach does not assume structure about the network, such as known orbital planes (that could be violated by maneuvering satellites). We empirically demonstrate that our problem formulation matches the performance of state-of-the-art offline methods. Importantly, we demonstrate that our approach is amenable to constrained online learning, exhibiting a trade-off between computational complexity per iteration and convergence to a final strategy.
[LG-41] Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned
链接: https://arxiv.org/abs/2603.25937
作者: Maeva Guerrier,Karthik Soma,Jana Pavlasek,Giovanni Beltrame
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and © performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.
[LG-42] Personalizing Mathematical Game-based Learning for Children: A Preliminary Study
链接: https://arxiv.org/abs/2603.25925
作者: Jie Gao,Adam K. Dubé
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Short research paper accepted at 27th International Conference on AI in Education (AIED 2026)
Abstract:Game-based learning (GBL) is widely adopted in mathematics education. It enhances learners’ engagement and critical thinking throughout the mathematics learning process. However, enabling players to learn intrinsically through mathematical games still presents challenges. In particular, effective GBL systems require dozens of high-quality game levels and mechanisms to deliver them to appropriate players in a way that matches their learning abilities. To address this challenge, we propose a framework, guided by adaptive learning theory, that uses artificial intelligence (AI) techniques to build a classifier for player-generated levels. We collect 206 distinct game levels created by both experts and advanced players in Creative Mode, a new tool in a math game-based learning app, and develop a classifier to extract game features and predict valid game levels. The preliminary results show that the Random Forest model is the optimal classifier among the four machine learning classification models (k-nearest neighbors, decision trees, support vector machines, and random forests). This study provides insights into the development of GBL systems, highlighting the potential of integrating AI into the game-level design process to provide more personalized game levels for players.
[LG-43] Preventing Data Leakage in EEG-Based Survival Prediction: A Two-Stage Embedding and Transformer Framework
链接: https://arxiv.org/abs/2603.25923
作者: Yixin Zhou,Zhixiang Liu,Vladimir I. Zadorozhny,Jonathan Elmer
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures. Preliminary version
Abstract:Deep learning models have shown promise in EEG-based outcome prediction for comatose patients after cardiac arrest, but their reliability is often compromised by subtle forms of data leakage. In particular, when long EEG recordings are segmented into short windows and reused across multiple training stages, models may implicitly encode and propagate label information, leading to overly optimistic validation performance and poor generalization. In this study, we identify a previously overlooked form of data leakage in multi-stage EEG modeling pipelines. We demonstrate that violating strict patient-level separation can significantly inflate validation metrics while causing substantial degradation on independent test data. To address this issue, we propose a leakage-aware two-stage framework. In the first stage, short EEG segments are transformed into embedding representations using a convolutional neural network with an ArcFace objective. In the second stage, a Transformer-based model aggregates these embeddings to produce patient-level predictions, with strict isolation between training cohorts to eliminate leakage pathways. Experiments on a large-scale EEG dataset of post-cardiac-arrest patients show that the proposed framework achieves stable and generalizable performance under clinically relevant constraints, particularly in maintaining high sensitivity at stringent specificity thresholds. These results highlight the importance of rigorous data partitioning and provide a practical solution for reliable EEG-based outcome prediction. Comments: 9 pages, 2 figures. Preliminary version Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.25923 [cs.LG] (or arXiv:2603.25923v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.25923 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yixin Zhou [view email] [v1] Thu, 26 Mar 2026 21:28:23 UTC (1,124 KB)
[LG-44] Parameter-Free Dynamic Regret for Unconstrained Linear Bandits AISTATS2026
链接: https://arxiv.org/abs/2603.25916
作者: Alberto Rumi,Andrew Jacobsen,Nicolò Cesa-Bianchi,Fabio Vitale
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages. v1: AISTATS 2026
Abstract:We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators \boldsymbolu_1,\ldots,\boldsymbolu_T in \mathbbR^d , but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches S_T = \sum_t\mathbbI\boldsymbolu_t \neq \boldsymbolu_t-1\ of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order \mathcalO\big(\sqrtd(1+S_T) T\big) up to poly-logarithmic terms without prior knowledge of S_T , thus resolving a long-standing open problem.
[LG-45] Data-Driven Plasticity Modeling via Acoustic Profiling
链接: https://arxiv.org/abs/2603.25894
作者: Khalid El-Awady
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a data-driven framework for modeling plastic deformation in crystalline metals through acoustic emission (AE) analysis. Building on experimental data from compressive loading of nickel micropillars, the study introduces a wavelet-based method using Morlet transforms to detect AE events across distinct frequency bands, enabling identification of both large and previously overlooked small-scale events. The detected events are validated against stress-drop dynamics, demonstrating strong physical consistency and revealing a relationship between AE energy release and strain evolution, including the onset of increased strain rate following major events. Leveraging labeled datasets of events and non-events, the work applies machine learning techniques, showing that engineered time and frequency domain features significantly outperform raw signal classifiers, and identifies key discriminative features such as RMS amplitude, zero crossing rate, and spectral centroid. Finally, clustering analysis uncovers four distinct AE event archetypes corresponding to different deformation mechanisms, highlighting the potential for transitioning from retrospective analysis to predictive modeling of material behavior using acoustic signals.
[LG-46] DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
链接: https://arxiv.org/abs/2603.25872
作者: Runsheng Bai,Chengyu Zhang,Yangdong Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip transitions to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of \tfrac1n or \tfrac2n+1 , depending on whether the conservative or aggressive mode is used, where n denotes the number of devices. Empirically, DRiffusion attains 1.4 \times -3.7 \times speedup across multiple diffusion models while incur minimal degradation in generation quality: on MS-COCO dataset, both FID and CLIP remain largely on par with those of the original model, while PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43, respectively. These results verify that DRiffusion delivers substantial acceleration and preserves perceptual quality.
[LG-47] In-Context Molecular Property Prediction with LLM s: A Blinding Study on Memorization and Knowledge Conflicts
链接: https://arxiv.org/abs/2603.25857
作者: Matthias Busch,Marius Tacke,Sviatlana V. Lamaka,Mikhail L. Zheludkevich,Christian J. Cyron,Christian Feiler,Roland C. Aydin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The capabilities of large language models (LLMs) have expanded beyond natural language processing to scientific prediction tasks, including molecular property prediction. However, their effectiveness in in-context learning remains ambiguous, particularly given the potential for training data contamination in widely used benchmarks. This paper investigates whether LLMs perform genuine in-context regression on molecular properties or rely primarily on memorized values. Furthermore, we analyze the interplay between pre-trained knowledge and in-context information through a series of progressively blinded experiments. We evaluate nine LLM variants across three families (GPT-4.1, GPT-5, Gemini 2.5) on three MoleculeNet datasets (Delaney solubility, Lipophilicity, QM7 atomization energy) using a systematic blinding approach that iteratively reduces available information. Complementing this, we utilize varying in-context sample sizes (0-, 60-, and 1000-shot) as an additional control for information access. This work provides a principled framework for evaluating molecular property prediction under controlled information access, addressing concerns regarding memorization and exposing conflicts between pre-trained knowledge and in-context information.
[LG-48] Incorporating contextual information into KGWAS for interpretable GWAS discovery
链接: https://arxiv.org/abs/2603.25855
作者: Cheng Jiang,Brady Ryan,Megan Crow,Kipper Fletez-Brant,Kashish Doshi,Sandra Melo Carlos,Kexin Huang,Burkhard Hoeckendorf,Heming Yao,David Richmond
类目: Machine Learning (cs.LG)
*备注:
Abstract:Genome-Wide Association Studies (GWAS) identify associations between genetic variants and disease; however, moving beyond associations to causal mechanisms is critical for therapeutic target prioritization. The recently proposed Knowledge Graph GWAS (KGWAS) framework addresses this challenge by linking genetic variants to downstream gene-gene interactions via a knowledge graph (KG), thereby improving detection power and providing mechanistic insights. However, the original KGWAS implementation relies on a large general-purpose KG, which can introduce spurious correlations. We hypothesize that cell-type specific KGs from disease-relevant cell types will better support disease mechanism discovery. Here, we show that the general-purpose KG in KGWAS can be substantially pruned with no loss of statistical power on downstream tasks, and that performance further improves by incorporating gene-gene relationships derived from perturb-seq data. Importantly, using a sparse, context-specific KG from direct perturb-seq evidence yields more consistent and biologically robust disease-critical networks.
[LG-49] A Neural Score-Based Particle Method for the Vlasov-Maxwell-Landau System ICLR
链接: https://arxiv.org/abs/2603.25832
作者: Vasily Ilin,Jingwei Hu
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: presented at ICLR AIPDE workshop
Abstract:Plasma modeling is central to the design of nuclear fusion reactors, yet simulating collisional plasma kinetics from first principles remains a formidable computational challenge: the Vlasov-Maxwell-Landau (VML) system describes six-dimensional phase-space transport under self-consistent electromagnetic fields together with the nonlinear, nonlocal Landau collision operator. A recent deterministic particle method for the full VML system estimates the velocity score function via the blob method, a kernel-based approximation with O(n^2) cost. In this work, we replace the blob score estimator with score-based transport modeling (SBTM), in which a neural network is trained on-the-fly via implicit score matching at O(n) cost. We prove that the approximated collision operator preserves momentum and kinetic energy, and dissipates an estimated entropy. We also characterize the unique global steady state of the VML system and its electrostatic reduction, providing the ground truth for numerical validation. On three canonical benchmarks – Landau damping, two-stream instability, and Weibel instability – SBTM is more accurate than the blob method, achieves correct long-time relaxation to Maxwellian equilibrium where the blob method fails, and delivers 50% faster runtime with 4\times lower peak memory.
[LG-50] ExVerus: Verus Proof Repair via Counterexample Reasoning
链接: https://arxiv.org/abs/2603.25810
作者: Jun Yang,Yuechun Sun,Yi Wu,Rodrigo Caridad,Yongwei Yuan,Jianan Yao,Shan Lu,Kexin Pei
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 31 pages, 8 figures
Abstract:Large Language Models (LLMs) have shown promising results in automating formal verification. However, existing approaches treat proof generation as a static, end-to-end prediction over source code, relying on limited verifier feedback and lacking access to concrete program behaviors. We present EXVERUS, a counterexample-guided framework that enables LLMs to reason about proofs using behavioral feedback via counterexamples. When a proof fails, EXVERUS automatically generates and validates counterexamples, and then guides the LLM to generalize them into inductive invariants to block these failures. Our evaluation shows that EXVERUS significantly improves proof accuracy, robustness, and token efficiency over the state-of-the-art prompting-based Verus proof generator.
[LG-51] A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation
链接: https://arxiv.org/abs/2603.25780
作者: Chengshuai Yang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 36 pages, 5 figures, 22 tables, includes Supplementary Information
Abstract:Large language models can generate scientific simulation code, but the generated code silently fails on most non-textbook problems. We show that classical mathematical validation – well-posedness, convergence, and error certification – can be fully automated by a Judge Agent, reducing the silent-failure rate from 42% to 1.5% across 134 test cases spanning 12 scientific domains. The headline result comes from a prospective benchmark: 72 blinded tasks submitted by 12 independent scientists yield an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, versus 53% without the Judge. On clinical CT (the only powered experiment, n = 200), the pipeline reaches 99% of expert quality. The residual 1.5% concentrates at bifurcation points where certifiability breaks down. We formalize this boundary through the simulability class S and introduce this http URL, a structured specification format that makes any scientific computation problem machine-readable and solver-independent. Code, data, and all 72 benchmark tasks are publicly archived.
[LG-52] Identifying Connectivity Distributions from Neural Dynamics Using Flows
链接: https://arxiv.org/abs/2603.26506
作者: Timothy Doyeon Kim,Ulises Pereira-Obilinovic,Yiliu Wang,Eric Shea-Brown,Uygar Sümbül
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses low-rank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underlying dynamics. We first characterize the identifiability of connectivity structures in lrRNNs and determine conditions under which a unique solution exists. Then, to find such solutions, we develop an inference framework based on maximum entropy and continuous normalizing flows (CNFs), trained via flow matching. Instead of estimating a single connectivity matrix, our method learns the maximally unbiased distribution over connection weights consistent with observed dynamics. This approach captures complex yet necessary distributions such as heavy-tailed connectivity found in empirical data. We validate our method on synthetic datasets with connectivity structures that generate multistable attractors, limit cycles, and ring attractors, and demonstrate its applicability in recordings from rat frontal cortex during decision-making. Our framework shifts circuit inference from recovering connectivity to identifying which connectivity structures are computationally required, and which are artifacts of underconstrained inference.
[LG-53] Conditional Neural Bayes Ratio Estimation for Experimental Design Optimisation
链接: https://arxiv.org/abs/2603.26489
作者: S. A. K. Leeney,T. Gessey-Jones,W. J. Handley,E. de Lera Acedo,H. T. J. Bevins,J. L. Tutt
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 11 pages, 5 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems
Abstract:For frontier experiments operating at the edge of detectability, instrument design directly determines the probability of discovery. We introduce Conditional Neural Bayes Ratio Estimation (cNBRE), which extends neural Bayes ratio estimation by conditioning on design parameters, enabling a single trained network to estimate Bayes factors across a continuous design space. Applied to 21-cm radio cosmology with simulations representative of the REACH experiment, the amortised nature of cNBRE enables systematic design space exploration that would be intractable with traditional point-wise methods, while recovering established physical relationships. The analysis demonstrates a ~20 percentage point variation in detection probability with antenna orientation for a single night of observation, a design decision that would be trivial to implement if determined prior to antenna construction. This framework enables efficient, globally-informed experimental design optimisation for a wide range of scientific applications.
[LG-54] Reconstructing Quantum Dot Charge Stability Diagrams with Diffusion Models
链接: https://arxiv.org/abs/2603.26432
作者: Vinicius Hernandes,Joseph Rogers,Rouven Koch,Thomas Spriggs,Brennan Undseth,Anasua Chatterjee,Lieven M. K. Vandersypen,Eliska Greplova
类目: Quantum Physics (quant-ph); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: Code available at this https URL . Data available at this https URL
Abstract:Efficiently characterizing quantum dot (QD) devices is a critical bottleneck when scaling quantum processors based on confined spins. Measuring high-resolution charge stability diagrams (or CSDs, data maps which crucially define the occupation of QDs) is time-consuming, particularly in emerging architectures where CSDs must be acquired with remote sensors that cannot probe the charge of the relevant dots directly. In this work, we present a generative approach to accelerate acquisition by reconstructing full CSDs from sparse measurements, using a conditional diffusion model. We evaluate our approach using two experimentally motivated masking strategies: uniform grid-based sampling, and line-cut sweeps. Our lightweight architecture, trained on approximately 9,000 examples, successfully reconstructs CSDs, maintaining key physically important features such as charge transition lines, from as little as 4% of the total measured data. We compare the approach to interpolation methods, which fail when the task involves reconstructing large unmeasured regions. Our results demonstrate that generative models can significantly reduce the characterization overhead for quantum devices, and provides a robust path towards an experimental implementation.
[LG-55] Kantorovich–Kernel Neural Operators: Approximation Theory Asymptotics and Neural Network Interpretation
链接: https://arxiv.org/abs/2603.26418
作者: Tian-Xiao He
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:
Abstract:This paper studies a class of multivariate Kantorovich-kernel neural network operators, including the deep Kantorovich-type neural network operators studied by Sharma and Singh. We prove density results, establish quantitative convergence estimates, derive Voronovskaya-type theorems, analyze the limits of partial differential equations for deep composite operators, prove Korovkin-type theorems, and propose inversion theorems. This paper studies a class of multivariate Kantorovich-kernel neural network operators, including the deep Kantorovich-type neural network operators studied by Sharma and Singh. We prove density results, establish quantitative convergence estimates, derive Voronovskaya-type theorems, analyze the limits of partial differential equations for deep composite operators, prove Korovkin-type theorems, and propose inversion theorems. Furthermore, this paper discusses the connection between neural network architectures and the classical positive operators proposed by Chui, Hsu, He, Lorentz, and Korovkin.
[LG-56] A Power-Weighted Noncentral Complex Gaussian Distribution
链接: https://arxiv.org/abs/2603.26344
作者: Toru Nakashika
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:
Abstract:The complex Gaussian distribution has been widely used as a fundamental spectral and noise model in signal processing and communication. However, its Gaussian structure often limits its ability to represent the diverse amplitude characteristics observed in individual source signals. On the other hand, many existing non-Gaussian amplitude distributions derived from hyperspherical models achieve good empirical fit due to their power-law structures, while they do not explicitly account for the complex-plane geometry inherent in complex-valued observations. In this paper, we propose a new probabilistic model for complex-valued random variables, which can be interpreted as a power-weighted noncentral complex Gaussian distribution. Unlike conventional hyperspherical amplitude models, the proposed model is formulated directly on the complex plane and preserves the geometric structure of complex-valued observations while retaining a higher-dimensional interpretation. The model introduces a nonlinear phase diffusion through a single shape parameter, enabling continuous control of the distributional geometry from arc-shaped diffusion along the phase direction to concentration of probability mass toward the origin. We formulate the proposed distribution and analyze the statistical properties of the induced amplitude distribution. The derived amplitude and power distributions provide a unified framework encompassing several widely used distributions in signal modeling, including the Rice, Nakagami, and gamma distributions. Experimental results on speech power spectra demonstrate that the proposed model consistently outperforms conventional distributions in terms of log-likelihood.
[LG-57] Making Multi-Axis Models Robust to Multiplicative Noise: How and Why?
链接: https://arxiv.org/abs/2603.26327
作者: Bailey Andrew,David R. Westhead,Luisa Cutillo
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 9 pages (26 with supplemental), 4 figures (+2 in supplemental), preprint
Abstract:In this paper we develop a graph-learning algorithm, MED-MAGMA, to fit multi-axis (Kronecker-sum-structured) models corrupted by multiplicative noise. This type of noise is natural in many application domains, such as that of single-cell RNA sequencing, in which it naturally captures technical biases of RNA sequencing platforms. Our work is evaluated against prior work on each and every public dataset in the Single Cell Expression Atlas under a certain size, demonstrating that our methodology learns networks with better local and global structure. MED-MAGMA is made available as a Python package (MED-MAGMA).
[LG-58] STN-GPR: A Singularity Tensor Network Framework for Efficient Option Pricing
链接: https://arxiv.org/abs/2603.26318
作者: Dominic Gribben,Carolina Allende,Alba Villarino,Aser Cortines,Mazen Ali,Román Orús,Pascal Oswald,Noureddine Lehdili
类目: Pricing of Securities (q-fin.PR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 15 pages, 2 figures
Abstract:We develop a tensor-network surrogate for option pricing, targeting large-scale portfolio revaluation problems arising in market risk management (e.g., VaR and Expected Shortfall computations). The method involves representing high-dimensional price surfaces in tensor-train (TT) form using TT-cross approximation, constructing the surrogate directly from black-box price evaluations without materializing the full training tensor. For inference, we use a Laplacian kernel and derive TT representations of the kernel matrix and its closed-form inverse in the noise-free setting, enabling TT-based Gaussian process regression without dense matrix factorization or iterative linear solves. We found that hyperparameter optimization consistently favors a large kernel length-scale and show that in this regime the GPR predictor reduces to multilinear interpolation for off-grid inputs; we also derive a low-rank TT representation for this limit. We evaluate the approach on five-asset basket options over an eight dimensional parameter space (asset spot levels, strike, interest rate, and time to maturity). For European geometric basket puts, the tensor surrogate achieves lower test error at shorter training times than standard GPR by scaling to substantially larger effective training sets. For American arithmetic basket puts trained on LSMC data, the surrogate exhibits more favorable scaling with training-set size while providing millisecond-level evaluation per query, with overall runtime dominated by data generation.
[LG-59] Semi-structured multi-state delinquency model for mortgage default
链接: https://arxiv.org/abs/2603.26309
作者: Victor Medina-Olivares,Wangzhen Xia,Stefan Lessmann,Nadja Klein
类目: Applications (stat.AP); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:
Abstract:We propose a semi-structured discrete-time multi-state model to analyse mortgage delinquency transitions. This model combines an easy-to-understand structured additive predictor, which includes linear effects and smooth functions of time and covariates, with a flexible neural network component that captures complex nonlinearities and higher-order interactions. To ensure identifiability when covariates are present in both components, we orthogonalise the unstructured part relative to the structured design. For discrete-time competing transitions, we derive exact transformations that map binary logistic models to valid competing transition probabilities, avoiding the need for continuous-time approximations. In simulations, our framework effectively recovers structured baseline and covariate effects while using the neural component to detect interaction patterns. We demonstrate the method using the Freddie Mac Single-Family Loan-Level Dataset, employing an out-of-time test design. Compared with a structured generalised additive benchmark, the semi-structured model provides modest but consistent gains in discrimination across the earliest prediction spans, while maintaining similar Brier scores. Adding macroeconomic indicators provides limited incremental benefit in this out-of-time evaluation and does not materially change the estimated borrower-, loan-, or duration-driven effects. Overall, semi-structured multi-state modelling offers a practical compromise between transparent effect estimates and flexible pattern learning, with potential applications beyond credit-transition forecasting.
[LG-60] Privacy-Accuracy Trade-offs in High-Dimensional LASSO under Perturbation Mechanisms
链接: https://arxiv.org/abs/2603.26227
作者: Ayaka Sakata,Haruka Tanzawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 53 pages, 11 figures
Abstract:We study privacy-preserving sparse linear regression in the high-dimensional regime, focusing on the LASSO estimator. We analyze two widely used mechanisms for differential privacy: output perturbation, which injects noise into the estimator, and objective perturbation, which adds a random linear term to the loss function. Using approximate message passing (AMP), we characterize the typical behavior of these estimators under random design and privacy noise. To quantify privacy, we adopt typical-case measures, including the on-average KL divergence, which admits a hypothesis-testing interpretation in terms of distinguishability between neighboring datasets. Our analysis reveals that sparsity plays a central role in shaping the privacy-accuracy trade-off: stronger regularization can improve privacy by stabilizing the estimator against single-point data changes. We further show that the two mechanisms exhibit qualitatively different behaviors. In particular, for objective perturbation, increasing the noise level can have non-monotonic effects, and excessive noise may destabilize the estimator, leading to increased sensitivity to data perturbations. Our results demonstrate that AMP provides a powerful framework for analyzing privacy-accuracy trade-offs in high-dimensional sparse models.
[LG-61] On associative neural networks for sparse patterns with huge capacities
链接: https://arxiv.org/abs/2603.26217
作者: Matthias Löwe,Franck Vermet
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注: 22 pages
Abstract:Generalized Hopfield models with higher-order or exponential interaction terms are known to have substantially larger storage capacities than the classical quadratic model. On the other hand, associative memories for sparse patterns, such as the Willshaw and Amari models, already outperform the classical Hopfield model in the sparse regime. In this paper we combine these two mechanisms. We introduce higher-order versions of sparse associative memory models and study their storage capacities. For fixed interaction order n , we obtain storage capacities of polynomial order in the system size. When the interaction order is allowed to grow logarithmically with the number of neurons, this yields super-polynomial capacities. We also discuss an analogue in the Gripon–Berrou architecture which was formulated for non-sparse messages (see \citegriponc). Our results show that the capacity increase caused by higher-order interactions persists in the sparse setting, although the precise storage scale depends on the underlying architecture. Comments: 22 pages Subjects: Probability (math.PR); Machine Learning (cs.LG) MSC classes: Primary: 82C32, 60K35, Secondary: 68T05, 92B20 Cite as: arXiv:2603.26217 [math.PR] (or arXiv:2603.26217v1 [math.PR] for this version) https://doi.org/10.48550/arXiv.2603.26217 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matthias Löwe [view email] [v1] Fri, 27 Mar 2026 09:41:22 UTC (16 KB) Full-text links: Access Paper: View a PDF of the paper titled On associative neural networks for sparse patterns with huge capacities, by Matthias L"owe and Franck VermetView PDFHTML (experimental)TeX Source view license Current browse context: math.PR prev | next new | recent | 2026-03 Change to browse by: cs cs.LG math References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-62] Asymptotic Optimism for Tensor Regression Models with Applications to Neural Network Compression
链接: https://arxiv.org/abs/2603.26048
作者: Haoming Shi,Eric C. Chi,Hengrui Luo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 62 pages, 11 figures
Abstract:We study rank selection for low-rank tensor regression under random covariates design. Under a Gaussian random-design model and some mild conditions, we derive population expressions for the expected training-testing discrepancy (optimism) for both CP and Tucker decomposition. We further demonstrate that the optimism is minimized at the true tensor rank for both CP and Tucker regression. This yields a prediction-oriented rank-selection rule that aligns with cross-validation and extends naturally to tensor-model averaging. We also discuss conditions under which under- or over-ranked models may appear preferable, thereby clarifying the scope of the method. Finally, we showcase its practical utility on a real-world image regression task and extend its application to tensor-based compression of neural network, highlighting its potential for model selection in deep learning.
[LG-63] A Priori Sampling of Transition States with Guided Diffusion
链接: https://arxiv.org/abs/2603.25980
作者: Hyukjun Lim,Soojung Yang,Lucas Pinède,Miguel Steiner,Yuanqi Du,Rafael Gómez-Bombarelli
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Transition states, the first-order saddle points on the potential energy surfaces, govern the kinetics and mechanisms of chemical reactions and conformational changes. Locating them is challenging because transition pathways are topologically complex and can proceed via an ensemble of diverse routes. Existing methods address these challenges by introducing heuristic assumptions about the pathway or reaction coordinates, which limits their applicability when a good initial guess is unavailable or when the guess precludes alternative, potentially relevant pathways. We propose to bypass such heuristic limitations by introducing ASTRA, A Priori Sampling of TRAnsition States with Guided Diffusion, which reframes the transition state search as an inference-time scaling problem for generative models. ASTRA trains a score-based diffusion model on configurations from known metastable states. Then, ASTRA guides inference toward the isodensity surface separating the basins of metastable states via a principled composition of conditional scores. A Score-Aligned Ascent (SAA) process then approximates a reaction coordinate from the difference between conditioned scores and combines it with physical forces to drive convergence onto first-order transition states. Validated on benchmarks from 2D potentials to biomolecular conformational changes and chemical reaction, ASTRA locates transition states with high precision and discovers multiple reaction pathways, enabling mechanistic studies of complex molecular systems.
[LG-64] Globalized Adversarial Regret Optimization: Robust Decisions with Uncalibrated Predictions
链接: https://arxiv.org/abs/2603.25948
作者: Jannis Kurtz,Bart P.G. van Parys
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Optimization problems routinely depend on uncertain parameters that must be predicted before a decision is made. Classical robust and regret formulations are designed to handle erroneous predictions and can provide statistical error bounds in simple settings. However, when predictions lack rigorous error bounds (as is typical of modern machine learning methods) classical robust models often yield vacuous guarantees, while regret formulations can paradoxically produce decisions that are more optimistic than even a nominal solution. We introduce Globalized Adversarial Regret Optimization (GARO), a decision framework that controls adversarial regret, defined as the gap between the worst-case cost and the oracle robust cost, uniformly across all possible uncertainty set sizes. By design, GARO delivers absolute or relative performance guarantees against an oracle with full knowledge of the prediction error, without requiring any probabilistic calibration of the uncertainty set. We show that GARO equipped with a relative rate function generalizes the classical adaptation method of Lepski to downstream decision problems. We derive exact tractable reformulations for problems with affine worst-case cost functions and polyhedral norm uncertainty sets, and provide a discretization and a constraint-generation algorithm with convergence guarantees for general settings. Finally, experiments demonstrate that GARO yields solutions with a more favorable trade-off between worst-case and mean out-of-sample performance, as well as stronger global performance guarantees.
[LG-65] On the Expressive Power of Contextual Relations in Transformers
链接: https://arxiv.org/abs/2603.25860
作者: Demián Fraiman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Transformer architectures have achieved remarkable empirical success in modeling contextual relationships in natural language, yet a precise mathematical characterization of their expressive power remains incomplete. In this work, we introduce a measure-theoretic framework for contextual representations in which texts are modeled as probability measures over a semantic embedding space, and contextual relations between words, are represented as coupling measures between them. Within this setting, we introduce Sinkhorn Transformer, a transformer-like architecture. Our main result is a universal approximation theorem: any continuous coupling function between probability measures, that encodes the semantic relation coupling measure, can be uniformly approximated by a Sinkhorn Transformer with appropriate parameters.
[LG-66] Vision Transformers and Graph Neural Networks for Charged Particle Tracking in the ATLAS Muon Spectrometer
链接: https://arxiv.org/abs/2603.25793
作者: Jonathan Renusch(on behalf of the ATLAS Collaboration)
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:The identification and reconstruction of charged particles, such as muons, is a main challenge for the physics program of the ATLAS experiment at the Large Hadron Collider. This task will become increasingly difficult with the start of the High-Luminosity LHC era after 2030, when the number of proton-proton collisions per bunch crossing will increase from 60 to up to 200. This elevated interaction density will also increase the occupancy within the ATLAS Muon Spectrometer, requiring more efficient and robust real-time data processing strategies within the experiment’s trigger system, particularly the Event Filter. To address these algorithmic challenges, we present two machine-learning-based approaches. First, we target the problem of background-hit rejection in the Muon Spectrometer using Graph Neural Networks integrated into the non-ML baseline reconstruction chain, demonstrating a 15 % improvement in reconstruction speed (from 255 ms to 217 ms). Second, we present a proof-of-concept for end-to-end muon tracking using state-of-the-art Vision Transformer architectures, achieving ultra-fast approximate muon reconstruction in 2.3 ms on consumer-grade GPUs at 98 % tracking efficiency.
[LG-67] SAHMM-VAE: A Source-Wise Adaptive Hidden Markov Prior Variational Autoencoder for Unsupervised Blind Source Separation
链接: https://arxiv.org/abs/2603.25776
作者: Yuan-Hao Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose SAHMM-VAE, a source-wise adaptive Hidden Markov prior variational autoencoder for unsupervised blind source separation. Instead of treating the latent prior as a single generic regularizer, the proposed framework assigns each latent dimension its own adaptive regime-switching prior, so that different latent dimensions are pulled toward different source-specific temporal organizations during training. Under this formulation, source separation is not implemented as an external post-processing step; it is embedded directly into variational learning itself. The encoder, decoder, posterior parameters, and source-wise prior parameters are optimized jointly, where the encoder progressively learns an inference map that behaves like an approximate inverse of the mixing transformation, while the decoder plays the role of the generative mixing model. Through this coupled optimization, the gradual alignment between posterior source trajectories and heterogeneous HMM priors becomes the mechanism through which different latent dimensions separate into different source components. To instantiate this idea, we develop three branches within one common framework: a Gaussian-emission HMM prior, a Markov-switching autoregressive HMM prior, and an HMM state-flow prior with state-wise autoregressive flow transformations. Experiments show that the proposed framework achieves unsupervised source recovery while also learning meaningful source-wise switching structures. More broadly, the method extends our structured-prior VAE line from smooth, mixture-based, and flow-based latent priors to adaptive switching priors, and provides a useful basis for future work on interpretable and potentially identifiable latent source modeling.
[LG-68] KANEL: Kolmogorov-Arnold Network Ensemble Learning Enables Early Hit Enrichment in High-Throughput Virtual Screening
链接: https://arxiv.org/abs/2603.25755
作者: Pavel Koptev,Nikita Krainov,Konstantin Malkov,Alexander Tropsha
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 8 Pages
Abstract:Machine learning models of chemical bioactivity are increasingly used for prioritizing a small number of compounds in virtual screening libraries for experimental follow-up. In these applications, assessing model accuracy by early hit enrichment such as Positive Predicted Value (PPV) calculated for top N hits (PPV@N) is more appropriate and actionable than traditional global metrics such as AUC. We present KANEL, an ensemble workflow that combines interpretable Kolmogorov-Arnold Networks (KANs) with XGBoost, random forest, and multilayer perceptron models trained on complementary molecular representations (LillyMol descriptors, RDKit-derived descriptors, and Morgan fingerprints).
[LG-69] Uncertainty Quantification for Quantum Computing
链接: https://arxiv.org/abs/2603.25039
作者: Ryan Bennink,Olena Burkovska,Konstantin Pieper,Jorge Ramirez,Elaine Wong
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:This review is designed to introduce mathematicians and computational scientists to quantum computing (QC) through the lens of uncertainty quantification (UQ) by presenting a mathematically rigorous and accessible narrative for understanding how noise and intrinsic randomness shape quantum computational outcomes in the language of mathematics. By grounding quantum computation in statistical inference, we highlight how mathematical tools such as probabilistic modeling, stochastic analysis, Bayesian inference, and sensitivity analysis, can directly address error propagation and reliability challenges in today’s quantum devices. We also connect these methods to key scientific priorities in the field, including scalable uncertainty-aware algorithms and characterization of correlated errors. The purpose is to narrow the conceptual divide between applied mathematics, scientific computing and quantum information sciences, demonstrating how mathematically rooted UQ methodologies can guide validation, error mitigation, and principled algorithm design for emerging quantum technologies, in order to address challenges and opportunities present in modern-day quantum high performance and fault-tolerant quantum computing paradigms.
附件下载



