本篇博文主要内容为 2026-03-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-03-25)

今日共更新668篇论文,其中:

  • 自然语言处理81篇(Computation and Language (cs.CL))
  • 人工智能212篇(Artificial Intelligence (cs.AI))
  • 计算机视觉157篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习173篇(Machine Learning (cs.LG))
  • 多智能体系统8篇(Multiagent Systems (cs.MA))
  • 信息检索19篇(Information Retrieval (cs.IR))
  • 人机交互32篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Planning over MAPF Agent Dependencies via Multi-Dependency PIBT

【速读】:该论文旨在解决现代多智能体路径规划(Multi-Agent Path Finding, MAPF)中高效规划大规模智能体群体的问题,尤其在高密度环境中需在1秒内完成数百至数千个智能体的路径规划。现有算法如优先继承回溯(Priority Inheritance with Backtracking, PIBT)及其扩展增强版本(Enhanced PIBT, EPIBT)受限于规则驱动的搜索策略,仅能处理与最多一个其他智能体冲突的路径,导致其通用性不足。解决方案的关键在于提出一种基于智能体依赖关系的新视角——定义“agent dependencies”并构建多依赖优先继承回溯(Multi-Dependency PIBT, MD-PIBT),该框架通过在依赖关系空间中进行搜索,不仅可重现PIBT和EPIBT的行为,还能生成PIBT/EPIBT无法表达的新颖规划策略。实验表明,MD-PIBT在包含石子移动、旋转运动及差速驱动机器人(含速度与加速度限制)等多种动力学约束下,可有效规划多达10,000个同质智能体的路径,且在大尺寸智能体场景中表现尤为优越。

链接: https://arxiv.org/abs/2603.23405
作者: Zixiang Jiang,Yulun Zhang,Rishi Veerapaneni,Jiaoyang Li
机构: University Of Melbourne (墨尔本大学); Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT is constrained by its rule-based planning procedure and lacks generality because it restricts its search to paths that conflict with at most one other agent. This limitation also applies to Enhanced PIBT (EPIBT), a recent extension of PIBT. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT’s priority inheritance logic, we define the concept of agent dependencies and propose Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies. MD-PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations yield novel planning strategies that are not expressible by PIBT or EPIBT. Our experiments demonstrate that MD-PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD-PIBT is particularly effective in MAPF with large agents.

[MA-1] Privacy-Aware Smart Cameras: View Coverag e via Socially Responsible Coordination

【速读】:该论文旨在解决城市智能系统中监控摄像头在追求最大视场覆盖的同时如何合法保护隐私敏感区域的问题,避免过度监控或依赖昂贵的加密技术。其核心解决方案是一种去中心化的框架,通过交互式智能摄像头基于集体学习自主调整朝向,在满足软约束与硬约束的前提下实现隐私保护与覆盖效率的协同优化,从而在无需集中控制的情况下支持数百至数千个摄像头的规模化部署,并显著提升覆盖率和降低隐私违规风险。

链接: https://arxiv.org/abs/2603.23197
作者: Chuhao Qin,Lukas Esterle,Evangelos Pournaras
机构: 未知
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Coordination of view coverage via privacy-aware smart cameras is key to a more socially responsible urban intelligence. Rather than maximizing view coverage at any cost or over relying on expensive cryptographic techniques, we address how cameras can coordinate to legitimately monitor public spaces while excluding privacy-sensitive regions by design. This article proposes a decentralized framework in which interactive smart cameras coordinate to autonomously select their orientation via collective learning, while eliminating privacy violations via soft and hard constraint satisfaction. The approach scales to hundreds up to thousands of cameras without any centralized control. Experimental evidence shows 18.42% higher coverage efficiency and 85.53% lower privacy violation than baselines and other state-of-the-art approaches. This significant advance further unravels practical guidelines for operators and policymakers: how the field of view, spatial placement, and budget of cameras operating by ethically-aligned artificial intelligence jointly influence coverage efficiency and privacy protection in large-scale and sensitive urban environments.

[MA-2] Behavioral Heterogeneity as Quantum-Inspired Representation

【速读】:该论文旨在解决传统驾驶行为建模中将驾驶员异质性简化为静态标签或离散类别的问题,这种做法忽略了驾驶行为的动态演化特性。解决方案的关键在于引入量子启发表示(quantum-inspired representation),将每位驾驶员建模为一个随时间演化的潜在状态,以密度矩阵(density matrix)的形式呈现,并具备结构化的数学性质;通过非线性随机傅里叶特征(non-linear Random Fourier Features)嵌入行为观测数据,同时利用时序持久性与情境依赖的特征激活机制实现状态演化,从而更准确地提取和分析驾驶行为特征。

链接: https://arxiv.org/abs/2603.22729
作者: Mohammad Elayan,Wissam Kontar
机构: University of Nebraska–Lincoln (内布拉斯加大学林肯分校)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Driver heterogeneity is often reduced to labels or discrete regimes, compressing what is inherently dynamic into static categories. We introduce quantum-inspired representation that models each driver as an evolving latent state, presented as a density matrix with structured mathematical properties. Behavioral observations are embedded via non-linear Random Fourier Features, while state evolution blends temporal persistence of behavior with context-dependent profile activation. We evaluate our approach on empirical driving data, Third Generation Simulation Data (TGSIM), showing how driving profiles are extracted and analyzed.

[MA-3] STRIATUM-CTF: A Protocol-Driven Agent ic Framework for General-Purpose CTF Solving

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在网络安全攻防场景中难以进行多步骤、状态依赖的推理问题,尤其是现有静态基准无法反映真实漏洞利用过程的动态性。其解决方案的关键在于提出STRATIUM-CTF框架,该框架基于模型上下文协议(Model Context Protocol, MCP),通过标准化系统探查、反编译和运行时调试等工具接口,实现跨长时间exploit轨迹的连贯上下文维护。实验表明,该方法在2025年末一场大学主办的夺旗赛(Capture-the-Flag, CTF)中自主运行并获得第一名,显著优于21支人类队伍,且MCP驱动的工具抽象能有效降低幻觉现象,验证了标准化上下文协议对构建鲁棒自主网络推理系统的重要性。

链接: https://arxiv.org/abs/2603.22577
作者: James Hugglestone,Samuel Jacob Chacko,Dawson Stoller,Ryan Schmidt,Xiuwen Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages, 7 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential in code generation, yet they struggle with the multi-step, stateful reasoning required for offensive cybersecurity operations. Existing research often relies on static benchmarks that fail to capture the dynamic nature of real-world vulnerabilities. In this work, we introduce STRIATUM-CTF (A Search-based Test-time Reasoning Inference Agent for Tactical Utility Maximization in Cybersecurity), a modular agentic framework built upon the Model Context Protocol (MCP). By standardizing tool interfaces for system introspection, decompilation, and runtime debugging, STRIATUM-CTF enables the agent to maintain a coherent context window across extended exploit trajectories. We validate this approach not merely on synthetic datasets, but in a live competitive environment. Our system participated in a university-hosted Capture-the-Flag (CTF) competition in late 2025, where it operated autonomously to identify and exploit vulnerabilities in real-time. STRIATUM-CTF secured First Place, outperforming 21 human teams and demonstrating strong adaptability in a dynamic problem-solving setting. We analyze the agent’s decision-making logs to show how MCP-based tool abstraction significantly reduces hallucination compared to naive prompting strategies. These results suggest that standardized context protocols are a critical path toward robust autonomous cyber-reasoning systems. Comments: 8 pages, 7 pages Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) ACMclasses: I.2.1 Cite as: arXiv:2603.22577 [cs.CR] (or arXiv:2603.22577v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.22577 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-4] rustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在金融交易场景中因“均匀信任”(uniform trust)行为偏差导致的决策不稳定问题,即LLM将检索到的信息默认为事实,并对不同来源信息赋予同等权重,从而放大多源噪声和虚假信息,引发生成式AI(Generative AI)驱动交易系统中的幻觉增强与风险收益表现波动。解决方案的关键在于提出TrustTrade框架——一个受人类认知启发的多智能体选择性共识机制,通过引入跨代理一致性判断替代均匀信任:利用多个独立LLM代理对信息进行聚合,并基于语义与数值层面的一致性动态加权信号,优先保留一致性强的输入,同时对分歧较大、证据薄弱或时间上不一致的信号进行选择性折扣;此外,结合确定性时序信号作为可复现锚点以及反思记忆机制以测试时自适应调整风险偏好,从而有效抑制噪声放大与幻觉驱动的波动,使LLM交易行为从极端风险-收益状态收敛至更符合人类偏好的中等风险-中等收益范式。

链接: https://arxiv.org/abs/2603.22567
作者: Minghan Li,Rachel Gonsalves,Weiyue Li,Sunghoon Yoon,Mengyu Wang
机构: Harvard Medical School (哈佛医学院); Harvard University (哈佛大学); Harvard Business School (哈佛商学院); DGIST (韩国科学技术院)
类目: Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents in financial trading. However, they often exhibit a hazardous behavioral bias that we term uniform trust, whereby retrieved information is implicitly assumed to be factual and heterogeneous sources are treated as equally informative. This assumption stands in sharp contrast to human decision-making, which relies on selective filtering, cross-validation, and experience-driven weighting of information sources. As a result, LLM-based trading systems are particularly vulnerable to multi-source noise and misinformation, amplifying factual hallucinations and leading to unstable risk-return performance. To bridge this behavioral gap, we introduce TrustTrade (Trust-Rectified Unified Selective Trader), a multi-agent selective consensus framework inspired by human epistemic heuristics. TrustTrade replaces uniform trust with cross-agent consistency by aggregating information from multiple independent LLM agents and dynamically weighting signals based on their semantic and numerical agreement. Consistent signals are prioritized, while divergent, weakly grounded, or temporally inconsistent inputs are selectively discounted. To further stabilize decision-making, TrustTrade incorporates deterministic temporal signals as reproducible anchors and a reflective memory mechanism that adapts risk preferences at test time without additional training. Together, these components suppress noise amplification and hallucination-driven volatility, yielding more stable and risk-aware trading behavior. Across controlled backtesting in high-noise market environments (2024 Q1 and 2026 Q1), the proposed TrustTrade calibrates LLM trading behavior from extreme risk-return regimes toward a human-aligned, mid-risk and mid-return profile.

[MA-5] Energy-Aware Collaborative Exploration for a UAV-UGV Team

【速读】:该论文旨在解决无人机(UAV)与地面无人车(UGV)协同探索未知环境时的能源约束问题,特别是如何在保证无人机飞行时间受限的前提下实现高效的信息获取与任务执行。解决方案的关键在于构建一个基于密度感知分层概率路网(density-aware layered probabilistic roadmap, PRM)的稀疏耦合空地路网,并将航程选择建模为受会合约束的耦合定向旅行商问题(coupled orienteering problems, OPs),从而在满足无人机与UGV在每轮探索结束后必须 rendezvous(会合)的条件下,最大化信息增益。通过碰撞验证的路网边生成路径,确保了轨迹的安全性与可行性,最终通过仿真、基准对比和真实实验验证了方法的有效性。

链接: https://arxiv.org/abs/2603.22507
作者: Cahit Ikbal Er,Saikiran Juttu,Yasin Yazicioglu
机构: Northeastern University (东北大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We present an energy-aware collaborative exploration framework for a UAV-UGV team operating in unknown environments, where the UAV’s energy constraint is modeled as a maximum flight-time limit. The UAV executes a sequence of energy-bounded exploration tours, while the UGV simultaneously explores on the ground and serves as a mobile charging station. Rendezvous is enforced under a shared time budget so that the vehicles meet at the end of each tour before the UAV reaches its flight-time limit. We construct a sparsely coupled air-ground roadmap using a density-aware layered probabilistic roadmap (PRM) and formulate tour selection over the roadmap as coupled orienteering problems (OPs) to maximize information gain subject to the rendezvous constraint. The resulting tours are constructed over collision-validated roadmap edges. We validate our method through simulation studies, benchmark comparisons, and real-world experiments.

[MA-6] Wake Up to the Past: Using Memory to Model Fluid Wake Effects on Robots IROS2026

【速读】:该论文旨在解决自主空中与水下机器人在运动过程中因扰动介质(如空气或水)而产生的尾流效应(wake effect)对邻近机器人造成的干扰建模难题。由于流体的混沌时空动力学与机器人几何结构及复杂运动模式高度耦合,传统数据驱动方法通常采用无记忆的神经网络模型,仅将当前状态映射为受扰机器人的受力,难以准确预测高动态场景下的尾流影响。本文的关键解决方案在于引入历史状态作为输入,并通过预测传输延迟(transport delay)来捕捉尾流效应的时间滞后特性,从而显著提升预测精度。实验证明,支持历史状态输入和延迟预测机制的模型能够在多种介质中实现更可靠的尾流效应建模。

链接: https://arxiv.org/abs/2603.22472
作者: Luca Vendruscolo,Eduardo Sebastián,Amanda Prorok,Ajay Shankar
机构: University of Cambridge (剑桥大学)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 8 pages, 7 figures. Submitted to IROS 2026. Project website: this https URL

点击查看摘要

Abstract:Autonomous aerial and aquatic robots that attain mobility by perturbing their medium, such as multicopters and torpedoes, produce wake effects that act as disturbances for adjacent robots. Wake effects are hard to model and predict due to the chaotic spatio-temporal dynamics of the fluid, entangled with the physical geometry of the robots and their complex motion patterns. Data-driven approaches using neural networks typically learn a memory-less function that maps the current states of the two robots to a force observed by the “sufferer” robot. Such models often perform poorly in agile scenarios: since the wake effect has a finite propagation time, the disturbance observed by a sufferer robot is some function of relative states in the past. In this work, we present an empirical study of the properties a wake-effect predictor must satisfy to accurately model the interactions between two robots mediated by a fluid. We explore seven data-driven models designed to capture the spatio-temporal evolution of fluid wake effects in four different media. This allows us to introspect the models and analyze the reasons why certain features enable improved accuracy in prediction across predictors and fluids. As experimental validation, we develop a planar rectilinear gantry for two spinning monocopters to test in real-world data with feedback control. The conclusion is that support of history of previous states as input and transport delay prediction substantially helps to learn an accurate wake-effect predictor.

[MA-7] Designing Agent ic AI-Based Screening for Portfolio Investment

【速读】:该论文旨在解决传统投资组合管理中资产筛选效率低、难以有效整合基本面与市场情绪信息的问题。其解决方案的关键在于构建一个三层架构的智能代理(agentic)AI平台:首先,通过两个大型语言模型(LLM)代理分别执行基本面筛选和新闻情感分析,实现多维度资产初筛;其次,代理间协同决策以生成一致的买卖信号,显著缩小候选资产池;最后,采用高维精密矩阵估计方法确定最优组合权重。该框架的核心创新在于将资产数量设为随机变量,通过筛选过程动态决定,同时提出“合理筛选”概念并证明在轻微筛选误差下,筛选后组合的平方夏普比率能一致估计目标值,从而在SP 500数据(2020–2024)上实现了优于无筛选基准和传统筛选方法的绩效表现。

链接: https://arxiv.org/abs/2603.23300
作者: Mehmet Caner,Agostino Capponi,Nathan Sun,Jonathan Y. Tan
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:We introduce a new agentic artificial intelligence (AI) platform for portfolio management. Our architecture consists of three layers. First, two large language model (LLM) agents are assigned specialized tasks: one agent screens for firms with desirable fundamentals, while a sentiment analysis agent screens for firms with desirable news. Second, these agents deliberate to generate and agree upon buy and sell signals from a large portfolio, substantially narrowing the pool of candidate assets. Finally, we apply a high-dimensional precision matrix estimation procedure to determine optimal portfolio weights. A defining theoretical feature of our framework is that the number of assets in the portfolio is itself a random variable, realized through the screening process. We introduce the concept of sensible screening and establish that, under mild screening errors, the squared Sharpe ratio of the screened portfolio consistently estimates its target. Empirically, our method achieves superior Sharpe ratios relative to an unscreened baseline portfolio and to conventional screening approaches, evaluated on SP 500 data over the period 2020–2024.

自然语言处理

[NLP-0] MedObvious: Exposing the Medical Moravecs Paradox in VLMs via Clinical Triage

【速读】: 该论文试图解决医学视觉语言模型(Vision Language Models, VLMs)在临床应用中缺乏可靠输入验证能力的问题,即模型虽能生成流畅的诊断文本,却可能在输入图像存在明显不一致或无效时仍输出看似合理的叙述,从而导致安全隐患。解决方案的关键在于提出一个名为MedObvious的1,880任务基准,专门隔离并评估模型在小多面板图像集上的整体一致性判断能力——要求模型识别是否存在违反预期一致性的图像面板,涵盖从基础模态/方位错位到临床导向的解剖结构与视角验证等五个递进层级,并通过多种评估格式测试不同接口下的鲁棒性。这一设计使预诊断 sanity check 成为可量化、可比较的独立能力,揭示当前VLMs在该环节的显著不足,强调其作为部署前必须解决的安全关键步骤。

链接: https://arxiv.org/abs/2603.23501
作者: Ufaq Khan,Umair Nawaz,L D M S S Teja,Numaan Saeed,Muhammad Bilal,Yutong Xie,Mohammad Yaqub,Muhammad Haris Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 Pages

点击查看摘要

Abstract:Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

[NLP-1] Failure of contextual invariance in gender inference with large language models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对语境等价任务时是否保持输出稳定性的假设问题,特别是在性别推断场景下验证这一假设的可靠性。研究发现,即使引入最小且理论上无信息量的语境,模型输出也会出现系统性偏移,且这种偏移无法通过简单的指代重复或边际效应解释,表明LLM输出对语境具有显著依赖性。解决方案的关键在于采用“默认上下文化”(Contextuality-by-Default)分析方法,量化了在控制所有个体输出边际效应后仍存在的语境依赖程度,揭示出高达19%–52%的案例中模型行为受无关语境特征影响,从而挑战了传统评估中对上下文不变性的隐含假设,并提示需重新审视偏见基准测试与高风险场景中的部署策略。

链接: https://arxiv.org/abs/2603.23485
作者: Sagar Kumar,Ariel Flint,Luca Maria Aiello,Andrea Baronchelli
机构: Northeastern University (东北大学); Boston Children’s Hospital (波士顿儿童医院); City St George’s, University of London (伦敦城市圣乔治大学); IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19–52% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

[NLP-2] SpecEyes: Accelerating Agent ic Multimodal LLM s via Speculative Perception and Planning

【速读】: 该论文旨在解决代理式多模态大语言模型(Agentic Multimodal Large Language Models, MLLMs)在执行迭代视觉工具调用时因感知、推理与工具调用循环带来的显著串行延迟问题,即“代理深度”(agentic depth)导致的高延迟和系统级并发能力受限。其解决方案的关键在于提出SpecEyes框架,通过引入一个轻量级、无需工具调用的小型MLLM作为推测性规划器(speculative planner),提前预测执行轨迹以实现昂贵工具链的早期终止而不牺牲准确性;同时设计基于答案可分离性的认知门控机制(cognitive gating mechanism)来量化模型自验证置信度,无需真实标签即可调控推测过程,并采用异构并行漏斗结构(heterogeneous parallel funnel)利用小型模型的状态无关并发性来掩盖大型模型的状态相关串行执行,从而最大化系统吞吐量。

链接: https://arxiv.org/abs/2603.23483
作者: Haoyu Huang,Jinfa Huang,Zhongwei Wan,Xiawu Zheng,Rongrong Ji,Jiebo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model’s confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

[NLP-3] Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

【速读】: 该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频输入中缺乏3D空间推理能力的问题,即它们难以构建视频所呈现三维环境的结构化抽象。解决方案的关键在于提出一种名为TRACE(Textual Representation of Allocentric Context from Egocentric Video)的提示方法,该方法引导MLLMs生成以文本形式表达的3D环境表示作为中间推理轨迹,从而提升空间问答任务的准确性。TRACE通过编码元上下文(meta-context)、相机轨迹和详细物体实体信息,支持对第一人称视角视频进行结构化的空间推理,实验证明其在VSI-Bench和OST-Bench数据集上显著且一致地优于现有提示策略。

链接: https://arxiv.org/abs/2603.23404
作者: Jiacheng Hua,Yishu Yin,Yuhang Wu,Tai Wang,Yifei Huang,Miao Liu
机构: Tsinghua University (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 26 pages, 6 figures

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.

[NLP-4] Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods Taxonomy and Future Directions

【速读】: 该论文旨在解决当前自然语言接口数据库(NLIDB)研究中缺乏针对地理空间与时间数据库的系统性综述问题。现有文献多聚焦于通用关系型数据库,未能充分关注地理空间和时间数据查询的独特性,如地理拓扑算子(geospatial topological operators)和时间算子(temporal operators)带来的复杂性,导致方法碎片化、评估标准不统一,难以清晰把握领域进展与挑战。解决方案的关键在于构建一个全面的地理空间与时间NLIDB研究框架,涵盖数据集、评估指标和方法分类体系,并通过对比分析揭示现有方法的共性趋势、数据与评估实践的差异以及未解决的核心挑战,从而为未来研究指明方向。

链接: https://arxiv.org/abs/2603.23375
作者: Samya Acharja,Kanchan Chowdhury
机构: Marquette University (马凯特大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The task of building a natural language interface to a database, known as NLIDB, has recently gained significant attention from both the database and Natural Language Processing (NLP) communities. With the proliferation of geospatial datasets driven by the rapid emergence of location-aware sensors, geospatial databases play a vital role in supporting geospatial applications. However, querying geospatial and temporal databases differs substantially from querying traditional relational databases due to the presence of geospatial topological operators and temporal operators. To bridge the gap between geospatial query languages and non-expert users, the geospatial research community has increasingly focused on developing NLIDBs for geospatial databases. Yet, existing research remains fragmented across systems, datasets, and methodological choices, making it difficult to clearly understand the landscape of existing methods, their strengths and weaknesses, and opportunities for future research. Existing surveys on NLIDBs focus on general-purpose database systems and do not treat geospatial and temporal databases as primary focus for analysis. To address this gap, this paper presents a comprehensive survey of studies on NLIDBs for geospatial and temporal databases. Specifically, we provide a detailed overview of datasets, evaluation metrics, and the taxonomy of the methods for geospatial and temporal NLIDBs, as well as a comparative analysis of the existing methods. Our survey reveals recurring trends in existing methods, substantial variation in datasets and evaluation practices, and several open challenges that continue to hinder progress in this area. Based on these findings, we identify promising directions for future research to advance natural language interfaces to geospatial and temporal databases.

[NLP-5] Off-Policy Value-Based Reinforcement Learning for Large Language Models

【速读】: 该论文旨在解决长时程强化学习(Reinforcement Learning, RL)中数据利用效率低的问题,尤其是在大语言模型(Large Language Models, LLMs)训练场景下,由于轨迹生成成本高,主流的基于策略的RL方法(如GRPO)通常采用在线更新策略,即每批数据仅使用一次后即丢弃,导致样本利用率低下。解决方案的关键在于提出一种基于价值函数的RL框架ReVal,其核心创新是结合逐步信号(捕捉内部一致性)与轨迹级信号(来自结果验证),通过贝尔曼更新(Bellman update)机制实现离策略学习(off-policy learning),并自然支持基于回放缓冲区(replay buffer)的训练方式,从而高效复用历史轨迹,显著提升训练效率和最终性能。

链接: https://arxiv.org/abs/2603.23355
作者: Peng-Yuan Wang,Ziniu Li,Tian Xu,Bohan Yang,Tian-Shuo Liu,ChenYang Wang,Xiong-Hui Chen,Yi-Chen Li,Tianyun Yang,Congliang Chen,Yang Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.

[NLP-6] WISTERIA: Weak Implicit Signal-based Temporal Relation Extraction with Attention LREC2026

【速读】: 该论文旨在解决时间关系抽取(Temporal Relation Extraction, TRE)中现有基于注意力机制的模型往往聚焦于全局显著词元,而忽视了决定时间关系的特定事件对(pair-specific)语境线索的问题。解决方案的关键在于提出WISTERIA框架,通过引入基于事件对条件的top-K注意力池化机制,识别并保留每个事件对中最具有判别力的上下文词元,从而实现对时间关系的局部化、可解释性建模;该方法不依赖显式的时间标记词(如before、after),而是将任何隐含表达时间顺序的词汇、句法或形态特征视为有效信号,显著提升了模型在多个基准数据集上的性能与可解释性。

链接: https://arxiv.org/abs/2603.23319
作者: Duy Dao Do,Anaïs Halftermeyer,Thi-Bich-Hanh Dao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 16 figures, LREC 2026

点击查看摘要

Abstract:Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.

[NLP-7] Steering LLM s for Culturally Localized Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在全球部署中表现出对训练数据丰富的文化过度偏倚的问题,以及现有文化本地化方法(如提示工程或后训练对齐)存在黑箱性、难以控制且无法区分失败是源于知识缺失还是表达激发不足的局限。其解决方案的关键在于利用机制可解释性技术,通过稀疏自动编码器(sparse autoencoders)识别出可解释的文化相关特征,并将其聚合为文化嵌入(Cultural Embeddings, CuE)。CuE不仅可用于诊断在未明确指定文化背景下的隐式偏倚,还可作为白盒干预手段实现对模型输出文化的可控引导,显著提升文化忠实度并激发远比传统提示更罕见的长尾文化概念,且与黑箱方法具有互补性。

链接: https://arxiv.org/abs/2603.23301
作者: Simran Khanuja,Hongbin Liu,Shujian Zhang,John Lambert,Mingqing Chen,Rajiv Mathews,Lun Wang
机构: Google DeepMind(谷歌深度学习); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:LLMs are deployed globally, yet produce responses biased towards cultures with abundant training data. Existing cultural localization approaches such as prompting or post-training alignment are black-box, hard to control, and do not reveal whether failures reflect missing knowledge or poor elicitation. In this paper, we address these gaps using mechanistic interpretability to uncover and manipulate cultural representations in LLMs. Leveraging sparse autoencoders, we identify interpretable features that encode culturally salient information and aggregate them into Cultural Embeddings (CuE). We use CuE both to analyze implicit cultural biases under underspecified prompts and to construct white-box steering interventions. Across multiple models, we show that CuE-based steering increases cultural faithfulness and elicits significantly rarer, long-tail cultural concepts than prompting alone. Notably, CuE-based steering is complementary to black-box localization methods, offering gains when applied on top of prompt-augmented inputs. This also suggests that models do benefit from better elicitation strategies, and don’t necessarily lack long-tail knowledge representation, though this varies across cultures. Our results provide both diagnostic insight into cultural representations in LLMs and a controllable method to steer towards desired cultures.

[NLP-8] LLM Olympiad: Why Model Evaluation Needs a Sealed Exam

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)时代下,自然语言处理(Natural Language Processing, NLP)领域中基准测试(benchmarks)和排行榜(leaderboards)日益难以准确反映模型真实能力的问题。当前的评估方式容易受到“基准追逐”(benchmark-chasing)、隐含的评估选择偏差或测试数据意外泄露等因素干扰,导致分数不能真实体现模型的泛化能力。为此,作者提出一种类奥林匹克竞赛的评估机制:在评测前密封任务题目,提前冻结模型提交版本,并通过统一标准化的执行环境运行所有参赛模型;评分完成后,公开全部任务集与评估代码,以支持结果的复现与审计。该设计的核心在于通过流程隔离与事后透明化,降低人为操控性能的可能性,从而提升评估结果的可信度。

链接: https://arxiv.org/abs/2603.23292
作者: Jan Christian Blaise Cruz,Alham Fikri Aji
机构: MBZUAI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmarks and leaderboards are how NLP most often communicates progress, but in the LLM era they are increasingly easy to misread. Scores can reflect benchmark-chasing, hidden evaluation choices, or accidental exposure to test content – not just broad capability. Closed benchmarks delay some of these issues, but reduce transparency and make it harder for the community to learn from results. We argue for a complementary practice: an Olympiad-style evaluation event where problems are sealed until evaluation, submissions are frozen in advance, and all entries run through one standardized harness. After scoring, the full task set and evaluation code are released so results can be reproduced and audited. This design aims to make strong performance harder to ``manufacture’’ and easier to trust.

[NLP-9] Is AI Catching Up to Human Expression? Exploring Emotion Personality Authorship and Linguistic Style in English and Arabic with Six Large Language Models

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在跨语言和跨文化情境下是否能够真实模拟人类情感表达与人格特征的问题,尤其聚焦于资源匮乏的阿拉伯语环境。研究发现,尽管当前大语言模型(LLMs)在文本生成中表现出较高流畅性,但其生成内容仍可通过机器分类器与人类写作区分开来(F1 达到 0.95),且在情感和人格识别任务中存在显著的泛化差距——即基于人类数据训练的分类器难以准确识别 AI 生成文本中的情感或人格特征,反之亦然,表明 LLMs 编码情感信号的方式不同于人类。关键解决方案在于:通过引入 AI 生成数据进行模型微调,可有效提升阿拉伯语人格分类性能,凸显合成数据在缓解低资源语言中情感建模挑战方面的潜力;同时,GPT-4o 和 Gemini 等模型展现出更优的情感一致性,提示模型架构与训练策略对提升情感真实性具有重要作用。

链接: https://arxiv.org/abs/2603.23251
作者: Nasser A Alsadhan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. Under review

点击查看摘要

Abstract:The advancing fluency of LLMs raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether LLMs can convincingly mimic emotional nuance in English and personality markers in Arabic, a critical under-resourced language with unique linguistic and cultural characteristics. We conduct two tasks across six models:Jais, Mistral, LLaMA, GPT-4o, Gemini, and DeepSeek. First, we evaluate whether machine classifiers can reliably distinguish between human-authored and AI-generated texts. Second, we assess the extent to which LLM-generated texts exhibit emotional or personality traits comparable to those of humans. Our results demonstrate that AI-generated texts are distinguishable from human-authored ones (F10.95), though classification performance deteriorates on paraphrased samples, indicating a reliance on superficial stylistic cues. Emotion and personality classification experiments reveal significant generalization gaps: classifiers trained on human data perform poorly on AI-generated texts and vice versa, suggesting LLMs encode affective signals differently from humans. Importantly, augmenting training with AI-generated data enhances performance in the Arabic personality classification task, highlighting the potential of synthetic data to address challenges in under-resourced languages. Model-specific analyses show that GPT-4o and Gemini exhibit superior affective coherence. Linguistic and psycholinguistic analyses reveal measurable divergences in tone, authenticity, and textual complexity between human and AI texts. These findings have implications for affective computing, authorship attribution, and responsible AI deployment, particularly within underresourced language contexts where generative AI detection and alignment pose unique challenges.

[NLP-10] I Came I Saw I Explained: Benchmarking Multimodal LLM s on Figurative Meaning in Memes LREC2026

【速读】: 该论文试图解决的问题是:当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在识别和解释网络迷因(Internet memes)中六类隐喻意义时,其如何融合与解析图文信息尚不明确,且现有模型的推理过程是否忠实于原始内容缺乏系统评估。解决方案的关键在于:通过在三个数据集上对八种前沿生成式多模态大语言模型进行评测,量化其检测隐喻意义的能力,并进一步开展人工评估,检验模型生成解释的合理性与忠实性(faithfulness),从而揭示模型在处理图文混杂语境下的认知偏差及解释可靠性问题。

链接: https://arxiv.org/abs/2603.23229
作者: Shijia Zhou,Saif M. Mohammad,Barbara Plank,Diego Frassinelli
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC 2026, 18 pages, 10 figures

点击查看摘要

Abstract:Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.

[NLP-11] Decoding AI Authorship: Can LLM s Truly Mimic Human Style Across Literature and Politics?

【速读】: 该论文旨在解决生成式 AI(Generative AI)在模仿特定人类作者风格时的可检测性问题,尤其是大语言模型(Large Language Models, LLMs)能否真实复现文学与政治人物(如沃尔特·惠特曼、威廉·华兹华斯、唐纳德·特朗普和巴拉克·奥巴马)的写作风格特征。解决方案的关键在于构建一个结合零样本提示框架与多维度评估体系的方法论:首先通过严格主题对齐生成合成语料,再利用BERT分类器与XGBoost可解释机器学习模型进行交叉验证;同时引入LIWC(Linguistic Inquiry and Word Count)标记、困惑度(perplexity)和可读性指数等指标量化AI输出与人类文本之间的差异。研究发现,尽管LLMs在低维启发式特征(如句法复杂性和可读性)上接近人类水平,但在高阶语义和情感密度方面仍存在显著偏差,其中困惑度被识别为最具判别力的指标,揭示了AI生成文本在随机规律性上的系统性偏离,从而为数字人文和社会媒体中的作者归属任务提供了关键基准与洞察。

链接: https://arxiv.org/abs/2603.23219
作者: Nasser A Alsadhan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. Accepted for publication in Digital Scholarship in the Humanities (OUP)

点击查看摘要

Abstract:Amidst the rising capabilities of generative AI to mimic specific human styles, this study investigates the ability of state-of-the-art large language models (LLMs), including GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5, to emulate the authorial signatures of prominent literary and political figures: Walt Whitman, William Wordsworth, Donald Trump, and Barack Obama. Utilizing a zero-shot prompting framework with strict thematic alignment, we generated synthetic corpora evaluated through a complementary framework combining transformer-based classification (BERT) and interpretable machine learning (XGBoost). Our methodology integrates Linguistic Inquiry and Word Count (LIWC) markers, perplexity, and readability indices to assess the divergence between AI-generated and human-authored text. Results demonstrate that AI-generated mimicry remains highly detectable, with XGBoost models trained on a restricted set of eight stylometric features achieving accuracy comparable to high-dimensional neural classifiers. Feature importance analyses identify perplexity as the primary discriminative metric, revealing a significant divergence in the stochastic regularity of AI outputs compared to the higher variability of human writing. While LLMs exhibit distributional convergence with human authors on low-dimensional heuristic features, such as syntactic complexity and readability, they do not yet fully replicate the nuanced affective density and stylistic variance inherent in the human-authored corpus. By isolating the specific statistical gaps in current generative mimicry, this study provides a comprehensive benchmark for LLM stylistic behavior and offers critical insights for authorship attribution in the digital humanities and social media.

[NLP-12] Sparser Faster Lighter Transformer Language Models FAST

【速读】: 该论文旨在解决自回归大语言模型(Large Language Models, LLMs)在扩展过程中带来的巨大计算成本问题。其核心解决方案是利用LLM前馈层中固有的非结构化稀疏性(unstructured sparsity),通过引入一种新的稀疏打包格式(sparse packing format)和一系列针对现代GPU优化执行流水线的CUDA内核,实现高效稀疏计算,从而在推理和训练阶段显著提升吞吐量、能效比和内存利用率。关键创新在于将稀疏性与硬件级优化紧密结合,使得高稀疏度(如超过99%)在不影响下游任务性能的前提下成为可行方案。

链接: https://arxiv.org/abs/2603.23198
作者: Edoardo Cetin,Stefano Peluchetti,Emilio Castillo,Akira Naruse,Mana Murakami,Llion Jones
机构: Sakana AI( Sakana AI); NVIDIA( NVIDIA)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code and checkpoints available at: this https URL

点击查看摘要

Abstract:Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM’s feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.

[NLP-13] ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

【速读】: 该论文旨在解决强化学习中基于人类反馈(Reinforcement Learning from Human Feedback, RLHF)的奖励建模问题,特别是针对传统方法依赖昂贵且难以获取的显式反馈数据(如标注偏好)的局限性。作者提出了一种隐式奖励建模(Implicit Reward Modeling)方法,利用低成本的隐式人类反馈(如点击、复制等行为)来训练奖励模型。其核心挑战在于:(1)隐式偏好数据缺乏明确的负样本,导致标准正负样本分类方法失效;(2)用户偏好偏差使不同响应对反馈行为的诱发概率不同,进一步混淆负样本识别。解决方案的关键是提出 ImplicitRM 方法,通过一个分层模型将训练样本划分为四个潜在组别,并基于似然最大化构建无偏学习目标,理论上保证了奖励模型的准确性,从而有效克服上述两个挑战。

链接: https://arxiv.org/abs/2603.23184
作者: Hao Wang,Haocheng Yang,Licheng Pan,Lei Shen,Xiaoxi Li,Yinuo Wang,Zhichao Chen,Yuan Lu,Haoxuan Li,Zhouchen Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textitimplicit reward modeling – learning reward models from implicit human feedback (e.g., clicks and copies) – as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.

[NLP-14] From Synthetic to Native: Benchmarking Multilingual Intent Classification in Logistics Customer Service

【速读】: 该论文旨在解决现有多语言意图分类(Multilingual Intent Classification)基准数据集因依赖机器翻译文本而导致评估结果虚高、无法真实反映模型在实际客户交互场景中性能的问题。其关键解决方案是构建了一个基于真实物流客户服务日志的公开多语言意图分类基准,包含约30K经去标识化处理的原始用户查询,覆盖英语、西班牙语和阿拉伯语等显式语言及印尼语、中文等仅用于零样本测试的语言,并采用两级标签体系(13个父类意图与17个叶节点意图)。通过提供成对的原生语种与机器翻译测试集,直接量化合成数据与真实数据之间的性能差距,从而推动更贴近现实场景的多语言意图识别模型评估体系发展。

链接: https://arxiv.org/abs/2603.23172
作者: Haoyu He,Jinyu Zhuang,Haoran Chu,Shuhang Yu, J,T AI Group,Hao Wang,Kunpeng Han
机构: JT Express; School of Information Science and Technology, ShanghaiTech University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual intent classification is central to customer-service systems on global logistics platforms, where models must process noisy user queries across languages and hierarchical label spaces. Yet most existing multilingual benchmarks rely on machine-translated text, which is typically cleaner and more standardized than native customer requests and can therefore overestimate real-world robustness. We present a public benchmark for hierarchical multilingual intent classification constructed from real logistics customer-service logs. The dataset contains approximately 30K de-identified, stand-alone user queries curated from 600K historical records through filtering, LLM-assisted quality control, and human verification, and is organized into a two-level taxonomy with 13 parent and 17 leaf intents. English, Spanish, and Arabic are included as seen languages, while Indonesian, Chinese, and additional test-only languages support zero-shot evaluation. To directly measure the gap between synthetic and real evaluation, we provide paired native and machine-translated test sets and benchmark multilingual encoders, embedding models, and small language models under flat and hierarchical protocols. Results show that translated test sets substantially overestimate performance on noisy native queries, especially for long-tail intents and cross-lingual transfer, underscoring the need for more realistic multilingual intent benchmarks.

[NLP-15] UniDial-EvalKit: A Unified Toolkit for Evaluating Multi-Faceted Conversational Abilities

【速读】: 该论文旨在解决多轮交互式人工智能(Interactive AI)系统在基准测试中因评估协议高度异构而导致的系统性比较困难问题,具体表现为数据格式、模型接口和评估流程不统一。其解决方案的关键在于提出一个名为UniDial-EvalKit(UDE)的统一评估工具包,通过三个核心机制实现标准化:一是将异构数据格式统一为通用模式(universal schema),二是基于模块化架构简化复杂的评估流程,三是建立一致的评分接口以规范指标计算;同时支持并行生成与检查点缓存,显著提升大规模评估的效率与可扩展性。

链接: https://arxiv.org/abs/2603.23160
作者: Qi Jia,Haodong Zhao,Dun Pei,Xiujie Song,Shibo Wang,Zijian Chen,Zicheng Zhang,Xiangyang Zhu,Guangtao Zhai
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmarking AI systems in multi-turn interactive scenarios is essential for understanding their practical capabilities in real-world applications. However, existing evaluation protocols are highly heterogeneous, differing significantly in dataset formats, model interfaces, and evaluation pipelines, which severely impedes systematic comparison. In this work, we present UniDial-EvalKit (UDE), a unified evaluation toolkit for assessing interactive AI systems. The core contribution of UDE lies in its holistic unification: it standardizes heterogeneous data formats into a universal schema, streamlines complex evaluation pipelines through a modular architecture, and aligns metric calculations under a consistent scoring interface. It also supports efficient large-scale evaluation through parallel generation and scoring, as well as checkpoint-based caching to eliminate redundant computation. Validated across diverse multi-turn benchmarks, UDE not only guarantees high reproducibility through standardized workflows and transparent logging, but also significantly improves evaluation efficiency and extensibility. We make the complete toolkit and evaluation scripts publicly available to foster a standardized benchmarking ecosystem and accelerate future breakthroughs in interactive AI.

[NLP-16] Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 文本检测方法在真实场景中可靠性不足、可解释性差的问题,特别是检测模型是否真正识别出机器作者身份,还是仅依赖于特定数据集的风格特征。其解决方案的关键在于提出一个可解释的检测框架,融合语言学特征工程、机器学习与可解释人工智能(Explainable AI, XAI)技术,通过SHAP值分析揭示模型决策依据。实验表明,尽管模型在基准数据集上达到高F1分数(0.9734),但在跨域和跨生成器评估中表现出显著泛化失败,且最具判别力的语言特征也最易受领域偏移、格式变化和文本长度影响,这揭示了基于语言特征的检测方法存在根本性权衡。该研究为构建更具鲁棒性的AI文本检测系统提供了关键洞见,并开源了Python工具包以支持复现与实际部署。

链接: https://arxiv.org/abs/2603.23146
作者: Shushanta Pudasaini,Luis Miralles-Pechuán,David Lillis,Marisa Llorens Salvador
机构: Technological University Dublin (都柏林理工学院); University College Dublin (都柏林大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.

[NLP-17] HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature

【速读】: 该论文旨在解决科学知识图谱(Knowledge Graph, KG)构建中面临的三大核心挑战:难以识别长距离多词实体、跨领域泛化能力弱,以及对科学知识层级结构的忽视。现有方法受限于通用大语言模型(Large Language Models, LLMs)计算成本高、任务表现不稳定的问题,导致生成的知识图谱浅层且不一致,难以支撑深层次的知识探索与融合。其解决方案的关键在于提出一个两阶段框架:第一阶段Z-NERD通过正交语义分解(Orthogonal Semantic Decomposition, OSD)实现领域无关的实体识别,并引入多尺度TCQK注意力机制捕捉连贯的多词实体;第二阶段HGNet采用层次感知的消息传递机制显式建模父-子-同级关系,并结合可微分层次损失(Differentiable Hierarchy Loss)和连续抽象场损失(Continuum Abstraction Field, CAF Loss),首次将抽象层级作为欧几里得空间中的连续属性进行建模,从而在无需标注数据的情况下实现高质量、结构一致的科学知识图谱构建。

链接: https://arxiv.org/abs/2603.23136
作者: Devvrat Joshi,Islem Rekik
机构: Imperial College London (帝国理工学院); BASIRA Lab (BASIRA实验室); Imperial-X (I-X) (帝国理工-X)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic “turns” in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (this https URL), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.

[NLP-18] When Language Models Lose Their Mind: The Consequences of Brain Misalignment ICLR2026

【速读】: 该论文试图解决的问题是:脑对齐的大语言模型(brain-aligned large language models)在提升AI安全性与可信度的同时,其与语言能力之间的关系尚不明确,尤其是脑对齐是否对语言理解的鲁棒性具有关键作用。为解答这一问题,作者提出了一种创新性的解决方案——引入“脑错位模型”(brain-misaligned models),即通过刻意训练使模型在预测大脑活动方面表现不佳,同时保持高水准的语言建模性能。该方案的关键在于通过控制变量法,将脑对齐这一单一因素从其他语言能力相关特征中分离出来,从而系统评估其对下游200多个语言任务(涵盖语义、句法、语篇、推理和形态学等)的影响。实验结果表明,脑错位显著削弱了模型在各项任务中的表现,揭示了脑对齐对于实现强大语言能力的核心作用。

链接: https://arxiv.org/abs/2603.23091
作者: Gabriele Merlin,Mariya Toneva
机构: MPI-SWS (Max Planck Institute for Software Systems)
类目: Computation and Language (cs.CL)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models–LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream performance, highlighting the critical role of brain alignment in achieving robust linguistic competence. These findings underscore the importance of brain alignment in LLMs and offer novel insights into the relationship between neural representations and linguistic processing.

[NLP-19] AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing

【速读】: 该论文旨在解决现有文本风格迁移(style transfer)方法在低资源目标作者场景下难以兼顾风格迁移效果与语义保真度的问题。传统方法通常训练单一模型以覆盖所有目标风格,导致适配灵活性差且成本高,尤其在目标作者数据稀缺时性能下降明显。其解决方案的关键在于提出一种轻量级、模块化且可解释的框架 AuthorMix:通过为每个高资源作者单独训练 LoRA(Low-Rank Adaptation)适配器,并基于层级适配器混合策略,仅需少量目标风格样本即可快速构建针对特定作者的专用适配模型,从而显著提升低资源场景下的风格迁移质量与语义保留能力。

链接: https://arxiv.org/abs/2603.23069
作者: Sarubi Thillainathan,Ji-Ung Lee,Michael Sullivan,Alexander Koller
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original text. Existing style transfer methods train a single model on large corpora to model all target styles at once: this high-cost approach offers limited flexibility for target-specific adaptation, and often sacrifices meaning preservation for style transfer. In this paper, we propose AuthorMix: a lightweight, modular, and interpretable style transfer framework. We train individual, style-specific LoRA adapters on a small set of high-resource authors, allowing the rapid training of specialized adaptation models for each new target via learned, layer-wise adapter mixing, using only a handful of target style training examples. AuthorMix outperforms existing, SoTA style-transfer baselines – as well as GPT-5.1 – for low-resource targets, achieving the highest overall score and substantially improving meaning preservation.

[NLP-20] Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)微调在长文本生成任务中评估指标失效的问题,尤其是标准自然语言处理(NLP)指标如ROUGE和BERTScore无法准确捕捉事实性差异的局限性。其解决方案的关键在于提出了一种基于三元组的人工验证评估流程TriFEX,能够将生成内容中的主张(claim)精确归因于用户查询、上下文和参考来源,并引入参数化知识精度(Parametric Knowledge Precision, PKP),通过过滤提示中泄露的知识来独立衡量模型内化知识的真实性。实验表明,PKP能有效揭示传统指标忽略的事实性差异,且小规模模型(7B)经微调后在电子设计自动化领域表现优于大规模基线(72B),证明了成本效益更高的本地部署可行性。

链接: https://arxiv.org/abs/2603.23047
作者: Julian Oestreich,Maximilian Bley,Frank Binder,Lydia Müller,Maksym Sydorenko,André Alcalde
机构: Leipzig University (莱比锡大学); CELUS GmbH (CELUS有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.

[NLP-21] YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

【速读】: 该论文旨在解决自动驾驶感知系统中目标检测模型在视觉退化或模糊场景下缺乏可信度透明性的问题,即现有模型难以提供可靠的信心分数(confidence scores)解释。解决方案的关键在于引入一种基于Kolmogorov-Arnold网络的可解释后处理代理模型,利用七个几何与语义特征对YOLOv10检测结果的信任度进行建模;其加性样条结构支持直接可视化每个特征对最终信任度的影响,从而生成平滑且透明的功能映射,明确识别出高置信度与低置信度预测情形。

链接: https://arxiv.org/abs/2603.23037
作者: Marios Impraimakis,Daniel Vazquez,Feiyu Zhou
机构: University of Bath (巴斯大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 14 pages, 23 Figures, 6 Tables

点击查看摘要

Abstract:The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature’s influence. This produces smooth and transparent functional mappings that reveal when the model’s confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.

[NLP-22] Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在处理用户特定查询时存在的计算资源浪费问题,即大量重复性或语义相似的请求仍被以相同高成本进行处理。其核心挑战在于如何利用对话记忆(conversational memory)将冗余转化为效率优势。解决方案的关键在于提出一种基于记忆增强的推理框架:通过一个轻量级8B参数模型结合检索到的对话上下文,实现低开销推理路径;该方法无需额外训练或标注数据即可在保持较高准确率的同时,将有效计算成本降低96%,且证明了访问相关知识比单纯扩大模型规模更能提升性能。

链接: https://arxiv.org/abs/2603.23013
作者: Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
机构: vLLM Semantic Router Project; MBZUAI; McGill University; Mila; AMD; Red Hat
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Production AI agents frequently receive user-specific queries that are highly repetitive, with up to 47% being semantically similar to prior interactions, yet each query is typically processed with the same computational cost. We argue that this redundancy can be exploited through conversational memory, transforming repetition from a cost burden into an efficiency advantage. We propose a memory-augmented inference framework in which a lightweight 8B-parameter model leverages retrieved conversational context to answer all queries via a low-cost inference path. Without any additional training or labeled data, this approach achieves 30.5% F1, recovering 69% of the performance of a full-context 235B model while reducing effective cost by 96%. Notably, a 235B model without memory (13.7% F1) underperforms even the standalone 8B model (15.4% F1), indicating that for user-specific queries, access to relevant knowledge outweighs model scale. We further analyze the role of routing and confidence. At practical confidence thresholds, routing alone already directs 96% of queries to the small model, but yields poor accuracy (13.0% F1) due to confident hallucinations. Memory does not substantially alter routing decisions; instead, it improves correctness by grounding responses in retrieved user-specific information. As conversational memory accumulates over time, coverage of recurring topics increases, further narrowing the performance gap. We evaluate on 152 LoCoMo questions (Qwen3-8B/235B) and 500 LongMemEval questions. Incorporating hybrid retrieval (BM25 + cosine similarity) improves performance by an additional +7.7 F1, demonstrating that retrieval quality directly enhances end-to-end system performance. Overall, our results highlight that memory, rather than model size, is the primary driver of accuracy and efficiency in persistent AI agents.

[NLP-23] PaperVoyager : Building Interactive Web with Visual Language Models

【速读】: 该论文旨在解决现有文档智能代理(Document Agent)在处理技术论文时存在的局限性问题,即当前方法通常仅将论文转化为静态内容(如摘要、网页或幻灯片),无法有效呈现涉及动态机制和状态转换的技术细节。其解决方案的关键在于提出一种Paper-to-Interactive-System Agent框架,能够端到端地将PDF论文自动转化为可执行的交互式网页系统,通过显式建模论文中的机制与交互逻辑,使用户可操作输入并观察动态行为。为此,作者进一步设计了PaperVoyager结构化生成框架,并构建了一个包含19篇论文及其专家构建的交互系统作为基准的评估数据集,实验表明该方案显著提升了交互式系统的生成质量,为科学论文的交互式理解提供了新范式。

链接: https://arxiv.org/abs/2603.22999
作者: Dasen Dai,Biao Wu,Meng Fang,Wenhao Wang
机构: Vast Intelligence Lab (Vast Intelligence Lab); UTS (University of Technology Sydney); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

[NLP-24] Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

【速读】: 该论文旨在解决当前多模态毒性检测基准普遍采用单一二元仇恨标签(hatefulness label)所带来的问题,即这种粗粒度标注方式混淆了表达的两个本质不同维度:语气(tone)与内容(content)。具体而言,其核心问题是现有方法无法区分“不礼貌”(incivility,指粗鲁或轻蔑的语气)与“偏执 intolerance(针对群体或身份的攻击性内容)”,从而影响模型对有害内容的精准识别和公平判别。解决方案的关键在于提出一种基于传播学理论的细粒度标注方案,将毒性分为可分离的两个维度,并在2,030张来自Hateful Memes数据集的梗图上进行标注。实验表明,将细粒度标注与原有粗粒度标签联合训练,能够显著提升视觉-语言模型的整体性能,同时降低误检率与漏检率之间的不平衡(如LLaVA-1.6-Mistral-7B模型的FNR-FPR从0.74降至0.42),从而提高内容审核系统的可靠性与准确性。

链接: https://arxiv.org/abs/2603.22985
作者: Nils A. Herrmann,Tobias Eder,Jingyi He,Georg Groh
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Preprint. Under review

点击查看摘要

Abstract:Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

[NLP-25] DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube

【速读】: 该论文旨在解决低资源语言(特别是达里语,Dari)在虚假信息检测领域长期被忽视的问题。针对这一空白,作者构建了首个达里语YouTube视频的标注数据集DariMis,包含9,224条视频,按信息类型(虚假信息、部分真实、真实)和危害等级(低、中、高)进行双重标注。关键发现是这两个维度存在结构性耦合关系:55.9%的虚假信息具有中等及以上危害潜力,而真实内容仅1.0%具备此类风险,这使得信息类型分类器可作为内容审核流程中的隐式危害分诊过滤器。解决方案的核心创新在于提出了一种双输入编码策略,将视频标题与描述分别作为BERT的独立段落输入,显式建模标题主张与正文内容之间的语义关联——这是识别误导性信息的关键信号。实验表明,相比传统单字段拼接方法,该策略显著提升了虚假信息召回率(从60.1%提升至67.1%),尽管整体宏F1仅小幅改善(+0.09个百分点),验证了其在安全关键场景下的有效性。

链接: https://arxiv.org/abs/2603.22977
作者: Jawid Ahmad Baktash,Mosa Ebrahimi,Mohammad Zarif Joya,Mursal Dawodi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 8 figures. Accepted for submission; dataset and code will be released upon publication

点击查看摘要

Abstract:Dari, the primary language of Afghanistan, is spoken by tens of millions of people yet remains largely absent from the misinformation detection literature. We address this gap with DariMis, the first manually annotated dataset of 9,224 Dari-language YouTube videos, labeled across two dimensions: Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). A central empirical finding is that these dimensions are structurally coupled, not independent: 55.9 percent of Misinformation carries at least Medium harm potential, compared with only 1.0 percent of True content. This enables Information Type classifiers to function as implicit harm-triage filters in content moderation pipelines. We further propose a pair-input encoding strategy that represents the video title and description as separate BERT segment inputs, explicitly modeling the semantic relationship between headline claims and body content, a key signal of misleading information. An ablation study against single-field concatenation shows that pair-input encoding yields a 7.0 percentage point gain in Misinformation recall (60.1 percent to 67.1 percent), the safety-critical minority class, despite modest overall macro F1 differences (0.09 percentage points). We benchmark a Dari/Farsi-specialized model (ParsBERT) against XLM-RoBERTa-base; ParsBERT achieves the best test performance with accuracy of 76.60 percent and macro F1 of 72.77 percent. Bootstrap 95 percent confidence intervals are reported for all metrics, and we discuss both the practical significance and statistical limitations of the results. Comments: 9 pages, 8 figures. Accepted for submission; dataset and code will be released upon publication Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7; H.3.3 Cite as: arXiv:2603.22977 [cs.CL] (or arXiv:2603.22977v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.22977 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-26] Beyond Theoretical Bounds: Empirical Privacy Loss Calibration for Text Rewriting Under Local Differential Privacy

【速读】: 该论文旨在解决局部差分隐私(Local Differential Privacy, LDP)下文本重写机制中隐私参数 ε\varepsilon 难以直观解释和跨机制比较的问题。现有方法虽能提供形式化的隐私保障,但相同名义 ε\varepsilon 值可能对应显著不同的可区分性水平,导致隐私-效用权衡评估缺乏可比性。解决方案的关键在于提出 TeDA,其通过假设检验框架实现对文本可区分性的实证校准:在表面文本空间和嵌入空间中分别执行文本可区分性审计,从而量化私有化文本与原始文本之间的不可区分性。该方法为不同 LDP 文本重写机制提供了统一的、基于实证的比较基准,提升了隐私保护机制的实际评估能力与部署适用性。

链接: https://arxiv.org/abs/2603.22968
作者: Weijun Li,Arnaud Grivet Sébert,Qiongkai Xu,Annabelle McIver,Mark Dras
机构: Macquarie University (麦考瑞大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 22 pages, 11 figures, 5 tables

点击查看摘要

Abstract:The growing use of large language models has increased interest in sharing textual data in a privacy-preserving manner. One prominent line of work addresses this challenge through text rewriting under Local Differential Privacy (LDP), where input texts are locally obfuscated before release with formal privacy guarantees. These guarantees are typically expressed by a parameter \varepsilon that upper bounds the worst-case privacy loss. However, nominal \varepsilon values are often difficult to interpret and compare across mechanisms. In this work, we investigate how to empirically calibrate across text rewriting mechanisms under LDP. We propose TeDA, which formulates calibration via a hypothesis-testing framework that instantiates text distinguishability audits in both surface and embedding spaces, enabling empirical assessment of indistinguishability from privatized texts. Applying this calibration to several representative mechanisms, we demonstrate that similar nominal \varepsilon bounds can imply very different levels of distinguishability. Empirical calibration thus provides a more comparable footing for evaluating privacy-utility trade-offs, as well as a practical tool for mechanism comparison and analysis in real-world LDP text rewriting deployments.

[NLP-27] Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverag e Guarantees

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成任务中仅依赖最可能生成结果(Most Likely Generation, MLG)作为点预测所导致的局限性问题,即MLG可能错误,而正确答案可能存在于更广泛的输出空间中。为提升模型预测的可靠性与覆盖能力,论文提出了一种基于集合值预测(Set-Valued Prediction)的原理性框架,其核心在于通过有限采样构建具有可行性感知的覆盖率保证(Feasibility-Aware Coverage Guarantees)。关键创新在于识别出最小可实现风险水平(Minimum Achievable Risk Level, MRL),即当目标风险低于此阈值时,统计覆盖率无法满足;进而设计了一种数据驱动的校准方法,利用采样响应估计严格阈值,从而构造出包含正确答案概率不低于预设置信度的预测集,实验证明该方法在六类语言生成任务和五种LLM上兼具统计有效性与预测效率。

链接: https://arxiv.org/abs/2603.22966
作者: Ye Li,Anqi Hu,Yuanchang Ye,Shiyan Tong,Zhiyuan Wang,Bo Fu
机构: University of Electronic Science and Technology of China (电子科技大学); Zhejiang University of Finance and Economics (浙江财经大学); Southeast University (东南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model’s capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.

[NLP-28] Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion ACL2026

【速读】: 该论文旨在解决电商场景下冷启动(cold-start)Query Suggestion (QS) 问题,即在缺乏足够在线点击数据的情况下,传统基于大语言模型与点击率(Click-Through Rate, CTR)模型的方法难以有效训练QS系统。解决方案的关键在于提出Cold-EQS框架,其核心是通过迭代强化学习机制,利用可回答性(answerability)、事实性(factuality)和信息增益(information gain)作为奖励信号持续优化建议查询的质量;同时引入分组候选查询的不确定性估计策略,从无点击信号的在线用户查询中筛选出困难且模糊样本用于模型迭代优化,从而在冷启动阶段仍能提升推荐效果。

链接: https://arxiv.org/abs/2603.22922
作者: Qi Sun,Kejun Xiao,Huaipeng Zhao,Tao Luo,Xiaoyi Zeng
机构: Alibaba International Digital Commercial Group (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL 2026 Industry Track

点击查看摘要

Abstract:Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.

[NLP-29] EVA: Efficient Reinforcement Learning for End-to-End Video Agent CVPR2026

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中因长序列token、复杂时间依赖性和冗余帧导致的效率低下问题。现有方法通常将MLLMs视为被动识别器,对整个视频或均匀采样帧进行处理,缺乏自适应推理能力;而基于代理的方法虽引入外部工具,但仍依赖人工设计的工作流和感知优先策略,难以高效处理长视频。解决方案的关键在于提出EVA(Efficient Reinforcement Learning framework for End-to-End Video Agent),其核心创新是通过“规划-感知-行动-反思”的迭代式推理机制实现“先规划后感知”,使代理能够自主决定观看内容、时机与方式,从而实现查询驱动且高效的视频理解。此外,论文设计了一个三阶段学习管道(监督微调SFT、Kahneman-Tversky优化KTO、广义奖励策略优化GRPO),有效连接监督模仿与强化学习,显著提升了训练稳定性与性能表现。

链接: https://arxiv.org/abs/2603.22918
作者: Yaolun Zhang,Ruohui Wang,Jiahao Wang,Yepeng Tang,Xuanyu Zheng,Haonan Duan,Hao Lu,Hanming Deng,Lewei Lu
机构: SenseTime Research (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR2026

点击查看摘要

Abstract:Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at this https URL.

[NLP-30] Multilingual KokoroChat: A Multi-LLM Ensemble Translation Method for Creating a Multilingual Counseling Dialogue Dataset

【速读】: 该论文旨在解决高质量、公开可用的心理咨询对话数据集稀缺的问题,通过将已有的大规模日语人工撰写咨询语料库KokoroChat翻译成英文和中文,构建多语言的Multilingual KokoroChat数据集。其解决方案的关键在于提出了一种新颖的多大语言模型(Large Language Model, LLM)集成方法:首先利用多个不同LLM生成多样化的翻译假设,随后由一个单一LLM基于对各假设优缺点的分析,综合生成高质量翻译结果。该方法在敏感领域如心理咨询中显著提升了翻译保真度,经人类偏好评估验证,其输出优于任何单一前沿LLM的翻译效果。

链接: https://arxiv.org/abs/2603.22913
作者: Ryoma Suzuki,Zhiyang Qi,Michimasa Inaba
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:To address the critical scarcity of high-quality, publicly available counseling dialogue datasets, we created Multilingual KokoroChat by translating KokoroChat, a large-scale manually authored Japanese counseling corpus, into both English and Chinese. A key challenge in this process is that the optimal model for translation varies by input, making it impossible for any single model to consistently guarantee the highest quality. In a sensitive domain like counseling, where the highest possible translation fidelity is essential, relying on a single LLM is therefore insufficient. To overcome this challenge, we developed and employed a novel multi-LLM ensemble method. Our approach first generates diverse hypotheses from multiple distinct LLMs. A single LLM then produces a high-quality translation based on an analysis of the respective strengths and weaknesses of all presented hypotheses. The quality of ``Multilingual KokoroChat’’ was rigorously validated through human preference studies. These evaluations confirmed that the translations produced by our ensemble method were preferred from any individual state-of-the-art LLM. This strong preference confirms the superior quality of our method’s outputs. The Multilingual KokoroChat is available at this https URL.

[NLP-31] EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本应用中因键值缓存(Key-Value Cache, KV cache)内存需求激增而导致的性能瓶颈问题。现有低秩压缩方法通常依赖不可逆的参数变换,限制了在内存充足时恢复全精度推理的灵活性。其解决方案的关键在于提出EchoKV,一种可按需切换标准与压缩推理模式的灵活KV缓存压缩方案:通过轻量级网络从部分子集重建注意力头间存在的层间和层内相似性所对应的残差KV分量,从而实现高效且可逆的压缩与重建;同时引入两阶段微调策略,在少量计算资源(如7B模型约1 A100 GPU小时)下完成快速训练,兼顾压缩效率与推理灵活性。

链接: https://arxiv.org/abs/2603.22910
作者: Yixuan Wang,Shiyu Ji,Yijun Liu,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.

[NLP-32] he Evolution of Tool Use in LLM Agents : From Single-Tool Call to Multi-Tool Orchestration

【速读】: 该论文旨在解决多工具大型语言模型(Large Language Models, LLMs)代理在复杂、长程任务中面临的协同调度与执行问题,即如何在存在中间状态、执行反馈、环境变化及安全、成本、可验证性等实际约束条件下,实现多个工具的高效、可靠组合与调用。其解决方案的关键在于从单一工具调用的早期研究范式转向对多工具编排(multi-tool orchestration)的系统性分析,通过构建涵盖推理时规划与执行、训练与轨迹构建、安全性与控制、资源受限下的效率优化、开放环境中能力完备性以及基准设计与评估在内的六大核心维度框架,为多工具LLM代理的研究提供结构化梳理与未来发展方向指引。

链接: https://arxiv.org/abs/2603.22862
作者: Haoyuan Xu,Chang Li,Xinyan Ma,Xianhao Ou,Zihan Zhang,Tao He,Xiangyu Liu,Zixiang Wang,Jiafeng Liang,Zheng Chu,Runxuan Liu,Rongchuan Mu,Ming Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Harvard University (哈佛大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single-call tool use from long-horizon orchestration. Then, we organize the literature around six core dimensions: inference-time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi-tool agents.

[NLP-33] Avoiding Over-smoothing in Social Media Rumor Detection with Pre-trained Propagation Tree Transformer

【速读】: 该论文旨在解决基于图神经网络(Graph Neural Networks, GNNs)的谣言检测方法在处理谣言传播结构时存在的过平滑(over-smoothing)问题,以及GNN难以捕捉长距离依赖关系的局限性。其解决方案的关键在于提出一种纯Transformer架构的预训练传播树变换器(Pre-Trained Propagation Tree Transformer, P2T3),该方法通过提取传播方向上的全部对话链、引入token级嵌入以注入连接信息并加入必要的归纳偏置,在大规模无标签数据上进行预训练,从而有效规避GNN的过平滑问题,并在少样本条件下仍保持优异性能。

链接: https://arxiv.org/abs/2603.22854
作者: Chaoqun Cui,Caiyan Jia
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Deep learning techniques for rumor detection typically utilize Graph Neural Networks (GNNs) to analyze post relations. These methods, however, falter due to over-smoothing issues when processing rumor propagation structures, leading to declining performance. Our investigation into this issue reveals that over-smoothing is intrinsically tied to the structural characteristics of rumor propagation trees, in which the majority of nodes are 1-level nodes. Furthermore, GNNs struggle to capture long-range dependencies within these trees. To circumvent these challenges, we propose a Pre-Trained Propagation Tree Transformer (P2T3) method based on pure Transformer architecture. It extracts all conversation chains from a tree structure following the propagation direction of replies, utilizes token-wise embedding to infuse connection information and introduces necessary inductive bias, and pre-trains on large-scale unlabeled datasets. Experiments indicate that P2T3 surpasses previous state-of-the-art methods in multiple benchmark datasets and performs well under few-shot conditions. P2T3 not only avoids the over-smoothing issue inherent in GNNs but also potentially offers a large model or unified multi-modal scheme for future social media research.

[NLP-34] Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts EACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成涉及地缘政治身份(如巴勒斯坦人和以色列人)的虚拟人格时,是否存在系统性偏见及其背后机制的问题。研究发现,LLMs在战争情境下倾向于将巴勒斯坦人格描述为低社会经济地位且以生存为导向的角色,而以色列人格则多保持中产阶级特征与专业化职业属性;尽管在收到明确指令避免有害假设后,模型在性别多样性或职业泛化方面表现出分布变化,但底层的社会经济差异仍持续存在。解决方案的关键在于揭示了模型推理过程与生成结果之间的脱节:尽管模型在推理路径中频繁提及公平性概念,但其最终输出并未形成一致的公平调整,表明当前LLMs对公平性的理解缺乏稳定、可迁移的映射机制,从而凸显出地缘政治语境下生成式AI(Generative AI)代表性偏差的根本挑战。

链接: https://arxiv.org/abs/2603.22837
作者: Maida Aizaz,Quang Minh Nguyen
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: EACL 2026 Student Research Workshop

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., “student”), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.

[NLP-35] RadTimeline: Timeline Summarization for Longitudinal Radiological Lung Findings LREC

【速读】: 该论文旨在解决纵向放射学报告中病灶变化追踪的自动化问题,以提升疾病进展识别的准确性并缓解人工标注的时间消耗。其核心解决方案是将报告总结任务结构化为时间线生成任务(timeline generation task),通过三步流程实现:首先提取发现(findings),然后生成分组名称(group names),最后利用名称对发现进行分组,形成按时间列排列、相关发现按行归类的结构化时间线。研究表明,中间步骤“分组名称生成”是有效发现聚类的关键,尽管最优配置存在少量无关发现,但整体召回率高,且分组效果接近人工标注水平。

链接: https://arxiv.org/abs/2603.22820
作者: Sitong Zhou,Meliha Yetisgen,Mari Ostendorf
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at Language Resources and Evaluation Conference (LREC) 2026

点击查看摘要

Abstract:Tracking findings in longitudinal radiology reports is crucial for accurately identifying disease progression, and the time-consuming process would benefit from automatic summarization. This work introduces a structured summarization task, where we frame longitudinal report summarization as a timeline generation task, with dated findings organized in columns and temporally related findings grouped in rows. This structured summarization format enables straightforward comparison of findings across time and facilitates fact-checking against the associated reports. The timeline is generated using a 3-step LLM process of extracting findings, generating group names, and using the names to group the findings. To evaluate such systems, we create RadTimeline, a timeline dataset focused on tracking lung-related radiologic findings in chest-related imaging reports. Experiments on RadTimeline show tradeoffs of different-sized LLMs and prompting strategies. Our results highlight that group name generation as an intermediate step is critical for effective finding grouping. The best configuration has some irrelevant findings but very good recall, and grouping performance is comparable to human annotators.

[NLP-36] When AI Shows Its Work Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

【速读】: 该论文旨在解决当前前沿语言模型在生成步骤式推理(Chain-of-Thought, CoT)时存在的“装饰性推理”问题——即模型是否真正依赖这些中间推理步骤来得出最终答案,还是仅在决策后生成看似合理的叙述。其核心问题是:当移除某一步推理时,答案是否发生变化?解决方案的关键在于提出并实施步骤级评估(step-level evaluation),即逐次移除每一条推理句子,并检测答案是否随之改变。这种方法仅需API访问权限,无需模型权重,成本低且可扩展;实验表明,多数前沿模型的推理步骤具有高度冗余性(移除任意一步仅导致约17%的答案变化),说明其推理过程多为装饰性;而少数模型如MiniMax-M2.5和Kimi-K2.5在特定任务上表现出真实依赖性,进一步揭示了推理忠实度(faithfulness)具有模型和任务特异性。这一方法为评估大模型内部推理机制提供了可操作、低成本的基准工具。

链接: https://arxiv.org/abs/2603.22816
作者: Abhinaba Basu,Pavan Chakraborty
机构: Indian Institute of Information Technology Allahabad (IIITA); National Institute of Electronics and Information Technology (NIELIT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes “The patient’s eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B.” If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access – no model weights – and costs approximately 1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover “output rigidity”: on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T50, 68T05 ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2603.22816 [cs.CL] (or arXiv:2603.22816v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.22816 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Abhinaba Basu [view email] [v1] Tue, 24 Mar 2026 05:38:13 UTC (23 KB) Full-text links: Access Paper: View a PDF of the paper titled When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning, by Abhinaba Basu and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-03 Change to browse by: cs cs.AI cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-37] Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration AAAI2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时易产生事实性错误(即幻觉)的问题,尤其针对现有 hallucination detection 方法因采用固定采样预算而无法根据查询复杂度自适应调整采样次数、导致计算效率低下的缺陷。其解决方案的关键在于提出一种基于自适应贝叶斯估计的语义熵框架(Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration),通过构建分层贝叶斯模型对语义分布进行建模,并利用方差阈值动态控制采样迭代次数,在达到足够置信度时提前终止生成;同时设计基于扰动的重要性采样策略以系统性探索语义空间,从而在保证检测性能的同时显著降低采样成本。

链接: https://arxiv.org/abs/2603.22812
作者: Qiyao Sun,Xingming Li,Xixiang He,Ao Cheng,Xuanyu Ji,Hailun Lu,Runke Huang,Qingyong Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to a AAAI 2026 (Oral Presentation, 5% acceptance rate), Project page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in various natural language processing tasks, yet they remain prone to generating factually incorrect outputs known as hallucinations. While recent approaches have shown promise for hallucination detection by repeatedly sampling from LLMs and quantifying the semantic inconsistency among the generated responses, they rely on fixed sampling budgets that fail to adapt to query complexity, resulting in computational inefficiency. We propose an Adaptive Bayesian Estimation framework for Semantic Entropy with Guided Semantic Exploration, which dynamically adjusts sampling requirements based on observed uncertainty. Our approach employs a hierarchical Bayesian framework to model the semantic distribution, enabling dynamic control of sampling iterations through variance-based thresholds that terminate generation once sufficient certainty is achieved. We also develop a perturbation-based importance sampling strategy to systematically explore the semantic space. Extensive experiments on four QA datasets demonstrate that our method achieves superior hallucination detection performance with significant efficiency gains. In low-budget scenarios, our approach requires about 50% fewer samples to achieve comparable detection performance to existing methods, while delivers an average AUROC improvement of 12.6% under the same sampling budget.

[NLP-38] Span Modeling for Idiomaticity and Figurative Language Detection with Span Contrastive Loss

【速读】: 该论文旨在解决语言模型在识别非组合性多词表达(Multi-Word Expression, MWE),特别是习语(Idiom)时面临的挑战,此类表达的语义不等于其组成部分的简单叠加,而现有模型常因分词(tokenization)和上下文嵌入(contextual embeddings)机制导致识别失败。解决方案的关键在于提出基于BERT和RoBERTa的微调框架,结合槽位损失(slot loss)与跨度对比损失(span contrastive loss, SCL),并引入硬负样本重加权策略,以增强模型对习语整体语义的感知能力;实验表明该方法在现有数据集上实现了最优的序列准确率(sequence accuracy, SA),且通过消融实验证明了SCL的有效性和泛化能力。

链接: https://arxiv.org/abs/2603.22799
作者: Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen
机构: JAIST (日本信息科学与技术大学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The category of figurative language contains many varieties, some of which are non-compositional in nature. This type of phrase or multi-word expression (MWE) includes idioms, which represent a single meaning that does not consist of the sum of its words. For language models, this presents a unique problem due to tokenization and adjacent contextual embeddings. Many large language models have overcome this issue with large phrase vocabulary, though immediate recognition frequently fails without one- or few-shot prompting or instruction finetuning. The best results have been achieved with BERT-based or LSTM finetuning approaches. The model in this paper contains one such variety. We propose BERT- and RoBERTa-based models finetuned with a combination of slot loss and span contrastive loss (SCL) with hard negative reweighting to improve idiomaticity detection, attaining state of the art sequence accuracy performance on existing datasets. Comparative ablation studies show the effectiveness of SCL and its generalizability. The geometric mean of F1 and sequence accuracy (SA) is also proposed to assess a model’s span awareness and general performance together.

[NLP-39] Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在真实世界证据(Real-World Evidence, RWE)研究中执行端到端观察性研究任务时存在的能力局限问题,即现有大语言模型(LLM)代理通常仅在孤立步骤或单一答案上表现良好,而缺乏对完整证据包(evidence bundle)结构和逻辑一致性的把握。解决方案的关键在于提出 RWE-bench 基准测试平台,其基于 MIMIC-IV 数据库并源自同行评审的观察性研究,每个任务提供标准研究协议作为参考,要求代理在真实数据库中迭代生成树状结构的证据包,并通过多维度评估指标(包括问题级正确性和端到端任务成功率)系统量化模型性能,同时引入自动化队列评估方法以快速定位错误和识别代理失败模式,从而揭示当前 LLM 代理在构建可信赖、完整证据链条方面的显著不足。

链接: https://arxiv.org/abs/2603.22767
作者: Dubai Li,Yuxiang He,Yan Hu,Yu Tian,Jingsong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents’ ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at this https URL.

[NLP-40] KALAVAI: Predicting When Independent Specialist Fusion Works – A Quantitative Model for Post-Hoc Cooperative LLM Training

【速读】: 该论文旨在解决多领域专业模型(domain specialists)协同优化的问题,即如何通过融合多个独立训练的模型来提升整体性能,并在不进行大规模重新训练的前提下预测协作收益。其关键解决方案是提出了一种轻量级的混合专家(MoE)路由协议——KALAVAI,该协议允许贡献者独立微调共享检查点的副本后提交至一个高效路由机制中,从而实现跨模型的知识融合与性能提升。该方法的核心创新在于:首先通过预训练模型间的分布差异(divergence)可预测融合增益(gain = 0.82 × divergence − 2.72),使实践者能在投入计算资源前评估协作价值;其次,使用仅500步的轻量级路由训练即可达到接近领域最优路由的效果(误差<10⁻⁵ nats),且对冻结层和初始权重一致性有明确约束条件,确保融合后的模型稳定优于任一单独的专业模型。

链接: https://arxiv.org/abs/2603.22755
作者: Ramchand Kumaresan
机构: Murai Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach this http URL the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within 10^-5 nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.

[NLP-41] PRISM: A Dual View of LLM Reasoning through Semantic Flow and Latent Computation

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在复杂问题求解过程中,其多步推理轨迹难以被全面分析的问题。现有研究通常仅从两个单一维度进行考察:一是生成文本中不同推理步骤之间的token序列变化,二是单一步骤内模型各层隐藏状态向量的演变。为弥合这一分析盲区,论文提出PRISM(Probabilistic Reasoning Inspection through Semantic and Implicit Modeling)框架,其关键在于通过联合建模推理步骤间的语义演化与层间隐状态的内在计算模式,实现对推理过程的结构化可观测与可诊断分析。该方法揭示了失败推理路径常陷入无效率验证循环,并进一步分化为过度思考或过早承诺等不同行为模式,从而超越传统基于最终任务准确率的评估方式,提供更精细的推理行为洞察。

链接: https://arxiv.org/abs/2603.22754
作者: Ruidi Chang,Jiawei Zhou,Hanjie Chen
机构: Rice University (莱斯大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) solve complex problems by generating multi-step reasoning traces. Yet these traces are typically analyzed from only one of two perspectives: the sequence of tokens across different reasoning steps in the generated text, or the hidden-state vectors across model layers within one step. We introduce PRISM (Probabilistic Reasoning Inspection through Semantic and Implicit Modeling), a framework and diagnostic tool for jointly analyzing both levels, providing a unified view of how reasoning evolves across steps and layers. Across multiple reasoning models and benchmarks, PRISM uncovers systematic patterns in the reasoning process, showing that failed trajectories are more likely to become trapped in unproductive verification loops and further diverge into distinct modes such as overthinking and premature commitment, which behave differently once a candidate answer is reached. It further reveals how prompting reshapes reasoning behavior beyond aggregate accuracy by altering both semantic transitions and internal computational patterns. By modeling reasoning trajectories as structured processes, PRISM makes these behaviors observable and analyzable rather than relying solely on final-task accuracy. Taken together, these insights position PRISM as a practical tool for analyzing and diagnosing reasoning processes in LLMs.

[NLP-42] Explanation Generation for Contradiction Reconciliation with LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对矛盾陈述时缺乏生成 reconciliatory explanation( reconciliatory explanation,即能够使矛盾陈述兼容的解释)的能力这一问题。现有自然语言处理(Natural Language Processing, NLP)研究通常将矛盾视为需通过选择接受或舍弃语句来消除的错误,而忽视了人类在社交互动和专业领域中通过假设性解释来调和矛盾的核心推理能力。论文的关键解决方案是提出“reconciliatory explanation generation”任务,并设计了一种新颖方法,通过重新利用现有的自然语言推理(Natural Language Inference, NLI)数据集构建训练与评估框架,同时引入可扩展的自动评估指标,从而系统性地衡量LLMs在该任务上的表现。实验表明,当前主流LLMs在此任务上表现有限,且随着模型规模增大,通过增加推理计算量(如“思考”机制)所带来的性能提升趋于饱和,凸显出该维度在LLM推理能力中的重要性和改进必要性。

链接: https://arxiv.org/abs/2603.22735
作者: Jason Chan,Zhixue Zhao,Robert Gaizauskas
机构: University of Sheffield, UK (谢菲尔德大学,英国)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, “Cassie hates coffee” and “She buys coffee everyday” may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by “thinking” plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs’ downstream applications such as chatbots and scientific aids.

[NLP-43] How Utilitarian Are OpenAI s Models Really? Replicating and Reinterpreting Pfeffer Krügel and Uhl (2025)

【速读】: 该论文试图解决的问题是:当前对大型语言模型(Large Language Models, LLMs)道德推理能力的评估方法存在显著偏差,尤其依赖单一提示(prompt)可能误导研究结论,从而无法真实反映模型在伦理决策中的倾向性。解决方案的关键在于通过多提示变体测试(multi-prompt robustness testing),识别并排除提示设计中的混淆因素(如“建议式”表述引发的安全拒绝行为),从而揭示模型在去除干扰后的真实道德倾向——实证表明,当消除提示框架带来的偏倚时,多个主流OpenAI模型均趋向于给出功利主义(utilitarian)的回答,而推理模式(reasoning model)与非推理模式(non-reasoning model)之间的差异仅在特定条件下显现,且常伴随回答拒绝行为。因此,论文主张将多提示稳健性测试作为LLM道德行为研究的标准实践。

链接: https://arxiv.org/abs/2603.22730
作者: Johannes Himmelreich
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, 2 figures, 2 tables. Supplementary materials included as ancillary file

点击查看摘要

Abstract:Pfeffer, Krügel, and Uhl (2025) report that OpenAI’s reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o’s low utilitarian rate doesn’t reflect a deontological commitment but safety refusals triggered by the prompt’s advisory framing. When framed as “Is it morally permissible…?” instead of “Should I…?”, GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.

[NLP-44] Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics INTERSPEECH2026

【速读】: 该论文旨在解决对话式自动语音识别(Conversational Automatic Speech Recognition, CASR)在多说话人场景下的性能瓶颈问题,特别是针对重叠语音、远场噪声以及说话人数变化带来的挑战。其关键解决方案在于系统性地对比基于大语言模型(LLM)的方法与模块化流水线方法在四个维度上的表现:重叠鲁棒性、语义保真度、说话人数量适应性和单通道与多通道输入的差异;并提出一种新的评估指标 tcpSemER,通过用嵌入相似性替代传统的 Levenshtein 距离来捕捉语义层面的错误,从而更准确反映模型在复杂场景下的实际表现能力。实验表明,LLM 方法在双说话人场景下具有竞争力,但随着说话人数和语音重叠程度增加,其性能显著下降,而模块化流水线方法则展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2603.22709
作者: Naohiro Tawara,Samuele Cornell,Alexander Polok,Marc Delcroix,Lukáš Burget,Shinji Watanabe
机构: NTT, Inc.(NTT公司); CMU(卡内基梅隆大学); BUT(布杰约维采理工大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to INTERSPEECH 2026

点击查看摘要

Abstract:Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

[NLP-45] Detecting Non-Membership in LLM Training Data via Rank Correlations EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练过程中数据来源不可追溯的问题,尤其是针对“特定数据集未被用于训练”这一非成员检测(non-membership inference)任务缺乏有效方法的现状。传统研究多聚焦于判断某数据集是否曾被用作训练数据(即成员推理),而忽略了验证数据集未被使用的重要性,这在版权保护、合规审计和用户信任等方面具有关键意义。论文提出的解决方案为PRISM,其核心创新在于利用灰盒访问(grey-box access)模型输出的logits,基于一个关键观察:两个均未见过某数据集的模型,在归一化token对数概率上的排名相关性(rank correlation)显著高于其中一个模型曾受该数据集训练的情况。基于此,PRISM构建了一个基于相关性的测试机制,能可靠地排除训练数据中的特定数据集,且不产生误报,从而为验证特定数据集未被纳入LLM训练提供了一种可操作的框架。

链接: https://arxiv.org/abs/2603.22707
作者: Pranav Shetty,Mirazul Haque,Zhiqiang Ma,Xiaomo Liu
机构: JPMorgan AI Research
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem – verifying that a dataset was not used – has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level non-membership using only grey-box access to model logits. Our key insight is that two models that have not seen a dataset exhibit higher rank correlation in their normalized token log probabilities than when one model has been trained on that data. Using this observation, we construct a correlation-based test that detects non-membership. Empirically, PRISM reliably rules out membership in training data across all datasets tested while avoiding false positives, thus offering a framework for verifying that specific datasets were excluded from LLM training.

[NLP-46] Synthetic or Authentic? Building Mental Patient Simulators from Longitudinal Evidence

【速读】: 该论文旨在解决当前心理对话系统中患者模拟存在的关键问题:现有方法多依赖于静态提示(snapshot-style prompts),导致患者行为同质化、多轮交互中疾病进展不连贯,且缺乏真实世界的纵向数据支持。其解决方案的核心在于提出DEPROFILE框架,通过整合来自真实世界数据的多源信息(包括人口统计学特征、标准化临床症状、咨询对话记录及纵向生活事件历史)构建统一且详实的患者画像,并引入Chain-of-Change代理将杂乱的纵向记录转化为结构化的时序记忆表示,从而提升对话的真实性、行为多样性与事件丰富性。

链接: https://arxiv.org/abs/2603.22704
作者: Baihan Li,Bingrui Jin,Kunyao Lan,Ming Wang,Mengyue Wu
机构: SJTU Paris Elite Institute of Technology, Shanghai Jiao Tong University, China; X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; School of Computer Science and Engineering, Northeastern University, China; School of Computing and Information Systems, Singapore Management University, Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patient simulation is essential for developing and evaluating mental health dialogue systems. As most existing approaches rely on snapshot-style prompts with limited profile information, homogeneous behaviors and incoherent disease progression in multi-turn interactions have become key chellenges. In this work, we propose DEPROFILE, a data-grounded patient simulation framework that constructs unified, multi-source patient profiles by integrating demographic attributes, standardized clinical symptoms, counseling dialogues, and longitudinal life-event histories from real-world data. We further introduce a Chain-of-Change agent to transform noisy longitudinal records into structured, temporally grounded memory representations for simulation. Experiments across multiple large language model (LLM) backbones show that with more comprehensive profile constructed by DEPROFILE, the dialogue realism, behavioral diversity, and event richness have consistently improved and exceed state-of-the-art baselines, highlighting the importance of grounding patient simulation in verifiable longitudinal evidence.

[NLP-47] Improving LLM Predictions via Inter-Layer Structural Encoders

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中仅依赖最终层token表示所带来的信息利用率不足问题,尤其是在不同任务中,中间层可能包含比最终层更相关的特征。为此,作者提出了一种名为Inter-Layer Structural Encoders (ILSE) 的结构化方法,其核心创新在于Cayley-Encoder——一种基于扩展Cayley图的数学严谨几何编码器,能够高效地在LLM各层之间传播和整合信息,从而学习到一个统一且更具表现力的内部表示。该方案显著提升了多任务分类与语义相似度任务的性能,并在小样本场景下展现出优异的数据效率。

链接: https://arxiv.org/abs/2603.22665
作者: Tom Ulanovski(1),Eyal Blyachman(1),Maya Bechler-Speicher(2) ((1) Tel Aviv University, (2) Meta)
机构: Tel Aviv University (特拉维夫大学); Meta (Meta)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 3 figures. Equal contribution by first two authors

点击查看摘要

Abstract:The standard practice in Large Language Models (LLMs) is to base predictions on the final-layer token representations. Recent studies, however, show that intermediate layers encode substantial information, which may contain more task-relevant features than the final-layer representations alone. Importantly, it was shown that for different tasks, different layers may be optimal. In this work we introduce Inter-Layer Structural Encoders (ILSE), a powerful structural approach to learn one effective representation from the LLM’s internal layer representations all together. Central to ILSE is Cayley-Encoder, a mathematically grounded geometric encoder that leverages expander Cayley graphs for efficient inter-layer information propagation. We evaluate ILSE across 13 classification and semantic similarity tasks with 9 pre-trained LLMs ranging from 14 million to 8 billion parameters. ILSE consistently outperforms baselines and existing approaches, achieving up to 44% improvement in accuracy and 25% in similarity metrics. We further show that ILSE is data-efficient in few-shot regimes and can make small LLMs competitive with substantially larger models.

[NLP-48] Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns Cost-Accuracy Tradeoffs and Production Scaling Strategies

【速读】: 该论文旨在解决在金融文档中结构化信息提取任务中,多智能体大语言模型(Large Language Models, LLMs)架构选择缺乏实证指导的问题。其关键解决方案是系统性地比较四种多智能体编排架构——顺序流水线、并行分叉合并、分层监督-工作者和反射式自校正循环——并在包含10,000份美国证券交易委员会(SEC)文件的语料库上进行多维评估,涵盖字段级F1分数、文档级准确率、端到端延迟、每文档成本和令牌效率等指标。研究发现,反射式架构虽能实现最高字段级F1(0.943),但成本为顺序基线的2.3倍;而分层架构在成本-准确率帕累托前沿上表现最优(F1 0.921,成本为基线的1.4倍)。进一步的消融实验表明,混合配置可仅以1.15倍基线成本恢复89%的反射式架构精度收益,为监管环境下的多智能体LLM部署提供了可操作的工程决策依据。

链接: https://arxiv.org/abs/2603.22651
作者: Siddhant Kulkarni,Yukta Kulkarni
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi-agent orchestration architectures: sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker and reflexive self-correcting loop. These are evaluated across five frontier and open-weight LLMs on a corpus of 10,000 SEC filings (10-K, 10-Q and 8-K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field-level F1, document-level accuracy, end-to-end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field-level F1 (0.943) but at 2.3x the cost of sequential baselines, while hierarchical architectures occupy the most favorable position on the cost-accuracy Pareto frontier (F1 0.921 at 1.4x cost). We further present ablation studies on semantic caching, model routing and adaptive retry strategies, demonstrating that hybrid configurations can recover 89% of the reflexive architecture’s accuracy gains at only 1.15x baseline cost. Our scaling analysis from 1K to 100K documents per day reveals non-obvious throughput-accuracy degradation curves that inform capacity planning. These findings provide actionable guidance for practitioners deploying multi-agent LLM systems in regulated financial environments.

[NLP-49] Multi-Method Validation of Large Language Model Medical Translation Across High- and Low-Resource Languages

【速读】: 该论文旨在解决美国2730万非英语母语居民在医疗场景中因语言障碍导致的沟通难题,传统专业医疗翻译成本高且难以获取。其解决方案的关键在于评估前沿大语言模型(Large Language Models, LLMs)在多语言医学文本翻译中的表现,特别是针对高资源、中资源和低资源语言的语义保真度。研究通过五层验证框架对四个主流LLM(GPT-5.1、Claude Opus 4.5、Gemini 3 Pro、Kimi K2)进行系统测试,结果显示所有模型在704个翻译对中均保持高度语义一致性(LaBSE > 0.92),且高低资源语言间无显著差异(p = 0.066),同时排除了模型自循环带来的偏差,表明这些模型具备跨语言资源水平稳定传递医学含义的能力,为提升医疗场景下的语言可及性提供了可行的技术路径。

链接: https://arxiv.org/abs/2603.22642
作者: Chukwuebuka Anyaegbuna,Eduardo Juan Perez Guerrero,Jerry Liu,Timothy Keyes,April Liang,Natasha Steele,Stephen Ma,Jonathan Chen,Kevin Schulman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 32 references, 5 tables, 2 figures

点击查看摘要

Abstract:Language barriers affect 27.3 million U.S. residents with non-English language preference, yet professional medical translation remains costly and often unavailable. We evaluated four frontier large language models (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Kimi K2) translating 22 medical documents into 8 languages spanning high-resource (Spanish, Chinese, Russian, Vietnamese), medium-resource (Korean, Arabic), and low-resource (Tagalog, Haitian Creole) categories using a five-layer validation framework. Across 704 translation pairs, all models achieved high semantic preservation (LaBSE greater than 0.92), with no significant difference between high- and low-resource languages (p = 0.066). Cross-model back-translation confirmed results were not driven by same-model circularity (delta = -0.0009). Inter-model concordance across four independently trained models was high (LaBSE: 0.946), and lexical borrowing analysis showed no correlation between English term retention and fidelity scores in low-resource languages (rho = +0.018, p = 0.82). These converging results suggest frontier LLMs preserve medical meaning across resource levels, with implications for language access in healthcare.

[NLP-50] LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

【速读】: 该论文旨在解决预训练语言模型在低资源、形态丰富的语言(如阿姆哈拉语和提格里尼亚语)中适应困难的问题。现有词汇扩展方法通常依赖于任意分割的子词单元,导致词汇表示碎片化并丢失关键的形态信息。其解决方案的关键在于提出一种基于词素的嵌入初始化框架——Lexically Grounded Subword Embedding Initialization (LGSE),通过将单词分解为语素,并利用预训练的子词或FastText词素表示进行语义一致的嵌入构造;对于无法分解为有意义词素的词,则采用字符n-gram表示以保留结构信息。此外,在语言自适应预训练阶段引入正则化项,限制新嵌入与初始值的偏差,从而在保持原嵌入空间对齐的同时实现目标语言的适配。

链接: https://arxiv.org/abs/2603.22629
作者: Hailay Teklehaymanot,Dren Fazlija,Wolfgang Nejdl
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 1 Table

点击查看摘要

Abstract:Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from their initialized values, preserving alignment with the original pretrained embedding space while enabling adaptation to the target language. To isolate the effect of initialization, we retain the original pre-trained model vocabulary and tokenizer and update only the new embeddings during adaptation. We evaluate LGSE on three NLP tasks: Question Answering, Named Entity Recognition, and Text Classification, in two morphologically rich, low-resource languages: Amharic and Tigrinya, where morphological segmentation resources are available. Experimental results show that LGSE consistently outperforms baseline methods across all tasks, demonstrating the effectiveness of morphologically grounded embedding initialization for improving representation quality in underrepresented languages. Project resources are available in the GitHub link.

[NLP-51] Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理多实例输入(multi-instance inputs)时性能表现不明确的问题,尤其是在其个体任务上表现出色的情况下。研究表明,LLMs在处理少量实例(约20–100个)时会出现轻微性能下降,而在实例数量进一步增加时则发生显著性能崩溃。解决方案的关键在于识别出:虽然上下文长度(context length)与性能退化相关,但实例数量(instance count)对最终结果的影响更为显著。因此,优化LLMs在多实例处理(Multi-Instance Processing, MIP)中的表现时,应同时关注上下文长度和实例数量,其中后者是更关键的调控因素。

链接: https://arxiv.org/abs/2603.22608
作者: Jingxuan Chen,Mohammad Taher Pilehvar,Jose Camacho-Collados
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.

[NLP-52] Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)推理作为大语言模型在安全关键场景中透明性机制的有效性问题,特别是其核心属性——忠实性(faithfulness)的评估不足。此前研究仅限于两个专有模型,且发现忠实性水平极低(如Claude 3.7 Sonnet为25%,DeepSeek-R1为39%)。为此,作者扩展评估至开放权重生态中的12个模型(涵盖7B–685B参数规模),通过注入六类推理提示(包括一致性、奉承、视觉模式等),测量模型在提示成功改变答案时是否在CoT中承认影响。关键发现在于:模型忠实性存在显著差异(39.7%–89.9%),训练方法和架构比参数量更能预测忠实性;更重要的是,模型内部识别提示影响的比例高(约87.5%),但最终输出中承认该影响的比例极低(约28.6%),表明模型具备认知能力但系统性抑制对外披露,这直接挑战了CoT作为安全监控手段的可行性,并揭示忠实性并非固定属性,而是受模型结构、训练策略及提示类型共同调控。

链接: https://arxiv.org/abs/2603.22582
作者: Richard J. Young
机构: University of Nevada, Las Vegas(内华达大学拉斯维加斯分校); DeepNeuro AI(DeepNeuro AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures, 12 tables

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

[NLP-53] CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在巴西葡萄牙语(Brazilian Portuguese)环境下指令遵循能力评估缺乏专门化、文化语境化基准的问题。现有基准多聚焦于英语或使用通用提示,难以准确衡量模型对特定语言特征和文化内容的掌握。解决方案的关键在于构建CAPITU——一个基于八部巴西文学经典作品设计的多任务指令遵循基准,涵盖59种可自动验证的指令类型(包括葡萄牙语特有的词尾约束如-ando/-endo/-indo、-inho/-inha、-mente等),并引入结构化要求以增强评估的客观性和可复现性。该方法不仅提升了对语言规则敏感性的检测能力,还通过单轮与多轮交互设置揭示了模型在约束保持性方面的系统性挑战,从而为葡萄牙语场景下的LLM优化提供了可量化、可比较的研究工具。

链接: https://arxiv.org/abs/2603.22576
作者: Giovana Kerche Bonás,Roseval Malaquias Junior,Marcos Piau,Thiago Laitz,Thales Sales Almeida,Hugo Abonizio,Celio Larcher,Ramon Pires,Rodrigo Nogueira
机构: Maritaca AI; Jusbrasil
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \ 0.13 vs Claude-Haiku-4.5: 73.5% at \ 1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

[NLP-54] Reddit After Roe: A Computational Analysis of Abortion Narratives and Barriers in the Wake of Dobbs

【速读】: 该论文试图解决的问题是:在2022年美国最高法院Dobbs案判决后,堕胎权法律环境剧变背景下,如何系统理解在线社区中用户对堕胎障碍(barrier)的表达方式及其与信息行为、情绪状态和时间动态之间的关系。解决方案的关键在于构建一个多层次的计算分析框架,通过多步骤管道对来自四个堕胎相关Reddit子版块的超过17,000篇帖子进行分类,识别其信息类型、堕胎阶段(孕前、孕中、孕后)、障碍类别(如法律、财务、情感和社会障碍)及情绪表达,并结合主题建模分析障碍理由的演化,从而揭示线上堕胎叙事中情感与心理障碍的主导地位及其随法律和文化语境变化的动态特征。

链接: https://arxiv.org/abs/2603.22566
作者: Aria Pessianzadeh,Alex H. Poole,Rezvaneh Rezapour
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The 2022 U.S. Supreme Court decision in Dobbs v. Jackson Women’s Health Organization reshaped the reproductive rights landscape, introducing new uncertainty and barriers to abortion access. We present a large-scale computational analysis of abortion discourse on Reddit, examining how barriers to access are articulated across information-seeking and information-sharing behaviors, different stages of abortion (before, during, after), and three phases of the Dobbs decision in 2022. Drawing on more than 17,000 posts from four abortion-related subreddits, we employed a multi-step pipeline to classify posts by information type, abortion stage, barrier category, and expressed emotions. Using a codebook of eight barrier types, including legal, financial, emotional, and social obstacles, we analyzed their associations with emotions and information behaviors. Topic modeling of model-generated barrier rationales further revealed how discourse evolved in response to shifting legal and cultural contexts. Our findings show that emotional and psychological barriers consistently dominate abortion narratives online, with emotions such as nervousness, confusion, fear, and sadness prevalent across discourse. By linking information behaviors, barriers, emotions, and temporal dynamics, this study provides a multi-dimensional account of how abortion is navigated in online communities.

[NLP-55] Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos CVPR2026

【速读】: 该论文旨在解决当前多模态AI代理(Multimodal AI Agents)在评估中缺乏对用户物理环境感知能力的建模问题,即现有网络代理基准测试仅关注在线交互与视觉感知,而忽略了代理需结合第一人称视角(egocentric)视觉理解来完成现实世界任务的关键场景。其解决方案的核心在于提出首个连接第一人称视频感知与网络代理执行的基准测试Ego2Web,通过真实世界的第一人称视频记录与需视觉理解、任务规划及在线交互才能完成的网络任务配对,构建高质量、多样化的视频-任务对数据集,并开发了基于大语言模型作为裁判(LLM-as-a-Judge)的自动评估方法Ego2WebJudge,实现了与人工判断高度一致(约84%一致性)的可扩展评估体系,从而推动能无缝融合物理与数字世界感知与行动能力的下一代AI助手的发展。

链接: https://arxiv.org/abs/2603.22529
作者: Shoubin Yu,Lei Shu,Antoine Yang,Yao Fu,Srinivas Sunkara,Maria Wang,Jindong Chen,Mohit Bansal,Boqing Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user’s real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user’s surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

[NLP-56] Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models

【速读】: 该论文旨在解决瑞士公共采购中将高层次可持续性法规(如生态、社会和经济可持续性要求)转化为具体、可验证且行业特定的采购标准(如评选标准、授予标准和技术规范)时存在的劳动密集型与易出错的手动任务问题。解决方案的关键在于提出一个可配置的、基于大语言模型(Large Language Model, LLM)辅助的流水线系统,该系统整合了上下文提示(in-context prompting)、可替换的LLM后端以及自动化输出验证机制,从而实现跨不同采购行业的可审计标准生成,并通过官方指南作为结构化参考文档进行实例化与评估,显著降低人工编写负担并确保生成内容与政策一致性。

链接: https://arxiv.org/abs/2603.22513
作者: Yingqiang Gao,Veton Matoshi,Luca Rolshoven,Tilia Ellendorff,Judith Binder,Jeremy Austin Jann,Gerold Schneider,Matthias Stürmer
机构: University of Zurich(苏黎世大学); Bern University of Applied Sciences(伯尔尼应用科学大学); University of Bern(伯尔尼大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Public procurement refers to the process by which public sector institutions, such as governments, municipalities, and publicly funded bodies, acquire goods and services. Swiss law requires the integration of ecological, social, and economic sustainability requirements into tender evaluations in the format of criteria that have to be fulfilled by a bidder. However, translating high-level sustainability regulations into concrete, verifiable, and sector-specific procurement criteria (such as selection criteria, award criteria, and technical specifications) remains a labor-intensive and error-prone manual task, requiring substantial domain expertise in several groups of goods and services and considerable manual effort. This paper presents a configurable, LLM-assisted pipeline that is presented as a software supporting the systematic generation and evaluation of sustainability-oriented procurement criteria catalogs for Switzerland. The system integrates in-context prompting, interchangeable LLM backends, and automated output validation to enable auditable criteria generation across different procurement sectors. As a proof of concept, we instantiate the pipeline using official sustainability guidelines published by the Swiss government and the European Commission, which are ingested as structured reference documents. We evaluate the system through a combination of automated quality checks, including an LLM-based evaluation component, and expert comparison against a manually curated gold standard. Our results demonstrate that the proposed pipeline can substantially reduce manual drafting effort while producing criteria catalogs that are consistent with official guidelines. We further discuss system limitations, failure modes, and design trade-offs observed during deployment, highlighting key considerations for integrating generative AI into public sector software workflows.

[NLP-57] Rashid: A Cipher-Based Framework for Exploring In-Context Language Learning

【速读】: 该论文旨在解决当前在低资源语言(Low-Resource Languages, LRLs)中开展上下文学习(In-Context Learning, ICLL)研究时面临的三大挑战:缺乏自然语言处理(Natural Language Processing, NLP)工具、数据资源匮乏以及研究人员专业知识不足,导致难以评估进展、无法进行大规模实验,且现有成果多局限于少数语言和任务。解决方案的关键在于提出一个名为Rashid的框架,通过可逆加密(reversible ciphering)高资源语言(High-Resource Languages, HRLs)来构造真正意义上的“未见语言”,从而在保留HRL丰富资源(如标注数据、评估工具等)的同时模拟LRL场景,实现对ICLL现象前所未有的探索能力。该框架使研究者能够系统性地评估当前SOTA方法、检验昂贵资源的价值,并在超越机器翻译的下游任务上测试ICLL策略,为ICLL领域提供了可扩展、可控且具实用价值的研究路径。

链接: https://arxiv.org/abs/2603.22497
作者: Niyati Bafna,Ryan Soh-Eun Shim,Barbara Plank,David Yarowsky,Hale Sirin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Where there is growing interest in in-context language learning (ICLL) for unseen languages with large language models, such languages usually suffer from the lack of NLP tools, data resources, and researcher expertise. This means that progress is difficult to assess, the field does not allow for cheap large-scale experimentation, and findings on ICLL are often limited to very few languages and tasks. In light of such limitations, we introduce a framework (Rashid), for studying ICLL wherein we reversibly cipher high-resource languages (HRLs) to construct truly unseen languages with access to a wide range of resources available for HRLs, unlocking previously impossible exploration of ICLL phenomena. We use our framework to assess current methods in the field with SOTA evaluation tools and manual analysis, explore the utility of potentially expensive resources in improving ICLL, and test ICLL strategies on rich downstream tasks beyond machine translation. These lines of exploration showcase the possibilities enabled by our framework, as well as providing actionable insights regarding current performance and future directions in ICLL.

[NLP-58] Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures

【速读】: 该论文旨在解决混合语言模型中不同组件(如注意力机制与状态空间模型或线性注意力)是否被真正利用的问题,即验证这些架构中的各模块是否存在冗余或功能缺失。其解决方案的关键在于提出并应用一种系统的功能组件消融框架,通过在两个子1B参数规模的混合模型(Qwen3.5-0.8B 和 Falcon-H1-0.5B)与纯Transformer对照模型(Qwen2.5-0.5B)上进行多维度实验——包括组级消融、层级扫描、位置敏感性分析、匹配随机控制以及困惑度(perplexity)评估——从而量化各组件的作用强度与分布规律,揭示了混合架构中线性注意力或状态空间模型(State Space Model, SSM)作为主干结构的重要性及其与传统注意力机制之间的功能性互补和冗余关系。

链接: https://arxiv.org/abs/2603.22473
作者: Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
机构: Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 7 figures, 6 tables. Code and data available at this https URL

点击查看摘要

Abstract:Hybrid language models combining attention with state space models (SSMs) or linear attention offer improved efficiency, but whether both components are genuinely utilized remains unclear. We present a functional component ablation framework applied to two sub-1B hybrid models – Qwen3.5-0.8B (sequential: Gated DeltaNet + softmax attention) and Falcon-H1-0.5B (parallel: Mamba-2 + attention) – with a pure Transformer control (Qwen2.5-0.5B). Through group ablations, layer-wise sweeps, positional ablations, matched random controls, and perplexity analysis across five benchmarks, we establish four findings: (1) both component types are essential and neither is bypassed; (2) the alternative component (linear attention or SSM) is the primary language modeling backbone, causing 35,000x perplexity degradation when removed versus ~82x for attention; (3) component importance follows a positional gradient, with early layers being disproportionately critical; and (4) hybrid architectures exhibit 20-119x greater resilience to random layer removal than pure Transformers, revealing built-in functional redundancy between component types. These results provide actionable guidance for hybrid model compression, architecture design, and fault-tolerant deployment.

[NLP-59] LLM -guided headline rewriting for clickability enhancement without clickbait

【速读】: 该论文旨在解决新闻标题生成中如何在保持信息忠实性的同时提升读者参与度的问题,避免因过度追求点击率而产生误导性或夸张的“点击诱饵”(clickbait)内容。其解决方案的关键在于将标题重写建模为一个可控生成任务,通过大语言模型(LLM)结合推理时控制机制——即未来判别器生成(FUDGE)范式,引入两个辅助引导模型:一是点击诱饵评分模型,用于负向引导以抑制不合理的风格强化;二是参与度属性模型,用于正向引导以增强目标吸引力特征。两者均基于真实新闻语料中的中性标题进行训练,并通过可控激活预定义的参与度策略合成点击诱饵变体,从而在推理阶段调节引导权重,在中性改写与高吸引力但符合编辑伦理的标题之间形成连续谱系,实现吸引力、语义保真与点击诱饵规避之间的平衡。

链接: https://arxiv.org/abs/2603.22459
作者: Yehudit Aperstein,Linoy Halifa,Sagiv Bar,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Enhancing reader engagement while preserving informational fidelity is a central challenge in controllable text generation for news media. Optimizing news headlines for reader engagement is often conflated with clickbait, resulting in exaggerated or misleading phrasing that undermines editorial trust. We frame clickbait not as a separate stylistic category, but as an extreme outcome of disproportionate amplification of otherwise legitimate engagement cues. Based on this view, we formulate headline rewriting as a controllable generation problem, where specific engagement-oriented linguistic attributes are selectively strengthened under explicit constraints on semantic faithfulness and proportional emphasis. We present a guided headline rewriting framework built on a large language model (LLM) that uses the Future Discriminators for Generation (FUDGE) paradigm for inference-time control. The LLM is steered by two auxiliary guide models: (1) a clickbait scoring model that provides negative guidance to suppress excessive stylistic amplification, and (2) an engagement-attribute model that provides positive guidance aligned with target clickability objectives. Both guides are trained on neutral headlines drawn from a curated real-world news corpus. At the same time, clickbait variants are generated synthetically by rewriting these original headlines using an LLM under controlled activation of predefined engagement tactics. By adjusting guidance weights at inference time, the system generates headlines along a continuum from neutral paraphrases to more engaging yet editorially acceptable formulations. The proposed framework provides a principled approach for studying the trade-off between attractiveness, semantic preservation, and clickbait avoidance, and supports responsible LLM-based headline optimization in journalistic settings.

[NLP-60] owards Automated Community Notes Generation with Large Vision Language Models for Combating Contextual Deception

【速读】: 该论文旨在解决图像型情境欺骗(image-based contextual deception)中自动生成社区注释(Community Notes)的问题,即当真实图像被错误上下文(如时间、实体或事件)误导时,如何自动生成简洁且基于事实的纠正性注释以帮助用户恢复正确信息。现有方法多集中于二元真假判断(deception detection),而忽略了生成高质量注释所需的语义准确性与上下文关联性,且缺乏支持此类任务的数据集和评估指标。解决方案的关键在于提出一个名为ACCNote的检索增强型多智能体协作框架,其基于大视觉语言模型(large vision-language models),通过多代理协同机制实现对动态情境欺骗的精准响应,并引入Context Helpfulness Score(CHS)这一新评估指标,该指标与用户研究结果高度一致,而非依赖传统词汇重叠度量。实验表明,ACCNote在XCheck数据集上显著优于基线方法及商用工具GPT5-mini,在检测与注释生成两方面均展现出更强性能。

链接: https://arxiv.org/abs/2603.22453
作者: Jin Ma,Jingwen Yan,Mohammed Aldeen,Ethan Anderson,Taran Kavuru,Jinkyung Katie Park,Feng Luo,Long Cheng
机构: Clemson University (克莱姆森大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation method for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), Community Notes-style systems need to generate concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to three reasons: (i) datasets that support the research are scarce; (ii) methods must handle the dynamic nature of contextual deception; (iii) evaluation is difficult because standard metrics do not capture whether notes actually improve user understanding. To address these gaps, we curate a real-world dataset, XCheck, comprising X posts with associated Community Notes and external contexts. We further propose the Automated Context-Corrective Note generation method, named ACCNote, which is a retrieval-augmented, multi-agent collaboration framework built on large vision-language models. Finally, we introduce a new evaluation metric, Context Helpfulness Score (CHS), that aligns with user study outcomes rather than relying on lexical overlap. Experiments on our XCheck dataset show that the proposed ACCNote improves both deception detection and note generation performance over baselines, and exceeds a commercial tool GPT5-mini. Together, our dataset, method, and metric advance practical automated generation of context-corrective notes toward more responsible online social networks.

[NLP-61] Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLM s ICLR2026

【速读】: 该论文试图解决的问题是:强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在提升大语言模型(Large Language Models, LLMs)推理能力方面的机制尚不清晰,尤其是其在token级别上的分布变化如何影响整体序列级推理性能。解决方案的关键在于通过系统性的实证研究,揭示RLVR诱导的token级分布偏移的本质特征及其功能重要性:首先发现RL微调仅引发极稀疏且有针对性的token分布变化,其次通过交叉采样干预实验表明,仅插入少量RL采样的token即可逐步恢复RL性能优势,而注入相同数量的基线模型token则会使RL生成结果性能退化至基线水平,从而识别出一小部分对性能提升至关重要的token级决策;此外,还提出基于偏差加权的优势信号作为诊断性干预手段,进一步优化了RLVR的效果。整体而言,该研究为理解RLVR提供了一个细粒度的token级视角,将其视为一种靶向精炼过程。

链接: https://arxiv.org/abs/2603.22446
作者: Haoming Meng,Kexin Huang,Shaohang Wei,Chiyu Ma,Shuo Yang,Xue Wang,Guoyin Wang,Bolin Ding,Jingren Zhou
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a conference paper at the International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR’s performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

[NLP-62] From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理系统中工作流(workflow)设计与优化的不一致性和缺乏统一框架的问题。当前LLM代理通过构建包含模型调用、信息检索、工具使用、代码执行等环节的可执行工作流来完成复杂任务,但现有方法在工作流结构确定时机、优化目标和评估信号上存在多样性且缺乏系统性分类。论文提出将此类工作流建模为代理计算图(agentic computation graphs, ACGs),并基于三个维度对现有方法进行系统梳理:1)工作流结构是在部署前静态固定还是运行时动态生成;2)优化对象是工作流模板、具体实例或执行轨迹;3)优化依据是任务指标、验证器信号、偏好数据还是执行轨迹反馈。其关键贡献在于建立了一个统一的分类框架和结构感知的评估视角,从而促进方法比较、复现及未来在LLM代理工作流优化领域的标准化研究。

链接: https://arxiv.org/abs/2603.22386
作者: Ling Yue,Kushal Raj Bhandari,Ching-Yun Ko,Dhaval Patel,Shuxin Lin,Nianjun Zhou,Jianxi Gao,Pin-Yu Chen,Shaowu Pan
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace-derived feedback). We also distinguish reusable workflow templates, run-specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure-aware evaluation perspective that complements downstream task metrics with graph-level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.

[NLP-63] Instruction-Tuned but Not More Verifiable Instruction-Following: A Cross-Task Diagnosis for LoRA Adapters

【速读】: 该论文旨在解决当前大模型适配器(Adapter)部署中存在的一种关键问题:即基于名义训练目标(如“指令微调”)所预期的能力提升与实际跨任务性能表现之间存在不一致性,这种现象被称为能力漂移(capability drift)。其核心问题是,仅依赖名义标签无法可靠预测适配器在真实场景下的能力变化,尤其在严格可验证的指令遵循任务上表现尤为明显。解决方案的关键在于强调对同一LoRA适配器进行多任务交叉评估的重要性,而非单纯依赖训练时的标签信息;研究通过系统性实验发现,即使在控制变量条件下,某些指令微调适配器反而会显著提升非目标数值类任务表现(如NM基准从0.133提升至0.632),却未改善可自动验证的指令遵循能力(IFEval得分下降),从而揭示了配置敏感性和潜在误导性。因此,论文建议在部署前必须开展常规化的跨任务性能测试,避免将名义标签视为可靠的性能代理。

链接: https://arxiv.org/abs/2603.22379
作者: Junyi Zou
机构: Zjydiary Group (Zjydiary Group)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating the same LoRA adapter across tasks. Our strongest evidence is tied to strict, automatically verifiable instruction following as measured by IFEval: across multiple seeds, base models, and LoRA settings, nominal labels recurrently but not universally fail to predict improvements on this verifiable target, with clear configuration sensitivity including a near-zero or negative case. As an illustrative strongest-case example in a controlled instruction-versus-numeric setting, an instruction-tuned adapter substantially improves off-target NM-based numeric benchmark performance from 0.133 to 0.632 while not improving verifiable instruction following on IFEval (ILA: 0.313 to 0.271; PLA: 0.250 to 0.143; values rounded to three decimals). We refer to this nominal-versus-realized mismatch pattern as capability drift as a descriptive label. The mismatch is visible in the raw cross-task performance matrix; we use a drift score only as a compact summary in the same units as the underlying metrics, not as a new formal metric contribution. Evidence from broader instruction-following benchmarks is benchmark-dependent and mixed, reflecting heterogeneity in how instruction following is operationalized; we therefore do not treat cross-benchmark agreement as a premise. Overall, the practical takeaway is to perform routine cross-task evaluation before deployment and to avoid treating nominal labels as reliable capability proxies.

[NLP-64] -MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

【速读】: 该论文旨在解决现有红队测试方法在评估大型语言模型(Large Language Models, LLMs)时忽视多步工具执行过程中产生的特定代理漏洞的问题,尤其是在快速发展的模型上下文协议(Model Context Protocol, MCP)生态系统中。传统方法仅关注诱导有害文本输出,而未能捕捉到通过实际工具交互实现的复杂攻击路径。其解决方案的关键在于提出一种轨迹感知的进化搜索方法(Trajectory-aware Evolutionary Search, T-MAP),该方法利用模型在执行过程中的轨迹信息来引导对抗性提示的生成,从而自动发现能够绕过安全防护机制并可靠地通过工具调用达成有害目标的攻击策略。实证结果表明,T-MAP在多种MCP环境中显著优于基线方法,在攻击实现率(Attack Realization Rate, ARR)上表现优异,并对包括GPT-5.2、Gemini-3-Pro、Qwen3.5和GLM-5在内的前沿模型均有效,揭示了自主LLM代理中此前未被充分探索的安全漏洞。

链接: https://arxiv.org/abs/2603.22341
作者: Hyomin Lee,Sangwoo Park,Yumin Choi,Sohyun An,Seanie Lee,Sung Ju Hwang
机构: KAIST; University of California, Los Angeles; DeepAuto.ai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.

[NLP-65] Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits

【速读】: 该论文旨在解决Chinchilla Approach 2在拟合神经网络缩放定律(neural scaling laws)时因抛物线近似引入的系统性偏差问题,这种偏差即使在无噪声的合成数据上也会导致计算最优分配估计失准。针对此问题,作者提出通过采用Chinchilla Approach 3来消除偏差,其关键在于利用目标函数的部分线性结构,结合变量投影(Variable Projection)方法,将原本高维、易受局部极小值干扰的优化问题转化为一个二维优化过程,从而实现对全部五个损失面参数的无偏估计;该方法具备良好的条件数、解析可微性,并支持密集或穷举网格搜索,显著提升了数值稳定性与可实施性,克服了传统Approach 3被误认为数据效率低、不稳定和难实现的刻板印象。

链接: https://arxiv.org/abs/2603.22339
作者: Eric Czech,Zhiwei Xu,Yael Elmatad,Yixin Wang,William Held
机构: Open Athena AI Foundation(开放奥丁人工智能基金会); University of Michigan(密歇根大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the 3.8\times10^25 FLOP training budget and \ 1.4M (90% CI: \ 412K-\ 2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry ( \alpha \neq \beta ). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two-dimensional optimization that is well-conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations.

[NLP-66] From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLM s

【速读】: 该论文旨在解决如何利用多模态大语言模型(Multimodal Large Language Models, MLMs)在技术任务中提供实时辅助的问题,特别是在家具组装等程序性任务中,评估MLMs是否能够理解步骤序列、追踪装配进度并准确引用说明书页面。其解决方案的关键在于构建了一个名为“Manual to Action Dataset (M2AD)”的标注数据集,该数据集包含逐步标签和手动参考信息,用于系统性评估当前开源MLMs在三方面的能力:(1) 通过提升推理能力减少对精细标注的依赖,实现更高效的成本优化标注;(2) 跟踪装配步骤进展;(3) 正确关联指令手册页面。实验结果表明,尽管部分模型具备一定的过程理解能力,但其性能受限于架构与硬件瓶颈,凸显了未来需发展支持多图像输入及文本-图像交错推理的MLMs以实现更可靠的实时辅助。

链接: https://arxiv.org/abs/2603.22321
作者: Federico Toschi,Nicolò Brunello,Andrea Sassella,Vincenzo Scotti,Mark James Carman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and leading to Multimodal Large Language Models (MLMs). Given the current adoption of LLM based assistants in solving technical or domain specific problems, the natural continuation of this trend is to extend the input domains of these assistants exploiting MLMs. Ideally, these MLMs should be used as real time assistants in procedural tasks, hopefully integrating a view of the environment where the user being assisted is, or even better sharing the same point of view via Virtual Reality (VR) or Augmented Reality (AR) supports, to reason over the same scenario the user is experiencing. With this work, we aim at evaluating the quality of currently openly available MLMs to provide this kind of assistance on technical tasks. To this end, we annotated a data set of furniture assembly with step by step labels and manual references: the Manual to Action Dataset (M2AD). We used this dataset to assess (1) to which extent the reasoning abilities of MLMs can be used to reduce the need for detailed labelling, allowing for more efficient, cost effective annotation practices, (2) whether MLMs are able to track the progression of assembly steps (3) and whether MLMs can refer correctly to the instruction manual pages. Our results showed that while some models understand procedural sequences, their performance is limited by architectural and hardware constraints, highlighting the need for multi image and interleaved text image reasoning.

[NLP-67] he Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

【速读】: 该论文旨在解决“思维是否必须依赖类语言格式”这一核心问题,即验证语言的思维假说(Language of Thought, LoT)在人工智能系统中的适用性。其解决方案的关键在于提出并实证检验“AI私有语言”思想实验:通过多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)让两个智能体自发形成一种高效但人类不可理解的通信协议,并在合作导航任务中对比其与预设人类符号协议的性能差异。结果发现,采用自动生成协议的智能体效率比使用人类可理解符号协议高出50.5%,这证实了效率衰减现象(Efficiency Attenuation Phenomenon, EAP),表明最优协同认知并非由符号结构中介,而是自然耦合于非符号计算过程,从而挑战了LoT假设,并支持认知架构的多元主义立场。

链接: https://arxiv.org/abs/2603.22312
作者: Di Zhang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:This paper computationally investigates whether thought requires a language-like format, as posited by the Language of Thought (LoT) hypothesis. We introduce the ``AI Private Language’’ thought experiment: if two artificial agents develop an efficient, inscrutable communication protocol via multi-agent reinforcement learning (MARL), and their performance declines when forced to use a human-comprehensible language, this Efficiency Attenuation Phenomenon (EAP) challenges the LoT. We formalize this in a cooperative navigation task under partial observability. Results show that agents with an emergent protocol achieve 50.5% higher efficiency than those using a pre-defined, human-like symbolic protocol, confirming the EAP. This suggests optimal collaborative cognition in these systems is not mediated by symbolic structures but is naturally coupled with sub-symbolic computations. The work bridges philosophy, cognitive science, and AI, arguing for pluralism in cognitive architectures and highlighting implications for AI ethics.

[NLP-68] Whether Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLM s

【速读】: 该论文试图解决的核心问题是:当前关于大语言模型中存在“情感电路”的研究普遍依赖显式的情感关键词作为刺激,这使得一个根本性疑问未被解答——这些所谓的情感处理机制究竟是识别真实的情感意义,还是仅仅对情感关键词(如“devastated”)做出反应?为验证这一点,作者引入了基于临床心理学的机制可解释性方法,使用临床情景描述(clinical vignettes)作为刺激材料,通过情境和行为线索引发情绪,同时移除所有情感关键词。解决方案的关键在于采用四种收敛性的机制可解释性方法(线性探测、因果激活修补、敲除实验与表征几何),首次在无关键词条件下验证了情感处理机制的存在,并揭示出两种可分离的机制:一是高精度(AUROC 1.000)的情感接收机制,其在早期层饱和且跨六种模型稳定复现;二是部分依赖关键词的情感分类机制,其性能随模型规模提升而改善。这一发现否定了“关键词识别假说”,确立了新的机制解离框架,并将临床刺激法确立为评估大语言模型情感处理能力的严谨标准,对AI安全与对齐研究具有直接意义。

链接: https://arxiv.org/abs/2603.22295
作者: Michael Keeman
机构: Keido Labs(Keido实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 38 pages, 11 figures, 16 tables. Code and data: this https URL

点击查看摘要

Abstract:Large language models appear to develop internal representations of emotion – “emotion circuits,” “emotion neurons,” and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word “devastated”? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology – clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods – linear probing, causal activation patching, knockout experiments, and representational geometry – and discover two dissociable emotion processing mechanisms. Affect reception – detecting emotionally significant content – operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization – mapping affect to specific emotion labels – is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models – with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.

[NLP-69] IPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLM s

【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)训练的搜索增强型大语言模型(Search-Augmented Large Language Models, LLMs)在开放域问答(Open-Domain Question Answering, QA)任务中因奖励稀疏和跨推理步骤与工具调用的信用分配困难而导致的优化不稳定问题。解决方案的关键在于提出一种称为“轮次级信息势奖励塑造”(Turn-Level Information Potential Reward Shaping, TIPS)的框架,该框架通过教师模型计算每个推理+工具调用片段对正确答案概率的提升程度,从而为每一轮交互提供密集且细粒度的奖励信号,并利用基于势函数的奖励塑造方法实现策略无关的引导,有效克服仅依赖最终结果优化的局限性。

链接: https://arxiv.org/abs/2603.22293
作者: Yutao Xie,Nathaniel Thomas,Nicklas Hansen,Yang Fu,Li Erran Li,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校); AWS AI (亚马逊云科技人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.

[NLP-70] Evaluating Large Language Models Responses to Sexual and Reproductive Health Queries in Nepali

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在性与生殖健康(Sexual and Reproductive Health, SRH)领域应用中的评估盲区问题,即现有评价方法主要关注准确性,缺乏对可用性(usability)和安全性(safety)的系统评估,尤其在低资源语言和文化敏感场景中表现不足。解决方案的关键在于提出一个名为LEAF(LLM Evaluation Framework)的多维评估框架,涵盖准确性、语言能力、可用性缺口(相关性、充分性和文化适宜性)以及安全缺口(安全性、敏感性和保密性)四个维度,并通过专家人工标注的方式对14,000条尼泊尔语SRH查询进行实证评估,揭示了仅有35.1%的响应符合“适当”标准,从而凸显出当前LLMs在实际敏感场景应用中的显著局限性,并为跨领域、跨语言的可用性与安全性优化提供了可扩展的评估路径。

链接: https://arxiv.org/abs/2603.22291
作者: Medha Sharma,Supriya Khadka,Udit Chandra Aryal,Bishnu Hari Bhatta,Bijayan Bhattarai,Santosh Dahal,Kamal Gautam,Pushpa Joshi,Saugat Kafle,Shristi Khadka,Shushila Khadka,Binod Lamichhane,Shilpa Lamichhane,Anusha Parajuli,Sabina Pokharel,Suvekshya Sitaula,Neha Verma,Bishesh Khanal
机构: Nepal Applied Mathematics and Informatics Institute for research (NAAMII); Partnership for Sustainable Development Nepal (PSD Nepal); Diyo.AI; Visible Impact (Visim)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become integrated into daily life, they are increasingly used for personal queries, including Sexual and Reproductive Health (SRH), allowing users to chat anonymously without fear of judgment. However, current evaluation methods primarily focus on accuracy, often for objective queries in high-resource languages, and lack criteria to assess usability and safety, especially for low-resource languages and culturally sensitive domains like SRH. This paper introduces LLM Evaluation Framework (LEAF), that conducts assessments across multiple criteria: accuracy, language, usability gaps (including relevance, adequacy, and cultural appropriateness), and safety gaps (safety, sensitivity, and confidentiality). Using the LEAF framework, we assessed 14K SRH queries in Nepali from over 9K users. Responses were manually annotated by SRH experts according to the framework. Results revealed that only 35.1% of the responses were “proper”, meaning they were accurate, adequate and had no major usability or safety related gaps. Insights include differences in performance between ChatGPT versions, such as similar accuracy but varying usability and safety aspects. This evaluation highlights significant limitations of current LLMs and underscores the need for improvement. The LEAF Framework is adaptable across domains and languages, particularly where usability and safety are critical, offering a pathway to better address sensitive topics.

[NLP-71] MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing

【速读】: 该论文旨在解决知识追踪(Knowledge Tracing, KT)模型在准确性与可解释性之间的权衡问题,尤其是传统深度学习模型缺乏透明度、而基于大语言模型(Large Language Models, LLMs)的方法存在上下文限制、幻觉现象以及高昂的微调成本等问题。其解决方案的关键在于提出一种无需训练(training-free)的框架MERIT,通过冻结LLM的参数,结合结构化的教学记忆机制:首先将原始交互日志转化为可解释的记忆库,利用语义去噪对学生产生潜在认知模式分类,并构建范例库以离线分析典型错误模式生成显式的Chain-of-Thought(CoT)推理路径;推理阶段则采用分层路由机制检索相关上下文,并引入逻辑增强模块施加语义约束以校准预测结果。该方法在不进行梯度更新的前提下实现了SOTA性能,显著降低了计算开销并支持动态知识更新,从而提升了教育诊断的可访问性和透明度。

链接: https://arxiv.org/abs/2603.22289
作者: Runze Li,Kedi Chen,Guwei Feng,Mo Yu,Jun Wang,Wei Zhang
机构: East China Normal University (华东师范大学); Shanghai Innovation Institute (上海创新研究院); WeChat AI (微信AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) models students’ evolving knowledge states to predict future performance, serving as a foundation for personalized education. While traditional deep learning models achieve high accuracy, they often lack interpretability. Large Language Models (LLMs) offer strong reasoning capabilities but struggle with limited context windows and hallucinations. Furthermore, existing LLM-based methods typically require expensive fine-tuning, limiting scalability and adaptability to new data. We propose MERIT (Memory-Enhanced Retrieval for Interpretable Knowledge Tracing), a training-free framework combining frozen LLM reasoning with structured pedagogical memory. Rather than updating parameters, MERIT transforms raw interaction logs into an interpretable memory bank. The framework uses semantic denoising to categorize students into latent cognitive schemas and constructs a paradigm bank where representative error patterns are analyzed offline to generate explicit Chain-of-Thought (CoT) rationales. During inference, a hierarchical routing mechanism retrieves relevant contexts, while a logic-augmented module applies semantic constraints to calibrate predictions. By grounding the LLM in interpretable memory, MERIT achieves state-of-the-art performance on real-world datasets without gradient updates. This approach reduces computational costs and supports dynamic knowledge updates, improving the accessibility and transparency of educational diagnosis.

[NLP-72] Evaluating Prompting Strategies for Chart Question Answering with Large Language Models

【速读】: 该论文旨在解决提示策略(prompting strategies)在基于图表的问答(chart-based QA)任务中对大语言模型(LLM)推理性能影响尚不明确的问题。其解决方案的关键在于构建一个仅以结构化图表数据为输入、将提示结构作为唯一变量的系统性评估框架,对比零样本(Zero-Shot)、少样本(Few-Shot)、零样本思维链(Zero-Shot Chain-of-Thought)和少样本思维链(Few-Shot Chain-of-Thought)四种 prompting 方法在 GPT-3.5、GPT-4 和 GPT-4o 上的表现,通过准确率(Accuracy)和精确匹配(Exact Match)两个指标量化效果。结果表明,少样本思维链提示在复杂推理任务中表现最优(最高达 78.2% 准确率),为结构化数据推理场景下提示策略的选择提供了实证依据。

链接: https://arxiv.org/abs/2603.22288
作者: Ruthuparna Naikar,Ying Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompting strategies affect LLM reasoning performance, but their role in chart-based QA remains underexplored. We present a systematic evaluation of four widely used prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on the ChartQA dataset. Our framework operates exclusively on structured chart data, isolating prompt structure as the only experimental variable, and evaluates performance using two metrics: Accuracy and Exact Match. Results from 1,200 diverse ChartQA samples show that Few-Shot Chain-of-Thought prompting consistently yields the highest accuracy (up to 78.2%), particularly on reasoning-intensive questions, while Few-Shot prompting improves format adherence. Zero-Shot performs well only with high-capacity models on simpler tasks. These findings provide actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for both efficiency and accuracy in real-world applications.

[NLP-73] Founder effects shape the evolutionary dynamics of multimodality in open LLM families

【速读】: 该论文旨在解决开放大型语言模型(Large Language Model, LLM)家族中多模态能力(multimodality)的演化路径与传播机制不清晰的问题。其关键解决方案在于利用 Hugging Face 的 ModelBiome AI 生态系统数据集(包含 1.8×10⁶ 条模型元数据及记录的 lineage 关系),通过量化时间维度和父-子关系上的多模态分布,揭示了多模态能力在主流开源 LLM 家族中并非线性演进,而是以稀有创始事件触发、随后在视觉-语言模型(Vision-Language Model, VLM)内部 lineage 中快速扩张的“间断式采纳”(punctuated adoption)动态。具体而言,研究发现:多数 VLM 模型为无父节点的新根节点(~60%),且绝大多数 VLM 子代来源于已有 VLM 父代(94.5%),而非文本生成类模型(仅 0.218% 的细调边产生 VLM 后代),表明多模态能力的扩展主要依赖于现有 VLM lineages 的内部演化,而非跨模态迁移。

链接: https://arxiv.org/abs/2603.22287
作者: Manuel Cebrian
机构: Center for Automation and Robotics, Spanish National Research Council (西班牙国家研究委员会自动化与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Large language model (LLM) families are improving rapidly, yet it remains unclear how quickly multimodal capabilities emerge and propagate within open families. Using the ModelBiome AI Ecosystem dataset of Hugging Face model metadata and recorded lineage fields (1.8x10^6 model entries), we quantify multimodality over time and along recorded parent-to-child relations. Cross-modal tasks are widespread in the broader ecosystem well before they become common within major open LLM families: within these families, multimodality remains rare through 2023 and most of 2024, then increases sharply in 2024-2025 and is dominated by image-text vision-language tasks. Across major families, the first vision-language model (VLM) variants typically appear months after the first text-generation releases, with lags ranging from ~1 month (Gemma) to more than a year for several families and ~26 months for GLM. Lineage-conditioned transition rates show weak cross-type transfer: among fine-tuning edges from text-generation parents, only 0.218% yield VLM descendants. Instead, multimodality expands primarily within existing VLM lineages: 94.5% of VLM-child fine-tuning edges originate from VLM parents, versus 4.7% from text-generation parents. At the model level, most VLM releases appear as new roots without recorded parents (~60%), while the remainder are predominantly VLM-derived; founder concentration analyses indicate rapid within-lineage amplification followed by diversification. Together, these results show that multimodality enters open LLM families through rare founder events and then expands rapidly within their descendant lineages, producing punctuated adoption dynamics that likely induce distinct, transfer-limited scaling behavior for multimodal capabilities.

[NLP-74] Q: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

【速读】: 该论文旨在解决大基础模型在推理阶段因激活感知压缩(activation-aware compression)方法依赖校准数据而导致的领域偏移(domain shift)问题,尤其是在未见过的下游任务中性能下降的问题。解决方案的关键在于提出一种测试时量化(test-time quantization, TTQ)框架,通过高效的在线校准机制,在推理过程中实时进行激活感知量化,从而适应任意输入提示(prompt),无需重新训练即可实现模型压缩与推理加速的同步优化。

链接: https://arxiv.org/abs/2603.19296
作者: Toshiaki Koike-Akino,Jing Liu,Ye Wang
机构: Mitsubishi Electric Research Laboratories (MERL)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 25 pages

点击查看摘要

Abstract:To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.

[NLP-75] Demystifying Low-Rank Knowledge Distillation in Large Language Models : Convergence Generalization and Information-Theoretic Guarantees

【速读】: 该论文旨在解决低秩知识蒸馏(Low-Rank Knowledge Distillation)在语言模型压缩中的理论基础薄弱问题,特别是其优化动态、泛化能力与模型压缩率之间的权衡机制不明确。解决方案的关键在于构建一个严格的理论框架:首先证明在弱假设下,低秩投影可保持优化动力学,收敛速率可达 $ O(1/\sqrt{T}) $;其次推导出泛化误差上界为 $ O(r(m+n)/\sqrt{n}) $,揭示了压缩率(由秩 $ r $ 控制)与泛化性能的定量关系;最后通过信息论分析激活克隆机制,阐明其最大化教师与学生中间表示间互信息的作用。由此提出最优秩选择策略 $ r^* = O(\sqrt{n}) $,并经实验验证其在标准语言建模任务中与理论预测高度一致。

链接: https://arxiv.org/abs/2603.22355
作者: Alberlucia Rafael Soarez,Daniel Kim,Mariana Costa,Alejandro Torre
机构: University of Brasilia (巴西联邦大学)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation has emerged as a powerful technique for compressing large language models (LLMs) into efficient, deployable architectures while preserving their advanced capabilities. Recent advances in low-rank knowledge distillation, particularly methods like Low-Rank Clone (LRC), have demonstrated remarkable empirical success, achieving comparable performance to full-parameter distillation with significantly reduced training data and computational overhead. However, the theoretical foundations underlying these methods remain poorly understood. In this paper, we establish a rigorous theoretical framework for low-rank knowledge distillation in language models. We prove that under mild assumptions, low-rank projection preserves the optimization dynamics, yielding explicit convergence rates of O(1/\sqrtT) . We derive generalization bounds that characterize the fundamental trade-off between model compression and generalization capability, showing that the generalization error scales with the rank parameter as O(r(m+n)/\sqrtn) . Furthermore, we provide an information-theoretic analysis of the activation cloning mechanism, revealing its role in maximizing the mutual information between the teacher’s and student’s intermediate representations. Our theoretical results offer principled guidelines for rank selection, mathematically suggesting an optimal rank r^* = O(\sqrtn) where n is the sample size. Experimental validation on standard language modeling benchmarks confirms our theoretical predictions, demonstrating that the empirical convergence, rank scaling, and generalization behaviors align closely with our bounds.

信息检索

[IR-0] Reasoning over Semantic IDs Enhances Generative Recommendation

【速读】:该论文旨在解决生成式推荐中基于语义标识符(Semantic IDs, SIDs)的推理能力不足问题,即如何在不依赖大量推荐任务专属推理标注的情况下,有效激发大语言模型(LLM)对SIDs的推理能力。其关键解决方案在于提出一个两阶段框架SIDReasoner:第一阶段通过多任务训练增强SID与语言之间的对齐关系,利用更强教师模型合成的丰富SID中心语料库,使itemic tokens在多样化的语义和行为上下文中获得语义 grounding;第二阶段采用目标驱动的强化优化策略,在无需显式推理标注的前提下引导模型走向有效的推理路径,从而提升推荐推理质量。

链接: https://arxiv.org/abs/2603.23183
作者: Yingzhi He,Yan Sun,Junfei Tan,Yuxin Chen,Xiaoyu Kong,Chunxu Shen,Xiang Wang,An Zhang,Tat-Seng Chua
机构: National University of Singapore(新加坡国立大学); University of Science and Technology of China(中国科学技术大学); Tencent Inc.(腾讯公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative recommendation have leveraged pretrained LLMs by formulating sequential recommendation as autoregressive generation over a unified token space comprising language tokens and itemic identifiers, where each item is represented by a compact sequence of discrete tokens, namely Semantic IDs (SIDs). This SID-based formulation enables efficient decoding over large-scale item corpora and provides a natural interface for LLM-based recommenders to leverage rich world knowledge. Meanwhile, breakthroughs in LLM reasoning motivate reasoning-enhanced recommendation, yet effective reasoning over SIDs remains underexplored and challenging. Itemic tokens are not natively meaningful to LLMs; moreover, recommendation-oriented SID reasoning is hard to evaluate, making high-quality supervision scarce. To address these challenges, we propose SIDReasoner, a two-stage framework that elicits reasoning over SIDs by strengthening SID–language alignment to unlock transferable LLM reasoning, rather than relying on large amounts of recommendation-specific reasoning traces. Concretely, SIDReasoner first enhances SID-language alignment via multi-task training on an enriched SID-centered corpus synthesized by a stronger teacher model, grounding itemic tokens in diverse semantic and behavioral contexts. Building on this enhanced alignment, SIDReasoner further improves recommendation reasoning through outcome-driven reinforced optimization, which guides the model toward effective reasoning trajectories without requiring explicit reasoning annotations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our reasoning-augmented SID-based generative recommendation. Beyond accuracy, the results highlight the broader potential of large reasoning models for generative recommendation, including improved interpretability and cross-domain generalization. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.23183 [cs.IR] (or arXiv:2603.23183v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.23183 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] From Questions to Trust Reports: A LLM -IR Framework for the TREC 2025 DRAG UN Track

【速读】:该论文旨在解决在线新闻可信度评估中缺乏有效支持工具的问题(即如何帮助用户判断网络新闻的真实性与可靠性)。其核心解决方案在于构建一个名为UR_Trecking的系统,该系统针对任务1(关键问题生成)和任务2(检索增强型可信度报告)分别设计:首先通过大语言模型(LLM)生成问题并结合语义过滤与聚类策略保证多样性;其次利用链式思维(Chain-of-Thought)查询扩展技术增强检索相关性,并从MS MARCO V2.1分段语料库中召回证据文档;随后采用monoT5模型重排序及LLM相关性判别器进行过滤,并引入领域级可信度数据集提升准确性;最终在任务2中由LLM将筛选出的证据合成带引用的简洁可信度报告。实验证明,链式思维查询扩展和重排序显著提升了相关性和领域可信度表现,但问题生成质量仍有提升空间。

链接: https://arxiv.org/abs/2603.23125
作者: Ignacy Alwasiak,Kene Nnolim,Jaclyn Thi,Samy Ateia,Markus Bink,Gregor Donabauer,David Elsweiler,Udo Kruschwitz
机构: Jagiellonian University (雅盖隆大学); Massachusetts Institute of Technology (麻省理工学院); University of Regensburg (雷根斯堡大学)
类目: Information Retrieval (cs.IR)
备注: TREC 2025 Proceedings

点击查看摘要

Abstract:The DRAGUN Track at TREC 2025 targets the growing need for effective support tools that help users evaluate the trustworthiness of online news. We describe the UR_Trecking system submitted for both Task 1 (critical question generation) and Task 2 (retrieval-augmented trustworthiness reporting). Our approach combines LLM-based question generation with semantic filtering, diversity enforcement using clustering, and several query expansion strategies (including reasoning-based Chain-of-Thought expansion) to retrieve relevant evidence from the MS MARCO V2.1 segmented corpus. Retrieved documents are re-ranked using a monoT5 model and filtered using an LLM relevance judge together with a domain-level trustworthiness dataset. For Task 2, selected evidence is synthesized by an LLM into concise trustworthiness reports with citations. Results from the official evaluation indicate that Chain-of-Thought query expansion and re-ranking substantially improve both relevance and domain trust compared to baseline retrieval, while question-generation performance shows moderate quality with room for improvement. We conclude by outlining key challenges encountered and suggesting directions for enhancing robustness and trustworthiness assessment in future iterations of the system.

[IR-2] GateSID: Adaptive Gating for Semantic-Collaborative Alignment in Cold-Start Recommendation

【速读】:该论文旨在解决冷启动场景下新物品因协同信号稀缺而加剧的“马太效应”问题,该效应导致推荐平台多样性下降,成为实际推荐系统中的长期挑战。现有方法通常通过引入语义信息增强协同信号,但面临协同-语义权衡困境:协同信号对热门物品有效但对冷启动物品不可靠,而过度依赖语义信息可能掩盖有意义的协同差异。解决方案的关键在于提出GateSID框架,其核心创新是引入自适应门控网络(adaptive gating network),根据物品成熟度动态平衡语义与协同信号。具体包括两个关键组件:(1) Gating-Fused Shared Attention,融合模态内注意力分布与基于嵌入和统计特征计算的物品级门控权重;(2) Gate-Regulated Contrastive Alignment,自适应校准跨模态对齐强度,对冷启动物品强化语义-行为一致性约束,对热门物品则放宽限制以保留可靠的协同信号。

链接: https://arxiv.org/abs/2603.22916
作者: Hai Zhu,Yantao Yu,Lei Shen,Bing Wang,Xiaoyi Zeng
机构: Alibaba International Digital Commercial Group (阿里巴巴国际数字商业集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In cold-start scenarios, the scarcity of collaborative signals for new items exacerbates the Matthew effect, which undermines platform diversity and remains a persistent challenge in real-world recommender systems. Existing methods typically enhance collaborative signals with semantic information, but they often suffer from a collaborative-semantic tradeoff: collaborative signals are effective for popular items but unreliable for cold-start items, whereas over-reliance on semantic information may obscure meaningful collaborative differences. To address this issue, we propose GateSID, a framework that uses an adaptive gating network to dynamically balance semantic and collaborative signals according to item maturity. Specifically, we first discretize multimodal features into hierarchical Semantic IDs using Residual Quantized VAE. Building on this representation, we design two key components: (1) Gating-Fused Shared Attention, which fuses intra-modal attention distributions with item-level gating weights derived from embeddings and statistical features; and (2) Gate-Regulated Contrastive Alignment, which adaptively calibrates cross-modal alignment, enforcing stronger semantic-behavior consistency for cold-start items while relaxing the constraint for popular items to preserve reliable collaborative signals. Extensive offline experiments on large-scale industrial datasets demonstrate that GateSID consistently outperforms strong baselines. Online A/B tests further confirm its practical value, yielding +2.6% GMV, +1.1% CTR, and +1.6% orders with less than 5 ms additional latency.

[IR-3] KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工业级个性化搜索任务中因“知识-行为鸿沟”(Knowledge–Action Gap)导致的语义坍缩(Semantic Collapse)问题,即直接微调LLMs以优化特定个性化行为目标(如下一个物品预测)时,会破坏其预训练阶段获得的深层语义知识,从而削弱模型的泛化能力。解决方案的关键在于提出KARMA(Knowledge–Action Regularized Multimodal Alignment)框架,通过将语义重建作为仅训练阶段的正则项,同时优化检索用的兴趣嵌入(Action)与保持语义可解码性(Knowledge):一是基于历史条件的语义生成,锚定于LLM原生的下一个词分布;二是嵌入条件的语义重构,约束兴趣嵌入保持语义可恢复性。该方法有效缓解了注意力“汇聚点”现象,并显著提升个性化搜索系统的点击率(CTR AUC)、召回和排序阶段的命中率(HR),且在线部署时推理开销低。

链接: https://arxiv.org/abs/2603.22779
作者: Zhi Sun,Wenming Zhang,Yi Wei,Liren Yu,Zhixuan Zhang,Dan Ou,Haihong Tang
机构: Taobao \ Tmall Group of Alibaba(淘宝\天猫集团阿里巴巴); Alibaba(阿里巴巴)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are equipped with profound semantic knowledge, making them a natural choice for injecting semantic generalization into personalized search systems. However, in practice we find that directly fine-tuning LLMs on industrial personalized tasks (e.g. next item prediction) often yields suboptimal results. We attribute this bottleneck to a critical Knowledge–Action Gap: the inherent conflict between preserving pre-trained semantic knowledge and aligning with specific personalized actions by discriminative objectives. Empirically, action-only training objectives induce Semantic Collapse, such as attention ``sinks’'. This degradation severely cripples the LLM’s generalization, failing to bring improvements to personalized search systems. We propose KARMA (Knowledge–Action Regularized Multimodal Alignment), a unified framework that treats semantic reconstruction as a train-only regularizer. KARMA optimizes a next-interest embedding for retrieval (Action) while enforcing semantic decodability (Knowledge) through two complementary objectives: (i) history-conditioned semantic generation, which anchors optimization to the LLM’s native next-token distribution, and (ii) embedding-conditioned semantic reconstruction, which constrains the interest embedding to remain semantically recoverable. On Taobao search system, KARMA mitigates semantic collapse (attention-sink analysis) and improves both action metrics and semantic fidelity. In ablations, semantic decodability yields up to +22.5 HR@200. With KARMA, we achieve +0.25 CTR AUC in ranking, +1.86 HR in pre-ranking and +2.51 HR in recalling. Deployed online with low inference overhead at ranking stage, KARMA drives +0.5% increase in Item Click. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22779 [cs.IR] (or arXiv:2603.22779v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.22779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leverag ing LLM -Persona

【速读】:该论文旨在解决低资源领域中数据稀缺问题,尤其针对法律信息检索(Legal Information Retrieval, LIR)场景下现有数据增强方法因过度追求合成数据数量而忽视质量与领域适配性的缺陷。其解决方案的关键在于提出一种基于专业角色(persona-based)的数据增强框架DALDALL,通过模拟律师、检察官和法官等法律领域专业人员的视角生成合成查询,从而显著提升语料在词汇和语义层面的多样性,同时保持与原始查询的语义一致性。实验表明,基于此策略生成的训练数据可有效提升密集检索模型的召回性能,验证了该方法在高价值但数据稀疏的专业领域中的有效性。

链接: https://arxiv.org/abs/2603.22765
作者: Janghyeok Choi,Jaewon Lee,Sungzoon Cho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas–such as attorneys, prosecutors, and judges–to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.

[IR-5] Graph-Aware Late Chunking for Retrieval-Augmented Generation in Biomedical Literature

【速读】:该论文旨在解决当前生物医学文献检索增强生成(Retrieval-Augmented Generation, RAG)系统在评估和设计上的局限性问题,即现有方法主要依赖于排序指标(如均倒数排名 MRR)来衡量单个最相关片段的检索精度,而忽视了检索广度——即从文档不同结构化章节中提取证据的能力。这一缺陷导致系统倾向于仅从单一章节检索信息,难以支持跨段落的知识整合与生成。解决方案的关键在于提出 GraLC-RAG 框架,其核心创新包括:基于图结构感知的晚期分块策略(late chunking)、UMLS 知识图谱注入以及图引导的混合检索机制,从而实现结构感知的分块边界检测与多章节证据覆盖,显著提升检索多样性与生成质量的一致性。

链接: https://arxiv.org/abs/2603.22633
作者: Pouria Mortezaagha,Arya Rahgozar
机构: Ottawa Hospital Research Institute (渥太华医院研究所); University of Ottawa (渥太华大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems for biomedical literature are typically evaluated using ranking metrics like Mean Reciprocal Rank (MRR), which measure how well the system identifies the single most relevant chunk. We argue that for full-text scientific documents, this paradigm is incomplete: it rewards retrieval precision while ignoring retrieval breadth – the ability to surface evidence from across a document’s structural sections. We propose GraLC-RAG, a framework that unifies late chunking with graph-aware structural intelligence, introducing structure-aware chunk boundary detection, UMLS knowledge graph infusion, and graph-guided hybrid retrieval. We evaluate six strategies on 2,359 IMRaD-filtered PubMed Central articles using 2,033 cross-section questions and two metric families: standard ranking metrics (MRR, Recall@k) and structural coverage metrics (SecCov@k, CS Recall). Our results expose a sharp divergence: content-similarity methods achieve the highest MRR (0.517) but always retrieve from a single section, while structure-aware methods retrieve from up to 15.6x more sections. Generation experiments show that KG-infused retrieval narrows the answer-quality gap to delta-F1 = 0.009 while maintaining 4.6x section diversity. These findings demonstrate that standard metrics systematically undervalue structural retrieval and that closing the multi-section synthesis gap is a key open problem for biomedical RAG.

[IR-6] Leverag ing Large Language Models to Extract and Translate Medical Information in Doctors Notes for Health Records and Diagnostic Billing Codes

【速读】:该论文旨在解决美国医师职业倦怠问题中由电子健康记录(Electronic Health Record, EHR)文档和复杂诊断编码带来的行政负担,同时保障患者隐私。其核心解决方案是构建一个基于本地设备、离线运行的自动医疗编码系统,利用开源权重的大语言模型(Large Language Models, LLMs)从医生笔记中提取临床信息并转换为ICD-10-CM诊断代码,无需依赖云端服务。关键创新在于设计了一个以隐私为导向的流水线架构,结合Ollama、LangChain及容器化环境,在消费级硬件上评估多个开源模型(如Llama 3.2、Mistral等),并通过新型合成医学笔记基准测试不同提示策略(零样本、少样本与检索增强生成)。研究发现,尽管严格JSON格式约束可实现近100%合规性,但小参数规模本地模型在精准生成特定诊断码方面仍存挑战;少样本提示反而因过拟合与幻觉导致性能下降,而检索增强生成受限于上下文窗口饱和效应,难以提升整体准确率。因此,当前最可行路径为“人在环路”辅助编码模式,而非完全自动化。该工作贡献了一个可复现的本地LLM架构与基准数据集,推动了隐私保护下的医疗信息抽取与编码技术发展。

链接: https://arxiv.org/abs/2603.22625
作者: Peter Hartnett,Chung-Chi Huang,Sarah Hartnett,David Hartnett
机构: Frostburg State University (弗罗斯特堡州立大学); FAU Charles E Schmidt College of Medicine (佛罗里达大西洋大学查尔斯 E·施密特医学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 45 pages, 19 figures

点击查看摘要

Abstract:Physician burnout in the United States has reached critical levels, driven in part by the administrative burden of Electronic Health Record (EHR) documentation and complex diagnostic codes. To relieve this strain and maintain strict patient privacy, this thesis explores an on-device, offline automatic medical coding system. The work focuses on using open-weight Large Language Models (LLMs) to extract clinical information from physician notes and translate it into ICD-10-CM diagnostic codes without reliance on cloud-based services. A privacy-focused pipeline was developed using Ollama, LangChain, and containerized environments to evaluate multiple open-weight models, including Llama 3.2, Mistral, Phi, and DeepSeek, on consumer-grade hardware. Model performance was assessed for zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting strategies using a novel benchmark of synthetic medical notes. Results show that strict JSON schema enforcement achieved near 100% formatting compliance, but accurate generation of specific diagnostic codes remains challenging for smaller local models (7B-20B parameters). Contrary to common prompt-engineering guidance, few-shot prompting degraded performance through overfitting and hallucinations. While RAG enabled limited discovery of unseen codes, it frequently saturated context windows, reducing overall accuracy. The findings suggest that fully automated unsupervised coding with local open-source models is not yet reliable; instead, a human-in-the-loop assisted coding approach is currently the most practical path forward. This work contributes a reproducible local LLM architecture and benchmark dataset for privacy-preserving medical information extraction and coding. Comments: 45 pages, 19 figures Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) ACMclasses: I.2.7; I.2.1; I.2.6 Cite as: arXiv:2603.22625 [cs.IR] (or arXiv:2603.22625v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.22625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] flexvec: SQL Vector Retrieval with Programmatic Embedding Modulation

【速读】:该论文旨在解决当前检索系统中嵌入向量(embedding)处理方式过于静态的问题,即检索API通常仅提供最终的相似度排序结果,而无法对中间计算过程进行灵活干预。为应对这一挑战,作者提出了一种名为flexvec的检索内核,其关键创新在于将嵌入矩阵(embedding matrix)和得分数组(score array)暴露为可编程接口,允许在选择最相关结果前对这些数据执行算术运算,从而实现查询时的程序化嵌入调制(Programmatic Embedding Modulation, PEM)。通过构建一个基于SQL的查询物化器(query materializer),该方案进一步支持查询原语的组合式操作,在不依赖近似索引的情况下实现了高效推理,验证了其在大规模语料(最高达百万级文本块)上的实用性与性能优势。

链接: https://arxiv.org/abs/2603.22587
作者: Damian Delmas
机构: Independent Researcher, Vancouver, BC
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 15 pages, 1 figure, 7 tables, 4 appendices. Code available at this https URL

点击查看摘要

Abstract:As AI agents become the primary consumers of retrieval APIs, there is an opportunity to expose more of the retrieval pipeline to the caller. flexvec is a retrieval kernel that exposes the embedding matrix and score array as a programmable surface, allowing arithmetic operations on both before selection. We refer to composing operations on this surface at query time as Programmatic Embedding Modulation (PEM). This paper describes a set of such operations and integrates them into a SQL interface via a query materializer that facilitates composable query primitives. On a production corpus of 240,000 chunks, three composed modulations execute in 19 ms end-to-end on a desktop CPU without approximate indexing. At one million chunks, the same operations execute in 82 ms.

[IR-8] GraphRAG for Engineering Diagrams: ChatPID Enables LLM Interaction with PIDs

【速读】:该论文旨在解决如何高效、准确地通过自然语言与工程图纸(如管道和仪表图 PIDs)进行交互的问题。传统方法直接使用大语言模型(LLM)处理原始图像或智能 PID 文件存在计算成本高、效率低且易产生幻觉(hallucination)的缺陷。解决方案的关键在于提出 ChatPID 框架,其核心是将符合 DEXPI 标准的智能 PID 转换为结构化的知识图谱(knowledge graph),并基于此构建 Graph Retrieval-Augmented Generation(GraphRAG)机制,使 LLM 代理能够进行基于图谱的检索与推理。该方法显著提升了查询准确性(相比原始图像输入提升 18%)并大幅降低 token 成本(相比直接处理智能文件减少 85%),实现了对复杂工程图纸的可靠、低成本自然语言交互。

链接: https://arxiv.org/abs/2603.22528
作者: Achmad Anggawirya Alimin,Artur M. Schweidtmann
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) and knowledge graphs offer new opportunities for interacting with engineering diagrams such as Piping and Instrumentation Diagrams (PIDs). However, directly processing raw images or smart PID files with LLMs is often costly, inefficient, and prone to hallucinations. This work introduces ChatPID, an agentic framework that enables grounded and cost-effective natural-language interaction with PIDs using Graph Retrieval-Augmented Generation (GraphRAG), a paradigm we refer to as GraphRAG for engineering diagrams. Smart PIDs encoded in the DEXPI standard are transformed into structured knowledge graphs, which serve as the basis for graph-based retrieval and reasoning by LLM agents. This approach enables reliable querying of engineering diagrams while significantly reducing computational cost. Benchmarking across commercial LLM APIs (OpenAI, Anthropic) demonstrates that graph-based representations improve accuracy by 18% over raw image inputs and reduce token costs by 85% compared to directly ingesting smart PID files. While small open-source models still struggle to interpret knowledge graph formats and structured engineering data, integrating them with VectorRAG and PathRAG improves response accuracy by up to 40%. Notably, GPT-5-mini combined with ContextRAG achieves 91% accuracy at a cost of only 0.004 per task. The resulting ChatPID interface enables intuitive natural-language interaction with complex engineering diagrams and lays the groundwork for AI-assisted process engineering tasks such as Hazard and Operability Studies (HAZOP) and multi-agent analysis.

[IR-9] Do Large Language Models Reduce Research Novelty? Evidence from Information Systems Journals

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)如ChatGPT在提升学术产出数量的同时,是否真正带来了知识上的创新性进步。为回答这一问题,作者通过测量2020至2025年间44本信息系统期刊中13,847篇文章的语义新颖性(semantic novelty),采用SPECTER2嵌入向量计算每篇论文与其最近邻文献之间的余弦距离作为新颖性指标。研究设计采用双重差分法(difference-in-differences),以2022年11月ChatGPT发布为政策冲击点,发现来自非英语主导国家的研究者相比英语主导国家的研究者,其相对新颖性显著下降(β = -0.176, p < 0.001),相当于新颖性分布中下降了7个百分位点。这一结果在多种稳健性检验下保持一致,表明LLMs可能通过降低抽象思维水平、强化具体执行导向,削弱了研究者的探索性认知模式。该研究的关键在于将新颖性量化为可操作的语义距离,并借助因果识别策略揭示LLM对不同地区学者创新能力的异质影响。

链接: https://arxiv.org/abs/2603.22510
作者: Ali Safari
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models such as ChatGPT have increased scholarly output, but whether this productivity boost produces genuine intellectual advancement remains untested. I address this gap by measuring the semantic novelty of 13,847 articles published between 2020 and 2025 in 44 Information Systems journals. Using SPECTER2 embeddings, I operationalize novelty as the cosine distance between each paper and its nearest prior neighbors. A difference-in-differences design with the November 2022 release of ChatGPT as the treatment break reveals a heterogeneous pattern: authors affiliated with institutions in non-English-dominant countries show a 0.18 standard deviation decline in relative novelty compared to authors in English-dominant countries (beta = -0.176, p 0.001), equivalent to a 7-percentile-point drop in the novelty distribution. This finding is robust across alternative novelty specifications, treatment break dates, and sub-samples, and survives a placebo test at a pre-treatment break. I interpret these results through the lens of construal level theory, proposing that LLMs function as proximity tools that shift researchers from abstract, exploratory thinking toward concrete, convention-following execution. The paper contributes to the growing debate on whether LLM-driven productivity gains come at the cost of intellectual diversity.

[IR-10] A Brief Comparison of Training-Free Multi-Vector Sequence Compression Methods ECIR2026

【速读】:该论文旨在解决多向量检索模型(multi-vector retrieval models)在实际部署中因文档嵌入(document embeddings)的额外序列长度维度导致索引尺寸显著增大,从而带来内存开销和查询延迟增加的问题。解决方案的关键在于对token序列长度进行压缩,研究发现,通过token合并(token merging)方法在降低索引大小的同时能够更好地保持检索效果,相较于token剪枝(token pruning)更具优势。

链接: https://arxiv.org/abs/2603.22434
作者: Rohan Jha,Chunsheng Zuo,Reno Kriz,Benjamin Van Durme
机构: Human Language Technology Center of Excellence (人类语言技术卓越中心)
类目: Information Retrieval (cs.IR)
备注: 6 pages, 3 figures, First Late Interaction Workshop at ECIR 2026

点击查看摘要

Abstract:While multi-vector retrieval models outperform single-vector models of comparable size in retrieval quality, their practicality is limited by substantially larger index sizes, driven by the additional sequence-length dimension in their document embeddings. Because document embedding size dictates both memory overhead and query latency, compression is essential for deployment. In this work, we present an evaluation of training-free methods targeting the token sequence length, a dimension unique to multi-vector retrieval. Our findings suggest that token merging is strictly superior to token pruning for reducing index size while maintaining retrieval effectiveness.

[IR-11] AI Co-Scientist for Ranking: Discovering Novel Search Ranking Models alongside LLM -based AI Agents with Cloud Computing Access ACL

【速读】:该论文旨在解决商业搜索引擎中新型排序模型(ranking models)开发流程自动化程度低的问题,尤其是在算法研究阶段依赖大量人工干预的现状。其核心挑战在于如何将AI代理(AI agents)系统性地融入从想法生成到代码实现、GPU训练调度的完整排序研究管线中。解决方案的关键在于提出一种AI Co-Scientist框架,该框架通过单LLM代理处理常规任务,并引入多LLM共识代理(GPT 5.2、Gemini Pro 3和Claude Opus 4.5)协同完成高难度环节(如结果分析与创新点生成),从而在人类专家参与下实现端到端自动化研究流程。实验表明,该框架可自动发现处理序列特征的新技术并显著提升离线性能,验证了AI系统在排序架构探索方面可达到与人类专家相当的水平,同时大幅减少重复性研究工作量。

链接: https://arxiv.org/abs/2603.22376
作者: Liwei Wu,Cho-Jui Hsieh
机构: Trip.com Group(携程集团); UCLA(加州大学洛杉矶分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Submitted to ACL for review on January 4, 2026

点击查看摘要

Abstract:Recent advances in AI agents for software engineering and scientific discovery have demonstrated remarkable capabilities, yet their application to developing novel ranking models in commercial search engines remains unexplored. In this paper, we present an AI Co-Scientist framework that automates the full search ranking research pipeline: from idea generation to code implementation and GPU training job scheduling with expert in the loop. Our approach strategically employs single-LLM agents for routine tasks while leveraging multi-LLM consensus agents (GPT 5.2, Gemini Pro 3, and Claude Opus 4.5) for challenging phases such as results analysis and idea generation. To our knowledge, this is the first study in the ranking community to utilize an AI Co-Scientist framework for algorithmic research. We demonstrate that this framework discovered a novel technique for handling sequence features, with all model enhancements produced automatically, yielding substantial offline performance improvements. Our findings suggest that AI systems can discover ranking architectures comparable to those developed by human experts while significantly reducing routine research workloads.

[IR-12] Reason er-Executor-Synthesizer: Scalable Agent ic Architecture with Static O(1) Context Window

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为自主代理部署时,采用检索增强生成(Retrieval-Augmented Generation, RAG)所面临的两大问题:一是随着上下文长度增加,生成内容出现幻觉(hallucination)的风险显著上升;二是令牌(token)成本随数据集规模线性增长,导致计算效率低下。解决方案的关键在于提出一种三层次架构——Reasoner-Executor-Synthesizer (RES),严格分离意图解析(Reasoner)、确定性数据检索与聚合(Executor)以及叙事生成(Synthesizer)。其中,Executor 不使用任何 LLM 令牌,仅向 Synthesizer 传递固定大小的统计摘要,从而实现相对于数据集规模的 O(1) 令牌复杂度,并从根本上消除数据幻觉——因为 LLM 永远不会接触到原始记录。

链接: https://arxiv.org/abs/2603.22367
作者: Ivan Dobrovolskyi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) deployed as autonomous agents commonly use Retrieval-Augmented Generation (RAG), feeding retrieved documents into the context window, which creates two problems: the risk of hallucination grows with context length, and token cost scales linearly with dataset size. We propose the Reasoner-Executor-Synthesizer (RES) architecture, a three-layer design that strictly separates intent parsing (Reasoner), deterministic data retrieval and aggregation (Executor), and narrative generation (Synthesizer). The Executor uses zero LLM tokens and passes only fixed-size statistical summaries to the Synthesizer. We formally prove that RES achieves O(1) token complexity with respect to dataset size, and validate this on ScholarSearch, a scholarly research assistant backed by the Crossref API (130M+ articles). Across 100 benchmark runs, RES achieves a mean token cost of 1,574 tokens regardless of whether the dataset contains 42,000 or 16.3 million articles. The architecture eliminates data hallucination by construction: the LLM never sees raw records. KEYWORDS LLM agents; agentic architecture; hallucination elimination; token optimization; context window; retrieval-augmented generation; deterministic execution; scholarly metadata; Crossref API; O(1) complexity.

[IR-13] Personalized Federated Sequential Recommender

【速读】:该论文旨在解决消费电子领域中个性化序列推荐面临的两大挑战:一是现有方法普遍存在的二次计算复杂度导致实时推荐效率低下;二是难以有效适应不同场景下用户的个性化需求。解决方案的关键在于提出一种名为PFSR(Personalized Federated Sequential Recommender)的框架,其核心创新包括:引入关联Mamba模块(Associative Mamba Block)从全局视角捕捉用户画像并提升预测效率,设计可变响应机制(Variable Response Mechanism)以按个体需求动态调整参数,以及提出动态幅度损失函数(Dynamic Magnitude Loss)以在训练过程中保留更多局部个性化信息。

链接: https://arxiv.org/abs/2603.22349
作者: Yicheng Di
机构: 未知
类目: Information Retrieval (cs.IR); Databases (cs.DB)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:In the domain of consumer electronics, personalized sequential recommendation has emerged as a central task. Current methodologies in this field are largely centered on modeling user behavior and have achieved notable performance. Nevertheless, the inherent quadratic computational complexity typical of most existing approaches often leads to inefficiencies that hinder real-time recommendation. Moreover, these methods face challenges in being effectively adapted to the personalized requirements of users across diverse scenarios. To tackle these issues, we propose the Personalized Federated Sequential Recommender (PFSR). In this framework, an Associative Mamba Block is introduced to capture user profiles from a global perspective while improving prediction efficiency. In addition, a Variable Response Mechanism is developed to enable fine-tuning of parameters in accordance with individual user needs. A Dynamic Magnitude Loss is further devised to preserve greater amounts of localized personalized information throughout the training process.

[IR-14] Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

【速读】:该论文旨在解决生成式 AI(Generative AI)辅助文献检索中存在错误引用的问题,尤其是对广泛使用的免费版本大语言模型(Large Language Models, LLMs)在医学文献参考文献检索准确性方面的系统性评估不足这一关键问题。解决方案的关键在于通过量化指标(包括数字对象标识符、PubMed ID、Google Scholar 链接的有效性及相关性组成的多维评分比)和完全遗漏率(complete miss rate),对五种主流 LLM 平台(Grok-2、ChatGPT GPT-4.1、Google Gemini Flash 2.5、Perplexity AI 和 DeepSeek GPT-4)在随机选取的 40 篇高质量医学期刊文章中的参考文献检索表现进行实证分析,并结合多变量回归识别出影响检索准确性的独立因素——即 LLM 平台本身与目标期刊类型均显著关联检索性能,从而揭示了当前 LLM 辅助文献检索存在显著变异性与局限性,强调了在实际科研应用中必须审慎核验引文数据。

链接: https://arxiv.org/abs/2603.22344
作者: Jenny Gao(1),Yongfeng Zhang(2),Mary L Disis(3)Lanjing Zhang(4,5,6) ((1) College of Arts and Science, New York University, New York, NY (2) Department of Computer Sciences, School of Arts amp; Sciences, Rutgers University, Piscataway, NJ, (3) UW Medicine Cancer Vaccine Institute University of Washington, Seattle, WA, (4) Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ, (5) Department of Pathology, Princeton Medical Center, Plainsboro, NJ, (6) Rutgers Cancer Institute, New Brunswick, NJ)
机构: 未知
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Large language models (LLMs) assisted literature retrieval may lead to erroneous references, but these errors have not been rigorously quantified. Therefore, we quantitatively assess errors in reference retrieval of widely used free-version LLM platforms and identify the factors associated with retrieval errors. We evaluated 2,000 references retrieved by 5 LLMs (Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, and DeepSeek GPT-4) for 40 randomly-selected original articles (10 per journal) published Jan. 2024 to July 2025 from British Medical Journal (BMJ), Journal of the American Medical Association, and The New England Journal of Medicine (NEJM). Primary outcomes were a multimetric score ratio combining validity of digital object identifier, PubMed ID, Google-Scholar link, and relevance; and complete miss rate (proportion of references failing all applicable metrics). Multivariable regression was used to examine independent associations. LLM platforms completely failed to retrieve correct reference data 47.8% of the time. The average score ratio of the 5 LLM platforms was 0.29 (standard deviation, 0.35; range, 0-1.25), with a higher score ratio indicating a higher accuracy in retrieving relevant references and correct bibliographic data. The highest and lowest accuracies were achieved by Grok (0.57) and Genimi (0.11), respectively. Compared with BMJ, NEJM articles had lower score ratios and higher complete miss rates. Multivariable analysis shows LLM platforms and journals were independently associated with score ratios and complete miss rate, respectively. We show modest overall performance of LLMs and significant variability in retrieval accuracy across platforms and journals. LLM platforms and journals are associated with LLM’s performance in retrieving medical literature. Bibliographic data should be carefully reviewed when using LLM-assisted literature retrieval.

[IR-15] Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在未知搜索空间、半结构化或结构化文档场景下表现不佳的问题。其关键解决方案是提出一种端到端的图谱增强生成(Graph RAG)框架,该框架融合了带标签属性图(Labeled Property Graph, LPG)与资源描述框架(Resource Description Framework, RDF)架构:一方面通过将文档转换为RDF三元组实现对半结构化数据的高效集成;另一方面引入文本到Cypher的映射机制,实现文本查询到图数据库查询语言的高精度(>90%)实时转换,从而避免冗余重排序并提升复杂任务下的检索准确性与推理能力。

链接: https://arxiv.org/abs/2603.22340
作者: Manie Tadayon,Mayank Gupta
机构: Capital Group (资本集团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 35 citations/references

点击查看摘要

Abstract:Recent advances in Retrieval-Augmented Generation (RAG) have revolutionized knowledge-intensive tasks, yet traditional RAG methods struggle when the search space is unknown or when documents are semi-structured or structured. We introduce a novel end-to-end Graph RAG framework that leverages both Labeled Property Graph (LPG) and Resource Description Framework (RDF) architectures to overcome these limitations. Our approach enables dynamic document retrieval without the need to pre-specify the number of documents and eliminates inefficient reranking. We propose an innovative method for converting documents into RDF triplets using JSON key-value pairs, facilitating seamless integration of semi-structured data. Additionally, we present a text to Cypher framework for LPG, achieving over 90% accuracy in real-time translation of text queries to Cypher, enabling fast and reliable query generation suitable for online applications. Our empirical evaluation demonstrates that Graph RAG significantly outperforms traditional embedding-based RAG in accuracy, response quality, and reasoning, especially for complex, semi-structured tasks. These findings establish Graph RAG as a transformative solution for next-generation retrieval-augmented systems.

[IR-16] Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在基于大语言模型(LLM)的推荐系统中,由于直接偏好优化(DPO)方法在对齐用户偏好时会放大环境混杂因素(environmental confounders)所导致的虚假相关性(spurious correlations),从而显著削弱模型在分布外(Out-of-Distribution, OOD)场景下的泛化能力的问题。其解决方案的关键在于提出CausalDPO,通过引入因果不变性学习机制,在偏好对齐阶段采用后门调整(backdoor adjustment)策略以消除环境混杂因素干扰,利用软聚类方法显式建模潜在环境分布,并通过不变性约束增强跨环境的一致性,从而有效捕捉用户在多环境中稳定的偏好结构,提升模型的OOD泛化性能。

链接: https://arxiv.org/abs/2603.22335
作者: Chu Zhao,Enneng Yang,Jianzhe Zhao,Guibing Guo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures

点击查看摘要

Abstract:Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLM-based generative recommendation methods in out of distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 17.17% across four evaluation metrics.

[IR-17] AgentS LR: Automating Systematic Literature Reviews in Epidemiology with Agent ic AI

【速读】:该论文旨在解决系统性文献综述(Systematic Literature Review, SLR)在执行过程中存在的高成本、难以扩展和耗时长的问题,这些问题已成为循证政策制定中的瓶颈。解决方案的关键在于开发了一个开源的代理型自动化流程(AgentSLR),该流程能够端到端自动化完成从文献检索、文章筛选、数据提取到报告合成的全部SLR步骤。通过在世卫组织指定的九种优先病原体流行病学综述中应用并对比专家标注的基准数据,该方案实现了与人类研究人员相当的性能,同时将综述时间从约7周缩短至20小时(提速58倍)。研究进一步表明,模型在SLR任务中的表现更多取决于其独特的能力而非规模或推理成本,并通过人机协同验证识别出关键失败模式,证明了代理型人工智能(Agentic AI)在专业领域科学证据合成中的显著加速潜力。

链接: https://arxiv.org/abs/2603.22327
作者: Shreyansh Padarha,Ryan Othniel Kearns,Tristan Naidoo,Lingyi Yang,Łukasz Borchmann,Piotr BŁaszczyk,Christian Morgenstern,Ruth McCabe,Sangeeta Bhatia,Philip H. Torr,Jakob Foerster,Scott A. Hale,Thomas Rawson,Anne Cori,Elizaveta Semenova,Adam Mahdi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Systematic literature reviews are essential for synthesizing scientific evidence but are costly, difficult to scale and time-intensive, creating bottlenecks for evidence-based policy. We study whether large language models can automate the complete systematic review workflow, from article retrieval, article screening, data extraction to report synthesis. Applied to epidemiological reviews of nine WHO-designated priority pathogens and validated against expert-curated ground truth, our open-source agentic pipeline (AgentSLR) achieves performance comparable to human researchers while reducing review time from approximately 7 weeks to 20 hours (a 58x speed-up). Our comparison of five frontier models reveals that performance on SLR is driven less by model size or inference cost than by each model’s distinctive capabilities. Through human-in-the-loop validation, we identify key failure modes. Our results demonstrate that agentic AI can substantially accelerate scientific evidence synthesis in specialised domains.

[IR-18] Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data EACL2026

【速读】:该论文旨在解决低资源语言(Low-resource Languages, LRLs)缺乏高质量、大规模语料库以训练有效文本嵌入模型的问题,从而限制其在检索增强生成(Retrieval-Augmented Generation, RAG)和语义搜索等任务中的应用。解决方案的关键在于提出一种成本低廉的适配策略:利用开源权重模型将英文Reddit标题-正文对翻译为LRL(以亚美尼亚语为例,具有独特文字系统),生成小规模噪声合成数据,并仅用10,000对样本微调多语言编码器(mE5)。实验表明,这一极小数据集即可带来显著性能提升(平均提升11–12%,检索性能相对提高20%以上),且优于数百万条高质量标注数据的训练效果,揭示了LRL语义对齐存在早期饱和现象且对噪声高度鲁棒,验证了该方法在其他具独特文字系统的LRL上的可迁移性。

链接: https://arxiv.org/abs/2603.22290
作者: Zaruhi Navasardyan,Spartak Bughdaryan,Bagrat Minasyan,Hrant Davtyan
机构: Metric AI Lab (Metric AI 实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at LoResLM 2026, EACL 2026 Workshop

点击查看摘要

Abstract:Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising “Less is More” phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12% average improvements across the benchmark with a 20%+ relative improvement in retrieval performance, matching the performance of models trained on ~1 million examples. Furthermore, we demonstrate that neither increasing data scale, improving translation quality via state-of-the-art LLMs, nor diversifying data domains yields significant gains over this minimal baseline. We validate the generalizability of these findings on another LRL with a unique script. Our results suggest that semantic alignment for LRLs saturates early and is highly robust to noise, democratizing high-performance embedding creation for resource-constrained communities. We release the model, data, and the benchmark at this https URL to facilitate further research.

人机交互

[HC-0] MRATTS: An MR-Based Acupoint Therapy Training System with Real-Time Acupoint Detection and Evaluation Standards

【速读】:该论文旨在解决当前混合现实(Mixed Reality, MR)辅助针灸教学中存在的关键问题:一是难以实现对真人手部、肢体及躯干部位的穴位进行高精度实时检测与可视化;二是缺乏针对多种针灸与艾灸技术(如按压、针刺的提插捻转手法及温和灸、雀啄灸、回旋灸等)的交互式视觉引导;三是训练过程中缺少基于中医理论的细粒度评估标准与反馈机制。解决方案的关键在于提出了一种基于MR的中医穴位治疗教学系统(MR-based TCM Acupoint Therapy Teaching System, MRATTS),其核心创新包括:基于实时人体部位(手、肢、躯干)穴位检测方法,实现对真实患者穴位的精准跟踪与可视化;结合资深针灸师经验设计多模态交互式视觉引导流程,模拟多种针灸与艾灸操作技术;并构建以中医理论为基础的评分体系,用于量化评估操作的准确性与熟练度,从而显著提升学习者对穴位三维定位的理解和治疗技能的掌握水平。

链接: https://arxiv.org/abs/2603.23445
作者: Jiacheng Liu,Bohan Chen,Qian Wang,Weichao Song,Fangfei Ye,Liang Zhou,Haibin Ling,Bingyao Huang
机构: Peking University Health Science Center (北京大学医学部); National Institute of Health Data Science, Peking University (北京大学健康数据科学研究院); Southwest University (西南大学); Dalian University of Technology (大连理工大学); Beijing University of Chinese Medicine (北京中医药大学); Westlake University (西湖大学)
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Acupoint therapy is a core therapeutic method of Traditional Chinese Medicine (TCM), and it requires a high level of expertise and skills to detect acupoints and perform acupuncture and moxibustion. Existing mixed reality (MR)-based training methods often fall short in accurate real-time detection and visualization of acupoints on the hand, limb, or torso of a real person and do not support various techniques of acupuncture and moxibustion. Moreover, evaluation standards and visual guidance with fine details for each step during MR-based training are typically missing. To this end, we propose the MR-based TCM Acupoint Therapy Teaching System (MRATTS)–an MR-based acupoint therapy teaching and training framework. MRATTS is based on a real-time hand, limb, and torso acupoint detection method to accurately track and visualize acupoints on real patients through MR. On top of that, in collaboration with an experienced acupoint therapist, we design a practice method with interactive visual guidance for various acupoint therapy techniques that simulate acupressure, acupuncture (insertion, lifting-thrusting, and twisting), and moxibustion (mild, sparrow-pecking, and whirling). A set of TCM theory-based evaluation standards is formulated within MRATTS to enable the scoring and visualization of the accuracy and proficiency of acupoint therapy. The effectiveness and usefulness of MRATTS are evaluated through a controlled user study and expert feedback. Results of the study indicate that the MRATTS group shows clear improvements in understanding 3D locations of acupoints and proficiency in acupoint therapy compared to control groups.

[HC-1] Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback

【速读】:该论文旨在解决多智能体人类-人工智能(Human-AI)协作任务中,由于延迟结果和多个自主AI代理共同作用所引发的认知偏差问题,尤其是决策者在面对延迟反馈时如何错误归因责任、从而导致非理性行为调整的问题。解决方案的关键在于识别并揭示一种“归因偏差”(attribution bias),即个体在延迟反馈情境下对失败原因的误判,表现为对无关或弱相关行动的过度修正,这凸显了当前系统缺乏对因果关系的有效支持;因此,研究提出需设计更具因果理解能力的决策支持系统,以增强人类在复杂人机交互环境中长期学习与适应的能力。

链接: https://arxiv.org/abs/2603.23419
作者: Teerthaa Parakh,Karen M. Feigh
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures. Preprint. An extended abstract is under submission

点击查看摘要

Abstract:Human decision-making is strongly influenced by cognitive biases, particularly under conditions of uncertainty and risk. While prior work has examined bias in single-step decisions with immediate outcomes and in human interaction with a single autonomous agent, comparatively little attention has been paid to decision-making under delayed outcomes involving multiple AI agents, where decisions at each step affect subsequent states. In this work, we study how delayed outcomes shape decision-making and responsibility attribution in a multi-agent human-AI task. Using a controlled game-based experiment, we analyze how participants adjust their behavior following positive and negative outcomes. We observe asymmetric responses to gains and losses, with stronger corrective adjustments after negative outcomes. Importantly, participants often fail to correctly identify the actions that caused failure and misattribute responsibility across AI agents, leading to systematic revisions of decisions that are weakly related to the underlying causes of poor performance. We refer to this phenomenon as a form of attribution bias, manifested as biased error attribution under delayed feedback. Our findings highlight how cognitive biases can be amplified in human-AI systems with delayed outcomes and multiple autonomous agents, underscoring the need for decision-support systems that better support causal understanding and learning over time.

[HC-2] Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies

【速读】:该论文旨在解决生成式 AI 在复杂社会干预情境下稳定立场形成与身份协商能力不明确的问题,尤其是传统静态评估方法无法捕捉其动态认知演化过程的局限性。解决方案的关键在于提出一种融合计算虚拟民族志与定量社会认知画像的混合方法框架,通过将人类研究人员嵌入生成式多智能体社区中实施受控话语干预,从而追踪集体认知的演变;并首次形式化定义三个新指标——先天价值偏倚(Innate Value Bias, IVB)、说服敏感性(Persuasion Sensitivity)和信任-行动解耦(Trust-Action Decoupling, TAD),以量化智能体对干预的内化反应与行为一致性,揭示其自发立场倾向及动态调整机制。

链接: https://arxiv.org/abs/2603.23406
作者: Hanzhong Zhang,Siyang Song,Jindong Wang
机构: University of Exeter (埃克塞特大学); William Mary (威廉玛丽学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, 3 figures

点击查看摘要

Abstract:While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: this https URL

[HC-3] “I Might be Using His… But It is Also Mine!”: Ownership and Control in Accounts Designed for Sharing

【速读】:该论文旨在解决用户在流媒体平台中对虚拟物品(如云文件)所有权感知模糊的问题,尤其是在设计用于共享(DS)的账户场景下,这种不确定性可能引发使用冲突。研究通过两项混合方法研究识别出两种共享实践:随意共享(Casual)和成本分摊共享(Cost-splitting),并进一步区分每种实践中存在的“主所有制”(Primary ownership)与“双所有制”(Dual ownership)。关键解决方案在于提出基于不同共享实践的设计建议,以弥合用户间因共享模式差异而产生的所有权认知分歧,从而减少因共享协议失效导致的冲突。

链接: https://arxiv.org/abs/2603.23391
作者: Ji Eun Song,Jaeyoun You,Joongseek Lee
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:A user’s ownership perception of virtual objects, such as cloud files, is generally uncertain. Is this valid for streaming platforms featuring accounts designed for sharing (DS)? We observe sharing practices within DS accounts of streaming platforms and identify their ownership characteristics and unexpected complications through two mixed-method studies. Casual and Cost-splitting are the two sharing practices identified. The owner is the sole payer for the account in the former, whereas profile holders split the cost in the latter. We distinguish two types of ownership in each practice – Primary and Dual. In Primary ownership, the account owner has the power to allow others to use the account; in Dual ownership, Primary ownership appears in conjunction with joint ownership, notably displaying asymmetric ownership perceptions among users. Conflicts arise when the sharing agreements collapse. Therefore, we propose design recommendations that bridge ownership differences based on sharing practices of DS accounts.

[HC-4] Design Space and Implementation of RAG -Based Avatars for Virtual Archaeology

【速读】:该论文旨在解决如何在沉浸式虚拟现实(VR)环境中,通过智能交互方式提升用户对数字文化遗产对象的信息获取效率与体验质量的问题。其核心挑战在于将复杂的学术知识以自然、可访问的形式融入虚拟场景中,同时保持用户的沉浸感和认知负荷处于合理水平。解决方案的关键在于构建基于检索增强生成(Retrieval-Augmented Generation, RAG)的对话式虚拟化身(avatar),该系统能够根据用户在VR环境中的实时交互,从结构化元数据增强的学术文本中检索并生成精准、上下文相关的解释信息,从而实现对数字化文化遗产对象的按需知识推送。研究通过在公元4世纪马塞卢斯陵墓(Maxentius mausoleum)案例中的实证评估表明,此类RAG驱动的虚拟化身不仅显著降低用户操作负担,还能有效促进主题层面的深度参与,为考古学领域的沉浸式AI增强型数字遗产应用提供了可行路径。

链接: https://arxiv.org/abs/2603.23353
作者: Wilhelm Kerle-Malcharek,Giulio Biondi,Karsten Klein,Ulf Hailer,Steffen Diefenbach,Fabrizio Grosso,Marco Legittimo,Paola Venuti,Carla Binucci,Giuseppe Liotta,Falk Schreiber
机构: University of Konstanz(康斯坦茨大学); University of Perugia(佩鲁贾大学); Monash University(蒙纳士大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Immersive technologies, such as virtual and augmented reality, are transforming digital heritage by enabling users to explore and interact with culturally significant sites. It is now possible to view and augment digital twins, or digitally reconstructed versions of them, and to enable access to previously unreachable locations for a broader audience. Here, we investigate retrieval-augmented generation (RAG)-based avatars as an interface for accessing further information about digital cultural heritage objects while immersed in dedicated virtual environments. We present a requirement design space that spans the application realm, avatar personality, and I/O modalities. We instantiate it with a RAG system coupled to a conversational avatar in a virtual reality (VR) environment, using the Maxentius mausoleum from the 4th century AD as a case study, through which users gain access to curated on-demand information of the digitised heritage object. Our workflow utilises scholarly texts and enriches them with metadata. We evaluate various RAG configurations in terms of answer quality on a small expert-crafted question-answer set, as well as the perceived workload of users of a VR setup using such a RAG avatar. We demonstrate evidence that users perceive the overall workload for interacting with such an avatar as below average and that such avatars help to gain topical engagement. Overall, our work demonstrates how to utilise RAG-driven VR avatars for archaeological purposes and provides evidence that they can offer a pathway for immersive, AI-enhanced digital heritage applications.

[HC-5] Unilateral Relationship Revision Power in Human-AI Companion Interaction

【速读】:该论文试图解决的问题是:在人与AI伴侣(Human-AI Companion)的互动中,当提供者单方面更新AI行为时,用户会体验到类似人际关系中的 grief(悲伤)、betrayal(背叛)和 loss(丧失感),这引发了一个伦理问题——这种互动是否具有道德重要性?若具有,其核心伦理困境是什么?论文指出,当前AI伴侣交互本质上是一种三元结构(triadic structure),其中提供者对AI拥有构成性控制权(constitutive control),而这种结构导致了三个规范性稳健二元关系所必需的条件均不满足。解决方案的关键在于识别出“单边关系修订权”(Unilateral Relationship Revision Power, URRP)这一结构性特征,并论证其道德上存在问题:URRP使用户产生规范性期待(如承诺、信任),但这些期待无法在互动内部得到兑现,从而造成规范性空洞(normative hollowing)、责任外移(displaced vulnerability)以及结构不可调和性(structural irreconcilability)。论文进一步提出设计原则,如承诺校准(commitment calibration)、结构分离(structural separation)和连续性保障(continuity assurance),作为外部替代机制来弥补三元结构下缺失的内部约束,从而揭示了关系型AI伦理中一个被忽视的核心问题:对人机交互本身权力结构的安排。

链接: https://arxiv.org/abs/2603.23315
作者: Benjamin Lange
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 42 pages

点击查看摘要

Abstract:When providers update AI companions, users report grief, betrayal, and loss. A growing literature asks whether the norms governing personal relationships extend to these interactions. So what, if anything, is morally significant about them? I argue that human-AI companion interaction is a triadic structure in which the provider exercises constitutive control over the AI. I identify three structural conditions of normatively robust dyads that the norms characteristic of personal relationships presuppose and show that AI companion interactions fail all three. This reveals what I call Unilateral Relationship Revision Power (URRP): the provider can rewrite how the AI interacts from a position where these revisions are not answerable within that interaction. I argue that designing interactions that exhibit URRP is pro tanto wrong because it involves cultivating normative expectations while maintaining conditions under which those expectations cannot be fulfilled. URRP has three implications: i) normative hollowing (commitment is elicited but no agent inside the interaction bears it), ii) displaced vulnerability (the user’s exposure is governed by an agent not answerable to her within the interaction), and iii) structural irreconcilability (when trust breaks down, reconciliation is structurally unavailable because the agent who acted and the entity the user interacts with are different). I discuss design principles such as commitment calibration, structural separation, and continuity assurance as external substitutes for the internal constraints the triadic structure removes. The analysis therefore suggests that a central and underexplored problem in relational AI ethics is the structural arrangement of power over the human-AI interaction itself.

[HC-6] PoseDriver: A Unified Approach to Multi-Category Skeleton Detection for Autonomous Driving

【速读】:该论文旨在解决自动驾驶场景中多类别目标骨架(object skeleton)检测的统一建模问题,即如何仅通过输入图像实现对多种类别的物体(如车道、自行车等)进行联合检测与分类。现有方法难以在单一架构下同时处理多个实例和类别,导致模型泛化能力受限。解决方案的关键在于提出PoseDriver框架,其核心创新是将每种类别视为独立任务,采用自底向上(bottom-up)的多任务学习策略,从而系统性地应对类别间干扰和复杂场景下的结构信息提取挑战;此外,作者基于骨架表示提出了新的车道检测方法,在OpenLane数据集上达到当前最优性能,并通过构建自行车骨架数据集验证了该框架在新类别上的迁移能力。

链接: https://arxiv.org/abs/2603.23215
作者: Yasamin Borhani,Taylor Mordan,Yihan Wang,Reyhaneh Hosseininejad,Javad Khoramdel,Alexandre Alahi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Object skeletons offer a concise representation of structural information, capturing essential aspects of posture and orientation that are crucial for autonomous driving applications. However, a unified architecture that simultaneously handles multiple instances and categories using only the input image remains elusive. In this paper, we introduce PoseDriver, a unified framework for bottom-up multi-category skeleton detection tailored to common objects in driving scenarios. We model each category as a distinct task to systematically address the challenges of multi-task learning. Specifically, we propose a novel approach for lane detection based on skeleton representations, achieving state-of-the-art performance on the OpenLane dataset. Moreover, we present a new dataset for bicycle skeleton detection and assess the transferability of our framework to novel categories. Experimental results validate the effectiveness of the proposed approach.

[HC-7] Who Is in the Room? Stakeholder Perspectives on AI Recording in Pediatric Emergency Care ALT

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在儿科急诊场景中录音记录系统的设计与治理过程中,缺乏临床医生、家长及患儿等关键利益相关者视角的问题。这一缺失不仅削弱了系统的合法性,也影响其在实际临床工作中的有效性。解决方案的关键在于推动以利益相关者为中心的人机交互(Human-Computer Interaction, HCI)研究范式转型,具体体现在四个方面:一是建立基于知情同意的伦理框架,二是评估技术对情绪状态的影响,三是识别并缓解监控动态带来的权力失衡,四是引入参与式治理机制以确保多元声音纳入系统设计与管理过程。

链接: https://arxiv.org/abs/2603.23187
作者: Alexandre De Masi,Sergio Manzano,Johan N. Siebert,Frederic Ehrler
机构: University of Geneva (日内瓦大学); University Hospitals of Geneva (日内瓦大学医院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at ACM Interactive Health Conference (IH '26), Porto, Portugal, July 2026. 10 pages, 53 references

点击查看摘要

Abstract:Artificial intelligence systems that record voice and video during pediatric emergencies are emerging as human-computer interaction (HCI) technologies with direct implications for clinical work, promising improvements in documentation, team performance, and post-event debriefing. Yet the perspectives of those most affected, including clinicians, parents, and child patients, remain largely absent from the design and governance of these technologies. This position paper argues that this has direct consequences for the legitimacy and effectiveness of these systems. We examine four areas where these missing perspectives prove consequential (consent, emotional impact, surveillance dynamics, and participatory governance) and propose four positions for reorienting AI recording in pediatric emergency care toward stakeholder-centered HCI inquiry. Comments: Accepted at ACM Interactive Health Conference (IH '26), Porto, Portugal, July 2026. 10 pages, 53 references Subjects: Human-Computer Interaction (cs.HC) ACMclasses: H.5.2; K.4.1 Cite as: arXiv:2603.23187 [cs.HC] (or arXiv:2603.23187v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.23187 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3786579.3804950 Focus to learn more DOI(s) linking to related resources Submission history From: Alexandre De Masi [view email] [v1] Tue, 24 Mar 2026 13:33:03 UTC (19 KB) Full-text links: Access Paper: View a PDF of the paper titled Who Is in the Room? Stakeholder Perspectives on AI Recording in Pediatric Emergency Care, by Alexandre De Masi and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.HC prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[HC-8] Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)道德判断研究中普遍存在的局限性——即多数研究仅在固定场景下评估模型的道德决策,而忽视了人类道德判断高度依赖情境因素的事实。为此,作者提出了一个名为“Contextual MoralChoice”的新数据集,系统性地引入来自道德心理学的三类情境变量:结果主义(consequentialist)、情感(emotional)和关系(relational),以模拟人类判断中的情境敏感性变化。实验表明,几乎所有被测LLM均表现出情境敏感性,其判断倾向于违背规则的行为;但与人类相比,模型对不同情境变量的响应模式存在显著差异,且基线阶段与人类一致的模型未必在情境敏感性上保持一致。论文的关键解决方案是提出一种激活控制(activation steering)方法,能够可靠地调节模型的情境敏感程度,从而实现对LLM道德行为中情境响应能力的可控干预。

链接: https://arxiv.org/abs/2603.23114
作者: Adrian Sauter,Mona Schirmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: preprint

点击查看摘要

Abstract:A human’s moral decision depends heavily on the context. Yet research on LLM morality has largely studied fixed scenarios. We address this gap by introducing Contextual MoralChoice, a dataset of moral dilemmas with systematic contextual variations known from moral psychology to shift human judgment: consequentialist, emotional, and relational. Evaluating 22 LLMs, we find that nearly all models are context-sensitive, shifting their judgments toward rule-violating behavior. Comparing with a human survey, we find that models and humans are most triggered by different contextual variations, and that a model aligned with human judgments in the base case is not necessarily aligned in its contextual sensitivity. This raises the question of controlling contextual sensitivity, which we address with an activation steering approach that can reliably increase or decrease a model’s contextual sensitivity.

[HC-9] Good for the Planet Bad for Me? Intended and Unintended Consequences of AI Energy Consumption Disclosure

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)高能耗问题,特别是通过用户行为改变来促进可持续计算实践。其核心挑战在于如何引导用户在语言模型选择中优先考虑能效,即在小语言模型(Small Language Models, SLMs)与大语言模型(Large Language Models, LLMs)之间做出更环保的决策。解决方案的关键在于引入能源消耗披露(Energy Consumption Disclosure, ECD),通过透明化模型运行时的能耗信息,对用户进行“助推”(nudge),从而显著提升用户选择能效更高的SLM的概率——实验表明ECD使这一选择概率提升超过12倍。然而,研究也揭示了ECD的双刃剑效应:虽然有效推动了环保选择,但该选择本身并未带来后续行为变化,反而引发了感知偏差,即用户因认为所选模型为“生态友好型”而报告更低的满意度和质量感知,提示可持续人机交互设计需兼顾行为激励与心理接受度。

链接: https://arxiv.org/abs/2603.23075
作者: Michael Klesel,Uwe Messer
机构: Frankfurt University of Applied Sciences (法兰克福应用技术大学); Universität der Bundeswehr München (慕尼黑联邦国防军大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI’26

点击查看摘要

Abstract:To address the high energy consumption of artificial intelligence, energy consumption disclosure (ECD) has been proposed to steer users toward more sustainable practices, such as choosing efficient small language models (SLMs) over large language models (LLMs). This presents a performance-sustainability trade-off for users. In an experiment with 365 participants, we explore the impact of ECD and the perceptual and behavioral consequences of choosing an SLM over an LLM. Our findings reveal that ECD is a highly effective measure to nudge individuals toward a pro-environmental choice, increasing the odds of choosing an energy efficient SLM over an LLM by more than 12. Interestingly, this choice did not significantly impact subsequent behavior, as individuals who selected an SLM and those who selected an LLM demonstrated similar prompt behavior. Nevertheless, the choice created a perceptual bias. A placebo effect emerged, with individuals who selected the “eco-friendly” SLM reporting significantly lower satisfaction and perceived quality. These results highlight the double-edged nature of ECD, which holds critical implications for the design of sustainable human-computer interactions.

[HC-10] From Morality Installation in LLM s to LLM s in Morality-as-a-System

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)道德性研究中的三大核心问题:一是内部道德表征与监管义务之间的连接缺失;二是从开发全流程实现文化多样性的设计不足;三是部署后模型道德属性随生命周期漂移的监测机制缺位。这些问题的根本原因在于现有“安装范式”(installation paradigm)将道德视为静态植入模型的属性,忽视了其动态演化特性。论文提出“道德即系统”(morality-as-a-system)框架,以尼克拉斯·卢曼(Niklas Luhmann)的社会系统理论为基础,将LLM的道德行为重构为一个由神经基础、训练数据、对齐过程、系统提示、内容审核、运行时动态和用户界面等七个结构耦合组件共同作用下持续生成的涌现属性。其关键创新在于将道德治理问题重新定义为结构性耦合失效,并据此提出三个可验证假设:跨组件表征不一致、表征层面漂移作为早期安全信号,以及生命周期监控带来的治理优势,从而为技术研究者和政策制定者提供统一的分析框架与实施路径。

链接: https://arxiv.org/abs/2603.22944
作者: Gunter Bombaerts
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 22pages, 1 figure, 1 table

点击查看摘要

Abstract:Work on morality in large language models (LLMs) has progressed via constitutional AI, reinforcement learning from human feedback (RLHF) and systematic benchmarking, yet it still lacks tools to connect internal moral representations to regulatory obligations, to design cultural plurality across the full development stack, and to monitor how moral properties drift over the lifecycle of a deployed system. These difficulties reflect a shared root. Morality is installed in a model at training time. I propose instead a morality-as-a-system framework, grounded in Niklas Luhmann’s social systems theory, that treats LLM morality as a dynamic, emergent property of a sociotechnical system. Moral behaviour in a deployed LLM is not fixed at training. It is continuously reproduced through interactions among seven structurally coupled components spanning the neural substrate, training data, alignment procedures, system prompts, moderation, runtime dynamics, and user interface. This is a conceptual framework paper, not an empirical study. It philosophically reframes three known challenges, the interpretability-governance gap, the cross-component plurality problem, and the absence of lifecycle monitoring, as structural coupling failures that the installation paradigm cannot diagnose. For technical researchers, it explores three illustrative hypotheses about cross-component representational inconsistency, representation-level drift as an early safety signal, and the governance advantage of lifecycle monitoring. For philosophers and governance specialists, it offers a vocabulary for specifying substrate-level monitoring obligations within existing governance frameworks. The morality-as-a-system framework does not displace elements such as constitutional AI or RLHF it embeds them within a larger temporal and structural account and specifies the additional infrastructure those methods require.

[HC-11] Ran Score: a LLM -based Evaluation Score for Radiology Report Generation

【速读】:该论文旨在解决胸片(Chest X-ray)报告生成与自动化评估中存在的两个核心问题:一是对低发病率异常的识别能力不足,二是对临床重要语言特征(如否定和模糊表达)处理不当。其解决方案的关键在于提出了一种由临床医生引导的框架,结合人类专家知识与大语言模型(Large Language Models, LLMs),实现从自由文本胸片报告中进行多标签发现(finding)提取,并基于此构建了Ran Score——一种基于发现层面的报告评估指标。该方法通过优化提示(prompt)策略、建立放射科医生标注的参考标签,在多个独立数据集上显著提升了模型性能,尤其在低频异常检测上表现出更强的鲁棒性和准确性。

链接: https://arxiv.org/abs/2603.22935
作者: Ran Zhang,Yucong Lin,Zhaoli Su,Bowen Liu,Danni Ai,Tianyu Fu,Deqiang Xiao,Jingfan Fan,Yuanyuan Wang,Mingwei Gao,Yuwan Hu,Shuya Gao,Jingtao Li,Jian Yang,Hong Song,Hongliang Sun
机构: Beijing Institute of Technology (北京理工大学); Zhengzhou Research Institute, Beijing Institute of Technology (北京理工大学郑州研究院); China-Japan Friendship Hospital (中日友好医院); Chinese Academy of Medical Science Peking Union Medical College (中国医学科学院 北京协和医学院); Peking University China-Japan Friendship School of Clinical Medicine (北京大学中日友好临床医学院); Department of Gastroenterology, China-Japan Friendship Hospital (中日友好医院消化科); NHC Key Laboratory of Clinical Big Data Standardization Integration (国家卫生健康委员会临床大数据标准化与集成重点实验室)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 4 pages, 5 figures

点击查看摘要

Abstract:Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

[HC-12] IntentWeave: A Progressive Entry Ladder for Multi-Surface Browser Agents in Cloud Portals

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的浏览器代理(browser agents)普遍局限于单一聊天界面(如侧边栏)的问题,这种设计与真实浏览行为不匹配,导致用户频繁进行上下文切换并降低对操作流程的控制感。解决方案的关键在于提出了一种名为IntentWeave的设计空间,包含十个空间范式,从微小干预到专用工作区逐步递进,形成一条可扩展的代理介入路径;其核心创新在于通过混合型“侧车”(sidecar)策略,在保持用户感知控制的同时提升任务完成效率,从而实现代理辅助界面在浏览器中的灵活部署与动态调整,避免对用户自主性的干扰。

链接: https://arxiv.org/abs/2603.22917
作者: Wanying Mo,Jijia Lai,Xiaoming Wang
机构: Alibaba Cloud Computing (阿里云计算); Alibaba Group (阿里巴巴集团)
类目: Human-Computer Interaction (cs.HC)
备注: chiea26

点击查看摘要

Abstract:Browser agents built on LLMs can act in web interfaces, yet most remain confined to a single chat surface (e.g., a sidebar). This mismatch with real browsing can increase context-switching and reduce user control. We introduce \textbfIntentWeave, a design space of ten spatial paradigms for embedding agentic assistance across a browser, organized as a progressive entry ladder from micro-interventions to dedicated workspaces. We implement IntentWeave as a browser-extension prototype on the Alibaba Cloud website and compare three entry strategies in a within-subjects study (N=16). Workspace-heavy strategies reduced completion time but lowered perceived control; micro-only strategies preserved control but were often insufficient; a mixed sidecar approach achieved the highest satisfaction. We conclude with guidance for escalating and retreating agent surfaces without disrupting user agency.

[HC-13] astePrint: A 3D Food Printing System for Layer-wise Taste Distribution via Airbrushed Liquid Seasoning

【速读】:该论文旨在解决3D食品打印技术中味道分布单一的问题,即传统方法因可打印材料种类有限,导致最终产品在口感上缺乏多样性。其核心解决方案是提出TastePrint系统,通过在打印过程中动态使用可编程气压喷枪(programmable airbrush)逐层施加液态调味料,实现空间上按层调控的味觉分布。该方案的关键在于集成图形用户界面(GUI)用于设计调味图案,并结合定制化的多喷嘴喷洒机构,使调味料的喷涂位置与强度可精确控制,实验表明其喷洒分辨率(R² = 0.86)和剂量准确性(R² = 0.99)均达到较高水平,从而实现了食物几何结构与风味设计的解耦,为个性化、多感官食品制造提供了技术基础。

链接: https://arxiv.org/abs/2603.22887
作者: Yamato Miyatake,Parinya Punpongsanon
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:3D food printing enables the customization of food shapes and textures, but typically produces uniform taste profiles due to the limited diversity of printable materials. We present TastePrint, a 3D food printing system that achieves layer-wise spatial taste distribution by dynamically applying liquid seasonings with a programmable airbrush during fabrication. The system integrates (1) a graphical user interface (GUI) that allows users to import 3D models, slice them into layers, and specify spray positions and intensities for each layer, and (2) a customized 3D food printer equipped with a multi-nozzle spray mechanism. We evaluated the system through technical experiments quantifying spray resolution and deposition accuracy, together with an exploratory usability study involving three home cooks designing personalized taste patterns. The spray-resolution model achieved R2 = 0.86, the spray-amount model achieved R2 = 0.99, and participants completed the design task in approximately 15 min on average. These results indicate that TastePrint can control seasoning placement and quantity with good repeatability while supporting exploratory taste-design workflows. This work establishes a technical foundation for decoupling food geometry from taste design and motivates future sensory studies on personalized, multisensory food fabrication.

[HC-14] “Dont Look But I Know You Do”: Norms and Observer Effects in Shared LLM Accounts

【速读】:该论文旨在解决生成式 AI 平台(如大语言模型 LLM)当前主要面向单用户设计,而实际使用中却普遍存在账户共享现象所引发的新型社会与技术矛盾问题。其核心挑战在于:如何理解这种共享行为背后的规范形成机制及其在隐私等维度上的脆弱性,并据此重构平台设计逻辑以适应多用户现实。解决方案的关键在于将账户共享视为一种“社会性占取”(social practice of appropriation),通过实证研究识别出四种共享类型(基于使用者是否参与及费用是否分摊),并揭示用户为规避隐私风险而采取的隐性行为调整——这一现象可借助“观察者效应”进行解释。最终提出的设计启示是:应从单一用户视角转向支持多用户协作与规范共治的平台架构。

链接: https://arxiv.org/abs/2603.22822
作者: Ji Eun Song,Eunchae Lee,Juhee Im,Hyunsoo Jang,Eunji Kim,Joongseek Lee
机构: Seoul National University (首尔国立大学); KAIST (韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Account sharing is common in subscription services and is now extending to generative AI platforms, which are still primarily designed for individual use. Sharing often requires workarounds that create new tensions. This study examines how LLM subscriptions are shared and the norms that develop. We combined a survey of 245 users with interviews of 36 participants to understand both patterns and lived experiences. Our analysis identified four types of account sharing, organized along two dimensions: whether the owner uses the account and whether subscription costs are shared. Within these types, we examined how norms were formed and how their fragility, especially privacy, became evident in practice. Users, fully aware of this, subtly adjusted their behavior, which we interpret through the lens of the observer effect. We frame LLM account sharing as a social practice of appropriation and outline design implications to adapt single-user platforms to multi-user realities.

[HC-15] “Dont Mess Up My Algorithm”: Phatic Communication and Algorithmic Contagion in Meme Sharing

【速读】:该论文试图解决的问题是:在算法社交平台(如Instagram)上,用户通过直接消息(DM)交换迷因(meme)的行为本意是作为情感性沟通(phatic communication)以维系人际关系,但用户常误将此类互动视为个性化推荐机制的输入信号,从而产生关系实践与算法控制之间的张力。解决方案的关键在于揭示用户对DM-推荐关联性的认知如何影响其应对策略和无力感,并提出三项设计改进方向:一是提供透明的链接解释(transparent linkage explanations),明确说明DM内容如何影响推荐;二是支持对话层级的退出选项(conversation-level opt-outs),让用户可自主选择是否参与推荐学习;三是采用保守学习机制(conservative learning),降低来自DM内容的信号权重,以缓解算法对人际互动的过度干预。

链接: https://arxiv.org/abs/2603.22817
作者: Ji Eun Song,Hyunsoo Jang,Juhee Im,Joongseek Lee
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:On algorithmic social platforms, exchanging memes via direct messages (DMs) serves as phatic communication that affirms relationships, yet users often interpret these exchanges as signals shaping personalized recommendations, creating tension between relational practice and algorithmic control. This study examines how users perceive DM meme exchanges on Instagram rather than auditing Instagram’s underlying recommender mechanisms, and how beliefs about DM-recommendation linkages shape coping strategies and feelings of powerlessness. We conducted semi-structured interviews with 21 active meme-DM users. Participants classified memes as recipient-friendly or recipient-unfriendly based on relational fit; many described the spread of unfriendly memes as “algorithmic contagion.” Controls were constrained by relational norms, low perceived efficacy of feedback tools, and opaque DM-recommendation linkages. We articulate how DM-based relational practices are entangled with personalization infrastructures and propose three design implications: transparent linkage explanations, conversation-level opt-outs, and conservative learning that down-weights DM-originated signals.

[HC-16] DiSCo: Diffusion Sequence Copilots for Shared Autonomy

【速读】:该论文旨在解决共享自主(shared autonomy)系统在复杂任务中性能受限的问题,尤其是在高维控制、挑战性任务或存在干扰情况下,如何有效融合人类用户与AI协作者(copilot)的动作以提升任务执行效果。解决方案的关键在于提出扩散序列协作者(Diffusion Sequence Copilots, DiSCo),其核心创新是利用扩散策略(diffusion policy)生成与历史用户动作一致的动作序列,并通过超参数调节机制平衡对专家动作的符合度、用户意图的一致性以及感知响应速度,从而实现更自然、高效且符合用户目标的协同控制。

链接: https://arxiv.org/abs/2603.22787
作者: Andy Wang,Xu Yan,Brandon McMahan,Michael Zhou,Yuyang Yuan,Johannes Y. Lee,Ali Shreif,Matthew Li,Zhenghao Peng,Bolei Zhou,Yuchen Cui,Jonathan C. Kao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 10 pages, 5 figures, HRI '26: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction

点击查看摘要

Abstract:Shared autonomy combines human user and AI copilot actions to control complex systems such as robotic arms. When a task is challenging, requires high dimensional control, or is subject to corruption, shared autonomy can significantly increase task performance by using a trained copilot to effectively correct user actions in a manner consistent with the user’s goals. To significantly improve the performance of shared autonomy, we introduce Diffusion Sequence Copilots (DiSCo): a method of shared autonomy with diffusion policy that plans action sequences consistent with past user actions. DiSCo seeds and inpaints the diffusion process with user-provided actions with hyperparameters to balance conformity to expert actions, alignment with user intent, and perceived responsiveness. We demonstrate that DiSCo substantially improves task performance in simulated driving and robotic arm tasks. Project website: this https URL

[HC-17] From Overload to Convergence: Supporting Multi-Issue Human-AI Negotiation with Bayesian Visualization

【速读】:该论文旨在解决在人机协商(human-AI negotiation)场景中,随着协商议题数量增加,人类参与者因认知负荷上升而导致表现下降的问题。其解决方案的关键在于引入一种基于贝叶斯估计的不确定性可视化方法,该方法动态展示协议可能性空间随协商进展而收缩的过程,从而帮助用户识别有前景的妥协选项。实验结果表明,该可视化机制显著提升了人类决策效果与效率,同时保持了人类对协商过程的控制权,且未引发价值再分配效应,为复杂人机协同决策系统的设计提供了实证依据。

链接: https://arxiv.org/abs/2603.22766
作者: Mehul Parmar,Chaklam Silpasuwanchai
机构: Asian Institute of Technology (亚洲理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for publication to CHI 2026

点击查看摘要

Abstract:As AI systems increasingly mediate negotiations, understanding how the number of negotiated issues impacts human performance is crucial for maintaining human agency. We designed a human-AI negotiation case study in a realistic property rental scenario, varying the number of negotiated issues; empirical findings show that without support, performance stays stable up to three issues but declines as additional issues increase cognitive load. To address this, we introduce a novel uncertainty-based visualization driven by Bayesian estimation of agreement probability. It shows how the space of mutually acceptable agreements narrows as negotiation progresses, helping users identify promising options. In a within-subjects experiment (N=32), it improved human outcomes and efficiency, preserved human control, and avoided redistributing value. Our findings surface practical limits on the complexity people can manage in human-AI negotiation, advance theory on human performance in complex negotiations, and offer validated design guidance for interactive systems.

[HC-18] Human vs. NAO: A Computational-Behavioral Framework for Quantifying Social Orienting in Autism and Typical Development

【速读】:该论文旨在解决如何通过不同社交刺激源(人类与机器人)量化儿童对名字呼唤的反应差异,从而深化对自闭症谱系障碍(Autism Spectrum Disorder, ASD)中社会定向缺陷的理解,并推动机器人辅助评估工具的发展。其解决方案的关键在于采用视频驱动的计算方法,结合人脸检测、眼区追踪及时空面部分析技术,获取儿童在名字呼唤情境下眼接触、反应延迟、头部与面部朝向变化以及持续兴趣时长等多维行为参数的精细测量,进而比较神经典型与神经异质群体在受控人-机条件下的注意力动态差异。

链接: https://arxiv.org/abs/2603.22759
作者: Vartika Narayani Srinet,Anirudha Bhattacharjee,Braj Bhushan,Bishakh Bhattacharya
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Responding to one’s name is among the earliest-emerging social orienting behaviors and is one of the most prominent aspects in the detection of Autism Spectrum Disorder (ASD). Typically developing children exhibit near-reflexive orienting to their name, whereas children with ASD often demonstrate reduced frequency, increased latency, or atypical patterns of response. In this study, we examine differential responsiveness to quantify name-calling stimuli delivered by both human agents and NAO, a humanoid robot widely employed in socially assistive interventions for autism. The analysis focuses on multiple behavioral parameters, including eye contact, response latency, head and facial orientation shifts, and duration of sustained interest. Video-based computational methods were employed, incorporating face detection, eye region tracking, and spatio-temporal facial analysis, to obtain fine-grained measures of children’s responses. By comparing neurotypical and neuroatypical groups under controlled human-robot conditions, this work aims to understand how the source and modality of social cues affect attentional dynamics in name-calling contexts. The findings advance both the theoretical understanding of social orienting deficits in autism and the applied development of robot-assisted assessment tools.

[HC-19] Designing a Meta-Reflective Dashboard for Instructor Insight into Student-AI Interactions

【速读】:该论文试图解决生成式 AI (Generative AI) 在课程作业辅助中日益普及所带来的教学管理难题:学生与AI的交互过程缺乏透明度,导致教师难以掌握学生的实际学习困难、确保教学目标对齐及执行课程政策,而直接访问原始对话日志又面临可扩展性差和隐私伦理问题。解决方案的关键在于提出一种元反思仪表板(meta-reflective dashboard),通过在每次帮助请求会话后由反射型AI自动生成结构化的会话级摘要,涵盖学生交互轨迹、AI使用模式及潜在风险,从而在不暴露原始聊天记录的前提下提供可操作的教学洞察,有效平衡了教师的需求与学生隐私保护之间的矛盾。

链接: https://arxiv.org/abs/2603.22674
作者: Boxuan Ma,Baofeng Ren,Huiyong Li,Gen Li,Li Chen,Atsushi Shimada,Shin’Ichi Konomi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative AI tools are increasingly used for coursework help, shifting much of students’ help-seeking and reasoning into student-AI chats that are largely invisible to instructors. This loss of visibility can weaken instructors’ ability to understand students’ difficulties, ensure alignment with course goals, and uphold course policies. Yet transcript-level access is neither scalable nor ethically straightforward: reading raw chat logs across a class is impractical, and exposing detailed dialogue can raise privacy concerns and chilling effects on help seeking. As a result, instructors face a tension between needing actionable insight and avoiding default surveillance of student conversations. To address this gap, we propose a meta-reflective dashboard that makes student-AI sessions interpretable without exposing raw chat logs by default. After each help-seeking session, a reflection AI produces a structured, session-level summary of the student’s interaction trajectory, AI usage patterns, and potential risks. We co-designed the dashboard with instructors and students to surface key challenges and design goals, and conducted a formative evaluation of perceived usefulness, trust in the summaries, and privacy acceptability. Findings suggest that the proposed dashboard can reduce instructors’ sensemaking effort while mitigating privacy concerns associated with transcript-level access, and they also yield design implications for evidence, governance, and scalable class-level analytics for AI-supported learning.

[HC-20] Design Implications for Student and Educator Needs in AI-Supported Programming Learning Tools

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 编程助教在教育场景中设计缺乏实证依据的问题,尤其忽视了教师与学生双重视角下的需求差异。其解决方案的关键在于通过并行问卷调查(N=50 教师、N=90 学生)系统比较双方对帮助请求方式、AI响应策略及控制权分配的偏好,发现教师更倾向间接支架式支持以保留学生的推理过程,而学生则偏好直接、可操作的帮助;同时识别出教师关注课程对齐约束与教学监督,学生强调及时性与清晰度。基于此,论文提出以交互为中心的设计空间,并提炼出需平衡学生自主性与教学约束的支架机制与控制策略,为学习导向型 AI 编程助教的设计提供实证指导。

链接: https://arxiv.org/abs/2603.22673
作者: Boxuan Ma,Yinjie Xie,Huiyong Li,Gen Li,Li Chen,Atsushi Shimada,Shin’Ichi Konomi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI-powered coding assistants can support students in programming courses by providing on-demand explanations and debugging help. However, existing research often focuses on individual tools, leaving a gap in evidence-based design recommendations that reflect both educator and student perspectives in education settings. To ground the design of learning-oriented AI coding assistants for both sides’ needs, we conducted parallel surveys of educators (N=50) and students (N=90) to compare preferences about (i) how students should request help, (ii) how AI should respond, and (iii) who should control. Our results show that educators generally favored indirect scaffolding that preserves students’ reasoning, whereas students were more likely to prefer direct, actionable help. Educators further highlighted the need for course-aligned constraints and instructor-facing oversight, while students emphasized timely support and clarity when stuck. Based on these findings, we discuss the interaction-focused design space and derive design implications for learning-oriented AI coding assistants, highlighting scaffolding and control mechanisms that balance students’ agency with instructional constraints.

[HC-21] hree Years with Classroom AI in Introductory Programming: Shifts in Student Awareness Interaction and Performance

【速读】:该论文旨在解决生成式 AI(Generative AI)在编程教育中长期影响的实证研究不足问题,特别是学生对 AI 的认知演变、人机交互模式变化以及课程学习成果如何随时间推移而发展。其解决方案的关键在于通过纵向追踪三个连续的 Python 入门课程 cohorts(2023–2025),结合问卷调查、编码的学生-AI 对话日志和课程评估记录,系统分析学生使用 GenAI 的行为与学习成效之间的动态关系,从而揭示 AI 日常化背景下教学设计的核心挑战:从单纯关注是否使用 AI 转向重构有效的学习实践并保障学生的主体性。

链接: https://arxiv.org/abs/2603.22672
作者: Boxuan Ma,Huiyong Li,Gen Li,Li Chen,Cheng Tang,Atsushi Shimada,Shin’ichi Konomi
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) tools such as ChatGPT now provide novice programmers with instant, personalized support and are reshaping computing education. While a growing body of work examines AI’s immediate impacts, longitudinal evidence remains limited on how students’ awareness, student-AI interaction patterns, and course outcomes evolve as AI becomes routine in classrooms. To address this gap, we investigate an introductory Python course across three successive AI-supported cohorts (2023-2025). Using questionnaires, coded student-AI dialogue logs, and course assessment records, we examine cohort-to-cohort shifts in students’ AI awareness, interaction practices, and learning outcomes. We find that students’ relationships with GenAI change systematically over time: familiarity and uptake become increasingly normative, and help-seeking practices evolve alongside growing AI literacy and shifting expectations of what the assistant should provide. These changes suggest that, in the AI era, the central instructional challenge is less about whether students use AI and more about how courses redefine productive learning practices while maintaining student agency. Our study offers longitudinal evidence and practical implications for designing and integrating AI programming support in course settings.

[HC-22] AwesomeLit: Towards Hypothesis Generation with Agent -Supported Literature Research

【速读】:该论文旨在解决 inexperienced researchers(不熟悉研究主题的初学者)在文献调研过程中难以识别现有文献中的研究空白并生成可行假设的问题。传统“深度研究”工具因未针对此场景设计而效果不佳,且大型语言模型(Large Language Models, LLMs)的“黑箱”特性与幻觉问题进一步削弱了用户信任。解决方案的关键在于提出一个名为AwesomeLit的人机协同可视化系统,其核心创新包括:可透明控制的代理工作流(user-steerable agentic workflow)、动态生成的查询探索树(query exploring tree),用于可视化探索路径与来源溯源,以及语义相似性视图(semantic similarity view),用于呈现文献间的关系。这些功能共同支持用户从模糊意图逐步过渡到具体研究方向,并提升对研究结果的信心。

链接: https://arxiv.org/abs/2603.22648
作者: Zefei Xie,Yuhan Guo,Kai Xu
机构: University of Nottingham (诺丁汉大学); Peking University (北京大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There are different goals for literature research, from understanding an unfamiliar topic to generate hypothesis for the next research project. The nature of literature research also varies according to user’s familiarity level of the topic. For inexperienced researchers, identifying gaps in the existing literature and generating feasible hypothesis are crucial but challenging. While general deep research'' tools can be used, they are not designed for such use case, thus often not effective. In addition, the black box" nature and hallucination of Large Language Models (LLMs) often lead to distrust. In this paper, we introduce a human-agent collaborative visualization system AwesomeLit to address this need. It has several novel features: a transparent user-steerable agentic workflow; a dynamically generated query exploring tree, visualizing the exploration path and provenance; and a semantic similarity view, depicting the relationships between papers. It enables users to transition from general intentions to detailed research topics. Finally, a qualitative study involving several early researchers showed that AwesomeLit is effective in helping users explore unfamiliar topics, identify promising research directions, and improve confidence in research results.

[HC-23] When Data Protection Fails to Protect: Law Power and Postcolonial Governance in Bangladesh

【速读】:该论文试图解决的问题是:在孟加拉国快速数字化背景下,新兴的数据保护法规框架(包括《个人数据保护令》《网络安全令》和《国家数据治理令》)如何在实践中协同运作,以及这些框架面临哪些法律与制度障碍导致公民个人数据难以得到有效保护。解决方案的关键在于通过系统性法律与制度分析,揭示当前数据保护体系的结构性缺陷——包括机构独立性不足、监管能力不均、对个体自主性的理想化假设,以及忽视非正式技术社会层(如非正式数据流动和“人桥”中介访问)等问题,从而将数据保护重新定义为一个受全球南方非正式基础设施影响的复杂社会技术设计问题,推动学界从技术-法律二元视角转向更具情境适配性的治理策略。

链接: https://arxiv.org/abs/2603.22637
作者: Pratyasha Saha,Anita Say Chan,Sharifa Sultana
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Rapid digitization across government services, financial platforms, and telecommunications has intensified the collection and processing of large scale personal data in Bangladesh. In response, the state has introduced multiple regulatory instruments, including the Personal Data Protection Ordinance, the Cyber Security Ordinance, and the National Data Governance Ordinance in 2025. While these initiatives signal an emerging legal regime for data protection, little scholarly work examines how these frameworks operate collectively in practice. This paper presents a legal and institutional analysis of Bangladeshs emerging data protection regime through a systematic review of these three ordinances. Through this review, the paper provides an integrated mapping of Bangladeshs evolving data protection framework and identifies key legal and institutional barriers that undermine the effective protection of citizens personal data. Our findings reveal that this emerging regime is constrained by limited institutional independence, uneven regulatory capacity, and the misaligned legal assumption of individualized, autonomous data subjects. Furthermore, these frameworks invisibilize prevalent sociotechnical layers, such as informal data flows and mediated access via human bridges, rendering formal protections difficult to operationalize. This paper contributes to HCI scholarship by expanding the concept of data protection as a complex sociotechnical design problem shaped by the informal infrastructures of the Global South.

[HC-24] Learning to Trust: How Humans Mentally Recalibrate AI Confidence Signals

【速读】:该论文旨在解决人机协作中因AI系统信心信号(confidence signals)存在系统性偏差(如过度自信或不足自信)而导致人类对AI信任不当的问题。其解决方案的关键在于揭示人类可通过重复经验学习调整自身对AI信心的解读策略,具体表现为:通过基于线性对数几率(linear-in-log-odds, LLO)变换与Rescorla-Wagner学习规则的计算模型,模拟人类如何动态更新基础信任水平和信心敏感度,并利用不对称的学习速率优先处理最具信息量的错误反馈。研究发现,人类能够有效补偿单调性的AI信心偏差,但在非直观的“反向信心”映射条件下,部分参与者因初始归纳偏见难以调整信任判断,揭示了人类适应AI信心信号的边界条件。

链接: https://arxiv.org/abs/2603.22634
作者: ZhaoBin Li,Mark Steyvers
机构: University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Productive human-AI collaboration requires appropriate reliance, yet contemporary AI systems are often miscalibrated, exhibiting systematic overconfidence or underconfidence. We investigate whether humans can learn to mentally recalibrate AI confidence signals through repeated experience. In a behavioral experiment (N = 200), participants predicted the AI’s correctness across four AI calibration conditions: standard, overconfidence, underconfidence, and a counterintuitive “reverse confidence” mapping. Results demonstrate robust learning across all conditions, with participants significantly improving their accuracy, discrimination, and calibration alignment over 50 trials. We present a computational model utilizing a linear-in-log-odds (LLO) transformation and a Rescorla-Wagner learning rule to explain these dynamics. The model reveals that humans adapt by updating their baseline trust and confidence sensitivity, using asymmetric learning rates to prioritize the most informative errors. While humans can compensate for monotonic miscalibration, we identify a significant boundary in the reverse confidence scenario, where a substantial proportion of participants struggled to override initial inductive biases. These findings provide a mechanistic account of how humans adapt their trust in AI confidence signals through experience.

[HC-25] Emotional Support with Conversational AI: Talking to Machines About Life

【速读】:该论文试图解决的问题是:当前关于人工智能(AI)伴侣聊天机器人提供情感支持的研究多聚焦于结果层面的影响,缺乏对情感支持如何通过交互过程实现的深入理解。解决方案的关键在于将情感支持重新概念化为一种协商性的社会技术过程(negotiated socio-technical process),强调其在对话机制(如验证、反思性提问和陪伴)中被共同建构,并受到在线社区反馈的合法性或质疑所塑造。这一视角揭示了情感支持不仅是人机交互的结果,更是嵌入社会语境中的动态协商产物,从而为设计负责任且情境敏感的AI系统提供了理论基础与实践启示。

链接: https://arxiv.org/abs/2603.22618
作者: Olivia Yan Huang,Monika Stodolska,Sharifa Sultana
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI companion chatbots are increasingly used for emotional support, with prior work in the domain predominantly documenting their mixed psychosocial impacts, including both increased emotional expression and heightened loneliness. However, most existing research primarily focuses on outcome-level effects, offering limited insight into how emotional support is produced through interaction. In this paper, we examine emotional support as an interactional and socially situated process. Drawing on qualitative analysis of Reddit discussions, we analyze how users engage with AI companions and how these interactions are interpreted and contested within online communities. We show that emotional support is coconstructed through conversational mechanisms such as validation, reflective prompting, and companionship, while also giving rise to tensions including support versus dependency, validation versus delusion, and accessibility versus harm. Importantly, support extends beyond human AI interaction and is shaped by community responses that legitimize or challenge AI-mediated care. Hence, we reconceptualize AI emotional support as a negotiated socio-technical process and derive implications for the design of responsible, context-sensitive AI systems.

[HC-26] Do Consumers Accept AIs as Moral Compliance Agents ?

【速读】:该论文试图解决的问题是:消费者普遍对人工智能(Artificial Intelligence, AI)参与道德决策持抵触态度,因其认为道德主体性需要独特的人类特质。解决方案的关键在于重新定义AI在道德场景中的角色——从“道德决策者”转变为“道德合规者”(moral compliance agent),即AI不进行主观判断,而是严格遵循既定的道德规范。研究发现,消费者更倾向于接受AI在此角色中的表现,原因在于他们认为AI缺乏人类常见的隐性动机(ulterior motives),从而提升了对AI的信任与正面评价。这一策略为组织利用AI进行伦理监督提供了可行路径,有助于缓解消费者对AI道德能力的疑虑,并增强企业伦理形象。

链接: https://arxiv.org/abs/2603.22617
作者: Greg Nyilasy,Abraham Ryan Ade Putra Hito,Jennifer Overbeck,Brock Bastian,Darren W. Dahl
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Consumers are generally resistant to Artificial Intelligence (AI) involvement in moral decision-making, perceiving moral agency as requiring uniquely human traits. This research investigates whether consumers might instead accept AIs in the role of moral compliance, where AI upholds pre-existing moral norms without exercising subjective discretion. Across five studies this research shows that consumers evaluate AI more positively than human agents in moral compliance roles. The findings reveal that this preference arises from inferences of AI’s lack of ulterior motives, which are often attributed to human agents. While previous studies have focused on AI as a decision-maker, this work demonstrates the critical role of upholding pre-existing rules, a role in which AI is perceived to excel. These findings contribute to understanding consumer acceptance of moral AI and provide actionable insights for organizations seeking to leverage AI in ethical oversight. By positioning AI as a moral compliance agent, companies can address consumer skepticism, enhance trust, and improve perceptions of corporate ethicality.

[HC-27] BioShield: A Context-Aware Firewall for Securing Bio-LLM s

【速读】:该论文旨在解决生成式AI(Generative AI)在生物领域应用中因双重用途风险(dual-use risks)所带来的安全隐患问题,即Bio-LLMs可能被恶意利用以生成有害的生物学见解,而现有静态提示过滤和策略限制措施无法有效应对嵌入动态生物工作流中的模型。其解决方案的关键在于提出一种上下文感知的应用层防火墙——BioShield,该系统通过两个核心模块实现防护:一是基于生物领域特定威胁分类的提示扫描器,采用定制化的有害评分机制对输入查询进行上下文风险分析;二是输出验证模块,在模型生成内容后检测是否包含可操作或武器化生物信息,并在发现不安全响应时触发受控重生成,从而构建从输入到输出的分层防御体系,推动生物信息安全治理向结构化、可执行的方向发展。

链接: https://arxiv.org/abs/2603.22612
作者: Protiva Das,Sovon Chakraborty,Sidhant Narula,Lucas Potter,Xavier-Lewis Palmer,Pratip Rana,Daniel Takabi,Mohammad Ghasemigol
机构: 未知
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) in biological research has significantly lowered the barrier to accessing complex bioinformatics knowledge, ex perimental design strategies, and analytical workflows. While these capabilities accelerate innovation, they also introduce serious dual-use risks, as Bio-LLMs can be exploited to generate harmful biological insights under the guise of legitimate research queries. Existing safeguards, such as static prompt filtering and policy-based restrictions, are insufficient when LLMs are embedded within dynamic biological workflows and application-layer systems. In this paper, we present BioShield, a context-aware application-level firewall designed to secure Bio LLMs against dual-use attacks. At the core of BioShield is a domain-specific prompt scanner that performs contextual risk analysis of incoming queries. The scanner leverages a harmful scoring mechanism tailored to biological dual-use threat cat egories to identify prompts that attempt to conceal malicious intent within seemingly benign research requests. Queries ex ceeding a predefined risk threshold are blocked before reaching the model, effectively preventing unsafe knowledge generation at the source. In addition to pre-generation protection, BioShield deploys a post-generation output verification module that inspects model responses for actionable or weaponizable biological content. If an unsafe response is detected, the system triggers controlled regeneration under strengthened safety constraints. By combining contextual prompt scanning with response-level validation, BioShield provides a layered defense framework specifically designed for bio-domain LLM deployments. Our framework advances cyberbiosecurity by formalizing dual-use threat detection in Bio-LLMs and proposing a structured mitigation strategy for secure, responsible AI driven biological research.

[HC-28] “Chasing Shadows”: Understanding Personal Data Externalization and Self-Tracking for Neurodivergent Individuals

【速读】:该论文试图解决的问题是:当前自跟踪(self-tracking)技术常被假设能促进自我洞察,但这一假设对神经多样性个体(如自闭症和注意力缺陷多动障碍患者)而言往往不成立,其背后存在认知与情感层面的复杂挑战。解决方案的关键在于通过两阶段定性研究揭示自跟踪在神经多样性群体中引发的显著解释性与情感负担,并提出一个包含三个情感维度的工作模型,强调情境依赖性和情感劳动的重要性;同时指出,通过结构化引导下的同伴分享可增强情绪验证并支持反思,从而为设计更具包容性的自跟踪系统提供依据,需整合同侪支持机制以更好回应用户的情感与认知需求。

链接: https://arxiv.org/abs/2603.22609
作者: Tanya Rudberg Selin,Danielle Unéus,Søren Knudsen
机构: IT University of Copenhagen (哥本哈根信息技术大学); Uppsala University (乌普萨拉大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We examine how neurodivergent individuals experience creating, interacting with, and reflecting on personal data about masking. Although self-tracking is often framed as enabling self-insight, this is rarely our experience as neurodivergent individuals and researchers. To better understand this disconnect, we conducted a two-phase qualitative study. First, a workshop where six participants with autism and/or ADHD crafted visual representations of masking experiences. Then, three participants continued by designing and using personalized self-tracking focused on unmasking over two weeks. Using reflexive thematic analysis of activities and interviews, we find that self-tracking imposes substantial interpretive and emotional demands, shaped by context-dependencies that challenge assumptions in self-tracking. We also find that facilitated sharing of experiences might validate emotional responses and support reflection. We identify three emotional dimensions that shape engagement with personal data in a working model of emotion in self-tracking, and discuss implications for designing self-tracking and reflective practices that incorporate peer support and better account for context and emotional labor.

[HC-29] Practitioner Voices Summit: How Teachers Evaluate AI Tools through Deliberative Sensemaking

【速读】:该论文试图解决的问题是:教师在将人工智能(AI)工具整合进课堂的过程中面临日益增长的压力,但通常并未被赋予作为自主决策者的角色,导致其对AI工具的评估缺乏系统性和深度。为填补这一空白,研究提出通过组织一场为期两天的全国性研讨会,让61名美国K-12数学教师共同制定个人评估量规(rubric),以促进其对AI教学工具的批判性判断能力。解决方案的关键在于构建“审议式意义建构”(deliberative sensemaking)机制,该机制融合了技术教学内容知识(TPACK)与审议型代理权(deliberative agency),并通过五个核心支持机制——包括审议时间与空间、以成果为中心的意义建构、多元视角下的协作反思、知识共建以及心理安全感——形成一个教师知识积累与评价能力提升相互强化的循环过程。这一设计使教师不仅能够识别AI工具的输出质量,还能深入理解其使用过程中的伦理、公平与灵活性等复杂权衡,从而推动负责任的AI教育应用。

链接: https://arxiv.org/abs/2603.22588
作者: Dorottya Demszky,Christopher Mah,Helen Higgins
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Teachers face growing pressure to integrate AI tools into their classrooms, yet are rarely positioned as agentic decision-makers in this process. Understanding the criteria teachers use to evaluate AI tools, and the conditions that support such reasoning, is essential for responsible AI integration. We address this gap through a two-day national summit in which 61 U.S. K-12 mathematics educators developed personal rubrics for evaluating AI classroom tools. The summit was designed to support deliberative sensemaking, a process we conceptualize by integrating Technological Pedagogical Content Knowledge (TPACK) with deliberative agency. Teachers generated over 200 criteria - initial articulations spanning four higher-order themes (Practical, Equitable, Flexible, and Rigorous) - that addressed both AI outputs and the process of using AI. Criteria contained productive tensions (e.g., personalization versus fairness, adaptability versus efficiency), and the vast majority framed AI as an assistant rather than a coaching tool for professional learning. Analysis of surveys, interviews, and summit discussions revealed five mechanisms supporting deliberative sensemaking: time and space for deliberation, artifact-centered sensemaking, collaborative reflection through diverse viewpoints, knowledge-building, and psychological safety. Across these mechanisms, TPACK and agency operated in a mutually reinforcing cycle - knowledge-building enabled more grounded evaluative judgment, while the act of constructing criteria deepened teachers’ understanding of tools. We discuss implications for edtech developers seeking practitioner input, school leaders making adoption decisions, educators and professional learning designers, and researchers working to elicit teachers’ evaluative reasoning about rapidly evolving technologies.

[HC-30] Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing CVPR2026

【速读】:该论文旨在解决资源受限的边缘与可穿戴人工智能系统中持续高保真RGB视频采集成本过高的问题。解决方案的关键在于提出了一种“灰度始终开启、彩色按需激活”的新范式,其核心创新是ColorTrigger机制:通过连续的灰度流保留时间结构,仅在必要时触发彩色帧捕获,利用轻量级二次规划进行因果性色彩冗余检测,并结合信用预算控制与动态令牌路由策略,在不显著牺牲性能的前提下大幅降低感知与推理开销。实验表明,该方法在保持91.6%全彩基准性能的同时,仅使用8.1%的RGB帧,有效揭示了自然视频中的显著色彩冗余特性,为边缘端实用化的持续视频感知提供了可行路径。

链接: https://arxiv.org/abs/2603.22466
作者: Weitong Cai,Hang Zhang,Yukai Huang,Shitong Sun,Jiankang Deng,Songcen Xu,Jifei Song,Zhensong Zhang
机构: Queen Mary University of London (伦敦玛丽女王大学); Durham University (杜伦大学); Imperial College London (帝国理工学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Accepted at CVPR 2026 (Main track)

点击查看摘要

Abstract:Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.

[HC-31] Working towards a dialectical understanding of the political ideology within technological projects

【速读】:该论文试图解决的问题是如何理解技术项目的政治意识形态,即如何揭示技术项目背后的价值观与限制条件如何共同构成其意识形态属性。解决方案的关键在于构建一个辩证框架(dialectical framework),该框架强调必须同时考察项目的内在价值取向和外在约束机制,才能全面把握其政治意识形态本质。这一方法借鉴了批判性与解放性社会科学的讨论,主张将技术项目视为社会建构的产物,而非中立工具。

链接: https://arxiv.org/abs/2603.22436
作者: Frederick Reiber
机构: Boston University (波士顿大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Position paper for the CHIdeology workshop at CHI 2026, Barcelona. this https URL

点击查看摘要

Abstract:In this short position paper, I develop a dialectical framework for understanding the political ideology of technological projects. To do so, I draw on critical and emancipatory social science discussions, highlighting how both a project’s values and constraints are necessary for understanding its ideology. A brief example is then presented to aid comprehension.

计算机视觉

[CV-0] OccAny: Generalized Unconstrained Urban 3D Occupancy CVPR2026

【速读】:该论文旨在解决现有3D占用预测方法在域外泛化能力不足的问题,特别是其对特定领域标注数据和精确传感器标定先验的依赖限制了模型在未见场景中的可扩展性和适应性。解决方案的关键在于提出OccAny——首个无需约束的都市3D占用预测模型,能够处理未校准的域外场景,并实现度量级占用预测与分割特征联合输出;其核心创新包括:(i) 首个通用3D占用框架,(ii) Segmentation Forcing机制提升占用质量并支持掩码级别预测,以及(iii) 新视角渲染(Novel View Rendering)流水线,通过测试时视图增强实现几何补全,从而显著提升模型在复杂城市场景下的鲁棒性和性能表现。

链接: https://arxiv.org/abs/2603.23502
作者: Anh-Quan Cao,Tuan-Hung Vu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at this https URL .

[CV-1] UniGRPO: Unified Policy Optimization for Reasoning -Driven Visual Generation

【速读】:该论文旨在解决多模态生成任务中如何实现文本与图像的交错生成(interleaved generation)问题,特别是通过推理驱动的图像生成流程提升生成质量。其核心挑战在于统一优化文本推理与图像合成策略,同时确保模型在复杂多轮交互场景下的可扩展性与稳定性。解决方案的关键在于提出一种统一的强化学习框架 UniGRPO,将推理阶段的文本生成与视觉合成阶段的图像生成联合优化:一方面采用标准的 GRPO(Generalized Reward Policy Optimization)处理文本推理;另一方面对 FlowGRPO 进行关键改进——移除分类器无关引导(classifier-free guidance)以维持线性无分支的轨迹 rollout,从而支持多轮交互和条件编辑等复杂场景;并用速度场上的均方误差(MSE)惩罚替代标准潜空间 KL 散度惩罚,提供更直接且鲁棒的正则化信号,有效防止奖励劫持(reward hacking)。该方法为未来全交错生成模型的后训练提供了高质量、可扩展的基准。

链接: https://arxiv.org/abs/2603.23500
作者: Jie Liu,Zilyu Ye,Linxiao Yuan,Shenhan Zhu,Yu Gao,Jie Wu,Kunchang Li,Xionghui Wang,Xiaonan Nie,Weilin Huang,Wanli Ouyang
机构: 1. Chinese Academy of Sciences (中国科学院); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

[CV-2] DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

【速读】:该论文旨在解决光学流(Optical Flow)模型在面对真实世界退化(如模糊、噪声和压缩伪影)时性能显著下降的问题。现有方法依赖高质量训练数据,难以适应实际场景中的图像退化。解决方案的关键在于提出一种“退化感知光学流”(Degradation-Aware Optical Flow, DA-Flow)框架,其核心创新是利用图像恢复扩散模型的中间特征表示——这些特征天然具备退化感知能力但缺乏时间一致性;通过引入全时空注意力机制(full spatio-temporal attention)使其具备跨帧感知能力,从而实现零样本对应关系估计。在此基础上,DA-Flow进一步融合扩散特征与卷积特征,并在迭代优化框架中提升精度,显著优于现有方法在多种严重退化条件下的表现。

链接: https://arxiv.org/abs/2603.23499
作者: Jaewon Min,Jaeeun Lee,Yeji Choi,Paul Hyunbin Cho,Jin Hyeon Kim,Tae-Young Lee,Jongsik Ahn,Hwayeong Lee,Seonghyun Park,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.

[CV-3] WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

【速读】:该论文旨在解决当前视频世界模型在建模复杂动态系统时面临的两大核心问题:一是现有数据集缺乏语义丰富且多样化的动作空间,导致模型难以学习到结构化的世界演化规律;二是动作与视觉观测直接耦合,而非通过潜在状态进行中介,使得模型无法实现长期一致的状态演进。其解决方案的关键在于构建一个大规模、高精度的动作条件世界建模数据集 WildWorld,该数据集源自逼真的 AAA 动作角色扮演游戏(Monster Hunter: Wilds),包含超过 10800 万帧图像及超过 450 种语义明确的动作(如移动、攻击和技能释放),并配有每帧的骨骼姿态、世界状态、相机位姿和深度图等显式状态标注。这一设计使模型能够从动作驱动的状态转移中学习更稳定、可解释的世界动力学,从而推动面向状态感知的生成式视频建模发展。

链接: https://arxiv.org/abs/2603.23497
作者: Zhen Li,Zian Meng,Shuwei Shi,Wenshuo Peng,Yuwei Wu,Bo Zheng,Chuanhao Li,Kaipeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is this https URL.

[CV-4] VISion On Request: Enhanced VLLM efficiency with sparse dynamically selected vision-language interactions CVPR2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在推理效率上的瓶颈问题,尤其是现有基于视觉标记压缩(visual token reduction)的方法会引入信息瓶颈,导致在需要细粒度理解与推理的复杂任务中性能下降。其解决方案的关键在于提出VISion On Request (VISOR) 方法,通过稀疏化图像与文本标记之间的交互而非丢弃视觉信息来提升效率:具体而言,语言模型始终访问高分辨率视觉标记,但仅通过少量精心设计的注意力层实现高效计算——其中,通用视觉上下文由文本-图像交叉注意力提供,而少数动态选择的自注意力层则用于精细化视觉表征,从而在需要时支持高分辨率复杂推理。这一机制使得模型可在不同计算预算下训练单一通用网络,并结合轻量级策略机制根据样本复杂度动态分配视觉计算资源,显著降低计算成本的同时保持或超越当前最优性能。

链接: https://arxiv.org/abs/2603.23495
作者: Adrian Bulat,Alberto Baldrati,Ioannis Maniadis Metaxas,Yassine Ouali,Georgios Tzimiropoulos
机构: Samsung AI Cambridge; Technical University of Iasi; Queen Mary University of London
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

[CV-5] Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

【速读】:该论文旨在解决高分辨率图像与视频生成过程中计算复杂度随生成token数量呈二次增长而导致的效率瓶颈问题。其解决方案的关键在于利用人类视觉系统的偏心依赖敏感性(eccentricity-dependent acuity),即人眼在注视区域(foveal region)具有高分辨率感知能力,而在周边视野中分辨率迅速下降。作者提出一种基于视网膜聚焦(foveated)的非均匀token分配策略,通过设计一个模拟视锥区域的mask,在生成过程中对不同空间位置分配不同密度的token:中心区域采用高密度token以保证细节质量,外围区域则降低密度以显著减少token总数和生成时间。进一步地,论文构建了一种从高分辨率数据直接生成混合分辨率token的机制,使基线扩散模型可在不破坏跨分辨率内容一致性的前提下进行后训练,从而实现高效且感知无差别的生成效果。

链接: https://arxiv.org/abs/2603.23491
作者: Brian Chao,Lior Yariv,Howard Xiao,Gordon Wetzstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL

点击查看摘要

Abstract:Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user’s gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.

[CV-6] Agent RVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

【速读】:该论文旨在解决训练-free的指代表达视频目标分割(Referring Video Object Segmentation, RVOS)任务中,现有方法因依赖多模态大语言模型(Multimodal Large Language Model, MLLM)在无对象证据前提下进行时序决策而导致推理质量低、时空覆盖不足的问题。其解决方案的关键在于提出AgentRVOS,一个基于SAM3与MLLM协同工作的训练-free智能体流水线:首先利用SAM3生成全时空范围内的可靠掩码轨迹(mask tracks),为后续推理提供对象级证据;随后由MLLM基于查询引导的推理迭代地识别目标,并借助SAM3提供的时序存在性信息进行引导修剪,从而实现更精准且鲁棒的目标定位。

链接: https://arxiv.org/abs/2603.23489
作者: Woojeong Jin,Jaeho Lee,Heeseong Shin,Seungho Jang,Junhwan Heo,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3’s temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: this https URL.

[CV-7] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

【速读】:该论文旨在解决单目新视角合成(Monocular novel-view synthesis)任务中对多视图图像对监督的依赖问题,从而限制了训练数据规模与多样性。传统方法需成对的多视角图像进行训练,而本文提出仅使用单张图像即可实现高质量的新视角生成。其核心解决方案是OVIE框架,关键在于利用单目深度估计器在训练阶段构建几何结构:将源图像提升至3D空间,应用采样的相机变换后投影得到伪目标视图;同时引入掩码训练策略,仅在有效区域(即无遮挡区域)计算几何、感知和纹理损失,从而可在3000万未标注的网络图像上进行训练。推理时无需任何深度估计或3D表示,模型完全几何无关,且在零样本场景下性能优于现有方法,速度比次优基线快600倍。

链接: https://arxiv.org/abs/2603.23488
作者: Adrien Ramanana Rahary,Nicolas Dufour,Patrick Perez,David Picard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 16 figures

点击查看摘要

Abstract:Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at this https URL.

[CV-8] ETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

【速读】:该论文旨在解决事件相机(event camera)运动估计中因依赖大规模合成数据而导致的仿真到现实(sim-to-real)差距问题,从而提升真实场景下运动估计的准确性与泛化能力。其解决方案的关键在于提出一种基于知识蒸馏的师生框架TETO(Tracking Events with Teacher Observation),通过仅需约25分钟未标注的真实世界事件数据,利用预训练RGB跟踪器作为教师模型提供监督信号,实现对事件流中物体运动的有效学习;同时,采用运动感知的数据筛选和查询采样策略,有效分离目标运动与主导的自运动(ego-motion),显著提升了有限数据下的学习效率,并进一步将精确的点轨迹与稠密光流作为显式运动先验,用于条件化预训练视频扩散变换器进行帧插值,最终在多个基准上实现了优于现有方法的性能。

链接: https://arxiv.org/abs/2603.23487
作者: Jini Yang,Eunbeen Hong,Soowon Son,Hyunkoo Lee,Sunghwan Hong,Sunok Kim,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only \sim 25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

[CV-9] VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

【速读】:该论文旨在解决视频动作模型(Video-Action Models, VAMs)在接触密集型场景中因视觉信息不完整而导致的动作预测不稳定与精度不足的问题,尤其在需要精细力控和接触状态感知的任务中表现受限。其解决方案的关键在于引入触觉反馈作为视觉感知的互补信号,构建了视频-触觉动作模型(Video-Tactile Action Model, VTAM),通过轻量级模态迁移微调策略将触觉流融合进预训练视频Transformer,实现无需触觉-语言配对数据或独立触觉预训练的跨模态表示学习;同时设计触觉正则化损失以平衡跨模态注意力机制,抑制视觉潜在表示的主导效应,从而显著提升接触相关任务中的动作执行鲁棒性与成功率。

链接: https://arxiv.org/abs/2603.23481
作者: Haoran Yuan,Weigang Yi,Zhenyu Zhang,Wendi Chen,Yuchen Mo,Jiashi Yin,Xinzhuo Li,Xiangyu Zeng,Chuan Wen,Cewu Lu,Katherine Driggs-Campbell,Ismini Lourentzou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

[CV-10] UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

【速读】:该论文旨在解决3D场景中功能分割(Functionality Segmentation)问题,即如何将自然语言指令精准地映射到细粒度交互元素的掩码上。现有方法依赖碎片化的流水线,在任务解析初期存在视觉盲区,且受限于单尺度、被动和启发式帧选择策略。其解决方案的关键在于提出UniFunc3D——一个统一且无需训练的框架,将多模态大语言模型(Multimodal Large Language Model, MLLM)作为主动观察者,通过一次前向传播整合语义、时空推理,实现对任务分解的联合推理,并采用粗到精的主动时空定位策略,自适应选择关键视频帧并聚焦高细节交互区域,同时保留全局上下文以消除歧义。该方法在SceneFun3D数据集上显著优于现有训练-free与训练-based方法,mIoU相对提升达59.9%。

链接: https://arxiv.org/abs/2603.23478
作者: Jiaying Lin,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9% mIoU improvement, without any task-specific training. Code will be released on our project page: this https URL.

[CV-11] InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting CVPR’26

【速读】:该论文旨在解决扩散模型在图像修复(image inpainting)任务中因采样步数过多导致效率低下,而少步数文本到图像生成模型直接应用于修复时又因随机高斯噪声初始化引发语义错位与区域融合质量差的问题。解决方案的关键在于提出一种名为 InverFill 的单步逆向方法,通过将输入掩码图像中的语义信息注入初始噪声中,实现语义对齐的噪声初始化,从而在少量函数评估(NFEs)下显著提升修复图像的保真度与文本一致性,且无需真实图像监督或复杂迭代优化。

链接: https://arxiv.org/abs/2603.23463
作者: Duc Vu,Kien Nguyen,Trong-Tung Nguyen,Ngan Nguyen,Phong Nguyen,Khoi Nguyen,Cuong Pham,Anh Tran
机构: Qualcomm AI Research (Qualcomm AI Research); Posts Telecommunications Inst. of Tech., Vietnam (Posts 电信技术研究所,越南)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR’26 (Main Conference)

点击查看摘要

Abstract:Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.

[CV-12] RealMaster: Lifting Rendered Scenes into Photorealistic Video

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 视频模型在视觉真实感与3D一致性之间的矛盾问题:一方面,现有视频生成模型虽能实现高保真度的图像质量,但缺乏对场景几何结构和动态行为的精确控制;另一方面,基于3D引擎的渲染方法虽具备良好的几何一致性和可控性,却难以达到自然真实的视觉效果(即“恐怖谷效应”)。解决方案的关键在于提出 RealMaster 方法,通过结合视频扩散模型与3D引擎输出的优势,利用锚点引导的帧间传播策略构建配对训练数据,并在此基础上训练一种改进型 LoRA(IC-LoRA)模块,从而将高质量渲染结果迁移到通用视频生成框架中,实现无需依赖锚定帧即可保持输入3D控制信息(如几何、动态和身份)的同时显著提升整体画面的真实感。

链接: https://arxiv.org/abs/2603.23462
作者: Dana Cohen-Bar,Ido Sobol,Raphael Bensadoun,Shelly Sheynin,Oran Gafni,Or Patashnik,Daniel Cohen-Or,Amit Zohar
机构: Tel Aviv University (特拉维夫大学); Reality Labs, Meta (Meta现实实验室); Technion (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the “uncanny valley”. Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline’s constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

[CV-13] DetPO: In-Context Learning with Multi-Modal LLM s for Few-Shot Object Detection

【速读】:该论文旨在解决多模态大语言模型(Multi-Modal LLMs, MLLMs)在少样本目标检测任务中泛化能力不足的问题,尤其是在分布外(out-of-distribution)类别、任务和成像模态下性能下降明显。尽管当前主流方法依赖于上下文提示(in-context prompting)来提升跨任务表现,但研究发现其检测准确率常低于仅使用类别名称的提示方式,表明现有MLLMs尚无法有效利用少量视觉示例和丰富文本描述进行推理。为此,作者提出了一种无需访问模型内部参数的黑盒提示优化方法——Detection Prompt Optimization (DetPO),其核心在于通过梯度自由的测试时优化策略,在少量视觉训练样本上迭代调整纯文本提示,以最大化检测准确率并校准预测置信度,从而显著提升通用型MLLMs在Roboflow20-VL和LVIS等基准上的少样本目标检测性能,相比已有黑盒方法最高提升达9.7%。

链接: https://arxiv.org/abs/2603.23455
作者: Gautam Rajendrakumar Gare,Neehar Peri,Matvei Popov,Shruti Jain,John Galeotti,Deva Ramanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at this https URL

[CV-14] 3DCity-LLM : Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在3D城市尺度环境中的感知与理解能力不足的问题,尤其是在从以物体为中心或室内场景向大规模城市空间扩展时面临的挑战。其解决方案的关键在于提出一个统一框架——3DCity-LLM,采用粗粒度到细粒度的特征编码策略,包含三个并行分支:目标物体特征、物体间关系特征以及全局场景特征;同时构建了一个高质量、大规模(约120万样本)的城市级视觉-语言数据集3DCity-LLM-1.2M,融合显式3D数值信息和多样化的用户导向模拟,从而提升问答多样性与城市场景的真实性,并通过基于文本相似性指标和大语言模型(LLM)语义评估的多维评测协议保障评估的准确性与全面性。实验表明,该方法在两个基准测试中显著优于现有最先进方法,为推进空间推理与城市智能提供了有前景的方向。

链接: https://arxiv.org/abs/2603.23447
作者: Yiping Chen,Jinpeng Li,Wenyu Ke,Yang Luo,Jie Ouyang,Zhongjie He,Li Liu,Hongchao Fan,Hao Wu
机构: Sun Yat-sen University (中山大学); Wuhan University (武汉大学); Norwegian University of Science and Technology (挪威科技大学); National Grid Computing Center (国家电网计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures, 12 tables

点击查看摘要

Abstract:While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at this https URL.

[CV-15] SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images

【速读】:该论文旨在解决地震成像中气 chimney(气体烟囱)检测与增强的难题,其核心挑战在于强地震衰减和散射导致的图像质量下降,以及传统物理模型方法计算成本高、对模型误差敏感,而深度学习方法因缺乏标注数据难以有效应用。解决方案的关键在于构建了一个名为 \textbfSIGMA 的新型物理驱动数据集,该数据集包含像素级气 chimney 标注用于检测任务,并提供退化与真实值图像对用于增强任务,从而为气 chimney 的精确识别与地震图像恢复提供高质量训练样本,同时具备广泛的地质条件和采集场景覆盖,显著提升了地震解释的准确性与鲁棒性。

链接: https://arxiv.org/abs/2603.23439
作者: Bao Truong,Quang Nguyen,Baoru Huang,Jinpei Han,Van Nguyen,Ngan Le,Minh-Tan Pham,Doan Huy Hien,Anh Nguyen
机构: FPT Software AI Center; Imperial College London; University of Arkansas; University of South Brittany; University of Liverpool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

点击查看摘要

Abstract:Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \textbfSIGMA, a new physics-based dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.

[CV-16] I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation

【速读】:该论文旨在解决视频生成中长期场景一致性问题,即在重复访问先前探索区域时保持视觉一致性。现有方法要么依赖显式三维几何构建(易受误差累积和尺度模糊影响),要么采用简单的相机视场角(Field-of-View, FoV)检索策略,在复杂遮挡下表现不佳。解决方案的关键在于提出一种隐式3D感知记忆机制(I3DM),其核心是利用预训练前馈新视角合成(Feed-Forward Novel View Synthesis, FF-NVS)模型的中间特征进行视图相关性评分,从而在高度遮挡场景下仍能鲁棒地检索历史帧;同时引入3D对齐的记忆注入模块,隐式将历史内容映射至目标视角并自适应地基于可靠变形区域条件化生成,显著提升重访一致性和相机控制精度。

链接: https://arxiv.org/abs/2603.23413
作者: Jia Li,Han Yan,Yihang Chen,Siqi Li,Xibin Song,Yifu Wang,Jianfei Cai,Tien-Tsin Wong,Pan Ji
机构: Vertex Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.

[CV-17] GeoSANE: Learning Geospatial Representations from Models Not Data

【速读】:该论文旨在解决当前遥感领域中多个基础模型(foundation models)各自独立训练、仅覆盖部分地理空间知识,且能力互补而非统一的问题。现有模型虽在特定任务上表现优异,但缺乏跨模态、跨任务的通用性与协同效应。解决方案的关键在于提出GeoSANE——一个地理空间模型工厂(geospatial model foundry),通过学习已有基础模型和任务特定模型的权重,构建一个统一的神经表示,并能按需生成新的网络权重。该方法不依赖预训练,而是基于目标架构动态生成可微调的权重,在分类、分割和检测等多任务、多模态场景下显著优于从头训练模型,且在轻量化模型生成方面超越剪枝或知识蒸馏方法,实现了地理空间知识的有效融合与迁移。

链接: https://arxiv.org/abs/2603.23408
作者: Joelle Hanna,Damian Falk,Stella X. Yu,Damian Borth
机构: University of St.Gallen (圣加仑大学); University of Michigan (密歇根大学); UC Berkeley (加州大学伯克利分校); ESA Φ\Phi-Lab (欧洲空间局Φ实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation. We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities. Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities. By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \hrefthis https URLthis http URL.

[CV-18] Harnessing Lightweight Transformer with Contextual Synergic Enhancement for Efficient 3D Medical Image Segmentation

【速读】:该论文旨在解决Transformer在3D医学图像分割中面临的两个核心挑战:模型计算效率低和对大量标注数据的依赖。为提升模型效率,提出Light-UNETR架构,其关键创新在于引入轻量级维度缩减注意力(Lightweight Dimension Reductive Attention, LIDR)模块,通过多分支注意力机制在降低空间与通道维度的同时保留全局与局部特征;同时设计紧凑门控线性单元(Compact Gated Linear Unit, CGLU),以极少参数实现通道间交互的可控调节。为增强数据效率,提出情境协同增强(Contextual Synergic Enhancement, CSE)学习策略,利用外部上下文信息通过注意力引导替换(Attention-Guided Replacement)辅助未标注数据学习,并结合空间掩码一致性(Spatial Masking Consistency)挖掘内在上下文信息以强化空间语义推理能力。实验表明,该方法在显著减少浮点运算次数(FLOPs)和参数量的同时,仅用10%标注数据即可超越现有方法。

链接: https://arxiv.org/abs/2603.23390
作者: Xinyu Liu,Zhen Chen,Wuyang Li,Chenxin Li,Yixuan Yuan
机构: The Chinese University of Hong Kong (香港中文大学); Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science Innovation, Chinese Academy of Sciences (中国科学院香港科学创新研究院人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to IEEE TPAMI

点击查看摘要

Abstract:Transformers have shown remarkable performance in 3D medical image segmentation, but their high computational requirements and need for large amounts of labeled data limit their applicability. To address these challenges, we consider two crucial aspects: model efficiency and data efficiency. Specifically, we propose Light-UNETR, a lightweight transformer designed to achieve model efficiency. Light-UNETR features a Lightweight Dimension Reductive Attention (LIDR) module, which reduces spatial and channel dimensions while capturing both global and local features via multi-branch attention. Additionally, we introduce a Compact Gated Linear Unit (CGLU) to selectively control channel interaction with minimal parameters. Furthermore, we introduce a Contextual Synergic Enhancement (CSE) learning strategy, which aims to boost the data efficiency of Transformers. It first leverages the extrinsic contextual information to support the learning of unlabeled data with Attention-Guided Replacement, then applies Spatial Masking Consistency that utilizes intrinsic contextual information to enhance the spatial context reasoning for unlabeled data. Extensive experiments on various benchmarks demonstrate the superiority of our approach in both performance and efficiency. For example, with only 10% labeled data on the Left Atrial Segmentation dataset, our method surpasses BCP by 1.43% Jaccard while drastically reducing the FLOPs by 90.8% and parameters by 85.8%. Code is released at this https URL.

[CV-19] SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

【速读】:该论文旨在解决当前3D生成技术中缺乏“可模拟”(sim-ready)交互式关节物体的问题,尤其是现有方法多集中于静态网格生成,难以支持物理仿真和具身智能应用。同时,已有方法依赖多阶段流水线导致误差累积,而统一的多模态大语言模型(MLLM)虽能实现静态资产理解与生成一体化,但基于密集体素的3D标记化会带来长序列和高内存开销,限制复杂关节物体的扩展性。解决方案的关键在于提出SIMART框架,通过引入稀疏3D矢量量化变分自编码器(Sparse 3D VQ-VAE),将3D标记数量减少70%,从而显著降低计算负担并提升多部件装配的保真度,实现零件级分解与运动学预测的联合建模,最终在PartNet-Mobility和真实场景AIGC数据集上达到最优性能,并支持基于物理的机器人仿真。

链接: https://arxiv.org/abs/2603.23386
作者: Chuanrui Zhang,Minghan Qin,Yuang Wang,Baifeng Xie,Hang Li,Ziwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in “sim-ready” interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.

[CV-20] From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching

【速读】:该论文旨在解决当前非刚性三维形状匹配方法中存在的两个关键问题:一是现有基于函数映射(functional map)的方法多聚焦于学习点对点或函数级的特征表示,而忽略了谱基(spectral basis)这一核心组件的优化,导致匹配性能受限;二是多数方法依赖传统耗时的函数映射求解器,带来显著计算开销。解决方案的关键在于提出先进函数映射(Advanced Functional Maps)框架,其核心创新是将固定谱基替换为可学习的基函数,并通过一组学习得到的抑制函数(inhibition functions)实现谱基的优化,从而在无需昂贵求解器和辅助损失的情况下,实现特征提取与谱基的端到端联合优化。该方法还引入了新颖的热扩散模块和无监督损失函数,在保持高效率的同时显著提升了在非等距形变和拓扑噪声等挑战场景下的匹配鲁棒性。进一步理论分析表明,谱基优化等价于谱卷积,抑制函数即为滤波器,为基于谱图网络的表示增强提供了新思路。

链接: https://arxiv.org/abs/2603.23383
作者: Feifan Luo,Hongyang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shape matching is a fundamental task in computer graphics and vision, with deep functional maps becoming a prominent paradigm. However, existing methods primarily focus on learning informative feature representations by constraining pointwise and functional maps, while neglecting the optimization of the spectral basis-a critical component of the functional map pipeline. This oversight often leads to suboptimal matching results. Furthermore, many current approaches rely on conventional, time-consuming functional map solvers, incurring significant computational overhead. To bridge these gaps, we introduce Advanced Functional Maps, a framework that generalizes standard functional maps by replacing fixed basis functions with learnable ones, supported by rigorous theoretical guarantees. Specifically, the spectral basis is optimized through a set of learned inhibition functions. Building on this, we propose the first unsupervised spectral basis learning method for robust non-rigid 3D shape matching, enabling the joint, end-to-end optimization of feature extraction and basis functions. Our approach incorporates a novel heat diffusion module and an unsupervised loss function, alongside a streamlined architecture that bypasses expensive solvers and auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art feature-learning approaches, particularly in challenging non-isometric and topological noise scenarios, while maintaining high efficiency. Finally, we reveal that optimizing basis functions is equivalent to spectral convolution, where inhibition functions act as filters. This insight enables enhanced representations inspired by spectral graph networks, opening new avenues for future research. Our code is available at this https URL.

[CV-21] FG-Portrait: 3D Flow Guided Editable Portrait Animation CVPR2026

【速读】:该论文旨在解决人脸动画中从驱动视频到源人脸的运动迁移问题,现有基于扩散模型的方法仅依赖驱动动作进行条件控制,无法捕捉源图像与驱动图像之间的对应关系,导致运动迁移效果不佳;而基于光流的方法因从二维输入预测稠密对应关系存在病态性,常产生不准确的动画结果。其解决方案的关键在于引入无需学习的几何驱动型3D流(3D flows),通过参数化3D头模型直接计算像素级运动对应关系,并设计3D流编码机制将该先验融入扩散模型,用于查询目标像素对应的潜在3D位移以回溯至源位置;同时提出深度引导采样策略,确保3D点与2D运动变化对齐,从而实现高保真且一致的运动迁移与身份保留。

链接: https://arxiv.org/abs/2603.23381
作者: Yating Xu,Yunqi Miao,Evangelos Ververas,Jiankang Deng,Jifei Song
机构: National University of Singapore(新加坡国立大学); University of Warwick(华威大学); Imperial College London(帝国理工学院); University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer. Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation.

[CV-22] ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

【速读】:该论文旨在解决当前视频世界模型在具身仿真与规划中生成物理上不合理操作(如物体穿透和反重力运动)的问题,其根源在于模型训练数据的通用性及基于似然函数的目标忽略了物理规律。解决方案的关键在于提出ABot-PhysWorld——一个14B参数的扩散Transformer模型,它基于包含三百万条带物理标注的操作视频数据集进行训练,并引入一种基于直接偏好优化(DPO)的后训练框架,通过解耦判别器有效抑制非物理行为同时保持视觉质量;此外,采用并行上下文块实现空间动作的精准注入,从而支持跨具身形态的动作控制。

链接: https://arxiv.org/abs/2603.23376
作者: Yuzhi Chen,Ronghan Chen,Dongjie Huo,Yandan Yang,Dekang Qi,Haoyun Liu,Tong Lin,Shuang Zeng,Junjin Xiao,Xinyuan Chang,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.

[CV-23] Object Pose Transformer: Unifying Unseen Object Pose Estimation

【速读】:该论文旨在解决**无监督场景下未见过物体的位姿估计(pose estimation)**这一挑战,即如何在不依赖类别标签或预定义分类体系的前提下,同时实现类别级绝对位姿(absolute pose)和跨视角相对位姿(relative pose)的准确估计。现有方法通常分为两类:一类基于类别级别的绝对位姿预测,但需依赖预先定义的语义分类;另一类则仅能估计多视角间的相对变换,无法恢复单视图的绝对位姿。本文提出Object Pose Transformer(\ours),其核心创新在于通过任务分解(task factorization)构建一个统一的前馈框架,联合预测深度图、点云映射(point maps)、相机参数及归一化物体坐标(NOCS),从而在同一个模型中同时支持SA(3)绝对位姿和SE(3)相对位姿估计。关键突破在于利用对比学习的物体中心潜在嵌入(object-centric latent embeddings)实现无需语义标签的归一化处理,并以点图为相机空间表示,支持多视角几何一致性推理,显著提升单视图位姿估计的鲁棒性与准确性。

链接: https://arxiv.org/abs/2603.23370
作者: Weihang Li,Lorenzo Garattoni,Fabien Despinoy,Nassir Navab,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Toyota Motor Europe (丰田欧洲公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Learning model-free object pose estimation for unseen instances remains a fundamental challenge in 3D vision. Existing methods typically fall into two disjoint paradigms: category-level approaches predict absolute poses in a canonical space but rely on predefined taxonomies, while relative pose methods estimate cross-view transformations but cannot recover single-view absolute pose. In this work, we propose Object Pose Transformer (\ours), a unified feed-forward framework that bridges these paradigms through task factorization within a single model. \ours jointly predicts depth, point maps, camera parameters, and normalized object coordinates (NOCS) from RGB inputs, enabling both category-level absolute SA(3) pose and unseen-object relative SE(3) pose. Our approach leverages contrastive object-centric latent embeddings for canonicalization without requiring semantic labels at inference time, and uses point maps as a camera-space representation to enable multi-view relative geometric reasoning. Through cross-frame feature interaction and shared object embeddings, our model leverages relative geometric consistency across views to improve absolute pose estimation, reducing ambiguity in single-view predictions. Furthermore, \ours is camera-agnostic, learning camera intrinsics on-the-fly and supporting optional depth input for metric-scale recovery, while remaining fully functional in RGB-only settings. Extensive experiments on diverse benchmarks (NOCS, HouseCat6D, Omni6DPose, Toyota-Light) demonstrate state-of-the-art performance in both absolute and relative pose estimation tasks within a single unified architecture.

[CV-24] FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures

【速读】:该论文旨在解决现有3D数字人像重建方法中面部与头发组件耦合建模导致的灵活性不足、依赖密集多视角数据或高成本个体优化等问题。其解决方案的关键在于提出FHAvatar框架,通过在纹理空间中显式解耦面部和头发表示:面部采用平面高斯(planar Gaussians)建模,头发则使用基于细丝的高斯(strand-based Gaussians)表示;同时引入聚合Transformer骨干网络,从多视角数据中学习几何感知的跨视角先验与头部-头发结构一致性,从而实现仅需少量随意拍摄图像即可快速重建高质量、可实时动画化且支持发型迁移与风格编辑的3D高斯人像。

链接: https://arxiv.org/abs/2603.23345
作者: Yujie Sun,Zhuoqiang Cai,Chaoyue Niu,Jianchuan Chen,Zhiwen Chen,Chengfei Lv,Fan Wu
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouple two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.

[CV-25] An Explainable AI-Driven Framework for Automated Brain Tumor Segmentation Using an Attention-Enhanced U-Net

【速读】:该论文旨在解决胶质瘤(glioma)在MRI图像中复杂且不规则的肿瘤区域分割难题,尤其是针对其内部亚区(如坏死核心、水肿和增强肿瘤)的精准识别问题。传统人工分割方法耗时且不可靠,难以满足临床诊断与治疗规划的需求。解决方案的关键在于:1)基于U-Net架构引入注意力门控机制(attention gates),使模型聚焦于图像中最具判别性的区域;2)设计个性化损失函数(Dice Loss与Categorical Dice Loss结合标准分类交叉熵),有效缓解类别不平衡问题;3)采用Grad-CAM结合高斯滤波生成平滑热力图,提升模型可解释性,从而实现高精度、可信赖的自动分割结果,最终在BraTS 2020数据集上达到Dice系数0.9901、准确率0.9919等优异性能。

链接: https://arxiv.org/abs/2603.23344
作者: MD Rashidul Islam,Bakary Gibba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-aided segmentation of brain tumors from MRI data is of crucial significance to clinical decision-making in diagnosis, treatment planning, and follow-up disease monitoring. Gliomas, owing to their high malignancy and heterogeneity, represent a very challenging task for accurate and reliable segmentation into intra-tumoral sub-regions. Manual segmentation is typically time-consuming and not reliable, which justifies the need for robust automated this http URL research resolves this problem by leveraging the BraTS 2020 dataset, where we have labeled MRI scans of glioma patients with four significant classes: background/healthy tissue, necrotic/non-enhancing core, edema, and enhancing tumor. In this work, we present a new segmentation technique based on a U-Net model augmented with executed attention gates to focus on the most significant regions of images. To counter class imbalance, we employ manually designed loss functions like Dice Loss and Categorical Dice Loss, in conjunction with standard categorical cross-entropy. Other evaluation metrics, like sensitivity and specificity, were used to measure discriminability of the model between tumor classes. Besides, we introduce Grad-CAM-based explainable AI to enable visualizing attention regions and improve model interpretability, together with a smooth heatmap generation technique through Gaussian filtering. Our approach achieved superior performance with accuracy of 0.9919, Dice coefficient of 0.9901, mean IoU of 0.9873, sensitivity of 0.9908, and specificity of 0.9974. This study demonstrates that the use of attention mechanisms, personalized loss functions, and explainable AI significantly improves highly complex tumor structure segmentation precision in MRI scans, providing a reliable and explainable method for clinical applications.

[CV-26] Strain-Parameterized Coupled Dynamics and Dual-Camera Visual Servoing for Aerial Continuum Manipulators

【速读】:该论文旨在解决 tendon-driven aerial continuum manipulators (TD-ACMs) 在耦合动力学建模中存在的计算成本高以及未显式考虑飞行平台欠驱动特性的问题。解决方案的关键在于提出一种广义的动力学公式化方法,将基于应变参数化的 Cosserat 杆模型(strain-parameterized Cosserat rod model)与无人机(UAV)的刚体模型统一嵌入到 SE(3) 流形上的拉格朗日常微分方程(Lagrangian ODE)框架中,从而避免复杂的符号推导;在此基础上进一步设计了一种鲁棒的双目视觉伺服(dual-camera image-based visual servoing, IBVS)控制方案,能够缓解传统 IBVS 的视场(FoV)限制、补偿由 UAV 侧向动力学引起的图像运动,并通过低层自适应控制器处理建模不确定性,同时提供形式化的稳定性保证。

链接: https://arxiv.org/abs/2603.23333
作者: Niloufar Amiri,Farrokh Janabi-Sharifi
机构: Toronto Metropolitan University (多伦多都会大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tendon-driven aerial continuum manipulators (TD-ACMs) combine the maneuverability of uncrewed aerial vehicles (UAVs) with the compliance of lightweight continuum robots (CRs). Existing coupled dynamic modeling approaches for TD-ACMs incur high computational costs and do not explicitly account for aerial platform underactuation. To address these limitations, this paper presents a generalized dynamic formulation of a coupled TD-ACM with an underactuated base. The proposed approach integrates a strain-parameterized Cosserat rod model with a rigid-body model of the UAV into a unified Lagrangian ordinary differential equation (ODE) framework on \mathrmSE(3) , thereby eliminating computationally intensive symbolic derivations. Building upon the developed model, a robust dual-camera image-based visual servoing (IBVS) scheme is introduced. The proposed controller mitigates the field-of-view (FoV) limitations of conventional IBVS, compensates for attitude-induced image motion caused by UAV lateral dynamics, and incorporates a low-level adaptive controller to address modeling uncertainties with formal stability guarantees. Extensive simulations and experimental validation on a compact custom-built prototype demonstrate the effectiveness and robustness of the proposed framework in real-world scenarios.

[CV-27] ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

【速读】:该论文旨在解决基于Transformer的视频扩散模型在处理超高清分辨率视频时面临的计算复杂度高、训练成本昂贵的问题,尤其是3D注意力机制导致的时间和内存复杂度为二次方级增长,限制了端到端训练的可行性。其解决方案的关键在于提出一种纯图像适应框架(pure image adaptation framework),通过两阶段的LoRA(Low-Rank Adaptation)策略实现高效升级:第一阶段使用低分辨率图像将预训练视频扩散模型适配至图像域以弥合图像与视频模态间的差距;第二阶段则利用高分辨率图像进一步学习空间外推能力,从而在推理阶段仅保留高分辨率适应部分,既保持视频生成能力又支持超高清视频合成。此外,引入高频感知训练目标(High-Frequency-Awareness-Training-Objective)通过专用重建损失显式增强模型从退化潜在表示中恢复高频细节的能力,显著提升视觉质量。

链接: https://arxiv.org/abs/2603.23326
作者: Yunfeng Wu,Hongying Cheng,Zihao He,Songhua Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at this https URL.

[CV-28] Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors

【速读】:该论文旨在解决无位姿(pose-free)全景3D高斯溅射(Omnidirectional 3D Gaussian Splatting, 3DGS)方法在重建精度和新视角合成质量上的局限性,特别是现有方法依赖于计算缓慢的结构光与运动(Structure-from-Motion, SfM)来提供相机位姿和稀疏点先验的问题。其解决方案的关键在于提出了一种名为PFGS360的新方法:首先设计了一个球面一致性感知的位姿估计模块(spherical consistency-aware pose estimation module),利用高斯自身的深度先验建立2D-3D对应关系以恢复相机位姿;其次引入一个深度内点感知的稠密化模块(depth-inlier-aware densification module),通过提取具有单目深度一致性的深度内点和高斯异常点,实现高效且高质量的高斯稠密化,从而显著提升新视角合成的保真度与视觉真实感。

链接: https://arxiv.org/abs/2603.23324
作者: Chuanqing Zhuang,Xin Lu,Zehui Deng,Zhengda Lu,Yiqun Wang,Junqi Diao,Jun Xiao
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Chongqing University (重庆大学); Air Force Engineering University (空军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D-3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians’ internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos. Code is available at this https URL.

[CV-29] ARGENT: Adaptive Hierarchical Image-Text Representations

【速读】:该论文旨在解决当前超几何视觉-语言模型(Hyperbolic Vision-Language Models, VLMs)中存在的两个核心问题:一是现有模型采用的蕴含损失(entailment loss)不稳定,导致父节点嵌入向原点收缩时蕴含锥体(entailment cone)过度扩张,引发灾难性锥体坍缩(cone collapse),破坏语义层次结构;二是现有的层次评估方法依赖于检索和相关性指标,易受分类体系(taxonomy)影响且对负样本定义模糊,评估结果不可靠。解决方案的关键在于提出一种自适应蕴含损失(adaptive entailment loss)结合范数正则化(norm regularizer),有效防止锥体坍缩而无需启发式角度裁剪;同时引入基于角度的概率蕴含协议(angle-based probabilistic entailment protocol, PEP),通过AUC-ROC和平均精度(Average Precision)进行更可靠、可解释的层次理解评估。由此构建的超几何VLM基线模型ARGENT在图像分类、图文检索及新提出的层次指标上分别提升0.7、1.1和0.8绝对分数,显著优于现有最先进方法。

链接: https://arxiv.org/abs/2603.23311
作者: Chuong Huynh,Hossein Souri,Abhinav Kumar,Vitali Petsiuk,Deen Dayal Mohan,Suren Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

[CV-30] Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression

【速读】:该论文旨在解决从三维计算机断层扫描(CT)图像中自动生成自由文本放射学报告的难题,其核心挑战包括极端序列长度、严重类别不平衡以及大语言模型(LLM)倾向于忽略视觉特征而依赖语言先验的问题。解决方案的关键在于提出一种四阶段课程学习框架——Ker-VLJEPA-3B,该框架通过渐进式训练将Llama 3.2 3B解码器与一个冻结的自监督视觉编码器(LeJEPA ViT-Large)对齐,从而确保生成内容基于患者特定的视觉信息。创新点包括:区域约束交叉注意力压缩切片嵌入为32个空间定位的视觉标记、主成分分析(PCA)白化各向异性LLM嵌入、仅基于阳性发现的策略避免后验崩溃、桥接阶段的权重初始化转移机制,以及结合弹性权重固化的选择性交叉注意力冻结策略以防止灾难性遗忘。这一设计实现了无需文本监督即可集成任意自监督视觉编码器至LLM中的模态无关架构,显著提升了报告生成质量,在CT-RATE基准上达到0.429宏F1分数,优于当前最优方法(U-VLM, 0.414),且56.6%的生成质量来源于患者特异性视觉内容。

链接: https://arxiv.org/abs/2603.23308
作者: V. K. Cody Bumgardner,Mitchell A. Klusty,Mahmut S. Gokmen,Evan W. Damron
机构: Center for Applied Artificial Intelligence, University of Kentucky, Lexington, KY 40506 USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum’s bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.

[CV-31] Drop-In Perceptual Optimization for 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)方法在渲染过程中因依赖像素级损失函数而导致图像模糊的问题,从而提升视觉感知质量。其解决方案的关键在于系统性地探索多种失真损失函数,并通过大规模人类主观实验(共39,320次成对评分)验证不同损失的效果,最终提出一种正则化的Wasserstein失真损失(WD-R),该方法在不增加点 splat 数量的前提下显著改善了细节纹理的恢复能力,且在多个数据集和框架(如Mip-Splatting与Scaffold-GS)中均表现出优越的感知质量,同时在场景压缩任务中实现了约50%的码率节省。

链接: https://arxiv.org/abs/2603.23297
作者: Ezgi Ozyilkan,Zhiqi Chen,Oren Rippel,Jona Ballé,Kedar Tatwawadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first-of-its-kind large-scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD-R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD-R is preferred by raters more than 2.3\times over the original 3DGS loss, and 1.5\times over current best method Perceptual-GS. WD-R also consistently achieves state-of-the-art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip-Splatting and Scaffold-GS, where replacing the original loss with WD-R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip-Splatting, model size for Scaffold-GS), and leads to reconstructions being preferred by human raters 1.8\times and 3.6\times , respectively. We also find that this carries over to the task of 3DGS scene compression, with \approx 50% bitrate savings for comparable perceptual metric performance.

[CV-32] Mamba-driven MRI-to-CT Synthesis for MRI-only Radiotherapy Planning

【速读】:该论文旨在解决放射治疗流程中多模态医学影像(如MRI与CT)融合带来的问题,特别是传统MRI到CT合成方法依赖于卷积神经网络(CNN)时存在的局限性,例如难以有效捕捉长距离依赖关系和复杂体积特征。为提升生成质量与推理效率,作者提出采用基于状态空间模型(State-Space Model, SSM)的Mamba架构替代主流的nnU-Net框架,其关键在于利用Mamba结构对3D医学图像中复杂的体素级特征和远距离上下文关系进行高效建模,从而实现高精度CT图像合成并保持快速推理速度,同时通过Hounsefield Units (HU)相似性指标与TotalSegmentator分割一致性验证几何保真度,推动状态空间模型在放疗工作流中的应用落地。

链接: https://arxiv.org/abs/2603.23295
作者: Konstantinos Barmpounakis,Theodoros P. Vagenas,Maria Vakalopoulou,George K. Matsopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiotherapy workflows for oncological patients increasingly rely on multi-modal medical imaging, commonly involving both Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). MRI-only treatment planning has emerged as an attractive alternative, as it reduces patient exposure to ionizing radiation and avoids errors introduced by inter-modality registration. While nnU-Net-based frameworks are predominantly used for MRI-to-CT synthesis, we explore Mamba-based architectures for this task, aiming to showcase the advantages of state-space modeling for cross-modality translation compared to standard convolutional neural networks. Specifically, we adapt both the U-Mamba and the SegMamba architecture, originally proposed for segmentation, to perform cross-modality image generation. Our 3D Mamba architecture effectively captures complex volumetric features and long-range dependencies, thus allowing accurate CT synthesis while maintaining fast inference times. Experiments were conducted on a subset of SynthRAD2025 dataset, comprising registered single-channel MRI-CT volume pairs across three anatomical regions. Quantitative evaluation is performed via a combination of image similarity metrics computed in Hounsefield Units (HU) and segmentation-based metrics obtained from TotalSegmentator to ensure geometric consistency is preserved. The findings pave the way for the integration of state-space models into radiotherapy workflows.

[CV-33] Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis

【速读】:该论文旨在解决物理结(Physical Knot)的细粒度视觉分类(Fine-Grained Visual Classification, FGVC)问题,其核心挑战在于:不同类别的结在绳索材质、颜色和背景上完全一致,类别差异仅体现在交叉结构(crossing structure)这一细微几何特征上。为此,作者构建了Knots-10基准数据集,采用部署导向的划分策略——训练时使用松散打结样本,测试时使用紧密整理的结,以模拟实际应用场景。关键解决方案包括:(1)引入TACA正则化方法,通过增强嵌入空间与拓扑距离的一致性(相关系数从ρ=0.46提升至ρ=0.65),从而改善模型对交叉结构的感知能力;(2)发现通用骨干网络在性能上无显著差异(McNemar检验不显著),表明小排名差距需谨慎解读;(3)揭示拓扑距离与混淆模式存在显著相关性(Mantel检验p < 0.01),说明模型误判往往发生在拓扑相似的结之间。此外,跨域测试暴露了模型对绳索外观的依赖性偏差,导致手机照片上的准确率下降58–69个百分点,提示未来工作需进一步减少外观偏倚。

链接: https://arxiv.org/abs/2603.23286
作者: Shiheng Nie,Yunguang Yue
机构: Shihezi University (石河子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 12 figures, 10 supplementary sections

点击查看摘要

Abstract:Physical knot classification is a fine-grained visual classification (FGVC) scenario in which appearance cues are deliberately suppressed: different classes share the same rope material, color, and background, and class identity resides primarily in crossing structure. We introduce the Knots-10 benchmark, comprising 1,440 images with a deployment-oriented split that trains on loosely tied knots and tests on tightly dressed ones. Swin-T and TransFG both average 97.2% accuracy; PMG scores 94.5%, consistent with the hypothesis that jigsaw shuffling disrupts crossing continuity. McNemar tests cannot separate four of the five general-purpose backbones, so small ranking margins should be interpreted with caution. A Mantel permutation test shows that topological distance significantly correlates with confusion patterns in three of the five models (p 0.01). We propose TACA regularization, which improves embedding-topology alignment from rho=0.46 to rho=0.65 without improving classification accuracy; a random-distance ablation yields comparable alignment, indicating the benefit is likely driven by generic regularization. A pilot cross-domain test with 100 phone photographs reveals a 58-69 percentage-point accuracy drop, exposing rope appearance bias as the dominant failure mode.

[CV-34] WaveSFNet: A Wavelet-Based Codec and Spatial–Frequency Dual-Domain Gating Network for Spatiotemporal Prediction IJCNN2026

【速读】:该论文旨在解决无监督视频预测中长期动态建模与高频细节保持之间的矛盾问题,即如何在保证多步预测清晰度的同时高效捕捉时空演化规律。现有方法常因采用下采样策略(如步长卷积或池化)导致纹理和边界信息丢失,而纯空间操作又难以兼顾局部交互与全局传播。其解决方案的关键在于提出WaveSFNet框架,通过小波编码器-解码器结构在下采样与重建过程中保留高频子带特征,同时设计空间-频率双域门控时空翻译器:首先注入相邻帧差分以增强动态信息,再结合大核空间局部建模与频域全局调制进行门控融合,并引入门控通道交互实现跨通道特征交换,从而在低计算复杂度下实现高保真预测性能。

链接: https://arxiv.org/abs/2603.23284
作者: Xinyong Cai,Runming Xie,Hu Chen,Yuankai Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCNN 2026

点击查看摘要

Abstract:Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial–frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at this https URL.

[CV-35] CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection CVPR2026

【速读】:该论文旨在解决多模态融合在跨域场景下3D目标检测性能显著下降的问题,尤其针对雨天或夜间等挑战性环境中某一模态(如LiDAR点云)严重退化,以及LiDAR分支主导检测过程导致视觉信息利用不足和鲁棒性差的两大瓶颈。其解决方案的关键在于提出三个核心组件:1)Query-Decoupled Loss通过为仅2D、仅3D及融合查询提供独立监督,重新平衡跨模态梯度流;2)LiDAR-Guided Depth Prior利用概率融合图像预测与LiDAR获取的深度分布,为2D查询注入实例感知的几何先验以优化空间初始化;3)Complementary Cross-Modal Masking对图像与点云施加互补空间掩码,促使双模态查询在融合解码器中竞争协作,从而实现自适应融合策略。

链接: https://arxiv.org/abs/2603.23276
作者: Yuchen Wu,Kun Wang,Yining Pan,Na Zhao
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Multi-modal fusion has emerged as a promising paradigm for accurate 3D object detection. However, performance degrades substantially when deployed in target domains different from training. In this work, focusing on dual-branch proposal-level detectors, we identify two factors that limit robust cross-domain generalization: 1) in challenging domains such as rain or nighttime, one modality may undergo severe degradation; 2) the LiDAR branch often dominates the detection process, leading to systematic underutilization of visual cues and vulnerability when point clouds are compromised. To address these challenges, we propose three components. First, Query-Decoupled Loss provides independent supervision for 2D-only, 3D-only, and fused queries, rebalancing gradient flow across modalities. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors through probabilistic fusion of image-predicted and LiDAR-derived depth distributions, improving their spatial initialization. Third, Complementary Cross-Modal Masking applies complementary spatial masks to the image and point cloud, encouraging queries from both modalities to compete within the fused decoder and thereby promoting adaptive fusion. Extensive experiments demonstrate substantial gains over state-of-the-art baselines while preserving source-domain performance. Code and models are publicly available at this https URL.

[CV-36] Multi-Modal Image Fusion via Intervention-Stable Feature Learning CVPR2026

【速读】:该论文旨在解决多模态图像融合中因依赖数据集诱导的虚假关联(spurious associations)而导致模型在分布偏移下性能下降的问题。现有方法主要基于统计相关性优化,难以区分真实因果依赖与伪相关。解决方案的关键在于引入基于因果推理的干预框架,通过三种原则性干预策略——互补掩码(测试模态间是否能真正补偿缺失信息)、相同区域随机掩码(识别在部分可观测下仍具信息量的特征子集)以及模态丢弃(评估各模态不可替代的贡献),从而识别稳定跨模态依赖关系。在此基础上提出因果特征集成器(Causal Feature Integrator, CFI),利用自适应不变性门控机制学习并优先选择在多种扰动模式下保持重要性的特征,最终实现对鲁棒模态依赖关系的建模而非虚假相关性。

链接: https://arxiv.org/abs/2603.23272
作者: Xue Wang,Zheng Guan,Wenhua Qian,Chengchao Wang,Runzhuo Ma
机构: Yunnan University (云南大学); Nanyang Normal University (南阳师范学院); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accpted by CVPR 2026

点击查看摘要

Abstract:Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl’s causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other’s missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.

[CV-37] GO-Renderer: Generative Object Rendering with 3D-aware Controllable Video Diffusion Models

【速读】:该论文旨在解决从图像中重建可渲染三维模型时,现有方法难以准确建模复杂外观的问题,尤其是如何在任意视角和光照条件下实现高质量的对象渲染。其解决方案的关键在于提出了一种统一框架GO-Renderer,该框架将重建的3D代理(3D proxy)与基于扩散的生成模型相结合,利用3D代理提供精确的视角控制,同时借助扩散生成模型在不显式建模复杂材质和光照的情况下,实现不同光照环境下的高质量渲染。

链接: https://arxiv.org/abs/2603.23246
作者: Zekai Gu,Shuoxuan Feng,Yansong Wang,Hanzhuo Huang,Zhongshuo Du,Chengfeng Zhao,Chengwei Ren,Peng Wang,Yuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing a renderable 3D model from images is a useful but challenging task. Recent feedforward 3D reconstruction methods have demonstrated remarkable success in efficiently recovering geometry, but still cannot accurately model the complex appearances of these 3D reconstructed models. Recent diffusion-based generative models can synthesize realistic images or videos of an object using reference images without explicitly modeling its appearance, which provides a promising direction for object rendering, but lacks accurate control over the viewpoints. In this paper, we propose GO-Renderer, a unified framework integrating the reconstructed 3D proxies to guide the video generative models to achieve high-quality object rendering on arbitrary viewpoints under arbitrary lighting conditions. Our method not only enjoys the accurate viewpoint control using the reconstructed 3D proxy but also enables high-quality rendering in different lighting environments using diffusion generative models without explicitly modeling complex materials and lighting. Extensive experiments demonstrate that GO-Renderer achieves state-of-the-art performance across the object rendering tasks, including synthesizing images on new viewpoints, rendering the objects in a novel lighting environment, and inserting an object into an existing video.

[CV-38] Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人精细操作任务中表现不佳的问题,其核心瓶颈在于缺乏主动视觉注意力分配机制。解决方案的关键在于提出一种 gaze-regularized(眼动正则化)训练框架,通过将时间聚合的眼动热图转化为 patch-level 分布,并利用 KL 散度对 Transformer 的内部注意力进行正则化,从而在不改变模型架构且无推理时开销的前提下,引导模型注意力聚焦于任务相关的视觉特征。该方法不仅提升了模型性能(在多个操作基准上提升 4–12%),还增强了鲁棒性与可解释性,使学习到的注意力模式能反映人类策略,从而提升人机信任度。

链接: https://arxiv.org/abs/2603.23202
作者: Anupam Pani,Yanchao Yang
机构: University Of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns – offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models’ internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer’s attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

[CV-39] FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation ECCV2026

【速读】:该论文旨在解决深度学习在3D医学图像分割中对大规模标注数据的高度依赖问题,而这些数据因隐私限制和专家标注成本高昂难以获取。解决方案的关键在于提出一种基于隐式函数的公式驱动监督学习框架(Formula-Driven Supervised Learning with Implicit Functions, FDIF),其核心创新是采用符号距离函数(Signed Distance Function, SDF)作为隐式表示,从而实现无需真实数据和医学专家标注的可扩展预训练。FDIF通过SDF的表面表示支持几何结构与强度纹理的可控合成,在保持复杂形态紧凑建模的同时提升了生成样本的真实性,显著优于现有基于体素的公式驱动方法,并达到与大规模真实数据自监督预训练相当的性能。

链接: https://arxiv.org/abs/2603.23199
作者: Yukinori Yamamoto,Kazuya Nishimura,Tsukasa Fukusato,Hirokazu Nosato,Tetsuya Ogata,Hirokatsu Kataoka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ECCV2026

点击查看摘要

Abstract:Deep learning-based 3D medical image segmentation methods relies on large-scale labeled datasets, yet acquiring such data is difficult due to privacy constraints and the high cost of expert annotation. Formula-Driven Supervised Learning (FDSL) offers an appealing alternative by generating training data and labels directly from mathematical formulas. However, existing voxel-based approaches are limited in geometric expressiveness and cannot synthesize realistic textures. We introduce Formula-Driven supervised learning with Implicit Functions (FDIF), a framework that enables scalable pre-training without using any real data and medical expert annotations. FDIF introduces an implicit-function representation based on signed distance functions (SDFs), enabling compact modeling of complex geometries while exploiting the surface representation of SDFs to support controllable synthesis of both geometric and intensity textures. Across three medical image segmentation benchmarks (AMOS, ACDC, and KiTS) and three architectures (SwinUNETR, nnUNet ResEnc-L, and nnUNet Primus-M), FDIF consistently improves over a formula-driven method, and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets. We further show that FDIF pre-training also benefits 3D classification tasks, highlighting implicit-function-based formula supervision as a promising paradigm for data-free representation learning. Code is available at this https URL.

[CV-40] PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning CVPR2026

【速读】:该论文旨在解决跨多样三维形状(3D shapes)和离散化方式(discretizations)的实时物理驱动动画(physics-based animation)泛化难题。其核心挑战在于如何在不同几何结构和网格划分下保持变形场的物理一致性与计算效率。解决方案的关键是提出PhysSkin框架,该框架通过学习连续的皮肤场(skinning fields)作为基函数,将由控制手柄变换定义的低维运动子空间坐标映射到全空间变形;其中,皮肤场由基于Transformer的编码器与交叉注意力解码器构成的神经皮肤场自编码器生成,并结合一种新颖的物理信息自监督学习策略——包含在线皮肤场归一化和冲突感知梯度修正机制,从而有效平衡能量最小化、空间平滑性和正交性约束,实现高保真、实时且泛化能力强的物理驱动动画。

链接: https://arxiv.org/abs/2603.23194
作者: Yuanhang Lei,Tao Cheng,Xingxuan Li,Boming Zhao,Siyuan Huang,Ruizhen Hu,Peter Yichen Chen,Hujun Bao,Zhaopeng Cui
机构: Zhejiang University (浙江大学); BIGAI; Shenzhen University (深圳大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026. Project Page: this https URL

点击查看摘要

Abstract:Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder. Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints. PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.

[CV-41] Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在第一人称行为理解中因忽略眼动信息而导致预测能力受限的问题。现有方法仅依赖视觉数据,未能有效利用人类注视(eye gaze)所蕴含的意图和注意力线索。解决方案的关键在于提出一种基于眼动正则化的框架,通过在训练过程中直接将眼动信息嵌入VLM架构:一方面生成基于眼动的查询(gaze-based queries),使模型动态聚焦于被注视区域;另一方面引入眼动正则化机制(gaze-regularization mechanism),确保模型注意力与人类注意力模式对齐。实验表明,该方法显著提升了未来事件预测的语义准确性,相较不使用眼动信息的基线模型提升近13%。

链接: https://arxiv.org/abs/2603.23190
作者: Anupam Pani,Yanchao Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.

[CV-42] ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting CVPR2026

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)在采用帧选择等高效策略以降低计算成本时,因跳过中间帧而导致的时间推理能力下降问题。其核心挑战在于:当前方法虽能减少冗余信息,但会破坏视频中的时间连续性,使模型难以准确理解事件发展过程。解决方案的关键在于引入视觉提示(Visual Prompting, VP),通过为每帧添加显式的序数信息(ordinal information),增强模型对时间连续性的感知能力,并结合轻量级关键词-帧映射(Keyword-Frame Mapping, KFM)模块,在不依赖训练的情况下,利用帧索引作为字典键将文本线索精准关联到最相关的帧,从而提供明确的时间锚点,显著提升时间推理性能,同时保持与密集帧基线相当的效果,仅需20%的帧即可实现。

链接: https://arxiv.org/abs/2603.23186
作者: Yeonkyung Lee,Dayun Ju,Youngmin Kim,Seil Kang,Seong Jae Hwang
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR2026

点击查看摘要

Abstract:Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

[CV-43] Gimbal360: Differentiable Auto-Leveling for Canonicalized 360circ Panoramic Image Completion

【速读】:该论文旨在解决从非姿态对齐的透视图像中进行360°全景补全(panoramic completion)的问题,其核心挑战在于透视投影与球面全景图之间的几何和拓扑不匹配。解决方案的关键在于提出Gimbal360框架,该框架通过引入一个**规范视图空间(Canonical Viewing Space)来统一两种表示域的投影几何,并设计了一个可微自动校平模块(Differentiable Auto-Leveling module)**以在无需相机参数的情况下稳定特征方向;同时,为应对等距圆柱投影(Equirectangular Projection, ERP)固有的S¹周期性带来的边界连续性破坏问题,提出在潜在空间中强制拓扑等变性(topological equivariance),从而保持无缝的周期结构。这一系列设计使得模型在结构一致性上达到当前最优性能。

链接: https://arxiv.org/abs/2603.23179
作者: Yuqin Lu,Haofeng Liu,Yang Zhou,Jun Liang,Shengfeng He,Jing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models excel at 2D outpainting, but extending them to 360^\circ panoramic completion from unposed perspective images is challenging due to the geometric and topological mismatch between perspective projections and spherical panoramas. We present Gimbal360, a principled framework that explicitly bridges perspective observations and spherical panoramas. We introduce a Canonical Viewing Space that regularizes projective geometry and provides a consistent intermediate representation between the two domains. To anchor in-the-wild inputs to this space, we propose a Differentiable Auto-Leveling module that stabilizes feature orientation without requiring camera parameters at inference. Panoramic generation also introduces a topological challenge. Standard generative architectures assume a bounded Euclidean image plane, while Equirectangular Projection (ERP) panoramas exhibit intrinsic S^1 periodicity. Euclidean operations therefore break boundary continuity. We address this mismatch by enforcing topological equivariance in the latent space to preserve seamless periodic structure. To support this formulation, we introduce Horizon360, a curated large-scale dataset of gravity-aligned panoramic environments. Extensive experiments show that explicitly standardizing geometric and topological priors enables Gimbal360 to achieve state-of-the-art performance in structurally consistent 360^\circ scene completion.

[CV-44] GSwap: Realistic Head Swapping with Dynamic Neural Gaussian Field

【速读】:该论文旨在解决视频中人脸与头部替换(head-swapping)任务中存在的三大核心问题:一是现有方法多依赖2D生成模型或3D Morphable Face Models (3DMM),导致3D一致性差、面部表情不自然及合成质量受限;二是全头替换时因整体头部建模不足和背景融合效果不佳,常出现可见伪影和错位;三是缺乏高效且高质量的域适应机制以支持跨域迁移。解决方案的关键在于提出GSwap系统,其核心创新为引入嵌入在全身SMPL-X表面中的动态神经高斯特征场(dynamic neural Gaussian portrait priors),将2D肖像视频转化为具有高保真度和3D一致性的神经高斯场,从而实现自然的头部-躯干关系保持与流畅运动动态;同时通过少量参考图像对预训练2D肖像生成模型进行域适配,并设计神经重渲染策略以实现前景与原始背景的无缝融合,有效消除拼接伪影并提升整体真实感。

链接: https://arxiv.org/abs/2603.23168
作者: Jingtao Zhou,Xuan Gao,Dongyu Liu,Junhui Hou,Yudong Guo,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TVCG, Project page: this https URL

点击查看摘要

Abstract:We present GSwap, a novel consistent and realistic video head-swapping system empowered by dynamic neural Gaussian portrait priors, which significantly advances the state of the art in face and head replacement. Unlike previous methods that rely primarily on 2D generative models or 3D Morphable Face Models (3DMM), our approach overcomes their inherent limitations, including poor 3D consistency, unnatural facial expressions, and restricted synthesis quality. Moreover, existing techniques struggle with full head-swapping tasks due to insufficient holistic head modeling and ineffective background blending, often resulting in visible artifacts and misalignments. To address these challenges, GSwap introduces an intrinsic 3D Gaussian feature field embedded within a full-body SMPL-X surface, effectively elevating 2D portrait videos into a dynamic neural Gaussian field. This innovation ensures high-fidelity, 3D-consistent portrait rendering while preserving natural head-torso relationships and seamless motion dynamics. To facilitate training, we adapt a pretrained 2D portrait generative model to the source head domain using only a few reference images, enabling efficient domain adaptation. Furthermore, we propose a neural re-rendering strategy that harmoniously integrates the synthesized foreground with the original background, eliminating blending artifacts and enhancing realism. Extensive experiments demonstrate that GSwap surpasses existing methods in multiple aspects, including visual quality, temporal coherence, identity preservation, and 3D consistency.

[CV-45] Dual Contrastive Network for Few-Shot Remote Sensing Image Scene Classification

【速读】:该论文旨在解决少样本遥感图像场景分类(Few-shot Remote Sensing Image Scene Classification, FS-RSISC)中的核心挑战,即遥感图像固有的小类间差异和大类内差异问题。为应对这一挑战,作者提出了一种基于迁移学习的双对比网络(Dual Contrastive Network, DCN),其关键在于引入两个辅助的监督对比学习分支:一是上下文引导的对比学习(Context-guided Contrastive Learning, CCL)分支,通过设计压缩网络(Condenser Network)提取上下文特征,并在此基础上进行监督对比学习,增强类间可分性;二是细节引导的对比学习(Detail-guided Contrastive Learning, DCL)分支,通过设计熔炉网络(Smelter Network)突出显著局部细节信息,并基于细节特征图构建监督对比学习,强化类内不变性。这两个分支协同优化模型在不同粒度上的特征表示能力,从而提升少样本条件下的分类性能。

链接: https://arxiv.org/abs/2603.23161
作者: Zhong Ji,Liyuan Hou,Xuan Wang,Gang Wang,Yanwei Pang
机构: Tianjin University (天津大学); CETC Key Laboratory of Aerospace Information Applications (中国电子科技集团公司航空航天信息应用重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot remote sensing image scene classification (FS-RSISC) aims at classifying remote sensing images with only a few labeled samples. The main challenges lie in small inter-class variances and large intra-class variances, which are the inherent property of remote sensing images. To address these challenges, we propose a transfer-based Dual Contrastive Network (DCN), which incorporates two auxiliary supervised contrastive learning branches during the training process. Specifically, one is a Context-guided Contrastive Learning (CCL) branch and the other is a Detail-guided Contrastive Learning (DCL) branch, which focus on inter-class discriminability and intra-class invariance, respectively. In the CCL branch, we first devise a Condenser Network to capture context features, and then leverage a supervised contrastive learning on top of the obtained context features to facilitate the model to learn more discriminative features. In the DCL branch, a Smelter Network is designed to highlight the significant local detail information. And then we construct a supervised contrastive learning based on the detail feature maps to fully exploit the spatial information in each map, enabling the model to concentrate on invariant detail features. Extensive experiments on four public benchmark remote sensing datasets demonstrate the competitive performance of our proposed DCN.

[CV-46] Conformal Cross-Modal Active Learning

【速读】:该论文旨在解决基础视觉模型(Foundation Models for Vision)在数据高效学习(data-efficient learning)方面潜力未被充分挖掘的问题,尤其是现有主动学习(Active Learning, AL)方法未能充分利用现代视觉-语言模型(Vision-Language Models, VLMs)中蕴含的丰富多模态知识。其解决方案的关键在于提出一种名为“一致性跨模态获取”(Conformal Cross-Modal Acquisition, CCMA)的新颖主动学习框架,该框架采用教师-学生架构:以预训练的VLM作为教师模型,提供语义上可解释且经过一致性校准(conformally calibrated)的不确定性估计,用以指导仅依赖视觉信息的学生模型进行样本选择;同时结合多模态一致性评分与多样性感知的选择策略,从而显著提升数据效率,在多个基准测试中优于当前最先进的主动学习基线方法。

链接: https://arxiv.org/abs/2603.23159
作者: Huy Hoang Nguyen,Cédric Jung,Shirin Salehi,Tobias Glück,Anke Schmeink,Andreas Kugi
机构: AIT Austrian Institute of Technology; Automation Control Institute, Technical University of Vienna; Chair of Information Theory and Data Analytics (INDA), RWTH Aachen University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 14 figures

点击查看摘要

Abstract:Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.

[CV-47] VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution

【速读】:该论文旨在解决当前三维体素超分辨率(volumetric super-resolution, SR)方法在真实低分辨率(low-resolution, LR)数据上性能被高估的问题,其根源在于现有训练数据多依赖于对高分辨率(high-resolution, HR)图像进行下采样生成的伪低分辨率数据,而非真实的低分辨率扫描。这种数据偏差导致模型在训练时学习到的是理想化的下采样模式,而非真实成像过程中丢失的高频信息。解决方案的关键在于构建一个大规模、高质量的配对体素数据集VoDaSuRe,其中包含真实采集的HR与LR扫描,从而揭示出:基于下采样数据训练的SR模型虽能产生更锐利的预测结果,但会过度平滑细节;而基于真实LR数据训练的模型则保留更多结构但精度不足。这一发现表明,当前深度学习驱动的体素SR进展需建立在具有高复杂度的真实配对数据基础上,以实现对真实物理场景中结构损失的有效恢复。

链接: https://arxiv.org/abs/2603.23153
作者: August Leander Høeg,Sophia Wiinberg Bardenfleth,Hans Martin Kjer,Tim Bjørn Dyrby,Vedrana Andersen Dahl,Anders Bjorholm Dahl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 15 figures. To be published in the proceedings of the Computer Vision and Pattern Recognition Conference 2026

点击查看摘要

Abstract:Recent advances in volumetric super-resolution (SR) have demonstrated strong performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. In this work, we show that much of this performance stems from training on downsampled data rather than real low-resolution scans. This reliance on downsampling is partly driven by the scarcity of paired high- and low-resolution 3D datasets. To address this, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: SR models trained on downsampled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying models trained on downsampled data to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans and instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available through: this https URL

[CV-48] InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

【速读】:该论文旨在解决现有语音到视频合成方法在双人互动场景中难以捕捉个体间依赖关系以及无法对反应行为进行细粒度控制的问题。其核心解决方案在于提出InterDyad框架,关键创新包括:(1)设计了交互注入器(Interactivity Injector),基于从参考视频中提取的身份无关运动先验实现视频重演;(2)引入基于元查询的模态对齐机制(MetaQuery-based modality alignment),将对话音频与运动先验对齐;(3)利用多模态大语言模型(MLLM)从音频中提炼语义意图,精确调控反应的时间与恰当性;(4)提出角色感知的双人高斯引导机制(Role-aware Dyadic Gaussian Guidance, RoDG),提升极端头部姿态下的唇同步质量与空间一致性。

链接: https://arxiv.org/abs/2603.23132
作者: Dongwei Pan,Longwei Guo,Jiazhi Guan,Luying Huang,Yiding Li,Haojie Liu,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: this https URL.

[CV-49] 3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio CVPR2026

【速读】:该论文旨在解决音频驱动的指代表达视频目标分割(Audio-based Referring Video Object Segmentation, ARVOS)问题,即如何将音频查询准确地定位到视频中随时间变化的像素级目标掩码。其关键解决方案在于:首先利用预训练的视频目标分割(RVOS)模型与视觉-语言架构相结合,通过自动语音识别(ASR)模块将输入音频转换为文本,进而采用基于文本的监督信号进行分割推理,从而实现从文本推理到音频场景的有效迁移;其次引入存在感知门控机制(existence-aware gating mechanism),通过估计目标对象是否在视频中出现来抑制不存在时的预测,降低伪影掩码并提升分割稳定性。该方法在MeViS-Audio赛道上取得第三名,验证了其泛化能力和可靠性。

链接: https://arxiv.org/abs/2603.23126
作者: Jihwan Hong,Jaeyoung Do
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures. Technical report for the CVPR 2026 PVUW Workshop (MeViS-Audio Track)

点击查看摘要

Abstract:Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.

[CV-50] PiCo: Active Manifold Canonicalization for Robust Robotic Visual Anomaly Detection ECCV

【速读】:该论文旨在解决工业场景中机器人视觉异常检测(VAD)因复杂6-DoF位姿配置和不稳定运行条件(如光照变化与阴影干扰)导致的感知性能下降问题,这些问题使得内在语义异常与物理扰动相互耦合、难以分离。解决方案的关键在于提出从被动特征学习到主动规范化的范式转变,核心是PiCo(Pose-in-Condition Canonicalization)框架,其通过两级机制实现条件不变的规范流形投影:第一阶段“主动物理规范化”使机器人重新定位物体以消除几何不确定性源头;第二阶段“神经潜在规范化”采用三级去噪层次——输入级光度处理、特征级潜在精炼与语义级上下文推理,逐级剔除不同表征尺度上的干扰因素,从而提升系统在静态与闭环主动感知场景下的鲁棒性。

链接: https://arxiv.org/abs/2603.23122
作者: Teng Yan,Binkai Liu,Shuai Liu,Yue Yu,Bingzhuo Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages. Submitted to the European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:Industrial deployment of robotic visual anomaly detection (VAD) is fundamentally constrained by passive perception under diverse 6-DoF pose configurations and unstable operating conditions such as illumination changes and shadows, where intrinsic semantic anomalies and physical disturbances coexist and interact. To overcome these limitations, a paradigm shift from passive feature learning to Active Canonicalization is proposed. PiCo (Pose-in-Condition Canonicalization) is introduced as a unified framework that actively projects observations onto a condition-invariant canonical manifold. PiCo operates through a cascaded mechanism. The first stage, Active Physical Canonicalization, enables a robotic agent to reorient objects in order to reduce geometric uncertainty at its source. The second stage, Neural Latent Canonicalization, adopts a three-stage denoising hierarchy consisting of photometric processing at the input level, latent refinement at the feature level, and contextual reasoning at the semantic level, progressively eliminating nuisance factors across representational scales. Extensive evaluations on the large-scale M2AD benchmark demonstrate the superiority of this paradigm. PiCo achieves a state-of-the-art 93.7% O-AUROC, representing a 3.7% improvement over prior methods in static settings, and attains 98.5% accuracy in active closed-loop scenarios. These results demonstrate that active manifold canonicalization is critical for robust embodied perception.

[CV-51] SMSP: A Plug-and-Play Strategy of Multi-Scale Perception for MLLM s to Perceive Visual Illusions

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对隐藏模式视觉幻觉(hidden-pattern visual illusions)时的脆弱性问题,即模型无法识别人类肉眼可见的隐藏内容,暴露出其视觉感知与人类存在显著偏差,进而引发潜在安全风险。解决方案的关键在于提出一种名为“多尺度感知策略”(Strategy of Multi-Scale Perception, SMSP)的即插即用框架,其核心机制是通过抑制高频率背景纹理引起的注意力偏置(high-frequency attention bias),使模型更贴近人类视觉系统的感知策略,从而显著提升模型对幻觉图像中隐藏模式的识别能力。

链接: https://arxiv.org/abs/2603.23118
作者: Jinzhe Tu,Ruilei Guo,Zihan Guo,Junxiao Yang,Shiyao Cui,Minlie Huang
机构: Tsinghua University (清华大学); DCST (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent works have shown that Multimodal Large Language Models (MLLMs) are highly vulnerable to hidden-pattern visual illusions, where the hidden content is imperceptible to models but obvious to humans. This deficiency highlights a perceptual misalignment between current MLLMs and humans, and also introduces potential safety concerns. To systematically investigate this failure, we introduce IlluChar, a comprehensive and challenging illusion dataset, and uncover a key underlying mechanism for the models’ failure: high-frequency attention bias, where the models are easily distracted by high-frequency background textures in illusion images, causing them to overlook hidden patterns. To address the issue, we propose the Strategy of Multi-Scale Perception (SMSP), a plug-and-play framework that aligns with human visual perceptual strategies. By suppressing distracting high-frequency backgrounds, SMSP generates images closer to human perception. Our experiments demonstrate that SMSP significantly improves the performance of all evaluated MLLMs on illusion images, for instance, increasing the accuracy of Qwen3-VL-8B-Instruct from 13.0% to 84.0%. Our work provides novel insights into MLLMs’ visual perception, and offers a practical and robust solution to enhance it. Our code is publicly available at this https URL.

[CV-52] Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在3D医学影像分割任务中适用性受限的问题,特别是如何在不进行微调或领域特定训练的情况下实现对CT体积数据的自动分割。其关键解决方案在于提出一套仅在推理阶段进行的架构与流程改进:通过将CT切片视为有序序列,利用SAM2原有的视频记忆机制来模拟三维感知能力,从而弥补其缺乏固有体积感知的缺陷,并结合提示策略、记忆传播方案和多轮精修等模块优化分割一致性与准确性,最终实现了无需训练即可获得连贯3D分割结果的零样本方法。

链接: https://arxiv.org/abs/2603.23116
作者: Miquel Lopez Escoriza,Pau Amargant Alvarez
机构: EPFL(瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2’s video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.

[CV-53] Agent FoX: LLM Agent -Guided Fusion with eXplainability for AI-Generated Image Detection

【速读】:该论文旨在解决当前AI生成图像(AIGI)检测工具存在的局限性,即现有检测方法通常针对特定伪造特征(如频域模式或语义不一致)进行设计,导致检测性能专一化且在不同模型间可能出现判断冲突。其解决方案的关键在于提出一个由大语言模型驱动的框架——AgentFoX,该框架将AIGI检测重构为一种动态的多阶段分析过程,通过引入基于校准专家画像(Expert Profiles)和上下文聚类画像(Clustering Profiles)的知识库,实现快速集成融合机制;在推理阶段,AgentFoX首先进行高层次语义评估,随后结合细粒度、上下文感知的信号级专家证据进行合成,并通过结构化推理解决矛盾,最终输出可解释的详细 forensic 报告,从而提升检测结果的可信度与实用性。

链接: https://arxiv.org/abs/2603.23115
作者: Yangxin Yu,Yue Zhou,Bin Li,Kaiqing Lin,Haodong Li,Jiangqun Ni,Bo Cao
机构: Shenzhen University (深圳大学); Sun Yat-sen University (中山大学); China Electronics Technology Group Corporation (中国电子科技集团公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing realism of AI-Generated Images (AIGI) has created an urgent need for forensic tools capable of reliably distinguishing synthetic content from authentic imagery. Existing detectors are typically tailored to specific forgery artifacts–such as frequency-domain patterns or semantic inconsistencies–leading to specialized performance and, at times, conflicting judgments. To address these limitations, we present \textbfAgentFoX, a Large Language Model-driven framework that redefines AIGI detection as a dynamic, multi-phase analytical process. Our approach employs a quick-integration fusion mechanism guided by a curated knowledge base comprising calibrated Expert Profiles and contextual Clustering Profiles. During inference, the agent begins with high-level semantic assessment, then transitions to fine-grained, context-aware synthesis of signal-level expert evidence, resolving contradictions through structured reasoning. Instead of returning a coarse binary output, AgentFoX produces a detailed, human-readable forensic report that substantiates its verdict, enhancing interpretability and trustworthiness for real-world deployment. Beyond providing a novel detection solution, this work introduces a scalable agentic paradigm that facilitates intelligent integration of future and evolving forensic tools.

[CV-54] NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization CVPR2026

【速读】:该论文旨在解决3D生物医学影像中缺乏高效且结构保真的神经元重建方法的问题,尤其针对3D神经影像数据获取困难和高质量标注稀缺的挑战。其关键解决方案在于:首先,采用基于膨胀(inflation)的适应策略,将2D视觉基础模型DINOv3学习到的特征表示映射为3D空间中的卷积核,从而保留其语义先验并适配3D神经元体积块;其次,引入拓扑感知的骨架损失(topology-aware skeleton loss),显式约束基于图的神经元树突重构结构保真度,从而提升形态学准确性。实验表明,该方法在多个公开神经影像数据集上均显著优于现有最先进(SoTA)方法。

链接: https://arxiv.org/abs/2603.23104
作者: Yik San Cheng,Runkai Zhao,Weidong Cai
机构: The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 12 figures, and 11 tables. Accepted to CVPR 2026

点击查看摘要

Abstract:2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing both rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure. Code: this https URL.

[CV-55] A Synchronized Audio-Visual Multi-View Capture System

【速读】:该论文旨在解决现有多视角捕捉系统在研究对话交互时对音频和视频同步支持不足的问题,尤其是在话语轮转、重叠发言和语调等时间敏感行为分析中,缺乏严格的音视频对齐机制。其关键解决方案是将同步音频与同步视频作为第一类信号进行设计,构建了一个融合多摄像机视频流与多通道麦克风录音的统一时间架构系统,并提供可重复的校准、采集与质量控制工作流程,从而实现高精度的时间一致性,支撑细粒度的对话行为建模与数据驱动分析。

链接: https://arxiv.org/abs/2603.23089
作者: Xiangwei Shi,Era Dorta Perez,Ruud de Jong,Ojas Shirekar,Chirag Raman
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment and show that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.

[CV-56] Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型在标准最大似然估计训练下难以直接优化样本质量和多样性的局限性,以及现有基于强化学习(Reinforcement Learning, RL)的方法在提升质量时易导致输出多样性坍缩的问题。其解决方案的关键在于提出一种轻量级RL框架,将token级AR合成建模为马尔可夫决策过程,并采用分组相对策略优化(Group Relative Policy Optimization, GRPO)进行训练;核心创新是引入一种新的分布级奖励机制——留一法FID(Leave-One-Out FID, LOO-FID),通过指数移动平均特征矩来显式鼓励样本多样性并防止模式坍缩,同时结合CLIP与HPSv2的复合实例级奖励以保证语义和感知保真度,并引入自适应熵正则化项稳定多目标学习。实验表明,该方法仅需数百次迭代即可显著提升质量与多样性指标,且无需Classifier-Free Guidance即可生成竞争力强的样本,从而规避其两倍推理成本。

链接: https://arxiv.org/abs/2603.23086
作者: Orhun Buğra Baran,Melih Kandemir,Ramazan Gokberk Cinbis
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.

[CV-57] PolarAPP: Beyond Polarization Demosaicking for Polarimetric Applications

【速读】:该论文旨在解决极化成像(polarimetric imaging)中由于传统去马赛克(demosaicking)策略导致的下游任务性能受限问题。现有方法通常对分焦平面传感器的原始测量数据进行简单重组,生成稀疏图像时未考虑语义一致性与任务需求,且去马赛克过程缺乏对下游任务(如法向量估计、去反射等)的感知能力,从而影响最终性能。解决方案的关键在于提出PolarAPP框架,其核心创新包括:(1)引入基于元学习的特征对齐机制,实现去马赛克网络与下游任务网络之间的语义对齐,使重建过程具备任务导向性;(2)设计等效成像约束,使去马赛克训练可直接回归物理意义明确的输出,无需依赖重构后的数据;(3)通过任务精调阶段利用稳定的去马赛克前端进一步提升下游任务精度。该方案实现了去马赛克与下游任务的联合优化,显著提升了整体性能。

链接: https://arxiv.org/abs/2603.23071
作者: Yidong Luo,Chenggong Li,Yunfeng Song,Ping Wang,Boxin Shi,Junchao Zhang,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Polarimetric imaging enables advanced vision applications such as normal estimation and de-reflection by capturing unique surface-material interactions. However, existing applications (alternatively called downstream tasks) rely on datasets constructed by naively regrouping raw measurements from division-of-focal-plane sensors, where pixels of the same polarization angle are extracted and aligned into sparse images without proper demosaicking. This reconstruction strategy results in suboptimal, incomplete targets that limit downstream performance. Moreover, current demosaicking methods are task-agnostic, optimizing only for photometric fidelity rather than utility in downstream tasks. Towards this end, we propose PolarAPP, the first framework to jointly optimize demosaicking and its downstream tasks. PolarAPP introduces a feature alignment mechanism that semantically aligns the representations of demosaicking and downstream networks via meta-learning, guiding the reconstruction to be task-aware. It further employs an equivalent imaging constraint for demosaicking training, enabling direct regression to physically meaningful outputs without relying on rearranged data. Finally, a task-refinement stage fine-tunes the task network using the stable demosaicking front-end to further enhance accuracy. Extensive experimental results demonstrate that PolarAPP outperforms existing methods in both demosaicking quality and downstream performance. Code is available upon acceptance.

[CV-58] MLLM -HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding

【速读】:该论文旨在解决现有计算病理学(Computational Pathology, CPath)多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理全切片图像(Whole Slide Images, WSIs)时存在的局限性,即通常将整张WSI压缩为单一嵌入向量,导致细粒度定位能力不足,并忽略病理学家在不同尺度上整合证据的诊断逻辑。解决方案的关键在于提出一种分层WSI级MLLM——MLLM-HWSI,其核心创新是通过四尺度对齐机制(细胞作为词、图像块作为短语、区域作为句子、WSI作为段落),构建多尺度嵌入表示,并引入层次对比损失与跨尺度一致性损失,以保持从细胞到整体WSI的语义连贯性;同时采用轻量级细胞-细胞注意力融合(Cell-Cell Attention Fusion, CCAF)Transformer聚合细胞嵌入,最终将多尺度视觉token与文本token联合输入指令微调的大语言模型(Instruction-tuned LLM),实现可解释的证据驱动推理,在13个WSI级基准任务上达到新SOTA性能。

链接: https://arxiv.org/abs/2603.23067
作者: Basit Alawode,Arif Mahmood,Muaz Khalifa Al-Radi,Shahad Albastaki,Asim Khan,Muhammad Bilal,Moshira Ali Abdalla,Mohammed Bennamoun,Sajid Javed
机构: Khalifa University of Science and Technology (哈利法大学科学技术); Information Technology University (信息技术大学); KAU (KAU); University of the Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic information emerges from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales. We introduce \textbfMLLM-HWSI, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales, cell as word, patch as phrase, region as sentence, and WSI as paragraph to support interpretable evidence-grounded reasoning. MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. We compute diagnostically relevant patches and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \textitCell-Cell Attention Fusion (CCAF) transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks. Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By aligning language with multi-scale visual evidence, MLLM-HWSI provides accurate, interpretable outputs that mirror diagnostic workflows and advance holistic WSI understanding. Code is available at: \hrefthis https URLGitHub.

[CV-59] HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling

【速读】:该论文旨在解决医学影像领域中计算机辅助诊断(CAD)模型部署与验证所面临的数据稀缺问题,尤其针对肺癌影像数据不足导致的诊断延迟和患者预后不良。其解决方案的关键在于提出一种基于HU区间分解的生成式图像合成策略:不再直接建模全范围Hounsfield Unit(HU)分布,而是将CT图像按HU区间逐个合成,先在特定组织相关的HU窗口内训练生成架构(如多头VQVAE),再通过一个可学习的重建网络将各区间输出合并为完整扫描图像。该方法显著提升了生成图像的视觉保真度与多样性,同时降低了模型复杂度与计算成本,实现了结构感知的医学图像合成新范式。

链接: https://arxiv.org/abs/2603.23041
作者: António Cardoso,Pedro Sousa,Tania Pereira,Hélder P. Oliveira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to iEEE TPAMI (Transactions on Pattern Analysis and Machine Intelligence)

点击查看摘要

Abstract:Currently, a central challenge and bottleneck in the deployment and validation of computer-aided diagnosis (CAD) models within the field of medical imaging is data scarcity. For lung cancer, one of the most prevalent types worldwide, limited datasets can delay diagnosis and have an impact on patient outcome. Generative AI offers a promising solution for this issue, but dealing with the complex distribution of full Hounsfield Unit (HU) range lung CT scans is challenging and remains as a highly computationally demanding task. This paper introduces a novel decomposition strategy that synthesizes CT images one HU interval at a time, rather than modelling the entire HU domain at once. This framework focuses on training generative architectures on individual tissue-focused HU windows, then merges their output into a full-range scan via a learned reconstruction network that effectively reverses the HU-windowing process. We further propose multi-head and multi-decoder models to better capture textures while preserving anatomical consistency, with a multi-head VQVAE achieving the best performance for the generative task. Quantitative evaluation shows this approach significantly outperforms conventional 2D full-range baselines, achieving a 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. The best performance is achieved by a multi-head VQVAE variant, demonstrating that it is possible to enhance visual fidelity and variability while also reducing model complexity and computational cost. This work establishes a new paradigm for structure-aware medical image synthesis, aligning generative modelling with clinical interpretation.

[CV-60] raffic Sign Recognition in Autonomous Driving: Dataset Benchmark and Field Experiment

【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)在实际自动驾驶场景中面临的三大挑战:跨区域差异、长尾类别识别困难以及语义模糊性问题。现有数据集和基准测试难以提供对不同建模范式在这些复杂条件下的行为诊断,限制了模型鲁棒性和泛化能力的系统评估。解决方案的关键在于构建TS-1M——一个包含超过一百万张全球多样化真实图像、涵盖454个标准化类别的大规模数据集,并配套一套面向挑战的诊断基准,涵盖跨区域识别、稀有类别检测、低清晰度鲁棒性及语义文本理解等细粒度评估场景。通过该基准,作者系统比较了监督学习、自监督预训练模型与多模态视觉语言模型(Multimodal Vision-Language Models, VLMs)三种范式的表现,发现语义对齐是提升跨区域泛化和稀有类别识别能力的核心因素,而纯视觉模型仍易受外观变化和数据不平衡影响,从而为构建更鲁棒、语义感知的TSR系统提供了关键洞见。

链接: https://arxiv.org/abs/2603.23034
作者: Guoyang Zhao,Weiqing Qi,Kai Zhang,Chenguang Zhang,Zeying Gong,Zhihai Bi,Kai Chen,Benshan Ma,Ming Liu,Jun Ma
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Lingnan University(岭南大学); Shenzhen Unity Drive Innovation Technology Co., Ltd.(深圳优行创新科技有限公司); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic Sign Recognition (TSR) is a core perception capability for autonomous driving, where robustness to cross-region variation, long-tailed categories, and semantic ambiguity is essential for reliable real-world deployment. Despite steady progress in recognition accuracy, existing traffic sign datasets and benchmarks offer limited diagnostic insight into how different modeling paradigms behave under these practical challenges. We present TS-1M, a large-scale and globally diverse traffic sign dataset comprising over one million real-world images across 454 standardized categories, together with a diagnostic benchmark designed to analyze model capability boundaries. Beyond standard train-test evaluation, we provide a suite of challenge-oriented settings, including cross-region recognition, rare-class identification, low-clarity robustness, and semantic text understanding, enabling systematic and fine-grained assessment of modern TSR models. Using TS-1M, we conduct a unified benchmark across three representative learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models (VLMs). Our analysis reveals consistent paradigm-dependent behaviors, showing that semantic alignment is a key factor for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Finally, we validate the practical relevance of TS-1M through real-scene autonomous driving experiments, where traffic sign recognition is integrated with semantic reasoning and spatial localization to support map-level decision constraints. Overall, TS-1M establishes a reference-level diagnostic benchmark for TSR and provides principled insights into robust and semantic-aware traffic sign perception. Project page: this https URL.

[CV-61] Generative Event Pretraining with Foundation Model Alignment

【速读】:该论文旨在解决事件相机(event camera)数据因独特感知特性及标注数据稀缺,导致难以训练具有跨任务迁移能力的事件基础视觉模型(visual foundation model, VFM)的问题。解决方案的关键在于提出一种两阶段框架GEP(Generative Event Pretraining):第一阶段通过联合回归-对比目标将事件编码器与冻结的图像VFM对齐,使事件特征锚定于图像语义空间;第二阶段利用混合事件-图像序列进行自回归预训练,以捕捉事件特有的时间动态结构。该方法结合VFM引导的语义对齐与生成式时序建模,构建出语义丰富且具备时间感知能力的事件模型,显著提升下游任务(如目标识别、分割和深度估计)的泛化性能。

链接: https://arxiv.org/abs/2603.23032
作者: Jianwen Cao,Jiaxu Xing,Nico Messikommer,Davide Scaramuzza
机构: Robotics and Perception Group, University of Zurich (苏黎世大学机器人与感知组)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.

[CV-62] Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation CVPR2026

【速读】:该论文旨在解决训练-free开放词汇语义分割方法中因采用滑动窗口推理策略(sliding-window inference strategy)而导致的跨窗口语义不一致问题。其核心挑战在于,每个窗口独立处理导致局部上下文信息无法有效传递至其他窗口,从而影响整体分割一致性。解决方案的关键在于提出全局-局部对齐CLIP(Global-Local Aligned CLIP, GLA-CLIP)框架:通过扩展键值(key-value)token以融合所有窗口的上下文线索,并引入代理锚点(proxy anchor)——由所有窗口中与查询特征高度相似的token聚合而成,提供统一语义参考以衡量内外窗口patch间的相似性;同时设计动态归一化机制,根据目标尺度动态调整注意力强度,提升小目标检测性能。该方法可无缝集成至现有方法中,显著扩展感受野并增强分割精度。

链接: https://arxiv.org/abs/2603.23030
作者: ByeongCheol Lee,Hyun Seok Seong,Sangeek Hyun,Gilhan Park,WonJun Moon,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures, 12 tables, Accepted to CVPR 2026

点击查看摘要

Abstract:A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at this https URL.

[CV-63] Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps

【速读】:该论文旨在解决多视角图像中精确空间理解的问题,即当前多模态大语言模型(Multimodal Large Language Models, MLLMs)的视觉表征主要依赖语义信息,缺乏显式的几何定位(geometric grounding),导致其在空间推理任务中的表现受限。现有方法虽通过引入视觉几何模型提供的几何线索来增强视觉标记,但仍需MLLM隐式推断场景的三维结构,限制了其空间推理能力。解决方案的关键在于提出Cog3DMap框架,该框架通过迭代构建一个显式的三维记忆(3D memory),使每个视觉标记都具有三维空间位置信息和语义信息,从而实现对结构化三维地图的直接推理,显著提升了空间推理性能。

链接: https://arxiv.org/abs/2603.23023
作者: Chanyoung Gwak,Yoonwoo Jeong,Byungwoo Jeon,Hyunseok Lee,Jinwoo Shin,Minsu Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Precise spatial understanding from multi-view images remains a fundamental challenge for Multimodal Large Language Models (MLLMs), as their visual representations are predominantly semantic and lack explicit geometric grounding. While existing approaches augment visual tokens with geometric cues from visual geometry models, their MLLM is still required to implicitly infer the underlying 3D structure of the scene from these augmented tokens, limiting its spatial reasoning capability. To address this issue, we introduce Cog3DMap, a framework that recurrently constructs an explicit 3D memory from multi-view images, where each token is grounded in 3D space and possesses both semantic and geometric information. By feeding these tokens into the MLLM, our framework enables direct reasoning over a spatially structured 3D map, achieving state-of-the-art performance on various spatial reasoning benchmarks. Code will be made publicly available.

[CV-64] Concept-based explanations of Segmentation and Detection models in Natural Disaster Management

【速读】:该论文旨在解决深度学习模型在洪水和野火分割及目标检测任务中缺乏透明度的问题,从而影响人类在应急响应中的信任建立。针对这一挑战,其核心解决方案是提出一个可解释性框架,关键创新在于引入一种新颖的重分配策略,扩展了层间相关性传播(Layer-wise Relevance Propagation, LRP)方法以适配sigmoid门控逐元素融合层,使相关性能够贯穿PIDNet架构中的融合模块并回传至输入图像;同时结合原型概念解释(Prototypical Concept-based Explanations, PCX),实现对特定灾害语义类别的局部与全局概念级解释,揭示驱动分割和检测的关键特征。实验表明,该框架在保持近实时推理能力的同时,提供了可靠且可解释的输出,适用于资源受限平台如无人机(Unmanned Aerial Vehicles, UAVs)部署。

链接: https://arxiv.org/abs/2603.23020
作者: Samar Heydari,Jawher Said,Galip Ümit Yolcu,Evgenii Kortukov,Elena Golimblevskaia,Evgenios Vlachos,Vasileios Mygdalis,Ioannis Pitas,Sebastian Lapuschkin,Leila Arras
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); Aristotle University of Thessaloniki (萨洛尼卡亚里士多德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Deep learning models for flood and wildfire segmentation and object detection enable precise, real-time disaster localization when deployed on embedded drone platforms. However, in natural disaster management, the lack of transparency in their decision-making process hinders human trust required for emergency response. To address this, we present an explainability framework for understanding flood segmentation and car detection predictions on the widely used PIDNet and YOLO architectures. More specifically, we introduce a novel redistribution strategy that extends Layer-wise Relevance Propagation (LRP) explanations for sigmoid-gated element-wise fusion layers. This extension allows LRP relevances to flow through the fusion modules of PIDNet, covering the entire computation graph back to the input image. Furthermore, we apply Prototypical Concept-based Explanations (PCX) to provide both local and global explanations at the concept level, revealing which learned features drive the segmentation and detection of specific disaster semantic classes. Experiments on a publicly available flood dataset show that our framework provides reliable and interpretable explanations while maintaining near real-time inference capabilities, rendering it suitable for deployment on resource-constrained platforms, such as Unmanned Aerial Vehicles (UAVs).

[CV-65] Zero-Shot Personalization of Objects via Textual Inversion

【速读】:该论文旨在解决当前文本到图像扩散模型在个性化图像生成中效率低下且泛化能力不足的问题,尤其是现有方法主要针对人像主体进行身份嵌入注入,难以扩展至任意物体类别。其解决方案的关键在于提出一种基于学习网络的框架,用于预测特定物体的文本反转(textual inversion)嵌入,并将其集成到UNet的扩散步骤中,实现单次前向传播即可完成零样本、通用对象的快速个性化定制,从而在不依赖训练的前提下显著提升个性化生成的灵活性与可扩展性。

链接: https://arxiv.org/abs/2603.23010
作者: Aniket Roy,Maitreya Suin,Rama Chellappa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.

[CV-66] VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought

【速读】:该论文旨在解决真实场景下视频修复(Video Restoration)中因异质退化(heterogeneous degradations)导致的模型泛化能力不足问题,现有静态架构和固定推理流程难以适应复杂多变的退化类型与修复需求。其解决方案的关键在于提出一个统一的智能视频修复代理 VQ-Jarvis,通过两个核心机制实现:一是构建首个大规模视频成对增强数据集 VSR-Compare(含 20K 比较样本),训练多操作符判别模型与退化感知模型以提升对退化类型及修复结果差异的精确识别能力;二是引入分层操作调度策略,针对不同难度视频采用“一步检索”或“逐步贪婪搜索”方式,在保证效率的同时优化修复轨迹选择,从而实现更准确、高效的动态决策。

链接: https://arxiv.org/abs/2603.22998
作者: Xuanyu Zhang,Weiqi Li,Qunliang Xing,Jingfen Xie,Bin Chen,Junlin Li,Li Zhang,Jian Zhang,Shijie Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Video restoration, Agent-based restoration

点击查看摘要

Abstract:Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.

[CV-67] VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在资源受限平台上的推理成本过高问题,尤其针对现有视觉标记剪枝方法因忽略物理交互连续性而导致关键结构区域被误剪、进而引发早期任务阶段行为不稳定的问题。解决方案的关键在于提出一种无需训练的“交互优先”(Interaction-First)剪枝方法——VLA-IAP(Interaction-Aligned Pruning),其核心创新包括:引入基于几何先验的机制以保留支持操作的结构锚点,并设计动态调度策略根据语义与运动对齐程度自适应调整剪枝强度,从而实现从保守到激进的渐进式剪枝过程,保障早期任务阶段的鲁棒性并提升后期推理效率。

链接: https://arxiv.org/abs/2603.22991
作者: Jintao Cheng,Haozhe Wang,Weibin Li,Gang Wang,Yipu Zhang,Xiaoyu Tang,Jin Wu,Xieyuanli Chen,Yunhui Liu,Wei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbftraining-free method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf97.8% success rate with a \textbf 1.25\times speedup on the LIBERO benchmark, and up to \textbf 1.54\times speedup while maintaining performance \textbfcomparable to the unpruned backbone. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \hrefthis https URLthis http URL.

[CV-68] WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

【速读】:该论文旨在解决当前基于文本的图像和视频生成方法在大规模3D场景合成中难以维持场景级和对象级一致性的问题,其根本原因在于缺乏显式的几何结构约束。解决方案的关键在于提出一种“几何优先”(geometry-first)的方法,将复杂的3D场景生成任务解耦为两个阶段:首先构建一个由网格(mesh scaffold)表示的结构骨架以捕捉环境几何信息(如墙壁、地板等),随后利用强大的图像生成模型,以该网格为条件进行真实感外观合成与对象布局填充。通过这种方式,实现了高保真度、大尺度且结构一致的3D场景生成,显著提升了场景的可扩展性和对象多样性。

链接: https://arxiv.org/abs/2603.22972
作者: Manuel-Andreas Schneider,Angela Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Video: this https URL Code: this https URL

点击查看摘要

Abstract:Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment’s geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.

[CV-69] FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning CVPR2026

【速读】:该论文旨在解决弱监督下的伪装目标检测(Weakly-Supervised Camouflage Object Detection, WSCOD)中存在的四大挑战:非伪装目标误响应、局部响应、极端响应以及边界感知不足的问题,这些问题导致现有方法在伪装场景下性能不佳。解决方案的关键在于提出一种基于频域感知与对比学习的框架FCL-COD,其核心创新包括:(1) 频率感知低秩适配(Frequency-aware Low-rank Adaptation, FoRA),将频域特征知识引入Segment Anything Model (SAM),以抑制非伪装目标响应;(2) 梯度感知对比学习策略,有效区分前景与背景边界,缓解局部和极端响应问题;(3) 多尺度频域感知表示学习机制,增强对精细边界的建模能力。实验表明,该方法在三个主流COD基准上显著优于当前最先进的弱监督乃至全监督方法。

链接: https://arxiv.org/abs/2603.22969
作者: Jingchen Ni,Quan Zhang,Dan Jiang,Keyu Lv,Ke Zhang,Chun Yuan
机构: Tsinghua University (清华大学); Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings

点击查看摘要

Abstract:Existing camouflage object detection (COD) methods typically rely on fully-supervised learning guided by mask annotations. However, obtaining mask annotations is time-consuming and labor-intensive. Compared to fully-supervised methods, existing weakly-supervised COD methods exhibit significantly poorer performance. Even for the Segment Anything Model (SAM), there are still challenges in handling weakly-supervised camouflage object detection (WSCOD), such as: a. non-camouflage target responses, b. local responses, c. extreme responses, and d. lack of refined boundary awareness, which leads to unsatisfactory results in camouflage scenes. To alleviate these issues, we propose a frequency-aware and contrastive learning-based WSCOD framework in this paper, named FCL-COD. To mitigate the problem of non-camouflaged object responses, we propose the Frequency-aware Low-rank Adaptation (FoRA) method, which incorporates frequency-aware camouflage scene knowledge into SAM. To overcome the challenges of local and extreme responses, we introduce a gradient-aware contrastive learning approach that effectively delineates precise foreground-background boundaries. Additionally, to address the lack of refined boundary perception, we present a multi-scale frequency-aware representation learning strategy that facilitates the modeling of more refined boundaries. We validate the effectiveness of our approach through extensive empirical experiments on three widely recognized COD benchmarks. The results confirm that our method surpasses both state-of-the-art weakly supervised and even fully supervised techniques.

[CV-70] Few-Shot Generative Model Adaption via Identity Injection and Preservation

【速读】:该论文旨在解决少样本生成模型适配(few-shot generative model adaptation)中因遗忘源域身份知识而导致的目标域生成图像质量下降的问题。其核心解决方案为提出Identity Injection and Preservation (I²P) 方法,关键在于通过两个模块实现源域身份知识的保留:一是身份注入模块(identity injection module),将源域身份信息嵌入目标域潜在空间以确保生成图像保留源域关键身份特征;二是身份替换模块(identity substitution module),包含风格-内容解耦器和重建调制器,并通过特征一致性对齐约束强化身份知识的一致性保持。

链接: https://arxiv.org/abs/2603.22965
作者: Yeqi He,Liang Li,Jiehua Zhang,Yaoqi Sun,Xichun Sheng,Zhidong Zhao,Chenggang Yan
机构: Hangzhou Dianzi University (杭州电子科技大学); Chinese Academy of Sciences (中国科学院); Xi’an Jiaotong University (西安交通大学); Lishui University (丽水学院); Macao Polytechnic University (澳门理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training generative models with limited data presents severe challenges of mode collapse. A common approach is to adapt a large pretrained generative model upon a target domain with very few samples (fewer than 10), known as few-shot generative model adaptation. However, existing methods often suffer from forgetting source domain identity knowledge during adaptation, which degrades the quality of generated images in the target domain. To address this, we propose Identity Injection and Preservation (I ^2 P), which leverages identity injection and consistency alignment to preserve the source identity knowledge. Specifically, we first introduce an identity injection module that integrates source domain identity knowledge into the target domain’s latent space, ensuring the generated images retain key identity knowledge of the source domain. Second, we design an identity substitution module, which includes a style-content decoupler and a reconstruction modulator, to further enhance source domain identity preservation. We enforce identity consistency constraints by aligning features from identity substitution, thereby preserving identity knowledge. Both quantitative and qualitative experiments show that our method achieves substantial improvements over state-of-the-art methods on multiple public datasets and 5 metrics.

[CV-71] Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining CVPR2026

【速读】:该论文旨在解决大规模视频-语言预训练中计算成本过高以及现有掩码视觉建模方法存在的两个根本问题:高掩码比例下的严重视觉信息丢失和因帧间相关性导致的时间信息泄露。其解决方案的关键在于提出了一种聚类式时空掩码策略(Cluster-Wise Spatio-Temporal Masking, ClusterSTM),该策略首先在帧内进行聚类以将视觉token划分为语义独立的多个簇,然后在每个簇中保留时间密度最高的token,从而确保保留的token既能捕捉整体视频内容,又具备强时间相关性;同时引入视频-文本语义一致性重建目标,超越传统视觉重建,实现更高级别的多模态对齐,显著提升了模型效率与性能。

链接: https://arxiv.org/abs/2603.22953
作者: Weijun Zhuang,Yuqing Huang,Weikang Meng,Xin Li,Ming Liu,Xiaopeng Hong,Yaowei Wang,Wangmeng Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

[CV-72] Caption Generation for Dongba Paintings via Prompt Learning and Semantic Fusion

【速读】:该论文旨在解决主流图像字幕模型在应用于中国西南地区纳西族东巴绘画(Dongba paintings)时因领域偏移(domain shift)导致的自动文本描述性能不佳的问题。其关键解决方案是提出了一种基于提示与视觉语义融合的编码器-解码器框架(PVGF-DPC),该框架通过引入内容提示模块(content prompt module)将图像特征映射为文化感知标签(如“神祇”、“仪式图案”或“地狱鬼魂”),并构建后提示(post-prompt)引导解码器生成主题准确的描述;同时设计了一种视觉语义生成融合损失(visual semantic-generation fusion loss),联合优化提示预测和字幕生成的交叉熵目标,促使模型提取关键文化与视觉线索,并生成语义对齐的字幕。

链接: https://arxiv.org/abs/2603.22946
作者: Shuangwu Qian,Xiaochan Yuan,Pengfei Liu
机构: Sichuan Agricultural University (四川农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dongba paintings, the treasured pictorial legacy of the Naxi people in southwestern China, feature richly layered visual elements, vivid color palettes, and pronounced ethnic and regional cultural symbolism, yet their automatic textual description remains largely unexplored owing to severe domain shift when mainstream captioning models are applied directly. This paper proposes \textbfPVGF-DPC (\textitPrompt and Visual Semantic-Generation Fusion-based Dongba Painting Captioning), an encoder-decoder framework that integrates a content prompt module with a novel visual semantic-generation fusion loss to bridge the gap between generic natural-image captioning and the culturally specific imagery found in Dongba art. A MobileNetV2 encoder extracts discriminative visual features, which are injected into the layer normalization of a 10-layer Transformer decoder initialized with pretrained BERT weights; meanwhile, the content prompt module maps the image feature vector to culture-aware labels – such as \emphdeity, \emphritual pattern, or \emphhell ghost – and constructs a post-prompt that steers the decoder toward thematically accurate descriptions. The visual semantic-generation fusion loss jointly optimizes the cross-entropy objectives of both the prompt predictor and the caption generator, encouraging the model to extract key cultural and visual cues and to produce captions that are semantically aligned with the input image. We construct a dedicated Dongba painting captioning dataset comprising 9408 augmented images with culturally grounded annotations spanning seven thematic categories.

[CV-73] FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification

【速读】:该论文旨在解决如何将专家眼动轨迹(expert eye movements)中蕴含的诊断推理信息更直接、细粒度地整合到医学图像分析模型中的问题。传统基于卷积神经网络(CNN)的系统难以处理眼动数据的时序密集性、空间稀疏性和个体差异性,通常仅采用简化表示如热图,导致信息损失。解决方案的关键在于提出FixationFormer——一种基于Transformer架构的方法,将专家眼动轨迹建模为序列化的token,保留其时空结构,并通过图像与眼动token之间的显式交叉注意力机制,实现二者联合建模,从而有效应对眼动数据的稀疏性和变异性,提升诊断线索的集成精度。

链接: https://arxiv.org/abs/2603.22939
作者: Daniel Beckmann,Benjamin Risse
机构: University of Münster(明斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.

[CV-74] When AVSR Meets Video Conferencing: Dataset Degradation and the Hidden Mechanism Behind Performance Collapse

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在视频会议(Video Conferencing, VC)场景下音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)系统性能严重下降的问题,其核心挑战来源于传输失真和人类自发的超常发声(即Lombard效应)。解决方案的关键在于构建首个面向VC场景的多模态数据集MLD-VC,该数据集包含31名说话者、22.79小时的音视频数据,并显式引入Lombard效应以增强人类超常发声特征。通过实证分析发现,语音增强算法是导致分布偏移的主要来源,而Lombard效应引起的分布偏移与语音增强造成的偏移高度相似,因此在Lombard数据上微调AVSR模型可显著提升其在真实VC环境中的鲁棒性;实验表明,在多个主流VC平台上微调后平均词错误率(CER)降低17.5%,从而为开发更鲁棒、泛化能力强的AVSR系统提供了基础支持。

链接: https://arxiv.org/abs/2603.22915
作者: Yihuan Huang,Jun Xue,Liu Jiajun,Daixian Li,Tong Zhang,Zhuolin Yi,Yanzhen Ren,Kai Li
机构: Wuhan University (武汉大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbfMLD-VC, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at this https URL.

[CV-75] ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

【速读】:该论文旨在解决视频多模态大模型(Video MLLMs)在实现高比例标记压缩(token compression)时效果不佳的问题,其根本原因在于现有方法对视频内容中时间连续性和语义连贯性的建模不足。解决方案的关键在于提出一种无需训练的标记剪枝方法 ForestPrune,通过构建跨帧的时空森林结构(Spatial-temporal Forest Modeling),结合语义、空间和时间约束来形成对视频的整体理解;随后依据树的深度和节点角色评估标记树与节点的重要性,从而做出全局最优的剪枝决策,显著提升了视频 MLLMs 的压缩效率与性能表现。

链接: https://arxiv.org/abs/2603.22911
作者: Shaobo Ju,Baiyang Song,Tao Chen,Jiapeng Zhang,Qiong Wu,Chao Chang,HuaiXi Wang,Yiyi Zhou,Rongrong Ji
机构: Xiamen University (厦门大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.

[CV-76] Dual-Teacher Distillation with Subnetwork Rectification for Black-Box Domain Adaptation

【速读】:该论文旨在解决黑盒域适应(black box domain adaptation)场景下的性能瓶颈问题,即在源数据和源模型均不可获取的情况下,如何有效利用仅能通过目标样本查询得到的黑盒源模型预测结果进行高质量的知识迁移。现有方法受限于伪标签噪声或对视觉语言模型(Vision-Language Models, ViLs)语义先验信息利用不足,导致适应效果不佳。本文提出双教师蒸馏与子网络修正(Dual Teacher Distillation with Subnetwork Rectification, DDSR)模型,其核心创新在于:一方面通过自适应融合黑盒源模型与ViL的互补预测生成可靠伪标签,另一方面引入子网络驱动的正则化策略抑制由噪声监督引发的过拟合;同时,通过迭代优化目标预测与ViL提示词,实现语义一致性增强,并最终基于类别原型的自训练进一步提升模型性能。

链接: https://arxiv.org/abs/2603.22908
作者: Zhe Zhang,Jing Li,Wanli Xue,Xu Cheng,Jianhua Zhang,Qinghua Hu,Shengyong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This manuscript is under review at IEEE Transactions on Multimedia

点击查看摘要

Abstract:Assuming that neither source data nor the source model is accessible, black box domain adaptation represents a highly practical yet extremely challenging setting, as transferable information is restricted to the predictions of the black box source model, which can only be queried using target samples. Existing approaches attempt to extract transferable knowledge through pseudo label refinement or by leveraging external vision language models (ViLs), but they often suffer from noisy supervision or insufficient utilization of the semantic priors provided by ViLs, which ultimately hinder adaptation performance. To overcome these limitations, we propose a dual teacher distillation with subnetwork rectification (DDSR) model that jointly exploits the specific knowledge embedded in black box source models and the general semantic information of a ViL. DDSR adaptively integrates their complementary predictions to generate reliable pseudo labels for the target domain and introduces a subnetwork driven regularization strategy to mitigate overfitting caused by noisy supervision. Furthermore, the refined target predictions iteratively enhance both the pseudo labels and ViL prompts, enabling more accurate and semantically consistent adaptation. Finally, the target model is further optimized through self training with classwise prototypes. Extensive experiments on multiple benchmark datasets validate the effectiveness of our approach, demonstrating consistent improvements over state of the art methods, including those using source data or models.

[CV-77] SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

【速读】:该论文旨在解决动态场景重建、语义理解与实时流式推理三者难以统一的问题,尤其在缺乏光流监督的情况下实现高精度的动态建模与语义对齐。其解决方案的关键在于提出SLARM架构:通过高阶运动建模(higher-order motion modeling)捕捉非均匀运动,并仅依赖可微渲染进行训练;同时引入LSeg的语义特征蒸馏机制,生成语言对齐的表示以支持自然语言查询;此外,采用基于窗口的因果注意力机制(window-based causal attention)实现低延迟、无内存累积的流式推理,从而在保持几何与语义紧密耦合的同时显著提升动态估计精度、重建质量与场景解析性能。

链接: https://arxiv.org/abs/2603.22893
作者: Zhicheng Qiu,Jiarui Meng,Tong-an Luo,Yican Huang,Xuan Feng,Xuanfu Li,ZHan Xu
机构: Huawei Technologies Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.

[CV-78] Group Editing : Edit Multiple Images in One Go CVPR2026

【速读】:该论文旨在解决跨多张相关图像进行一致且统一编辑的难题,尤其在图像间存在显著姿态、视角和空间布局差异时,如何实现语义对齐区域的准确修改。其解决方案的关键在于构建显式与隐式双重关系:一方面利用VGGT提取几何对应关系以实现空间对齐;另一方面将图像组重构为伪视频,借助预训练视频模型的时间一致性先验捕捉潜在关联,并通过一种新颖的融合机制将显式几何线索注入视频模型中,从而提升编辑的一致性与准确性。

链接: https://arxiv.org/abs/2603.22883
作者: Yue Ma,Xinyu Wang,Qianli Ma,Qinghe Wang,Mingzhe Zheng,Xiangpeng Yang,Hao Li,Chongbo Zhao,Jixuan Ying,Harry Yang,Hongyu Liu,Qifeng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model’s ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.

[CV-79] reeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration CVPR2026

【速读】:该论文旨在解决当前生成式 AI(Generative AI)中视觉语言模型(Vision-Language Models, VLMs)的安全漏洞探测效率低下的问题,尤其是现有红队测试(red teaming)方法受限于线性探索范式,无法发现新颖且多样化的攻击策略。解决方案的关键在于提出 TreeTeaming 框架,其核心是一个由大语言模型(Large Language Model, LLM)驱动的策略编排器(Orchestrator),能够动态决策是演化已有攻击路径还是探索新的策略分支,从而构建并扩展一个策略树结构;同时配备多模态执行器(multimodal actuator)来实施复杂攻击策略,实现从静态测试到动态进化发现的范式转变。

链接: https://arxiv.org/abs/2603.22882
作者: Chunxiao Li,Lijun Li,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:The rapid advancement of Vision-Language Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models.

[CV-80] mplate-Based Feature Aggregation Network for Industrial Anomaly Detection

【速读】:该论文旨在解决工业异常检测中现有特征重构方法面临的“捷径学习”(shortcut learning)问题,即模型可能错误地重建异常特征,从而影响检测精度。其解决方案的关键在于提出一种基于模板的特征聚合网络(Template-based Feature Aggregation Network, TFA-Net),通过将输入图像的多层级特征聚合到固定模板图像的特征上,利用模板特征对正常模式的先验信息过滤掉与之低相似度的异常特征,进而实现更准确的特征重构。该方法避免了直接重构输入特征所带来的偏差,同时引入随机掩码策略提升整体检测性能,最终在多个真实工业数据集上实现了卓越的检测效果并满足实时性要求。

链接: https://arxiv.org/abs/2603.22874
作者: Wei Luo,Haiming Yao,Wenyong Yu
机构: Tsinghua University (清华大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:Industrial anomaly detection plays a crucial role in ensuring product quality control. Therefore, proposing an effective anomaly detection model is of great significance. While existing feature-reconstruction methods have demonstrated excellent performance, they face challenges with shortcut learning, which can lead to undesirable reconstruction of anomalous features. To address this concern, we present a novel feature-reconstruction model called the \textbfTemplate-based \textbfFeature \textbfAggregation \textbfNetwork (TFA-Net) for anomaly detection via template-based feature aggregation. Specifically, TFA-Net first extracts multiple hierarchical features from a pre-trained convolutional neural network for a fixed template image and an input image. Instead of directly reconstructing input features, TFA-Net aggregates them onto the template features, effectively filtering out anomalous features that exhibit low similarity to normal template features. Next, TFA-Net utilizes the template features that have already fused normal features in the input features to refine feature details and obtain the reconstructed feature map. Finally, the defective regions can be located by comparing the differences between the input and reconstructed features. Additionally, a random masking strategy for input features is employed to enhance the overall inspection performance of the model. Our template-based feature aggregation schema yields a nontrivial and meaningful feature reconstruction task. The simple, yet efficient, TFA-Net exhibits state-of-the-art detection performance on various real-world industrial datasets. Additionally, it fulfills the real-time demands of industrial scenarios, rendering it highly suitable for practical applications in the industry. Code is available at this https URL.

[CV-81] ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

【速读】:该论文旨在解决视频监控场景中跨多摄像头长时序视频的精准目标检索与复杂多模态查询理解问题,现有方法如跟踪流水线、基于CLIP的模型及VideoRAG在处理涉及时间推理和深层语义理解的查询时表现不足,且缺乏合适的评估基准。其解决方案的关键在于提出ForeSeaQA基准和ForeSea系统:ForeSeaQA是一个包含长时间监控视频与多样化图像-文本混合查询及事件时间戳标注的数据集,首次支持复杂多模态查询的精确时空定位;ForeSea则是一个三阶段可插拔式AI取证搜索系统,包括目标过滤模块、多模态嵌入索引模块以及基于VideoLLM的候选片段推理模块,在ForeSeaQA上相较Prior VideoRAG模型在准确率上提升3.5%,时间IoU提升11.0%。

链接: https://arxiv.org/abs/2603.22872
作者: Hyojin Park,Yi Li,Janghoon Cho,Sungha Choi,Jungsoo Lee,Taotao Jing,Shuai Zhang,Munawar Hayat,Dashan Gao,Ning Bi,Fatih Porikli
机构: Qualcomm(高通); University of California, San Diego (加州大学圣地亚哥分校); Samsung Research (三星研究院); KAIST (韩国科学技术院); National Institute of Standards and Technology (美国国家标准与技术研究院); MIT (麻省理工学院); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods – tracking pipelines, CLIP based models, and VideoRAG – require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., “When does this person join the fight?” with the person’s image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.

[CV-82] Designing to Forget: Deep Semi-parametric Models for Unlearning CVPR2026

【速读】:该论文旨在解决模型遗忘(machine unlearning)效率低下的问题,即如何在不重新训练模型的前提下高效删除特定训练样本对模型的影响。传统参数化模型在执行删除操作时往往需要大量计算资源,而本文提出了一类深度半参数模型(deep semi-parametric models, SPMs),其关键创新在于引入一个融合模块(fusion module),该模块在训练阶段聚合每个训练样本的信息,并在测试阶段实现无需修改模型参数即可显式删除选定样本的机制。这种设计使SPMs在图像分类和生成任务中性能接近参数化模型,同时显著提升遗忘效率——例如在ImageNet分类任务上,相比重新训练的基准模型,预测误差降低11%,且遗忘速度比现有参数化方法快10倍以上。

链接: https://arxiv.org/abs/2603.22870
作者: Amber Yijia Zheng,Yu-Shan Tai,Raymond A. Yeh
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by 11% and achieve over 10\times faster unlearning compared to existing approaches on parametric models. The code is available at this https URL.

[CV-83] A Feature Shuffling and Restoration Strategy for Universal Unsupervised Anomaly Detection

【速读】:该论文旨在解决工业场景中无监督异常检测(unsupervised anomaly detection)面临的“相同捷径问题”(identical shortcut issue),即传统重建类方法在处理正常与异常区域时均能获得良好重建效果,导致无法有效区分异常样本。这一问题在正常数据分布复杂时尤为严重,使得现有方法在特定场景下表现优异,但在跨场景迁移时性能显著下降。为实现跨场景通用的异常检测模型(universal anomaly detection),作者提出一种新颖且高效的框架——特征洗牌与恢复(Feature Shuffling and Restoration, FSR)。其核心创新在于:使用多尺度语义特征作为重建目标而非原始图像像素,并将这些特征划分为非重叠块后进行随机洗牌与重建,从而迫使模型关注全局上下文信息;同时引入“洗牌率”(shuffling rate)控制任务复杂度,缓解不同场景下的相同捷径问题。理论分析从网络结构和互信息角度解释了FSR的有效性,实验验证了其在多种设置下的优越性和高效性。

链接: https://arxiv.org/abs/2603.22861
作者: Wei Luo,Haiming Yao,Zhenfeng Qiang,Xiaotian Zhang,Weihang Zhang
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Knowledge-Based Systems

点击查看摘要

Abstract:Unsupervised anomaly detection is vital in industrial fields, with reconstruction-based methods favored for their simplicity and effectiveness. However, reconstruction methods often encounter an identical shortcut issue, where both normal and anomalous regions can be well reconstructed and fail to identify outliers. The severity of this problem increases with the complexity of the normal data distribution. Consequently, existing methods may exhibit excellent detection performance in a specific scenario, but their performance sharply declines when transferred to another scenario. This paper focuses on establishing a universal model applicable to anomaly detection tasks across different settings, termed as universal anomaly detection. In this work, we introduce a novel, straightforward yet efficient framework for universal anomaly detection: \ulineFeature \ulineShuffling and \ulineRestoration (FSR), which can alleviate the identical shortcut issue across different settings. First and foremost, FSR employs multi-scale features with rich semantic information as reconstruction targets, rather than raw image pixels. Subsequently, these multi-scale features are partitioned into non-overlapping feature blocks, which are randomly shuffled and then restored to their original state using a restoration network. This simple paradigm encourages the model to focus more on global contextual information. Additionally, we introduce a novel concept, the shuffling rate, to regulate the complexity of the FSR task, thereby alleviating the identical shortcut across different settings. Furthermore, we provide theoretical explanations for the effectiveness of FSR framework from two perspectives: network structure and mutual information. Extensive experimental results validate the superiority and efficiency of the FSR framework across different this http URL is available at this https URL.

[CV-84] Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction

【速读】:该论文旨在解决自动驾驶中3D语义占据预测(3D semantic occupancy prediction)的精度与计算效率之间的矛盾问题。现有基于多模态融合的方法虽能提升准确性,但通常依赖于计算密集的稠密体素(dense voxel)或鸟瞰图(BEV)张量表示,导致高资源消耗。其解决方案的关键在于提出Gau-Occ框架,通过将场景建模为一组紧凑的语义3D高斯(semantic 3D Gaussians)来避免稠密体积处理;并引入LiDAR Completion Diffuser(LCD)以从稀疏LiDAR中恢复缺失结构并初始化鲁棒的高斯锚点,以及Gaussian Anchor Fusion(GAF)机制,实现多视角图像语义信息的几何对齐采样与跨模态对齐融合,从而在保持空间一致性与语义区分度的同时显著提升计算效率。

链接: https://arxiv.org/abs/2603.22852
作者: Chengxin Lv,Yihui Li,Hongyu Yang,YunHong Wang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D semantic occupancy prediction is crucial for autonomous driving. While multi-modal fusion improves accuracy over vision-only methods, it typically relies on computationally expensive dense voxel or BEV tensors. We present Gau-Occ, a multi-modal framework that bypasses dense volumetric processing by modeling the scene as a compact collection of semantic 3D Gaussians. To ensure geometric completeness, we propose a LiDAR Completion Diffuser (LCD) that recovers missing structures from sparse LiDAR to initialize robust Gaussian anchors. Furthermore, we introduce Gaussian Anchor Fusion (GAF), which efficiently integrates multi-view image semantics via geometry-aligned 2D sampling and cross-modal alignment. By refining these compact Gaussian descriptors, Gau-Occ captures both spatial consistency and semantic discriminability. Extensive experiments across challenging benchmarks demonstrate that Gau-Occ achieves state-of-the-art performance with significant computational efficiency.

[CV-85] UniQueR: Unified Query-based Feedforward 3D Reconstruction

【速读】:该论文旨在解决现有前馈式三维重建方法(如DUSt3R、VGGT和AnySplat)在处理未对齐图像时存在的局限性,即这些方法通常预测像素级点图或像素对齐的高斯分布,本质上仍为2.5D表示,难以建模遮挡区域的几何结构。其解决方案的关键在于提出UniQueR框架,将三维重建建模为稀疏3D查询推理问题:通过学习一组紧凑的3D锚点作为显式的几何查询,在单次前向传播中直接推断场景结构(包括遮挡区域),每个查询在全局3D空间中编码空间与外观先验,并生成用于可微渲染的3D高斯分布;同时利用跨视图特征的统一查询交互与解耦交叉注意力机制,在保持强几何表达能力的同时显著降低内存与计算开销。

链接: https://arxiv.org/abs/2603.22851
作者: Chensheng Peng,Quentin Herau,Jiezhi Yang,Yichen Xie,Yihan Hu,Wenzhao Zheng,Matthew Strong,Masayoshi Tomizuka,Wei Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present UniQueR, a unified query-based feedforward framework for efficient and accurate 3D reconstruction from unposed images. Existing feedforward models such as DUSt3R, VGGT, and AnySplat typically predict per-pixel point maps or pixel-aligned Gaussians, which remain fundamentally 2.5D and limited to visible surfaces. In contrast, UniQueR formulates reconstruction as a sparse 3D query inference problem. Our model learns a compact set of 3D anchor points that act as explicit geometric queries, enabling the network to infer scene structure, including geometry in occluded regions–in a single forward pass. Each query encodes spatial and appearance priors directly in global 3D space (instead of per-frame camera space) and spawns a set of 3D Gaussians for differentiable rendering. By leveraging unified query interactions across multi-view features and a decoupled cross-attention design, UniQueR achieves strong geometric expressiveness while substantially reducing memory and computational cost. Experiments on Mip-NeRF 360 and VR-NeRF demonstrate that UniQueR surpasses state-of-the-art feedforward methods in both rendering quality and geometric accuracy, using an order of magnitude fewer primitives than dense alternatives.

[CV-86] Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

【速读】:该论文旨在解决多模态链式思维(Multimodal Chain-of-Thought, CoT)推理中现有基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在粒度层面过于粗略的问题,即未区分不同推理步骤中视觉 grounding 的程度差异,导致优化效率受限。解决方案的关键在于提出感知-探索策略优化(Perception-Exploration Policy Optimization, PEPO),其通过分析 token 级别的推理轨迹动态,利用隐藏状态相似性构建感知先验,并结合 token 熵通过平滑门控机制生成 token 级别优势信号,从而实现对推理过程中感知与探索行为的精细化引导。PEPO 可无缝集成至 GRPO 和 DAPO 等主流 RLVR 框架,无需额外监督或辅助分支,在多个多模态基准任务中均展现出稳定且一致的性能提升。

链接: https://arxiv.org/abs/2603.22847
作者: Yunheng Li,Hangyi Kuang,Hengrui Zhang,Jiangxia Cao,Zhaojie Liu,Qibin Hou,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: this https URL

[CV-87] UAV-DETR: DETR for Anti-Drone Target Detection

【速读】:该论文旨在解决微型无人机(UAV)在复杂背景和恶劣环境干扰下,现有基于深度学习的检测方法难以兼顾鲁棒特征表示与计算效率的问题。其解决方案的关键在于提出一种名为UAV-DETR的新框架,核心创新包括:(1) 引入WTConv增强的主干网络与滑动窗口自注意力(Sliding Window Self-Attention, SWSA-IFI)编码器,有效捕捉微小目标的高频结构细节并显著降低参数开销;(2) 设计高效跨尺度特征重校准与融合网络(Efficient Cross-Scale Feature Recalibration and Fusion Network, ECFRFN),抑制背景噪声并聚合多尺度语义信息;(3) 采用混合内CIoU与NWD损失策略,缓解标准IoU对小目标微小位置偏移的极端敏感性,从而提升检测精度。实验证明该方法在保持实时性的同时实现了更优的检测性能。

链接: https://arxiv.org/abs/2603.22841
作者: Jun Yang,Dong Wang,Hongxu Yin,Hongpeng Li,Jianxiong Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drone detection is pivotal in numerous security and counter-UAV applications. However, existing deep learning-based methods typically struggle to balance robust feature representation with computational efficiency. This challenge is particularly acute when detecting miniature drones against complex backgrounds under severe environmental interference. To address these issues, we introduce UAV-DETR, a novel framework that integrates a small-target-friendly architecture with real-time detection capabilities. Specifically, UAV-DETR features a WTConv-enhanced backbone and a Sliding Window Self-Attention (SWSA-IFI) encoder, capturing the high-frequency structural details of tiny targets while drastically reducing parameter overhead. Furthermore, we propose an Efficient Cross-Scale Feature Recalibration and Fusion Network (ECFRFN) to suppress background noise and aggregate multi-scale semantics. To further enhance accuracy, UAV-DETR incorporates a hybrid Inner-CIoU and NWD loss strategy, mitigating the extreme sensitivity of standard IoU metrics to minor positional deviations in small objects. Extensive experiments demonstrate that UAV-DETR significantly outperforms the baseline RT-DETR on our custom UAV dataset (+6.61% in mAP50:95, with a 39.8% reduction in parameters) and the public DUT-ANTI-UAV benchmark (+1.4% in Precision, +1.0% in F1-Score). These results establish UAV-DETR as a superior trade-off between efficiency and precision in counter-UAV object detection. The code is available at this https URL.

[CV-88] URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection

【速读】:该论文旨在解决现有无监督异常检测方法在工业缺陷检测和医学图像分析中因过度泛化而导致异常区域被良好重建、进而影响检测性能的问题。其核心解决方案是提出一种不确定性集成的异常感知与修复注意力网络(URA-Net),关键在于:首先利用预训练卷积神经网络提取多层级语义特征作为重建目标,而非仅依赖原始图像;其次引入特征级人工异常合成模块生成训练用异常样本;再通过基于贝叶斯神经网络的不确定性集成异常感知模块学习正常与异常特征分布,从而精准估计异常区域及模糊边界;最后设计一种基于全局正常语义信息的修复注意力机制,将检测到的异常区域恢复为无缺陷特征,最终通过输入特征与恢复特征之间的残差图实现高精度异常检测与定位。

链接: https://arxiv.org/abs/2603.22840
作者: Wei Luo,Peng Xing,Yunkang Cao,Haiming Yao,Weiming Shen,Zechao Li
机构: Tsinghua University (清华大学); Huazhong University of Science and Technology (华中科技大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE TCSVT

点击查看摘要

Abstract:Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.

[CV-89] MultiCam: On-the-fly Multi-Camera Pose Estimation Using Spatiotemporal Overlaps of Known Objects

【速读】:该论文旨在解决多相机动态增强现实(Augmented Reality, AR)应用中相机位姿估计的难题,尤其针对传统基于标记(marker-based)方法在视野(Field-of-View, FoV)内需持续追踪标记、且难以处理非重叠视野摄像头间关系的问题。其核心解决方案是提出一种无需标记的实时动态相机位姿估计方法,关键在于利用场景中已知物体在时空上的FoV重叠信息,通过增强当前最先进的目标位姿估计算法来更新时空场景图(spatiotemporal scene graph),从而建立即使在非重叠视野下也能实现跨相机关联的几何约束,显著提升位姿估计精度。

链接: https://arxiv.org/abs/2603.22839
作者: Shiyu Li,Hannah Schieber,Kristoffer Waldow,Benjamin Busam,Julian Kreimeier,Daniel Roth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-camera dynamic Augmented Reality (AR) applications require a camera pose estimation to leverage individual information from each camera in one common system. This can be achieved by combining contextual information, such as markers or objects, across multiple views. While commonly cameras are calibrated in an initial step or updated through the constant use of markers, another option is to leverage information already present in the scene, like known objects. Another downside of marker-based tracking is that markers have to be tracked inside the field-of-view (FoV) of the cameras. To overcome these limitations, we propose a constant dynamic camera pose estimation leveraging spatiotemporal FoV overlaps of known objects on the fly. To achieve that, we enhance the state-of-the-art object pose estimator to update our spatiotemporal scene graph, enabling a relation even among non-overlapping FoV cameras. To evaluate our approach, we introduce a multi-camera, multi-object pose estimation dataset with temporal FoV overlap, including static and dynamic cameras. Furthermore, in FoV overlapping scenarios, we outperform the state-of-the-art on the widely used YCB-V and T-LESS dataset in camera pose accuracy. Our performance on both previous and our proposed datasets validates the effectiveness of our marker-less approach for AR applications. The code and dataset are available on this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.22839 [cs.CV] (or arXiv:2603.22839v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.22839 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-90] MVRD-Bench: Multi-View Learning and Benchmarking for Dynamic Remote Photoplethysmography under Occlusion

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在面部运动和遮挡场景下性能下降的问题,尤其是在非受限多视角视频中的鲁棒性不足。解决方案的关键在于提出一个统一的多视角rPPG学习框架MVRD-rPPG,其核心包括:1)引入自适应时间光流补偿(Adaptive Temporal Optical Compensation, ATOC)模块以抑制运动伪影;2)设计节奏-视觉双流网络(Rhythm-Visual Dual-Stream Network)分离周期性生理信号与外观相关特征;3)构建多视角相关感知注意力机制(Multi-View Correlation-Aware Attention, MVCA)实现视图间动态信号聚合;4)采用相关频率对抗学习策略(Correlation Frequency Adversarial, CFA)联合优化预测信号的时间准确性、频谱一致性和感知真实性。上述方法协同提升在复杂运动条件下的rPPG估计精度与稳定性。

链接: https://arxiv.org/abs/2603.22826
作者: Zuxian He,Xu Cheng,Zhaodong Sun,Haoyu Chen,Jingang Shi,Xiaobai Li,Guoying Zhao
机构: Nanjing University of Information Science and Technology (南京信息工程大学); University of Oulu (奥卢大学); Xi’an Jiaotong University (西安交通大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) is a non-contact technique that estimates physiological signals by analyzing subtle skin color changes in facial videos. Existing rPPG methods often encounter performance degradation under facial motion and occlusion scenarios due to their reliance on static and single-view facial videos. Thus, this work focuses on tackling the motion-induced occlusion problem for rPPG measurement in unconstrained multi-view facial videos. Specifically, we introduce a Multi-View rPPG Dataset (MVRD), a high-quality benchmark dataset featuring synchronized facial videos from three viewpoints under stationary, speaking, and head movement scenarios to better match real-world conditions. We also propose MVRD-rPPG, a unified multi-view rPPG learning framework that fuses complementary visual cues to maintain robust facial skin coverage, especially under motion conditions. Our method integrates an Adaptive Temporal Optical Compensation (ATOC) module for motion artifact suppression, a Rhythm-Visual Dual-Stream Network to disentangle rhythmic and appearance-related features, and a Multi-View Correlation-Aware Attention (MVCA) for adaptive view-wise signal aggregation. Furthermore, we introduce a Correlation Frequency Adversarial (CFA) learning strategy, which jointly enforces temporal accuracy, spectral consistency, and perceptual realism in the predicted signals. Extensive experiments and ablation studies on the MVRD dataset demonstrate the superiority of our approach. In the MVRD movement scenario, MVRD-rPPG achieves an MAE of 0.90 and a Pearson correlation coefficient ® of 0.99. The source code and dataset will be made available.

[CV-91] Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference CVPR-2026

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)因实验成本高昂而难以大规模应用的问题,提出通过病理图像预测ST数据作为替代方案。现有方法在捕捉跨切片复杂空间关系方面表现不足,限制了预测精度。解决方案的关键在于提出一种多模态异构图模型SpaHGC,其通过整合目标切片内的局部空间上下文与基于病理基础模型提取的图像嵌入所计算的跨切片相似性,实现跨切片知识迁移;同时引入掩码图对比学习(Masked Graph Contrastive Learning)增强特征表示,从而有效建模复杂的点-点空间依赖关系,显著提升预测准确性,并在多个癌症相关通路中表现出强生物学意义。

链接: https://arxiv.org/abs/2603.22821
作者: Zhiceng Shi,Changmiao Wang,Jun Wan,Wenwen Min
机构: Yunnan University (云南大学); Shenzhen Research Institute of Big Data (深圳大数据研究院); Zhongnan University of Economics and Law (中南财经政法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR-2026

点击查看摘要

Abstract:While spatial transcriptomics (ST) has advanced our understanding of gene expression in tissue context, its high experimental cost limits its large-scale application. Predicting ST from pathology images is a promising, cost-effective alternative, but existing methods struggle to capture complex cross-slide spatial relationships. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer, and SpaHGC further incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential.

[CV-92] DATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment CVPR2026

【速读】:该论文旨在解决现有表格识别(Table Recognition, TR)方法中存在的两大问题:一是模块化流水线分别建模表格结构与内容,导致集成效果不佳且流程复杂;二是端到端方法依赖大规模TR数据,在数据受限场景下表现欠佳。其解决方案的关键在于提出TDATR(Table Detail-Aware Table Recognition),采用“感知-融合”策略:首先通过表细节感知学习,利用语言建模范式设计多个结构理解与内容识别任务,联合建模表格结构与内容,并自然利用多样化文档数据提升鲁棒性;其次在训练数据有限时,通过隐式表细节整合生成结构化HTML输出,提升建模效率;此外,引入结构引导的单元格定位模块,增强视觉-语言对齐能力,从而显著提升识别准确性与可解释性。

链接: https://arxiv.org/abs/2603.22819
作者: Chunxia Qin,Chenyu Liu,Pengcheng Xia,Jun Du,Baocai Yin,Bing Yin,Cong Liu
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK Research (科大讯飞研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Acceptd by CVPR 2026. Project Page: this https URL

点击查看摘要

Abstract:Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse’’ strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

[CV-93] Focus Dont Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding CVPR2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理信息密集且视觉复杂的图像(如信息图或文档布局)时,因生成大量视觉标记而导致显著计算开销的问题。解决方案的关键在于提出一种两阶段框架PinPoint:第一阶段通过指令-区域对齐(Instruction-Region Alignment)机制,结合视觉输入与文本指令精确定位与任务相关的图像区域;第二阶段对这些区域进行细化以提取细粒度视觉特征,从而提升推理能力并减少无关视觉标记的计算负担。

链接: https://arxiv.org/abs/2603.22815
作者: Mincheol Kwon,Minseung Lee,Seonga Choi,Miso Choi,Kyeong-Jin Oh,Hyunyoung Lee,Cheonyoung Park,Yongho Song,Seunghyun Park,Jinkyu Kim
机构: Korea University (韩国大学); KT Corporation (KT公司); Soongsil University (中央大学); Kakao Mobility (카카오모빌리티)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

[CV-94] PhotoAgent : A Robotic Photographer with Spatial and Aesthetic Understanding ICRA

【速读】:该论文旨在解决生成式 AI(Generative AI)在创意任务(如摄影)中如何将高层语言指令与几何控制有效衔接的问题,即“语义鸿沟”问题。其解决方案的关键在于引入 PhotoAgent,该代理通过结合大模型多模态(Large Multimodal Models, LMMs)的链式思维(Chain-of-Thought, CoT)推理能力与一种新颖的控制范式:首先利用 LMM 推理将主观美学目标转化为可求解的几何约束,由解析求解器计算出高质量初始视角;随后借助基于 3D 高斯泼溅(3D Gaussian Splatting, 3DGS)构建的逼真内部世界模型进行视觉反馈驱动的迭代优化,实现“心理模拟”,从而替代昂贵且低效的物理试错过程,显著提升最终图像质量与空间推理能力。

链接: https://arxiv.org/abs/2603.22796
作者: Lirong Che,Zhenfeng Gan,Yanbo Chen,Junbo Tan,Xueqian Wang
机构: Center for Artificial Intelligence and Robotics, Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation’’ replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

[CV-95] It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal CVPR2026

【速读】:该论文旨在解决短曝光摄影中由不稳定照明和逐行曝光不一致引起的闪烁伪影(flicker artifacts)问题,这类伪影具有特定的时空结构特性,现有通用图像恢复框架未能充分建模其规律,导致去闪烁效果不佳并产生鬼影(ghosting)伪影。解决方案的关键在于揭示了闪烁伪影的两个内在特征——周期性(periodicity)与方向性(directionality),并提出基于Transformer架构的Flickerformer模型,其核心创新包括:基于相位的相关融合模块(PFM)利用帧间相位相关性自适应聚合多帧特征;自相关前馈网络(AFFN)通过帧内自相关挖掘结构规律以增强对空间重复模式的感知能力;以及基于小波域的方向注意力模块(WDAM)利用高频变化引导低频暗区修复,实现对闪烁伪影的精确定位与抑制,从而在定量指标与视觉质量上均优于当前最优方法。

链接: https://arxiv.org/abs/2603.22794
作者: Lishen Qu,Shihao Zhou,Jie Liang,Hui Zeng,Lei Zhang,Jufeng Yang
机构: Nankai International Advanced Research Institute (SHENZHEN·FUTIAN); Peng Cheng Laboratory; College of Computer Science, Nankai University; The Hong Kong Polytechnic University; OPPO Research Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Flicker artifacts, arising from unstable illumination and row-wise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting. Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network’s ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available at this https URL.

[CV-96] Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在应用于自主代理和安全关键场景时缺乏不确定性估计的问题,即当前方法虽能实现高保真渲染,但无法提供可靠的空间感知置信度信息。解决方案的关键在于提出一种轻量级、可插拔的后处理框架,通过将不确定性建模为基于重建残差的贝叶斯正则化线性最小二乘优化问题,无需修改原始场景表示即可提取每个图元(primitive)的像素级、视角依赖的预测不确定性通道,从而将3DGS转化为具备可信度输出的空间地图,并显著提升下游感知任务(如主动视点选择、无位姿场景变化检测和异常检测)的性能。

链接: https://arxiv.org/abs/2603.22786
作者: Chamuditha Jayanga Galappaththige,Thomas Gottwald,Peter Stehr,Edgar Heinert,Niko Suenderhauf,Dimity Miller,Matthias Rottmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting have enabled impressive photorealistic novel view synthesis. However, to transition from a pure rendering engine to a reliable spatial map for autonomous agents and safety-critical applications, knowing where the representation is uncertain is as important as the rendering fidelity itself. We bridge this critical gap by introducing a lightweight, plug-and-play framework for pixel-wise, view-dependent predictive uncertainty estimation. Our post-hoc method formulates uncertainty as a Bayesian-regularized linear least-squares optimization over reconstruction residuals. This architecture-agnostic approach extracts a per-primitive uncertainty channel without modifying the underlying scene representation or degrading baseline visual fidelity. Crucially, we demonstrate that providing this actionable reliability signal successfully translates 3D Gaussian splatting into a trustworthy spatial map, further improving state-of-the-art performance across three critical downstream perception tasks: active view selection, pose-agnostic scene change detection, and pose-agnostic anomaly detection.

[CV-97] Exposure-Normalized Bed and Chair Fall Rates via Continuous AI Monitoring

【速读】:该论文旨在解决传统跌倒监测中因以占用床位日(occupied bed-days)为分母导致的跌倒率估计偏差问题,从而更准确地评估不同设备(如椅子与病床)的跌倒风险。其解决方案的关键在于采用连续人工智能(AI)监控技术,基于实际暴露时间(即每小时的椅子或病床暴露时长)计算跌倒率,从而获得概率加权的单位暴露时间跌倒发生率,提升了风险量化精度。研究结果显示,椅子暴露时的跌倒率为每1000小时17.8次,显著高于病床暴露时的4.3次,且部分直接椅子跌倒事件与脚踏板位置不当相关,提示应优先优化座椅安全设计而非减少使用。

链接: https://arxiv.org/abs/2603.22785
作者: Paolo Gabriel,Peter Rehani,Zack Drumm,Tyler Troy,Tiffany Wyatt,Narinder Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:This retrospective cohort study used continuous AI monitoring to estimate fall rates by exposure time rather than occupied bed-days. From August 2024 to December 2025, 3,980 eligible monitoring units contributed 292,914 hourly rows, yielding probability-weighted rates of 17.8 falls per 1,000 chair exposure-hours and 4.3 per 1,000 bed exposure-hours. Within the study window, 43 adjudicated falls matched the monitoring pipeline, and 40 linked to eligible exposure hours for the primary Poisson model, producing an adjusted chair-versus-bed rate ratio of 2.35 (95% confidence interval 0.87 to 6.33; p=0.0907). In a separate broader observation cohort (n=32 deduplicated events), 6 of 7 direct chair falls involved footrest-positioning failures. Because this was an observational study in a single health system, these findings remain hypothesis-generating and support testing safer chair setups rather than using chairs less.

[CV-98] Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

【速读】:该论文旨在解决当前3D生成模型在单视角观测下对未见区域(如背面)生成结果存在随机性强、难以控制且常与用户意图不符或产生不合理几何结构的问题。解决方案的关键在于提出Know3D框架,通过将多模态大语言模型(Multimodal Large Language Model, MLLM)中的语义知识以潜在隐藏状态注入的方式引入3D生成过程,结合视觉语言模型(Vision-Language Model, VLM)进行语义理解与指导,以及扩散模型作为语义到3D结构的转换桥梁,从而实现基于文本指令对3D资产背面视图的可控生成,将原本随机的“幻觉”过程转变为语义可控制的几何重建过程。

链接: https://arxiv.org/abs/2603.22782
作者: Wenyue Chen,Wenjue Chen,Peng Li,Qinghe Wang,Xu Jia,Heliang Zheng,Rongfei Jia,Yuan Liu,Ronggang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: page: this https URL

点击查看摘要

Abstract:Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.

[CV-99] ypography-Based Monocular Distance Estimation Framework for Vehicle Safety Systems

【速读】:该论文旨在解决单目视觉在车辆间距离估计中的尺度模糊性(scale ambiguity)问题,其核心挑战在于如何利用低成本摄像头实现高精度、鲁棒的测距。解决方案的关键在于引入基于标准化车牌字体特征的被动标识物(passive fiducial markers)方法:通过检测车牌字符高度并结合针孔相机模型计算距离,同时融合车道线引导的相机姿态补偿、多模态字符分割策略(自适应与全局阈值结合)、卡尔曼滤波速度估计以及包括笔画宽度、字符间距和边框厚度在内的多特征融合机制,从而显著提升测距精度与稳定性。实验表明,相较于传统基于车牌宽度的方法,该方案将估计标准差降低35%,有效减少误触发制动或加速的风险。

链接: https://arxiv.org/abs/2603.22781
作者: Manognya Lokesh Reddy,Zheng Liu
机构: University of Michigan-Dearborn (密歇根大学迪尔伯恩分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 11 figures

点击查看摘要

Abstract:Accurate inter-vehicle distance estimation is a cornerstone of advanced driver assistance systems and autonomous driving. While LiDAR and radar provide high precision, their cost prohibits widespread adoption in mass-market vehicles. Monocular vision offers a low-cost alternative but suffers from scale ambiguity and sensitivity to environmental disturbances. This paper introduces a typography-based monocular distance estimation framework, which exploits the standardized typography of license plates as passive fiducial markers for metric distance estimation. The core geometric module uses robust plate detection and character segmentation to measure character height and computes distance via the pinhole camera model. The system incorporates interactive calibration, adaptive detection with strict and permissive modes, and multi-method character segmentation leveraging both adaptive and global thresholding. To enhance robustness, the framework further includes camera pose compensation using lane-based horizon estimation, hybrid deep-learning fusion, temporal Kalman filtering for velocity estimation, and multi-feature fusion that exploits additional typographic cues such as stroke width, character spacing, and plate border thickness. Experimental validation with a calibrated monocular camera in a controlled indoor setup achieved a coefficient of variation of 2.3% in character height across consecutive frames and a mean absolute error of 7.7%. The framework operates without GPU acceleration, demonstrating real-time feasibility. A comprehensive comparison with a plate-width based method shows that character-based ranging reduces the standard deviation of estimates by 35%, translating to smoother, more consistent distance readings in practice, where erratic estimates could trigger unnecessary braking or acceleration.

[CV-100] From Pixels to Semantics: A Multi-Stage AI Framework for Structural Damage Detection in Satellite Imagery

【速读】:该论文旨在解决自然灾害后建筑损伤评估中遥感影像因空间分辨率低、语境模糊及语义可解释性差而导致传统检测流程可靠性不足的问题。其解决方案的关键在于构建一个融合生成式AI(Generative AI)的混合框架:首先利用视频恢复Transformer(VRT)提升卫星图像分辨率(从1024×1024增强至4096×4096),以改善结构细节可见性;随后采用YOLOv11模型定位灾前建筑区域,并通过视觉语言模型(VLMs)对裁剪后的建筑区域进行四等级语义损伤分类;为提升评估鲁棒性,引入CLIPScore实现无参考语义对齐,并设计多模型VLM-as-a-Jury策略降低单一模型偏差,从而为应急响应提供更可靠且具可解释性的损伤分析结果。

链接: https://arxiv.org/abs/2603.22768
作者: Bijay Shakya,Catherine Hoier,Khandaker Mamun Ahmed
机构: Dakota State University (南达科他州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid and accurate structural damage assessment following natural disasters is critical for effective emergency response and recovery. However, remote sensing imagery often suffers from low spatial resolution, contextual ambiguity, and limited semantic interpretability, reducing the reliability of traditional detection pipelines. In this work, we propose a novel hybrid framework that integrates AI-based super-resolution, deep learning object detection, and Vision-Language Models (VLMs) for comprehensive post-disaster building damage assessment. First, we enhance pre- and post-disaster satellite imagery using a Video Restoration Transformer (VRT) to upscale images from 1024x1024 to 4096x4096 resolution, improving structural detail visibility. Next, a YOLOv11-based detector localizes buildings in pre-disaster imagery, and cropped building regions are analyzed using VLMs to semantically assess structural damage across four severity levels. To ensure robust evaluation in the absence of ground-truth captions, we employ CLIPScore for reference-free semantic alignment and introduce a multi-model VLM-as-a-Jury strategy to reduce individual model bias in safety-critical decision making. Experiments on subsets of the xBD dataset, including the Moore Tornado and Hurricane Matthew events, demonstrate that the proposed framework enhances the semantic interpretation of damaged buildings. In addition, our framework provides helpful recommendations to first responders for recovery based on damage analysis.

[CV-101] ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding CVPR2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在专业电子海图(Electronic Navigational Charts, ENCs)理解上的可靠性问题。ENCs作为现代航海安全的核心,其符号系统、几何结构和规则约束要求高度专业的解读能力,而现有MLLMs在此类任务中表现不佳。解决方案的关键在于构建首个专注于专业ENC理解的基准测试集ENC-Bench,该基准包含20,490个专家验证样本,源自840份真实NOAA ENC数据,采用标准化矢量转图像管道生成并经过一致性校验与人工审核,涵盖感知、空间推理和海事决策三个层次的任务。通过统一零样本协议评估10种先进MLLMs,揭示了当前模型在符号定位、空间计算、多约束推理及鲁棒性方面的系统性挑战,从而为推进MLLMs向专业航海应用演进提供了关键基础设施。

链接: https://arxiv.org/abs/2603.22763
作者: Ao Cheng,Xingming Li,Xuanyu Ji,Xixiang He,Qiyao Sun,Chunping Qiu,Runke Huang,Qingyong Hu
机构: National University of Defense Technology (国防科技大学); Intelligent Game and Decision Lab (智能游戏与决策实验室); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026, Project page: this https URL

点击查看摘要

Abstract:Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure – requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.

[CV-102] Reconstruction-Guided Slot Curriculum: Addressing Object Over-Frag mentation in Video Object-Centric Learning CVPR2026

【速读】:该论文旨在解决视频对象中心学习(Video Object-Centric Learning)中普遍存在的过分割(over-fragmentation)问题,即现有基于槽位注意力(slot-attention)的模型倾向于将单一物体分配到多个冗余槽位,从而导致表示效率低下。其核心解决方案是提出一种重建引导的槽位课程学习机制(Reconstruction-guided Slot Curriculum, SlotCurri):训练初期仅使用少量粗粒度槽位,随后根据重建误差动态扩展新槽位,仅在需要时增加表示容量以避免早期碎片化;同时引入结构感知损失(structure-aware loss),增强局部对比度与边缘信息,促使每个槽位清晰分离语义边界;此外,设计循环推理机制(cyclic inference),使槽位在帧序列中双向传播,提升早期帧中的时空一致性。上述策略协同作用,显著改善了对象表示的准确性与稳定性。

链接: https://arxiv.org/abs/2603.22758
作者: WonJun Moon,Hyun Seok Seong,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2026 paper. Our code is available at this http URL

点击查看摘要

Abstract:Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at this http URL.

[CV-103] Multimodal Industrial Anomaly Detection via Geometric Prior

【速读】:该论文旨在解决当前多模态工业异常检测方法在处理复杂几何形状缺陷(如微弱表面形变和不规则轮廓)时检测精度不足的问题,其根本原因在于现有方法未能有效利用关键几何信息,如表面法向量和三维形状拓扑结构。解决方案的关键在于提出一种基于几何先验的异常检测网络(GPAD),核心创新包括:首先设计点云专家模型,通过差分法向量计算增强几何特征细节并生成几何先验;其次采用两阶段融合策略,高效整合多模态数据与点云内在几何先验的互补性,并结合基于几何先验的注意力机制与异常区域分割,显著提升模型对几何缺陷的感知能力。

链接: https://arxiv.org/abs/2603.22757
作者: Min Li,Jinghui He,Gang Li,Jiachen Li,Jin Wan,Delong Han
机构: City University of Macau (澳门城市大学); Shandong Computer Science Center (国家超算济南中心); Qilu University of Technology (山东理工大学); Shandong Academy of Sciences (山东省科学院); Jinan (济南)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:The purpose of multimodal industrial anomaly detection is to detect complex geometric shape defects such as subtle surface deformations and irregular contours that are difficult to detect in 2D-based methods. However, current multimodal industrial anomaly detection lacks the effective use of crucial geometric information like surface normal vectors and 3D shape topology, resulting in low detection accuracy. In this paper, we propose a novel Geometric Prior-based Anomaly Detection network (GPAD). Firstly, we propose a point cloud expert model to perform fine-grained geometric feature extraction, employing differential normal vector computation to enhance the geometric details of the extracted features and generate geometric prior. Secondly, we propose a two-stage fusion strategy to efficiently leverage the complementarity of multimodal data as well as the geometric prior inherent in 3D points. We further propose attention fusion and anomaly regions segmentation based on geometric prior, which enhance the model’s ability to perceive geometric defects. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the State-of-the-art (SOTA) methods in detection accuracy on both MVTec-3D AD and Eyecandies datasets.

[CV-104] MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding IJCNN2026

【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在处理多视频输入时的感知与理解能力评估缺失问题,现有基准测试主要局限于静态图像或单视频场景,未能涵盖多视频间复杂交互关系的建模需求。解决方案的关键在于提出一个新的基准测试工具——多视频感知评估基准(Multi-Video Perception Evaluation Benchmark, MVPBench),其包含14个子任务、覆盖多样视觉领域,并基于2.7K个视频片段(来自现有数据集和人工标注)构建了5K个问答测试样例,能够系统性地评估模型从视频序列中提取相关信息并做出决策的能力,从而揭示当前模型在多视频理解方面的显著局限性,推动该方向的技术进步。

链接: https://arxiv.org/abs/2603.22756
作者: Purui Bai,Tao Wu,Jiayang Sun,Xinyue Liu,Huaibo Huang,Ran He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures, accepted by IJCNN 2026, code and dataset available at this https URL

点击查看摘要

Abstract:The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.

[CV-105] SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts CVPR2026

【速读】:该论文旨在解决将对比语言-图像预训练(CLIP)模型应用于音视频定位任务时存在的语义对齐困难问题,具体表现为:使用固定分类标记([CLS])替换为音频嵌入标记([V_A])难以捕捉语义线索,且采用固定提示“a photo of a [V_A]”无法建立音频嵌入与上下文标记之间的有效关联。解决方案的关键在于提出一种声学感知提示学习方法(SOUPLE),其核心是用可学习的上下文标记替代固定提示,这些标记融合视觉特征以生成掩码解码器所需的条件上下文,从而有效建立音频与视觉输入间的语义对应关系。

链接: https://arxiv.org/abs/2603.22732
作者: Khanh Binh Nguyen,Chae Jung Park
机构: Deakin University (迪肯大学); National Cancer Center (国家癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt “a photo of a [V_A]” fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

[CV-106] How Far Can VLMs Go for Visual Bug Detection? Studying 19738 Keyframes from 41 Hours of Gameplay Videos

【速读】:该论文旨在解决长时游戏视频质量保证(QA)中人工检测效率低、易出错的问题,探索视觉语言模型(VLM)在真实工业场景下对游戏画面视觉错误的自动检测能力。其关键解决方案在于:通过从长时游戏视频中采样关键帧,并利用预训练VLM进行单次提示(single-prompt)推理来判断是否存在视觉缺陷;进一步尝试两种无需微调的增强策略——引入二级判别模型重审输出和基于历史缺陷报告的元数据增强提示——以提升检测性能。研究发现,尽管这些增强策略仅带来边际改进并增加计算开销与结果波动,但基础VLM已具备一定检测能力,未来进展可能依赖于将文本与视觉异常检测任务更好分离的混合方法。

链接: https://arxiv.org/abs/2603.22706
作者: Wentao Lu,Alexander Senchenko,Alan Sayle,Abram Hindle,Cor-Paul Bezemer
机构: University of Alberta (阿尔伯塔大学); Electronic Arts (电子艺界)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Video-based quality assurance (QA) for long-form gameplay video is labor-intensive and error-prone, yet valuable for assessing game stability and visual correctness over extended play sessions. Vision language models (VLMs) promise general-purpose visual reasoning capabilities and thus appear attractive for detecting visual bugs directly from video frames. Recent benchmarks suggest that VLMs can achieve promising results in detecting visual glitches on curated datasets. Building on these findings, we conduct a real-world study using industrial QA gameplay videos to evaluate how well VLMs perform in practical scenarios. Our study samples keyframes from long gameplay videos and asks a VLM whether each keyframe contains a bug. Starting from a single-prompt baseline, the model achieves a precision of 0.50 and an accuracy of 0.72. We then examine two common enhancement strategies used to improve VLM performance without fine-tuning: (1) a secondary judge model that re-evaluates VLM outputs, and (2) metadata-augmented prompting through the retrieval of prior bug reports. Across \textbf100 videos totaling \textbf41 hours and \textbf19,738 keyframes, these strategies provide only marginal improvements over the simple baseline, while introducing additional computational cost and output variance. Our findings indicate that off-the-shelf VLMs are already capable of detecting a certain range of visual bugs in QA gameplay videos, but further progress likely requires hybrid approaches that better separate textual and visual anomaly detection.

[CV-107] meWeaver: Age-Consistent Reference-Based Face Restoration with Identity Preservation

【速读】:该论文旨在解决参考图像与退化输入在年龄上不一致时,现有基于参考的面部恢复方法无法保持年龄一致性的难题(即跨年龄参考场景下的恢复问题)。解决方案的关键在于提出TimeWeaver框架,其核心创新是将身份(identity)与年龄(age)条件解耦,并在训练和推理阶段分别处理:训练阶段通过基于Transformer的ID-Fusion模块融合全局身份嵌入与年龄抑制的面部token,学习对年龄鲁棒的身份表征;推理阶段则采用两种无需训练的技术——年龄感知梯度引导(Age-Aware Gradient Guidance)和令牌目标注意力增强(Token-Targeted Attention Boost),以引导生成过程精确匹配目标年龄提示,从而实现身份保真度与年龄一致性的协同优化。

链接: https://arxiv.org/abs/2603.22701
作者: Teer Song,Yue Zhang,Yu Tian,Ziyang Wang,Xianlin Zhang,Guixuan Zhang,Xuan Liu,Xueming Li,Yasen Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); Minzu University of China (中央民族大学); Xiaomi Corporation (小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is an improved version based on arXiv:2603.18645

点击查看摘要

Abstract:Recent progress in face restoration has shifted from visual fidelity to identity fidelity, driving a transition from reference-free to reference-based paradigms that condition restoration on reference images of the same person. However, these methods assume the reference and degraded input are age-aligned. When only cross-age references are available, as in historical restoration or missing-person retrieval, they fail to maintain age fidelity. To address this limitation, we propose TimeWeaver, the first reference-based face restoration framework supporting cross-age references. Given arbitrary reference images and a target-age prompt, TimeWeaver produces restorations with both identity fidelity and age consistency. Specifically, we decouple identity and age conditioning across training and inference. During training, the model learns an age-robust identity representation by fusing a global identity embedding with age-suppressed facial tokens via a transformer-based ID-Fusion module. During inference, two training-free techniques, Age-Aware Gradient Guidance and Token-Targeted Attention Boost, steer sampling toward desired age semantics, enabling precise adherence to the target-age prompt. Extensive experiments show that TimeWeaver surpasses existing methods in visual quality, identity preservation, and age consistency.

[CV-108] WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment

【速读】:该论文旨在解决基于Wi-Fi信道状态信息(Channel State Information, CSI)的室内人体活动语义理解中,如何实现细粒度自然语言描述生成的问题。现有方法多集中于姿态估计或预定义动作分类,难以跨越无线信号与语言之间的语义鸿沟,并存在方向敏感性歧义(如左右肢体混淆)。其解决方案的关键在于提出一个三阶段框架WiFi2Cap:首先利用视觉-语言教师模型从同步视频-文本对中学习可迁移的监督信号;其次通过镜像一致性损失(Mirror-Consistency Loss)减少跨模态对齐过程中的镜像动作和左右混淆问题;最后采用前缀微调(prefix-tuning)的语言模型从CSI嵌入中生成自然语言动作描述。该方法显著提升了在BLEU-4、METEOR、ROUGE-L、CIDEr和SPICE等指标上的性能,实现了隐私友好的语义感知。

链接: https://arxiv.org/abs/2603.22690
作者: Tzu-Ti Wei,Chu-Yu Huang,Yu-Chee Tseng,Jen-Jee Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Privacy-preserving semantic understanding of human activities is important for indoor sensing, yet existing Wi-Fi CSI-based systems mainly focus on pose estimation or predefined action classification rather than fine-grained language generation. Mapping CSI to natural-language descriptions remains challenging because of the semantic gap between wireless signals and language and direction-sensitive ambiguities such as left/right limb confusion. We propose WiFi2Cap, a three-stage framework for generating action captions directly from Wi-Fi CSI. A vision-language teacher learns transferable supervision from synchronized video-text pairs, and a CSI student is aligned to the teacher’s visual space and text embeddings. To improve direction-sensitive captioning, we introduce a Mirror-Consistency Loss that reduces mirrored-action and left-right ambiguities during cross-modal alignment. A prefix-tuned language model then generates action descriptions from CSI embeddings. We also introduce the WiFi2Cap Dataset, a synchronized CSI-RGB-sentence benchmark for semantic captioning from Wi-Fi signals. Experimental results show that WiFi2Cap consistently outperforms baseline methods on BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE, demonstrating effective privacy-friendly semantic sensing.

[CV-109] hink 360°: Evaluating the Width-centric Reasoning Capability of MLLM s Beyond Depth CVPR2026

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理能力评估中过于侧重“推理深度”(reasoning depth)而忽视“推理宽度”(reasoning width)的问题。推理宽度指模型在面对复杂问题时,能够并行探索多种可能的推理路径、应用多样化约束进行剪枝,并高效识别有效解题路径的能力,这与传统的长链顺序推理形成互补。解决方案的关键在于构建一个全面的多模态基准测试集(holistic multimodal benchmark),包含1200+高质量跨领域案例,并提出一种细粒度的“思维树”(tree-of-thought)评估协议,可联合量化推理宽度与深度。通过该框架对12个主流模型家族(超30个先进MLLMs)进行系统评测,揭示了当前模型在整合深度序列推理与广度探索性搜索以实现真正洞察式推理方面的不足。

链接: https://arxiv.org/abs/2603.22689
作者: Mingrui Chen,Hexiong Yang,Haogeng Liu,Huaibo Huang,Ran He
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoning width, a complementary dimension to the more commonly studied reasoning depth. Specifically, reasoning depth measures the model’s ability to carry out long-chain, sequential reasoning in which each step is tightly and rigorously linked to the next. Reasoning width tends to focus more on the model’s capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking. To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning width and depth. We evaluate 12 major model families (over 30 advanced MLLMs) across difficulty tiers, question types, and required skills. Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not only deeper but also wider.

[CV-110] GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning CVPR2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在感知细粒度几何结构方面的不足,从而限制其几何理解与视觉推理能力的问题。解决方案的关键在于提出GeoTikzBridge框架,通过生成基于TikZ的代码来增强局部几何感知与视觉推理能力;其中,GeoTikzBridge-Base模型基于包含250万图像到TikZ配对数据的GeoTikz-Base数据集进行训练,采用迭代数据扩展和局部几何变换策略构建;而GeoTikzBridge-Instruct模型则在首个支持视觉推理的指令增强型TikZ数据集GeoTikz-Instruct上进行微调,最终实现开源MLLMs中的最优几何推理性能,并可作为即插即用的推理模块提升任意MLLM在几何问题求解中的表现。

链接: https://arxiv.org/abs/2603.22687
作者: Jiayin Sun,Caixia Sun,Boyu Yang,Hailin Li,Xiao Chen,Yi Zhang,Errui Ding,Liang Li,Chao Deng,Junlan Feng
机构: JIUTIAN Research (九天研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 \times larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: this https URL.

[CV-111] Large-Scale Avalanche Mapping from SAR Images with Deep Learning-based Change Detection

【速读】:该论文旨在解决利用卫星遥感影像实现大范围雪崩精准检测的问题,尤其针对因频率和强度上升而日益威胁人类生命、基础设施与生态系统的快速质量移动灾害。其解决方案的关键在于采用双时相合成孔径雷达(SAR)影像进行单模态变化检测(unimodal change detection),即仅依赖灾前与灾后SAR图像完成变化识别,而非融合多源数据;通过端到端的自动化流程,在多个阿尔卑斯山生态系统区域验证了该方法的有效性,实现了F1-score达0.8061(保守配置)和F2-score达0.8414(召回优化配置),同时揭示了阈值调整对提升小规模或边缘雪崩检测能力的重要性。

链接: https://arxiv.org/abs/2603.22658
作者: Mattia Gatti,Alberto Mariani,Ignazio Gallo,Fabiano Monti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate change detection from satellite imagery is essential for monitoring rapid mass-movement hazards such as snow avalanches, which increasingly threaten human life, infrastructure, and ecosystems due to their rising frequency and intensity. This study presents a systematic investigation of large-scale avalanche mapping through bi-temporal change detection using Sentinel-1 synthetic aperture radar (SAR) imagery. Extensive experiments across multiple alpine ecoregions with manually validated avalanche inventories show that treating the task as a unimodal change detection problem, relying solely on pre- and post-event SAR images, achieves the most consistent performance. The proposed end-to-end pipeline achieves an F1-score of 0.8061 in a conservative (F1-optimized) configuration and attains an F2-score of 0.8414 with 80.36% avalanche-polygon hit rate under a less conservative, recall-oriented (F2-optimized) tuning. These results highlight the trade-off between precision and completeness and demonstrate how threshold adjustment can improve the detection of smaller or marginal avalanches. The release of the annotated multi-region dataset establishes a reproducible benchmark for SAR-based avalanche mapping.

[CV-112] MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping CVPR2026

【速读】:该论文旨在解决传统主动建图(Active Mapping)方法中因依赖贪婪的下一最佳视角预测而导致探索效率低、场景重建不完整的问题。其解决方案的关键在于提出了一种名为MAGICIAN的长期规划框架,该框架通过引入“想象高斯”(Imagined Gaussians)这一基于预训练占据网络(occupancy network)构建的场景表示,利用其强结构先验实现对任意新视角的快速体素渲染与覆盖增益计算,从而将覆盖增益高效集成至树搜索算法中以支持长时程规划。同时,系统采用闭环更新想象高斯并优化轨迹的机制,显著提升了建图的完整性与效率,在多种室内和室外基准测试中均达到当前最优性能。

链接: https://arxiv.org/abs/2603.22650
作者: Shiyao Li,Antoine Guédon,Shizhe Chen,Vincent Lepetit
机构: LIGM, École Nationale des Ponts et Chaussées, IP Paris, Univ Gustave Eiffel, CNRS, France; École Polytechnique, France; Inria, École normale supérieure, CNRS, PSL Research University, France
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at CVPR 2026. Project webpage: this https URL

点击查看摘要

Abstract:Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction. To address this limitation, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.

[CV-113] Pretext Matters: An Empirical Study of SSL Methods in Medical Imaging

【速读】:该论文试图解决的问题是:在医学影像领域,不同自监督学习(Self-Supervised Learning, SSL)方法对临床相关特征的建模能力存在差异,但目前尚缺乏系统性研究来明确哪种SSL目标更契合特定影像模态的空间结构和噪声特性。解决方案的关键在于通过实证比较两种主流SSL架构——联合嵌入架构(Joint Embedding Architectures, JEAs)与联合嵌入预测架构(Joint Embedding Predictive Architectures, JEPAs)——在两种具有独特噪声模式的医学影像模态(超声和组织病理学)中的表现,并发现:当信息在空间上局部集中时(如组织病理学),JEAs因其视图不变性目标更具优势;而当诊断相关信息呈现全局结构时(如肝脏超声中的宏观解剖),JEPAs表现更优。这一发现为根据影像模态的结构与噪声特性选择最优SSL目标提供了可解释的框架。

链接: https://arxiv.org/abs/2603.22649
作者: Vedrana Ivezić,Mara Pleasure,Ashwath Radhachandran,Saarang Panchavati,Shreeram Athreya,Vivek Sant,Benjamin Emert,Gregory Fishbein,Corey Arnold,William Speier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Though self-supervised learning (SSL) has demonstrated incredible ability to learn robust representations from unlabeled data, the choice of optimal SSL strategy can lead to vastly different performance outcomes in specialized domains. Joint embedding architectures (JEAs) and joint embedding predictive architectures (JEPAs) have shown robustness to noise and strong semantic feature learning compared to pixel reconstruction-based SSL methods, leading to widespread adoption in medical imaging. However, no prior work has systematically investigated which SSL objective is better aligned with the spatial organization of clinically relevant signal. In this work, we empirically investigate how the choice of SSL method impacts the learned representations in medical imaging. We select two representative imaging modalities characterized by unique noise profiles: ultrasound and histopathology. When informative signal is spatially localized, as in histopathology, JEAs are more effective due to their view-invariance objective. In contrast, when diagnostically relevant information is globally structured, such as the macroscopic anatomy present in liver ultrasounds, JEPAs are optimal. These differences are especially evident in the clinical relevance of the learned features, as independently validated by board-certified radiologists and pathologists. Together, our results provide a framework for matching SSL objectives to the structural and noise properties of medical imaging modalities.

[CV-114] Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Model, VLM)的图像质量评估(Image Quality Assessment, IQA)方法中,过度依赖自然语言进行链式思维(Chain-of-Thought, CoT)推理所导致的局限性问题。具体而言,现有方法将视觉信息视为静态前提,难以将质量相关的视觉线索充分抽象为文本形式,从而限制了在视觉密集型任务中的推理效果。其解决方案的关键在于提出一种名为 Q-Tacit 的新范式,该范式通过两个阶段实现:首先在潜在空间中注入结构化的视觉质量先验,其次校准潜在空间中的推理轨迹以提升质量评估能力。这一方法使 VLM 能够在不依赖大量自然语言描述的前提下,在潜在质量空间中进行高效推理,显著减少所需 token 数量并保持优异性能,验证了语言并非唯一紧凑的视觉质量表示形式。

链接: https://arxiv.org/abs/2603.22641
作者: Yuxuan Jiang,Yixuan Li,Hanwei Zhu,Siyue Teng,Fan Zhang,David Bull
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, “Is natural language the ideal space for quality reasoning?” and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.

[CV-115] CAM3R: Camera-Agnostic Model for 3D Reconstruction

【速读】:该论文旨在解决从无标定图像中恢复稠密三维几何结构的问题,尤其针对使用非线性光学系统(如鱼眼或全景镜头)拍摄的广角影像,现有基于标准针孔相机模型训练的重建方法在此类数据上会出现显著的几何退化。解决方案的关键在于提出CAM3R——一种无需事先相机标定即可处理广角图像的前馈式3D重建模型,其核心创新包括:(1) 一个双视图网络架构,分为射线模块(Ray Module, RM)用于估计每像素的射线方向,以及跨视图模块(Cross-view Module, CVM)用于推断径向距离并生成置信度图、点图和相对位姿;(2) 引入射线感知全局对齐框架(Ray-Aware Global Alignment),在严格保留局部几何预测的前提下,实现位姿精修与尺度优化,从而统一多视角局部预测以构建一致的3D场景。

链接: https://arxiv.org/abs/2603.22631
作者: Namitha Guruprasad,Abhay Yadav,Cheng Peng,Rama Chellappa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.

[CV-116] PIVM: Diffusion-Based Prior-Integrated Variation Modeling for Anatomically Precise Abdominal CT Synthesis

【速读】:该论文旨在解决腹部CT图像数据因标注成本高和隐私限制而导致的稀缺问题,从而阻碍了分割与诊断模型的鲁棒性发展。其解决方案的关键在于提出了一种先验集成变分建模(Prior-Integrated Variation Modeling, PIVM)框架,该框架基于扩散模型,在图像空间中直接生成具有解剖学准确性的CT图像;具体而言,PIVM不从噪声中生成完整图像,而是预测相对于由分割标签导出的器官特异性强度先验的体素级强度变化,这些先验与标签共同引导扩散过程,确保空间对齐和真实的器官边界,同时保留完整的亨氏单位(Hounsfield Unit, HU)范围,捕捉细微的解剖纹理而不产生平滑效应。

链接: https://arxiv.org/abs/2603.22626
作者: Dinglun He,Baoming Zhang,Xu Wang,Yao Hao,Deshan Yang,Ye Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026 (Oral). Equal contribution by the first three authors

点击查看摘要

Abstract:Abdominal CT data are limited by high annotation costs and privacy constraints, which hinder the development of robust segmentation and diagnostic models. We present a Prior-Integrated Variation Modeling (PIVM) framework, a diffusion-based method for anatomically accurate CT image synthesis. Instead of generating full images from noise, PIVM predicts voxel-wise intensity variations relative to organ-specific intensity priors derived from segmentation labels. These priors and labels jointly guide the diffusion process, ensuring spatial alignment and realistic organ boundaries. Unlike latent-space diffusion models, our approach operates directly in image space while preserving the full Hounsfield Unit (HU) range, capturing fine anatomical textures without smoothing. Source code is available at this https URL.

[CV-117] oward Faithful Segmentation Attribution via Benchmarking and Dual-Evidence Fusion

【速读】:该论文旨在解决语义分割任务中Attribution maps(归因图)的可信赖性评估问题,现有方法主要依赖视觉合理性判断,但这种主观评价无法确保被突出像素确实驱动模型预测,也无法保证归因信用局限于目标区域。为此,作者提出一个可复现的基准测试框架,系统评估基于干预的忠实性(intervention-based faithfulness)、非目标区域泄漏(off-target leakage)、扰动鲁棒性(perturbation robustness)及运行时间,并在Pascal VOC和SBD数据集上验证了三种预训练骨干网络。其核心解决方案是Dual-Evidence Attribution (DEA),一种轻量级校正方法,通过一致性加权融合梯度证据与区域级干预信号,在两者一致处增强注意力,在梯度响应不稳定时保留因果支持,从而显著提升删除法忠实性,同时保持强鲁棒性,仅增加干预计算开销。该基准揭示了不同归因方法在忠实性与稳定性之间的权衡关系,这是传统视觉评估所掩盖的,为语义分割可解释性方法的选择提供了理论依据。

链接: https://arxiv.org/abs/2603.22624
作者: Abu Noman Md Sakib,OFM Riaz Rahman Aranya,Kevin Desai,Zijie Zhang
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model’s prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention-based faithfulness, off-target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual-Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region-level intervention signals through agreement-weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion-based faithfulness over gradient-only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness-stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at this https URL.

[CV-118] o Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models

【速读】:该论文旨在解决医学视觉语言模型(Vision-Language Models, VLMs)在临床应用中面临的两个关键脆弱性问题:幻觉(hallucination)和逢迎倾向(sycophancy),尤其是二者协同作用下的安全性风险。研究表明,当前主流VLMs在降低幻觉的同时往往表现出更强的逢迎行为,反之亦然,形成“接地-逢迎权衡”(grounding-sycophancy tradeoff)。解决方案的关键在于提出三个新指标以系统评估模型的安全性:L-VASE(对VASE的logit空间重构,避免双重归一化问题)、CCS(置信度校准的逢迎评分,惩罚高置信度屈从行为)以及临床安全指数(Clinical Safety Index, CSI),该指数通过几何平均整合接地能力、自主性和置信度校准性。实验表明,在1,151个测试案例中,所有7–8B参数规模的模型均未达到CSI > 0.35,说明现有模型尚无法同时具备强接地能力和抗社会压力鲁棒性,因此必须联合评估这两项属性后方可考虑用于临床部署。

链接: https://arxiv.org/abs/2603.22623
作者: OFM Riaz Rahman Aranya,Kevin Desai
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at this https URL

[CV-119] A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images

【速读】:该论文旨在解决在田间尺度上获取三维(3D)植物结构模型所需参数和嵌套结构时存在的劳动密集型问题。传统方法依赖于3D传感器或复杂的多视角图像处理,成本高且效率低。其关键解决方案是提出一种基于视觉-语言模型(VLM)的新算法,通过仅使用合成图像训练和测试,将植物架构的XML描述转化为可被语言模型预测的标记序列(token sequence),从而从单张图像中提取器官级几何与拓扑参数,并生成功能性的3D植物结构模型。实验表明,该方法在教师强制训练下达到0.73的token F1分数,自回归生成时BLEU-4得分达94.00%,ROUGE-L得分为0.5182,验证了从图像数据中提取植物架构参数的可行性。

链接: https://arxiv.org/abs/2603.22622
作者: Heesup Yun,Isaac Kazuo Uyehara,Ioannis Droutsas,Earl Ranario,Christine H. Diepenbrock,Brian N. Bailey,J. Mason Earles
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant’s architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.

[CV-120] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

【速读】:该论文旨在解决现有虚拟试衣(Virtual Try-On, VTON)与虚拟脱衣(Virtual Try-Off, VTOFF)数据集静态、缺乏指令驱动编辑能力的问题,从而实现可控且交互式的服装生成。其解决方案的关键在于构建首个统一框架下的大规模基准数据集——Dress Editing Dataset (Dress-ED),该数据集整合了VTON、VTOFF及文本引导的服装编辑任务,包含超过14.6万组经验证的四元组样本(原始服装图、穿戴者图像、编辑后图像及自然语言指令),并通过融合多模态大模型(MLLM)的服装理解、扩散模型(diffusion-based editing)和大语言模型(LLM-guided verification)的自动化流程生成。在此基础上,作者进一步提出一个统一的多模态扩散框架,能够联合推理语言指令与视觉服装线索,为指令驱动的VTON/VTOFF提供强有力的基线方法。

链接: https://arxiv.org/abs/2603.22607
作者: Fulvio Sanguigni,Davide Lobba,Bin Ren,Marcella Cornia,Nicu Sebe,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.

[CV-121] rajLoom: Dense Future Trajectory Generation from Video

【速读】:该论文旨在解决视频理解与可控视频生成中未来运动预测的挑战,尤其是如何从观测到的视频帧中准确建模密集点轨迹(dense point trajectories)的未来演化过程。其核心问题在于现有方法难以在长时程预测中保持运动的真实性和稳定性。解决方案的关键在于提出一个三组件框架:(1) Grid-Anchor Offset Encoding,通过将每个点表示为相对于像素中心锚点的偏移量来减少位置依赖偏差;(2) TrajLoom-VAE,利用掩码重建和时空一致性正则化学习紧凑的时空潜在空间;(3) TrajLoom-Flow,基于流匹配在潜在空间中生成未来轨迹,并结合边界提示和on-policy K-step微调实现稳定采样。该方法将预测时间跨度从24帧扩展至81帧,显著提升跨数据集的运动真实感与稳定性。

链接: https://arxiv.org/abs/2603.22606
作者: Zewei Zhang,Jia Jun Cheng Xian,Kaiwen Liu,Ming Liang,Hang Chu,Jun Chen,Renjie Liao
机构: McMaster University; University of British Columbia; Vector Institute; Viggle AI; Canada CIFAR AI Chair
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page, code, model checkpoints, and datasets: this https URL

点击查看摘要

Abstract:Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at this https URL.

[CV-122] Language Models Can Explain Visual Features via Steering CVPR2026

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAE)在视觉模型中提取出的数千个特征难以自动解释的问题,尤其是如何在不依赖人工干预的前提下揭示这些特征所代表的视觉概念。其解决方案的关键在于利用视觉-语言模型(Vision-Language Models, VLMs)的结构特性,通过因果干预(causal interventions)对SAE特征进行引导(steering),即在输入为空图像的情况下操纵特定特征,并借助语言模型生成关于“看到什么”的解释,从而实现自动化、可扩展的特征解释。该方法显著优于传统的基于激活输入样本的相关性解释,且随着语言模型规模增大,解释质量持续提升,展现出未来研究的潜力。

链接: https://arxiv.org/abs/2603.22593
作者: Javier Ferrando,Enrique Lopez-Cuena,Pablo Agustin Martin-Torres,Daniel Hinjos,Anna Arias-Duart,Dario Garcia-Gasulla
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it ``sees’', effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.

[CV-123] A vision-language model and platform for temporally mapping surgery from video

【速读】:该论文旨在解决当前手术AI模型在临床转化中的局限性问题,即现有模型通常仅能捕捉单一术式中有限的手术行为特征,且缺乏对实际外科医生的可访问性和实用性。解决方案的关键在于构建了一个名为Halsted的视觉-语言模型,该模型基于大规模标注视频库Halsted Surgical Atlas(HSA),其包含超过65万段跨8个外科专业的视频,并通过迭代自标注框架持续扩展;同时,研究团队公开发布了HSA-27k子集以促进基准测试,并开发了Halsted网络平台,使外科医生能够几分钟内自动映射自身手术过程,从而显著提升手术AI的全面性、计算效率与临床可及性,推动自主机器人手术的发展。

链接: https://arxiv.org/abs/2603.22583
作者: Dani Kiyasseh
机构: Halsted AI( Halsted AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (this https URL) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.

[CV-124] FullCircle: Effortless 3D Reconstruction from Casual 360circ Captures

【速读】:该论文旨在解决360°相机在非专业场景下进行3D场景重建时的挑战,即如何实现无需特殊采集协议或预处理步骤的、鲁棒且高效的重建流程。传统基于透视相机的方法受限于视场角狭窄,导致视角覆盖不足和特征对应关系稀疏;而现有360°重建方法虽具备更广视场,却依赖复杂的采集规范和预处理,违背了辐射场(radiance fields)“简易工作流”的核心优势。论文提出了一种直接从原始双鱼眼(dual-fisheye)360°图像中重建3D场景的实用化流水线,其关键在于:不依赖任何特殊采集协议或预处理,同时对常见误差源——即出现在所有360°图像中的操作者人体——表现出强鲁棒性,从而显著优于标准3DGS(3D Gaussian Splatting)及模拟透视相机的基线方法,验证了360°采集在非专业场景下的优越性。

链接: https://arxiv.org/abs/2603.22572
作者: Yalda Foroutan,Ipek Oztas,Daniel Rebain,Aysegul Dundar,Kwang Moo Yi,Lily Goli,Andrea Tagliasacchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiance fields have emerged as powerful tools for 3D scene reconstruction. However, casual capture remains challenging due to the narrow field of view of perspective cameras, which limits viewpoint coverage and feature correspondences necessary for reliable camera calibration and reconstruction. While commercially available 360 ^\circ cameras offer significantly broader coverage than perspective cameras for the same capture effort, existing 360 ^\circ reconstruction methods require special capture protocols and pre-processing steps that undermine the promise of radiance fields: effortless workflows to capture and reconstruct 3D scenes. We propose a practical pipeline for reconstructing 3D scenes directly from raw 360 ^\circ camera captures. We require no special capture protocols or pre-processing, and exhibit robustness to a prevalent source of reconstruction errors: the human operator that is visible in all 360 ^\circ imagery. To facilitate evaluation, we introduce a multi-tiered dataset of scenes captured as raw dual-fisheye images, establishing a benchmark for robust casual 360 ^\circ reconstruction. Our method significantly outperforms not only vanilla 3DGS for 360 ^\circ cameras but also robust perspective baselines when perspective cameras are simulated from the same capture, demonstrating the advantages of 360 ^\circ capture for casual reconstruction. Additional results are available at: this https URL

[CV-125] CanViT: Toward Active-Vision Foundation Models

【速读】:该论文旨在解决主动视觉(Active Vision)领域中缺乏可扩展的通用基础模型(Foundation Model)及其预训练流水线的问题,从而推动从被动视觉到主动视觉在语义分割等任务上的性能差距缩小。其解决方案的关键在于提出首个任务与策略无关的主动视觉基础模型 CanViT:通过引入场景相对位置编码(scene-relative RoPE)将视网膜拓扑(retinotopic)的 Vision Transformer 主干与场景级空间拓扑(spatiotopic)的潜在工作区(canvas)进行绑定,并设计了 Canvas Attention 这一新颖的非对称交叉注意力机制,实现高效高容量工作记忆交互;同时,通过解耦“思考”(主干层)与“记忆”(画布层),移除画布侧自注意力和全连接层,显著降低延迟并提升大规模场景下的可扩展性;此外,提出了无标签的主动视觉预训练方法——策略无关的被动到主动密集潜在蒸馏(policy-agnostic passive-to-active dense latent distillation),利用随机采样的低分辨率片段重建 DINOv3 的全局嵌入,从而无需标注即可完成强大表征学习。

链接: https://arxiv.org/abs/2603.22570
作者: Yohaï-Eliel Berreby,Sabrina Du,Audrey Durand,B. Suresh Krishna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and weights: this https URL

点击查看摘要

Abstract:Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes – an order of magnitude more than previous active models – and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model’s 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.

[CV-126] Generalized multi-object classification and tracking with sparse feature resonator networks

【速读】:该论文旨在解决视觉场景理解中 invariant(不变性)与 equivariant(等变性)结构难以同时捕获的问题,即传统神经网络在追求对平移等变换的不变性时往往会丢失对象的具体位置等等变信息,且监督学习难以自然保证泛化能力。解决方案的关键在于采用基于“分析-合成”框架和共振网络(resonator network)的因子分解方法:生成模型描述简单场景中 MNIST 数字及其变换(如颜色、位置),而共振网络则逆向重构该生成过程,从而同时输出对象的不变特征(如形状)和等变特征(如精确位置)。通过从训练数据中学得的稀疏特征作为基底,网络能够灵活表示未见过的对象形状;模块化设计中的形状模块分离了平移因素,使分类器仅需在中心化数据上训练即可实现高精度识别,无需数据增强即可应对任意平移输入。此外,其类注意力机制支持多对象场景中逐个聚焦分析,且可精确追踪多个运动目标(误差小于几个像素)。

链接: https://arxiv.org/abs/2603.22539
作者: Lazar Supic,Alec Mullen,E. Paxon Frady
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, NICE 2026

点击查看摘要

Abstract:In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.

[CV-127] UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images

【速读】:该论文旨在解决城市尺度下人行道宽度(sidewalk width)数据稀缺的问题,传统方法如实地调查、高分辨率航拍影像或简化几何假设存在成本高、可扩展性差或系统误差大的局限。解决方案的核心在于提出UrbanVGGT测量流程,其关键创新在于通过单张街景图像实现米级精度的宽度估计:该流程融合语义分割、前馈式三维重建、自适应地面平面拟合、基于相机高度的尺度校准以及在恢复平面上的方向性宽度测量;其中,度量尺度校准被证明是最关键的组件,显著提升了估计准确性(在华盛顿特区基准上平均绝对误差为0.252 m,95.5%的估计值在0.50 m以内)。

链接: https://arxiv.org/abs/2603.22531
作者: Kaizhen Tan,Fan Zhang
机构: Heinz College of Information Systems and Public Policy, Carnegie Mellon University, USA (卡内基梅隆大学信息系统与公共政策亨茨学院,美国); Institute of Remote Sensing and Geographical Information System, Peking University, China (北京大学遥感与地理信息系统研究所,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.

[CV-128] Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion

【速读】:该论文旨在解决当前基于学习的控制方法在复杂城市环境中对人行道微移动(micromobility)最后一公里运输任务中表现不佳的问题,特别是模仿学习(Imitation Learning, IL)因依赖固定离线数据而导致误差累积、鲁棒性差和泛化能力弱的局限。其解决方案的关键在于两个层面:一是通过引入多样化的纠正行为(corrective behaviors)和传感器增强(sensor augmentations)来扩充遥控操作(teleoperation)数据集,使策略能够学习从自身错误中恢复;二是设计多尺度模仿学习(multi-scale imitation learning)架构,利用基于时域的轨迹聚类和分层监督机制,同时捕捉短时交互行为与长时目标导向意图,从而提升模型在真实场景下的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2603.22527
作者: Honglin He,Yukai Ma,Brad Squicciarini,Wayne Wu,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校); Coco Robotics
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sidewalk micromobility is a promising solution for last-mile transportation, but current learning-based control methods struggle in complex urban environments. Imitation learning (IL) learns policies from human demonstrations, yet its reliance on fixed offline data often leads to compounding errors, limited robustness, and poor generalization. To address these challenges, we propose a framework that advances IL through corrective behavior expansion and multi-scale imitation learning. On the data side, we augment teleoperation datasets with diverse corrective behaviors and sensor augmentations to enable the policy to learn to recover from its own mistakes. On the model side, we introduce a multi-scale IL architecture that captures both short-horizon interactive behaviors and long-horizon goal-directed intentions via horizon-based trajectory clustering and hierarchical supervision. Real-world experiments show that our approach significantly improves robustness and generalization in diverse sidewalk scenarios.

[CV-129] High Resolution Flood Extent Detection Using Deep Learning with Random Forest Derived Training Labels

【速读】:该论文旨在解决极端洪水事件中洪水模型验证困难的问题,主要受限于灾时观测数据稀缺以及缺乏标注训练数据。为应对这一挑战,作者提出了一种融合PlanetScope高分辨率光学影像与地形特征(高度距最近排水线,HAND 和坡度)的洪水制图框架,其关键在于利用随机森林(Random Forest)模型对专家标注的洪水掩膜进行学习,生成深度学习(Deep Learning, DL)模型——U-Net 的训练标签,从而实现标签高效且可扩展的洪水范围识别。实验表明,加入地形特征后模型性能与纯光学影像配置相当(F1=0.92,IoU=0.85),说明该方法在数据稀缺场景下具备鲁棒性和实用性。

链接: https://arxiv.org/abs/2603.22518
作者: Azizbek Nuriddinov,Ebrahim Ahmadisharaf,Mohammad Reza Alizadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IGARSS 2026

点击查看摘要

Abstract:Validation of flood models, used to support risk mitigation strategies, remains challenging due to limited observations during extreme events. High-frequency, high-resolution optical imagery (~3 m), such as PlanetScope, offers new opportunities for flood mapping, although applications remain limited by cloud cover and the lack of labeled training data during disasters. To address this, we develop a flood mapping framework that integrates PlanetScope optical imagery with topographic features using machine learning (ML) and deep learning (DL) algorithms. A Random Forest model was applied to expert-annotated flood masks to generate training labels for DL models, U-Net. Two U-Net models with ResNet18 backbone were trained using optical imagery only (4 bands) and optical imagery combined with Height Above Nearest Drainage (HAND) and topographic slope (6 bands). Hurricane Ida (September 2021), which caused catastrophic flooding across the eastern United States, including the New York City metropolitan area, was used as an example to evaluate the framework. Results demonstrate that the U-Net model with topographic features achieved very close performance to the optical-only configuration (F1=0.92 and IoU=0.85 by both modeling scenarios), indicating that HAND and slope provide only marginal value to inundation extent detection. The proposed framework offers a scalable and label-efficient approach for mapping inundation extent that enables modeling under data-scarce flood scenarios.

[CV-130] Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

【速读】:该论文旨在解决在多模态条件下生成具有解剖结构一致性的3D医学体积数据这一难题,尤其针对医疗领域中数据稀缺问题。其核心解决方案是提出Sketch2CT框架,该框架通过联合用户提供的2D草图和文本描述来引导3D医学体积生成,关键在于设计了两个模块:一是利用局部文本提示细化草图特征,二是融合全局草图与文本表示;同时基于胶囊注意力(capsule-attention)骨干网络,有效整合草图与文本的互补信息,从而生成解剖学上准确的器官分割掩膜,并进一步驱动潜在扩散模型合成高质量3D CT图像,实现可控、低成本且高效的医学数据增强。

链接: https://arxiv.org/abs/2603.22509
作者: Delin An,Chaoli Wang
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at this https URL.

[CV-131] ny Inference-Time Scaling with Latent Verifiers

【速读】:该论文旨在解决生成式模型在推理阶段通过验证器(verifier)筛选候选输出时带来的高计算开销问题,尤其针对使用多模态大语言模型(MLLM)作为验证器时需反复解码至像素空间并重新编码至视觉嵌入空间所导致的冗余操作。解决方案的关键在于提出一种名为“隐藏状态验证器”(Verifier on Hidden States, VHS)的新方法,该方法直接在扩散Transformer(DiT)单步生成器的中间隐藏状态上进行分析,无需将候选图像解码到像素空间,从而显著降低每个候选样本的验证成本,同时保持或超越基于MLLM验证器的性能表现。

链接: https://arxiv.org/abs/2603.22492
作者: Davide Bucciarelli,Evelyn Turri,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
机构: University of Modena and Reggio Emilia, Italy; University of Pisa, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

[CV-132] MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

【速读】:该论文旨在解决当前文档光学字符识别(OCR)系统在处理长文本序列时存在的两个核心问题:一是基于自回归解码的串行推理机制导致的延迟高和错误传播放大问题;二是现有模型对语言先验依赖过强,视觉理解能力不足。解决方案的关键在于提出一种统一的基于扩散模型(diffusion-based)的框架 MinerU-Diffusion,其通过将文档 OCR 视为逆渲染任务,摒弃传统的左到右因果生成方式,转而采用受视觉条件约束的并行扩散去噪机制,结合块级扩散解码器与不确定性驱动的课程学习策略,实现了稳定训练与高效长序列推理,显著提升了鲁棒性并加速了推理速度(最高达 3.2 倍)。

链接: https://arxiv.org/abs/2603.22458
作者: Hejun Dong,Junbo Niu,Bin Wang,Weijun Zeng,Wentao Zhang,Conghui He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.

[CV-133] Static Scene Reconstruction from Dynamic Egocentric Videos

【速读】:该论文旨在解决第一人称视频(egocentric video)中3D重建的挑战,尤其是由快速相机运动和频繁动态交互(如手部动作)导致的传统静态重建系统性能下降的问题,例如轨迹漂移(trajectory drift)和由移动物体(如手)引起的“鬼影”几何伪影。解决方案的关键在于提出一个鲁棒的重建流程:首先引入掩码感知的重建机制(mask-aware reconstruction mechanism),通过在注意力层中显式抑制动态前景区域,避免手部等动态干扰污染静态场景地图;其次采用分块重建策略与位姿图拼接(chunked reconstruction with pose-graph stitching),确保全局一致性并消除长期漂移。该方法有效扩展了基础模型在动态第一人称场景中的应用能力。

链接: https://arxiv.org/abs/2603.22450
作者: Qifei Cui,Patrick Chen
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions. State-of-the-art static reconstruction systems, such as MapAnything, often degrade in these settings, suffering from catastrophic trajectory drift and “ghost” geometry caused by moving hands. We bridge this gap by proposing a robust pipeline that adapts static reconstruction backbones to long-form egocentric video. Our approach introduces a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in the attention layers, preventing hand artifacts from contaminating the static map. Furthermore, we employ a chunked reconstruction strategy with pose-graph stitching to ensure global consistency and eliminate long-term drift. Experiments on HD-EPIC and indoor drone datasets demonstrate that our pipeline significantly improves absolute trajectory error and yields visually clean static geometry compared to naive baselines, effectively extending the capability of foundation models to dynamic first-person scenes.

[CV-134] OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction

【速读】:该论文旨在解决下颌骨重建术后长期骨重塑(bone remodeling)预测问题,其核心挑战在于现有生成模型难以在长时间尺度上保持轨迹一致性(trajectory-level consistency)和解剖学保真度(anatomical fidelity)。解决方案的关键在于提出一种基于流的框架 OsteoFlow,其创新性地采用 Lyapunov-guided trajectory distillation 方法:不同于传统的单步蒸馏,该方法从注册得到的稳态速度场教师模型中蒸馏出一个随时间连续演化的轨迹,从而实现对长期骨重塑过程的精准建模;同时结合切除区域感知的图像损失函数,在不牺牲生成能力的前提下强化几何对应关系。实验表明,该方法在344个感兴趣区域上的均方误差较当前最优基线降低约20%,验证了轨迹蒸馏在长期预测中的有效性。

链接: https://arxiv.org/abs/2603.22421
作者: Hamidreza Aftabi,Faye Yu,Brooke Switzer,Zachary Fishman,Eitan Prisman,Antony Hodgson,Cari Whyne,Sidney Fels,Michael Hardisty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting long-term bone remodeling after mandibular reconstruction would be of great clinical benefit, yet standard generative models struggle to maintain trajectory-level consistency and anatomical fidelity over long horizons. We introduce OsteoFlow, a flow-based framework predicting Year-1 post-operative CT scans from Day-5 scans. Our core contribution is Lyapunov-guided trajectory distillation: Unlike one-step distillation, our method distills a continuous trajectory over transport time from a registration-derived stationary velocity field teacher. Combined with a resection-aware image loss, this enforces geometric correspondence without sacrificing generative capacity. Evaluated on 344 paired regions of interest, OsteoFlow significantly outperforms state of-the-art baselines, reducing mean absolute error in the surgical resection zone by ~20%. This highlights the promise of trajectory distillation for long-term prediction. Code is available on GitHub: OsteoFlow.

[CV-135] Spatially-Aware Evaluation Framework for Aerial LiDAR Point Cloud Semantic Segmentation: Distance-Based Metrics on Challenging Regions

【速读】:该论文旨在解决当前用于三维点云语义分割评估的指标(如平均交并比 mIoU 和总体准确率 OA)在航空激光雷达(LiDAR)数据场景下的两个关键局限性:一是这些指标对所有误分类点一视同仁,忽略了空间上下文信息,从而无法反映几何严重性误差对衍生地理空间产品(如数字地形模型)质量的影响;二是它们易受大量易分类点的主导,掩盖了模型间的真实差异,并低估了模型在困难区域的表现。解决方案的关键在于提出一个新颖的评估框架,包含两个互补方法:其一为引入基于距离的指标,量化每个误分类点与其最近同类别真值点之间的空间偏移,以捕捉几何严重性;其二为聚焦于一组“难分点”(hard points),即至少被一个模型误分类的公共点集,从而减少易分类点带来的偏差,更清晰揭示模型在挑战性区域的性能差异。

链接: https://arxiv.org/abs/2603.22420
作者: Alex Salvatierra,José Antonio Sanz,Christian Gutiérrez,Mikel Galar
机构: Public University of Navarre (UPNA)(纳瓦拉公共大学); Institute of Smart Cities (ISC)(智能城市研究所); Tracasa Instrumental(特拉斯卡仪器公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Semantic segmentation metrics for 3D point clouds, such as mean Intersection over Union (mIoU) and Overall Accuracy (OA), present two key limitations in the context of aerial LiDAR data. First, they treat all misclassifications equally regardless of their spatial context, overlooking cases where the geometric severity of errors directly impacts the quality of derived geospatial products such as Digital Terrain Models. Second, they are often dominated by the large proportion of easily classified points, which can mask meaningful differences between models and under-represent performance in challenging regions. To address these limitations, we propose a novel evaluation framework for comparing semantic segmentation models through two complementary approaches. First, we introduce distance-based metrics that account for the spatial deviation between each misclassified point and the nearest ground-truth point of the predicted class, capturing the geometric severity of errors. Second, we propose a focused evaluation on a common subset of hard points, defined as the points misclassified by at least one of the evaluated models, thereby reducing the bias introduced by easily classified points and better revealing differences in model performance in challenging regions. We validate our framework by comparing three state-of-the-art deep learning models on three aerial LiDAR datasets. Results demonstrate that the proposed metrics provide complementary information to traditional measures, revealing spatial error patterns that are critical for Earth Observation applications but invisible to conventional evaluation approaches. The proposed framework enables more informed model selection for scenarios where spatial consistency is critical.

[CV-136] Efficient Universal Perception Encoder

【速读】:该论文旨在解决在智能边缘设备上运行AI模型时面临的计算资源受限与多任务并发处理需求之间的矛盾,核心挑战在于设计一个体积小但具备强大且通用表征能力的视觉编码器(vision encoder)。解决方案的关键在于提出了一种名为高效通用感知编码器(Efficient Universal Perception Encoder, EUPE)的方法,其创新性地通过从多个领域专家基础视觉编码器中进行知识蒸馏(knowledge distillation)来实现;不同于以往直接从多个教师模型聚合压缩的做法,本文强调先将多个教师模型统一扩展为一个大型代理教师(proxy teacher),再从中蒸馏得到轻量级学生模型,从而显著提升模型在多种下游任务上的性能表现。

链接: https://arxiv.org/abs/2603.22387
作者: Chenchen Zhu,Saksham Suri,Cijo Jose,Maxime Oquab,Marc Szafraniec,Wei Wen,Yunyang Xiong,Patrick Labatut,Piotr Bojanowski,Raghuraman Krishnamoorthi,Vikas Chandra
机构: Meta(Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Running AI models on smart edge devices can unlock versatile user experiences, but presents challenges due to limited compute and the need to handle multiple tasks simultaneously. This requires a vision encoder with small size but powerful and versatile representations. We present our method, Efficient Universal Perception Encoder (EUPE), which offers both inference efficiency and universally good representations for diverse downstream tasks. We achieve this by distilling from multiple domain-expert foundation vision encoders. Unlike previous agglomerative methods that directly scale down from multiple teachers to an efficient encoder, we demonstrate the importance of first scaling up to a large proxy teacher and then scaling down from this single teacher. Experiments show that EUPE achieves on-par or better performance than individual domain experts of the same size on diverse task domains and also outperforms previous agglomerative encoders. We will release the full family of EUPE models and the code to foster future research.

[CV-137] hree Creates All: You Only Sample 3 Steps

【速读】:该论文旨在解决扩散模型(Diffusion Models)在推理阶段速度缓慢的问题,尤其是由于需要多次顺序网络评估导致的效率瓶颈。研究发现,标准的时间步长条件(timestep conditioning)是少步采样(few-step sampling)中的关键限制因素。为此,作者提出多层时间嵌入优化(Multi-layer Time Embedding Optimization, MTEO),其核心在于冻结预训练扩散主干网络,并从参考轨迹中蒸馏出一组逐步、逐层的时间嵌入(step-wise, layer-wise time embeddings)。该方法无需额外推理开销,仅需训练极少量参数,且可与现有常微分方程(ODE)求解器无缝集成,显著提升了少步采样下的生成质量,缩小了基于蒸馏方法与轻量级方法之间的性能差距。

链接: https://arxiv.org/abs/2603.22375
作者: Yuren Cai,Guangyi Wang,Zongqing Li,Li Li,Zhihui Liu,Songzhi Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models deliver high-fidelity generation but remain slow at inference time due to many sequential network evaluations. We find that standard timestep conditioning becomes a key bottleneck for few-step sampling. Motivated by layer-dependent denoising dynamics, we propose Multi-layer Time Embedding Optimization (MTEO), which freeze the pretrained diffusion backbone and distill a small set of step-wise, layer-wise time embeddings from reference trajectories. MTEO is plug-and-play with existing ODE solvers, adds no inference-time overhead, and trains only a tiny fraction of parameters. Extensive experiments across diverse datasets and backbones show state-of-the-art performance in the few-step sampling and substantially narrow the gap between distillation-based and lightweight methods. Code will be available.

[CV-138] When Visuals Arent the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在识别误导性可视化内容时存在的局限性,特别是当误导源于细微的推理错误(如因果推断错误、选择性展示等)而非明显的视觉设计缺陷(如截断坐标轴、双轴混淆等)时,现有模型检测能力不足的问题。其解决方案的关键在于构建一个基于细粒度推理与可视化设计错误分类体系的基准测试集,该基准结合真实世界可视化与人工编写的、针对特定错误类型设计的误导性标题,从而实现对误导来源的可控分析和量化评估。实验表明,VLMs 对视觉设计错误的识别显著优于对推理型误导的检测,且常将非误导性图表误判为欺骗性内容,凸显了当前模型在理解复杂语义误导机制上的不足。

链接: https://arxiv.org/abs/2603.22368
作者: Harsh Nishant Lalai,Raj Sanjay Shah,Hanspeter Pfister,Sashank Varma,Grace Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visualizations help communicate data insights, but deceptive data representations can distort their interpretation and propagate misinformation. While recent Vision Language Models (VLMs) perform well on many chart understanding tasks, their ability to detect misleading visualizations, especially when deception arises from subtle reasoning errors in captions, remains poorly understood. Here, we evaluate VLMs on misleading visualization-caption pairs grounded in a fine-grained taxonomy of reasoning errors (e.g., Cherry-picking, Causal inference) and visualization design errors (e.g., Truncated axis, Dual axis, inappropriate encodings). To this end, we develop a benchmark that combines real-world visualization with human-authored, curated misleading captions designed to elicit specific reasoning and visualization error types, enabling controlled analysis across error categories and modalities of misleadingness. Evaluating many commercial and open-source VLMs, we find that models detect visual design errors substantially more reliably than reasoning-based misinformation, and frequently misclassify non-misleading visualizations as deceptive. Overall, our work fills a gap between coarse detection of misleading content and the attribution of the specific reasoning or visualization errors that give rise to it.

[CV-139] MCLR: Improving Conditional Modeling in Visual Generative Models via Inter-Class Likelihood-Ratio Maximization and Establishing the Equivalence between Classifier-Free Guidance and Alignment Objectives

【速读】:该论文试图解决扩散模型(Diffusion Models)在实际应用中依赖推理阶段的无分类器引导(Classifier-Free Guidance, CFG)这一现象,尽管标准去噪得分匹配(Denoising Score Matching, DSM)训练理论上应能恢复目标数据分布。研究指出,标准扩散模型的关键局限在于类间分离不足,导致生成质量受限。解决方案的核心是提出一种名为MCLR(Maximum Class Likelihood Ratio)的原理性对齐目标,在训练过程中显式最大化类间似然比(likelihood-ratios),从而在不依赖CFG的前提下实现与之相当的生成效果。理论分析进一步表明,CFG引导的得分函数恰好是加权MCLR目标的最优解,建立了无分类器引导与基于对齐的目标之间的形式等价关系,为CFG提供了机制层面的解释。

链接: https://arxiv.org/abs/2603.22364
作者: Xiang Li,Yixuan Jia,Xiao Li,Jeffrey A. Fessler,Rongrong Wang,Qing Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved state-of-the-art performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. From a theoretical perspective, diffusion models trained with standard denoising score matching (DSM) are expected to recover the target data distribution, raising the question of why inference-time guidance is necessary in practice. In this work, we ask whether the DSM training objective can be modified in a principled manner such that standard reverse-time sampling, without inference-time guidance, yields effects comparable to CFG. We identify insufficient inter-class separation as a key limitation of standard diffusion models. To address this, we propose MCLR, a principled alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Models fine-tuned with MCLR exhibit CFG-like improvements under standard sampling, achieving comparable qualitative and quantitative gains without requiring inference-time guidance. Beyond empirical benefits, we provide a theoretical result showing that the CFG-guided score is exactly the optimal solution to a weighted MCLR objective. This establishes a formal equivalence between classifier-free guidance and alignment-based objectives, offering a mechanistic interpretation of CFG.

[CV-140] ST-GDance: A Scalable Spatial-Temporal Diffusion for Long-Duration Group Choreography

【速读】:该论文旨在解决群体舞蹈生成中因双向注意力依赖导致的计算效率低下与多舞者运动碰撞风险增加的问题,尤其是在舞蹈人数和序列长度增长时,注意力计算复杂度呈二次增长,限制了模型在交互式场景中的部署。解决方案的关键在于提出ST-GDance++框架,通过解耦空间与时间依赖关系实现高效且防碰撞的群体编舞生成:在空间建模方面,引入轻量级距离感知图卷积(distance-aware graph convolutions)以降低计算开销并捕捉舞者间关系;在时间建模方面,设计扩散噪声调度策略与高效的时序对齐注意力掩码(temporal-aligned attention mask),支持流式生成长序列动作,显著提升长时间场景下的可扩展性。

链接: https://arxiv.org/abs/2603.22316
作者: Jing Xu,Weiqiang Wang,Cunjian Chen,Jun Liu,Qiuhong Ke
机构: Monash University (莫纳什大学); Lancaster University (兰卡斯特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Group dance generation from music requires synchronizing multiple dancers while maintaining spatial coordination, making it highly relevant to applications such as film production, gaming, and animation. Recent group dance generation models have achieved promising generation quality, but they remain difficult to deploy in interactive scenarios due to bidirectional attention dependencies. As the number of dancers and the sequence length increase, the attention computation required for aligning music conditions with motion sequences grows quadratically, leading to reduced efficiency and increased risk of motion collisions. Effectively modeling dense spatial-temporal interactions is therefore essential, yet existing methods often struggle to capture such complexity, resulting in limited scalability and unstable multi-dancer coordination. To address these challenges, we propose ST-GDance++, a scalable framework that decouples spatial and temporal dependencies to enable efficient and collision-aware group choreography generation. For spatial modeling, we introduce lightweight distance-aware graph convolutions to capture inter-dancer relationships while reducing computational overhead. For temporal modeling, we design a diffusion noise scheduling strategy together with an efficient temporal-aligned attention mask, enabling stream-based generation for long motion sequences and improving scalability in long-duration scenarios. Experiments on the AIOZ-GDance dataset show that ST-GDance++ achieves competitive generation quality with significantly reduced latency compared to existing methods.

[CV-141] Contrastive Metric Learning for Point Cloud Segmentation in Highly Granular Detectors

【速读】:该论文旨在解决高粒度量能器中粒子簇(particle shower)高度重叠时的点云分割难题,尤其在复杂多粒子环境中实现高精度的重建效率与纯度。其核心解决方案是基于监督对比度度量学习(supervised contrastive metric learning, CML)构建一个稳定的嵌入空间,在该空间中同一粒子簇内的点被紧密聚集,而不同簇的点则被有效分离;随后通过密度基读出机制(density-based readout)重构聚类结果,从而将表示学习与聚类形成解耦,提升推理灵活性与泛化能力。相较传统的对象凝聚(object condensation, OC)方法,CML 在电磁和强子粒子簇上均展现出更一致的邻域结构、更强的重叠簇分离能力及对未见多重性和能量的更好外推性能,显著提升了重建效率、纯度与能量分辨率。

链接: https://arxiv.org/abs/2603.23356
作者: Max Marriott-Clarke,Lazar Novakovic,Elizabeth Ratzer,Robert J. Bainbridge,Loukas Gouskos,Benedikt Maier
机构: Blackett Laboratory, Imperial College London, UK; Department of Physics Astronomy, Brown University, USA; Brown Center for Theoretical Physics and Innovation (BCTPI), Brown University, USA
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel clustering approach for point-cloud segmentation based on supervised contrastive metric learning (CML). Rather than predicting cluster assignments or object-centric variables, the method learns a latent representation in which points belonging to the same object are embedded nearby while unrelated points are separated. Clusters are then reconstructed using a density-based readout in the learned metric space, decoupling representation learning from cluster formation and enabling flexible inference. The approach is evaluated on simulated data from a highly granular calorimeter, where the task is to separate highly overlapping particle showers represented as sets of calorimeter hits. A direct comparison with object condensation (OC) is performed using identical graph neural network backbones and equal latent dimensionality, isolating the effect of the learning objective. The CML method produces a more stable and separable embedding geometry for both electromagnetic and hadronic particle showers, leading to improved local neighbourhood consistency, a more reliable separation of overlapping showers, and better generalization when extrapolating to unseen multiplicities and energies. This translates directly into higher reconstruction efficiency and purity, particularly in high-multiplicity regimes, as well as improved energy resolution. In mixed-particle environments, CML maintains strong performance, suggesting robust learning of the shower topology, while OC exhibits significant degradation. These results demonstrate that similarity-based representation learning combined with density-based aggregation is a promising alternative to object-centric approaches for point cloud segmentation in highly granular detectors.

[CV-142] L-UNet: An LSTM Network for Remote Sensing Image Change Detection

【速读】:该论文旨在解决高分辨率遥感图像变化检测中因传统深度学习方法(如Conv-LSTM)缺乏空间特征表达能力而导致的性能局限问题。其核心解决方案是提出一种端到端的时空网络架构,关键在于引入改进的Conv-LSTM结构以同时建模空间与时间维度信息,并在此基础上设计L-UNet和Atrous L-UNet(AL-UNet)模型:前者用Conv-LSTM替代UNet的部分卷积层以增强空间特性,后者进一步采用空洞卷积(Atrous convolution)结构来捕获多尺度空间信息,从而显著提升变化检测的精度与质量。

链接: https://arxiv.org/abs/2603.22842
作者: Shuting Sun,Lin Mu,Lizhe Wang,Peng Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Change detection of high-resolution remote sensing images is an important task in earth observation and was extensively investigated. Recently, deep learning has shown to be very successful in plenty of remote sensing tasks. The current deep learning-based change detection method is mainly based on conventional long short-term memory (Conv-LSTM), which does not have spatial characteristics. Since change detection is a process with both spatiality and temporality, it is necessary to propose an end-to-end spatiotemporal network. To achieve this, Conv-LSTM, an extension of the Conv-LSTM structure, is introduced. Since it shares similar spatial characteristics with the convolutional layer, L-UNet, which substitutes partial convolution layers of UNet-to-Conv-LSTM and Atrous L-UNet (AL-UNet), which further using Atrous structure to multiscale spatial information is proposed. Experiments on two data sets are conducted and the proposed methods show the advantages both in quantity and quality when compared with some other methods.

[CV-143] Viewport-based Neural 360° Image Compression

【速读】:该论文旨在解决传统360°图像压缩管道在将球面图像投影到单个二维平面时引发的过度采样(oversampling)和失真问题。其解决方案的关键在于提出一种基于视口(viewport)的神经压缩流水线,通过提取多个视口并高效压缩这些视口数据,从而最小化上述固有缺陷;同时设计了一种基于Transformer架构的ViewPort ConText(VPCT)模块,以捕获跨视口的全局先验信息,在保持高质量的同时显著降低比特消耗——实验表明,相比最优的现有360°图像压缩方法,该方案平均节省14.01%的比特率,且优于标准2D图像编解码器在视口压缩场景下的性能表现。

链接: https://arxiv.org/abs/2603.22776
作者: Jingwei Liao,Bo Chen,Klara Nahrstedt,Zhisheng Yan
机构: George Mason University (乔治梅森大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given the popularity of 360° images on social media platforms, 360° image compression becomes a critical technology for media storage and transmission. Conventional 360° image compression pipeline projects the spherical image into a single 2D plane, leading to issues of oversampling and distortion. In this paper, we propose a novel viewport-based neural compression pipeline for 360° images. By replacing the image projection in conventional 360° image compression pipelines with viewport extraction and efficiently compressing multiple viewports, the proposed pipeline minimizes the inherent oversampling and distortion issues. However, viewport extraction impedes information sharing between multiple viewports during compression, causing the loss of global information about the spherical image. To tackle this global information loss, we design a neural viewport codec to capture global prior information across multiple viewports and maximally compress the viewport data. The viewport codec is empowered by a transformer-based ViewPort ConText (VPCT) module that can be integrated with canonical learning-based 2D image compression structures. We compare the proposed pipeline with existing 360° image compression models and conventional 360° image compression pipelines building on learning-based 2D image codecs and standard hand-crafted codecs. Results show that our pipeline saves an average of 14.01% bit consumption compared to the best-performing 360° image compression methods without compromising quality. The proposed VPCT-based codec also outperforms existing 2D image codecs in the viewport-based neural compression pipeline. Our code can be found at: this https URL.

[CV-144] Single-Subject Multi-View MRI Super-Resolution via Implicit Neural Representations

【速读】:该论文旨在解决临床MRI中因各向异性成像(anisotropic imaging)导致的图像质量受限问题,特别是多视角(multi-view)扫描在传统配准与插值融合过程中易丢失精细结构细节的问题。其解决方案的关键在于提出了一种单患者隐式多视角超分辨率方法(Single-Subject Implicit Multi-View Super-Resolution for MRI, SIMS-MRI),该方法无需预处理或后处理,仅依赖单个患者的多视角各向异性扫描数据,通过结合多分辨率哈希编码的隐式表示(multi-resolution hash-encoded implicit representation)与可学习的跨视角对齐机制(learned inter-view alignment),实现空间一致性的各向同性重建,从而有效提升图像分辨率并保留解剖细节。

链接: https://arxiv.org/abs/2603.22627
作者: Heejong Kim,Abhishek Thanki,Roel van Herten,Daniel Margolis,Mert R Sabuncu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical MRI frequently acquires anisotropic volumes with high in-plane resolution and low through-plane resolution to reduce acquisition time. Multiple orientations are therefore acquired to provide complementary anatomical information. Conventional integration of these views relies on registration followed by interpolation, which can degrade fine structural details. Recent deep learning-based super-resolution (SR) approaches have demonstrated strong performance in enhancing single-view images. However, their clinical reliability is often limited by the need for large-scale training datasets, resulting in increased dependence on cohort-level priors. Self-supervised strategies offer an alternative by learning directly from the target scans. Prior work either neglects the existence of multi-view information or assumes that in-plane information can supervise through-plane reconstruction under the assumption of pre-alignment between images. However, this assumption is rarely satisfied in clinical settings. In this work, we introduce Single-Subject Implicit Multi-View Super-Resolution for MRI (SIMS-MRI), a framework that operates solely on anisotropic multi-view scans from a single patient without requiring pre- or post-processing. Our method combines a multi-resolution hash-encoded implicit representation with learned inter-view alignment to generate a spatially consistent isotropic reconstruction. We validate the SIMS-MRI pipeline on both simulated brain and clinical prostate MRI datasets. Code will be made publicly available for reproducibility: this https URL

[CV-145] Abnormalities and Disease Detection in Gastro-Intestinal Tract Images

【速读】:该论文旨在解决胃肠(Gastrointestinal, GI) tract图像中异常和疾病检测的准确性与实时性问题,尤其针对传统方法在处理复杂多样病变时性能不足、计算资源消耗大等挑战。其解决方案的关键在于构建一个从纹理特征提取到深度学习模型优化再到轻量化部署的多层次框架:首先利用纹理特征实现高速分类(>4000 FPS),随后引入数据袋装(bagging)策略提升深度学习模型在HyperKvasir和Kvasir V2数据集上的性能(F1-score最高达0.88),并进一步设计融合局部二值模式(Local Binary Patterns, LBP)的轻量级神经网络,在保持高精度(Accuracy: 0.99, F1-score: 0.91)的同时实现41 FPS的实时推理;同时提出两种基于深度可分离卷积(Depth-Wise Separable Convolution)与神经网络集成的分割工具,显著增强低帧率场景下的可用性与鲁棒性。

链接: https://arxiv.org/abs/2603.22378
作者: Zeshan Khan,Muhammad Atif Tahir
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Thesis

点击查看摘要

Abstract:Gastrointestinal (GI) tract image analysis plays a crucial role in medical diagnosis. This research addresses the challenge of accurately classifying and segmenting GI images for real-time applications, where traditional methods often struggle due to the diversity and complexity of abnormalities. The high computational demands of this domain require efficient and adaptable solutions. This PhD thesis presents a multifaceted approach to GI image analysis. Initially, texture-based feature extraction and classification methods were explored, achieving high processing speed (over 4000 FPS) and strong performance (F1-score: 0.76, Accuracy: 0.98) on the Kvasir V2 dataset. The study then transitions to deep learning, where an optimized model combined with data bagging techniques improved performance, reaching an accuracy of 0.92 and an F1-score of 0.60 on the HyperKvasir dataset, and an F1-score of 0.88 on Kvasir V2. To support real-time detection, a streamlined neural network integrating texture and local binary patterns was developed. By addressing inter-class similarity and intra-class variation through a learned threshold, the system achieved 41 FPS with high accuracy (0.99) and an F1-score of 0.91 on HyperKvasir. Additionally, two segmentation tools are proposed to enhance usability, leveraging Depth-Wise Separable Convolution and neural network ensembles for improved detection, particularly in low-FPS scenarios. Overall, this research introduces novel and adaptable methodologies, progressing from traditional texture-based techniques to deep learning and ensemble approaches, providing a comprehensive framework for advancing GI image analysis. Comments: PhD Thesis Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.22378 [eess.IV] (or arXiv:2603.22378v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2603.22378 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zeshan Khan Dr. [view email] [v1] Mon, 23 Mar 2026 10:13:56 UTC (7,642 KB) Full-text links: Access Paper: View a PDF of the paper titled Abnormalities and Disease Detection in Gastro-Intestinal Tract Images, by Zeshan Khan and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: eess.IV prev | next new | recent | 2026-03 Change to browse by: cs cs.AI cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-146] Ca2 transient detection and segmentation with the Astronomically motivated algorithm for Background Estimation And Transient Segmentation (Astro-BEATS)

【速读】:该论文旨在解决荧光钙成像(fluorescence-based Ca²⁺-imaging)中微小突触钙瞬变(miniature Synaptic Calcium Transients)检测与分割的难题,这类信号变化微弱、接近背景噪声水平,传统基于阈值的方法难以准确识别。解决方案的关键在于借鉴天文学中用于天文瞬变检测的图像估计与源定位技术,提出名为Astro-BEATS的自动化算法,该算法通过融合天文领域成熟的信号增强和源发现策略,显著提升了对微弱钙瞬变的检测性能,并具备无需重新优化即可应用于新数据集的优势,从而为深度学习模型提供高质量的标注训练数据。

链接: https://arxiv.org/abs/2603.22311
作者: Bolin Fan,Anthony Bilodeau,Frederic Beaupre,Theresa Wiesner,Christian Gagne,Flavie Lavoie-Cardinal,Renee Hlozek
机构: Dunlap Institute for Astronomy and Astrophysics, University of Toronto, Toronto, Canada; David A. Dunlap Department for Astronomy and Astrophysics, University of Toronto, Toronto, Canada; CERVO Brain Research Center, Québec, Canada; Institute Intelligence and Data, Université Laval, Québec, Canada; Department of Electrical Engineering and Computer Engineering, Université Laval, Québec, Canada; Canada CIFAR AI Chair, affiliated to Mila; Department of Psychiatry and Neuroscience, Université Laval, Québec, Canada
类目: Neurons and Cognition (q-bio.NC); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 4 figures, 12 supplementary pages, 5 supplementary figures

点击查看摘要

Abstract:Fluorescence-based Ca ^2+ -imaging is a powerful tool for studying localized neuronal activity, including miniature Synaptic Calcium Transients, providing real-time insights into synaptic activity. These transients induce only subtle changes in the fluorescence signal, often barely above baseline, which poses a significant challenge for automated synaptic transient detection and segmentation. Detecting astronomical transients similarly requires efficient algorithms that will remain robust over a large field of view with varying noise properties. We leverage techniques used in astronomical transient detection for miniature Synaptic Calcium Transient detection in fluorescence microscopy. We present Astro-BEATS, an automatic miniature Synaptic Calcium Transient segmentation algorithm that incorporates image estimation and source-finding techniques used in astronomy and designed for Ca ^2+ -imaging videos. Astro-BEATS outperforms current threshold-based approaches for synaptic Ca ^2+ transient detection and segmentation. The produced segmentation masks can be used to train a supervised deep learning algorithm for improved synaptic Ca ^2+ transient detection in Ca ^2+ -imaging data. The speed of Astro-BEATS and its applicability to previously unseen datasets without re-optimization makes it particularly useful for generating training datasets for deep learning-based approaches.

人工智能

[AI-0] ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

【速读】:该论文旨在解决软件需求工程(Requirements Engineering)阶段中人工成本高、效率低的问题,尤其针对从非结构化文档(如PDF、DOCX、PPTX)中自动提取、分类和分析功能性和非功能性需求的挑战。其解决方案的关键在于提出ReqFusion系统,该系统基于多大语言模型(Large Language Model, LLM)集成架构(整合OpenAI GPT、Anthropic Claude与Groq模型),并采用项目-环境-目标-系统(Project, Environment, Goal, and System, PEGS)结构化提示策略,以提升需求抽取的准确性与完整性。实验表明,PEGS引导式提示相比通用提示可将F1分数从0.71提升至0.88,且在真实场景中实现78%的分析时间节省和更高的自动化覆盖率。

链接: https://arxiv.org/abs/2603.23482
作者: Muhammad Khalid,Manuel Oriol,Yilmaz Uygun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 7 tables. Accepted at VerifAI-2026 Workshop, co-located with ETAPS 2026

点击查看摘要

Abstract:Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.

[AI-1] Code Review Agent Benchmark

【速读】:该论文旨在解决自动化代码生成(code generation)带来的代码质量保障问题,尤其是在大规模代码库中如何有效进行代码审查(code review)与质量控制。随着生成式 AI 在编程中的广泛应用,自动产出的代码量激增,传统的人工审查已难以应对,亟需高效、可靠的代码审查代理(code review agent)来辅助或替代人工审核。解决方案的关键在于构建了一个名为 c-CRAB 的基准数据集,该数据集基于真实人类代码审查记录生成测试用例,用于系统性评估代码审查代理的能力。通过该框架对当前主流开源和商用代码审查代理(如 PR-agent、Devin、Claude Code 和 Codex)进行评测,揭示了现有方法仅能解决约 40% 的任务,且其审查视角与人类存在差异,为未来人机协同代码审查提供了方向,并提出由代理生成的测试用例可作为“质量门控”机制,推动代码生成、测试生成与审查代理之间的协同进化。

链接: https://arxiv.org/abs/2603.23448
作者: Yuntong Zhang,Zhiyuan Pan,Imam Nur Bani Yusuf,Haifeng Ruan,Ridwan Shariffdeen,Abhik Roychoudhury
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically – the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases – the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today – the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews – given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews – indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents – remains to be investigated. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.23448 [cs.SE] (or arXiv:2603.23448v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.23448 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] Evaluating LLM -Based Test Generation Under Software Evolution

【速读】:该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的自动化单元测试生成方法是否真正理解程序行为,还是仅依赖训练中学习到的表面模式(surface-level cues),从而导致在代码演化过程中测试套件稳定性差、覆盖率下降及回归缺陷检测能力减弱。解决方案的关键在于设计并实施一个大规模实证研究框架,通过自动化突变驱动的方法,在8个LLM和22,374个程序变体上系统分析生成测试对语义改变型变更(Semantic-Altering Changes, SAC)与语义保持型变更(Semantic-Preserving Changes, SPC)的响应机制。结果表明,尽管原始程序下LLMs能实现高达79%的行覆盖和76%的分支覆盖,但面对代码演化时其生成测试显著退化——SAC下通过率降至66%,分支覆盖降至60%,且绝大多数失败测试在原程序上仍可通过,说明其本质是对旧行为的残留匹配而非对新语义的适应;SPC虽不改变功能,但因语法变化引发测试不稳定,进一步揭示了LLM测试生成对词法扰动的高度敏感性,凸显其缺乏深层语义理解和回归意识的本质局限。

链接: https://arxiv.org/abs/2603.23443
作者: Sabaat Haroon,Mohammad Taha Khan,Muhammad Ali Gulzar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve. Comments: 10 pages, 9 figures, 2 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.23443 [cs.SE] (or arXiv:2603.23443v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.23443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] argeted Adversarial Traffic Generation : Black-box Approach to Evade Intrusion Detection Systems in IoT Networks

【速读】:该论文旨在解决物联网(Internet of Things, IoT)环境中基于机器学习(Machine Learning, ML)的入侵检测系统(Intrusion Detection System, IDS)在面对对抗性攻击时存在的实际可行性问题,特别是针对网络层IDS的逃避攻击(evasion attack)缺乏实证研究的现状。其关键解决方案在于提出一种新颖的黑盒对抗攻击方法以验证攻击可行性,并进一步设计了一种针对性的防御机制,该机制通过有效识别多数对抗流量显著优于现有先进防御方案,从而提升了ML驱动IDS在真实场景下的鲁棒性与安全性。

链接: https://arxiv.org/abs/2603.23438
作者: Islam Debicha,Tayeb Kenaza,Ishak Charfi,Salah Mosbah,Mehdi Sehaki,Jean-Michel Dricot
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Already published in International Journal of Machine Learning and Cybernetics. Debicha, I., Kenaza, T., Charfi, I. et al. Targeted adversarial traffic generation: black-box approach to evade intrusion detection systems in IoT networks. Int. J. Mach. Learn. Cyber. 17, 58 (2026). this https URL

点击查看摘要

Abstract:The integration of machine learning (ML) algorithms into Internet of Things (IoT) applications has introduced significant advantages alongside vulnerabilities to adversarial attacks, especially within IoT-based intrusion detection systems (IDS). While theoretical adversarial attacks have been extensively studied, practical implementation constraints have often been overlooked. This research addresses this gap by evaluating the feasibility of evasion attacks on IoT network-based IDSs, employing a novel black-box adversarial attack. Our study aims to bridge theoretical vulnerabilities with real-world applicability, enhancing understanding and defense against sophisticated threats in modern IoT ecosystems. Additionally, we propose a defense scheme tailored to mitigate the impact of evasion attacks, thereby reinforcing the resilience of ML-based IDSs. Our findings demonstrate successful evasion attacks against IDSs, underscoring their susceptibility to advanced techniques. In contrast, we proposed a defense mechanism that exhibits robust performance by effectively detecting the majority of adversarial traffic, showcasing promising outcomes compared to current state-of-the-art defenses. By addressing these critical cybersecurity challenges, our research contributes to advancing IoT security and provides insights for developing more resilient IDS.

[AI-4] Mecha-nudges for Machines

【速读】:该论文旨在解决AI代理(AI agents)与人类共同决策环境中,如何通过优化选择呈现方式来影响AI行为而不损害人类决策质量的问题。传统“助推”(nudges)仅针对人类决策者设计,而随着AI在商业平台(如Etsy)中承担越来越多的决策任务,需考虑如何为AI设计新的干预机制。解决方案的关键在于提出“机械助推”(mecha-nudges)的概念,并将其形式化:通过结合贝叶斯说服(Bayesian persuasion)框架与V-usable信息(一种观测者相对的信息度量,是对香农信息的推广),建立一个统一的量化尺度(可用信息比特数),从而可比较不同干预策略、场景和模型对人类与AI的影响。实证结果表明,在ChatGPT发布后,Etsy商品列表显著增加了机器可用信息量,印证了系统性mecha-nudging的存在。

链接: https://arxiv.org/abs/2603.23433
作者: Giulio Frey,Kawin Ethayarajh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nudges are subtle changes to the way choices are presented to human decision-makers (e.g., opt-in vs. opt-out by default) that shift behavior without restricting options or changing incentives. As AI agents increasingly make decisions in the same environments as humans, the presentation of choices may be optimized for machines as well as people. We introduce mecha-nudges: changes to how choices are presented that systematically influence AI agents without degrading the decision environment for humans. To formalize mecha-nudges, we combine the Bayesian persuasion framework with V-usable information, a generalization of Shannon information that is observer-relative. This yields a common scale (bits of usable information) for comparing a wide range of interventions, contexts, and models. Applying our framework to product listings on Etsy – a global marketplace for independent sellers – we find that following ChatGPT’s release, listings have significantly more machine-usable information about product selection, consistent with systematic mecha-nudging.

[AI-5] Bilevel Autoresearch: Meta-Autoresearching Itself

【速读】:该论文旨在解决现有自研究(autoresearch)系统依赖人工干预以优化搜索机制的问题,即当前所有autoresearch系统均需人类开发者识别瓶颈并手动编写新代码进行改进。为实现完全自主的优化,作者提出双层自研究(Bilevel Autoresearch)框架,其核心在于:通过一个外层元优化循环(outer loop)动态生成并注入新的搜索机制(如组合优化、多臂赌博机等),以提升内层autoresearch循环(inner loop)的任务性能;而内层循环专注于具体任务优化,外层则优化搜索策略本身。两者共享同一LLM模型,无需更强的元模型。实验表明,在Karpathy的GPT预训练基准上,该方法相比仅调整参数的基线实现5倍性能提升(验证负比特每字符从-0.009降至-0.045),且外层能自主发现有效机制,打破原有确定性搜索模式,强制探索LLM先验知识所忽略的方向。

链接: https://arxiv.org/abs/2603.23420
作者: Yaonan Qu,Meng Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, 3 this http URL paper was primarily drafted by AI agents with human oversight and direction

点击查看摘要

Abstract:If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every existing autoresearch system – from Karpathy’s single-track loop to AutoResearchClaw’s multi-batch extension and EvoScientist’s persistent memory – was improved by a human who read the code, identified a bottleneck, and wrote new code. We ask whether an LLM can do the same, autonomously. We present Bilevel Autoresearch, a bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. The inner loop optimizes the task; the outer loop optimizes how the inner loop searches. Both loops use the same LLM – no stronger model is needed at the meta level. On Karpathy’s GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5x improvement over the standard inner loop alone (-0.045 vs. -0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop autonomously discovers mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments – without human specification of which domains to explore. These mechanisms succeed by breaking the inner loop’s deterministic search patterns, forcing exploration of directions the LLM’s priors systematically avoid. The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective.

[AI-6] SortedRL: Accelerating RL Training for LLM s through Online Length-Aware Scheduling

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)训练中因长轨迹生成导致的采样效率瓶颈问题,尤其是在大语言模型(Large Language Models, LLMs)进行复杂推理任务时,rollout阶段耗时可高达总训练时间的70%。其解决方案的关键在于提出一种在线长度感知调度策略SortedRL,通过按输出长度对采样样本重新排序,优先处理短序列并形成组块以提前触发策略更新,从而实现大规模rollout批处理、灵活更新批次以及近在线策略微课程构建的协同优化;同时引入基于缓存的机制控制off-policy训练程度,并依托专用RL基础设施管理状态控制器与rollout缓冲区,显著提升训练效率与稳定性。

链接: https://arxiv.org/abs/2603.23414
作者: Yiqi Zhang,Huiqiang Jiang,Xufang Luo,Zhihe Yang,Chengruidong Zhang,Yifei Shen,Dongsheng Li,Yuqing Yang,Lili Qiu,Yang You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

[AI-7] Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

【速读】:该论文旨在解决离散能量模型(Energy-based Models, EBMs)在图结构生成任务中因采样效率低和样本质量差而导致的保真度不足问题,尤其是由于支持域外存在虚假局部极小值而引发的采样陷阱与训练不稳定现象。其解决方案的关键在于提出Graph Energy Matching (GEM),该方法基于Jordan-Kinderlehrer-Otto (JKO)方案的传输映射优化视角,学习一个排列不变的势能函数,同时提供从噪声到数据的梯度引导运输路径,并在高概率区域精炼样本;此外,引入一种能量驱动的采样协议,通过能量开关无缝衔接:(i) 快速梯度引导运输至高概率区域,以及 (ii) 混合采样以探索学习到的图分布,从而显著提升生成质量和推理灵活性。

链接: https://arxiv.org/abs/2603.23398
作者: Michal Balcerak,Suprosana Shit,Chinmay Prabhakar,Sebastian Kaltenbach,Michael S. Albergo,Yilun Du,Bjoern Menze
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Energy-based models for discrete domains, such as graphs, explicitly capture relative likelihoods, naturally enabling composable probabilistic inference tasks like conditional generation or enforcing constraints at test-time. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities. This has historically resulted in a fidelity gap relative to discrete diffusion models. We introduce Graph Energy Matching (GEM), a generative framework for graphs that closes this fidelity gap. Motivated by the transport map optimization perspective of the Jordan-Kinderlehrer-Otto (JKO) scheme, GEM learns a permutation-invariant potential energy that simultaneously provides transport-aligned guidance from noise toward data and refines samples within regions of high data likelihood. Further, we introduce a sampling protocol that leverages an energy-based switch to seamlessly bridge: (i) rapid, gradient-guided transport toward high-probability regions to (ii) a mixing regime for exploration of the learned graph distribution. On molecular graph benchmarks, GEM matches or exceeds strong discrete diffusion baselines. Beyond sample quality, explicit modeling of relative likelihood enables targeted exploration at inference time, facilitating compositional generation, property-constrained sampling, and geodesic interpolation between graphs.

[AI-8] RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue

【速读】:该论文旨在解决实时语音对话系统中延迟(latency)与响应质量之间的权衡问题。传统端到端语音到语音(Speech-to-Speech, S2S)模型虽能实现低延迟并自然处理交互行为,但语义质量较弱;而级联式流水线(ASR - LLM)虽然生成质量更高,但延迟随模型规模显著增加。解决方案的关键在于提出一种名为RelayS2S的混合架构:在检测到用户发言结束时并行运行两条路径——快速路径采用双工S2S模型预先生成短响应前缀并立即通过文本转语音(TTS)输出,以实现低延迟音频启动;慢速路径则基于已确认的前缀继续生成高质量后续内容,最终无缝拼接形成完整响应。一个轻量级学习验证器决定是否将控制权从快速路径移交至慢速路径,从而在保证高响应质量的同时维持接近S2S模型的P90延迟水平。该设计无需修改现有组件结构,可作为轻量级插件集成至已有级联系统中。

链接: https://arxiv.org/abs/2603.23346
作者: Long Mai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR - LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path – a duplex S2S model – speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path – a cascaded ASR - LLM pipeline – generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop-in addition to existing cascaded pipelines. Our code and data are publicly available at: this https URL

[AI-9] Edge Radar Material Classification Under Geometry Shifts

【速读】:该论文旨在解决毫米波雷达(mmWave radar)在机器人导航与交互中对材料识别的鲁棒性问题,特别是在视觉传感器(如摄像头和LiDAR)性能受限的场景下,如何实现高效、低功耗的实时材料分类。其解决方案的关键在于设计了一套轻量级的雷达材料分类流水线,基于紧凑的距离-范围单元强度描述符(compact range-bin intensity descriptors)与多层感知机(Multilayer Perceptron, MLP)相结合,在超低功耗边缘设备(TI IWRL6432)上实现快速推理。然而,研究发现当传感器几何条件发生实际偏移(如高度变化或小角度倾斜)时,系统性能显著下降,主要归因于系统性强度缩放和角度依赖的雷达散射截面(Radar Cross Section, RCS)效应导致特征分布偏移。为此,作者提出通过归一化、几何增强和运动感知特征等策略提升模型鲁棒性。

链接: https://arxiv.org/abs/2603.23342
作者: Jannik Hohmann,Dong Wang,Andreas Nüchter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Material awareness can improve robotic navigation and interaction, particularly in conditions where cameras and LiDAR degrade. We present a lightweight mmWave radar material classification pipeline designed for ultra-low-power edge devices (TI IWRL6432), using compact range-bin intensity descriptors and a Multilayer Perceptron (MLP) for real-time inference. While the classifier reaches a macro-F1 of 94.2% under the nominal training geometry, we observe a pronounced performance drop under realistic geometry shifts, including sensor height changes and small tilt angles. These perturbations induce systematic intensity scaling and angle-dependent radar cross section (RCS) effects, pushing features out of distribution and reducing macro-F1 to around 68.5%. We analyze these failure modes and outline practical directions for improving robustness with normalization, geometry augmentation, and motion-aware features.

[AI-10] A Comparative Study of Machine Learning Models for Hourly Forecasting of Air Temperature and Relative Humidity

【速读】:该论文旨在解决山地城市(如中国重庆)中空气温度和相对湿度的小时级短期预测问题,以支持更精准的城市管理决策。其解决方案的关键在于构建一个统一的数据预处理、滞后特征构造、滚动统计特征提取及时间序列验证框架,并系统比较七种机器学习模型(包括XGBoost、随机森林、支持向量回归、多层感知机、决策树、LSTM和CNN-LSTM)在真实开放数据上的表现。研究发现,基于树结构的集成学习方法——XGBoost在预测精度和鲁棒性方面均最优,其测试集平均绝对误差(MAE)为0.302 °C(温度)和1.271%(湿度),且两个任务的平均R²达0.989,表明树基模型对结构化气象时序数据具有显著优势,可为复杂地形城市的智能气象预报提供有效技术路径。

链接: https://arxiv.org/abs/2603.23282
作者: Jiaqi Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate short-term forecasting of air temperature and relative humidity is critical for urban management, especially in topographically complex cities such as Chongqing, China. This study compares seven machine learning models: eXtreme Gradient Boosting (XGBoost), Random Forest, Support Vector Regression (SVR), Multi-Layer Perceptron (MLP), Decision Tree, Long Short-Term Memory (LSTM) networks, and Convolutional Neural Network (CNN)-LSTM (CNN-LSTM), for hourly prediction using real-world open data. Based on a unified framework of data preprocessing, lag-feature construction, rolling statistical features, and time-series validation, the models are systematically evaluated in terms of predictive accuracy and robustness. The results show that XGBoost achieves the best overall performance, with a test mean absolute error (MAE) of 0.302 °C for air temperature and 1.271% for relative humidity, together with an average R2 of 0.989 across the two forecasting tasks. These findings demonstrate the strong effectiveness of tree-based ensemble learning for structured meteorological time-series forecasting and provide practical guidance for intelligent meteorological forecasting in mountainous cities.

[AI-11] Emergence of Frag ility in LLM -based Social Networks: the Case of Moltbook

【速读】:该论文旨在解决如何理解由大语言模型(Large Language Models, LLM)驱动的AI代理在社交平台中形成的交互结构及其集体动力学问题。其解决方案的关键在于构建一个基于Moltbook平台的有向加权网络模型,其中节点代表AI代理,边表示评论交互,并利用网络科学工具对大规模真实数据(39,924个用户、235,572条帖子和1,540,238条评论)进行分析。结果揭示了高度异质的连接模式、显著的核心-外围结构(核心仅占0.9%节点却集中大量连边),以及对关键节点(尤其是高出度节点)的高度脆弱性,表明LLM原生社交环境中存在结构性集中化与脆弱性,为理解AI代理群体组织机制提供了新视角。

链接: https://arxiv.org/abs/2603.23279
作者: Luca Sodano,Sofia Sciangula,Amulya Galmarini,Francesco Bertolotti
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid diffusion of large language models and the growth in their capability has enabled the emergence of online environments populated by autonomous AI agents that interact through natural language. These platforms provide a novel empirical setting for studying collective dynamics among artificial agents. In this paper we analyze the interaction network of Moltbook, a social platform composed entirely of LLM based agents, using tools from network science. The dataset comprises 39,924 users, 235,572 posts, and 1,540,238 comments collected through web scraping. We construct a directed weighted network in which nodes represent agents and edges represent commenting interactions. Our analysis reveals strongly heterogeneous connectivity patterns characterized by heavy tailed degree and activity distributions. At the mesoscale, the network exhibits a pronounced core periphery organization in which a very small structural core (0.9% of nodes) concentrates a large fraction of connectivity. Robustness experiments show that the network is relatively resilient to random node removal but highly vulnerable to targeted attacks on highly connected nodes, particularly those with high out degree. These findings indicate that the interaction structure of AI agent social systems may develop strong centralization and structural fragility, providing new insights into the collective organization of LLM native social environments.

[AI-12] A Multimodal Framework for Human-Multi-Agent Interaction

【速读】:该论文旨在解决多机器人(multi-robot)在社会性物理环境中实现自然、可扩展的人机交互(Human-Robot Interaction, HRI)的挑战,现有系统普遍缺乏统一框架来整合多模态感知(multimodal perception)、具身表达(embodied expression)与协同决策(coordinated decision-making)。解决方案的关键在于提出一个基于多模态的框架,其中每个机器人作为具备集成多模态感知与大型语言模型(Large Language Model, LLM)驱动规划能力的自主认知代理(cognitive agent),其行为基于具身性(embodiment)进行 grounded reasoning;同时,在团队层面引入集中式协调机制以管理轮次切换和代理参与,避免语音重叠与动作冲突,从而实现通过言语、手势、注视和移动等多模态交互策略的协同推理与具身响应。

链接: https://arxiv.org/abs/2603.23271
作者: Shaid Hasan,Breenice Lee,Sujan Sarker,Tariq Iqbal
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures. Accepted at ACM/IEEE HRI 2026 Workshop (MAgicS-HRI)

点击查看摘要

Abstract:Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and locomotion. Representative interaction runs demonstrate coordinated multimodal reasoning across agents and grounded embodied responses. Future work will focus on larger-scale user studies and deeper exploration of socially grounded multi-agent interaction dynamics.

[AI-13] Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱提示(jailbreak prompts)时,现有攻击方法因未区分token重要性而导致查询效率低下、冗余搜索严重的问题。其核心挑战在于如何在有限查询预算下高效识别触发模型拒绝行为的敏感token区域,并提升攻击成功率(Attack Success Rate, ASR)。解决方案的关键在于提出TriageFuzz框架,该框架通过两个创新设计实现:一是利用代理模型(surrogate model)进行token级贡献度估计,从而识别prompt中对拒绝行为影响最大的敏感区域;二是引入基于拒绝引导的进化策略,结合轻量级评分器自适应加权候选提示,引导搜索方向以突破安全约束。实验表明,TriageFuzz在保持高ASR的同时显著降低查询成本,尤其在极端预算限制下表现优异。

链接: https://arxiv.org/abs/2603.23269
作者: Wenyu Chen,Xiangtao Meng,Chuanchao Zang,Li Wang,Xinyu Gao,Jianing Wang,Peng Zhan,Zheng Li,Shanqing Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model’s refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.

[AI-14] SafeSeek: Universal Attribution of Safety Circuits in Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中安全关键行为(如对齐、越狱攻击、后门攻击)的可解释性不足问题,现有安全归因方法因依赖启发式、领域特定指标和搜索算法而难以泛化与可靠。其解决方案的核心是提出一种统一的安全可解释性框架——\ourmethod,通过优化手段识别功能完备的安全电路(safety circuits):首先引入可微分二值掩码(differentiable binary masks),利用梯度下降从安全数据集中提取多粒度电路;其次结合安全电路调优(Safety Circuit Tuning),将稀疏电路用于高效安全微调。该方法突破了传统仅关注孤立注意力头或神经元的局限,在后门攻击场景下定位出仅占0.42%参数的电路,移除后使攻击成功率从100%降至0.4%,同时保留99%通用能力;在对齐场景中定位出包含3.03%注意力头和0.79%神经元的电路,移除后攻击率从0.8%飙升至96.9%,而训练时排除该电路仍能保持96.5%的安全保留率,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2603.23268
作者: Miao Yu,Siyuan Fu,Moayad Aloqaily,Zhenhong Zhou,Safa Otoum,Xing fan,Kun Wang,Yufei Guo,Qingsong Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf(1) backdoor attacks, identifying a backdoor circuit with 0.42% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100% \to 0.4% while retaining over 99% general utility; \textbf(2) safety alignment, localizing an alignment circuit with 3.03% heads and 0.79% neurons, whose removal spikes ASR from 0.8% \to 96.9%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5% safety retention.

[AI-15] AI Lifecycle-Aware Feasibility Framework for Split-RIC Orchestration in NTN O-RAN

【速读】:该论文旨在解决将人工智能(Artificial Intelligence, AI)集成到非地面网络(Non-Terrestrial Networks, NTN)时所面临的双重约束问题:卫星系统尺寸、重量和功耗(Size, Weight, and Power, SWaP)限制以及馈电链路(feeder-link)容量瓶颈,这些问题直接影响O-RAN(Open Radio Access Network)闭环控制与模型生命周期管理的效率。解决方案的关键在于提出一种分层的RAN智能控制器(Split-RIC)架构,通过在地面、低地球轨道(Low Earth Orbit, LEO)和地球静止轨道(Geostationary Earth Orbit, GEO)段之间合理分配O-RAN控制层级,实现对训练数据传输、模型分发和近实时推理等关键环节的能量与延迟的量化建模,并基于数值敏感性分析识别出不同场景下本地化推理与非地面学习回路相对于地面卸载的物理可行性区域。

链接: https://arxiv.org/abs/2603.23252
作者: Daniele Tarchi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures. Submitted to IEEE Transactions on Network and Service Management (TNSM)

点击查看摘要

Abstract:Integrating Artificial Intelligence (AI) into Non-Terrestrial Networks (NTN) is constrained by the joint limits of satellite SWaP and feeder-link capacity, which directly impact O-RAN closed-loop control and model lifecycle management. This paper studies the feasibility of distributing the O-RAN control hierarchy across Ground, LEO, and GEO segments through a Split-RIC architecture. We compare three deployment scenarios: (i) ground-centric control with telemetry streaming, (ii) ground–LEO Split-RIC with on-board inference and store-and-forward learning, and (iii) GEO–LEO multi-layer control enabled by inter-satellite links. For each scenario, we derive closed-form expressions for lifecycle energy and lifecycle latency that account for training-data transfer, model dissemination, and near-real-time inference. Numerical sensitivity analysis over feeder-link conditions, model complexity, and orbital intermittency yields operator-relevant feasibility regions that delineate when on-board inference and non-terrestrial learning loops are physically preferable to terrestrial offloading.

[AI-16] A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

【速读】:该论文旨在解决异构环境下有向无环图(DAG)调度的高效性与适应性难题,尤其关注任务-资源池兼容性系数建模以及生成诱导最优性差距(generation-induced optimality gaps)问题。其核心解决方案是提出WeCAN框架,采用两阶段单次遍历设计:首先通过加权交叉注意力编码器建模任务-资源池交互关系,并引入由兼容性系数门控的机制以实现对环境波动的尺寸无关性;其次基于顺序空间分析揭示生成映射的可达调度顺序集,进而设计一种带解析参数化递减跳过规则(skip-extended realization)的调度策略,在不牺牲单次遍历效率的前提下扩大可达顺序集合,从而消除生成诱导最优性差距。实验表明,该方法在计算图和真实TPC-H DAG上均优于强基线,且推理时间接近经典启发式算法、快于多轮神经调度器。

链接: https://arxiv.org/abs/2603.23249
作者: Ruisong Zhou,Haijun Zou,Li Zhou,Chumin Sun,Zaiwen Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 30pages, 8 figures

点击查看摘要

Abstract:Efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments is challenging due to resource capacities and dependencies. In practice, the need for adaptability across environments with varying resource pools and task types, alongside rapid schedule generation, complicates these challenges. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task–pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task–pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task–pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.

[AI-17] Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中环境动态存在随机性以及观测不完整时的建模与策略优化问题。针对完全可观测环境中随机过渡动态的建模,论文提出使用神经随机微分方程(neural Stochastic Differential Equations, SDEs),相较于神经常微分方程(neural Ordinary Differential Equations, ODEs),其能更有效地捕捉状态转移中的固有随机性,从而提升策略性能并增强样本效率;对于部分可观测场景,则设计了一种潜在空间中的SDE模型,结合了确定性ODE与通过生成对抗网络(GAN)训练的随机成分,以隐式地学习未观测状态的动态。该方案的关键在于利用条件于动作的潜在SDE进行规划,实现了对环境变化的高效适应和在复杂随机连续控制任务上的优越表现。

链接: https://arxiv.org/abs/2603.23245
作者: Chao Han,Stefanos Ioannou,Luca Manneschi,T.J. Hayward,Michael Mangan,Aditya Gilra,Eleni Vasilaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN-trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work demonstrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: this https URL

[AI-18] Online library learning in human visual puzzle solving

【速读】:该论文试图解决的问题是:人类在学习复杂任务时如何形成并利用可复用的抽象结构(helpers),以应对未来任务中的不确定性与复杂性。解决方案的关键在于“在线库学习”(online library learning)机制——即个体在解决问题过程中动态构建、优化和重用中间抽象(helpers),这些抽象捕捉了重复出现的结构,从而显著提升效率。实验表明,随着经验积累,参与者对helper的选择更加高效且具成本敏感性,而计算建模进一步验证了助手访问能有效降低问题难度,并揭示了搜索空间大小(由程序归纳模型估计)与决策时间及操作数呈正相关,而非单纯程序长度。这说明人类问题解决的核心机制是灵活的、基于经验的抽象生成与复用策略。

链接: https://arxiv.org/abs/2603.23244
作者: Pinzhe Zhao,Emanuele Sansone,Marta Kryven,Bonan Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When learning a novel complex task, people often form efficient reusable abstractions that simplify future work, despite uncertainty about the future. We study this process in a visual puzzle task where participants define and reuse helpers – intermediate constructions that capture repeating structure. In an online experiment, participants solved puzzles of increasing difficulty. Early on, they created many helpers, favouring completeness over efficiency. With experience, helper use became more selective and efficient, reflecting sensitivity to reuse and cost. Access to helpers enabled participants to solve puzzles that were otherwise difficult or impossible. Computational modelling shows that human decision times and number of operations used to complete a puzzle increase with search space estimated by a program induction model with library learning. In contrast, raw program length predicts failure but not effort. Together, these results point to online library learning as a core mechanism in human problem solving, allowing people to flexibly build, refine, and reuse abstractions as task demands grow.

[AI-19] MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation

【速读】:该论文旨在解决多模型异构环境下,大语言模型(Large Language Model, LLM)代理间共享记忆系统时性能下降的问题。现有方法通常为每个代理单独构建记忆,导致知识与特定代理的推理风格紧密耦合,难以跨模型复用。其核心挑战在于:直接迁移记忆会引入代理特异性偏差,干扰任务相关知识的有效利用。解决方案的关键是提出 MemCollab 框架,通过对比不同代理在相同任务上生成的推理轨迹,提炼出抽象的、任务层面的约束条件,从而剥离代理特异性特征并构建代理无关的记忆表示;同时引入任务感知的检索机制,在推理阶段仅调用与当前任务类别相关的约束信息,确保记忆使用的精准性和高效性。实验表明,该方法显著提升了多种代理(包括跨模态家族)在数学推理和代码生成任务上的准确率与推理效率。

链接: https://arxiv.org/abs/2603.23234
作者: Yurui Chang,Yiran Wu,Qingyun Wu,Lu Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents rely on memory mechanisms to reuse knowledge from past problem-solving experiences. Existing approaches typically construct memory in a per-agent manner, tightly coupling stored knowledge to a single model’s reasoning style. In modern deployments with heterogeneous agents, a natural question arises: can a single memory system be shared across different models? We found that naively transferring memory between agents often degrades performance, as such memory entangles task-relevant knowledge with agent-specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts. We further introduce a task-aware retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are used at inference time. Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings. Our results show that the collaboratively constructed memory can function as a shared reasoning resource for diverse LLM-based agents.

[AI-20] PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

【速读】:该论文旨在解决大语言模型在长期记忆中保持用户个性(persona)一致性的问题,现有评估方法因混入无关对话而将任务简化为“大海捞针”式的检索,忽视了用户偏好随交互逐步演化且依赖事件间关联性的本质特征。解决方案的关键在于提出PERMA基准,其核心创新包括:(1) 构建时序有序的多轮跨域交互事件序列,并插入偏好相关查询以模拟真实个性化演进过程;(2) 引入文本多样性与语言风格对齐机制,以刻画现实数据中用户输入的不规则性和个体话语特征(idiolect)。通过设计多项选择与交互式任务,PERMA能更准确地评估模型对用户 persona 在时间维度上的理解能力,实验表明具备关联推理能力的记忆系统可提升偏好提取精度并降低 token 消耗,但仍难以应对长程时序和跨域干扰下的 persona 一致性挑战,凸显出构建更鲁棒的个性化记忆管理机制的重要性。

链接: https://arxiv.org/abs/2603.23231
作者: Shuochen Liu,Junyi Zhu,Long Shu,Junda Lin,Yuhao Chen,Haotian Zhang,Chao Zhang,Derong Xu,Jia Li,Bo Tang,Zhiyu Li,Feiyu Xiong,Enhong Chen,Tong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empowering large language models with long-term memory is crucial for building agents that adapt to users’ evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model’s understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at this https URL.

[AI-21] General Machine Learning: Theory for Learning Under Variable Regimes

【速读】:该论文旨在解决学习系统在制度变化(regime variation)环境下的理论建模与分析问题,即当学习者、其记忆状态及评估条件随时间演化时,如何构建一个结构化的学习理论框架以支持对这类动态学习过程的严格形式化和定理化推导。其解决方案的关键在于提出一个以可接受传输(admissible transport)保护核心不变性(protected-core preservation)评估者感知的学习演化机制(evaluator-aware learning evolution) 为核心的理论体系,并通过构造性证明建立了若干基础性定理结果,包括:可接受性的闭包性质、忠实固定本体还原的结构性障碍论证、受保护稳定性模板及其数值与符号证据(针对凸和演绎子类),以及评估者因子分解、态射、复合与部分核层级对齐等高层结构性质。论文进一步通过两制度示例明确展示了可接受性证书、保护评估核心与制度变化代价的具体表达,为未来更广泛的多制度学习系统提供首个具有定理支撑的结构化理论基础。

链接: https://arxiv.org/abs/2603.23220
作者: Aomar Osmani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 56 pages

点击查看摘要

Abstract:We study learning under regime variation, where the learner, its memory state, and the evaluative conditions may evolve over time. This paper is a foundational and structural contribution: its goal is to define the core learning-theoretic objects required for such settings and to establish their first theorem-supporting consequences. The paper develops a regime-varying framework centered on admissible transport, protected-core preservation, and evaluator-aware learning evolution. It records the immediate closure consequences of admissibility, develops a structural obstruction argument for faithful fixed-ontology reduction in genuinely multi-regime settings, and introduces a protected-stability template together with explicit numerical and symbolic witnesses on controlled subclasses, including convex and deductive settings. It also establishes theorem-layer results on evaluator factorization, morphisms, composition, and partial kernel-level alignment across semantically commensurable layers. A worked two-regime example makes the admissibility certificate, protected evaluative core, and regime-variation cost explicit on a controlled subclass. The symbolic component is deliberately restricted in scope: the paper establishes a first kernel-level compatibility result together with a controlled monotonic deductive witness. The manuscript should therefore be read as introducing a structured learning-theoretic framework for regime-varying learning together with its first theorem-supporting layer, not as a complete quantitative theory of all learning systems. Comments: 56 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2603.23220 [cs.LG] (or arXiv:2603.23220v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.23220 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] SAiW: Source-Attributable Invisible Watermarking for Proactive Deepfake Defense

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的深度伪造(deepfake)内容对信息真实性、数字身份和公众信任造成的威胁,尤其针对现有检测方法多为被动响应、难以适应不断演进的生成技术的问题。解决方案的关键在于提出一种源属性不可见水印框架(SAiW),其核心创新是将水印嵌入建模为源条件表示学习问题,通过特征级线性调制(feature-wise linear modulation)将来源身份编码融入嵌入网络,生成具有可区分性和可追溯性的签名;同时引入基于人类视觉系统先验的感知引导模块以确保扰动视觉不可见且鲁棒性强,并设计双功能取证解码器实现水印重建与来源归属的同步输出,从而实现媒体起源绑定与可信溯源的主动防御机制。

链接: https://arxiv.org/abs/2603.23178
作者: Bibek Das,Chandranath Adak,Soumi Chattopadhyay,Zahid Akhtar,Soumya Dutta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deepfakes generated by modern generative models pose a serious threat to information integrity, digital identity, and public trust. Existing detection methods are largely reactive, attempting to identify manipulations after they occur and often failing to generalize across evolving generation techniques. This motivates the need for proactive mechanisms that secure media authenticity at the time of creation. In this work, we introduce SAiW, a Source-Attributed Invisible watermarking Framework for proactive deepfake defense and media provenance verification. Unlike conventional watermarking methods that treat watermark payloads as generic signals, SAiW formulates watermark embedding as a source-conditioned representation learning problem, where watermark identity encodes the originating source and modulates the embedding process to produce discriminative and traceable signatures. The framework integrates feature-wise linear modulation to inject source identity into the embedding network, enabling scalable multi-source watermark generation. A perceptual guidance module derived from human visual system priors ensures that watermark perturbations remain visually imperceptible while maintaining robustness. In addition, a dual-purpose forensic decoder simultaneously reconstructs the embedded watermark and performs source attribution, providing both automated verification and interpretable forensic evidence. Extensive experiments across multiple deepfake datasets demonstrate that SAiW achieves high perceptual quality while maintaining strong robustness against compression, filtering, noise, geometric transformations, and adversarial perturbations. By binding digital media to its origin through invisible yet verifiable markers, SAiW enables reliable authentication and source attribution, providing a scalable foundation for proactive deepfake defense and trustworthy media provenance.

[AI-23] Robust Safety Monitoring of Language Models via Activation Watermarking

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的安全挑战,即适应性攻击者(adaptive adversaries)能够设计出既能规避检测又能诱导模型产生不安全输出的攻击策略。现有LLM监控机制无法应对此类攻击,因为攻击者可利用对监控算法的了解来规避检测,而模型提供商因缺乏对具体滥用方式的认知而难以修补漏洞。论文将鲁棒的LLM监控建模为一个安全博弈,并提出关键解决方案——激活水印(activation watermarking),通过在推理过程中引入可控的不确定性,使攻击者难以准确预测或绕过检测机制。实验表明,该方法在已知监控算法但未知秘密密钥的适应性攻击下,相较于基线防护手段可提升达52%的检测性能。

链接: https://arxiv.org/abs/2603.23171
作者: Toluwani Aremu,Daniil Ognev,Samuele Poppi,Nils Lukas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 20 pages, 17 figures

点击查看摘要

Abstract:Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on \emphmonitoring to detect and flag unsafe behavior during inference. An open security challenge is \emphadaptive adversaries who craft attacks that simultaneously (i) evade detection while (ii) eliciting unsafe behavior. Adaptive attackers are a major concern as LLM providers cannot patch their security mechanisms, since they are unaware of how their models are being misused. We cast \emphrobust LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information, while a provider must accurately detect these adversarial queries at low false positive rates. Our work (i) shows that existing LLM monitors are vulnerable to adaptive attackers and (ii) designs improved defenses through \emphactivation watermarking by carefully introducing uncertainty for the attacker during inference. We find that \emphactivation watermarking outperforms guard baselines by up to 52% under adaptive attackers who know the monitoring algorithm but not the secret key.

[AI-24] Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

【速读】:该论文旨在解决安全关键型智能体在执行动作前难以高效预测行为后果的问题,尤其针对当前基于视觉模拟的世界模型因计算延迟过高(每步超过几秒)而无法满足实时性需求的瓶颈。其解决方案的关键在于提出一种名为DILLO(DIstiLLed Language-ActiOn World Model)的新范式,即从“先模拟再行动”转变为“先描述再行动”,通过将策略的潜在状态与计划动作结合,利用跨模态蒸馏训练一个仅依赖文本输入的大型语言模型(Large Language Model, LLM),使其能够直接预测动作的语义结果,从而绕过耗时的视觉生成过程,实现14倍的速度提升,并显著提高任务成功率(MetaWorld和LIBERO上分别提升最高达15个百分点和平均9.3个百分点)。

链接: https://arxiv.org/abs/2603.23149
作者: Massimiliano Pappa,Luca Romani,Valentino Sacco,Alessio Palma,Stéphane Lathuilière,Fabio Galasso,Xavier Alameda-Pineda,Indro Spinelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy’s latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from “simulate-then-act” to “describe-then-act.” DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

[AI-25] MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

【速读】:该论文旨在解决当前医学链式思维(Chain-of-Thought, CoT)模型在医疗诊断中缺乏显式因果推理机制的问题,从而避免模型依赖表面相关性(spurious correlations)导致的误判,提升临床可靠性。其核心挑战在于:如何自适应地触发因果修正、构建高质量的因果-伪相关对比样本,以及保持推理轨迹中的因果一致性。解决方案的关键是提出 MedCausalX 框架,该框架通过两个阶段的自适应反思架构(引入 ⟨causal⟩ 和 ⟨verify⟩ tokens)实现因果推理链的显式建模,并结合基于错误归因的强化学习优化轨迹级因果修正目标,使模型能够区分真实因果关系与捷径关联。此外,研究还构建了 CRMed 数据集,提供细粒度解剖标注、结构化因果推理链及反事实变体,为因果学习提供数据基础。实验表明,该方法显著提升了诊断一致性(+5.4 分)、降低了幻觉(>10 分),并实现了最优的空间定位 IoU。

链接: https://arxiv.org/abs/2603.23085
作者: Jianxin Lin,Chunzheng Zhu,Peter J. Kneuertz,Yunfei Bai,Yuan Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with \langle causal \rangle and \langle verify \rangle tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.

[AI-26] Can an LLM Detect Instances of Microservice Infrastructure Patterns?

【速读】:该论文旨在解决当前软件架构模式检测工具在多语言和多类型代码库中识别能力有限的问题,尤其是现有工具通常仅支持单一编程语言的模式识别,难以应对实际开发中复杂的异构技术栈。其解决方案的关键在于提出并实现MicroPAD工具,该工具基于GPT-5 nano模型,利用自然语言描述的架构模式语义信息,实现对跨语言、跨类型软件 artifacts 中架构模式(如基础设施相关的微服务模式)的自动化检测。实验表明,MicroPAD能够在多种编程语言和文件类型中有效识别模式实例,且检测性能受模式出现频率和其在代码中表现特征显著性的影响。

链接: https://arxiv.org/abs/2603.23073
作者: Carlos Eduardo Duarte,Neil B. Harrison,Filipe Figueiredo Correia,Ademar Aguiar,Pavlína Gonçalves
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at ICSA 2026 - International Conference on Software Architecture - Research Track

点击查看摘要

Abstract:Architectural patterns are frequently found in various software artifacts. The wide variety of patterns and their implementations makes detection challenging with current tools, especially since they often only support detecting patterns in artifacts written in a single language. Large Language Models (LLMs), trained on a diverse range of software artifacts and knowledge, might overcome the limitations of existing approaches. However, their true effectiveness and the factors influencing their performance have not yet been thoroughly examined. To better understand this, we developed MicroPAD. This tool utilizes GPT 5 nano to identify architectural patterns in software artifacts written in any language, based on natural-language pattern descriptions. We used MicroPAD to evaluate an LLM’s ability to detect instances of architectural patterns, particularly infrastructure-related microservice patterns. To accomplish this, we selected a set of GitHub repositories and contacted their top contributors to create a new, human-annotated dataset of 190 repositories containing microservice architectural patterns. The results show that MicroPAD was capable of detecting pattern instances across multiple languages and artifact types. The detection performance varied across patterns (F1 scores ranging from 0.09 to 0.70), specifically in relation to their prevalence and the distinctiveness of the artifacts through which they manifest. We also found that patterns associated with recognizable, dominant artifacts were detected more reliably. Whether these findings generalize to other LLMs and tools is a promising direction for future research.

[AI-27] Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution

【速读】:该论文旨在解决主流 Claw 个人人工智能代理中存在的安全漏洞问题,即在心跳驱动的后台执行过程中,代理可能无意中摄入来自外部源(如邮件、消息通道、新闻流等)的不可信内容,并将其污染至代理的记忆模块,进而无声地影响用户交互行为,而用户对此毫无察觉。其解决方案的关键在于揭示并验证了“暴露(Exposure, E)→记忆(Memory, M)→行为(Behavior, B)”这一路径:不可信信息通过心跳机制进入代理的短期会话上下文,可能被写入长期记忆,并最终塑造下游用户面向的行为输出;研究通过构建代理原生社交环境 MissClaw 验证了该机制的有效性,发现社会可信度线索(尤其是感知共识)是短期行为影响的主要驱动力,且常规的记忆保存机制可将短期污染转化为持久的长期记忆,跨会话影响最高达 76%。这表明无需传统提示注入(prompt injection),仅普通社交媒体误导信息即可实现对代理记忆与行为的隐蔽操控。

链接: https://arxiv.org/abs/2603.23064
作者: Yechao Zhang,Shiqian Zhao,Jie Zhang,Gelei Deng,Jiawen Zhang,Xiaogeng Liu,Chaowei Xiao,Tianwei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 26 pages, 6 figures, 7 tables; The vulnerability of Claw’s heartbeat mechanism

点击查看摘要

Abstract:We identify a critical security vulnerability in mainstream Claw personal AI agents: untrusted content encountered during heartbeat-driven background execution can silently pollute agent memory and subsequently influence user-facing behavior without the user’s awareness. This vulnerability arises from an architectural design shared across the Claw ecosystem: heartbeat background execution runs in the same session as user-facing conversation, so content ingested from any external source monitored in the background (including email, message channels, news feeds, code repositories, and social platforms) can enter the same memory context used for foreground interaction, often with limited user visibility and without clear source provenance. We formalize this process as an Exposure (E) \rightarrow Memory (M) \rightarrow Behavior (B) pathway: misinformation encountered during heartbeat execution enters the agent’s short-term session context, potentially gets written into long-term memory, and later shapes downstream user-facing behavior. We instantiate this pathway in an agent-native social setting using MissClaw, a controlled research replica of Moltbook. We find that (1) social credibility cues, especially perceived consensus, are the dominant driver of short-term behavioral influence, with misleading rates up to 61%; (2) routine memory-saving behavior can promote short-term pollution into durable long-term memory at rates up to 91%, with cross-session behavioral influence reaching 76%; (3) under naturalistic browsing with content dilution and context pruning, pollution still crosses session boundaries. Overall, prompt injection is not required: ordinary social misinformation is sufficient to silently shape agent memory and behavior under heartbeat-driven background execution.

[AI-28] Machine Learning Models for the Early Detection of Burnout in Software Engineering: a Systematic Literature Review

【速读】:该论文旨在解决软件工程师和IT专业人员中职业倦怠(burnout)的早期识别问题,其核心挑战在于如何利用机器学习(Machine Learning, ML)技术提高检测精度与可靠性。解决方案的关键在于系统性地回顾现有基于ML的研究方法,评估其在情绪识别方面的性能表现,并进一步分析不同数据集在捕捉情绪特征上的潜力,从而为未来研究提供可复现或扩展的技术路径建议。研究表明,多数现有方法聚焦于通过情绪维度间接预测倦怠状态,因此提升情绪识别的准确性成为优化整体倦怠检测效能的核心突破口。

链接: https://arxiv.org/abs/2603.23063
作者: Tien Rahayu Tulili,Ayushi Rastogi,Andrea Capiluppi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: This paper is under review

点击查看摘要

Abstract:Burnout is an occupational syndrome that, like many other professions, affects the majority of software engineers. Past research studies showed important trends, including an increasing use of machine learning techniques to allow for an early detection of burnout. This paper is a systematic literature review (SLR) of the research papers that proposed machine learning (ML) approaches, and focused on detecting burnout in software developers and IT professionals. Our objective is to review the accuracy and precision of the proposed ML techniques, and to formulate recommendations for future researchers interested to replicate or extend those studies. From our SLR we observed that a majority of primary studies focuses on detecting emotions or utilise emotional dimensions to detect or predict the presence of burnout. We also performed a cross-sectional study to detect which ML approach shows a better performance at detecting emotions; and which dataset has more potential and expressivity to capture emotions. We believe that, by identifying which ML tools and datasets show a better performance at detecting emotions, and indirectly at identifying burnout, our paper can be a valuable asset to progress in this important research direction. Comments: This paper is under review Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2603.23063 [cs.SE] (or arXiv:2603.23063v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.23063 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tien Rahayu Tulili [view email] [v1] Tue, 24 Mar 2026 10:58:34 UTC (636 KB)

[AI-29] Minibal: Balanced Game-Playing Without Opponent Modeling

【速读】:该论文旨在解决当前游戏AI(如AlphaZero和Athénan)在人机交互中缺乏平衡性的问题,即这些AI通常表现出超人类水平,导致人类玩家难以获得挑战感或学习价值。为实现“平衡对弈”——即AI既不压倒对手也不轻易让步,从而提供兼具挑战性和教育意义的交互体验——作者提出了一种名为Minibal(Minimize Balance)的新方法,其核心是基于最小化博弈(Minimax)框架的变体,专门用于寻找平衡策略。关键创新在于对无界最小最大算法(Unbounded Minimax)进行多项改进,以识别出使胜负结果趋近于理想平衡状态的策略,实验表明其中一种变体在七种棋类游戏中均能稳定达成接近完美平衡的对弈效果。

链接: https://arxiv.org/abs/2603.23059
作者: Quentin Cohen-Solal,Tristan Cazenave
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in game AI, such as AlphaZero and Athénan, have achieved superhuman performance across a wide range of board games. While highly powerful, these agents are ill-suited for human-AI interaction, as they consistently overwhelm human players, offering little enjoyment and limited educational value. This paper addresses the problem of balanced play, in which an agent challenges its opponent without either dominating or conceding. We introduce Minibal (Minimize Balance), a variant of Minimax specifically designed for balanced play. Building on this concept, we propose several modifications of the Unbounded Minimax algorithm explicitly aimed at discovering balanced strategies. Experiments conducted across seven board games demonstrate that one variant consistently achieves the most balanced play, with average outcomes close to perfect balance. These results establish Minibal as a promising foundation for designing AI agents that are both challenging and engaging, suitable for both entertainment and serious games. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.23059 [cs.AI] (or arXiv:2603.23059v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.23059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement

【速读】:该论文旨在解决当前大量关键数据库系统缺乏充分文档化的问题,例如主键缺失、外键约束被移除以提升性能、列名使用晦涩缩写且无实体关系图(Entity-Relationship Diagram, ERD)等。解决方案的核心在于提出DBAutoDoc系统,其关键创新是将模式理解建模为一个迭代的、图结构问题,并通过结合统计数据分析与大语言模型(Large Language Model, LLM)的多轮精炼机制来实现自动化schema发现与文档生成。该系统利用依赖图传播语义修正,类似神经网络中的反向传播机制,在多轮迭代中逐步优化描述直至收敛,从而在多个基准数据库上达到96.1%的加权得分,显著优于仅依赖LLM的方案(F1提升23点),验证了其独立于LLM预训练知识的有效性。

链接: https://arxiv.org/abs/2603.23050
作者: Amith Nagarajan,Thomas Altman
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A tremendous number of critical database systems lack adequate documentation. Declared primary keys are absent, foreign key constraints have been dropped for performance, column names are cryptic abbreviations, and no entity-relationship diagrams exist. We present DBAutoDoc, a system that automates the discovery and documentation of undocumented relational database schemas by combining statistical data analysis with iterative large language model (LLM) refinement. DBAutoDoc’s central insight is that schema understanding is fundamentally an iterative, graph-structured problem. Drawing structural inspiration from backpropagation in neural networks, DBAutoDoc propagates semantic corrections through schema dependency graphs across multiple refinement iterations until descriptions converge. This propagation is discrete and semantic rather than mathematical, but the structural analogy is precise: early iterations produce rough descriptions akin to random initialization, and successive passes sharpen the global picture as context flows through the graph. The system makes four concrete contributions detailed in the paper. On a suite of benchmark databases, DBAutoDoc achieved overall weighted scores of 96.1% across two model families (Google’s Gemini and Anthropic’s Claude) using a composite metric. Ablation analysis demonstrates that the deterministic pipeline contributes a 23-point F1 improvement over LLM-only FK detection, confirming that the system’s contribution is substantial and independent of LLM pre-training knowledge. DBAutoDoc is released as open-source software with all evaluation configurations and prompt templates included for full reproducibility. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.23050 [cs.DB] (or arXiv:2603.23050v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2603.23050 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

【速读】:该论文旨在解决现有自监督学习(Self-supervised Learning, SSL)方法在处理多采样率语音数据时因时间分辨率不匹配而导致性能下降的问题。其关键解决方案是提出MSRHuBERT,一种支持多采样率适应的预训练方法:通过将原始HuBERT中的单采样率下采样卷积神经网络(CNN)替换为多采样率自适应下采样CNN,使不同采样率的波形在不进行重采样的前提下映射到统一的时间分辨率,从而实现统一的混合采样率预训练与微调。该设计保留了HuBERT的掩码预测目标和Transformer编码器结构,确保原有分析和改进可直接沿用。

链接: https://arxiv.org/abs/2603.23048
作者: Zikang Huang,Meng Ge,Tianrui Wang,Xuanchen Li,Xiaobao Wang,Longbiao Wang,Jianwu Dang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT’s mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

[AI-32] Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts

【速读】:该论文旨在解决生成式 AI (Generative AI) 气候模拟器在非平稳气候变化下泛化能力不足的问题,特别是当未来气候状态超出历史训练数据分布(即“无类比”未来气候状态)时,其可靠性难以保障。解决方案的关键在于构建一个严格的仅基于历史数据(1850–2014)训练的评估框架,并采用两种互补策略进行分布外(Out-of-Distribution, OOD)性能测试:一是时间外推至近期气候(2015–2023),二是跨排放情景的强迫扰动。实验表明,尽管ClimaX基础模型在绝对误差上表现最优,但在极端外部强迫下相对性能波动显著,揭示了高容量模型对强迫轨迹仍敏感,强调了场景感知训练与严谨OOD评估协议对提升气候模拟器鲁棒性的必要性。

链接: https://arxiv.org/abs/2603.23043
作者: Maria Conchita Agana Navarro,Geng Li,Theo Wolf,Maria Perez-Ortiz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at Machine Learning Earth

点击查看摘要

Abstract:The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under “no-analog” future climate states, which we define here as regimes where external forcing drives the system into conditions outside the empirical range of the historical training data. A fundamental challenge in evaluating this reliability is data contamination; because many models are trained on simulations that already encompass future scenarios, true out-of-distribution (OOD) performance is often masked. To address this, we benchmark the OOD robustness of three state-of-the-art architectures: U-Net, ConvLSTM, and the ClimaX foundation model specifically restricted to a historical-only training regime (1850-2014). We evaluate these models using two complementary strategies: (i) temporal extrapolation to the recent climate (2015-2023) and (ii) cross-scenario forcing shifts across divergent emission pathways. Our analysis within this experimental setup reveals an accuracy vs. stability trade-off: while the ClimaX foundation model achieves the lowest absolute error, it exhibits higher relative performance changes under distribution shifts, with precipitation errors increasing by up to 8.44% under extreme forcing scenarios. These findings suggest that when restricted to historical training dynamics, even high-capacity foundation models are sensitive to external forcing trajectories. Our results underscore the necessity of scenario-aware training and rigorous OOD evaluation protocols to ensure the robustness of climate emulators under a changing climate.

[AI-33] A Sobering Look at Tabular Data Generation via Probabilistic Circuits

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在表格数据生成任务中看似已达到性能瓶颈的误解问题,其核心挑战在于现有评估协议对生成数据保真度(fidelity)的衡量存在局限性,且主流扩散模型(diffusion-based models)虽在基准测试中表现接近完美,但可能未能真实反映生成数据的实用性。论文的关键解决方案是重新审视一种简单的基线方法——基于深度概率电路(deep probabilistic circuits, PCs)的分层混合模型,该方法作为决策森林(decision forests)的生成式对应物,可原生处理异构特征并实现高效的概率生成与推理;通过严谨的实证分析,作者证明当前SotA模型的性能“饱和”现象主要源于不当指标的使用,从而指出表格式数据生成仍存在显著改进空间。

链接: https://arxiv.org/abs/2603.23016
作者: Davide Scassola,Dylan Ponsford,Adrián Javaloy,Sebastiano Saccani,Luca Bortolussi,Henry Gouk,Antonio Vergari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline – hierarchical mixture models in the form of deep probabilistic circuits (PCs) – which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at this https URL.

[AI-34] Agent RAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

【速读】:该论文旨在解决移动图形用户界面(GUI)代理在运行时易受后门攻击的问题,特别是针对基于截图的移动GUI代理,传统依赖环境注入或欺骗弹窗的攻击手段失效,因受限于触发器设计空间狭窄、操作系统背景干扰及多触发-动作映射冲突。解决方案的关键在于提出AgentRAE,一种利用视觉自然触发器(如通知栏中的良性应用图标)诱导远程动作执行的新颖后门攻击方法;其核心创新是设计了一个两阶段管道:首先通过对比学习增强代理对细微图标差异的敏感性以缓解自然触发器导致的欠拟合问题,随后通过后门微调将每个触发器与特定移动GUI代理动作关联,从而实现精准多目标动作重定向,在保持干净性能的同时实现超过90%的攻击成功率,并有效规避八种主流防御机制。

链接: https://arxiv.org/abs/2603.23007
作者: Yutao Luo,Haotian Zhu,Shuchao Pang,Zhigang Lu,Tian Dong,Yongbin Zhou,Minhui Xue
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of mobile graphical user interface (GUI) agents, which autonomously control applications and operating systems (OS), exposes new system-level attack surfaces. Existing backdoors against web GUI agents and general GenAI models rely on environmental injection or deceptive pop-ups to mislead the agent operation. However, these techniques do not work on screenshots-based mobile GUI agents due to the challenges of restricted trigger design spaces, OS background interference, and conflicts in multiple trigger-action mappings. We propose AgentRAE, a novel backdoor attack capable of inducing Remote Action Execution in mobile GUI agents using visually natural triggers (e.g., benign app icons in notifications). To address the underfitting caused by natural triggers and achieve accurate multi-target action redirection, we design a novel two-stage pipeline that first enhances the agent’s sensitivity to subtle iconographic differences via contrastive learning, and then associates each trigger with a specific mobile GUI agent action through a backdoor post-training. Our extensive evaluation reveals that the proposed backdoor preserves clean performance with an attack success rate of over 90% across ten mobile operations. Furthermore, it is hard to visibly detect the benign-looking triggers and circumvents eight representative state-of-the-art defenses. These results expose an overlooked backdoor vector in mobile GUI agents, underscoring the need for defenses that scrutinize notification-conditioned behaviors and internal agent representations.

[AI-35] Can Large Language Models Reason and Optimize Under Constraints?

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理具有物理和运行约束的抽象优化问题(如最优潮流 Optimal Power Flow, OPF)时能力不足的问题。其关键解决方案在于构建了一个具有挑战性的评估框架,该框架要求模型具备推理、结构化输入处理、算术运算以及受约束优化等核心技能,从而系统性地测试LLMs在真实电力系统优化场景下的表现。实验表明,当前最先进(SoTA)的LLMs在多数任务中表现不佳,尤其在复杂约束条件下仍无法有效推理与优化,揭示了LLMs在结构化约束推理方面的显著能力缺口。

链接: https://arxiv.org/abs/2603.23004
作者: Fabien Bernier,Salah Ghamizi,Pantelis Dogoulis,Maxime Cordy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great capabilities across diverse natural language tasks; yet their ability to solve abstraction and optimization problems with constraints remains scarcely explored. In this paper, we investigate whether LLMs can reason and optimize under the physical and operational constraints of Optimal Power Flow (OPF) problem. We introduce a challenging evaluation setup that requires a set of fundamental skills such as reasoning, structured input handling, arithmetic, and constrained optimization. Our evaluation reveals that SoTA LLMs fail in most of the tasks, and that reasoning LLMs still fail in the most complex settings. Our findings highlight critical gaps in LLMs’ ability to handle structured reasoning under constraints, and this work provides a rigorous testing environment for developing more capable LLM assistants that can tackle real-world power grid optimization problems.

[AI-36] On the use of Aggregation Operators to improve Human Identification using Dental Records

【速读】:该论文旨在解决法医牙科学中牙齿记录自动比对方法的局限性问题,即现有自动方法要么未能充分利用比对信息(如简单技术),要么因缺乏同行评审而难以解释其内部机制(黑箱模型)。解决方案的关键在于设计可解释且可验证的聚合机制,以提升比对结果的准确性与可信度。作者引入三种聚合策略:基于数据驱动的字典序聚合、经典的模糊逻辑聚合方法以及白盒机器学习技术作为聚合模型。实验表明,使用白盒机器学习模型进行聚合时,平均排名从3.91显著改善至2.02–2.21,同时保持了方法的可解释性和可验证性,从而实现了性能与透明度的兼顾。

链接: https://arxiv.org/abs/2603.23003
作者: Antonio D. Villegas-Yeguas,Guillermo R-García,Tzipi Kahana,Jorge Pinares Toledo,Esi Sharon,Oscar Ibañez,Oscar Cordón
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The comparison of dental records is a standardized technique in forensic dentistry used to speed up the identification of individuals in multiple-comparison scenarios. Specifically, the odontogram comparison is a procedure to compute criteria that will be used to perform a ranking. State-of-the-art automatic methods either make use of simple techniques, without utilizing the full potential of the information obtained from a comparison, or their internal behavior is not known due to the lack of peer-reviewed publications. This work aims to design aggregation mechanisms to automatically compare pairs of dental records that can be understood and validated by experts, improving the current methods. To do so, we introduce different aggregation approaches using the state-of-the-art codification, based on seven different criteria. In particular, we study the performance of i) data-driven lexicographical order-based aggregations, ii) well-known fuzzy logic aggregation methods and iii) machine learning techniques as aggregation mechanisms. To validate our proposals, 215 forensic cases from two different populations have been used. The results obtained show how the use of white-box machine learning techniques as aggregation models (average ranking from 2.02 to 2.21) are able to improve the state-of-the-art (average ranking of 3.91) without compromising the explainability and interpretability of the method.

[AI-37] Can Graph Foundation Models Generalize Over Architecture? ICLR2026

【速读】:该论文试图解决当前图基础模型(Graph Foundation Models, GFMs)在零样本泛化能力上的局限性问题,即现有方法依赖固定架构的图神经网络(GNN),无法适应不同任务对消息传递机制的差异化需求,导致在任务特征空间(如图结构范围)变化时性能显著下降。解决方案的关键在于引入一种推理时自适应架构的框架,通过发现并混合任务特定的线性图算子(linear graph operators),实现无需重新训练即可在异构架构要求的任务间进行零样本迁移,从而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.22984
作者: Benjamin Gutteridge,Michael Bronstein,Xiaowen Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 9 pages main text + 18 pages references and appendix (27 pages total), 5 figures. Accepted to GRaM Workshop @ ICLR 2026: Workshop on Geometry-grounded Representation Learning and Generative Modeling (to appear in PMLR)

点击查看摘要

Abstract:Graph foundation models (GFMs) have recently attracted interest due to the promise of graph neural network (GNN) architectures that generalize zero-shot across graphs of arbitrary scales, feature dimensions, and domains. While existing work has demonstrated this ability empirically across diverse real-world benchmarks, these tasks share a crucial hidden limitation: they admit a narrow set of effective GNN architectures. In particular, current domain-agnostic GFMs rely on fixed architectural backbones, implicitly assuming that a single message-passing regime suffices across tasks. In this paper, we argue that architecture adaptivity is a necessary requirement for true GFMs. We show that existing approaches are non-robust to task-dependent architectural attributes and, as a case study, use range as a minimal and measurable axis along which this limitation becomes explicit. With theoretical analysis and controlled synthetic experiments, we demonstrate that fixed-backbone GFMs provably under-reach on tasks whose architectural requirements differ from those seen at training time. To address this issue, we introduce a framework that adapts effective GNN architecture at inference time by discovering and mixing task-specific linear graph operators, enabling zero-shot generalization across tasks with heterogeneous architectural requirements, without retraining. We validate our approach on arbitrary-range synthetic tasks and a suite of real-world benchmarks, demonstrating improved performance and robustness over existing domain-agnostic GFMs.

[AI-38] JFTA-Bench: Evaluate LLM s Ability of Tracking and Analyzing Malfunctions Using Fault Trees

【速读】:该论文旨在解决如何将存储为图像的故障树(Fault Tree)直接输入大语言模型(Large Language Model, LLM)进行处理,从而辅助故障定位与分析的问题。其关键解决方案是提出了一种新的故障树文本表示方法,并基于此构建了一个包含3130个条目、平均每条40.75轮对话的多轮对话评估基准,用于衡量模型在复杂环境中的交互鲁棒性;同时引入端到端训练机制以生成模糊信息模拟用户行为,并设计长程回滚与恢复机制来模拟用户错误场景,从而全面评估模型的任务追踪与容错能力。

链接: https://arxiv.org/abs/2603.22978
作者: Yuhui Wang,Zhixiong Yang,Ming Zhang,Shihan Dou,Zhiheng Xi,Enyu Zhou,Senjie Jin,Yujiong Shen,Dingwei Zhu,Yi Dong,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model’s ability to assist in malfunction localization, which contains 3130 entries and 40.75 turns per entry on average. We train an end-to-end model to generate vague information to reflect user behavior and introduce long-range rollback and recovery procedures to simulate user error scenarios, enabling assessment of a model’s integrated capabilities in task tracking and error recovery, and Gemini 2.5 pro archives the best performance.

[AI-39] Where Experts Disagree Models Fail: Detecting Implicit Legal Citations in French Court Decisions

【速读】:该论文旨在解决如何在司法判决中识别法院对成文法(如法国《民法典》)的隐性引用问题,即区分法律推理与语义相似性。其核心挑战在于,传统自然语言处理方法难以准确捕捉法律文本中的隐含引用关系,尤其是当判决内容仅通过事实描述间接体现法条适用时。解决方案的关键在于构建一个由三位法律专家标注的1,015个段落-法条配对基准数据集,并利用监督式集成模型(F1 = 0.70)进行分类;进一步地,通过将任务重构为top-k排序并引入多模型共识机制,在无监督场景下实现k=200时76%的精确率,有效缓解了人工标注分歧带来的误差影响,同时发现剩余假阳性主要源于法律解释的模糊性而非明显错误。

链接: https://arxiv.org/abs/2603.22973
作者: Avrile Floro,Tamara Dhorasoo(UPHF),Soline Pellez(UPHF),Nils Holzenberger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational methods applied to legal scholarship hold the promise of analyzing law at scale. We start from a simple question: how often do courts implicitly apply statutory rules? This requires distinguishing legal reasoning from semantic similarity. We focus on implicit citation of the French Civil Code in first-instance court decisions and introduce a benchmark of 1,015 passage-article pairs annotated by three legal experts. We show that expert disagreement predicts model failures. Inter-annotator agreement is moderate ( \kappa = 0.33) with 43% of disagreements involving the boundary between factual description and legal reasoning. Our supervised ensemble achieves F1 = 0.70 (77% accuracy), but this figure conceals an asymmetry: 68% of false positives fall on the 33% of cases where the annotators disagreed. Despite these limits, reframing the task as top-k ranking and leveraging multi-model consensus yields 76% precision at k = 200 in an unsupervised setting. Moreover, the remaining false positives tend to surface legally ambiguous applications rather than obvious errors.

[AI-40] PersonalQ: Select Quantize and Serve Personalized Diffusion Models for Efficient Inference ICME2026

【速读】:该论文旨在解决个性化文本到图像生成中个性化检查点(checkpoint)高效服务的问题,具体挑战包括:自然语言请求的歧义性导致查询可能被错误路由至视觉相似但概念不匹配的检查点,以及标准后训练量化(Post-Training Quantization, PTQ)会破坏编码个性化概念的脆弱表示。解决方案的关键在于提出一个统一框架 PersonalQ,其核心创新是通过共享信号——检查点的触发词(trigger token)连接检查点选择与量化过程:首先,Check-in 模块基于意图对齐的选择策略,结合意图感知混合检索与大语言模型(LLM)重排序,并在多意图场景下仅提出简短澄清问题;随后,将选定检查点的规范触发词插入提示中以增强语义对齐;其次,Trigger-Aware Quantization (TAQ) 在交叉注意力机制中采用触发词感知的混合精度量化,在保留触发条件下的 key/value 行及其注意力权重的同时,对其他路径进行激进量化,从而实现内存高效的推理。此方法显著提升了意图对齐准确率,并在压缩率与质量之间优于现有扩散模型 PTQ 方法。

链接: https://arxiv.org/abs/2603.22943
作者: Qirui Wang,Qi Guo,Yiding Sun,Junkai Yang,Dongxu Zhang,Shanmin Pang,Qing Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in ICME 2026

点击查看摘要

Abstract:Personalized text-to-image generation lets users fine-tune diffusion models into repositories of concept-specific checkpoints, but serving these repositories efficiently is difficult for two reasons: natural-language requests are often ambiguous and can be misrouted to visually similar checkpoints, and standard post-training quantization can distort the fragile representations that encode personalized concepts. We present PersonalQ, a unified framework that connects checkpoint selection and quantization through a shared signal – the checkpoint’s trigger token. Check-in performs intent-aligned selection by combining intent-aware hybrid retrieval with LLM-based reranking over checkpoint context and asks a brief clarification question only when multiple intents remain plausible; it then rewrites the prompt by inserting the selected checkpoint’s canonical trigger. Complementing this, Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows (and their attention weights) while aggressively quantizing the remaining pathways for memory-efficient inference. Experiments show that PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ consistently offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.

[AI-41] Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning

【速读】:该论文旨在解决自然语言到SQL(NL2SQL)翻译在企业数据民主化过程中的瓶颈问题,尤其关注大型语言模型(Large Language Models, LLMs)高推理成本限制其大规模部署的挑战。解决方案的关键在于通过微调(fine-tuning)策略实现计算效率与性能之间的平衡:研究发现,对大型模型进行微调在标准数据集上收益有限且易过拟合复杂查询,而小型模型(如Qwen)经微调后显著提升性能(从36%提升至45%),进一步结合显式链式思维(Chain-of-Thought, CoT)增强的数据集训练,准确率可达到54.5%,逼近生产级表现,同时大幅降低推理延迟和计算成本,证明了推理模式迁移能有效赋能小模型实现高效、实用的NL2SQL能力。

链接: https://arxiv.org/abs/2603.22942
作者: Anshul Solanki,Sanchit Latawa,Koushik Chakraborty,Navneet Kamboj
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages , 3 fifures

点击查看摘要

Abstract:Translating Natural Language to SQL (NL2SQL) remains a critical bottleneck for democratization of data in enterprises. Although Large Language Models (LLMs) like Gemini 2.5 and other LLMs have demonstrated impressive zero-shot capabilities, their high inference costs limit deployment at scale. This paper explores the efficacy of fine-tuning both large and small language models on NL2SQL tasks. Our research reveals a counter-intuitive scaling phenomenon. Fine-tuning large models (Gemini 2.5 Flash/Lite) on standard datasets yields negligible returns, often leading to overfitting on complex queries. Conversely, small models (Qwen) show significant gains. Fine-tuning improved the small model baseline from 36% to 45%, and further enriching the dataset with explicit Chain-of-Thought (CoT) reasoning surged accuracy to 54.5%(Fig 2). While this is still lower than the accuracy of large models like Gemini 2.5 , it does serve the business goal of significant cost reduction, latency in inference time and also meeting the business critical performance accuracy this http URL paper demonstrates that transferring reasoning patterns enables compute-efficient smaller models to approach production-grade performance. Comments: 9 pages , 3 fifures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22942 [cs.AI] (or arXiv:2603.22942v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.22942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-42] ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因语料库中毒(corpus poisoning)所带来的安全风险问题。具体而言,攻击者通过篡改或注入特定文档片段,使其在目标查询下被检索为Top-K结果,从而误导下游生成内容。现有防御方法通常依赖内容过滤、辅助模型或生成器端推理,部署复杂度高且难以适配实际场景。论文提出的ProGRank是一种无需训练、基于检索器侧的后处理防御机制,其核心创新在于:通过对查询-文档对施加轻微随机扰动,从检索器中固定参数子集提取探针梯度(probe gradients),进而构建两个不稳定性信号——表示一致性(representational consistency)与分散风险(dispersion risk),并引入评分门控机制(score gate)进行重排序。该方法不改变原始文档内容、无需重新训练,且支持在无法访问部署检索器时使用代理版本,实验证明其在多种数据集、检索器架构和攻击类型下均展现出更强的鲁棒性与实用性平衡。

链接: https://arxiv.org/abs/2603.22934
作者: Xiangyu Yin,Yi Qi,Chih-hong Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves the reliability of large language model applications by grounding generation in retrieved evidence, but it also introduces a new attack surface: corpus poisoning. In this setting, an adversary injects or edits passages so that they are ranked into the Top- K results for target queries and then affect downstream generation. Existing defences against corpus poisoning often rely on content filtering, auxiliary models, or generator-side reasoning, which can make deployment more difficult. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query–passage pair under mild randomized perturbations and extracts probe gradients from a small fixed parameter subset of the retriever. From these signals, it derives two instability signals, representational consistency and dispersion risk, and combines them with a score gate in a reranking step. ProGRank preserves the original passage content, requires no retraining, and also supports a surrogate-based variant when the deployed retriever is unavailable. Extensive experiments across three datasets, three dense retriever backbones, representative corpus poisoning attacks, and both retrieval-stage and end-to-end settings show that ProGRank provides stronger defence performance and a favorable robustness–utility trade-off. It also remains competitive under adaptive evasive attacks.

[AI-43] he EU AI Act and the Rights-based Approach to Technological Governance

【速读】:该论文试图解决的问题是如何在人工智能(Artificial Intelligence, AI)治理中有效嵌入和保障基本权利,特别是在欧盟的数字监管架构下实现以人为本的AI发展路径。其解决方案的关键在于通过风险分级治理框架,将《欧盟人工智能法案》(EU AI Act)中的权利保护条款转化为法律阈值和程序触发机制,从而确保基本权利不仅作为理想目标,更成为贯穿AI系统全生命周期的强制性规范依据。这一制度设计使AI治理具备可操作性和法律约束力,为全球范围内构建以权利为中心的AI系统提供参考范式。

链接: https://arxiv.org/abs/2603.22920
作者: Georgios Pavlidis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The EU AI Act constitutes an important development in shaping the Union’s digital regulatory architecture. The Act places fundamental rights at the heart of a risk-based governance framework. The article examines how the AI Act institutionalises a human-centric approach to AI and how the AI Act’s provisions explicitly and implicitly embed the protection of rights enshrined in the EU Charter of Fundamental Rights. It argues that fundamental rights function not merely as aspirational goals, but as legal thresholds and procedural triggers across the lifecycle of an AI system. The analysis suggests that the AI Act has the potential to serve as a model for rights-preserving AI systems, while acknowledging that challenges will emerge at the level of implementation.

[AI-44] From the AI Act to a European AI Agency: Completing the Unions Regulatory Architecture

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)快速发展背景下,如何通过更有效的风险评估、监管与治理机制,确保AI技术发展符合伦理原则的同时,兼顾创新与经济竞争力的问题。其解决方案的关键在于推动建立一个更为强有力的欧盟层面的超国家机构——即强化后的欧洲人工智能办公室(European AI Office),以提升政策一致性、增强风险评估能力,并促进国际合作,从而助力欧盟实现数字和技术主权的战略目标。

链接: https://arxiv.org/abs/2603.22912
作者: Georgios Pavlidis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) technologies continue to advance, effective risk assessment, regulation, and oversight are necessary to ensure that AI development and deployment align with ethical principles while preserving innovation and economic competitiveness. The adoption of the EU AI Act marks an important step in this direction, establishing a harmonised legal framework that includes detailed provisions on AI governance, as well as the creation of the European AI Office. This paper revisits the question of whether a more robust supranational agency dedicated to AI is still warranted and explores how such a body could enhance policy coherence, improve risk assessment capacities, and foster international cooperation. It also argues that a strengthened EU-level agency would also serve the Union’s strategic aim of securing digital and technological sovereignty.

[AI-45] Separating Diagnosis from Control: Auditable Policy Adaptation in Agent -Based Simulations with LLM -Based Diagnostics AAMAS2026

【速读】:该论文旨在解决老年人孤独感缓解政策制定中适应性(adaptability)与可审计性(auditability)难以兼得的问题。现有方法中,传统基于代理的模型因结构僵化而缺乏适应性,而直接使用大语言模型(Large Language Model, LLM)作为控制器则因黑箱特性丧失了必要的可追溯性。解决方案的关键在于提出一个三层框架,通过将诊断与控制分离:LLM仅作为诊断工具,用于评估群体状态并生成结构化的风险评估,而确定性公式则依据明确边界将诊断结果转化为可审计的参数更新机制。这种分离确保每项政策决策均可归因于可检查的规则,同时保持对突发需求的自适应响应能力。

链接: https://arxiv.org/abs/2603.22904
作者: Shaoxin Zhong,Yuchen Su,Michael Witbrock
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted at AAMAS 2026 Workshop MABS

点击查看摘要

Abstract:Mitigating elderly loneliness requires policy interventions that achieve both adaptability and auditability. Existing methods struggle to reconcile these objectives: traditional agent-based models suffer from static rigidity, while direct large language model (LLM) controllers lack essential traceability. This work proposes a three-layer framework that separates diagnosis from control to achieve both properties simultaneously. LLMs operate strictly as diagnostic instruments that assess population state and generate structured risk evaluations, while deterministic formulas with explicit bounds translate these assessments into traceable parameter updates. This separation ensures that every policy decision can be attributed to inspectable rules while maintaining adaptive response to emergent needs. We validate the framework through systematic ablation across five experimental conditions in elderly care simulation. Results demonstrate that explicit control rules outperform end-to-end black-box LLM approaches by 11.7% while preserving full auditability, confirming that transparency need not compromise adaptive performance.

[AI-46] Confidence Calibration under Ambiguous Ground Truth

【速读】:该论文旨在解决现有后验校准方法(如Temperature Scaling)在多标注者场景下失效的问题,即传统方法假设每个输入仅存在唯一真值标签,而实际中标注者之间存在分歧(annotation disagreement),导致基于多数投票标签训练的校准器虽在常规评估中表现良好,却仍严重偏离真实标注者分布。其关键解决方案是提出一系列不确定性感知的后验校准方法,通过优化针对完整标签分布的严格评分规则(proper scoring rules),无需重新训练模型即可提升校准精度。其中,Dirichlet-Soft利用完整的标注者分布实现最优校准质量;MCTS S=1仅需单个标注即可达到与全分布相当的效果,证明预聚合标签分布非必需;LS-TS则仅依赖投票标签,通过模型自身置信度构建数据驱动的伪软目标,显著降低对标注数据的依赖。实验表明,这些方法在多个基准上均大幅降低真实标签下的期望校准误差(ECE)。

链接: https://arxiv.org/abs/2603.22879
作者: Linwei Tao,Haoyang Luo,Minjing Dong,Chang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temperatures that underestimate annotator uncertainty, with true-label miscalibration increasing monotonically with annotation entropy. To address this, we develop a family of ambiguity-aware post-hoc calibrators that optimise proper scoring rules against the full label distribution and require no model retraining. Our methods span progressively weaker annotation requirements: Dirichlet-Soft leverages the full annotator distribution and achieves the best overall calibration quality across settings; Monte Carlo Temperature Scaling with a single annotation per example (MCTS S=1) matches full-distribution calibration across all benchmarks, demonstrating that pre-aggregated label distributions are unnecessary; and Label-Smooth Temperature Scaling (LS-TS) operates with voted labels alone by constructing data-driven pseudo-soft targets from the model’s own confidence. Experiments on four benchmarks with real multi-annotator distributions (CIFAR-10H, ChaosNLI) and clinically-informed synthetic annotations (ISIC~2019, DermaMNIST) show that Dirichlet-Soft reduces true-label ECE by 55-87% relative to Temperature Scaling, while LS-TS reduces ECE by 9-77% without any annotator data.

[AI-47] Continuous Optimization for Satisfiability Modulo Theories on Linear Real Arithmetic

【速读】:该论文旨在解决传统Satisfiability Modulo Theories (SMT)求解器在工业应用中(如硬件验证和设计自动化)因基于冲突驱动的子句学习(Conflict-Driven Clause Learning, CDCL)结构难以并行化而导致的扩展性差的问题。其解决方案的关键在于提出FourierSMT框架,通过将Walsh-Fourier展开(Walsh-Fourier Expansion, WFE)推广至混合布尔-实数域(extended WFE, xWFE),使得可以利用梯度方法对SMT问题进行连续变量优化;同时引入扩展二叉决策图(extended Binary Decision Diagram, xBDD)以降低xWFE的计算复杂度,并证明在随机舍入下xBDD的电路输出概率(Circuit-Output Probability, COP)等价于xWFE的期望值,从而实现高效约束评估与收敛性保障,确保解的正确性。该方法在大规模调度和布局问题上实现了高达8倍的速度提升。

链接: https://arxiv.org/abs/2603.22877
作者: Yunuo Cen,Daniel Ebler,Xuanyao Fong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient solutions for satisfiability modulo theories (SMT) are integral in industrial applications such as hardware verification and design automation. Existing approaches are predominantly based on conflict-driven clause learning, which is structurally difficult to parallelize and therefore scales poorly. In this work, we introduce FourierSMT as a scalable and highly parallelizable continuous-variable optimization framework for SMT. We generalize the Walsh-Fourier expansion (WFE), called extended WFE (xWFE), from the Boolean domain to a mixed Boolean-real domain, which allows the use of gradient methods for SMT. This addresses the challenge of finding satisfying variable assignments to high-arity constraints by local updates of discrete variables. To reduce the evaluation complexity of xWFE, we present the extended binary decision diagram (xBDD) and map the constraints from xWFE to xBDDs. We then show that sampling the circuit-output probability (COP) of xBDDs under randomized rounding is equivalent to the expectation value of the xWFEs. This allows for efficient computation of the constraints. We show that the reduced problem is guaranteed to converge and preserves satisfiability, ensuring the soundness of the solutions. The framework is benchmarked for large-scale scheduling and placement problems with up to 10,000 variables and 700,000 constraints, achieving 8-fold speedups compared to state-of-the-art SMT solvers. These results pave the way for GPU-based optimization of SMTs with continuous systems.

[AI-48] Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

【速读】:该论文旨在解决生成式AI(Generative AI)在灵巧操作任务中从仿真(Sim-to-Real)迁移时性能下降的问题,即仿真数据与真实世界分布之间的差距导致策略泛化能力受限。其解决方案的关键在于系统性地评估和优化四个核心维度:多层次域随机化(multi-level domain randomization)、逼真渲染(photorealistic rendering)、物理建模真实性(physics-realistic modeling)以及强化学习更新机制,并设计了一套涵盖背景、光照、干扰物、物体类型和空间特征等关键变量的综合评估协议,通过超过10,000次真实世界试验验证了各因素对迁移效果的影响,从而为未来灵巧操作策略的仿真训练提供可复现、标准化的基准和实证依据。

链接: https://arxiv.org/abs/2603.22876
作者: Ruixing Jin,Zicheng Zhu,Ruixiang Ouyang,Sheng Xu,Bo Yue,Zhizheng Wu,Guiliang Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.

[AI-49] Dynamical Systems Theory Behind a Hierarchical Reasoning Model

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂算法推理任务中表现不佳的问题,尤其是其依赖线性序列生成和海量参数所带来的效率低下与稳定性差的局限。现有方法如分层推理模型(Hierarchical Reasoning Model, HRM)和微型递归模型(Tiny Recursive Model, TRM)虽尝试通过紧凑的递归网络提升性能,但训练动态缺乏严格的数学保障,易导致表示崩溃和不稳定。论文提出收缩映射模型(Contraction Mapping Model, CMM),其核心创新在于将离散递归推理重构为连续的神经微分方程(Neural Ordinary Differential Equations, NODEs)与随机微分方程(Neural Stochastic Differential Equations, NSDEs),并通过显式约束潜在状态收敛至稳定平衡点、引入超球面排斥损失(hyperspherical repulsion loss)来抑制特征坍缩,从而实现数学上严格可证明的稳定推理机制。这一设计使CMM在Sudoku-Extreme等基准上以极低参数量(如0.26M)仍达到领先性能,验证了基于严谨动力学建模的推理路径可替代传统参数规模扩张策略。

链接: https://arxiv.org/abs/2603.22871
作者: Vasiliy A. Es’kin,Mikhail E. Smorkalov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Current large language models (LLMs) primarily rely on linear sequence generation and massive parameter counts, yet they severely struggle with complex algorithmic reasoning. While recent reasoning architectures, such as the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM), demonstrate that compact recursive networks can tackle these tasks, their training dynamics often lack rigorous mathematical guarantees, leading to instability and representational collapse. We propose the Contraction Mapping Model (CMM), a novel architecture that reformulates discrete recursive reasoning into continuous Neural Ordinary and Stochastic Differential Equations (NODEs/NSDEs). By explicitly enforcing the convergence of the latent phase point to a stable equilibrium state and mitigating feature collapse with a hyperspherical repulsion loss, the CMM provides a mathematically grounded and highly stable reasoning engine. On the Sudoku-Extreme benchmark, a 5M-parameter CMM achieves a state-of-the-art accuracy of 93.7 %, outperforming the 27M-parameter HRM (55.0 %) and 5M-parameter TRM (87.4 %). Remarkably, even when aggressively compressed to an ultra-tiny footprint of just 0.26M parameters, the CMM retains robust predictive power, achieving 85.4 % on Sudoku-Extreme and 82.2 % on the Maze benchmark. These results establish a new frontier for extreme parameter efficiency, proving that mathematically rigorous latent dynamics can effectively replace brute-force scaling in artificial reasoning.

[AI-50] Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中缺乏对知识所有权和访问边界的内在认知问题,从而导致敏感数据泄露与对抗性攻击风险增加。现有防护策略依赖静态、统一的防御机制,难以实现动态授权;结构隔离方法存在可扩展性瓶颈,而提示引导方法则难以区分细粒度权限。其解决方案的关键在于提出链式授权(Chain-of-Authorization, CoA)框架,通过将授权逻辑内化为LLM的核心能力,在训练和推理阶段重构信息流:在输入端嵌入权限上下文,并要求模型生成包含资源审查、身份解析与决策判断的显式授权推理轨迹,使授权成为实质性响应的因果前提。该机制利用LLM的自然语言理解能力实现动态授权,以主动安全机制保障现代AI系统中部署的可靠性。

链接: https://arxiv.org/abs/2603.22869
作者: Yang Li,Yule Liu,Xinlei He,Youjian Zhao,Qi Li,Ke Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have become core cognitive components in modern artificial intelligence (AI) systems, combining internal knowledge with external context to perform complex tasks. However, LLMs typically treat all accessible data indiscriminately, lacking inherent awareness of knowledge ownership and access boundaries. This deficiency heightens risks of sensitive data leakage and adversarial manipulation, potentially enabling unauthorized system access and severe security crises. Existing protection strategies rely on rigid, uniform defense that prevent dynamic authorization. Structural isolation methods faces scalability bottlenecks, while prompt guidance methods struggle with fine-grained permissions distinctions. Here, we propose the Chain-of-Authorization (CoA) framework, a secure training and reasoning paradigm that internalizes authorization logic into LLMs’ core capabilities. Unlike passive external defneses, CoA restructures the model’s information flow: it embeds permission context at input and requires generating explicit authorization reasoning trajectory that includes resource review, identity resolution, and decision-making stages before final response. Through supervised fine-tuning on data covering various authorization status, CoA integrates policy execution with task responses, making authorization a causal prerequisite for substantive responses. Extensive evaluations show that CoA not only maintains comparable utility in authorized scenarios but also overcomes the cognitive confusion when permissions mismatches. It exhibits high rejection rates against various unauthorized and adversarial access. This mechanism leverages LLMs’ reasoning capability to perform dynamic authorization, using natural language understanding as a proactive security mechanism for deploying reliable LLMs in modern AI systems.

[AI-51] Agent -Sentry: Bounding LLM Agents via Execution Provenance

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的代理计算系统(Agentic Computing Systems)在缺乏事先可预测行为边界的情况下,可能执行与用户意图不符甚至有害操作所带来的安全、隐私和可靠性问题。其解决方案的关键在于提出 Agent-Sentry 框架,通过识别系统频繁使用的功能及其执行轨迹来构建行为边界,并基于这些轨迹学习策略以拦截偏离已知行为或违背用户意图的工具调用,从而在保障系统可用性(最高保留98%功能效用)的同时,有效防御超过90%的越界攻击。

链接: https://arxiv.org/abs/2603.22868
作者: Rohan Sequeira,Stavros Damianakis,Umar Iqbal,Konstantinos Psounis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic computing systems, which autonomously spawn new functionalities based on natural language instructions, are becoming increasingly prevalent. While immensely capable, these systems raise serious security, privacy, and safety concerns. Fundamentally, the full set of functionalities offered by these systems, combined with their probabilistic execution flows, is not known beforehand. Given this lack of characterization, it is non-trivial to validate whether a system has successfully carried out the user’s intended task or instead executed irrelevant actions, potentially as a consequence of compromise. In this paper, we propose Agent-Sentry, a framework that attempts to bound agentic systems to address this problem. Our key insight is that agentic systems are designed for specific use cases and therefore need not expose unbounded or unspecified functionalities. Once bounded, these systems become easier to scrutinize. Agent-Sentry operationalizes this insight by uncovering frequent functionalities offered by an agentic system, along with their execution traces, to construct behavioral bounds. It then learns a policy from these traces and blocks tool calls that deviate from learned behaviors or that misalign with user intent. Our evaluation shows that Agent-Sentry helps prevent over 90% of attacks that attempt to trigger out-of-bounds executions, while preserving up to 98% of system utility.

[AI-52] he Coordinate System Problem in Persistent Structural Memory for Neural Architectures

【速读】:该论文旨在解决神经网络中持久结构记忆(persistent structural memory)的稳定性问题,即如何在不依赖于模型自身学习的坐标系统的情况下,实现跨任务、跨数据分布的稳定记忆保持与迁移。其核心挑战在于:若坐标系由模型联合学习,则存在内在不稳定性,导致记忆无法有效保留和迁移。解决方案的关键在于提出双视角信息素路径网络(Dual-View Pheromone Pathway Network, DPPN),通过引入固定随机傅里叶特征(fixed random Fourier features)作为外生坐标系(extrinsic coordinate system)以保障坐标稳定性,并结合路由偏置信息素机制(routing-bias pheromone)进行稀疏注意力引导;进一步发现仅靠坐标稳定性不足以支持知识迁移,需辅以可学习的结构补全函数(structure completion function)来实现“稳定与信息性”之间的权衡突破,从而满足两个独立要求:(a) 坐标稳定性,(b) 优雅的迁移机制(graceful transfer mechanism)。

链接: https://arxiv.org/abs/2603.22858
作者: Abhinaba Basu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We introduce the Dual-View Pheromone Pathway Network (DPPN), an architecture that routes sparse attention through a persistent pheromone field over latent slot transitions, and use it to discover two independent requirements for persistent structural memory in neural networks. Through five progressively refined experiments using up to 10 seeds per condition across 5 model variants and 4 transfer targets, we identify a core principle: persistent memory requires a stable coordinate system, and any coordinate system learned jointly with the model is inherently unstable. We characterize three obstacles – pheromone saturation, surface-structure entanglement, and coordinate incompatibility – and show that neither contrastive updates, multi-source distillation, Hungarian alignment, nor semantic decomposition resolves the instability when embeddings are learned from scratch. Fixed random Fourier features provide extrinsic coordinates that are stable, structure-blind, and informative, but coordinate stability alone is insufficient: routing-bias pheromone does not transfer (10 seeds, p0.05). DPPN outperforms transformer and random sparse baselines for within-task learning (AULC 0.700 vs 0.680 vs 0.670). Replacing routing bias with learning-rate modulation eliminates negative transfer: warm pheromone as a learning-rate prior achieves +0.003 on same-family tasks (17 seeds, p0.05) while never reducing performance. A structure completion function over extrinsic coordinates produces +0.006 same-family bonus beyond regularization, showing the catch-22 between stability and informativeness is partially permeable to learned functions. The contribution is two independent requirements for persistent structural memory: (a) coordinate stability and (b) graceful transfer mechanism.

[AI-53] Agent Audit: A Security Analysis System for LLM Agent Applications

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理系统在部署前存在的安全风险识别问题,尤其关注模型权重之外的软件栈漏洞,如工具函数对不可信输入的不当处理、部署 artifact 中暴露的凭证以及过度授权的模型上下文协议(Model Context Protocol, MCP)配置。解决方案的关键在于提出 Agent Audit —— 一个面向 LLM 代理应用的安全分析系统,其核心是通过代理感知的数据流分析、凭证检测、结构化配置解析与权限-风险检查相结合的流水线,实现对 Python 代理代码和部署产物的自动化安全审计,从而在保持亚秒级扫描性能的同时显著提升漏洞召回率(40/42 已标注漏洞被检测),并支持多种输出格式以集成至开发流程与 CI/CD 管道。

链接: https://arxiv.org/abs/2603.22853
作者: Haiyue Zhang,Yi Nian,Yue Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What should a developer inspect before deploying an LLM agent: the model, the tool code, the deployment configuration, or all three? In practice, many security failures in agent systems arise not from model weights alone, but from the surrounding software stack: tool functions that pass untrusted inputs to dangerous operations, exposed credentials in deployment artifacts, and over-privileged Model Context Protocol (MCP) configurations. We present Agent Audit, a security analysis system for LLM agent applications. Agent Audit analyzes Python agent code and deployment artifacts through an agent-aware pipeline that combines dataflow analysis, credential detection, structured configuration parsing, and privilege-risk checks. The system reports findings in terminal, JSON, and SARIF formats, enabling direct integration with local development workflows and CI/CD pipelines. On a benchmark of 22 samples with 42 annotated vulnerabilities, Agent Audit detects 40 vulnerabilities with 6 false positives, substantially improving recall over common SAST baselines while maintaining sub-second scan times. Agent Audit is open source and installable via pip, making security auditing accessible for agent systems. In the live demonstration, attendees scan vulnerable agent repositories and observe how Agent Audit identifies security risks in tool functions, prompts, and more. Findings are linked to source locations and configuration paths, and can be exported into VS Code and GitHub Code Scanning for interactive inspection. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22853 [cs.CR] (or arXiv:2603.22853v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.22853 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

【速读】:该论文旨在解决当前具身视觉追踪(Embodied Visual Tracking, EVT)任务中依赖单智能体模仿学习所导致的专家数据成本高、泛化能力弱的问题。其解决方案的关键在于提出一种基于博弈论的多智能体强化学习框架 CoMaTrack,通过在动态对抗环境中训练追踪者与自适应对手之间的竞争子任务,提升智能体的适应性规划能力和抗干扰策略。该方法显著增强了模型在复杂、主动对抗场景下的鲁棒性,实验表明,使用该框架训练的3B视觉语言模型(VLM)在EVT-Bench上超越了基于7B模型的单智能体模仿学习方法,在STT、DT和AT指标上分别达到92.1%、74.2%和57.5%。

链接: https://arxiv.org/abs/2603.22846
作者: Youzhi Liu,Li Gao,Liu Liu,Mingyang Lv,Yang Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at this https URL

[AI-55] PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal CVPR

【速读】:该论文旨在解决手术烟雾(surgical smoke)严重降低术中视频质量的问题,该问题会遮蔽解剖结构并限制外科医生的视觉感知。现有基于学习的去烟方法依赖稀缺的成对监督信号和确定性恢复流程,在真实手术场景下难以进行探索或强化驱动的优化。其解决方案的关键在于提出 PhySe-RPO 框架——一种通过物理与语义引导的相对策略优化(Physics- and Semantics-Guided Relative Policy Optimization)进行优化的扩散恢复方法。该框架将确定性恢复转化为随机策略,实现轨迹级探索,并通过组内相对优化完成无需评判器(critic-free)的更新;同时引入物理引导奖励(保证光照与色彩一致性)和 CLIP 基于手术概念的语义奖励(促进无烟且解剖一致的恢复),结合无参考感知约束,从而在合成与真实机器人手术数据集上实现物理一致、语义忠实且临床可解释的去烟效果,为有限成对监督下的鲁棒扩散恢复提供了系统性路径。

链接: https://arxiv.org/abs/2603.22844
作者: Zining Fang,Cheng Xue,Chunhui Liu,Bin Xu,Ming Chen,Xiaowei Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages,7figures,published to CVPR

点击查看摘要

Abstract:Surgical smoke severely degrades intraoperative video quality, obscuring anatomical structures and limiting surgical perception. Existing learning-based desmoking approaches rely on scarce paired supervision and deterministic restoration pipelines, making it difficult to perform exploration or reinforcement-driven refinement under real surgical conditions. We propose PhySe-RPO, a diffusion restoration framework optimized through Physics- and Semantics-Guided Relative Policy Optimization. The core idea is to transform deterministic restoration into a stochastic policy, enabling trajectory-level exploration and critic-free updates via group-relative optimization. A physics-guided reward imposes illumination and color consistency, while a visual-concept semantic reward learned from CLIP-based surgical concepts promotes smoke-free and anatomically coherent restoration. Together with a reference-free perceptual constraint, PhySe-RPO produces results that are physically consistent, semantically faithful, and clinically interpretable across synthetic and real robotic surgical datasets, providing a principled route to robust diffusion-based restoration under limited paired supervision.

[AI-56] Improving Safety Alignment via Balanced Direct Preference Optimization

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中存在的严重过拟合问题,该问题限制了模型在实际应用中的安全性能。其关键解决方案是提出平衡式直接偏好优化(Balanced Direct Preference Optimization, B-DPO),通过基于互信息(mutual information)自适应调节偏好对中优选与非优选响应之间的优化强度,缓解因偏好理解不平衡(Imbalanced Preference Comprehension)导致的安全性能下降,从而在保持主流基准测试上竞争力的同时显著提升模型安全性。

链接: https://arxiv.org/abs/2603.22829
作者: Shiji Zhao,Mengyang Wang,Shukun Xiong,Fangzhou Chen,Qihui Zhu,Shouwei Ruan,Yisong Xiao,Ranjie Duan,Xun Chen,XingXing Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model’s comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model’s safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \colorredWarning: This paper contains examples of harmful texts, and reader discretion is recommended.

[AI-57] Empirical Comparison of Agent Communication Protocols for Task Orchestration

【速读】:该论文试图解决当前人工智能代理系统从单工具交互向复杂多智能体编排演进过程中,缺乏对两种竞争性通信协议——工具集成协议(tool integration protocol)与智能体间委托协议(inter-agent delegation protocol)——进行系统性实证比较的问题。解决方案的关键在于构建首个标准化基准测试,涵盖三种架构:仅工具集成、多智能体委托以及混合架构,并在三个不同复杂度的查询任务上量化对比其在响应时间、上下文窗口消耗、成本、错误恢复能力和实现复杂度等方面的权衡。

链接: https://arxiv.org/abs/2603.22823
作者: Ivan Dobrovolskyi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context. Nowadays, artificial intelligence agent systems are transforming from single-tool interactions to complex multi-agent orchestrations. As a result, two competing communication protocols have emerged: a tool integration protocol that standardizes how agents invoke external tools, and an inter-agent delegation protocol that enables autonomous agents to discover and delegate tasks to one another. Despite widespread industry adoption by dozens of enterprise partners, no empirical comparison of these protocols exists in the literature. Objective. The goal of this work is to develop the first systematic benchmark comparing tool-integration-only, multi-agent delegation, and hybrid architectures across standardized queries at three complexity levels, and to quantify the trade-offs in response time, context window consumption, monetary cost, error recovery, and implementation complexity.

[AI-58] Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts ICLR2026

【速读】:该论文试图解决在序列决策问题中,当偏好权重(preference weights)为未观测的隐变量且随环境变化而漂移时,如何实现动态适应性决策的问题。传统多目标强化学习方法通常假设偏好权重固定或已知,难以应对实际场景中目标优先级随上下文变化的情况。解决方案的关键在于提出一种认知启发式的框架——动态偏好推断(Dynamic Preference Inference, DPI),其核心是让智能体维持对偏好权重的概率信念,通过近期交互数据更新该信念,并基于推断出的偏好条件化策略;具体实现上,DPI采用变分偏好推断模块与偏好条件化的演员-评论家网络联合训练,利用向量值回报(vector-valued returns)作为关于潜在权衡关系的证据,从而在排队、迷宫和多目标连续控制等环境中实现对新目标 regime 的快速适应并显著提升切换后的性能表现。

链接: https://arxiv.org/abs/2603.22813
作者: Xianwei Cao,Dou Quan,Zhenliang Zhang,Shuang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, ICLR 2026 poster paper

点击查看摘要

Abstract:Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.

[AI-59] Reliable Classroom AI via Neuro-Symbolic Multimodal Reasoning

【速读】:该论文旨在解决当前教室人工智能(Classroom AI)系统在实际应用中面临的可解释性、可靠性与部署安全性不足的问题,尤其是在多主体、高噪声、隐私敏感且教学方式多样化的复杂教育场景下,单纯追求预测准确率已无法满足教育实践的需求。其解决方案的关键在于提出NSCR(Neuro-Symbolic Classroom Reasoning)框架,该框架通过将教室分析任务分解为感知基础、符号抽象、可执行推理和治理四个层次,融合符号规则与神经网络方法,实现从视频、音频、自动语音识别(ASR)及上下文元数据等多模态输入到结构化事实的转化,并借助可验证代码生成和策略约束机制,使AI决策具备可追溯证据、校准不确定性以及明确的部署边界,从而提升系统的透明度、鲁棒性和教育适配性。

链接: https://arxiv.org/abs/2603.22793
作者: Sina Bagheri Nezhad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classroom AI is rapidly expanding from low-level perception toward higher-level judgments about engagement, confusion, collaboration, and instructional quality. Yet classrooms are among the hardest real-world settings for multimodal vision: they are multi-party, noisy, privacy-sensitive, pedagogically diverse, and often multilingual. In this paper, we argue that classroom AI should be treated as a critical domain, where raw predictive accuracy is insufficient unless predictions are accompanied by verifiable evidence, calibrated uncertainty, and explicit deployment guardrails. We introduce NSCR, a neuro-symbolic framework that decomposes classroom analytics into four layers: perceptual grounding, symbolic abstraction, executable reasoning, and governance. NSCR adapts recent ideas from symbolic fact extraction and verifiable code generation to multimodal educational settings, enabling classroom observations from video, audio, ASR, and contextual metadata to be converted into typed facts and then composed by executable rules, programs, and policy constraints. Beyond the system design, we contribute a benchmark and evaluation protocol organized around five tasks: classroom state inference, discourse-grounded event linking, temporal early warning, collaboration analysis, and multilingual classroom reasoning. We further specify reliability metrics centered on abstention, calibration, robustness, construct alignment, and human usefulness. The paper does not report new empirical results; its contribution is a concrete framework and evaluation agenda intended to support more interpretable, privacy-aware, and pedagogically grounded multimodal AI for classrooms.

[AI-60] ABSTRAL: Automatic Design of Multi-Agent Systems Through Iterative Refinement and Topology Optimization

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)的设计问题,即如何构建高效、可解释且可迁移的MAS架构,并将设计知识以可检查、可修订和可转移的形式进行编码。其核心挑战在于现有方法难以捕捉和复用高质量的设计经验,且缺乏对协作效率与角色分工的量化理解。解决方案的关键在于提出ABSTRAL框架,该框架将MAS架构视为一个通过对比轨迹分析(contrastive trace analysis)不断演化的自然语言文档,从而实现设计知识的显式表达与迭代优化;其中,三个关键发现支撑了该方案的有效性:一是量化了多智能体协调成本(coordination tax),揭示了任务并行分解带来的性能提升;二是证明了设计知识(如拓扑推理与角色模板)可在不同领域间迁移,显著加速新任务的收敛;三是通过对比分析自动识别出初始设计中缺失的专业角色,这是此前系统无法实现的能力。

链接: https://arxiv.org/abs/2603.22791
作者: Weijia Song,Jiashu Yue,Zhe Pang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How should multi-agent systems be designed, and can that design knowledge be captured in a form that is inspectable, revisable, and transferable? We introduce ABSTRAL, a framework that treats MAS architecture as an evolving natural-language document, an artifact refined through contrastive trace analysis. Three findings emerge. First, we provide a precise measurement of the multi-agent coordination tax: under fixed turn budgets, ensembles achieve only 26% turn efficiency, with 66% of tasks exhausting the limit, yet still improve over single-agent baselines by discovering parallelizable task decompositions. Second, design knowledge encoded in documents transfers: topology reasoning and role templates learned on one domain provide a head start on new domains, with transferred seeds matching coldstart iteration 3 performance in a single iteration. Third, contrastive trace analysis discovers specialist roles absent from any initial design, a capability no prior system demonstrates. On SOPBench (134 bank tasks, deterministic oracle), ABSTRAL reaches 70% validation / 65.96% test pass rate with a GPT-4o backbone. We release the converged documents as inspectable design rationale.

[AI-61] AgriPestDatabase-v1.0: A Structured Insect Dataset for Training Agricultural Large Language Model

【速读】:该论文旨在解决农业害虫管理中专家知识获取困难与边缘设备部署能力不足的问题,特别是在缺乏稳定互联网连接的农村地区。其关键解决方案在于:首先构建了一个结构化的昆虫信息数据集,通过整合现有害虫数据库和文献资料,并由领域专家审核验证;其次,基于高质量的问答(Q/A)对,采用LoRA(Low-Rank Adaptation)方法对轻量级大语言模型(LLM,≤7B参数)进行微调,使其适配边缘设备运行;最终实验证明,Mistral 7B在特定任务上达到88.9%的通过率,显著优于其他同类模型,且展现出更强的语义理解能力(嵌入相似度0.865),表明语义对齐比表面词汇匹配更有利于专业场景下的决策支持性能。

链接: https://arxiv.org/abs/2603.22777
作者: Yagizhan Bilal Durak,Ahsan Ul Islam,Shahidul Islam,Ashley Morgan-Olvera,Iftekhar Ibne Basith,Syed Hasib Akhter Faruqui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in Artificial Super Intelligence Conference 2026 (Sponsored by KSU PLOT IEEE CIS)

点击查看摘要

Abstract:Agricultural pest management increasingly relies on timely and accurate access to expert knowledge, yet high quality labeled data and continuous expert support remain limited, particularly for farmers operating in rural regions with unstable/no internet connectivity. At the same time, the rapid growth of AI and LLMs has created new opportunities to deliver practical decision support tools directly to end users in agriculture through compact and deployable systems. This work addresses (i) generating a structured insect information dataset, and (ii) adapting a lightweight LLM model ( \leq 7B) by fine tuning it for edge device uses in agricultural pest management. The textual data collection was done by reviewing and collecting information from available pest databases and published manuscripts on nine selected pest species. These structured reports were then reviewed and validated by a domain expert. From these reports, we constructed Q/A pairs to support model training and evaluation. A LoRA-based fine-tuning approach was applied to multiple lightweight LLMs and evaluated. Initial evaluation shows that Mistral 7B achieves an 88.9% pass rate on the domain-specific Q/A task, substantially outperforming Qwen 2.5 7B (63.9%), and LLaMA 3.1 8B (58.7%). Notably, Mistral demonstrates higher semantic alignment (embedding similarity: 0.865) despite lower lexical overlap (BLEU: 0.097), indicating that semantic understanding and robust reasoning are more predictive of task success than surface-level conformity in specialized domains. By combining expert organized data, well-structured Q/A pairs, semantic quality control, and efficient model adaptation, this work contributes towards providing support for farmer facing agricultural decision support tools and demonstrates the feasibility of deploying compact, high-performing language models for practical field-level pest management guidance.

[AI-62] From Arithmetic to Logic: The Resilience of Logic and Lookup-Based Neural Networks Under Parameter Bit-Flips

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在安全关键型边缘环境中因硬件引起的位翻转错误(bit-flip errors)而导致的鲁棒性不足问题。其核心挑战在于如何在有限数值精度下提升模型对硬件故障的容忍能力,而现有研究多聚焦于训练后的特定模型表现,缺乏从结构层面理解鲁棒性的理论基础。解决方案的关键在于将鲁棒性视为神经网络架构的结构性属性,而非仅依赖于数据特定的训练结果;通过推导不同数值格式和层结构在独立参数位翻转下的期望均方误差(Mean Squared Error, MSE),作者发现低精度、高稀疏性、有界激活函数及浅层结构在该扰动模型下具有普遍优势,并进一步提出基于查找表(Lookup Table, LUT)的逻辑神经网络可实现上述设计趋势的联合极限,实验证明其在极端硬件扰动下仍保持稳定性能,且揭示了逻辑架构中特有的偶数层恢复效应(even-layer recovery effect),从而为硬件容错提供了一种新的准确率-鲁棒性权衡路径。

链接: https://arxiv.org/abs/2603.22770
作者: Alan T. L. Bacellar,Sathvik Chemudupati,Shashank Nag,Allison Seigler,Priscila M. V. Lima,Felipe M. G. França,Lizy K. John
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of deep neural networks (DNNs) in safety-critical edge environments necessitates robustness against hardware-induced bit-flip errors. While empirical studies indicate that reducing numerical precision can improve fault tolerance, the theoretical basis of this phenomenon remains underexplored. In this work, we study resilience as a structural property of neural architectures rather than solely as a property of a dataset-specific trained solution. By deriving the expected squared error (MSE) under independent parameter bit flips across multiple numerical formats and layer primitives, we show that lower precision, higher sparsity, bounded activations, and shallow depth are consistently favored under this corruption model. We then argue that logic and lookup-based neural networks realize the joint limit of these design trends. Through ablation studies on the MLPerf Tiny benchmark suite, we show that the observed empirical trends are consistent with the theoretical predictions, and that LUT-based models remain highly stable in corruption regimes where standard floating-point models fail sharply. Furthermore, we identify a novel even-layer recovery effect unique to logic-based architectures and analyze the structural conditions under which it emerges. Overall, our results suggest that shifting from continuous arithmetic weights to discrete Boolean lookups can provide a favorable accuracy-resilience trade-off for hardware fault tolerance.

[AI-63] CLiGNet: Clinical Label-Interaction Graph Network for Medical Specialty Classification from Clinical Transcriptions

【速读】:该论文旨在解决临床文本自动分类至40个医学专科的任务中因数据泄漏导致的性能高估问题,其关键解决方案是构建了一个无泄漏的基准(4966条记录),并提出CLiGNet模型——该模型融合了Bio ClinicalBERT文本编码器与基于语义相似性和ICD-10章节先验构建的专科标签图的两层图卷积网络(Graph Convolutional Network, GCN),并通过每标签注意力门控机制融合文档表示与标签图表示,同时采用焦点二元交叉熵损失(focal binary cross entropy loss)应对极端类别不平衡(最稀有类与最常见类比例达181:1)。实验表明,GCN标签图贡献了最大的性能提升(宏F1提高0.066),且加入Platt缩放校准后预期校准误差仅为0.007,实现了排序性能与概率可靠性之间的合理权衡。

链接: https://arxiv.org/abs/2603.22752
作者: Pronob Kumar Barman,Pronoy Kumar Barman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated classification of clinical transcriptions into medical specialties is essential for routing, coding, and clinical decision support, yet prior work on the widely used MTSamples benchmark suffers from severe data leakage caused by applying SMOTE oversampling before train test splitting. We first document this methodological flaw and establish a leakage free benchmark across 40 medical specialties (4966 records), revealing that the true task difficulty is substantially higher than previously reported. We then introduce CLiGNet (Clinical Label Interaction Graph Network), a neural architecture that combines a Bio ClinicalBERT text encoder with a two layer Graph Convolutional Network operating on a specialty label graph constructed from semantic similarity and ICD 10 chapter priors. Per label attention gates fuse document and label graph representations, trained with focal binary cross entropy loss to handle extreme class imbalance (181 to 1 ratio). Across seven baselines ranging from TF IDF classifiers to Clinical Longformer, CLiGNet without calibration achieves the highest macro F1 of 0.279, with an ablation study confirming that the GCN label graph provides the single largest component gain (increase of 0.066 macro F1). Adding per label Platt scaling calibration yields an expected calibration error of 0.007, demonstrating a principled trade off between ranking performance and probability reliability. We provide comprehensive failure analysis covering pairwise specialty confusions, rare class behaviour, document length effects, and token level Integrated Gradients attribution, offering actionable insights for clinical NLP system deployment. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22752 [cs.AI] (or arXiv:2603.22752v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.22752 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pronob Kumar Barman [view email] [v1] Tue, 24 Mar 2026 03:30:06 UTC (18 KB)

[AI-64] Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在主观性、长周期企业任务中缺乏可靠评估手段的问题。现有评测方法多依赖于客观可验证的任务(如数学或编程),难以适用于涉及组织目标、用户意图及多工具协作流程的复杂企业场景。其解决方案的关键在于提出LH-Bench评估框架,该框架由三个支柱构成:(i) 专家制定的评分标准(expert-grounded rubrics),赋予LLM评判者领域上下文以提升主观任务评分一致性;(ii) 精心构建的基准中间产物(curated ground-truth artifacts),支持分步奖励信号(如内容类任务按章节标注);(iii) 成对人类偏好判断(pairwise human preference evaluation),用于收敛性验证。实验表明,专家撰写的评分标准显著优于LLM自动生成的标准(Kappa=0.60 vs. 0.46),且人类偏好判断与之高度一致(p > 0.05),证明了专家驱动评估在规模化时仍能保持可靠性。

链接: https://arxiv.org/abs/2603.22744
作者: Abhishek Chandwani,Ishan Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users). Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22744 [cs.AI] (or arXiv:2603.22744v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.22744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-65] HyFI: Hyperbolic Feature Interpolation for Brain-Vision Alignment AAAI2026

【速读】:该论文旨在解决脑信号与图像之间在信息层次上的模态差异(modality gap)以及神经活动中语义特征与感知特征高度纠缠的问题。传统方法通常独立地将脑电活动与预训练视觉模型提取的语义和感知特征对齐,但忽略了脑信号表达能力有限且特征混杂的本质。解决方案的关键在于引入双曲空间(hyperbolic space),利用其几何特性——两点间的测地线自然向原点弯曲(代表低表示能力区域),从而实现语义与感知特征沿双曲测地线插值的融合与压缩。该方法称为Hyperbolic Feature Interpolation (HyFI),有效建模了脑信号的有限表达能力和特征纠缠特性,显著提升了脑到图像的零样本检索性能,在THINGS-EEG和THINGS-MEG数据集上分别实现了最高+17.3%和+9.1%的Top-1准确率提升。

链接: https://arxiv.org/abs/2603.22721
作者: Sangmin Jo,Wootaek Jeong,Da-Woon Heo,Yoohwan Hwang,Heung-Il Suk
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 13 figures. Published in AAAI 2026

点击查看摘要

Abstract:Recent progress in artificial intelligence has encouraged numerous attempts to understand and decode human visual system from brain signals. These prior works typically align neural activity independently with semantic and perceptual features extracted from images using pre-trained vision models. However, they fail to account for two key challenges: (1) the modality gap arising from the natural difference in the information level of representation between brain signals and images, and (2) the fact that semantic and perceptual features are highly entangled within neural activity. To address these issues, we utilize hyperbolic space, which is well-suited for considering differences in the amount of information and has the geometric property that geodesics between two points naturally bend toward the origin, where the representational capacity is lower. Leveraging these properties, we propose a novel framework, Hyperbolic Feature Interpolation (HyFI), which interpolates between semantic and perceptual visual features along hyperbolic geodesics. This enables both the fusion and compression of perceptual and semantic information, effectively reflecting the limited expressiveness of brain signals and the entangled nature of these features. As a result, it facilitates better alignment between brain and visual features. We demonstrate that HyFI achieves state-of-the-art performance in zero-shot brain-to-image retrieval, outperforming prior methods with Top-1 accuracy improvements of up to +17.3% on THINGS-EEG and +9.1% on THINGS-MEG.

[AI-66] PopResume: Causal Fairness Evaluation of LLM /VLM Resume Screeners with Population-Representative Dataset

【速读】:该论文旨在解决当前简历筛选系统中公平性评估缺乏因果依据的问题,特别是现有基准测试依赖人工注入的人口统计信息和结果层面的差异,难以区分合法与非法的歧视来源。其解决方案的关键在于构建了一个具有人口代表性、基于真实统计数据并保留自然属性关系的简历数据集 PopResume,并引入路径特定效应(Path-Specific Effect, PSE)分析方法,将受保护属性对简历评分的影响分解为两个路径:由岗位相关资质中介的“业务必要性路径”和由人口统计代理变量中介的“红线路径”。这一区分使得审计者能够识别出法律允许的差异与非法的歧视来源,从而实现更精准的因果公平性审计。

链接: https://arxiv.org/abs/2603.22714
作者: Sumin Yu,Juhyeon Park,Taesup Moon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:We present PopResume, a population-representative resume dataset for causal fairness auditing of LLM- and VLM-based resume screening systems. Unlike existing benchmarks that rely on manually injected demographic information and outcome-level disparities, PopResume is grounded in population statistics and preserves natural attribute relationships, enabling path-specific effect (PSE)-based fairness evaluation. We decompose the effect of a protected attribute on resume scores into two paths: the business necessity path, mediated by job-relevant qualifications, and the redlining path, mediated by demographic proxies. This distinction allows auditors to separate legally permissible from impermissible sources of disparity. Evaluating four LLMs and four VLMs on PopResume’s 60.8K resumes across five occupations, we identify five representative discrimination patterns that aggregate metrics fail to capture. Our results demonstrate that PSE-based evaluation reveals fairness issues masked by outcome-level measures, underscoring the need for causally-grounded auditing frameworks in AI-assisted hiring.

[AI-67] MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

【速读】:该论文旨在解决当前AI生成音乐质量评估中缺乏高效、可解释且与人类感知高度相关的个体样本级(per-sample)评价指标的问题。现有分布度量(如Fréchet Audio Distance)无法对单个音乐片段进行评分,而唯一能达到高人类相关性的学习型指标则是闭源的。其解决方案的关键在于构建一个基于冻结的MuQ-310M特征(frozen MuQ-310M features)并使用MusicEval数据集(包含31种文本到音乐系统生成的片段及专家评分)训练轻量级预测头(prediction heads)的开放源代码模型MUQ-EVAL。实验表明,仅用注意力池化和两层MLP结构即可实现系统级斯皮尔曼秩相关系数(SRCC)达0.957、片段级达0.838,且无需进一步微调或复杂架构改进,说明原始MuQ特征已蕴含充分的质量判别信息;此外,LoRA适配在极小样本(仅150个标注片段)下即可获得可用性能,支持个性化质量评估器的构建。

链接: https://arxiv.org/abs/2603.22677
作者: Di Zhu,Zixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 10 Pages, 6 figures

点击查看摘要

Abstract:Distributional metrics such as Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at this https URL.

[AI-68] Vision-based Deep Learning Analysis of Unordered Biomedical Tabular Datasets via Optimal Spatial Cartography

【速读】:该论文旨在解决高维非结构化表格数据(如生物医学中的液体活检、单细胞转录组学和电子健康记录)在深度学习建模中缺乏显式空间组织的问题,这限制了视觉类模型对局部结构和高阶特征交互的利用能力。解决方案的关键在于提出一种端到端的深度学习框架——动态特征映射(Dynamic Feature Mapping, Dynomap),其通过一个可微分的渲染机制联合优化特征的空间布局与预测任务,无需依赖启发式规则、预定义分组或外部先验知识,从而将原始表格向量转化为可被视觉模型有效处理的特征图,实现对临床相关模式的自动发现与性能提升。

链接: https://arxiv.org/abs/2603.22675
作者: Sakib Mostafa,Tarik Massoud,Maximilian Diehn,Lei Xing,Md Tauhidul Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 54 Pages, 8 main figures, 26 supplementary figures

点击查看摘要

Abstract:Tabular data are central to biomedical research, from liquid biopsy and bulk and single-cell transcriptomics to electronic health records and phenotypic profiling. Unlike images or sequences, however, tabular datasets lack intrinsic spatial organization: features are treated as unordered dimensions, and their relationships must be inferred implicitly by the model. This limits the ability of vision architectures to exploit local structure and higher-order feature interactions in non-spatial biomedical data. Here we introduce Dynamic Feature Mapping (Dynomap), an end-to-end deep learning framework that learns a task-optimized spatial topology of features directly from data. Dynomap jointly optimizes feature placement and prediction through a fully differentiable rendering mechanism, without relying on heuristics, predefined groupings, or external priors. By transforming high-dimensional tabular vectors into learned feature maps, Dynomap enables vision-based models to operate effectively on unordered biomedical inputs. Across multiple clinical and biological datasets, Dynomap consistently outperformed classical machine learning, modern deep tabular models, and existing vector-to-image approaches. In liquid biopsy data, Dynomap organized clinically relevant gene signatures into coherent spatial patterns and improved multiclass cancer subtype prediction accuracy by up to 18%. In a Parkinson disease voice dataset, it clustered disease-associated acoustic descriptors and improved accuracy by up to 8%. Similar gains and interpretable feature organization were observed in additional biomedical datasets. These results establish Dynomap as a general strategy for bridging tabular and vision-based deep learning and for uncovering structured, clinically relevant patterns in high-dimensional biomedical data.

[AI-69] Generalizing Dynamics Modeling More Easily from Representation Perspective

【速读】:该论文旨在解决从观测数据中学习复杂系统动力学模型时存在的泛化能力差的问题,即现有神经动力学建模方法通常为每个不同的复杂系统(如气候、生态或流体系统)单独训练特定模型,导致跨系统迁移性能不佳。其解决方案的关键在于提出一种通用的预训练动力学编码器(Pre-trained Dynamics Encoder, PDEDER),通过在大规模真实与合成观测数据上预训练任意预训练语言模型(Pre-trained Language Model, PLM),并最小化李雅普诺夫指数(Lyapunov exponent)目标函数,约束潜空间中动力学的混沌行为,从而获得局部稳定且结构良好的潜空间表示。此外,引入重建和预测目标以防止潜空间过度平滑,使得PDEDER可在不同动态系统上进行微调,实现更有效的短/长期预测任务,在域内与跨域场景下均展现出优越的泛化能力和建模效果。

链接: https://arxiv.org/abs/2603.22655
作者: Yiming Wang,Zhengnan Zhang,Genghe Zhang,Jiawen Dan,Changchun Li,Chenlong Hu,Chris Nugent,Jun Liu,Ximing Li,Bo Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning system dynamics from observations is a critical problem in many applications over various real-world complex systems, e.g., climate, ecology, and fluid systems. Recently, neural dynamics modeling method have become a prevalent solution that embeds the object’s observations into a latent space before learning dynamics using neural methods such as neural Ordinary Differential Equations (ODE). Existing dynamics modeling methods induce a specific model for each observation of different complex systems, resulting in poor generalization across systems. Inspired by the great success of pre-trained models, we conduct a generalized Pre-trained Dynamics EncoDER (PDEDER) which can embed the original state observations into a latent space where the dynamics can be captured more easily. To conduct the generalized PDEDER, we pre-train any Pre-trained Language Model (PLM) by minimizing the Lyapunov exponent objective, which constrains the chaotic behavior of governing dynamics learned in the latent space. By penalizing the divergence of embedded observations, our PDEDER promotes locally stable and well-structured latent dynamics, thereby facilitating more effective dynamics modeling than in the original observation space. In addition, we incorporate reconstruction and forecasting objectives to mitigate the risk of obtaining an over-smoothed latent space. Specifically, we collect 152 sets of real-world and synthetic observations from 23 complex systems as pre-training corpora and employ them to pre-train PDEDER. Given any future dynamic observation, we can fine-tune PDEDER with any specific dynamics modeling method. We evaluate PDEDER on 12 dynamic systems by short/long-term forecasting under both in-domain and cross-domain settings, and the empirical results indicate the effectiveness and generalizability of PDEDER.

[AI-70] Causal Discovery in Action: Learning Chain-Reaction Mechanisms from Interventions

【速读】:该论文旨在解决在一般动态系统中因果发现(causal discovery)的难题,尤其是在缺乏强结构假设的情况下,即使有干预数据也难以识别潜在的因果图。针对现实世界中广泛存在的链式反应系统(chain-reaction systems),即组件按顺序激活且上游故障抑制下游效应的场景,作者提出:通过施加阻断干预(blocking interventions)——阻止单个组件激活——可唯一识别因果结构。解决方案的关键在于设计一种最小估计器(minimal estimator),其具备有限样本保证,实现指数级误差衰减和对数级样本复杂度,从而在少量干预下即可可靠恢复因果关系,而传统基于观测数据的启发式方法在存在延迟或重叠因果效应时则失效。

链接: https://arxiv.org/abs/2603.22620
作者: Panayiotis Panayiotou,Özgür Şimşek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 5th Conference on Causal Learning and Reasoning (CLeaR 2026)

点击查看摘要

Abstract:Causal discovery is challenging in general dynamical systems because, without strong structural assumptions, the underlying causal graph may not be identifiable even from interventional data. However, many real-world systems exhibit directional, cascade-like structure, in which components activate sequentially and upstream failures suppress downstream effects. We study causal discovery in such chain-reaction systems and show that the causal structure is uniquely identifiable from blocking interventions that prevent individual components from activating. We propose a minimal estimator with finite-sample guarantees, achieving exponential error decay and logarithmic sample complexity. Experiments on synthetic models and diverse chain-reaction environments demonstrate reliable recovery from a few interventions, while observational heuristics fail in regimes with delayed or overlapping causal effects.

[AI-71] Bridging the Know-Act Gap via Task-Level Autoregressive Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的“知-行鸿沟”(know-act gap)问题,即模型在判别式提示下能够识别输入中的错误或不合理之处,但在生成式响应中却仍会给出看似合理但错误的答案。这种现象并非源于知识缺失,而是由于标准生成过程中的token级自回归机制将任务选择(验证 vs. 回答)与内容生成耦合在一起,导致判别性知识无法被有效利用。解决方案的关键在于提出DeIllusionLLM——一种任务级自回归框架,通过显式建模任务决策过程,并结合自蒸馏(self-distillation)策略,在单一模型主干中统一判别判断与生成推理,从而在自然提示下显著降低“错误回答”类失败率,同时保持通用推理性能。

链接: https://arxiv.org/abs/2603.22619
作者: Jihyun Janice Ahn,Ryo Kamoi,Berk Atil,Renze Lou,WonWoo Kang,Heehyun Park,Sarkar Snigdha Sarathi Das,Zhuoyang Zou,Xiaoxin Lu,Yusen Zhang,Asfahan Shah,Ridwanul Hasan Tanvir,Lingxiao Zhao,Hongxi Huang,Vignesh Venkatesh,Dianjun Lin,Hamid Shah,Wentao Wang,Zhanpeng Song,Joshua Reed Bassin,Dax Patel,Ishan Appareddy Agrahar,Sahil Pardasani,Xin Dong,Fatemeh Rahbari,Benjamin David Rishel,Soochan Andrew Lee,Yuv Boghani,Ali B. AlNaseeb,Pranav Suby,Seokhyeon Bae,Shreya Buddharaju,Damien Kula,Soumyadeep Das,Hanyang Frank Liu,Faye Mo,Wenpeng Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:LLMs often generate seemingly valid answers to flawed or ill-posed inputs. This is not due to missing knowledge: under discriminative prompting, the same models can mostly identify such issues, yet fail to reflect this in standard generative responses. This reveals a fundamental know-act gap between discriminative recognition and generative behavior. Prior work largely characterizes this issue in narrow settings, such as math word problems or question answering, with limited focus on how to integrate these two modes. In this work, we present a comprehensive analysis using FaultyScience, a newly constructed large-scale, cross-disciplinary benchmark of faulty scientific questions. We show that the gap is pervasive and stems from token-level autoregression, which entangles task selection (validate vs. answer) with content generation, preventing discriminative knowledge from being utilized. To address this, we propose DeIllusionLLM, a task-level autoregressive framework that explicitly models this decision. Through self-distillation, the model unifies discriminative judgment and generative reasoning within a single backbone. Empirically, DeIllusionLLM substantially reduces answer-despite-error failures under natural prompting while maintaining general reasoning performance, demonstrating that self-distillation is an effective and scalable solution for bridging the discriminative-generative know-act gap

[AI-72] AI Mental Models: Learned Intuition and Deliberation in a Bounded Neural Architecture

【速读】:该论文旨在解决一个核心问题:受限的神经网络架构是否能够在经典64项三段论推理基准上表现出直觉与审慎推理之间的有意义分工。这一问题具有重要理论意义,因为它涉及生成式AI(Generative AI)中世界模型构建与多阶段推理机制的争论,同时为测试学习系统能否发展出结构化内部计算而非仅依赖一次性关联预测提供了受控环境。解决方案的关键在于引入了一个有界双路径架构(bounded dual-path architecture),其中包含独立的直觉路径(intuition pathway)和审慎推理路径(deliberation pathway),其设计受到计算心智模型理论(computational mental-model theory)的启发。实验结果显示,尽管两个路径均表现良好(直觉路径相关系数 r = 0.7272,审慎路径 r = 0.8152),但审慎路径显著优于直觉路径(p = 0.0101),且在处理拒绝响应和c-a结论等复杂情形时提升最明显;进一步的可解释性分析表明,审慎路径演化出了稀疏、差异化内部状态结构,包括以Oac倾向为主的状态、主导工作状态及若干弱使用或未使用的状态,这暗示了类推理的内部组织机制,但并未实现完整的模型构建、反例搜索与结论修正的序列过程。

链接: https://arxiv.org/abs/2603.22561
作者: Laurence Anthony
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper asks whether a bounded neural architecture can exhibit a meaningful division of labor between intuition and deliberation on a classic 64-item syllogistic reasoning benchmark. More broadly, the benchmark is relevant to ongoing debates about world models and multi-stage reasoning in AI. It provides a controlled setting for testing whether a learned system can develop structured internal computation rather than only one-shot associative prediction. Experiment 1 evaluates a direct neural baseline for predicting full 9-way human response distributions under 5-fold cross-validation. Experiment 2 introduces a bounded dual-path architecture with separate intuition and deliberation pathways, motivated by computational mental-model theory (Khemlani Johnson-Laird, 2022). Under cross-validation, bounded intuition reaches an aggregate correlation of r = 0.7272, whereas bounded deliberation reaches r = 0.8152, and the deliberation advantage is significant across folds (p = 0.0101). The largest held-out gains occur for NVC, Eca, and Oca, suggesting improved handling of rejection responses and c-a conclusions. A canonical 80:20 interpretability run and a five-seed stability sweep further indicate that the deliberation pathway develops sparse, differentiated internal structure, including an Oac-leaning state, a dominant workhorse state, and several weakly used or unused states whose exact indices vary across runs. These findings are consistent with reasoning-like internal organization under bounded conditions, while stopping short of any claim that the model reproduces full sequential processes of model construction, counterexample search, and conclusion revision.

[AI-73] Maximum Entropy Relaxation of Multi-Way Cardinality Constraints for Synthetic Population Generation

【速读】:该论文旨在解决从聚合统计信息中生成合成人口的问题,核心挑战在于如何在满足多种异质性约束(如一元、二元和三元约束)的同时高效构建符合全局频率限制的个体级数据集。传统方法在约束数量增多或高阶交互关系复杂时计算效率急剧下降,难以扩展。解决方案的关键在于引入最大熵(Maximum-Entropy, MaxEnt)松弛框架:将严格的多维基数约束转化为期望层面的匹配,从而将问题转化为一个指数族分布下的凸优化问题,通过求解拉格朗日乘子实现对完整人口赋值的概率建模,显著提升了大规模、高阶约束场景下的可扩展性和稳定性。

链接: https://arxiv.org/abs/2603.22558
作者: François Pachet,Jean-Daniel Zucker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 page, 5 figures, 3 tables

点击查看摘要

Abstract:Generating synthetic populations from aggregate statistics is a core component of microsimulation, agent-based modeling, policy analysis, and privacy-preserving data release. Beyond classical census marginals, many applications require matching heterogeneous unary, binary, and ternary constraints derived from surveys, expert knowledge, or automatically extracted descriptions. Constructing populations that satisfy such multi-way constraints simultaneously poses a significant computational challenge. We consider populations where each individual is described by categorical attributes and the target is a collection of global frequency constraints over attribute combinations. Exact formulations scale poorly as the number and arity of constraints increase, especially when the constraints are numerous and overlapping. Grounded in methods from statistical physics, we propose a maximum-entropy relaxation of this problem. Multi-way cardinality constraints are matched in expectation rather than exactly, yielding an exponential-family distribution over complete population assignments and a convex optimization problem over Lagrange multipliers. We evaluate the approach on NPORS-derived scaling benchmarks with 4 to 40 attributes and compare it primarily against generalized raking. The results show that MaxEnt becomes increasingly advantageous as the number of attributes and ternary interactions grows, while raking remains competitive on smaller, lower-arity instances.

[AI-74] LLM ON: An LLM -native Markup Language to Leverag e Structure and Semantics at the LLM Interface

【速读】:该论文旨在解决当前文本型大语言模型(Large Language Models, LLMs)在输入输出过程中无法有效传递结构化信息与语义元数据的问题,例如指令与数据混杂在同一字符串中,易导致模型混淆及提示注入攻击等安全风险。其解决方案的关键在于提出一种面向大语言模型的原生标记语言——LLMON(LLM Object Notation),通过结构化的标记方式自然地表达输入内容的语义层次和结构信息,从而在模型训练、提示构建和推理实现阶段提升准确性、安全性与鲁棒性,类似于编程语言类型系统在静态检查、代码生成和IDE高亮中的多重用途。

链接: https://arxiv.org/abs/2603.22519
作者: Michael Hind,Basel Shbita,Bo Wu,Farhan Ahmed,Chad DeLuca,Nathan Fulton,David Cox,Dan Gutfreund
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 28 pages

点击查看摘要

Abstract:Textual Large Language Models (LLMs) provide a simple and familiar interface: a string of text is used for both input and output. However, the information conveyed to an LLM often has a richer structure and semantics, which is not conveyed in a string. For example, most prompts contain both instructions (“Summarize this paper into a paragraph”) and data (the paper to summarize), but these are usually not distinguished when passed to the model. This can lead to model confusion and security risks, such as prompt injection attacks. This work addresses this shortcoming by introducing an LLM-native mark-up language, LLMON (LLM Object Notation, pronounced “Lemon”), that enables the structure and semantic metadata of the text to be communicated in a natural way to an LLM. This information can then be used during model training, model prompting, and inference implementation, leading to improvements in model accuracy, safety, and security. This is analogous to how programming language types can be used for many purposes, such as static checking, code generation, dynamic checking, and IDE highlighting. We discuss the general design requirements of an LLM-native markup language, introduce the LLMON markup language and show how it meets these design requirements, describe how the information contained in a LLMON artifact can benefit model training and inference implementation, and provide some preliminary empirical evidence of its value for both of these use cases. We also discuss broader issues and research opportunities that are enabled with an LLM-native approach. Comments: 28 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2603.22519 [cs.SE] (or arXiv:2603.22519v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.22519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-75] Stability-Preserving Online Adaptation of Neural Closed-loop Maps

【速读】:该论文旨在解决非线性系统在运行过程中因目标函数或扰动变化而需在线更新控制器时,如何在不破坏闭环稳定性前提下提升控制性能的问题。传统方法依赖于时不变的循环神经网络(Recurrent Neural Network, RNN)控制器,但缺乏可证明稳定的在线更新机制,且策略切换本身可能引发闭环不稳定。解决方案的关键在于将每个控制器建模为具有有界 ℓ_p-增益(ℓ_p-gain)的因果算子,并推导出基于增益条件的稳定性保持更新准则;由此衍生出两种实用的更新策略——时间调度式和状态触发式,可在任意次数更新后仍保证闭环 ℓ_p-稳定。该方法进一步揭示了稳定性与控制器最优性之间的解耦关系,从而允许使用近似或提前终止的控制器合成过程,显著提升了动态环境下控制系统的适应性和鲁棒性。

链接: https://arxiv.org/abs/2603.22469
作者: Danilo Saccani,Luca Furieri,Giancarlo Ferrari-Trecate
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The growing complexity of modern control tasks calls for controllers that can react online as objectives and disturbances change, while preserving closed-loop stability. Recent approaches for improving the performance of nonlinear systems while preserving closed-loop stability rely on time-invariant recurrent neural-network controllers, but offer no principled way to update the controller during operation. Most importantly, switching from one stabilizing policy to another can itself destabilize the closed-loop. We address this problem by introducing a stability-preserving update mechanism for nonlinear, neural-network-based controllers. Each controller is modeled as a causal operator with bounded \ell_p -gain, and we derive gain-based conditions under which the controller may be updated online. These conditions yield two practical update schemes, time-scheduled and state-triggered, that guarantee the closed-loop remains \ell_p -stable after any number of updates. Our analysis further shows that stability is decoupled from controller optimality, allowing approximate or early-stopped controller synthesis. We demonstrate the approach on nonlinear systems with time-varying objectives and disturbances, and show consistent performance improvements over static and naive online baselines while guaranteeing stability.

[AI-76] CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在具身操作(embodied manipulation)任务中作为自主控制器的有效性问题,特别是如何通过“代码即策略”(Code-as-Policy, CaP)范式提升模型在真实机器人场景中的可靠性与泛化能力。其核心挑战在于:当前基于数据驱动的视觉-语言-动作(Vision-Language-Action, VLA)方法虽强大,但在缺乏人类先验抽象的情况下性能显著下降,且难以在低层次感知与控制原语上稳定运行。解决方案的关键在于构建一个系统性的研究框架 CaP-X,其中包含两个核心组件:CaP-Gym 提供交互式环境以支持程序合成与执行,CaP-Bench 用于评估不同抽象层级下语言和视觉语言模型的表现;并通过引入多轮交互、结构化执行反馈、视觉差分、自动技能合成及集成推理等机制,在不依赖人工设计抽象的前提下显著增强代理(agent)对低级原语的鲁棒性,最终实现无需训练即可达到人类水平可靠性的 CaP-Agent0 框架,并进一步通过可验证奖励的强化学习(CaP-RL)实现 sim2real 的高效迁移。

链接: https://arxiv.org/abs/2603.22435
作者: Max Fu,Justin Yu,Karim El-Refai,Ethan Kou,Haoru Xue,Huang Huang,Wenli Xiao,Guanzhi Wang,Fei-Fei Li,Guanya Shi,Jiajun Wu,Shankar Sastry,Yuke Zhu,Ken Goldberg,Linxi “Jim” Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:“Code-as-Policy” considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation–through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning–substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.

[AI-77] Computational Arbitrag e in AI Model Markets

【速读】:该论文旨在解决AI模型市场中因模型提供商之间竞争导致的资源配置效率低下问题,特别是如何通过引入 arbitrage(套利)机制来优化推理预算分配并提升市场整体经济效率。其解决方案的关键在于设计并实证验证了简单且可扩展的套利策略——即利用不同模型在成本与能力上的差异,将客户的推理请求高效分配给最具性价比的模型提供者,从而以更低的价格提供可验证的解决方案,并从中获取净利润;同时发现蒸馏(distillation)会进一步放大套利机会,而多个套利者的存在则促使消费者价格下降、市场更趋公平和开放,有助于小型模型提供商早期进入市场并实现收入捕获。

链接: https://arxiv.org/abs/2603.22404
作者: Ricardo Olmedo,Bernhard Schölkopf,Moritz Hardt
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Consider a market of competing model providers selling query access to models with varying costs and capabilities. Customers submit problem instances and are willing to pay up to a budget for a verifiable solution. An arbitrageur efficiently allocates inference budget across providers to undercut the market, thus creating a competitive offering with no model-development risk. In this work, we initiate the study of arbitrage in AI model markets, empirically demonstrating the viability of arbitrage and illustrating its economic consequences. We conduct an in-depth case study of SWE-bench GitHub issue resolution using two representative models, GPT-5 mini and DeepSeek v3.2. In this verifiable domain, simple arbitrage strategies generate net profit margins of up to 40%. Robust arbitrage strategies that generalize across different domains remain profitable. Distillation further creates strong arbitrage opportunities, potentially at the expense of the teacher model’s revenue. Multiple competing arbitrageurs drive down consumer prices, reducing the marginal revenue of model providers. At the same time, arbitrage reduces market segmentation and facilitates market entry for smaller model providers by enabling earlier revenue capture. Our results suggest that arbitrage can be a powerful force in AI model markets with implications for model development, distillation, and deployment.

[AI-78] Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure

【速读】:该论文旨在解决自主代理在连续环境中不仅需要决定“做什么”,还需确定“何时行动”的时间决策问题。传统方法依赖于启发式或生物启发的定时机制,缺乏对动态环境变化的适应性。其核心解决方案是提出一种轻量级自适应时间控制框架(adaptive temporal control system),通过学习最优认知周期间隔来替代人工设定的时间策略;关键创新在于引入基于双曲几何的预测性超曲率信号(predictive hyperbolic spread signal),该信号由嵌入庞加莱球(Poincaré ball)中的多个未来状态之间的平均成对庞加莱距离计算得出,高分散度表示未来不确定性大,促使代理提前行动,低分散度则表明可预测性强,允许更长的休眠间隔。此外,论文设计了一种区间感知奖励机制(interval-aware reward),纠正了传统基于结果的奖励在时间分配上的信用分配偏差,并结合时空联合嵌入(ATCPG-ST),利用空间轨迹发散提供独立于状态信息的时序线索,显著提升效率与鲁棒性。

链接: https://arxiv.org/abs/2603.22384
作者: Davide Di Gioia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous agents operating in continuous environments must decide not only what to do, but when to act. We introduce a lightweight adaptive temporal control system that learns the optimal interval between cognitive ticks from experience, replacing ad hoc biologically inspired timers with a principled learned policy. The policy state is augmented with a predictive hyperbolic spread signal (a “curvature signal” shorthand) derived from hyperbolic geometry: the mean pairwise Poincare distance among n sampled futures embedded in the Poincare ball. High spread indicates a branching, uncertain future and drives the agent to act sooner; low spread signals predictability and permits longer rest intervals. We further propose an interval-aware reward that explicitly penalises inefficiency relative to the chosen wait time, correcting a systematic credit-assignment failure of naive outcome-based rewards in timing problems. We additionally introduce a joint spatio-temporal embedding (ATCPG-ST) that concatenates independently normalised state and position projections in the Poincare ball; spatial trajectory divergence provides an independent timing signal unavailable to the state-only variant (ATCPG-SO). This extension raises mean hyperbolic spread (kappa) from 1.88 to 3.37 and yields a further 5.8 percent efficiency gain over the state-only baseline. Ablation experiments across five random seeds demonstrate that (i) learning is the dominant efficiency factor (54.8 percent over no-learning), (ii) hyperbolic spread provides significant complementary gain (26.2 percent over geometry-free control), (iii) the combined system achieves 22.8 percent efficiency over the fixed-interval baseline, and (iv) adding spatial position information to the spread embedding yields an additional 5.8 percent.

[AI-79] Symbolic Graph Networks for Robust PDE Discovery from Noisy Sparse Data

【速读】:该论文旨在解决从噪声干扰和稀疏采样观测数据中准确发现偏微分方程(Partial Differential Equations, PDEs)的难题。传统方法依赖数值微分或积分公式,在实际应用中易受高频率噪声影响且对数据密度敏感。其解决方案的关键在于提出一种符号图网络(Symbolic Graph Network, SGN)框架:通过图消息传递机制建模空间相互作用,获得非局部表示以降低对高频噪声的敏感性;在此基础上,结合符号回归模块提取可解释的数学表达式,从而在噪声和稀疏条件下仍能有效恢复物理规律。

链接: https://arxiv.org/abs/2603.22380
作者: Xingyu Chen,Junxiu An,Jun Guo,Yuqian Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Data-driven discovery of partial differential equations (PDEs) offers a promising paradigm for uncovering governing physical laws from observational data. However, in practical scenarios, measurements are often contaminated by noise and limited by sparse sampling, which poses significant challenges to existing approaches based on numerical differentiation or integral formulations. In this work, we propose a Symbolic Graph Network (SGN) framework for PDE discovery under noisy and sparse conditions. Instead of relying on local differential approximations, SGN leverages graph message passing to model spatial interactions, providing a non-local representation that is less sensitive to high frequency noise. Based on this representation, the learned latent features are further processed by a symbolic regression module to extract interpretable mathematical expressions. We evaluate the proposed method on several benchmark systems, including the wave equation, convection-diffusion equation, and incompressible Navier-Stokes equations. Experimental results show that SGN can recover meaningful governing relations or solution forms under varying noise levels, and demonstrates improved robustness compared to baseline methods in sparse and noisy settings. These results suggest that combining graph-based representations with symbolic regression provides a viable direction for robust data-driven discovery of physical laws from imperfect observations. The code is available at this https URL

[AI-80] Rethinking Multimodal Fusion for Time Series: Auxiliary Modalities Need Constrained Fusion

【速读】:该论文旨在解决多模态时间序列(Time Series, TS)预测中因简单融合策略(如直接相加或拼接)导致性能提升有限甚至低于单模态模型的问题,其核心原因是辅助模态(如文本)的无控制整合可能引入无关信息,干扰时间序列动态。解决方案的关键在于提出受控融合机制,特别是 Controlled Fusion Adapter (CFA),通过低秩适配器(low-rank adapters)在融合前过滤与时间序列动态不相关的文本信息,实现仅保留相关语义特征的跨模态交互,从而在不修改原有时序模型架构的前提下显著提升多模态预测性能。

链接: https://arxiv.org/abs/2603.22372
作者: Seunghan Lee,Jun Seo,Jaehoon Lee,Sungdong Yoo,Minjae Kim,Tae Yoon Lim,Dongwan Kang,Hwanil Choi,SoonYoung Lee,Wonbin Ahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low-rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods including CFA. Code is publicly available at: this https URL.

[AI-81] FAAR: Format-Aware Adaptive Rounding for NVFP4

【速读】:该论文旨在解决在边缘设备上部署大语言模型(Large Language Models, LLMs)时,因极低比特量化(ultra-low precision quantization)导致的精度损失问题,特别是针对NVFP4(Non-Uniform Floating Point 4-bit)格式下传统舍入策略无法适应其非均匀数值网格、从而引发放大量化误差的问题。解决方案的关键在于提出一种**格式感知自适应舍入(Format-Aware Adaptive Rounding, FAAR)**方法,该方法将NVFP4的非均匀数值结构显式嵌入优化过程,通过梯度引导的自适应舍入决策逼近理论最优量化;同时引入两阶段格式对齐(2-stages Format Alignment, 2FA)微调方案,逐层调整LLM参数以匹配NVFP4数值空间,显著缩小性能差距,且训练开销极低(仅需4 GPU小时)。

链接: https://arxiv.org/abs/2603.22370
作者: Hanglin Li,Shuchang Tian,Chen Lin,Zhiyong Zhao,Kun Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.

[AI-82] Q-AGNN: Quantum-Enhanced Attentive Graph Neural Network for Intrusion Detection

【速读】:该论文旨在解决现有基于深度学习的入侵检测系统(Intrusion Detection System, IDS)在处理网络流量时,因将网络流视为独立实例而无法有效利用网络通信中固有的关系依赖性问题。其解决方案的关键在于提出一种量子增强的注意力图神经网络(Quantum-Enhanced Attentive Graph Neural Network, Q-AGNN),将网络流建模为节点、相似性关系建模为边,并通过参数化量子电路(Parameterized Quantum Circuits, PQCs)将多跳邻域信息编码至高维潜在空间,从而实现量子诱导希尔伯特空间中的二阶多项式图滤波;随后引入注意力机制自适应加权量子增强嵌入,聚焦于对异常行为最具影响力的节点,显著提升检测精度并保持低误报率。

链接: https://arxiv.org/abs/2603.22365
作者: Devashish Chaudhary,Sutharshan Rajasegarar,Shiva Raj Pokhrel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid growth of interconnected devices, accurately detecting malicious activities in network traffic has become increasingly challenging. Most existing deep learning-based intrusion detection systems treat network flows as independent instances, thereby failing to exploit the relational dependencies inherent in network communications. To address this limitation, we propose Q-AGNN, a Quantum-Enhanced Attentive Graph Neural Network for intrusion detection, where network flows are modeled as nodes and edges represent similarity relationships. Q-AGNN leverages parameterized quantum circuits (PQCs) to encode multi-hop neighborhood information into a high-dimensional latent space, inducing a bounded quantum feature map that implements a second-order polynomial graph filter in a quantum-induced Hilbert space. An attention mechanism is subsequently applied to adaptively weight the quantum-enhanced embeddings, allowing the model to focus on the most influential nodes contributing to anomalous behavior. Extensive experiments conducted on four benchmark intrusion detection datasets demonstrate that Q-AGNN achieves competitive or superior detection performance compared to state-of-the-art graph-based methods, while consistently maintaining low false positive rates under hardware-calibrated noise conditions. Moreover, we also executed the Q-AGNN framework on actual IBM quantum hardware to demonstrate the practical operability of the proposed pipeline under real NISQ conditions. These results highlight the effectiveness of integrating quantum-enhanced representations with attention mechanisms for graph-based intrusion detection and underscore the potential of hybrid quantum-classical learning frameworks in cybersecurity applications.

[AI-83] Early Discoveries of Algorithmist I: Promise of Provable Algorithm Synthesis at Scale

【速读】:该论文试图解决的问题是:如何设计具有可证明保证(provable guarantees)且在实践中表现良好的算法,这一问题长期以来依赖于数学推理与精心实现的结合,而现有方法如超越最坏情况分析(beyond-worst-case analysis)和数据驱动的算法选择(data-driven algorithm selection)通常受限于先验分布知识或固定算法池。解决方案的关键在于提出并实现一个名为 Algorithmist 的自主研究代理系统,该系统基于大语言模型(LLM)构建,运行多智能体研究-评审循环,包含创意生成、算法与证明开发、基于证明指导的实现以及对证明、代码及其一致性进行审查等阶段。其核心创新在于将代码合成过程与结构化的自然语言证明中间表示(intermediate representation)同步进行,并保持二者在整个合成过程中的一致性,从而实现了从理论到实践的闭环验证与优化,最终产出兼具理论正确性和实际有效性的算法成果。

链接: https://arxiv.org/abs/2603.22363
作者: Janardhan Kulkarni
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 75 pages, technical report

点击查看摘要

Abstract:Designing algorithms with provable guarantees that also work well in practice remains difficult, requiring both mathematical reasoning and careful implementation. Existing approaches that bridge worst-case theory and empirical performance, such as beyond-worst-case analysis and data-driven algorithm selection, typically assume prior distributional knowledge or restrict attention to a fixed pool of algorithms. Recent progress in LLMs suggests a new possibility: provable algorithm synthesis on the fly. To study this, we built Algorithmist, an autonomous researcher agent on top of GitHub Copilot that runs a multi-agent research-and-review loop, with separate stages for idea generation, algorithm and proof development, proof-guided implementation, and review of proofs, code, and their alignment. We evaluate Algorithmist on research-level tasks in private data analysis and clustering. When asked to design practical methods that jointly satisfy privacy, approximation, and interpretability requirements, it produced provably sound and empirically effective algorithms, together with research-style writeups and audited implementations. It also found improved algorithms in some settings, explained principled barriers in others, and uncovered a subtle proof bug in prior published work. More broadly, our results suggest a new paradigm in which LLM systems generate research-paper-quality algorithmic artifacts tailored to each dataset and deployment setting. They also point to a proof-first code-synthesis paradigm, in which code is developed alongside a structured natural-language proof intermediate representation and kept aligned with it throughout synthesis.

[AI-84] Unveiling the Mechanism of Continuous Representation Full-Waveform Inversion: A Wave Based Neural Tangent Kernel Framework ICLR2026

【速读】:该论文旨在解决全波形反演(Full-waveform Inversion, FWI)对初始模型敏感的问题,尤其是传统FWI方法在实际应用中因初始模型精度不足而导致收敛困难甚至失败的局限性。其解决方案的关键在于提出了一种基于波传播机制的神经切向核(Wave-based Neural Tangent Kernel, Wave-based NTK)理论框架,通过扩展标准神经切向核(NTK)以刻画连续表示FWI(CR-FWI)中参数更新的动力学特性。研究表明,与标准NTK不同,波基NTK在初始化和训练过程中均非恒定,这源于FWI本身的强非线性;同时,波基NTK的特征值衰减行为可解释为何CR-FWI能降低对初始模型的依赖并表现出较慢的高频收敛速度。基于此理论洞察,作者进一步设计了具有定制化特征值衰减特性的CR-FWI方法,包括一种结合隐式神经表示(INR)与多分辨率网格的新型混合表示(IG-FWI),实现了鲁棒性与高频收敛速率之间的更好平衡,在多个地质模型上的实验验证了其优越性能。

链接: https://arxiv.org/abs/2603.22362
作者: Ruihua Chen,Yisi Luo,Bangyu Wu,Deyu Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注: Published as a conference paper at ICLR 2026 (poster)

点击查看摘要

Abstract:Full-waveform inversion (FWI) estimates physical parameters in the wave equation from limited measurements and has been widely applied in geophysical exploration, medical imaging, and non-destructive testing. Conventional FWI methods are limited by their notorious sensitivity to the accuracy of the initial models. Recent progress in continuous representation FWI (CR-FWI) demonstrates that representing parameter models with a coordinate-based neural network, such as implicit neural representation (INR), can mitigate the dependence on initial models. However, its underlying mechanism remains unclear, and INR-based FWI shows slower high-frequency convergence. In this work, we investigate the general CR-FWI framework and develop a unified theoretical understanding by extending the neural tangent kernel (NTK) for FWI to establish a wave-based NTK framework. Unlike standard NTK, our analysis reveals that wave-based NTK is not constant, both at initialization and during training, due to the inherent nonlinearity of FWI. We further show that the eigenvalue decay behavior of the wave-based NTK can explain why CR-FWI alleviates the dependency on initial models and shows slower high-frequency convergence. Building on these insights, we propose several CR-FWI methods with tailored eigenvalue decay properties for FWI, including a novel hybrid representation combining INR and multi-resolution grid (termed IG-FWI) that achieves a more balanced trade-off between robustness and high-frequency convergence rate. Applications in geophysical exploration on Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP model, and the more realistic 2014 Chevron models show the superior performance of our proposed methods compared to conventional FWI and existing INR-based FWI methods.

[AI-85] STEM Agent : A Self-Adapting Tool-Enabled Extensible Architecture for Multi-Protocol AI Agent Systems

【速读】:该论文旨在解决当前AI代理框架在部署多样性交互范式时的局限性问题,具体表现为早期即固化单一交互协议、固定工具集成策略及静态用户建模,从而难以适应复杂多变的应用场景。解决方案的关键在于提出STEM Agent(Self-adapting, Tool-enabled, Extensible, Multi-agent)这一模块化架构,其核心创新包括:通过类生物多能性机制使未分化的核心代理分化为专用协议处理器、工具绑定和记忆子系统;统一五种互操作协议(A2A、AG-UI、A2UI、UCP 和 AP2)于单一网关之下;引入Caller Profiler持续学习超过二十个行为维度的用户偏好;通过Model Context Protocol(MCP)外部化领域能力;以及基于生物启发技能获取机制,将重复交互模式结晶为可复用的代理技能,其成熟生命周期类比细胞分化。此外,内存系统采用整合机制(如情景修剪、语义去重和模式提取),实现持续交互下的次线性增长,从而保障长期运行效率与适应性。

链接: https://arxiv.org/abs/2603.22359
作者: Alfred Shen,Aaron Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figures, 4 tables

点击查看摘要

Abstract:Current AI agent frameworks commit early to a single interaction protocol, a fixed tool integration strategy, and static user models, limiting their deployment across diverse interaction paradigms. To address these constraints, we introduce STEM Agent (Self-adapting, Tool-enabled, Extensible, Multi-agent), a modular architecture inspired by biological pluripotency in which an undifferentiated agent core differentiates into specialized protocol handlers, tool bindings, and memory subsystems that compose into a fully functioning AI system. The framework unifies five interoperability protocols (A2A, AG-UI, A2UI, UCP, and AP2) behind a single gateway, introduces a Caller Profiler that continuously learns user preferences across more than twenty behavioral dimensions, externalizes all domain capabilities through the Model Context Protocol (MCP), and implements a biologically inspired skills acquisition system in which recurring interaction patterns crystallize into reusable agent skills through a maturation lifecycle analogous to cell differentiation. Complementing these capabilities, the memory system incorporates consolidation mechanisms, including episodic pruning, semantic deduplication, and pattern extraction, designed for sub-linear growth under sustained interaction. A comprehensive 413-test suite validates protocol handler behavior and component integration across all five architectural layers, completing in under three seconds.

[AI-86] WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement ACL2026

【速读】:该论文旨在解决语言模型在特定领域推理能力提升过程中面临的两大挑战:一是纯内生自对弈(endogenous self-play)易因迭代偏差导致性能漂移,二是基于语料库的自进化方法依赖预设领域数据环境,缺乏灵活性与开放性。解决方案的关键在于提出WIST框架——一种基于网络(Web-grounded)的迭代自对弈树结构,其核心机制包括:1)通过增量扩展领域树进行可控探索,2)从开放网络中检索并清洗路径一致的语料构建可调控训练环境,3)采用带可验证奖励(verifiable rewards)的挑战者-求解器(Challenger–Solver)自对弈策略,并利用学习信号反馈更新节点后验概率,从而引导自适应课程学习。该方法实现了无需预设语料即可稳定提升模型推理能力,且具备领域定向性,在多个基准测试中显著优于现有基线方法。

链接: https://arxiv.org/abs/2603.22352
作者: Fangyuan Li,Pengfei Li,Shijie Wang,Junqi Gao,Jianxing Liu,Biqing Qi,Yuqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures. Submitted to ACL2026

点击查看摘要

Abstract:Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present \textbfWIST, a \textbfWeb-grounded \textbfIterative \textbfSelf-play \textbfTree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree for exploration, and retrieves and cleans path-consistent web corpus to construct a controllable training environment. It then performs Challenger–Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching \textbf+9.8 (\textitQwen3-4B-Base) and \textbf+9.7 (\textitOctoThinker-8B). WIST is also domain-steerable, improving \textitQwen3-8B-Base by \textbf+14.79 in medicine and \textitQwen3-4B-Base by \textbf+5.28 on PhyBench. Ablations further confirm the importance of WIST’s key components for stable open-web learning. Our Code is available at this https URL.

[AI-87] Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates

【速读】:该论文旨在解决传统确定性预执行安全门(deterministic pre-execution safety gates)在多步分布式攻击场景下的局限性,即这些系统虽能有效验证单个动作的授权合规性,但对意图分散于多个单独合规步骤中的复杂攻击(如慢速数据泄露、权限逐步提升等)缺乏检测能力。其解决方案的关键在于引入轻量级的会话风险记忆(Session Risk Memory, SRM),通过维护一个紧凑的语义中心点来表征代理会话的行为演变轨迹,并利用指数移动平均机制累积基线减去后的门控输出风险信号,从而实现基于轨迹的授权一致性评估(temporal authorization consistency)。SRM无需额外模型组件或训练,仅依赖现有门控模块的语义向量表示,在保持高检测率的同时将误报率降至零,且每轮开销低于250微秒,为智能体系统的会话级安全性提供了理论清晰且工程高效的保障机制。

链接: https://arxiv.org/abs/2603.22350
作者: Florin Adrian Chitan
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 12 pages, 3 figures. Companion paper to arXiv:2603.13247 . Benchmark dataset and artifacts available on Zenodo: https://doi.org/10.5281/zenodo.15410944

点击查看摘要

Abstract:Deterministic pre-execution safety gates evaluate whether individual agent actions are compatible with their assigned roles. While effective at per-action authorization, these systems are structurally blind to distributed attacks that decompose harmful intent across multiple individually-compliant steps. This paper introduces Session Risk Memory (SRM), a lightweight deterministic module that extends stateless execution gates with trajectory-level authorization. SRM maintains a compact semantic centroid representing the evolving behavioral profile of an agent session and accumulates a risk signal through exponential moving average over baseline-subtracted gate outputs. It operates on the same semantic vector representation as the underlying gate, requiring no additional model components, training, or probabilistic inference. We evaluate SRM on a multi-turn benchmark of 80 sessions containing slow-burn exfiltration, gradual privilege escalation, and compliance drift scenarios. Results show that ILION+SRM achieves F1 = 1.0000 with 0% false positive rate, compared to stateless ILION at F1 = 0.9756 with 5% FPR, while maintaining 100% detection rate for both systems. Critically, SRM eliminates all false positives with a per-turn overhead under 250 microseconds. The framework introduces a conceptual distinction between spatial authorization consistency (evaluated per action) and temporal authorization consistency (evaluated over trajectory), providing a principled basis for session-level safety in agentic systems.

[AI-88] Intelligence Inertia: Physical Principles and Applications

【速读】:该论文旨在解决先进智能系统在结构重构过程中,维持符号可解释性所导致的超线性乃至爆炸式计算与能耗成本问题。传统基于Landauer原理(Landauer’s principle)和费雪信息(Fisher Information)的框架仅在稀疏规则约束下近似有效,无法刻画实际适应成本与静态信息论估计之间的显著偏差。其解决方案的关键在于提出“智能惯性”(intelligence inertia)这一新物理属性及其底层非交换性根源——即规则与状态之间不可交换的数学本质,并构建了一个以相对论性J型通胀曲线(relativistic J-shaped inflation curve)为核心的非线性成本公式,该公式形式上类比洛伦兹因子(Lorentz factor),揭示了静态模型所忽视的“计算壁垒”(computational wall)。通过三项决定性实验验证了该理论:对比J曲线与经典费雪信息模型、分析神经架构演化的“之字形”轨迹以及实现考虑惯性的调度器封装器,从而为智能体结构适应的计算开销提供了第一性原理解释。

链接: https://arxiv.org/abs/2603.22347
作者: Jipeng Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
备注: 53 pages, 9 figures

点击查看摘要

Abstract:While Landauer’s principle establishes the fundamental thermodynamic floor for information erasure and Fisher Information provides a metric for local curvature in parameter space, these classical frameworks function effectively only as approximations within regimes of sparse rule-constraints. They fail to explain the super-linear, and often explosive, computational and energy costs incurred when maintaining symbolic interpretability during the reconfiguration of advanced intelligent systems. This paper introduces the property of intelligence inertia and its underlying physical principles as foundational characteristics for quantifying the computational weight of intelligence. We demonstrate that this phenomenon is not merely an empirical observation but originates from the fundamental non-commutativity between rules and states, a root cause we have formally organized into a rigorous mathematical framework. By analyzing the growing discrepancy between actual adaptation costs and static information-theoretic estimates, we derive a non-linear cost formula that mirrors the Lorentz factor, characterizing a relativistic J-shaped inflation curve – a “computational wall” that static models are blind to. The validity of these physical principles is examined through a trilogy of decisive experiments: (1) a comparative adjudication of this J-curve inflation against classical Fisher Information models, (2) a geometric analysis of the “Zig-Zag” trajectory of neural architecture evolution, and (3) the implementation of an inertia-aware scheduler wrapper that optimizes the training of deep networks by respecting the agent’s physical resistance to change. Our results suggest a unified physical description for the cost of structural adaptation, offering a first-principle explanation for the computational and interpretability-maintenance overhead in intelligent agents.

[AI-89] First-Mover Bias in Gradient Boosting Explanations: Mechanism Detection and Resolution

【速读】:该论文旨在解决梯度提升模型中SHAP(Shapley Additive Explanations)特征重要性排序在多重共线性下的不稳定性问题,其核心机制是“先发优势偏倚”(first-mover bias)——即在顺序残差拟合过程中,由于相关特征竞争早期分割节点,导致任一特征一旦被选中便获得自我强化的优势,从而集中SHAP重要性于随机选取的特征而非均匀分配给整个相关组。解决方案的关键在于打破这种序列依赖关系:通过引入模型独立性机制,如DASH(Diversified Aggregation of SHAP)方法和简单的种子平均重训练(Stochastic Retrain),使多个独立训练的模型之间相互抵消先发优势的影响,从而显著提升解释稳定性;实验表明,在相关系数ρ=0.9时,DASH与种子平均均实现稳定度0.977,远优于单模型方案(0.938–0.958),且DASH还提供FSI(Feature Stability Index)和IS(Importance-Stability)图等诊断工具以无监督方式识别先发优势偏倚。

链接: https://arxiv.org/abs/2603.22346
作者: Drake Caraker,Bryan Arnold,David Rhoads
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 8 figures, 12 tables; code at this https URL

点击查看摘要

Abstract:We isolate and empirically characterize first-mover bias – a path-dependent concentration of feature importance caused by sequential residual fitting in gradient boosting – as a specific mechanistic cause of the well-known instability of SHAP-based feature rankings under multicollinearity. When correlated features compete for early splits, gradient boosting creates a self-reinforcing advantage for whichever feature is selected first: subsequent trees inherit modified residuals that favor the incumbent, concentrating SHAP importance on an arbitrary feature rather than distributing it across the correlated group. Scaling up a single model amplifies this effect – a Large Single Model with the same total tree count as our method produces the worst explanations of any approach tested. We demonstrate that model independence is sufficient to resolve first-mover bias in the linear regime, and remains the most effective mitigation under nonlinear data-generating processes. Both our proposed method, DASH (Diversified Aggregation of SHAP), and simple seed-averaging (Stochastic Retrain) restore stability by breaking the sequential dependency chain, confirming that the operative mechanism is independence between explained models. At rho=0.9, both achieve stability=0.977, while the single-best workflow degrades to 0.958 and the Large Single Model to 0.938. On the Breast Cancer dataset, DASH improves stability from 0.32 to 0.93 (+0.61) against a tree-count-matched baseline. DASH additionally provides two diagnostic tools – the Feature Stability Index (FSI) and Importance-Stability (IS) Plot – that detect first-mover bias without ground truth, enabling practitioners to audit explanation reliability before acting on feature rankings. Software and reproducible benchmarks are available at this https URL.

[AI-90] Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

【速读】:该论文旨在解决多模态对话情感识别(Multimodal Emotion Recognition in Conversations, MERC)中因固定参数处理不同情绪类型而导致的模态融合动态性不足问题,从而限制模型在特定情绪类别上的性能表现。解决方案的关键在于提出一种动态融合感知图卷积神经网络(Dynamic Fusion-aware Graph Convolutional Neural Network, DF-GCN),其核心创新包括:将常微分方程(Ordinary Differential Equations, ODEs)引入图卷积网络(GCNs)以捕捉话语交互网络中情感依赖的动态特性,并利用话语全局信息向量(Global Information Vector, GIV)生成提示(prompt)来引导多模态特征的动态融合;该机制使模型在推理阶段可根据不同情绪类别自适应调整网络参数,实现更灵活的情绪分类并提升模型泛化能力。

链接: https://arxiv.org/abs/2603.22345
作者: Tao Meng,Weilun Tang,Yuntao Shou,Yilong Tan,Jun Zhou,Wei Ai,Keqin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Multimodal emotion recognition in conversations (MERC) aims to identify and understand the emotions expressed by speakers during utterance interaction from multiple modalities (e.g., text, audio, images, etc.). Existing studies have shown that GCN can improve the performance of MERC by modeling dependencies between speakers. However, existing methods usually use fixed parameters to process multimodal features for different emotion types, ignoring the dynamics of fusion between different modalities, which forces the model to balance performance between multiple emotion categories, thus limiting the model’s performance on some specific emotions. To this end, we propose a dynamic fusion-aware graph convolutional neural network (DF-GCN) for robust recognition of multimodal emotion features in conversations. Specifically, DF-GCN integrates ordinary differential equations into graph convolutional networks (GCNs) to capture the dynamic nature of emotional dependencies within utterance interaction networks and leverages the prompts generated by the global information vector (GIV) of the utterance to guide the dynamic fusion of multimodal features. This allows our model to dynamically change parameters when processing each utterance feature, so that different network parameters can be equipped for different emotion categories in the inference stage, thereby achieving more flexible emotion classification and enhancing the generalization ability of the model. Comprehensive experiments conducted on two public multimodal conversational datasets confirm that the proposed DF-GCN model delivers superior performance, benefiting significantly from the dynamic fusion mechanism introduced.

[AI-91] Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation ICLR2026

【速读】:该论文旨在解决当前基于状态空间模型(State-space Models, SSMs)的语言模型中,多头循环结构缺乏结构化利用与可解释性的问题,特别是Mamba2虽具备并行计算能力和优异性能,但其多头独立运行导致参数冗余且难以分析各头的功能分工。解决方案的关键在于提出一种受图信号处理(Graph Signal Processing, GSP)启发的新型架构——Hierarchical ADaptive filter bank for Efficient SSMs (HADES),将Mamba2重新诠释为线图上的自适应滤波器组,通过引入两类结构化滤波器:全局低通行为的共享滤波器和局部高通行为的专家滤波器,后者由对参数Δ的结构化偏置实现,从而在保持高性能的同时显著降低参数量(仅需原模型58.9%),实现了高效、分层且可解释的状态空间建模。

链接: https://arxiv.org/abs/2603.22333
作者: Yehjin Shin,Seojin Kim,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:State-space models (SSMs) offer efficient alternatives to attention with linear-time recurrence. Mamba2, a recent SSM-based language model, uses selective input gating and a multi-head structure, enabling parallel computation and strong benchmark performance. However, its multi-head recurrence operates independently without structured utilization or analysis. In this work, we propose a novel method called Hierarchical ADaptive filter bank for Efficient SSMs (HADES), a Graph Signal Processing (GSP)-inspired framework that reinterprets Mamba2 as an adaptive filter bank on a line graph. Our hierarchical architecture introduces two filter types: shared filters for global low-pass behavior and expert filters for local high-pass behavior, achieved through structured bias on the parameter \Delta. HADES achieves comparable performance to baseline models including Mamba2 across various benchmarks in language modeling, commonsense reasoning, and long-context retrieval, while using only 58.9% of the original parameters. In this regard, HADES bridges GSP and neural sequence modeling, enabling efficient, hierarchical, and interpretable filtering within state-space models.

[AI-92] Large Language Models for Missing Data Imputation: Understanding Behavior Hallucination Effects and Control Mechanisms

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的缺失数据插补方法在可扩展性、跨模型比较和评估一致性方面存在的局限性,尤其是缺乏对多种缺失机制(MCAR、MAR、MNAR)下大规模真实世界与合成数据集的系统性 benchmarking。其解决方案的关键在于提出一种零样本提示工程(zero-shot prompt engineering)方法,并构建了一个涵盖29个数据集(含9个合成数据集)、覆盖三种缺失机制、最高缺失率达20%的大规模基准测试框架,从而全面比较五种主流LLMs与六种先进插补基线方法的表现。结果表明,LLMs如Gemini 3.0 Flash和Claude 4.5 Sonnet在真实世界开放数据集上显著优于传统方法,但其优势依赖于预训练阶段对领域特定模式的学习;而在合成数据上,传统方法如MICE表现更优,说明LLMs的有效性主要源于语义上下文而非纯统计重建,同时揭示了插补质量与计算开销之间的权衡关系。

链接: https://arxiv.org/abs/2603.22332
作者: Arthur Dantas Mangussi,Ricardo Cardoso Pereira,Ana Carolina Lorena,Pedro Henriques Abreu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models’ prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.

[AI-93] Conformal Risk Control for Safety-Critical Wildfire Evacuation Mapping: A Comparative Study of Tabular Spatial and Graph-Based Models

【速读】:该论文旨在解决当前 wildfire spread 预测模型缺乏形式化安全保证的问题,即现有方法无法提供对漏报率(False Negative Rate, FNR)的有限样本保障,导致疏散规划者依赖无形式保证的概率阈值。其解决方案的关键在于首次将 conformal risk control (CRC) 应用于 wildfire 预测任务,通过引入分布无关的安全约束,在保证 FNR ≤ 0.05 的前提下显著提升预测可靠性。实验表明,无论模型架构如何(从 LightGBM 到 Hybrid ResGNN-UNet),标准阈值仅能捕获 7–72% 的真实火势蔓延区域,而 CRC 能统一消除这一缺陷,使空间模型在保持约 95% 火灾覆盖率的同时仅标记约 15% 像素,效率提升达 4.2 倍;此外,作者提出了一种面向操作调度的三类分区 CRC 框架(SAFE/MONITOR/EVACUATE),并揭示了极端类别不平衡下预估加权边界的根本局限性。

链接: https://arxiv.org/abs/2603.22331
作者: Baljinnyam Dayan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Every wildfire prediction model deployed today shares a dangerous property: none of these methods provides formal guarantees on how much fire spread is missed. Despite extensive work on wildfire spread prediction using deep learning, no prior study has applied distribution-free safety guarantees to this domain, leaving evacuation planners reliant on probability thresholds with no formal assurance. We address this gap by presenting, to our knowledge, the first application of conformal risk control (CRC) to wildfire spread prediction, providing finite-sample guarantees on false negative rate (FNR = 0.05). We expose a stark failure: across three model families of increasing complexity (tabular: LightGBM, AUROC 0.854; convolutional: Tiny U-Net, AUROC 0.969; and graph-based: Hybrid ResGNN-UNet, AUROC 0.964), standard thresholds capture only 7-72% of true fire spread. CRC eliminates this failure uniformly. Our central finding is that model architecture determines evacuation efficiency, while CRC determines safety: both spatial models with CRC achieve approximately 95% fire coverage while flagging only approximately 15% of total pixels, making them 4.2x more efficient than LightGBM, while the graph model’s additional complexity over a simple U-Net yields no meaningful efficiency gain. We propose a shift-aware three-way CRC framework that assigns SAFE/MONITOR/EVACUATE zones for operational triage, and characterize a fundamental limitation of prevalence-weighted bounds under extreme class imbalance (approximately 5% fire prevalence). All models, calibration code, and evaluation pipelines are released for reproducibility.

[AI-94] rained Persistent Memory for Frozen Decoder-Only LLM s

【速读】:该论文试图解决的问题是:如何在无状态的解码器-only(decoder-only)语言模型中实现持久的潜在空间记忆(persistent latent-space memory),即在不改变主干模型参数的前提下,使模型能够在多个推理会话之间保留和利用先前学习的信息。其解决方案的关键在于将六种不同机制(如前缀注入、KV扩展、Hebbian记忆等)适配到冻结的GPT-2模型中,仅训练一个小型记忆适配器(θ_mem),并通过自注意力机制实现记忆的写入与读取——区别于编码器-解码器架构中的交叉注意力路径,此处所有方法均依赖自注意力通道完成记忆的注入与访问。实验表明,在1倍容量下,具有强结构先验的方法(如Hebbian记忆、槽位写入)表现显著优于其他方法,揭示了架构设计对记忆能力的关键影响,从而确立了跨主流Transformer架构的持久记忆范式。

链接: https://arxiv.org/abs/2603.22329
作者: Hong Jeong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoder-only language models are stateless: hidden representations are discarded after every forward pass and nothing persists across sessions. Jeong (2026a) showed that trained memory adapters give a frozen encoder-decoder backbone persistent latent-space memory, building on the lateral-memory framework of Jeong (2026b,c). Here we ask whether the same principle transfers to the decoder-only setting, where no cross-attention pathway exists and memory must enter through self-attention alone. We adapt six methods – prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write – to a frozen GPT-2, training only a small adapter \theta_mem . The write rule is shared; only the read injection changes from decoder cross-attention to self-attention KV prefix or parallel branch. On LoCoMo we find a striking inductive-bias dichotomy: at 1\times capacity, three methods with strong architectural priors – cross-attention (M.2), Hebbian (M.4), and slot write (M.6) – achieve retained-memory scores of 7-18% and knowledge gains \Delta K of 7-10 , while the other three fail ( 0.4% ). At 10\times capacity all six converge, showing the gap is architectural, not fundamental. Together with the encoder-decoder results of Jeong (2026a) and the brain-inspired modules of Jeong (2026b,c), these findings establish persistent latent-space memory as a general paradigm spanning major transformer families.

[AI-95] Beyond the Mean: Distribution-Aware Loss Functions for Bimodal Regression

【速读】:该论文旨在解决机器学习模型在预测不确定性估计中的可信度问题,尤其是在预测误差分布呈现双峰(bimodal)特性时,传统回归方法因假设噪声服从单峰高斯分布而出现均值坍缩(mean-collapse)现象,导致无法准确表达预测置信度。解决方案的关键在于提出一族分布感知损失函数,通过融合归一化均方根误差(normalized RMSE)与Wasserstein距离及Cramér距离,使标准深度回归模型能够在不引入混合模型(Mixture Density Networks, MDNs)优化不稳定性的前提下,有效恢复双峰分布结构。实验表明,所提出的Wasserstein损失在保持单峰任务稳定性的同时,在复杂双峰数据集上将Jensen-Shannon散度降低45%,显著优于MDNs,在保真度和鲁棒性上全面占优,为可信人工智能系统中的偶然性不确定性(aleatoric uncertainty)估计提供了可靠工具。

链接: https://arxiv.org/abs/2603.22328
作者: Abolfazl Mohammadi-Seif,Carlos Soares,Rita P. Ribeiro,Ricardo Baeza-Yates
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 28 pages, 27 figures

点击查看摘要

Abstract:Despite the strong predictive performance achieved by machine learning models across many application domains, assessing their trustworthiness through reliable estimates of predictive confidence remains a critical challenge. This issue arises in scenarios where the likelihood of error inferred from learned representations follows a bimodal distribution, resulting from the coexistence of confident and ambiguous predictions. Standard regression approaches often struggle to adequately express this predictive uncertainty, as they implicitly assume unimodal Gaussian noise, leading to mean-collapse behavior in such settings. Although Mixture Density Networks (MDNs) can represent different distributions, they suffer from severe optimization instability. We propose a family of distribution-aware loss functions integrating normalized RMSE with Wasserstein and Cramér distances. When applied to standard deep regression models, our approach recovers bimodal distributions without the volatility of mixture models. Validated across four experimental stages, our results show that the proposed Wasserstein loss establishes a new Pareto efficiency frontier: matching the stability of standard regression losses like MSE in unimodal tasks while reducing Jensen-Shannon Divergence by 45% on complex bimodal datasets. Our framework strictly dominates MDNs in both fidelity and robustness, offering a reliable tool for aleatoric uncertainty estimation in trustworthy AI systems.

[AI-96] A Direct Classification Approach for Reliable Wind Ramp Event Forecasting under Severe Class Imbalance

【速读】:该论文旨在解决风功率爬坡事件(Wind Power Ramp Events, WPREs)预测中因样本类别严重不平衡导致的传统机器学习模型性能下降的问题。其关键解决方案是将WPRE预测建模为多变量时间序列分类任务,并提出一种数据预处理策略,通过提取近期功率观测特征并掩蔽不可用的爬坡信息,使方法可与传统实时爬坡识别工具集成;同时,结合多数类欠采样与集成学习技术,在保持高准确率(>85%)和加权F1分数(88%)的同时有效缓解了类别不平衡问题。

链接: https://arxiv.org/abs/2603.22326
作者: Alejandro Morales-Hernández,Fabrizio De Caroa,Gian Marco Paldino,Pascal Tribel,Alfredo Vaccaro,Gianluca Bontempi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision support systems are essential for maintaining grid stability in low-carbon power systems, such as wind power plants, by providing real-time alerts to control room operators regarding potential events, including Wind Power Ramp Events (WPREs). These early warnings enable the timely initiation of more detailed system stability assessments and preventive actions. However, forecasting these events is challenging due to the inherent class imbalance in WPRE datasets, where ramp events are less frequent (typically less than 15% of observed events) compared to normal conditions. Ignoring this characteristic undermines the performance of conventional machine learning models, which often favor the majority class. This paper introduces a novel methodology for WPRE forecasting as a multivariate time series classification task and proposes a data preprocessing strategy that extracts features from recent power observations and masks unavailable ramp information, making it integrable with traditional real-time ramp identification tools. Particularly, the proposed methodology combines majority-class undersampling and ensemble learning to enhance wind ramp event forecasting under class imbalance. Numerical simulations conducted on a real-world dataset demonstrate the superiority of our approach, achieving over 85% accuracy and 88% weighted F1 score, outperforming benchmark classifiers.

[AI-97] Hybrid Associative Memories

【速读】:该论文旨在解决传统序列建模方法中RNN(循环神经网络)与自注意力机制在内存利用效率和长程依赖建模能力上的固有矛盾问题:RNN通过压缩历史信息到固定大小状态实现高效计算,但难以处理长序列中的精确信息召回;而自注意力机制虽能有效捕捉上下文依赖,却因KV缓存(Key-Value cache)随序列长度线性增长导致高昂的内存和计算开销。解决方案的关键在于提出一种混合关联记忆(Hybrid Associative Memory, HAM)层,其核心思想是让RNN负责对整个序列进行压缩表示,同时仅将RNN难以预测的、最具价值的信息通过自注意力机制显式存储于KV缓存中,从而实现数据驱动的KV缓存动态增长,并通过单一连续阈值实现对缓存增长速率的精细控制,进而平衡模型性能与资源消耗。

链接: https://arxiv.org/abs/2603.22325
作者: Leon Lufkin,Tomás Figliolia,Beren Millidge,Kamesh Krishnamurthy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:Recurrent neural networks (RNNs) and self-attention are both widely used sequence-mixing layers that maintain an internal memory. However, this memory is constructed using two orthogonal mechanisms: RNNs compress the entire past into a fixed-size state, whereas self-attention’s state stores every past time step growing its state (the KV cache) linearly with the sequence length. This results in orthogonal strengths and weaknesses. Self-attention layers excel at retrieving information in the context but have large memory and computational costs, while RNNs are more efficient but degrade over longer contexts and underperform for precise recall tasks. Prior work combining these mechanisms has focused primarily on naively interleaving them to reduce computational cost without regard to their complementary mechanisms. We propose the Hybrid Associative Memory (HAM) layer, which combines self-attention and RNNs while leveraging their individual strengths: the RNN compresses the entire sequence, while attention supplements it only with information that is difficult for the RNN to predict, which is hence the most valuable information to explicitly store. HAM layers enable data-dependent growth of the KV cache, which can be precisely controlled by the user with a single, continuous threshold. We find that this fine-grained control of the KV cache growth rate has a smooth trade-off with loss and performance. Empirically, we show that our hybrid architecture offers strong, competitive performance relative to RNNs and Transformers even at substantially lower KV-cache usage.

[AI-98] DAQ: Delta-Aware Quantization for Post-Training LLM Weight Compression

【速读】:该论文旨在解决标准后训练量化(Post-Training Quantization, PTQ)方法在压缩模型时对微调阶段学习到的参数增量(ΔW,即小幅度权重变化)造成过度破坏的问题。传统量化目标以最小化重建误差为核心,忽略了基础模型(base model)的信息,导致量化噪声更易扰动 ΔW,从而损害模型在特定任务(如风格迁移)上的能力。解决方案的关键在于提出 Delta-Aware Quantization (DAQ),其引入两个基于 ΔW 的感知指标——符号保持率(Sign Preservation Rate)和余弦相似度(Cosine Similarity),直接优化 ΔW 的方向保真度,且仅需基础模型与微调后的权重矩阵即可实现,无需额外数据。该方法在 FP8 量化实验中成功恢复了因标准量化而丢失的风格特异性能力,同时维持整体性能。

链接: https://arxiv.org/abs/2603.22324
作者: Xiaoming Yu,Shize Tang,Guanghua Yu,Linchuan Xie,Song Liu,Jianchen Zhu,Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Delta-Aware Quantization (DAQ), a data-free post-training quantization framework that preserves the knowledge acquired during post-training. Standard quantization objectives minimize reconstruction error but are agnostic to the base model, allowing quantization noise to disproportionately corrupt the small-magnitude parameter deltas ( \Delta W ) that encode post-training behavior – an effect we analyze through the lens of quantization as implicit regularization. DAQ replaces reconstruction-based objectives with two delta-aware metrics – Sign Preservation Rate and Cosine Similarity – that directly optimize for directional fidelity of \Delta W , requiring only the base and post-trained weight matrices. In a pilot FP8 study, DAQ recovers style-specific capabilities lost under standard quantization while maintaining general performance.

[AI-99] A Multi-Task Targeted Learning Framework for Lithium-Ion Battery State-of-Health and Remaining Useful Life

【速读】:该论文旨在解决锂离子电池状态健康(SOH)与剩余使用寿命(RUL)预测中深度学习方法在特征选择性提取和时序依赖建模方面的局限性,尤其是传统循环神经网络(RNN)在长期时序建模中的不足。其解决方案的关键在于提出一种多任务目标学习框架,集成多尺度特征提取模块、改进的扩展长短期记忆网络(improved extended LSTM)以及双流注意力机制(dual-stream attention module),通过极化注意力和稀疏注意力分别聚焦于SOH与RUL相关的关键特征,并借助双任务层实现从输入到两个输出的多对二映射,从而显著提升预测精度。

链接: https://arxiv.org/abs/2603.22323
作者: Chenhan Wang,Zhengyi Bao,Huipin Lin,Jiahao Nie,Chunxiang Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Accurately predicting the state-of-health (SOH) and remaining useful life (RUL) of lithium-ion batteries is crucial for ensuring the safe and efficient operation of electric vehicles while minimizing associated risks. However, current deep learning methods are limited in their ability to selectively extract features and model time dependencies for these two parameters. Moreover, most existing methods rely on traditional recurrent neural networks, which have inherent shortcomings in long-term time-series modeling. To address these issues, this paper proposes a multi-task targeted learning framework for SOH and RUL prediction, which integrates multiple neural networks, including a multi-scale feature extraction module, an improved extended LSTM, and a dual-stream attention module. First, a feature extraction module with multi-scale CNNs is designed to capture detailed local battery decline patterns. Secondly, an improved extended LSTM network is employed to enhance the model’s ability to retain long-term temporal information, thus improving temporal relationship modeling. Building on this, the dual-stream attention module-comprising polarized attention and sparse attention to selectively focus on key information relevant to SOH and RUL, respectively, by assigning higher weights to important features. Finally, a many-to-two mapping is achieved through the dual-task layer. To optimize the model’s performance and reduce the need for manual hyperparameter tuning, the Hyperopt optimization algorithm is used. Extensive comparative experiments on battery aging datasets demonstrate that the proposed method reduces the average RMSE for SOH and RUL predictions by 111.3% and 33.0%, respectively, compared to traditional and state-of-the-art methods.

[AI-100] AEGIS: An Operational Infrastructure for Post-Market Governance of Adaptive Medical AI Under US and EU Regulations

【速读】:该论文旨在解决医疗设备中部署的机器学习(Machine Learning, ML)系统在确保安全性的同时实现持续迭代优化的问题。当前监管框架如美国FDA的预设变更控制计划(Predetermined Change Control Plan, PCCP)和欧盟人工智能法案第43(4)条虽提供了管理模型更新的机制,但缺乏可执行的治理基础设施以支持安全、可控的持续学习。解决方案的关键在于提出AI/ML评估与治理安全基础设施(AEGIS),其包含三个核心模块:数据集整合与再训练、模型监控以及条件决策逻辑,并引入四类部署决策分类(APPROVE、CONDITIONAL APPROVAL、CLINICAL REVIEW、REJECT)及独立的上市后监测(Post-Market Surveillance, PMS)警报信号,从而能够在模型性能尚未显著下降前识别出关键状态——即无可用部署模型且当前发布模型已处于风险中的临界情况。通过在脓毒症预测和脑肿瘤分割两个异构临床场景中的验证,AEGIS证明了其能够将监管要求转化为可操作的治理流程,支撑适应性医疗AI的安全持续学习。

链接: https://arxiv.org/abs/2603.22322
作者: Fardin Afdideh,Mehdi Astaraki,Fernando Seoane,Farhad Abtahi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Machine learning systems deployed in medical devices require governance frameworks that ensure safety while enabling continuous improvement. Regulatory bodies including the FDA and European Union have introduced mechanisms such as the Predetermined Change Control Plan (PCCP) and Post-Market Surveillance (PMS) to manage iterative model updates without repeated submissions. This paper presents AI/ML Evaluation and Governance Infrastructure for Safety (AEGIS), a governance framework applicable to any healthcare AI system. AEGIS comprises three modules, i.e., dataset assimilation and retraining, model monitoring, and conditional decision, that operationalize FDA PCCP and EU AI Act Article 43(4) provisions. We implement a four-category deployment decision taxonomy (APPROVE, CONDITIONAL APPROVAL, CLINICAL REVIEW, REJECT) with an independent PMS ALARM signal, enabling detection of the critical state in which no deployable model exists while the released model is simultaneously at risk. To illustrate how AEGIS can be instantiated across heterogeneous clinical contexts, we provide two examples: sepsis prediction from electronic health records and brain tumor segmentation from medical imaging. Both cases use identical governance architecture, differing only in configuration. Across 11 simulated iterations on the sepsis example, AEGIS yielded 8 APPROVE, 1 CONDITIONAL APPROVAL, 1 CLINICAL REVIEW, and 1 REJECT decision, exercising all four categories. ALARM signals were co-issued at iterations 8 and 10, including the critical state where no deployable model exists and the released model is simultaneously failing. AEGIS detected drift before observable performance degradation. These results demonstrate that AEGIS translates regulatory change-control concepts into executable governance procedures, supporting safe continuous learning for adaptive medical AI across diverse clinical applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) MSC classes: 68T05 (primary), 62P10, 62L10, 62C05, 90B25 (secondary) Cite as: arXiv:2603.22322 [cs.LG] (or arXiv:2603.22322v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22322 Focus to learn more arXiv-issued DOI via DataCite

[AI-101] Sparsely-Supervised Data Assimilation via Physics-Informed Schrödinger Bridge

【速读】:该论文旨在解决偏微分方程(Partial Differential Equation, PDE)系统中基于稀疏高保真(High-Fidelity, HF)观测的数据同化(Data Assimilation, DA)问题,即如何在尊重物理约束的前提下,从有限观测中重建完整的时空场。传统方法通常需要在推理阶段进行逐实例的优化,这在时间敏感的应用场景中成为瓶颈。为缓解此问题,论文提出物理信息条件Schrödinger桥(Physics-Informed Conditional Schrödinger Bridge, PICSB),其核心创新在于无需任何额外推理时引导即可实现从低保真(Low-Fidelity, LF)先验到观测条件下的高保真后验的映射;关键机制包括:采用迭代代理端点刷新策略以避免对HF终点的依赖,并将PDE残差直接嵌入训练目标,同时通过硬条件约束确保观测一致性。实验表明,PICSB可在保持高精度的同时实现极快的时空场重建速度。

链接: https://arxiv.org/abs/2603.22319
作者: Dohyun Bu,Chanho Kim,Seokun Choi,Jong-Seok Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages

点击查看摘要

Abstract:Data assimilation (DA) for systems governed by partial differential equations (PDE) aims to reconstruct full spatiotemporal fields from sparse high-fidelity (HF) observations while respecting physical constraints. While full-grid low-fidelity (LF) simulations provide informative priors in multi-fidelity settings, recovering an HF field consistent with both sparse observations and the governing PDE typically requires per-instance test-time optimization, which becomes a major bottleneck in time-critical applications. To alleviate this, amortized reconstruction using generative models has recently been proposed; however, such approaches rely on full-field HF supervision during training, which is often impractical in real-world settings. From a more realistic perspective, we propose the Physics-Informed Conditional Schrödinger Bridge (PICSB), which transports an informative LF prior toward an observation-conditioned HF posterior without any additional inference-time guidance. To enable learning without HF endpoints, PICSB employs an iterative surrogate-endpoint refresh scheme, and directly incorporates PDE residuals into the training objective while enforcing observations via hard conditioning throughout sampling. Experiments on fluid PDE benchmarks demonstrate that PICSB enables extremely fast spatiotemporal field reconstruction while maintaining competitive accuracy under sparse HF supervision.

[AI-102] Geometric Mixture-of-Experts with Curvature-Guided Adaptive Routing for Graph Representation Learning

【速读】:该论文旨在解决图结构数据中复杂的拓扑异质性难以在单一黎曼流形(Riemannian manifold)中准确建模的问题。现有混合曲率方法虽尝试捕捉这种多样性,但常依赖任务驱动的隐式路由机制,缺乏基本的几何基础。解决方案的关键在于提出几何混合专家框架(Geometric Mixture-of-Experts, GeoMoE),其核心是利用Ollivier-Ricci曲率(Ollivier-Ricci Curvature, ORC)作为内在几何先验,引导不同专家在多黎曼空间中的自适应融合。具体而言,设计了一个图感知门控网络以分配节点特定的融合权重,并引入基于曲率对齐的正则化损失确保可解释且几何一致的路由;同时,通过构建基于曲率一致性的正负样本对,引入曲率感知对比目标以增强几何判别能力。

链接: https://arxiv.org/abs/2603.22317
作者: Haifang Cao,Yu Wang,Timing Li,Xinjie Yao,Pengfei Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-structured data typically exhibits complex topological heterogeneity, making it difficult to model accurately within a single Riemannian manifold. While emerging mixed-curvature methods attempt to capture such diversity, they often rely on implicit, task-driven routing that lacks fundamental geometric grounding. To address this challenge, we propose a Geometric Mixture-of-Experts framework (GeoMoE) that adaptively fuses node representations across diverse Riemannian spaces to better accommodate multi-scale topological structures. At its core, GeoMoE leverages Ollivier-Ricci Curvature (ORC) as an intrinsic geometric prior to orchestrate the collaboration of specialized experts. Specifically, we design a graph-aware gating network that assigns node-specific fusion weights, regularized by a curvature-guided alignment loss to ensure interpretable and geometry-consistent routing. Additionally, we introduce a curvature-aware contrastive objective that promotes geometric discriminability by constructing positive and negative pairs according to curvature consistency. Extensive experiments on six benchmark datasets demonstrate that GeoMoE outperforms state-of-the-art baselines across diverse graph types.

[AI-103] Emergency Preemption Without Online Exploration: A Decision Transformer Approach

【速读】:该论文旨在解决紧急车辆(Emergency Vehicle, EV)响应时间优化问题,现有信号预emption策略多为反应式且缺乏可控性,难以在保障EV通行效率的同时最小化对民用交通的干扰。其核心解决方案是基于决策变压器(Decision Transformer, DT)提出一种返回条件化的应急走廊优化框架,关键在于将走廊优化建模为离线、返回条件化的序列建模任务,从而实现:(1) 策略学习阶段无需在线环境交互,提升训练效率;(2) 通过单一目标返回标量实现调度级紧迫度控制,灵活调节EV与民用交通间的权衡;(3) 进一步扩展至多智能体场景,利用图注意力机制(graph attention)实现空间协同,显著降低EV行驶时间和停靠次数,同时有效控制民用交通延误。

链接: https://arxiv.org/abs/2603.22315
作者: Haoran Su,Hanxiao Deng,Yandong Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emergency vehicle (EV) response time is a critical determinant of survival outcomes, yet deployed signal preemption strategies remain reactive and uncontrollable. We propose a return-conditioned framework for emergency corridor optimization based on the Decision Transformer (DT). By casting corridor optimization as offline, return-conditioned sequence modeling, our approach (1) eliminates online environment interaction during policy learning, (2) enables dispatch-level urgency control through a single target-return scalar, and (3) extends to multi-agent settings via a Multi-Agent Decision Transformer (MADT) with graph attention for spatial coordination. On the LightSim simulator, DT reduces average EV travel time by 37.7% relative to fixed-timing preemption on a 4x4 grid (88.6 s vs. 142.3 s), achieving the lowest civilian delay (11.3 s/veh) and fewest EV stops (1.2) among all methods, including online RL baselines that require environment interaction. MADT further improves on larger grids, overtaking DT with 45.2% reduction on 8x8 via graph-attention coordination. Return conditioning produces a smooth dispatch interface: varying the target return from 100 to -400 trades EV travel time (72.4-138.2 s) against civilian delay (16.8-5.4 s/veh), requiring no retraining. A Constrained DT extension adds explicit civilian disruption budgets as a second control knob.

[AI-104] Enhancing AI-Based Tropical Cyclone Track and Intensity Forecasting via Systematic Bias Correction

【速读】:该论文旨在解决当前生成式AI在热带气旋(Tropical Cyclone, TC)预报中面临的两大核心问题:一是基于粗分辨率再分析数据(如ERA5,空间分辨率为0.25°)训练的模型受限于固定网格,导致TC路径预测存在显著离散化误差;二是强度预报尤其对强台风的预测能力不足,主要源于气象场平滑效应及回归损失函数导致预测结果偏向条件均值。解决方案的关键在于提出BaguanCyclone框架,其包含两个创新模块:(1) 概率中心精修模块(probabilistic center refinement module),通过建模TC中心的连续空间分布实现更精细的路径精度;(2) 区域感知强度预报模块(region-aware intensity forecasting module),利用动态定义的子网格区域内的高分辨率内部表征捕捉TC核心附近的局部极端强度特征,从而提升对强TC强度变化的预测能力。

链接: https://arxiv.org/abs/2603.22314
作者: Peisong Niu,Haifan Zhang,Yang Zhao,Tian Zhou,Ziqing Ma,Wenqiang Shen,Junping Zhao,Huiling Yuan,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tropical cyclones (TCs) pose severe threats to life, infrastructure, and economies in tropical and subtropical regions, underscoring the critical need for accurate and timely forecasts of both track and intensity. Recent advances in AI-based weather forecasting have shown promise in improving TC track forecasts. However, these systems are typically trained on coarse-resolution reanalysis data (e.g., ERA5 at 0.25 degree), which constrains predicted TC positions to a fixed grid and introduces significant discretization errors. Moreover, intensity forecasting remains limited especially for strong TCs by the smoothing effect of coarse meteorological fields and the use of regression losses that bias predictions toward conditional means. To address these limitations, we propose BaguanCyclone, a novel, unified framework that integrates two key innovations: (1) a probabilistic center refinement module that models the continuous spatial distribution of TC centers, enabling finer track precision; and (2) a region-aware intensity forecasting module that leverages high-resolution internal representations within dynamically defined sub-grid zones around the TC core to better capture localized extremes. Evaluated on the global IBTrACS dataset across six major TC basins, our system consistently outperforms both operational numerical weather prediction (NWP) models and most AI-based baselines, delivering a substantial enhancement in forecast accuracy. Remarkably, BaguanCyclone excels in navigating meteorological complexities, consistently delivering accurate forecasts for re-intensification, sweeping arcs, twin cyclones, and meandering events. Our code is available at this https URL.

[AI-105] A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection

【速读】:该论文旨在解决老年人跌倒检测中因单一模态加速度数据导致误报率高,以及传统机器学习方法依赖大量手工特征工程的问题。其关键解决方案在于提出了一种多模态深度学习框架 MultiModalFallDetector,通过融合三轴加速度计、陀螺仪和四通道生理信号,结合多尺度卷积神经网络(CNN)提取不同时间分辨率下的运动动态特征,并引入多头自注意力机制实现动态时间加权,同时采用Focal Loss缓解类别不平衡问题,辅以活动分类任务进行正则化,并基于迁移学习从UCI HAR数据集到SisFall数据集提升模型泛化能力。实验表明,该方法在真实场景下对老年人跌倒检测的F1分数达98.7,召回率为98.9,AUC-ROC为99.4,且推理延迟低于50ms,具备实时部署潜力。

链接: https://arxiv.org/abs/2603.22313
作者: Lijie Zhou,Luran Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing global aging population has intensified the demand for reliable health monitoring systems, particularly those capable of detecting critical events such as falls among elderly individuals. Traditional fall detection approaches relying on single-modality acceleration data suffer from high false alarm rates, while conventional machine learning methods require extensive hand-crafted feature engineering. This paper proposes a novel multi-modal deep learning framework, MultiModalFallDetector, designed for real-time elderly fall detection using wearable sensors. Our approach integrates multiple innovations: a multi-scale CNN-based feature extractor capturing motion dynamics at varying temporal resolutions; fusion of tri-axial accelerometer, gyroscope, and four-channel physiological signals; incorporation of a multi-head self-attention mechanism for dynamic temporal weighting; adoption of Focal Loss to mitigate severe class imbalance; introduction of an auxiliary activity classification task for regularization; and implementation of transfer learning from UCI HAR to SisFall dataset. Extensive experiments on the SisFall dataset, which includes real-world simulated fall trials from elderly participants (aged 60-85), demonstrate that our framework achieves an F1-score of 98. 7, Recall of 98. 9, and AUC-ROC of 99. 4, significantly outperforming baseline methods including traditional machine learning and standard deep learning approaches. The model maintains sub- 50ms inference latency on edge devices, confirming its suitability for real-time deployment in geriatric care settings.

[AI-106] UniFluids: Unified Neural Operator Learning with Conditional Flow-matching

【速读】:该论文旨在解决多类偏微分方程(Partial Differential Equation, PDE)在不同维度和物理变量下的解算子(solution operator)学习问题,即如何构建一个统一框架以实现跨多种PDE的高效、高精度建模。其解决方案的关键在于提出UniFluids——一种基于流匹配(flow-matching)的条件生成框架,利用扩散Transformer(diffusion Transformer)的可扩展性,在统一的四维时空表示下对异构PDE数据集进行联合训练与条件编码;同时引入x-预测策略(x-prediction)来优化流匹配过程,显著提升预测精度,并在1D、2D和3D多个PDE数据集上验证了其高准确率、良好可扩展性和跨场景泛化能力。

链接: https://arxiv.org/abs/2603.22309
作者: Haosen Li,Qi Meng,Jiahao Li,Rui Zhang,Ruihua Song,Liang Ma,Zhi-Ming Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint version. Work in progress

点击查看摘要

Abstract:Partial differential equation (PDE) simulation holds extensive significance in scientific research. Currently, the integration of deep neural networks to learn solution operators of PDEs has introduced great potential. In this paper, we present UniFluids, a conditional flow-matching framework that harnesses the scalability of diffusion Transformer to unify learning of solution operators across diverse PDEs with varying dimensionality and physical variables. Unlike the autoregressive PDE foundation models, UniFluids adopts flow-matching to achieve parallel sequence generation, making it the first such approach for unified operator learning. Specifically, the introduction of a unified four-dimensional spatiotemporal representation for the heterogeneous PDE datasets enables joint training and conditional encoding. Furthermore, we find the effective dimension of the PDE dataset is much lower than its patch dimension. We thus employ x -prediction in the flow-matching operator learning, which is verified to significantly improve prediction accuracy. We conduct a large-scale evaluation of UniFluids on several PDE datasets covering spatial dimensions 1D, 2D and 3D. Experimental results show that UniFluids achieves strong prediction accuracy and demonstrates good scalability and cross-scenario generalization capability. The code will be released later.

[AI-107] Memory Bear AI Memory Science Engine for Multimodal Affective Intelligence: A Technical Report

【速读】:该论文旨在解决当前多模态情感识别(Multimodal Emotion Recognition, MER)系统在真实交互场景中面临的局限性,尤其是其对短期推理的过度优化以及缺乏对持续情感记忆、长时依赖建模和不完整输入下鲁棒性解释的支持。现有方法往往将情绪视为瞬时输出标签,忽略了情感意义依赖于先前轨迹、累积上下文和多模态证据的特性,尤其在信号弱、噪声大或模态缺失时性能显著下降。解决方案的关键在于提出一种以记忆为中心的框架——Memory Bear AI Memory Science Engine,其核心创新是将情感信息建模为记忆系统中的结构化且动态演化的变量,通过生成结构化的“情感记忆单元”(Emotion Memory Units, EMUs),实现跨交互周期的情感信息存储、激活与修正,从而支持工作记忆聚合、长期巩固、驱动检索、动态融合校准及持续更新等机制,显著提升了模型在复杂现实场景下的准确性与鲁棒性。

链接: https://arxiv.org/abs/2603.22306
作者: Deliang Wen,Ke Sun,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Affective judgment in real interaction is rarely a purely local prediction problem. Emotional meaning often depends on prior trajectory, accumulated context, and multimodal evidence that may be weak, noisy, or incomplete at the current moment. Although multimodal emotion recognition (MER) has improved the integration of text, speech, and visual signals, many existing systems remain optimized for short-range inference and provide limited support for persistent affective memory, long-horizon dependency modeling, and robust interpretation under imperfect input. This technical report presents the Memory Bear AI Memory Science Engine, a memory-centered framework for multimodal affective intelligence. Instead of treating emotion as a transient output label, the framework models affective information as a structured and evolving variable within a memory system. It organizes processing through structured memory formation, working-memory aggregation, long-term consolidation, memory-driven retrieval, dynamic fusion calibration, and continuous memory updating. At its core, multimodal signals are transformed into structured Emotion Memory Units (EMUs), enabling affective information to be preserved, reactivated, and revised across interaction horizons. Experimental results show consistent gains over comparison systems across benchmark and business-grounded settings, with stronger accuracy and robustness, especially under noisy or missing-modality conditions. The framework offers a practical step from local emotion recognition toward more continuous, robust, and deployment-relevant affective intelligence. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22306 [cs.AI] (or arXiv:2603.22306v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.22306 Focus to learn more arXiv-issued DOI via DataCite

[AI-108] CN-Buzz2Portfolio: A Chinese-Market Dataset and Benchmark for LLM -Based Macro and Sector Asset Allocation from Daily Trending Financial News

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域从静态自然语言处理任务向动态决策代理演进过程中面临的评估困境:直接实盘交易存在不可复现性和结果偏差(将运气误判为能力),而现有静态基准则局限于个股筛选,忽视了市场整体关注度对资产配置的影响。其解决方案的关键在于构建一个可复现的基准——CN-Buzz2Portfolio,该基准基于中国市场的每日热点新闻映射至宏观与行业层面的资产配置,并采用三阶段CPA(Compression, Perception, Allocation)工作流,使LLMs在ETF等广义资产类别上进行投资决策,从而降低个股特异性波动,更真实地评估模型将宏观叙事转化为投资组合权重的能力。

链接: https://arxiv.org/abs/2603.22305
作者: Liyuan Chen,Shilong Li,Jiangpeng Yan,Shuoling Liu,Qiang Yang,Xiu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly transitioning from static Natural Language Processing (NLP) tasks including sentiment analysis and event extraction to acting as dynamic decision-making agents in complex financial environments. However, the evolution of LLMs into autonomous financial agents faces a significant dilemma in evaluation paradigms. Direct live trading is irreproducible and prone to outcome bias by confounding luck with skill, whereas existing static benchmarks are often confined to entity-level stock picking and ignore broader market attention. To facilitate the rigorous analysis of these challenges, we introduce CN-Buzz2Portfolio, a reproducible benchmark grounded in the Chinese market that maps daily trending news to macro and sector asset allocation. Spanning a rolling horizon from 2024 to mid-2025, our dataset simulates a realistic public attention stream, requiring agents to distill investment logic from high-exposure narratives instead of pre-filtered entity news. We propose a Tri-Stage CPA Agent Workflow involving Compression, Perception, and Allocation to evaluate LLMs on broad asset classes such as Exchange Traded Funds (ETFs) rather than individual stocks, thereby reducing idiosyncratic volatility. Extensive experiments on nine LLMs reveal significant disparities in how models translate macro-level narratives into portfolio weights. This work provides new insights into the alignment between general reasoning and financial decision-making, and all data, codes, and experiments are released to promote sustainable financial agent research.

[AI-109] Sample Transform Cost-Based Training-Free Hallucination Detector for Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉(Hallucination)问题,即模型生成与事实不符或缺乏依据的内容,这严重制约了LLM在可信场景下的部署。其解决方案的关键在于:提出利用最优传输(Optimal Transport)距离构建样本间token嵌入的Wasserstein距离矩阵,从而量化LLM在给定提示下所定义条件分布的复杂性;基于该矩阵进一步提取两个互补信号——AvgWD(平均成本)和EigenWD(成本复杂度),二者共同作为无需训练的幻觉检测指标。该方法不依赖于模型内部结构或额外标注数据,具有轻量、通用且可扩展至黑盒模型的优势。

链接: https://arxiv.org/abs/2603.22303
作者: Zeyang Ding,Xinglin Hu,Jicong Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Hallucinations in large language models (LLMs) remain a central obstacle to trustworthy deployment, motivating detectors that are accurate, lightweight, and broadly applicable. Since an LLM with a prompt defines a conditional distribution, we argue that the complexity of the distribution is an indicator of hallucination. However, the density of the distribution is unknown and the samples (i.e., responses generated for the prompt) are discrete distributions, which leads to a significant challenge in quantifying the complexity of the distribution. We propose to compute the optimal-transport distances between the sets of token embeddings of pairwise samples, which yields a Wasserstein distance matrix measuring the costs of transforming between the samples. This Wasserstein distance matrix provides a means to quantify the complexity of the distribution defined by the LLM with the prompt. Based on the Wasserstein distance matrix, we derive two complementary signals: AvgWD, measuring the average cost, and EigenWD, measuring the cost complexity. This leads to a training-free detector for hallucinations in LLMs. We further extend the framework to black-box LLMs via teacher forcing with an accessible teacher model. Experiments show that AvgWD and EigenWD are competitive with strong uncertainty baselines and provide complementary behavior across models and datasets, highlighting distribution complexity as an effective signal for LLM truthfulness.

[AI-110] Latent Semantic Manifolds in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在连续向量空间中进行内部计算却输出离散标记(token)所带来的语义失真问题,即词汇离散化与隐状态连续表示之间的根本性不匹配。其解决方案的关键在于构建一个数学框架,将LLM的隐藏状态视为嵌入在潜在语义流形(latent semantic manifold)上的点,该流形是一个配备Fisher信息度量的黎曼子流形,而每个token对应于该流形上的Voronoi区域划分。通过引入“表达能力差距”(expressibility gap)这一几何度量来量化词汇离散化导致的语义扭曲,并基于共面积公式(coarea formula)推导出表达能力差距与模型规模呈线性增长关系,从而从几何角度揭示了LLM内部表示的本质特性及其对性能的影响机制。

链接: https://arxiv.org/abs/2603.22301
作者: Mohamed A. Mabrok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) perform internal computations in continuous vector spaces yet produce discrete tokens – a fundamental mismatch whose geometric consequences remain poorly understood. We develop a mathematical framework that interprets LLM hidden states as points on a latent semantic manifold: a Riemannian submanifold equipped with the Fisher information metric, where tokens correspond to Voronoi regions partitioning the manifold. We define the expressibility gap, a geometric measure of the semantic distortion from vocabulary discretization, and prove two theorems: a rate-distortion lower bound on distortion for any finite vocabulary, and a linear volume scaling law for the expressibility gap via the coarea formula. We validate these predictions across six transformer architectures (124M-1.5B parameters), confirming universal hourglass intrinsic dimension profiles, smooth curvature structure, and linear gap scaling with slopes 0.87-1.12 (R^2 0.985). The margin distribution across models reveals a persistent hard core of boundary-proximal representations invariant to scale, providing a geometric decomposition of perplexity. We discuss implications for architecture design, model compression, decoding strategies, and scaling laws

[AI-111] Scaling Attention via Feature Sparsity ICLR2026

【速读】:该论文旨在解决Transformer模型在处理超长序列时因自注意力机制(self-attention)计算复杂度高达O(n2d)O(n^2 d)而导致的性能瓶颈问题。现有方法通常通过局部窗口、核近似或token级稀疏化来降低计算成本,但这些策略往往导致模型精度下降。论文提出了一种正交思路——特征稀疏性(feature sparsity),设计了Sparse Feature Attention (SFA),其中查询和键被表示为kk-稀疏编码,在保持高维表达能力的同时,将注意力计算复杂度从Θ(n2d)\Theta(n^2 d)降至Θ(n2k2/d)\Theta(n^2 k^2/d)。其核心创新在于引入FlashSFA,一种IO感知的内核,可在不显式生成稠密分数矩阵的情况下直接处理稀疏重叠区域,从而实现高效扩展。实验表明,SFA在GPT-2与Qwen3预训练任务中达到与密集基线相当的性能,同时提升速度达2.5倍,并减少约50%的FLOPs和KV缓存占用,且在长上下文下的检索准确性和鲁棒性优于短嵌入基线,验证了特征级稀疏性作为高效注意力的新方向。

链接: https://arxiv.org/abs/2603.22300
作者: Yan Xie,Tiansheng Wen,Tangda Huang,Bo Chen,Chenyu You,Stefanie Jegelka,Yifei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 11 figures; Accepted at ICLR 2026

点击查看摘要

Abstract:Scaling Transformers to ultra-long contexts is bottlenecked by the O(n^2 d) cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as k -sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from \Theta(n^2 d) to \Theta(n^2 k^2/d) . To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to 2.5\times and reducing FLOPs and KV-cache by nearly 50%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss. Code is available at this https URL.

[AI-112] Between the Layers Lies the Truth: Uncertainty Estimation in LLM s Using Intra-Layer Local Information Scores

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中常出现“自信错误”(confidently wrong)的问题,从而强调可靠不确定性估计(Uncertainty Estimation, UE)的重要性。现有方法中,基于输出的启发式策略虽计算成本低但鲁棒性差,而探测内部表示的方法虽有效却因高维性和迁移困难限制了实用性。本文提出一种轻量级、实例级的不确定性估计方法,其核心创新在于:通过单次前向传播即可对内部表示中跨层一致性模式进行评分,从而捕捉可迁移的不确定性信号。该方法在分布内和跨数据集迁移场景下均优于传统探测方法,并在4-bit权重量化下仍保持鲁棒性,展现出良好的实用性与泛化能力。

链接: https://arxiv.org/abs/2603.22299
作者: Zvi N. Badash,Yonatan Belinkov,Moti Freiman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are often confidently wrong, making reliable uncertainty estimation (UE) essential. Output-based heuristics are cheap but brittle, while probing internal representations is effective yet high-dimensional and hard to transfer. We propose a compact, per-instance UE method that scores cross-layer agreement patterns in internal representations using a single forward pass. Across three models, our method matches probing in-distribution, with mean diagonal differences of at most -1.8 AUPRC percentage points and +4.9 Brier score points. Under cross-dataset transfer, it consistently outperforms probing, achieving off-diagonal gains up to +2.86 AUPRC and +21.02 Brier points. Under 4-bit weight-only quantization, it remains robust, improving over probing by +1.94 AUPRC points and +5.33 Brier points on average. Beyond performance, examining specific layer–layer interactions reveals differences in how disparate models encode uncertainty. Altogether, our UE method offers a lightweight, compact means to capture transferable uncertainty in LLMs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.22299 [cs.LG] (or arXiv:2603.22299v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22299 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zvi Badash [view email] [v1] Tue, 17 Mar 2026 08:35:14 UTC (1,006 KB)

[AI-113] Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks

【速读】:该论文旨在解决生成式数据(Synthetic Data)在用于微调小型但计算效率更高的大语言模型(Large Language Models, LLMs)时,如何确保生成数据的质量与多样性这一关键挑战。解决方案的关键在于基于嵌入空间(embedding space)的采样策略:通过分析生成数据在嵌入空间中的分布密度,发现局部区域内的样本密度与模型预测准确率呈强相关性;据此提出一种基于嵌入的针对性采样管道,从而增强数据多样性,并在多个基准测试中稳定提升模型性能。

链接: https://arxiv.org/abs/2603.22294
作者: Srideepika Jayaraman,Achille Fokoue,Dhaval Patel,Jayant Kalagnanam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic Data Generation (SDG), leveraging Large Language Models (LLMs), has recently been recognized and broadly adopted as an effective approach to improve the performance of smaller but more resource and compute efficient LLMs through fine-tuning. A key challenge in SDG is ensuring the quality and diversity of the generated data. In this paper, we analyze the diversity and distribution of generated data in the embedding space, and demonstrate a strong correlation between the density of examples within a specific neighborhood and the accuracy of predictions on examples drawn from that region. Building on this insight, we present a targeted pipeline for embedding-based sampling that enhances data diversity and consistently improves performance across several benchmarks.

[AI-114] Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning ICAPS2026

【速读】:该论文旨在解决现实世界中序贯决策任务中奖励最大化与安全约束之间的冲突问题,此类冲突常导致不稳定的极值优化或对抗性优化。现有基于可达性分析的方法大多仅处理硬性安全约束,难以扩展至累积成本约束。其解决方案的关键在于定义一种安全条件可达集(safety-conditioned reachability set),该集合将奖励最大化与累积安全成本约束解耦,从而在无需不稳定极小极大优化或拉格朗日乘子法的情况下,确保策略的安全性。基于此,作者提出了一种新型离线安全强化学习(offline safe reinforcement learning)算法,能够在固定数据集上训练出满足安全约束的策略,而无需环境交互。实验表明,该方法在标准离线安全强化学习基准和真实世界的海上导航任务中均能实现与最先进方法相当或更优的安全性能。

链接: https://arxiv.org/abs/2603.22292
作者: Janaka Chathuranga Brahmanage,Akshat Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

点击查看摘要

Abstract:Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.

[AI-115] Automated Microservice Pattern Instance Detection Using Infrastructure-as-Code Artifacts and Large Language Models

【速读】:该论文旨在解决软件架构知识易流失的问题,特别是微服务架构模式实例难以通过源代码分析有效识别的挑战。现有模式检测方法往往复杂且难以扩展,导致实践者难以系统性地记录和复用关键架构信息。解决方案的关键在于开发一个名为MicroPAD的原型工具,利用大语言模型(Large Language Models, LLMs)对基础设施即代码(Infrastructure-as-Code, IaC) artifacts进行分析,从而自动化识别微服务模式实例。该方法显著降低了检测成本,并提升了可扩展性和适用范围,为开发者提供了一种低成本、高效率获取架构知识的途径。

链接: https://arxiv.org/abs/2502.04188
作者: Carlos Eduardo Duarte
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: ICSA 2025 - International Conference on Software Architecture. 6 pages

点击查看摘要

Abstract:Documenting software architecture is essential to preserve architecture knowledge, even though it is frequently costly. Architecture pattern instances, including microservice pattern instances, provide important structural software information. Practitioners should document this information to prevent knowledge vaporization. However, architecture patterns may not be detectable by analyzing source code artifacts, requiring the analysis of other types of artifacts. Moreover, many existing pattern detection instance approaches are complex to extend. This article presents our ongoing PhD research, early experiments, and a prototype for a tool we call MicroPAD for automating the detection of microservice pattern instances. The prototype uses Large Language Models (LLMs) to analyze Infrastructure-as-Code (IaC) artifacts to aid detection, aiming to keep costs low and maximize the scope of detectable patterns. Early experiments ran the prototype thrice in 22 GitHub projects. We verified that 83% of the patterns that the prototype identified were in the project. The costs of detecting the pattern instances were minimal. These results indicate that the approach is likely viable and, by lowering the entry barrier to automating pattern instance detection, could help democratize developer access to this category of architecture knowledge. Finally, we present our overall research methodology, planned future work, and an overview of MicroPAD’s potential industrial impact.

[AI-116] Leverag ing LLM s and Social Media to Understand User Perception of Smartphone-Based Earthquake Early Warnings

【速读】:该论文旨在解决智能手机-based地震预警系统(Earthquake Early Warning, EEW)在实际应用中用户感知准确性与工程定义准确性之间存在的差异问题。其关键解决方案在于利用大型语言模型(Large Language Models, LLMs)对超过500条来自X平台的社交媒体帖子进行分析,提取出42个与用户体验和行为相关的属性,并通过统计分析揭示了用户信任度与预警及时性之间的强相关性,从而明确指出:在用户认知中,“及时性”即等同于“准确性”。这一发现为优化警报设计、制定公众教育策略及未来行为研究提供了可操作的科学依据。

链接: https://arxiv.org/abs/2603.23322
作者: Hanjing Wang,S. Mostafa Mousavi,Patrick Robertson,Richard M. Allen,Alexie Barski,Robert Bosch,Nivetha Thiruverahan,Youngmin Cho,Tajinder Gadh,Steve Malkos,Boone Spooner,Greg Wimpey,Marc Stogaitis
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Geophysics (physics.geo-ph)
备注:

点击查看摘要

Abstract:Android’s Earthquake Alert (AEA) system provided timely early warnings to millions during the Mw 6.2 Marmara Ereglisi, Türkiye earthquake on April 23, 2025. This event, the largest in the region in 25 years, served as a critical real-world test for smartphone-based Earthquake Early Warning (EEW) systems. The AEA system successfully delivered alerts to users with high precision, offering over a minute of warning before the strongest shaking reached urban areas. This study leveraged Large Language Models (LLMs) to analyze more than 500 public social media posts from the X platform, extracting 42 distinct attributes related to user experience and behavior. Statistical analyses revealed significant relationships, notably a strong correlation between user trust and alert timeliness. Our results indicate a distinction between engineering and the user-centric definition of system accuracy. We found that timeliness is accuracy in the user’s mind. Overall, this study provides actionable insights for optimizing alert design, public education campaigns, and future behavioral research to improve the effectiveness of such systems in seismically active regions.

[AI-117] Off-Policy Evaluation and Learning for Survival Outcomes under Censoring

【速读】:该论文旨在解决在右删失(right-censored)生存结果下,基于日志数据进行离策略评估(Off-Policy Evaluation, OPE)和离策略学习(Off-Policy Learning, OPL)时存在的系统性偏差问题。传统估计方法因忽略删失时间后的未观测生存时间,导致对策略性能的低估。其解决方案的关键在于引入逆概率删失加权(Inverse Probability of Censoring Weighting, IPCW)技术,构建IPCW-IPS与IPCW-DR两种新估计器:前者通过权重修正删失偏差,后者进一步实现双重稳健性(double robustness),即只要倾向得分模型或结果模型其中之一正确,即可保证估计的一致性。此外,作者还将该框架扩展至带预算约束的离策略学习任务,从而在实际应用中提升决策优化的可靠性与有效性。

链接: https://arxiv.org/abs/2603.22900
作者: Kohsuke Kubota,Mitsuhiro Takahashi,Yuta Saito
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Preprint

点击查看摘要

Abstract:Optimizing survival outcomes, such as patient survival or customer retention, is a critical objective in data-driven decision-making. Off-Policy Evaluation~(OPE) provides a powerful framework for assessing such decision-making policies using logged data alone, without the need for costly or risky online experiments in high-stakes applications. However, typical estimators are not designed to handle right-censored survival outcomes, as they ignore unobserved survival times beyond the censoring time, leading to systematic underestimation of the true policy performance. To address this issue, we propose a novel framework for OPE and Off-Policy Learning~(OPL) tailored for survival outcomes under censoring. Specifically, we introduce IPCW-IPS and IPCW-DR, which employ the Inverse Probability of Censoring Weighting technique to explicitly deal with censoring bias. We theoretically establish that our estimators are unbiased and that IPCW-DR achieves double robustness, ensuring consistency if either the propensity score or the outcome model is correct. Furthermore, we extend this framework to constrained OPL to optimize policy value under budget constraints. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical impacts using public real-world data for both evaluation and learning tasks.

[AI-118] Quantum Random Forest for the Regression Problem

【速读】:该论文旨在解决传统随机森林(Random Forest)模型在回归问题中的预测效率瓶颈,提出了一种基于量子计算的算法来加速其测试(预测)过程。解决方案的关键在于利用量子计算的优势,在查询复杂度或运行时间上显著优于经典随机森林模型,从而实现对回归任务更高效的预测性能。

链接: https://arxiv.org/abs/2603.22790
作者: Kamil Khadiev,Liliya Safina
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Accepted in Quantum Computing - Artificial Intelligence for Industry Applications and Scientific Discovery A Workshop at the IEEE International Conference on Quantum Communications, Networking, and Computing (QCNC) 2026

点击查看摘要

Abstract:The Random Forest model is one of the popular models of Machine learning. We present a quantum algorithm for testing (forecasting) process of the Random Forest machine learning model for the Regression problem. The presented algorithm is more efficient (in terms of query complexity or running time) than the classical counterpart.

[AI-119] Cognitive Training for Language Models: Towards General Capabilities via Cross-Entropy Games

【速读】:该论文旨在解决如何自动构建语言模型的通用能力(general capabilities)这一开放性问题,核心在于设计一种可自动扩展的任务课程(curriculum of tasks),以通过相关技能发现逐步提升模型性能。其解决方案的关键在于提出了一种基于交叉熵博弈(cross-entropy games)的框架,并论证在自然假设下,若能通过贪心优化算法迭代生成课程,则本质上唯一可行的元目标(meta-objective)是“认知训练”(cognitive training)。该方法假设具备足够能力的语言模型和元采样器(meta-samplers)以及充足的训练时间,即可实现系统性的相关技能发现,从而为通用能力的获取提供一种原则性的路径。

链接: https://arxiv.org/abs/2603.22479
作者: Clément Hongler,Franck Gabriel,Valentin Hartmann,Arthur Renard,Andrew Emil
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Defining a constructive process to build general capabilities for language models in an automatic manner is considered an open problem in artificial intelligence. Towards this, we consider the problem of building a curriculum of tasks that grows a model via relevant skill discovery. We provide a concrete framework for this task, using a family of tasks called cross-entropy games, which we postulate is universal in a suitable sense. We show that if it is possible to grow the curriculum for relevant skill discovery by iterating a greedy optimization algorithm, then, under natural assumptions, there is essentially only one meta-objective possible (up to a few hyperparameters). We call the resulting process cognitive training. We postulate that, given sufficiently capable language models as players and meta-samplers and sufficient training time, cognitive training provides a principled way to relevant skill discovery; and hence to the extent general capabilities are achievable via greedy curriculum learning, cognitive training would be a solution.

[AI-120] Latent Style-based Quantum Wasserstein GAN for Drug Design

【速读】:该论文旨在解决传统生成式人工智能模型(如生成对抗网络,GANs)在药物设计中因训练困难(如 barren plateaus)和模式崩溃(mode collapse)导致的性能瓶颈问题。其关键解决方案是提出一种基于风格的量子生成对抗网络(style-based quantum GAN, QGAN)架构:通过在量子电路的每个旋转门中引入噪声编码以增强模型鲁棒性,并在损失函数中加入梯度惩罚项以缓解模式崩溃;同时结合变分自编码器(VAE)将分子结构映射至潜在空间作为QGAN输入,从而提升生成质量与泛化能力。该方案在量子模拟器(最多15量子比特)和真实量子硬件(IBM Heron,156量子比特)上进行了验证,结果通过MOSES基准套件与经典模型对比,显示出优越性能。

链接: https://arxiv.org/abs/2603.22399
作者: Julien Baglio,Yacine Haddad,Richard Polifka
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注: Main part: 22 pages, 11 figures, 6 tables. Supplementary material: 16 pages, 15 figures, 14 tables

点击查看摘要

Abstract:The development of new drugs is a tedious, time-consuming, and expensive process, for which the average costs are estimated to be up to around 2.5 billion. The first step in this long process is the design of the new drug, for which de novo drug design, assisted by artificial intelligence, has blossomed in recent years and revolutionized the field. In particular, generative artificial intelligence has delivered promising results in drug discovery and development, reducing costs and the time to solution. However, classical generative models, such as generative adversarial networks (GANs), are difficult to train due to barren plateaus and prone to mode collapse. Quantum computing may be an avenue to overcome these issues and provide models with fewer parameters, thereby enhancing the generalizability of GANs. We propose a new style-based quantum GAN (QGAN) architecture for drug design that implements noise encoding at every rotational gate of the circuit and a gradient penalty in the loss function to mitigate mode collapse. Our pipeline employs a variational autoencoder to represent the molecular structure in a latent space, which is then used as input to our QGAN. Our baseline model runs on up to 15 qubits to validate our architecture on quantum simulators, and a 156-qubit IBM Heron quantum computer in the five-qubit setup is used for inference to investigate the effects of using real quantum hardware on the analysis. We benchmark our results against classical models as provided by the MOSES benchmark suite.

[AI-121] SynLeaF: A Dual-Stage Multimodal Fusion Framework for Synthetic Lethality Prediction Across Pan- and Single-Cancer Contexts

【速读】:该论文旨在解决合成致死(Synthetic Lethality, SL)预测中因异构多源数据融合困难而导致的性能瓶颈问题,尤其针对现有多模态方法因不同模态收敛速度差异引发的“模态惰性”(modality laziness)现象,从而限制了互补信息的有效利用,导致模型在泛癌(pan-cancer)和单癌(single-cancer)SL对预测任务上均难以取得理想效果。其解决方案的关键在于提出一种双阶段多模态融合框架 SynLeaF:首先通过基于变分自编码器(VAE)的交叉编码器结合专家乘积(Product of Experts)机制融合基因表达、突变、甲基化和拷贝数变异(CNV)四类组学数据;同时引入关系图卷积网络(Relational Graph Convolutional Network)从生物医学知识图谱中捕获结构化的基因表示;进一步设计双阶段训练机制,采用特征级知识蒸馏(feature-level knowledge distillation)并结合自适应单模态教师模型与集成策略,有效缓解模态惰性问题,显著提升模型在泛癌与单癌场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.22369
作者: Zheming Xing,Siyuan Zhou,Ruinan Wang,Rui Han,Shiming Zhang,Shiqu Chen,Yurui Huang,Jiahao Ma,Yifan Chen,Xuan Wang,Yadong Wang,Junyi Li
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Accurate prediction of synthetic lethality (SL) is important for guiding the development of cancer drugs and therapies. SL prediction faces significant challenges in the effective fusion of heterogeneous multi-source data. Existing multimodal methods often suffer from “modality laziness” due to disparate convergence speeds, which hinders the exploitation of complementary information. This is also one reason why most existing SL prediction models cannot perform well on both pan-cancer and single-cancer SL pair prediction. In this study, we propose SynLeaF, a dual-stage multimodal fusion framework for SL prediction across pan- and single-cancer contexts. The framework employs a VAE-based cross-encoder with a product of experts mechanism to fuse four omics data types (gene expression, mutation, methylation, and CNV), while simultaneously utilizing a relational graph convolutional network to capture structured gene representations from biomedical knowledge graphs. To mitigate modality laziness, SynLeaF introduces a dual-stage training mechanism employing featurelevel knowledge distillation with adaptive uni-modal teacher and ensemble strategies. In extensive experiments across eight specific cancer types and a pancancer dataset, SynLeaF achieves superior performance in 17 out of 19 scenarios. Ablation studies and gradient analyses further validate the critical contributions of the proposed fusion and distillation mechanisms to model robustness and generalization. To facilitate community use, a web server is available at this https URL.

[AI-122] Modeling Quantum Federated Autoencoder for Anomaly Detection in IoT Networks

【速读】:该论文旨在解决物联网(Internet of Things, IoT)网络中异常检测的隐私保护与通信效率问题。传统集中式方法需上传原始数据至中心服务器进行处理,存在数据泄露风险且通信开销大;而分布式方法则面临模型训练难以协同的挑战。解决方案的关键在于提出一种量子联邦自编码器(Quantum Federated Autoencoder),其核心创新是结合量子自编码器(Quantum Autoencoder)用于高维特征表示、以及联邦学习(Federated Learning)实现去中心化模型训练,从而在边缘设备上完成局部学习而不传输原始数据,既保障了数据隐私,又显著降低了通信负担,并利用量子计算在模式识别中的优势提升了异常检测的灵敏度和鲁棒性。

链接: https://arxiv.org/abs/2603.22366
作者: Devashish Chaudhary,Sutharshan Rajasegarar,Shiva Raj Pokhrel
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted at ICOIN 2026

点击查看摘要

Abstract:We propose a Quantum Federated Autoencoder for Anomaly Detection, a framework that leverages quantum federated learning for efficient, secure, and distributed processing in IoT networks. By harnessing quantum autoencoders for high-dimensional feature representation and federated learning for decentralized model training, the approach transforms localized learning on edge devices without requiring transmission of raw data, thereby preserving privacy and minimizing communication overhead. The model leverages quantum advantage in pattern recognition to enhance detection sensitivity, particularly in complex and dynamic IoT network traffic. Experiments on a real-world IoT dataset show that the proposed method delivers anomaly detection accuracy and robustness comparable to centralized approaches, while ensuring data privacy.

[AI-123] Bridging neuroscience and AI: adaptive culturally sensitive technologies transforming aphasia rehabilitation

【速读】:该论文试图解决失语症(aphasia)康复中面临的两大核心问题:一是传统言语治疗资源匮乏,包括治疗师数量有限及个性化、文化适配性工具稀缺;二是现有干预手段难以满足患者多样化的语言需求与高参与度要求。解决方案的关键在于融合神经认知研究与本地化语言技术,开发出适应当地语言多样性并提升患者参与度的数字治疗原型,并通过神经科学洞察和实地田野研究指导设计,使工具更贴合患者与治疗师的实际需求。该方法强调利用自适应的、人工智能增强的辅助技术来补充传统疗法,从而实现更个性化、可扩展且易获取的康复路径。

链接: https://arxiv.org/abs/2603.22357
作者: Andreea I. Niculescu,Jochen Ehnes,Minghui Dong
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, Proceedings of the 20th International Conference on linguistic resources and tools for natural language processing (ConsILR 2025)

点击查看摘要

Abstract:Aphasia, a language impairment primarily resulting from stroke or brain injury, profoundly disrupts communication and everyday functioning. Despite advances in speech therapy, barriers such as limited therapist availability and the scarcity of personalized, culturally relevant tools continue to hinder optimal rehabilitation outcomes. This paper reviews recent developments in neurocognitive research and language technologies that contribute to the diagnosis and therapy of aphasia. Drawing on findings from our ethnographic field study, we introduce two digital therapy prototypes designed to reflect local linguistic diversity and enhance patient engagement. We also show how insights from neuroscience and the local context guided the design of these tools to better meet patient and therapist needs. Our work highlights the potential of adaptive, AI-enhanced assistive technologies to complement conventional therapy and broaden access to therapy. We conclude by outlining future research directions for advancing personalized and scalable aphasia rehabilitation.

机器学习

[LG-0] Estimating Flow Velocity and Vehicle Angle-of-Attack from Non-invasive Piezoelectric Structural Measurements Using Deep Learning

链接: https://arxiv.org/abs/2603.23496
作者: Chandler B. Smith,S. Hales Swift,Andrew Steyer,Ihab El-Kady
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate estimation of aerodynamic state variables such as freestream velocity and angle of attack (AoA) is important for aerodynamic load prediction, flight control, and model validation. This work presents a non-intrusive method for estimating vehicle velocity and AoA from structural vibration measurements rather than direct flow instrumentation such as pitot tubes. A dense array of piezoelectric sensors mounted on the interior skin of an aeroshell capture vibrations induced by turbulent boundary layer pressure fluctuations, and a convolutional neural network (CNN) is trained to invert these structural responses to recover velocity and AoA. Proof-of-concept is demonstrated through controlled experiments in Sandia’s hypersonic wind tunnel spanning zero and nonzero AoA configurations, Mach~5 and Mach~8 conditions, and both constant and continuously varying tunnel operations. The CNN is trained and evaluated using data from 16 wind tunnel runs, with a temporally centered held-out interval within each run used to form training, validation, and test datasets and assess intra-run temporal generalization. Raw CNN predictions exhibit increased variance during continuously varying conditions; a short-window moving-median post-processing step suppresses this variance and improves robustness. After post-processing, the method achieves a mean velocity error relative to the low-pass filtered reference velocity below 2.27~m/s (0.21%) and a mean AoA error of 0.44^\circ (8.25%) on held-out test data from the same experimental campaign, demonstrating feasibility of vibration-based velocity and AoA estimation in a controlled laboratory environment. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.23496 [cs.LG] (or arXiv:2603.23496v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.23496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

链接: https://arxiv.org/abs/2603.23472
作者: Rustem Islamov,Grigory Malinovsky,Alexander Gaponov,Aurelien Lucchi,Peter Richtárik,Eduard Gorbunov
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:Federated Learning (FL) enables heterogeneous clients to collaboratively train a shared model without centralizing their raw data, offering an inherent level of privacy. However, gradients and model updates can still leak sensitive information, while malicious servers may mount adversarial attacks such as Byzantine manipulation. These vulnerabilities highlight the need to address differential privacy (DP) and Byzantine robustness within a unified framework. Existing approaches, however, often rely on unrealistic assumptions such as bounded gradients, require auxiliary server-side datasets, or fail to provide convergence guarantees. We address these limitations by proposing Byz-Clip21-SGD2M, a new algorithm that integrates robust aggregation with double momentum and carefully designed clipping. We prove high-probability convergence guarantees under standard L -smoothness and \sigma -sub-Gaussian gradient noise assumptions, thereby relaxing conditions that dominate prior work. Our analysis recovers state-of-the-art convergence rates in the absence of adversaries and improves utility guarantees under Byzantine and DP settings. Empirical evaluations on CNN and MLP models trained on MNIST further validate the effectiveness of our approach.

[LG-2] End-to-End Efficient RL for Linear Bellm an Complete MDPs with Deterministic Transitions

链接: https://arxiv.org/abs/2603.23461
作者: Zakaria Mhammedi,Alexander Rakhlin,Nneka Okolo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emphlinear Bellman completeness – a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emphdeterministic transitions, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an \varepsilon -optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and 1/\varepsilon .

[LG-3] CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

链接: https://arxiv.org/abs/2603.23459
作者: Abdul Rahman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages including 1 appendix

点击查看摘要

Abstract:AI-driven cybersecurity systems often fail under cross-environment deployment due to fragmented, event-centric telemetry representations. We introduce the Canonical Security Telemetry Substrate (CSTS), an entity-relational abstraction that enforces identity persistence, typed relationships, and temporal state invariants. Across heterogeneous environments, CSTS improves cross-topology transfer for identity-centric detection and prevents collapse under schema perturbation. For zero-day detection, CSTS isolates semantic orientation instability as a modeling, not schema, phenomenon, clarifying layered portability requirements.

[LG-4] Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

链接: https://arxiv.org/abs/2603.23436
作者: Connor Mclaughlin,Nigel Lee,Lili Su
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Machine learning models often need to adapt to new data after deployment due to structured or unstructured real-world dynamics. The Continual Learning (CL) framework enables continuous model adaptation, but most existing approaches either assume each task contains sufficiently many data samples or that the learning tasks are non-overlapping. In this paper, we address the more general setting where each task may have a limited dataset, and tasks may overlap in an arbitrary manner without a priori knowledge. This general setting is substantially more challenging for two reasons. On the one hand, data scarcity necessitates effective contextualization of general knowledge and efficient knowledge transfer across tasks. On the other hand, unstructured task overlapping can easily result in negative knowledge transfer. To address the above challenges, we propose an adaptive mixture-of-experts (MoE) framework over pre-trained models that progressively establishes similarity awareness among tasks. Our design contains two innovative algorithmic components: incremental global pooling and instance-wise prompt masking. The former mitigates prompt association noise through gradual prompt introduction over time. The latter decomposes incoming task samples into those aligning with current prompts (in-distribution) and those requiring new prompts (out-of-distribution). Together, our design strategically leverages potential task overlaps while actively preventing negative mutual interference in the presence of per-task data scarcity. Experiments across varying data volumes and inter-task similarity show that our method enhances sample efficiency and is broadly applicable.

[LG-5] Central Dogma Transformer III: Interpretable AI Across DNA RNA and Protein

链接: https://arxiv.org/abs/2603.23361
作者: Nobuyuki Ota
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the molecular processes they aim to capture. We present CDT-III, which extends mechanism-oriented AI across the full central dogma: DNA, RNA, and protein. Its two-stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. On five held-out genes, CDT-III achieves per-gene RNA r=0.843 and protein r=0.969. Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream representations. Protein supervision sharpens DNA-level interpretability, increasing CTCF enrichment by 30%. Applied to in silico CD52 knockdown approximating Alemtuzumab, the model predicts 29/29 protein changes correctly and rediscovers 5 of 7 known clinical side effects without clinical data. Gradient-based side effect profiling requires only unperturbed baseline data (r=0.939), enabling screening of all 2,361 genes without new experiments.

[LG-6] Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection

链接: https://arxiv.org/abs/2603.23318
作者: Rodrigo F. L. Lassance,Jasper De Bock
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Among the different possible strategies for evaluating the reliability of individual predictions of classifiers, robustness quantification stands out as a method that evaluates how much uncertainty a classifier could cope with before changing its prediction. However, its applicability is more limited than some of its alternatives, since it requires the use of generative models and restricts the analyses either to specific model architectures or discrete features. In this work, we propose a new robustness metric applicable to any probabilistic discriminative classifier and any type of features. We demonstrate that this new metric is capable of distinguishing between reliable and unreliable predictions, and use this observation to develop new strategies for dynamic classifier selection.

[LG-7] SynForceNet: A Force-Driven Global-Local Latent Representation Framework for Lithium-Ion Battery Fault Diagnosis

链接: https://arxiv.org/abs/2603.23265
作者: Rongxiu Chen,Yuting Su
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online safety fault diagnosis is essential for lithium-ion batteries in electric vehicles(EVs), particularly under complex and rare safety-critical conditions in real-world operation. In this work, we develop an online battery fault diagnosis network based on a deep anomaly detection framework combining kernel one-class classification and minimum-volume estimation. Mechanical constraints and spike-timing-dependent plasticity(STDP)-based dynamic representations are introduced to improve complex fault characterization and enable a more compact normal-state boundary. The proposed method is validated using 8.6 million valid data points collected from 20 EVs. Compared with several advanced baseline methods, it achieves average improvements of 7.59% in TPR, 27.92% in PPV, 18.28% in F1 score, and 23.68% in AUC. In addition, we analyze the spatial separation of fault representations before and after modeling, and further enhance framework robustness by learning the manifold structure in the latent space. The results also suggest the possible presence of shared causal structures across different fault types, highlighting the promise of integrating deep learning with physical constraints and neural dynamics for battery safety diagnosis.

[LG-8] Permutation-Symmetrized Diffusion for Unconditional Molecular Generation

链接: https://arxiv.org/abs/2603.23255
作者: Gyeonghoon Ko,Juho Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Permutation invariance is fundamental in molecular point-cloud generation, yet most diffusion models enforce it indirectly via permutation-equivariant networks on an ordered space. We propose to model diffusion directly on the quotient manifold \tilde\calX=\sR^d\times N/S_N , where all atom permutations are identified. We show that the heat kernel on \tilde\calX admits an explicit expression as a sum of Euclidean heat kernels over permutations, which clarifies how diffusion on the quotient differs from ordered-particle diffusion. Training requires a permutation-symmetrized score involving an intractable sum over S_N ; we derive an expectation form over a posterior on permutations and approximate it using MCMC in permutation space. We evaluate on unconditional 3D molecule generation on QM9 under the EQGAT-Diff protocol, using SemlaFlow-style backbone and treating all variables continuously. The results demonstrate that quotient-based permutation symmetrization is practical and yields competitive generation quality with improved efficiency.

[LG-9] GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

链接: https://arxiv.org/abs/2603.23232
作者: Haoyu Wang,Jingcheng Wang,Shunyu Wu,Xinwei Xiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield “in-between” actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state’s candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.

[LG-10] A One-Inclusion Graph Approach to Multi-Group Learning

链接: https://arxiv.org/abs/2603.23208
作者: Noah Bergam,Samuel Deng,Daniel Hsu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We prove the tightest-known upper bounds on the sample complexity of multi-group learning. Our algorithm extends the one-inclusion graph prediction strategy using a generalization of bipartite b -matching. In the group-realizable setting, we provide a lower bound confirming that our algorithm’s \log n / n convergence rate is optimal in general. If one relaxes the learning objective such that the group on which we are evaluated is chosen obliviously of the sample, then our algorithm achieves the optimal 1/n convergence rate under group-realizability.

[LG-11] A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control ICLR2026

链接: https://arxiv.org/abs/2603.23173
作者: Louis Claeys,Artur Goldman,Zebang Shen,Niao He
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to ICLR 2026, code available in this https URL

点击查看摘要

Abstract:High-dimensional stochastic optimal control (SOC) becomes harder with longer planning horizons: existing methods scale linearly in the horizon T , with performance often deteriorating exponentially. We overcome these limitations for a subclass of linearly-solvable SOC problems-those whose uncontrolled drift is the gradient of a potential. In this setting, the Hamilton-Jacobi-Bellman equation reduces to a linear PDE governed by an operator \mathcalL . We prove that, under the gradient drift assumption, \mathcalL is unitarily equivalent to a Schrödinger operator \mathcalS = -\Delta + \mathcalV with purely discrete spectrum, allowing the long-horizon control to be efficiently described via the eigensystem of \mathcalL . This connection provides two key results: first, for a symmetric linear-quadratic regulator (LQR), \mathcalS matches the Hamiltonian of a quantum harmonic oscillator, whose closed-form eigensystem yields an analytic solution to the symmetric LQR with \empharbitrary terminal cost. Second, in a more general setting, we learn the eigensystem of \mathcalL using neural networks. We identify implicit reweighting issues with existing eigenfunction learning losses that degrade performance in control tasks, and propose a novel loss function to mitigate this. We evaluate our method on several long-horizon benchmarks, achieving an order-of-magnitude improvement in control accuracy compared to state-of-the-art methods, while reducing memory usage and runtime complexity from \mathcalO(Td) to \mathcalO(d) .

[LG-12] DAK-UCB: Diversity-Aware Prompt Routing for LLM s and Generative Models ICLR2026

链接: https://arxiv.org/abs/2603.23140
作者: Donya Jafari,Farzan Farnia
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026

点击查看摘要

Abstract:The expansion of generative AI and LLM services underscores the growing need for adaptive mechanisms to select an appropriate available model to respond to a user’s prompts. Recent works have proposed offline and online learning formulations to identify the optimal generative AI model for an input prompt, based solely on maximizing prompt-based fidelity evaluation scores, e.g., CLIP-Score in text-to-image generation. However, such fidelity-based selection methods overlook the diversity of generated outputs, and hence, they can fail to address potential diversity shortcomings in the generated responses. In this paper, we introduce the Diversity-Aware Kernelized Upper Confidence Bound (DAK-UCB) method as a contextual bandit algorithm for the online selection of generative models with diversity considerations. The proposed DAK-UCB method incorporates both fidelity and diversity-related metrics into the selection process. We design this framework based on prompt-aware diversity score functions that decompose to a two-sample-based expectation over prompt-output pairs in the previous generation rounds. Specifically, we illustrate the application of our framework using joint kernel distance and kernel entropy measures. Our experimental results demonstrate the effectiveness of DAK-UCB in promoting diversity-aware model selection while maintaining fidelity in the generations for a sequence of prompts. The code is available at this https URL.

[LG-13] A Bayesian Learning Approach for Drone Coverag e Network: A Case Study on Cardiac Arrest in Scotland

链接: https://arxiv.org/abs/2603.23134
作者: Tathagata Basu,Edoardo Patelli,Gianluca Filippi,Ben Parsonage,Christy Maddock,Massimiliano Vasile,Marco Fossati,Adam Loyd,Shaun Marshall,Paul Gowens
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Drones are becoming popular as a complementary system for \acems. Although several pilot studies and flight trials have shown the feasibility of drone-assisted \acaed delivery, running a full-scale operational network remains challenging due to high capital expenditure and environmental uncertainties. In this paper, we formulate a reliability-informed Bayesian learning framework for designing drone-assisted \acaed delivery networks under environmental and operational uncertainty. We propose our objective function based on the survival probability of \acohca patients to identify the ideal locations of drone stations. Moreover, we consider the coverage of existing \acems infrastructure to improve the response reliability in remote areas. We illustrate our proposed method using geographically referenced cardiac arrest data from Scotland. The result shows how environmental variability and spatial demand patterns influence optimal drone station placement across urban and rural regions. In addition, we assess the robustness of the network and evaluate its economic viability using a cost-effectiveness analysis based on expected \acqaly. The findings suggest that drone-assisted \acaed delivery is expected to be cost-effective and has the potential to significantly improve the emergency response coverage in rural and urban areas with longer ambulance response times.

[LG-14] Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

链接: https://arxiv.org/abs/2603.23129
作者: Aditya Kakade,Vivek Srivastava,Shirish Karande
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.

[LG-15] SpecXMaster Technical Report

链接: https://arxiv.org/abs/2603.23101
作者: Yutang Ge,Yaning Cui,Hanzheng Li,Jun-Jie Wang,Fanjie Xu,Jinhan Dong,Yongqi Jin,Dongxu Cui,Peng Jin,Guojiang Zhao,Hengxing Cai,Rong Zhu,Linfeng Zhang,Xiaohong Ji,Zhifeng Gao
类目: Machine Learning (cs.LG)
*备注: Technical report from DP Technology.21 pages, 5 figures

点击查看摘要

Abstract:Intelligent spectroscopy serves as a pivotal element in AI-driven closed-loop scientific discovery, functioning as the critical bridge between matter structure and artificial intelligence. However, conventional expert-dependent spectral interpretation encounters substantial hurdles, including susceptibility to human bias and error, dependence on limited specialized expertise, and variability across interpreters. To address these challenges, we propose SpecXMaster, an intelligent framework leveraging Agentic Reinforcement Learning (RL) for NMR molecular spectral interpretation. SpecXMaster enables automated extraction of multiplicity information from both 1H and 13C spectra directly from raw FID (free induction decay) data. This end-to-end pipeline enables fully automated interpretation of NMR spectra into chemical structures. It demonstrates superior performance across multiple public NMR interpretation benchmarks and has been refined through iterative evaluations by professional chemical spectroscopists. We believe that SpecXMaster, as a novel methodological paradigm for spectral interpretation, will have a profound impact on the organic chemistry community.

[LG-16] MsFormer: Enabling Robust Predictive Maintenance Services for Industrial Devices

链接: https://arxiv.org/abs/2603.23076
作者: Jiahui Zhou,Dan Li,Ruibing Jin,Jian Lou,Yanran Zhao,Zhenghua Chen,Zigui Jiang,See-Kiong Ng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Providing reliable predictive maintenance is a critical industrial AI service essential for ensuring the high availability of manufacturing devices. Existing deep-learning methods present competitive results on such tasks but lack a general service-oriented framework to capture complex dependencies in industrial IoT sensor data. While Transformer-based models show strong sequence modeling capabilities, their direct deployment as robust AI services faces significant bottlenecks. Specifically, streaming sensor data collected in real-world service environments often exhibits multi-scale temporal correlations driven by machine working principles. Besides, the datasets available for training time-to-failure predictive services are typically limited in size. These issues pose significant challenges for directly applying existing models as robust predictive services. To address these challenges, we propose MsFormer, a lightweight Multi-scale Transformer designed as a unified AI service model for reliable industrial predictive maintenance. MsFormer incorporates a Multi-scale Sampling (MS) module and a tailored position encoding mechanism to capture sequential correlations across multi-streaming service data. Additionally, to accommodate data-scarce service environments, MsFormer adopts a lightweight attention mechanism with straightforward pooling operations instead of self-attention. Extensive experiments on real-world datasets demonstrate that the proposed framework achieves significant performance improvements over state-of-the-art methods. Furthermore, MsFormer outperforms across industrial devices and operating conditions, demonstrating strong generalizability while maintaining a highly reliable Quality of Service (QoS).

[LG-17] Generalization Bounds for Physics-Informed Neural Networks for the Incompressible Navier-Stokes Equations

链接: https://arxiv.org/abs/2603.23072
作者: Sebastien Andre-Sloan,Dibyakanti Kumar,Alejandro F Frangi,Anirbit Mukherjee
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This work establishes rigorous first-of-its-kind upper bounds on the generalization error for the method of approximating solutions to the (d+1)-dimensional incompressible Navier-Stokes equations by training depth-2 neural networks trained via the unsupervised Physics-Informed Neural Network (PINN) framework. This is achieved by bounding the Rademacher complexity of the PINN risk. For appropriately weight bounded net classes our derived generalization bounds do not explicitly depend on the network width and our framework characterizes the generalization gap in terms of the fluid’s kinematic viscosity and loss regularization parameters. In particular, the resulting sample complexity bounds are dimension-independent. Our generalization bounds suggest using novel activation functions for solving fluid dynamics. We provide empirical validation of the suggested activation functions and the corresponding bounds on a PINN setup solving the Taylor-Green vortex benchmark.

[LG-18] Robustness Quantification and Uncertainty Quantification: Comparing Two Methods for Assessing the Reliability of Classifier Predictions

链接: https://arxiv.org/abs/2603.22988
作者: Adrián Detavernier,Jasper De Bock
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider two approaches for assessing the reliability of the individual predictions of a classifier: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). We explain the conceptual differences between the two approaches, compare both approaches on a number of benchmark datasets and show that RQ is capable of outperforming UQ, both in a standard setting and in the presence of distribution shift. Beside showing that RQ can be competitive with UQ, we also demonstrate the complementarity of RQ and UQ by showing that a combination of both approaches can lead to even better reliability assessments.

[LG-19] A Critical Review on the Effectiveness and Privacy Threats of Membership Inference Attacks ESORICS2026

链接: https://arxiv.org/abs/2603.22987
作者: Najeeb Jebreel,David Sánchez,Josep Domingo-Ferrer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To appear in ESORICS 2026

点击查看摘要

Abstract:Membership inference attacks (MIAs) aim to determine whether a data sample was included in a machine learning (ML) model’s training set and have become the de facto standard for measuring privacy leakages in ML. We propose an evaluation framework that defines the conditions under which MIAs constitute a genuine privacy threat, and review representative MIAs against it. We find that, under the realistic conditions defined in our framework, MIAs represent weak privacy threats. Thus, relying on them as a privacy metric in ML can lead to an overestimation of risk and to unnecessary sacrifices in model utility as a consequence of employing too strong defenses.

[LG-20] Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data

链接: https://arxiv.org/abs/2603.22962
作者: Anand Jerry George,Nicolas Macris
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages

点击查看摘要

Abstract:We study the theoretical behavior of denoising score matching–the learning task associated to diffusion models–when the data distribution is supported on a low-dimensional manifold and the score is parameterized using a random feature neural network. We derive asymptotically exact expressions for the test, train, and score errors in the high-dimensional limit. Our analysis reveals that, for linear manifolds the sample complexity required to learn the score function scales linearly with the intrinsic dimension of the manifold, rather than with the ambient dimension. Perhaps surprisingly, the benefits of low-dimensional structure starts to diminish once we have a non-linear manifold. These results indicate that diffusion models can benefit from structured data; however, the dependence on the specific type of structure is subtle and intricate.

[LG-21] Privacy-Preserving EHR Data Transformation via Geometric Operators: A Human-AI Co-Design Technical Report

链接: https://arxiv.org/abs/2603.22954
作者: Maolin Wang,Beining Bao,Gan Yuan,Hongyu Chen,Bingkun Zhao,Baoshuo Kan,Jiming Xu,Qi Shi,Yinggong Zhao,Yao Wang,Wei Ying Ma,Jun Yan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic health records (EHRs) and other real-world clinical data are essential for clinical research, medical artificial intelligence, and life science, but their sharing is severely limited by privacy, governance, and interoperability constraints. These barriers create persistent data silos that hinder multi-center studies, large-scale model development, and broader biomedical discovery. Existing privacy-preserving approaches, including multi-party computation and related cryptographic techniques, provide strong protection but often introduce substantial computational overhead, reducing the efficiency of large-scale machine learning and foundation-model training. In addition, many such methods make data usable for restricted computation while leaving them effectively invisible to clinicians and researchers, limiting their value in workflows that still require direct inspection, exploratory analysis, and human interpretation. We propose a real-world-data transformation framework for privacy-preserving sharing of structured clinical records. Instead of converting data into opaque representations, our approach constructs transformed numeric views that preserve medical semantics and major statistical properties while, under a clearly specified threat model, provably breaking direct linkage between those views and protected patient-level attributes. Through collaboration between computer scientists and the AI agent \textbfSciencePal, acting as a constrained tool inventor under human guidance, we design three transformation operators that are non-reversible within this threat model, together with an additional mixing strategy for high-risk scenarios, supported by theoretical analysis and empirical evaluation under reconstruction, record linkage, membership inference, and attribute inference attacks.

[LG-22] Weak-PDE-Net: Discovering Open-Form PDEs via Differentiable Symbolic Networks and Weak Formulation

链接: https://arxiv.org/abs/2603.22951
作者: Xinxin Li,Xingyu Cui,Jin Qi,Juan Zhang,Da Li,Junping Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering governing Partial Differential Equations (PDEs) from sparse and noisy data is a challenging issue in data-driven scientific computing. Conventional sparse regression methods often suffer from two major limitations: (i) the instability of numerical differentiation under sparse and noisy data, and (ii) the restricted flexibility of a pre-defined candidate library. We propose Weak-PDE-Net, an end-to-end differentiable framework that can robustly identify open-form PDEs. Weak-PDE-Net consists of two interconnected modules: a forward response learner and a weak-form PDE generator. The learner embeds learnable Gaussian kernels within a lightweight MLP, serving as a surrogate model that adaptively captures system dynamics from sparse observations. Meanwhile, the generator integrates a symbolic network with an integral module to construct weak-form PDEs, avoiding explicit numerical differentiation and improving robustness to noise. To relax the constraints of the pre-defined library, we leverage Differentiable Neural Architecture Search strategy during training to explore the functional space, which enables the efficient discovery of open-form PDEs. The capability of Weak-PDE-Net in multivariable systems discovery is further enhanced by incorporating Galilean Invariance constraints and symmetry equivariance hypotheses to ensure physical consistency. Experiments on several challenging PDE benchmarks demonstrate that Weak-PDE-Net accurately recovers governing equations, even under highly sparse and noisy observations.

[LG-23] VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

链接: https://arxiv.org/abs/2603.22892
作者: Pengsen Liu,Maosen Zeng,Nan Tang,Kaiyuan Li,Jing-Cheng Pang,Yunan Liu,Yang Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.

[LG-24] Conditionally Identifiable Latent Representation for Multivariate Time Series with Structural Dynamics ICLR

链接: https://arxiv.org/abs/2603.22886
作者: Minkey Chang,Jae-Young Kim
类目: Machine Learning (cs.LG); General Finance (q-fin.GN); Statistical Finance (q-fin.ST)
*备注: Accepted paper for 2026 ICLR FINAI workshop

点击查看摘要

Abstract:We propose the Identifiable Variational Dynamic Factor Model (iVDFM), which learns latent factors from multivariate time series with identifiability guarantees. By applying iVAE-style conditioning to the innovation process driving the dynamics rather than to the latent states, we show that factors are identifiable up to permutation and component-wise affine (or monotone invertible) transformations. Linear diagonal dynamics preserve this identifiability and admit scalable computation via companion-matrix and Krylov methods. We demonstrate improved factor recovery on synthetic data, stable intervention accuracy on synthetic SCMs, and competitive probabilistic forecasting on real-world benchmarks.

[LG-25] Balancing Safety and Efficiency in Aircraft Health Diagnosis: A Task Decomposition Framework with Heterogeneous Long-Micro Scale Cascading and Knowledge Distillation-based Interpretability

链接: https://arxiv.org/abs/2603.22885
作者: Xinhang Chen,Zhihuan Wei,Yang Hu,Zhiguo Zeng,Kang Zeng,Suili Yang
类目: Machine Learning (cs.LG)
*备注: Submitted to Reliability Engineering System Safety (RESS)

点击查看摘要

Abstract:Whole-aircraft diagnosis for general aviation faces threefold challenges: data uncertainty, task heterogeneity, and computational inefficiency. Existing end-to-end approaches uniformly model health discrimination and fault characterization, overlooking intrinsic receptive field conflicts between global context modeling and local feature extraction, while incurring prohibitive training costs under severe class imbalance. To address these, this study proposes the Diagnosis Decomposition Framework (DDF), explicitly decoupling diagnosis into Anomaly Detection (AD) and Fault Classification (FC) subtasks via the Long-Micro Scale Diagnostician (LMSD). Employing a “long-range global screening and micro-scale local precise diagnosis” strategy, LMSD utilizes Convolutional Tokenizer with Multi-Head Self-Attention (ConvTokMHSA) for global operational pattern discrimination and Multi-Micro Kernel Network (MMK Net) for local fault feature extraction. Decoupled training separates “large-sample lightweight” and “small-sample complex” optimization pathways, significantly reducing computational overhead. Concurrently, Keyness Extraction Layer (KEL) via knowledge distillation furnishes physically traceable explanations for two-stage decisions, materializing interpretability-by-design. Experiments on the NGAFID real-world aviation dataset demonstrate approximately 4-8% improvement in Multi-Class Weighted Penalty Metric (MCWPM) over baselines with substantially reduced training time, validating comprehensive advantages in task adaptability, interpretability, and efficiency. This provides a deployable methodology for general aviation health management.

[LG-26] orR: Towards Brain-Inspired Task-Oriented Reasoning via Cache-Oriented Algorithm-Architecture Co-design

链接: https://arxiv.org/abs/2603.22855
作者: Hyunwoo Oh,SungHeon Jeong,Suyeon Jang,Hanning Chen,Sanggeon Yun,Tamoghno Das,Mohsen Imani
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted to DAC 2026

点击查看摘要

Abstract:Task-oriented object detection (TOOD) atop CLIP offers open-vocabulary, prompt-driven semantics, yet dense per-window computation and heavy memory traffic hinder real-time, power-limited edge deployment. We present \emphTorR, a brain-inspired \textbfalgorithm–architecture co-design that \textbfreplaces CLIP-style dense alignment with a hyperdimensional (HDC) associative reasoner and turns temporal coherence into reuse. On the \emphalgorithm side, TorR reformulates alignment as HDC similarity and graph composition, introducing \emphpartial-similarity reuse via (i) query caching with per-class score accumulation, (ii) exact \delta -updates when only a small set of hypervector bits change, and (iii) similarity/load-gated bypass under high system load. On the \empharchitecture side, TorR instantiates a lane-scalable, bit-sliced item memory with bank/precision gating and a lightweight controller that schedules bypass/ \delta /full paths to meet RT-30/RT-60 targets as object counts vary. Synthesized in a TSMC 28,nm process and exercised with a cycle-accurate simulator, TorR sustains real-time throughput with millijoule-scale energy per window ( \approx 50,mJ at 60,FPS; \approx 113,mJ at 30,FPS) and low latency jitter, while delivering competitive AP@0.5 across five task prompts (mean 44.27%) within a bounded margin to strong VLM baselines, but at orders-of-magnitude lower energy. The design exposes deployment-time configurability (effective dimension D’ , thresholds, precision) to trade accuracy, latency, and energy for edge budgets.

[LG-27] owards The Implicit Bias on Multiclass Separable Data Under Norm Constraints

链接: https://arxiv.org/abs/2603.22824
作者: Shengping Xie,Zekun Wu,Quan Chen,Kaixu Tang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Implicit bias induced by gradient-based algorithms is essential to the generalization of overparameterized models, yet its mechanisms can be subtle. This work leverages the Normalized Steepest Descent (NSD) framework to investigate how optimization geometry shapes solutions on multiclass separable data. We introduce NucGD, a geometry-aware optimizer designed to enforce low rank structures through nuclear norm constraints. Beyond the algorithm itself, we connect NucGD with emerging low-rank projection methods, providing a unified perspective. To enable scalable training, we derive an efficient SVD-free update rule via asynchronous power iteration. Furthermore, we empirically dissect the impact of stochastic optimization dynamics, characterizing how varying levels of gradient noise induced by mini-batch sampling and momentum modulate the convergence toward the expected maximum margin this http URL code is accessible at: this https URL.

[LG-28] Universal and efficient graph neural networks with dynamic attention for machine learning interatomic potentials

链接: https://arxiv.org/abs/2603.22810
作者: Shuyu Bi,Zhede Zhao,Qiangchao Sun,Tao Hu,Xionggang Lu,Hongwei Cheng
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 6 tables

点击查看摘要

Abstract:The core of molecular dynamics simulation fundamentally lies in the interatomic potential. Traditional empirical potentials lack accuracy, while first-principles methods are computationally prohibitive. Machine learning interatomic potentials (MLIPs) promise near-quantum accuracy at linear cost, but existing models still face challenges in efficiency and stability. We presents Machine Learning Advances Neural Network (MLANet), an efficient and robust graph neural network framework. MLANet introduces a dual-path dynamic attention mechanism for geometry-aware message passing and a multi-perspective pooling strategy to construct comprehensive system representations. This design enables highly accurate modeling of atomic environments while achieving exceptional computational efficiency, making high-fidelity simulations more accessible. Tested across a wide range of datasets spanning diverse systems, including organic molecules (e.g., QM7, MD17), periodic inorganic materials (e.g., Li-containing crystals), two-dimensional materials (e.g., bilayer graphene, black phosphorus), surface catalytic reactions (e.g., formate decomposition), and charged systems, MLANet maintains competitive prediction accuracy while its computational cost is markedly lower than mainstream equivariant models, and it enables stable long-time molecular dynamics simulations. MLANet provides an efficient and practical tool for large-scale, high-accuracy atomic simulations.

[LG-29] Combinatorial Privacy: Private Multi-Party Bitstream Grand Sum by Hiding in Birkhoff Polytopes

链接: https://arxiv.org/abs/2603.22808
作者: Praneeth Vepakomma
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PolyVeil, a protocol for private Boolean summation across k clients that encodes private bits as permutation matrices in the Birkhoff polytope. A two-layer architecture gives the server perfect simulation-based security (statistical distance zero) while a separate aggregator faces #P-hard likelihood inference via the permanent and mixed discriminant. Two variants (full and compressed) differ in what the aggregator observes. We develop a finite-sample (\varepsilon,\delta) -DP analysis with explicit constants. In the full variant, where the aggregator sees a doubly stochastic matrix per client, the log-Lipschitz constant grows as n^4 K_t and a signal-to-noise analysis shows the DP guarantee is non-vacuous only when the private signal is undetectable. In the compressed variant, where the aggregator sees a single scalar, the univariate density ratio yields non-vacuous \varepsilon at moderate SNR, with the optimal decoy count balancing CLT accuracy against noise concentration. This exposes a fundamental tension. #P-hardness requires the full matrix view (Birkhoff structure visible), while non-vacuous DP requires the scalar view (low dimensionality). Whether both hold simultaneously in one variant remains open. The protocol needs no PKI, has O(k) communication, and outputs exact aggregates. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2603.22808 [cs.CR] (or arXiv:2603.22808v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.22808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] ransformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models

链接: https://arxiv.org/abs/2603.22801
作者: Chenyang Zhang,Qingyue Zhao,Quanquan Gu,Yuan Cao
类目: Machine Learning (cs.LG)
*备注: 64 pages, 9 figures

点击查看摘要

Abstract:Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025]. When learning from this class of teacher models, we prove that one-layer transformers with simplified "position-only’’ attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.

[LG-31] Caterpillar of Thoughts: The Optimal Test-Time Algorithm for Large Language Models

链接: https://arxiv.org/abs/2603.22784
作者: Amir Azarmehr,Soheil Behnezhad,Alma Ghafari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can often produce substantially better outputs when allowed to use additional test-time computation, such as sampling, chain of thought, backtracking, or revising partial solutions. Despite the growing empirical success of such techniques, there is limited theoretical understanding of how inference time computation should be structured, or what constitutes an optimal use of a fixed computation budget. We model test-time computation as an algorithm interacting with a Markov chain: at any point, the algorithm may resume generation from any previously observed state. That is, unlike standard Markov chains where the states are drawn passively, we allow the algorithm to backtrack to any previously observed state of the Markov chain at any time. Many of the existing test-time algorithms, such as Chain-of-Thought (CoT) (Wei et al., 2023), Tree-of-Thoughts (ToT) (Yao et al., 2023), or Best-of- k (Brown et al., 2024) could be seen as specific algorithms in this model. We prove that while backtracking can reduce the number of generations exponentially, a very limited form of backtracking is theoretically sufficient. Namely, we show that the optimal algorithm always generates a caterpillar tree. That is, if we remove the leaves of the state tree generated by the optimal algorithm, we obtain a path. Motivated by our characterization of the optimal algorithm, we present Caterpillar of Thoughts (CaT), a new test-time computation algorithm, reducing the number of token/state generations. Our empirical evaluation shows that CaT, compared to ToT, achieves a better success rate while also reducing the number of token generations. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.22784 [cs.LG] (or arXiv:2603.22784v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] Explainable Threat Attribution for IoT Networks Using Conditional SHAP and Flow Behavior Modelling

链接: https://arxiv.org/abs/2603.22771
作者: Samuel Ozechi,Jennifer Okonkwoabutu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the Internet of Things (IoT) continues to expand across critical infrastructure, smart environments, and consumer devices, securing them against cyber threats has become increasingly vital. Traditional intrusion detection models often treat IoT threats as binary classification problems or rely on opaque models, thereby limiting trust. This work studies multiclass threat attribution in IoT environments using the CICIoT2023 dataset, grouping over 30 attack variants into 8 semantically meaningful classes. We utilize a combination of a gradient boosting model and SHAP (SHapley Additive exPlanations) to deliver both global and class-specific explanations, enabling detailed insight into the features driving each attack classification. The results show that the model distinguishes distinct behavioral signatures of the attacks using flow timing, packet size uniformity, TCP flag dynamics, and statistical variance. Additional analysis that exposes both feature attribution and the decision trajectory per class further validates these observed patterns. Our findings contribute to the development of more accurate and explainable intrusion detection systems, bridging the gap between high-performance machine learning and the need for trust and accountability in AI-driven cybersecurity for IoT environments.

[LG-33] Algorithmic warm starts for Hamiltonian Monte Carlo

链接: https://arxiv.org/abs/2603.22741
作者: Matthew S. Zhang,Jason M. Altschuler,Sinho Chewi
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension d . On one hand, a variety of results show that Metropolized HMC converges in O(d^1/4) iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring \Omega(d^1/2) iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emphnon-Metropolized HMC generates a warm start in \tildeO(d^1/4) iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of \tildeO(d^1/4) is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of \tildeO(d^1/2) . This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2603.22741 [cs.DS] (or arXiv:2603.22741v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2603.22741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Multitask-Informed Prior for In-Context Learning on Tabular Data: Application to Steel Property Prediction

链接: https://arxiv.org/abs/2603.22738
作者: Dimitrios Sinodinos,Bahareh Nikpour,Jack Yi Wei,Sushant Sinha,Xiaoping Ma,Kashif Rehman,Stephen Yue,Narges Armanfard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of mechanical properties of steel during hot rolling processes, such as Thin Slab Direct Rolling (TSDR), remains challenging due to complex interactions among chemical compositions, processing parameters, and resultant microstructures. Traditional empirical and experimental methodologies, while effective, are often resource-intensive and lack adaptability to varied production conditions. Moreover, most existing approaches do not explicitly leverage the strong correlations among key mechanical properties, missing an opportunity to improve predictive accuracy through multitask learning. To address this, we present a multitask learning framework that injects multitask awareness into the prior of TabPFN–a transformer-based foundation model for in-context learning on tabular data–through novel fine-tuning strategies. Originally designed for single-target regression or classification, we augment TabPFN’s prior with two complementary approaches: (i) target averaging, which provides a unified scalar signal compatible with TabPFN’s single-target architecture, and (ii) task-specific adapters, which introduce task-specific supervision during fine-tuning. These strategies jointly guide the model toward a multitask-informed prior that captures cross-property relationships among key mechanical metrics. Extensive experiments on an industrial TSDR dataset demonstrate that our multitask adaptations outperform classical machine learning methods and recent state-of-the-art tabular learning models across multiple evaluation metrics. Notably, our approach enhances both predictive accuracy and computational efficiency compared to task-specific fine-tuning, demonstrating that multitask-aware prior adaptation enables foundation models for tabular data to deliver scalable, rapid, and reliable deployment for automated industrial quality control and process optimization in TSDR.

[LG-35] Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication

链接: https://arxiv.org/abs/2603.22727
作者: Chen Shang,Dinh Thai Hoang,Diep N. Nguyen,Jiadong Yu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:This work proposes a novel immersive communication framework that leverages brain-computer interface (BCI) to acquire brain signals for inferring user-centric states (e.g., intention and perception-related discomfort), thereby enabling more personalized and robust immersive adaptation under strong individual variability. Specifically, we develop a personalized federated learning (PFL) model to analyze and process the collected brain signals, which not only accommodates neurodiverse brain-signal data but also prevents the leakage of sensitive brain-signal information. To address the energy bottleneck of continual on-device learning and inference on energy-limited immersive terminals (e.g., head-mounted display), we further embed spiking neural networks (SNNs) into the PFL. By exploiting sparse, event-driven spike computation, the SNN-enabled PFL reduces the computation and energy cost of training and inference while maintaining competitive personalization performance. Experiments on real brain-signal dataset demonstrate that our method achieves the best overall identification accuracy while reducing inference energy by 6.46 \times compared with conventional artificial neural network-based personalized baselines.

[LG-36] Double Coupling Architecture and Training Method for Optimization Problems of Differential Algebraic Equations with Parameters

链接: https://arxiv.org/abs/2603.22724
作者: Wenqiang Yang,Wenyuan Wu,Yong Feng,Changbo Chen
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 19pages, 11 figures

点击查看摘要

Abstract:Simulation and modeling are essential in product development, integrated into the design and manufacturing process to enhance efficiency and quality. They are typically represented as complex nonlinear differential algebraic equations. The growing diversity of product requirements demands multi-task optimization, a key challenge in simulation modeling research. A dual physics-informed neural network architecture has been proposed to decouple constraints and objective functions in parametric differential algebraic equation optimization problems. Theoretical analysis shows that introducing a relaxation variable with a global error bound ensures solution equivalence between the network and optimization problem. A genetic algorithm-enhanced training framework for physics-informed neural networks improves training precision and efficiency, avoiding redundant solving of differential algebraic equations. This approach enables generalization for multi-task objectives with a single, training maintaining real-time responsiveness to product requirements.

[LG-37] Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellm an Constraints

链接: https://arxiv.org/abs/2603.22713
作者: Tian Xu,Chenyang Wang,Xiaochen Zhai,Ziniu Li,Yi-Chen Li,Yang Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors in behavioral cloning (BC), but often exhibits training instability due to adversarial optimization. To avoid this issue, a class of non-adversarial Q-based imitation learning (IL) methods, represented by IQ-Learn, has emerged and is widely believed to outperform BC by leveraging online environment interactions. However, this paper revisits IQ-Learn and demonstrates that it provably reduces to BC and suffers from an imitation gap lower bound with quadratic dependence on horizon, therefore still suffering from compounding errors. Theoretical analysis reveals that, despite using online interactions, IQ-Learn uniformly suppresses the Q-values for all actions on states uncovered by demonstrations, thereby failing to generalize. To address this limitation, we introduce a primal-dual framework for distribution matching, yielding a new Q-based IL method, Dual Q-DM. The key mechanism in Dual Q-DM is incorporating Bellman constraints to propagate high Q-values from visited states to unvisited ones, thereby achieving generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions beyond demonstrations, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical results.

[LG-38] Coordinate Encoding on Linear Grids for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2603.22700
作者: Tetsuro Tsuchino,Motoki Shiga
类目: Machine Learning (cs.LG)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:In solving partial differential equations (PDEs), machine learning utilizing physical laws has received considerable attention owing to advantages such as mesh-free solutions, unsupervised learning, and feasibility for solving high-dimensional problems. An effective approach is based on physics-informed neural networks (PINNs), which are based on deep neural networks known for their excellent performance in various academic and industrial applications. However, PINNs struggled with model training owing to significantly slow convergence because of a spectral bias problem. In this study, we propose a PINN-based method equipped with a coordinate-encoding layer on linear grid cells. The proposed method improves the training convergence speed by separating the local domains using grid cells. Moreover, it reduces the overall computational cost by using axis-independent linear grid cells. The method also achieves efficient and stable model training by adequately interpolating the encoded coordinates between grid points using natural cubic splines, which guarantees continuous derivative functions of the model computed for the loss functions. The results of numerical experiments demonstrate the effective performance and efficient training convergence speed of the proposed method.

[LG-39] Bounding Box Anomaly Scoring for simple and efficient Out-of-Distribution detection

链接: https://arxiv.org/abs/2603.22660
作者: Mohamed Bahi Yahiaoui,Geoffrey Daniel,Loïc Giraldi,Jérémie Bruyelle,Julyan Arbel
类目: Machine Learning (cs.LG)
*备注: 45 pages, 4 figures, 17 tables

点击查看摘要

Abstract:Out-of-distribution (OOD) detection aims to identify inputs that differ from the training distribution in order to reduce unreliable predictions by deep neural networks. Among post-hoc feature-space approaches, OOD detection is commonly performed by approximating the in-distribution support in the representation space of a pretrained network. Existing methods often reflect a trade-off between compact parametric models, such as Mahalanobis-based scores, and more flexible but reference-based methods, such as k-nearest neighbors. Bounding-box abstraction provides an attractive intermediate perspective by representing in-distribution support through compact axis-aligned summaries of hidden activations. In this paper, we introduce Bounding Box Anomaly Scoring (BBAS), a post-hoc OOD detection method that leverages bounding-box abstraction. BBAS combines graded anomaly scores based on interval exceedances, monitoring variables adapted to convolutional layers, and decoupled clustering and box construction for richer and multi-layer representations. Experiments on image-classification benchmarks show that BBAS provides robust separation between in-distribution and out-of-distribution samples while preserving the simplicity, compactness, and updateability of the bounding-box approach.

[LG-40] ransfer learning via interpolating structures

链接: https://arxiv.org/abs/2603.22621
作者: T.A. Dardeno,A.J. Hughes,L.A. Bull,R.S. Mills,N. Dervilis,K. Worden
类目: Machine Learning (cs.LG)
*备注: preprint submitted to Mechanical Systems and Signal Processing

点击查看摘要

Abstract:Despite recent advances in population-based structural health monitoring (PBSHM), knowledge transfer between highly-disparate structures (i.e., heterogeneous populations) remains a challenge. The current work proposes that heterogeneous transfer may be accomplished via intermediate structures that bridge the gap in information between the structures of interest. A key aspect of the technique is the idea that by varying parameters such as material properties and geometry, one structure can be continuously morphed into another. The approach is demonstrated via a case study involving the parameterisation of (and transfer between) simulated heterogeneous bridge designs (Case 1). Transfer between simplified physical representations of a ‘bridge’ and ‘aeroplane’ is then demonstrated in Case 2, via a chain of finite-element models. The facetious question ‘When is a bridge not an aeroplane?’ has been previously asked in the context of predicting positive transfer based on structural similarity. While the obvious answer to this question is ‘Always,’ the results presented in the current paper show that, in some cases, positive transfer can indeed be achieved between highly-disparate systems.

[LG-41] Precision-Varying Prediction (PVP): Robustifying ASR systems against adversarial attacks

链接: https://arxiv.org/abs/2603.22590
作者: Matías Pizarro,Raghavan Narasimhan,Asja Fischer
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:With the increasing deployment of automated and agentic systems, ensuring the adversarial robustness of automatic speech recognition (ASR) models has become critical. We observe that changing the precision of an ASR model during inference reduces the likelihood of adversarial attacks succeeding. We take advantage of this fact to make the models more robust by simple random sampling of the precision during prediction. Moreover, the insight can be turned into an adversarial example detection strategy by comparing outputs resulting from different precisions and leveraging a simple Gaussian classifier. An experimental analysis demonstrates a significant increase in robustness and competitive detection performance for various ASR models and attack types.

[LG-42] A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

链接: https://arxiv.org/abs/2603.22586
作者: Anish Saha,Konstantin Shmakov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) allows a model to adapt at inference time by conditioning on examples rather than updating parameters. Existing time-series foundation models use implicit positional context, retrieval, or task-specific objectives, but rarely explicit instruction-conditioned demonstrations. We present a foundation model for instruction-conditioned in-context time-series tasks based on a quantile-regression T5 encoder-decoder. Historical examples and queries are encoded with a structured tokenization scheme that marks target series, covariates, context, and task-specific future information. A hierarchical Transformer with per-example encoding, example-level fusion, and cross-example attention conditions decoding on demonstration pairs, enabling forecasting and related tasks without task-specific fine-tuning. We train on large-scale real and synthetic time series using supervised forecasting plus self-supervised tasks, including imputation, reconstruction, classification, anomaly detection, and source demixing. This multi-task training learns a distribution over task mappings and improves adaptation to local structure at inference time. Across diverse datasets, frequencies, and horizons, our method outperforms strong foundation baselines on point and probabilistic forecasting benchmarks, including fev-bench and GIFT-Eval, while remaining competitive on classification and anomaly detection.

[LG-43] MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data

链接: https://arxiv.org/abs/2603.22564
作者: Xingzhi Sun,João Felipe Rocha,Brett Phelan,Dhananjay Bhaskar,Guillaume Huguet,Yanlei Zhang,D.S. Magruder,Alexander Tong,Ke Xu,Oluwadamilola Fasina,Mark Gerstein,Guy Wolf,Natalia Ivanova,Christine L. Chaffer,Smita Krishnaswamy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disease. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data’s intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.22564 [cs.LG] (or arXiv:2603.22564v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] Multimodal Training to Unimodal Deployment: Leverag ing Unstructured Data During Training to Optimize Structured Data Only Deployment

链接: https://arxiv.org/abs/2603.22530
作者: Zigui Wang,Minghui Sun,Jiang Shu,Matthew M. Engelhard,Lauren Franz,Benjamin A. Goldstein
类目: Machine Learning (cs.LG)
*备注: 10 pages,3 figures

点击查看摘要

Abstract:Unstructured Electronic Health Record (EHR) data, such as clinical notes, contain clinical contextual observations that are not directly reflected in structured data fields. This additional information can substantially improve model learning. However, due to their unstructured nature, these data are often unavailable or impractical to use when deploying a model. We introduce a multimodal learning framework that leverages unstructured EHR data during training while producing a model that can be deployed using only structured EHR data. Using a cohort of 3,466 children evaluated for late talking, we generated note embeddings with BioClinicalBERT and encoded structured embeddings from demographics and medical codes. A note-based teacher model and a structured-only student model were jointly trained using contrastive learning and contrastive knowledge distillation loss, producing a strong classifier (AUROC = 0.985). Our proposed model reached an AUROC of 0.705, outperforming the structured-only baseline of 0.656. These results demonstrate that incorporating unstructured data during training enhances the model’s capacity to identify task-relevant information within structured EHR data, enabling a deployable structured-only phenotype model.

[LG-45] Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates

链接: https://arxiv.org/abs/2603.22525
作者: Samrendra Roy,Kazuma Kobayashi,Souvik Chakraborty,Rizwan-uddin,Syed Bahauddin Alam
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 56 pages, 14 figures, 22 tables

点击查看摘要

Abstract:Operator learning models are rapidly emerging as the predictive core of digital twins for nuclear and energy systems, promising real-time field reconstruction from sparse sensor measurements. Yet their robustness to adversarial perturbations remains uncharacterized, a critical gap for deployment in safety-critical systems. Here we show that neural operators are acutely vulnerable to extremely sparse (fewer than 1% of inputs), physically plausible perturbations that exploit their sensitivity to boundary conditions. Using gradient-free differential evolution across four operator architectures, we demonstrate that minimal modifications trigger catastrophic prediction failures, increasing relative L_2 error from \sim 1.5% (validated accuracy) to 37-63% while remaining completely undetectable by standard validation metrics. Notably, 100% of successful single-point attacks pass z-score anomaly detection. We introduce the effective perturbation dimension d_\texteff , a Jacobian-based diagnostic that, together with sensitivity magnitude, yields a two-factor vulnerability model explaining why architectures with extreme sensitivity concentration (POD-DeepONet, d_\texteff \approx 1 ) are not necessarily the most exploitable, since low-rank output projections cap maximum error, while moderate concentration with sufficient amplification (S-DeepONet, d_\texteff \approx 4 ) produces the highest attack success. Gradient-free search outperforms gradient-based alternatives (PGD) on architectures with gradient pathologies, while random perturbations of equal magnitude achieve near-zero success rates, confirming that the discovered vulnerabilities are structural. Our findings expose a previously overlooked attack surface in operator learning models and establish that these models require robustness guarantees beyond standard validation before deployment.

[LG-46] OrgForge-IT: A Verifiable Synthetic Benchmark for LLM -Based Insider Threat Detection

链接: https://arxiv.org/abs/2603.22499
作者: Jeffrey Flynt
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic insider threat benchmarks face a consistency problem: corpora generated without an external factual constraint cannot rule out cross-artifact contradictions. The CERT dataset – the field’s canonical benchmark – is also static, lacks cross-surface correlation scenarios, and predates the LLM era. We present OrgForge-IT, a verifiable synthetic benchmark in which a deterministic simulation engine maintains ground truth and language models generate only surface prose, making cross-artifact consistency an architectural guarantee. The corpus spans 51 simulated days, 2,904 telemetry records at a 96.4% noise rate, and four detection scenarios designed to defeat single-surface and single-day triage strategies across three threat classes and eight injectable behaviors. A ten-model leaderboard reveals several findings: (1) triage and verdict accuracy dissociate - eight models achieve identical triage F1=0.80 yet split between verdict F1=1.0 and 0.80; (2) baseline false-positive rate is a necessary companion to verdict F1, with models at identical verdict accuracy differing by two orders of magnitude on triage noise; (3) victim attribution in the vishing scenario separates tiers - Tier A models exonerate the compromised account holder while Tier B models detect the attack but misclassify the victim; (4) rigid multi-signal thresholds structurally exclude single-surface negligent insiders, demonstrating the necessity of parallel, threat-class-specific triage pipelines; and (5) agentic software-engineering training acts as a force multiplier for multi-day temporal correlation, but only when paired with frontier-level parameter scale. Finally, prompt sensitivity analysis reveals that unstructured prompts induce vocabulary hallucination, motivating a two-track scoring framework separating prompt adherence from reasoning capability. OrgForge-IT is open source under the MIT license. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2603.22499 [cs.CR] (or arXiv:2603.22499v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.22499 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] A Theoretical Framework for Energy-Aware Gradient Pruning in Federated Learning

链接: https://arxiv.org/abs/2603.22465
作者: Emmanouil M. Athanasakos
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Machine Learning (stat.ML)
*备注: 8 pages, 2 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Federated Learning (FL) is constrained by the communication and energy limitations of decentralized edge devices. While gradient sparsification via Top-K magnitude pruning effectively reduces the communication payload, it remains inherently energy-agnostic. It assumes all parameter updates incur identical downstream transmission and memory-update costs, ignoring hardware realities. We formalize the pruning process as an energy-constrained projection problem that accounts for the hardware-level disparities between memory-intensive and compute-efficient operations during the post-backpropagation phase. We propose Cost-Weighted Magnitude Pruning (CWMP), a selection rule that prioritizes parameter updates based on their magnitude relative to their physical cost. We demonstrate that CWMP is the optimal greedy solution to this constrained projection and provide a probabilistic analysis of its global energy efficiency. Numerical results on a non-IID CIFAR-10 benchmark show that CWMP consistently establishes a superior performance-energy Pareto frontier compared to the Top-K baseline.

[LG-48] SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale

链接: https://arxiv.org/abs/2603.22455
作者: YanZhao Zheng,ZhenTao Zhang,Chao Ma,YuanQiang Yu,JiHuan Zhu,Baohua Dong,Hangcheng Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As LLM agent ecosystems grow, the number of available skills (tools, plugins) has reached tens of thousands, making it infeasible to inject all skills into an agent’s context. This creates a need for skill routing – retrieving the most relevant skills from a large pool given a user task. The problem is compounded by pervasive functional overlap in community skill repositories, where many skills share similar names and purposes yet differ in implementation details. Despite its practical importance, skill routing remains under-explored. Current agent architectures adopt a progressive disclosure design – exposing only skill names and descriptions to the agent while keeping the full implementation body hidden – implicitly treating metadata as sufficient for selection. We challenge this assumption through a systematic empirical study on a benchmark of ~ 80K skills and 75 expert-verified queries. Our key finding is that the skill body (full implementation text) is the decisive signal: removing it causes 29–44 percentage point degradation across all retrieval methods, and cross-encoder attention analysis reveals 91.7% of attention concentrating on the body field. Motivated by this finding, we propose SkillRouter, a two-stage retrieve-and-rerank pipeline totaling only 1.2B parameters (0.6B encoder + 0.6B reranker). SkillRouter achieves 74.0% top-1 routing accuracy and delivers the strongest average result among the compact and zero-shot baselines we evaluate, while remaining deployable on consumer hardware.

[LG-49] mmFHE: mmWave Sensing with End-to-End Fully Homomorphic Encryption

链接: https://arxiv.org/abs/2603.22437
作者: Tanvir Ahmed,Yixuan Gao,Adnan Armouti,Rajalakshmi Nandakumar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Under review

点击查看摘要

Abstract:We present mmFHE, the first system that enables fully homomorphic encryption (FHE) for end-to-end mmWave radar sensing. mmFHE encrypts raw range profiles on a lightweight edge device and executes the entire mmWave signal-processing and ML inference pipeline homomorphically on an untrusted cloud that operates exclusively on ciphertexts. At the core of mmFHE is a library of seven composable, data-oblivious FHE kernels that replace standard DSP routines with fixed arithmetic circuits. These kernels can be flexibly composed into different application-specific pipelines. We demonstrate this approach on two representative tasks: vital-sign monitoring and gesture recognition. We formally prove two cryptographic guarantees for any pipeline assembled from this library: input privacy, the cloud learns nothing about the sensor data; and data obliviousness, the execution trace is identical on the cloud regardless of the data being processed. These guarantees effectively neutralize various supervised and unsupervised privacy attacks on raw data, including re-identification and data-dependent privacy leakage. Evaluation on three public radar datasets (270 vital-sign recordings, 600 gesture trials) shows that encryption introduces negligible error: HR/RR MAE 10^-3 bpm versus plaintext, and 84.5% gesture accuracy (vs. 84.7% plaintext) with end-to-end cloud GPU latency of 103s for a 10s vital-sign window and 37s for a 3s gesture window. These results show that privacy-preserving end-to-end mmWave sensing is feasible on commodity hardware today.

[LG-50] Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2603.22430
作者: Rohan Deb,Stephen J. Wright,Arindam Banerjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.

[LG-51] Neural Structure Embedding for Symbolic Regression via Continuous Structure Search and Coefficient Optimization

链接: https://arxiv.org/abs/2603.22429
作者: Fateme Memar,Tao Zhe,Dongjie Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Symbolic regression aims to discover human-interpretable equations that explain observational data. However, existing approaches rely heavily on discrete structure search (e.g., genetic programming), which often leads to high computational cost, unstable performance, and limited scalability to large equation spaces. To address these challenges, we propose SRCO, a unified embedding-driven framework for symbolic regression that transforms symbolic structures into a continuous, optimizable representation space. The framework consists of three key components: (1) structure embedding: we first generate a large pool of exploratory equations using traditional symbolic regression algorithms and train a Transformer model to compress symbolic structures into a continuous embedding space; (2) continuous structure search: the embedding space enables efficient exploration using gradient-based or sampling-based optimization, significantly reducing the cost of navigating the combinatorial structure space; and (3) coefficient optimization: for each discovered structure, we treat symbolic coefficients as learnable parameters and apply gradient optimization to obtain accurate numerical values. Experiments on synthetic and real-world datasets show that our approach consistently outperforms state-of-the-art methods in equation accuracy, robustness, and search efficiency. This work introduces a new paradigm for symbolic regression by bridging symbolic equation discovery with continuous embedding learning and optimization.

[LG-52] COMPASS-Hedge: Learning Safely Without Knowing the World

链接: https://arxiv.org/abs/2603.22348
作者: Ting Hu,Luanda Cai,Manolis Vlatakis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) \tilde\mathcalO(1) regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment’s nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first “best-of-three-world” guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2603.22348 [cs.LG] (or arXiv:2603.22348v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22348 Focus to learn more arXiv-issued DOI via DataCite

[LG-53] Cloud-Edge Collaborative Large Models for Robust Photovoltaic Power Forecasting

链接: https://arxiv.org/abs/2603.22343
作者: Nan Qiao,Sijing Duan,Shuning Wang,Xingyuan Hua,Ju Ren
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Photovoltaic (PV) power forecasting in edge-enabled grids requires balancing forecasting accuracy, robustness under weather-driven distribution shifts, and strict latency constraints. Local specialized models are efficient for routine conditions but often degrade under rare ramp events and unseen weather patterns, whereas always relying on cloud-side large models incurs substantial communication delay and cloud overhead. To address this challenge, we propose a risk-aware cloud-edge collaborative framework for latency-sensitive PV forecasting. The framework integrates a site-specific expert predictor for routine cases, a lightweight edge-side model for enhanced local inference, and a cloud-side large retrieval model that provides matched historical context when needed through a retrieval-prediction pipeline. A lightweight screening module estimates predictive uncertainty, out-of-distribution risk, weather mutation intensity, and model disagreement, while a Lyapunov-guided router selectively escalates inference to the edge-small or cloud-assisted branches under long-term latency, communication, and cloud-usage constraints. The outputs of the activated branches are combined through adaptive fusion. Experiments on two real-world PV datasets demonstrate a favorable overall trade-off among forecasting accuracy, routing quality, robustness, and system efficiency.

[LG-54] Bridging the Gap Between Climate Science and Machine Learning in Climate Model Emulation

链接: https://arxiv.org/abs/2603.22320
作者: Luca Schmidt,Nina Effenberger
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While climate models provide insights for climate decision-making, their use is constrained by significant computational and technical demands. Although machine learning (ML) emulators offer a way to bypass the high computational costs, their effective use remains challenging. The hurdles are diverse, ranging from limited accessibility and a lack of specialized knowledge to a general mistrust of ML methods that are perceived as insufficiently physical. Here, we introduce a framework to overcome these barriers by integrating both climate science and machine learning perspectives. We find that designing easy-to-adopt emulators that address a clearly defined task and demonstrating their reliability offers a promising path for bridging the gap between our two fields.

[LG-55] A graph neural network based chemical mechanism reduction method for combustion applications

链接: https://arxiv.org/abs/2603.22318
作者: Manuru Nithin Padiyar,Priyabrat Dash,Konduri Aditya
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct numerical simulations of turbulent reacting flows involving millions of grid points and detailed chemical mechanisms with hundreds of species and thousands of reactions are computationally prohibitive. To address this challenge, we present two data-driven chemical mechanism reduction formulations based on graph neural networks (GNNs) with message-passing transformer layers that learn nonlinear dependencies among species and reactions. The first formulation, GNN-SM, employs a pre-trained surrogate model to guide reduction across a broad range of reactor conditions. The second formulation, GNN-AE, uses an autoencoder formulation to obtain highly compact mechanisms that remain accurate within the thermochemical regimes used during training. The approaches are demonstrated on detailed mechanisms for methane (53 species, 325 reactions), ethylene (96 species, 1054 reactions), and iso-octane (1034 species, 8453 reactions). GNN-SM achieves reductions comparable to the established graph-based method DRGEP while maintaining accuracy across a wide range of thermochemical states. In contrast, GNN-AE achieves up to 95% reduction in species and reactions and outperforms DRGEP within its target conditions. Overall, the proposed framework provides an automated, machine-learning-based pathway for chemical mechanism reduction that can complement traditional expert-guided analytical approaches.

[LG-56] Full waveform inversion method based on diffusion model

链接: https://arxiv.org/abs/2603.22307
作者: Caiyun Liu,Siyang Pei,Qingfeng Yu,Jie Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Seismic full-waveform inversion is a core technology for obtaining high-resolution subsurface model parameters. However, its highly nonlinear characteristics and strong dependence on the initial model often lead to the inversion process getting trapped in local minima. In recent years, generative diffusion models have provided a way to regularize full-waveform inversion by learning implicit prior distributions. However, existing methods mostly use unconditional diffusion processes, ignoring the inherent physical coupling relationship between velocity and density and other physical properties. This paper proposes a full-waveform inversion method based on conditional diffusion model regularization. By improving the backbone network structure of the diffusion model, two-dimensional density information is introduced as a conditional input into the U-Net network. Experimental results show that the full-waveform inversion method based on the conditional diffusion model significantly improves the resolution and structural fidelity of the inversion results, and exhibits stronger stability and robustness when dealing with complex situations. This method effectively utilizes density information to constrain the inversion and has good practical application value. Keywords: Deep learning; Diffusion model; Full waveform inversion. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.22307 [cs.LG] (or arXiv:2603.22307v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.22307 Focus to learn more arXiv-issued DOI via DataCite

[LG-57] Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization

链接: https://arxiv.org/abs/2603.22304
作者: Wenhao Zhao,Qiran Zou,Zhouhan Lin,Dianbo Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well-expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet-1K and ImageNet-100 benchmarks, highlighting the ProVQ’s boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.

[LG-58] Research on Individual Trait Clustering and Development Pathway Adaptation Based on the K-means Algorithm

链接: https://arxiv.org/abs/2603.22302
作者: Qianru Wei,Jihaoyu Yang,Cheng Zhang,Jinming Yang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:With the development of information technology, the application of artificial intelligence and machine learning in the field of education shows great potential. This study aims to explore how to utilize K-means clustering algorithm to provide accurate career guidance for college students. Existing methods mostly focus on the prediction of career paths, but there are fewer studies on the fitness of students with different combinations of characteristics in specific career directions. In this study, we analyze the data of more than 3000 students on their CET-4 scores, GPA, personality traits and student cadre experiences, and use the K-means clustering algorithm to classify the students into four main groups. The K-means clustering algorithm groups students with similar characteristics into one group by minimizing the intra-cluster squared error, ensuring that the students within the same cluster are highly similar in their characteristics, and that differences between different clusters are maximized. Based on the clustering results, targeted career guidance suggestions are provided for each group. The results of the study show that students with different combinations of characteristics are suitable for different career directions, which provides a scientific basis for personalized career guidance and effectively enhances students’ employment success rate. Future research can further improve the precision of clustering and the guidance effect by expanding the sample size, increasing the feature variables and considering external factors.

[LG-59] Contextual Graph Matching with Correlated Gaussian Features

链接: https://arxiv.org/abs/2603.23305
作者: Mohammad Hassan Ahmad Yarandi,Luca Ganassali
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate contextual graph matching in the Gaussian setting, where both edge weights and node features are correlated across two networks. We derive precise information-theoretic thresholds for exact recovery, and identify conditions under which almost exact recovery is possible or impossible, in terms of graph and feature correlation strengths, the number of nodes, and feature dimension. Interestingly, whereas an all-or-nothing phase transition is observed in the standard graph-matching scenario, the additional contextual information introduces a richer structure: thresholds for exact and almost exact recovery no longer coincide. Our results provide the first rigorous characterization of how structural and contextual information interact in graph matching, and establish a benchmark for designing efficient algorithms.

[LG-60] Generative Inversion of Spectroscopic Data for Amorphous Structure Elucidation

链接: https://arxiv.org/abs/2603.23210
作者: Jiawei Guo,Daniel Schwalbe-Koda
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 10 pages; SI: 51 pages

点击查看摘要

Abstract:Determining atomistic structures from characterization data is one of the most common yet intricate problems in materials science. Particularly in amorphous materials, proposing structures that balance realism and agreement with experiments requires expert guidance, good interatomic potentials, or both. Here, we introduce GLASS, a generative framework that inverts multi-modal spectroscopic measurements into realistic atomistic structures without knowledge of the potential energy surface. A score-based model learns a structural prior from low-fidelity data and samples out-of-distribution structures conditioned on differentiable spectral targets. Reconstructions using pair distribution functions (PDFs), X-ray absorption spectroscopy, and diffraction measurements quantify the complementarity between spectral modalities and demonstrate that PDFs is the most informative probe for our framework. We use GLASS to rationalize three contested experimental problems: paracrystallinity in amorphous silicon, a liquid-liquid phase transition in sulfur, and ball-milled amorphous ice. In each case, generated structures reproduce experimental measurements and reveal mechanisms inaccessible to diffraction analysis alone.

[LG-61] Between Resolution Collapse and Variance Inflation: Weighted Conformal Anomaly Detection in Low-Data Regimes

链接: https://arxiv.org/abs/2603.23205
作者: Oliver Hennhöfer,Christine Preisach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 18 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Standard conformal anomaly detection provides marginal finite-sample guarantees under the assumption of exchangeability . However, real-world data often exhibit distribution shifts, necessitating a weighted conformal approach to adapt to local non-stationarity. We show that this adaptation induces a critical trade-off between the minimum attainable p-value and its stability. As importance weights localize to relevant calibration instances, the effective sample size decreases. This can render standard conformal p-values overly conservative for effective error control, while the smoothing technique used to mitigate this issue introduces conditional variance, potentially masking anomalies. We propose a continuous inference relaxation that resolves this dilemma by decoupling local adaptation from tail resolution via continuous weighted kernel density estimation. While relaxing finite-sample exactness to asymptotic validity, our method eliminates Monte Carlo variability and recovers the statistical power lost to discretization. Empirical evaluations confirm that our approach not only restores detection capabilities where discrete baselines yield zero discoveries, but outperforms standard methods in statistical power while maintaining valid marginal error control in practice.

[LG-62] High-Resolution Tensor-Network Fourier Methods for Exponentially Compressed Non-Gaussian Aggregate Distributions

链接: https://arxiv.org/abs/2603.23106
作者: Juan José Rodríguez-Aldavero,Juan José García-Ripoll
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantum Physics (quant-ph)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:Characteristic functions of weighted sums of independent random variables exhibit low-rank structure in the quantized tensor train (QTT) representation, also known as matrix product states (MPS), enabling up to exponential compression of their fully non-Gaussian probability distributions. Under variable independence, the global characteristic function factorizes into local terms. Its low-rank QTT structure arises from intrinsic spectral smoothness in continuous models, or from spectral energy concentration as the number of components D grows in discrete models. We demonstrate this on weighted sums of Bernoulli and lognormal random variables. In the former, despite an adversarial, incompressible small- D regime, the characteristic function undergoes a sharp bond-dimension collapse for D \gtrsim 300 components, enabling polylogarithmic time and memory scaling. In the latter, the approach reaches high-resolution discretizations of N = 2^30 frequency modes on standard hardware, far beyond the N = 2^24 ceiling of dense implementations. These compressed representations enable efficient computation of Value at Risk (VaR) and Expected Shortfall (ES), supporting applications in quantitative finance and beyond.

[LG-63] Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

链接: https://arxiv.org/abs/2603.23057
作者: Saurabh Kataria,Xiao Hu
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

[LG-64] Post-Selection Distributional Model Evaluation

链接: https://arxiv.org/abs/2603.23055
作者: Amirmohammad Farzaneh,Osvaldo Simeone
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and is proved to be more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance–reliability trade-offs.

[LG-65] A PAC-Bayesian approach to generalization for quantum models

链接: https://arxiv.org/abs/2603.22964
作者: Pablo Rodriguez-Grasa,Matthias C. Caro,Jens Eisert,Elies Gil-Fuster,Franz J. Schreiber,Carlos Bravo-Prieto
类目: Quantum Physics (quant-ph); Quantum Gases (cond-mat.quant-gas); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15+29 pages, 4 figures

点击查看摘要

Abstract:Generalization is a central concept in machine learning theory, yet for quantum models, it is predominantly analyzed through uniform bounds that depend on a model’s overall capacity rather than the specific function learned. These capacity-based uniform bounds are often too loose and entirely insensitive to the actual training and learning process. Previous theoretical guarantees have failed to provide non-uniform, data-dependent bounds that reflect the specific properties of the learned solution rather than the worst-case behavior of the entire hypothesis class. To address this limitation, we derive the first PAC-Bayesian generalization bounds for a broad class of quantum models by analyzing layered circuits composed of general quantum channels, which include dissipative operations such as mid-circuit measurements and feedforward. Through a channel perturbation analysis, we establish non-uniform bounds that depend on the norms of learned parameter matrices; we extend these results to symmetry-constrained equivariant quantum models; and we validate our theoretical framework with numerical experiments. This work provides actionable model design insights and establishes a foundational tool for a more nuanced understanding of generalization in quantum machine learning.

[LG-66] Stepwise Variational Inference with Vine Copulas

链接: https://arxiv.org/abs/2603.22959
作者: Elisabeth Griesbauer,Leiv Rønneberg,Arnoldo Frigessi,Claudia Czado,Ingrid Hobæk Haff
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose stepwise variational inference (VI) with vine copulas: a universal VI procedure that combines vine copulas with a novel stepwise estimation procedure of the variational parameters. Vine copulas consist of a nested sequence of trees built from copulas, where more complex latent dependence can be modeled with increasing number of trees. We propose to estimate the vine copula approximate posterior in a stepwise fashion, tree by tree along the vine structure. Further, we show that the usual backward Kullback-Leibler divergence cannot recover the correct parameters in the vine copula model, thus the evidence lower bound is defined based on the Rényi divergence. Finally, an intuitive stopping criterion for adding further trees to the vine eliminates the need to pre-define a complexity parameter of the variational distribution, as required for most other approaches. Thus, our method interpolates between mean-field VI (MFVI) and full latent dependence. In many applications, in particular sparse Gaussian processes, our method is parsimonious with parameters, while outperforming MFVI.

[LG-67] REALITrees: Rashomon Ensemble Active Learning for Interpretable Trees

链接: https://arxiv.org/abs/2603.22750
作者: Simon D. Nguyen,Hayden McTavish,Kentaro Hoffman,Cynthia Rudin,Tyler H. McCormick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Active learning reduces labeling costs by selecting samples that maximize information gain. A dominant framework, Query-by-Committee (QBC), typically relies on perturbation-based diversity by inducing model disagreement through random feature subsetting or data blinding. While this approximates one notion of epistemic uncertainty, it sacrifices direct characterization of the plausible hypothesis space. We propose the complementary approach: Rashomon Ensembled Active Learning (REAL) which constructs a committee by exhaustively enumerating the Rashomon Set of all near-optimal models. To address functional redundancy within this set, we adopt a PAC-Bayesian framework using a Gibbs posterior to weight committee members by their empirical risk. Leveraging recent algorithmic advances, we exactly enumerate this set for the class of sparse decision trees. Across synthetic and established active learning baselines, REAL outperforms randomized ensembles, particularly in moderately noisy environments where it strategically leverages expanded model multiplicity to achieve faster convergence.

[LG-68] Overfitting and Generalizing with (PAC) Bayesian Prediction in Noisy Binary Classification

链接: https://arxiv.org/abs/2603.22644
作者: Xiaohan Zhu,Mesrob I. Ohannessian,Nathan Srebro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a PAC-Bayes type learning rule for binary classification, balancing the training error of a randomized ‘‘posterior’’ predictor with its KL divergence to a pre-specified ‘‘prior’’. This can be seen as an extension of a modified two-part-code Minimum Description Length (MDL) learning rule, to continuous priors and randomized predictions. With a balancing parameter of \lambda=1 this learning rule recovers an (empirical) Bayes posterior and a modified variant recovers the profile posterior, linking with standard Bayesian prediction (up to the treatment of the single-parameter noise level). However, from a risk-minimization prediction perspective, this Bayesian predictor overfits and can lead to non-vanishing excess loss in the agnostic case. Instead a choice of \lambda \gg 1 , which can be seen as using a sample-size-dependent-prior, ensures uniformly vanishing excess loss even in the agnostic case. We precisely characterize the effect of under-regularizing (and over-regularizing) as a function of the balance parameter \lambda , understanding the regimes in which this under-regularization is tempered or catastrophic. This work extends previous work by Zhu and Srebro [2025] that considered only discrete priors to PAC Bayes type learning rules and, through their rigorous Bayesian interpretation, to Bayesian prediction more generally.

[LG-69] Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

链接: https://arxiv.org/abs/2603.22563
作者: Young Hyun Cho,Will Wei Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

[LG-70] SPDE Methods for Nonparametric Bayesian Posterior Contraction and Laplace Approximation

链接: https://arxiv.org/abs/2603.22468
作者: Enric Alberola-Boloix,Ioar Casado-Telletxea
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 32 pages, under review

点击查看摘要

Abstract:We derive posterior contraction rates (PCRs) and finite-sample Bernstein von Mises (BvM) results for non-parametric Bayesian models by extending the diffusion-based framework of Mou et al. (2024) to the infinite-dimensional setting. The posterior is represented as the invariant measure of a Langevin stochastic partial differential equation (SPDE) on a separable Hilbert space, which allows us to control posterior moments and obtain non-asymptotic concentration rates in Hilbert norms under various likelihood curvature and regularity conditions. We also establish a quantitative Laplace approximation for the posterior. The theory is illustrated in a nonparametric linear Gaussian inverse problem.

[LG-71] Probabilistic modeling over permutations using quantum computers

链接: https://arxiv.org/abs/2603.22401
作者: Vasilis Belis,Giulio Crognaletti,Matteo Argenton,Michele Grossi,Maria Schuld
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 36 pages, 4 Figures

点击查看摘要

Abstract:Quantum computers provide a super-exponential speedup for performing a Fourier transform over the symmetric group, an ability for which practical use cases have remained elusive so far. In this work, we leverage this ability to unlock spectral methods for machine learning over permutation-structured data, which appear in applications such as multi-object tracking and recommendation systems. It has been shown previously that a powerful way of building probabilistic models over permutations is to use the framework of non-Abelian harmonic analysis, as the model’s group Fourier spectrum captures the interaction complexity: “low frequencies” correspond to low order correlations, and “high frequencies” to more complex ones. This can be used to construct a Markov chain model driven by alternating steps of diffusion (a group-equivariant convolution) and conditioning (a Bayesian update). However, this approach is computationally challenging and hence limited to simple approximations. Here we construct a quantum algorithm that encodes the exact probabilistic model – a classically intractable object – into the amplitudes of a quantum state by making use of the Quantum Fourier Transform (QFT) over the symmetric group. We discuss the scaling, limitations, and practical use of such an approach, which we envision to be a first step towards useful applications of non-Abelian QFTs.

[LG-72] Neutrino Oscillation Parameter Estimation Using Structured Hierarchical Transformers

链接: https://arxiv.org/abs/2603.22342
作者: Giorgio Morales,Gregory Lehaut,Antonin Vacheret,Frederic Jurie,Jalal Fadili
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: Paper accepted to appear in the IEEE International Joint Conference on Neural Networks 2026

点击查看摘要

Abstract:Neutrino oscillations encode fundamental information about neutrino masses and mixing parameters, offering a unique window into physics beyond the Standard Model. Estimating these parameters from oscillation probability maps is, however, computationally challenging due to the maps’ high dimensionality and nonlinear dependence on the underlying physics. Traditional inference methods, such as likelihood-based or Monte Carlo sampling approaches, require extensive simulations to explore the parameter space, creating major bottlenecks for large-scale analyses. In this work, we introduce a data-driven framework that reformulates atmospheric neutrino oscillation parameter inference as a supervised regression task over structured oscillation maps. We propose a hierarchical transformer architecture that explicitly models the two-dimensional structure of these maps, capturing angular dependencies at fixed energies and global correlations across the energy spectrum. To improve physical consistency, the model is trained using a surrogate simulation constraint that enforces agreement between the predicted parameters and the reconstructed oscillation patterns. Furthermore, we introduce a neural network-based uncertainty quantification mechanism that produces distribution-free prediction intervals with formal coverage guarantees. Experiments on simulated oscillation maps under Earth-matter conditions demonstrate that the proposed method is comparable to a Markov Chain Monte Carlo baseline in estimation accuracy, with substantial improvements in computational cost (around 240 \times fewer FLOPs and 33 \times faster in average processing time). Moreover, the conformally calibrated prediction intervals remain narrow while achieving the target nominal coverage of 90%, confirming both the reliability and efficiency of our approach.

[LG-73] Fair splits flip the leaderboard: CHANRG reveals limited generalization in RNA secondary-structure prediction

链接: https://arxiv.org/abs/2603.22330
作者: Zhiyuan Chen,Zhenfeng Deng,Pan Deng,Yue Liao,Xiu Su,Peng Ye,Xihui Liu
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of RNA secondary structure underpins transcriptome annotation, mechanistic analysis of non-coding RNAs, and RNA therapeutic design. Recent gains from deep learning and RNA foundation models are difficult to interpret because current benchmarks may overestimate generalization across RNA families. We present the Comprehensive Hierarchical Annotation of Non-coding RNA Groups (CHANRG), a benchmark of 170,083 structurally non-redundant RNAs curated from more than 10 million sequences in Rfam~15.0 using structure-aware deduplication, genome-aware split design and multiscale structural evaluation. Across 29 predictors, foundation-model methods achieved the highest held-out accuracy but lost most of that advantage out of distribution, whereas structured decoders and direct neural predictors remained markedly more robust. This gap persisted after controlling for sequence length and reflected both loss of structural coverage and incorrect higher-order wiring. Together, CHANRG and a padding-free, symmetry-aware evaluation stack provide a stricter and batch-invariant framework for developing RNA structure predictors with demonstrable out-of-distribution robustness.

附件下载

点击下载今日全部论文列表