Arxiv今日论文 | 2026-03-19

本篇博文主要内容为 2026-03-19 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共84篇(Computation and Language (cs.CL))
人工智能共185篇(Artificial Intelligence (cs.AI))
计算机视觉共171篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共167篇(Machine Learning (cs.LG))
多智能体系统共14篇(Multiagent Systems (cs.MA))
信息检索共17篇(Information Retrieval (cs.IR))
人机交互共23篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Actionable Recourse in Competitive Environments: A Dynamic Game of Endogenous Selection

【速读】：该论文旨在解决在竞争性环境中（如招生或招聘）当所有个体均具备可操作的补救措施（actionable recourse）时，如何影响整体选拔结果及公平性的问题。其核心挑战在于：若每位候选人都能根据决策规则调整自身可操作特征以提升被选中概率，则群体行为会动态改变成功基准，从而引发内生选择机制（endogenous selection）。解决方案的关键在于提出一个将补救视为候选人之间战略互动的框架，其中决策规则与成功阈值由群体当前特征状态共同决定，形成闭环动力系统，揭示了初始差异如何通过这一机制被放大并导致持续的绩效差距。

链接: https://arxiv.org/abs/2603.17907
作者: Ya-Ting Yang,Quanyan Zhu
机构: New York University (纽约大学)
类目: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems. However, many such systems operate in competitive settings, such as admission or hiring, where only a fraction of candidates can succeed. A fundamental question arises: what happens when actionable recourse is available to everyone in a competitive environment? This study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule. Rejected individuals exert effort to improve actionable features along directions implied by the decision rule, while the success benchmark evolves endogenously as many candidates adjust simultaneously. This creates endogenous selection, in which both the decision rule and the selection threshold are determined by the population’s current feature state. This interaction generates a closed-loop dynamical system linking candidate selection and strategic recourse. We show that the initially selected candidates determine both the benchmark of success and the direction of improvement, thereby amplifying initial disparities and producing persistent performance gaps across the population.

[MA-1] Governed Memory: A Production Architecture for Multi-Agent Workflows

【速读】：该论文旨在解决企业级人工智能（Enterprise AI）中因缺乏共享记忆与统一治理而导致的五大结构性挑战：跨代理工作流的记忆孤岛、治理碎片化、非结构化记忆无法被下游系统利用、多步骤自主执行中的冗余上下文传递，以及无反馈机制下的隐性质量退化。其解决方案的核心是提出“受管记忆”（Governed Memory）架构，通过四大机制实现：① 双模态记忆模型（结合开放集原子事实与模式强制的类型属性）；② 分层治理路由与渐进式上下文交付；③ 基于反射约束的检索与实体作用域隔离；④ 闭环模式生命周期管理（含AI辅助编写与属性级自动化优化）。实验证明该方案在事实召回率（99.6%）、治理精度（92%）、token消耗减少（50%）及跨实体泄露防护（零泄漏）等方面表现优异，且在LoCoMo基准测试中达到74.8%准确率，验证了治理与模式强制不会损害检索质量。

链接: https://arxiv.org/abs/2603.17787
作者: Hamed Taheri
机构: Personize.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 18 pages, 4 figures, 11 tables, 7 appendices. Code and datasets: this https URL

点击查看摘要

Abstract:Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context delivery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at this http URL.

[MA-2] In Trust We Survive: Emergent Trust Learning

【速读】：该论文旨在解决多智能体系统（Multiagent Systems）中在共享资源竞争环境下难以实现稳定合作的问题，尤其是在个体理性行为可能导致集体非最优结果（如资源枯竭或合作破裂）的情境下。其解决方案的核心是提出一种轻量级的信任学习机制——Emergent Trust Learning (ETL)，该机制通过每个智能体维护一个紧凑的内部信任状态（trust state），动态调节记忆、探索和动作选择策略，从而在仅依赖局部观测和个体奖励的情况下，实现高效的协作与适应性行为。ETL不依赖全局信息或复杂通信，计算开销极低，且能在多种博弈场景（包括网格资源世界、分层塔楼环境和重复囚徒困境）中维持高生存率并恢复合作，展现出良好的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2603.17564
作者: Qianpu Chen,Giulio Barbero,Mike Preuss,Derya Soydaner
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Emergent Trust Learning (ETL), a lightweight, trust-based control algorithm that can be plugged into existing AI agents. It enables these to reach cooperation in competitive game environments under shared resources. Each agent maintains a compact internal trust state, which modulates memory, exploration, and action selection. ETL requires only individual rewards and local observations and incurs negligible computational and communication overhead. We evaluate ETL in three environments: In a grid-based resource world, trust-based agents reduce conflicts and prevent long-term resource depletion while achieving competitive individual returns. In a hierarchical Tower environment with strong social dilemmas and randomised floor assignments, ETL sustains high survival rates and recovers cooperation even after extended phases of enforced greed. In the Iterated Prisoner’s Dilemma, the algorithm generalises to a strategic meta-game, maintaining cooperation with reciprocal opponents while avoiding long-term exploitation by defectors. Code will be released upon publication. Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG) Cite as: arXiv:2603.17564 [cs.MA] (or arXiv:2603.17564v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2603.17564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-3] Bringing Network Coding into Multi-Robot Systems: Interplay Study for Autonomous Systems over Wireless Communications

【速读】：该论文旨在解决多机器人系统（Multi-Robot Systems, MRS）中无线通信链路带来的延迟、丢包和乱序问题对自主决策性能与安全性造成的负面影响。传统基于重传的传输层可靠性机制虽能缓解数据丢失，但其引入的长延迟往往与MRS任务的时效性需求不匹配，甚至导致接收数据失效。论文提出以自适应因果网络编码（Adaptive and Causal Network Coding）作为解决方案，其关键在于通过主动注入编码冗余来优化通信延迟与吞吐量，使数据在时序上更具相关性；同时根据实时信道状态动态调整通信速率，并利用高效算法实现因果调度，从而保障关键信息的及时可达性。案例研究表明，该方法显著减少有序交付阻塞、维持估计一致性并提升任务截止期限可靠性，凸显了将自主算法与通信机制协同设计的重要性。

链接: https://arxiv.org/abs/2603.17472
作者: Anil Zaher,Kiril Solovey,Alejandro Cohen
机构: Technion–Israel Institute of Technology (以色列理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Communication is a core enabler for multi-robot systems (MRS), providing the mechanism through which robots exchange state information, coordinate actions, and satisfy safety constraints. While many MRS autonomy algorithms assume reliable and timely message delivery, realistic wireless channels introduce delay, erasures, and ordering stalls that can degrade performance and compromise safety-critical decisions of the robot task. In this paper, we investigate how transport-layer reliability mechanisms that mitigate communication losses and delays shape the autonomy-communication loop. We show that conventional non-coded retransmission-based protocols introduce long delays that are misaligned with the timeliness requirements of MRS applications, and may render the received data irrelevant. As an alternative, we advocate for adaptive and causal network coding, which proactively injects coded redundancy to achieve the desired delay and throughput that enable relevant data delivery to the robotic task. Specifically, this method adapts to channel conditions between robots and causally tunes the communication rates via efficient algorithms. We present two case studies: cooperative localization under delayed and lossy inter-robot communication, and a safety-critical overtaking maneuver where timely vehicle-to-vehicle message availability determines whether an ego vehicle can abort to avoid a crash. Our results demonstrate that coding-based communication significantly reduces in-order delivery stalls, preserves estimation consistency under delay, and improves deadline reliability relative to retransmission-based transport. Overall, the study highlights the need to jointly design autonomy algorithms and communication mechanisms, and positions network coding as a principled tool for dependable multi-robot operation over wireless networks. Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2603.17472 [cs.RO] (or arXiv:2603.17472v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.17472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-4] Is Your LLM -as-a-Recommender Agent Trustable? LLM s Recommendation is Easily Hacked by Biases (Preferences)

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）作为推荐者（LLM-as-a-Recommender）在高价值实际任务中因潜在偏见导致推荐可靠性不足的问题。其解决方案的关键在于提出一个名为BiasRecBench的偏见推荐基准，通过构建具有校准质量边界的偏见合成流水线（Bias Synthesis Pipeline with Calibrated Quality Margins），系统性地控制最优选项与次优选项之间的质量差距，并注入逻辑合理且适配选项上下文的偏见，从而精准评估LLM代理在复杂推荐场景下的脆弱性。实验表明，即使具备强大推理能力，主流LLM代理仍易受偏见影响，揭示了当前智能体工作流中的关键可靠性瓶颈。

链接: https://arxiv.org/abs/2603.17417
作者: Zichen Tang,Zirui Zhang,Qian Wang,Zhenheng Tang,Bo Li,Xiaowen Chu
机构: The Hong Kong University of Science and Technology (香港科技大学); National University of Singapore (新加坡国立大学)
类目: Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textitLLM-as-a-Recommender paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbfBias \textbfRecommendation \textbfBenchmark (\textbfBiasRecBench) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textscBias Synthesis Pipeline with Calibrated Quality Margins that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-2.5,3-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.

[MA-5] Agent ic Cognitive Profiling: Realigning Automated Alzheimers Disease Detection with Clinical Construct Validity

【速读】：该论文旨在解决当前自动化阿尔茨海默病（Alzheimer’s Disease, AD）筛查方法中因采用归纳式模式识别范式而导致的临床协议构念效度（construct validity）缺失问题。现有方法通常直接从输入信号到标签建立黑箱映射，牺牲了临床逻辑的可解释性与科学严谨性。其解决方案的关键在于提出代理认知剖面（Agentic Cognitive Profiling, ACP）框架，通过将标准化评估任务分解为原子级认知任务，并由专用大语言模型（Large Language Model, LLM）代理执行可验证的评分基元提取；同时，通过将语义理解与量化测量解耦，所有量化操作均由确定性函数调用完成，从而有效抑制幻觉并恢复构念效度。该设计在包含402名受试者、覆盖8个结构化认知任务的临床标注语料上实现了90.5%的任务评分匹配率和85.3%的AD预测准确率，证明了在不牺牲预测性能的前提下可实现高解释性的认知剖面生成。

链接: https://arxiv.org/abs/2603.17392
作者: Jiawen Kang,Kun Li,Dongrui Han,Jinchao Li,Junan Li,Lingwei Meng,Xixin Wu,Helen Meng
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Multiagent Systems (cs.MA); Information Retrieval (cs.IR); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Automated Alzheimer’s Disease (AD) screening has predominantly followed the inductive paradigm of pattern recognition, which directly maps the input signal to the outcome label. This paradigm sacrifices construct validity of clinical protocol for statistical shortcuts. This paper proposes Agentic Cognitive Profiling (ACP), an agentic framework that realigns automated screening with clinical protocol logic across multiple cognitive domains. Rather than learning opaque mappings from transcripts to labels, the framework decomposes standardized assessments into atomic cognitive tasks and orchestrates specialized LLM agents to extract verifiable scoring primitives. Central to our design is decoupling semantic understanding from measurement by delegating all quantification to deterministic function calling, thereby mitigating hallucination and restoring construct validity. Unlike popular datasets that typically comprise around a hundred participants under a single task, we evaluate on a clinically-annotated corpus of 402 participants across eight structured cognitive tasks spanning multiple cognitive domains. The framework achieves 90.5% score match rate in task examination and 85.3% accuracy in AD prediction, surpassing popular baselines while generating interpretable cognitive profiles grounded in behavioral evidence. This work demonstrates that construct validity and predictive performance need not be traded off, charting a path toward AD screening systems that explain rather than merely predict.

[MA-6] Distributed Equilibrium-Seeking in Target Coverag e Games via Self-Configurable Networks under Limited Communication

【速读】：该论文旨在解决受限通信环境下多感知代理（sensing agents）对可能被攻击者动态重定位的目标进行协同覆盖的问题，其核心建模为一个零和博弈（zero-sum game），其中感知团队为防御方（defender），攻击者为对手。由于防御方的动作空间随传感器数量及其可能朝向呈指数增长，精确计算纳什均衡（Nash equilibrium, NE）在计算上不可行。解决方案的关键在于利用博弈效用函数的子模性（submodularity），提出一种分布式框架，使代理能够在带宽约束下自适应配置通信邻域，并协同最大化目标覆盖度；理论分析表明所获策略收敛至近似NE。该方法首次实现了在组合动作空间中可扩展的、显式考虑通信约束的分布式求解，其有效性通过引入分布式bandit-submodular优化框架与协调价值（Value of Coordination）概念得以保障，并在仿真中展现出接近最优的游戏收益和更高的目标覆盖性能。

链接: https://arxiv.org/abs/2603.17335
作者: Jayanth Bhargav,Zirui Xu,Vasileios Tzoumas,Mahsa Ghasemi,Shreyas Sundaram
机构: Purdue University (普渡大学); University of Michigan (密歇根大学)
类目: ystems and Control (eess.SY); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We study a target coverage problem in which a team of sensing agents, operating under limited communication, must collaboratively monitor targets that may be adaptively repositioned by an attacker. We model this interaction as a zero-sum game between the sensing team (known as the defender) and the attacker. However, computing an exact Nash equilibrium (NE) for this game is computationally prohibitive as the action space of the defender grows exponentially with the number of sensors and their possible orientations. Exploiting the submodularity property of the game’s utility function, we propose a distributed framework that enables agents to self-configure their communication neighborhoods under bandwidth constraints and collaboratively maximize the target coverage. We establish theoretical guarantees showing that the resulting sensing strategies converge to an approximate NE of the game. To our knowledge, this is the first distributed, communication-aware approach that scales effectively for games with combinatorial action spaces while explicitly incorporating communication constraints. To this end, we leverage the distributed bandit-submodular optimization framework and the notion of Value of Coordination that were introduced in [1]. Through simulations, we show that our approach attains near-optimal game value and higher target coverage compared to baselines.

[MA-7] ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

【速读】：该论文旨在解决现代计算中内存系统延迟和能耗过高的问题，从而提升内存系统的整体效率。其解决方案的关键在于提出了一种可解释的多智能体在线强化学习框架 ReLMXEL（Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization），该框架通过奖励分解机制动态优化内存控制器参数，并利用详细的内存行为指标指导决策过程，实现对不同工作负载下内存访问特征的自适应调整，同时增强控制决策的透明度，推动更可信赖和灵活的内存系统设计。

链接: https://arxiv.org/abs/2603.17309
作者: Panuganti Chirag Sai,Gandholi Sarat,R. Raghunatha Sarma,Venkata Kalyan Tavva,Naveen M
机构: Sri Sathya Sai Institute of Higher Learning ( Sri Sathya Sai Institute of Higher Learning); Indian Institute of Technology Ropar (Indian Institute of Technology Ropar); Red Hat (Red Hat)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.

[MA-8] Ablation Study of a Fairness Auditing Agent ic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection

【速读】：该论文旨在解决临床人工智能（AI）系统中因缺乏充分监管和领域专业知识而导致的算法偏见与安全风险问题，尤其是在早期结直肠癌（early-onset colorectal cancer, EO-CRC）筛查模型中存在的种族、性别等人口统计学差异。其解决方案的关键在于构建一个双代理架构：Domain Expert Agent（领域专家代理）负责整合EO-CRC相关文献中的差异证据，Fairness Consultant Agent（公平性顾问代理）则据此推荐敏感属性（sensitive attributes）和公平性度量指标（fairness metrics），并通过引入检索增强生成（Retrieval-Augmented Generation, RAG）技术提升代理对专业背景知识的利用能力。实验表明，结合RAG的代理系统在语义相似度上显著优于仅使用预训练大语言模型（LLM）或无RAG的代理配置，尤其在识别不公平性方面表现更优，证明了基于检索增强的智能体系统可有效扩展临床AI模型的公平性审计能力。

链接: https://arxiv.org/abs/2603.17179
作者: Amalia Ionescu,Jose Guadalupe Hernandez,Jui-Hsuan Chang,Emily F. Wong,Paul Wang,Jason H. Moore,Tiffani J. Bright
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is increasingly used in clinical settings, yet limited oversight and domain expertise can allow algorithmic bias and safety risks to persist. This study evaluates whether an agentic AI system can support auditing biomedical machine learning models for fairness in early-onset colorectal cancer (EO-CRC), a condition with documented demographic disparities. We implemented a two-agent architecture consisting of a Domain Expert Agent that synthesizes literature on EO-CRC disparities and a Fairness Consultant Agent that recommends sensitive attributes and fairness metrics for model evaluation. An ablation study compared three Ollama large language models (8B, 20B, and 120B parameters) across three configurations: pretrained LLM-only, Agent without Retrieval-Augmented Generation (RAG), and Agent with RAG. Across models, the Agent with RAG achieved the highest semantic similarity to expert-derived reference statements, particularly for disparity identification, suggesting agentic systems with retrieval may help scale fairness auditing in clinical AI.

[MA-9] Asymmetric Nash Seeking via Best Response Maps: Global Linear Convergence and Robustness to Inexact Reaction Models

【速读】：该论文旨在解决多智能体决策与控制中纳什均衡（Nash equilibrium）求解时面临的信息不对称问题，即传统方法通常假设各参与者完全知晓对方的目标函数和约束条件，而这一假设在实际应用中往往不成立。针对具有解耦可行集的两玩家约束博弈，其中玩家1掌握自身目标与约束，玩家2仅通过其最优响应映射（best-response map）被访问，作者提出了一种非对称投影梯度下降-最优响应迭代算法（asymmetric projected gradient descent-best response iteration）。该方案的关键在于：无需双方完全共享优化问题信息，仅依赖玩家1的局部信息和玩家2的最优响应映射即可实现收敛；在精确最优响应下，证明了纳什均衡的存在唯一性及全局线性收敛性；进一步分析了最优响应近似误差存在统一上界 $\varepsilon$ 的情形，表明迭代序列最终收敛至真实纳什均衡的一个 $O(\varepsilon)$ 邻域内，从而为实际中基于学习或估计的最优响应映射提供了理论保障。

链接: https://arxiv.org/abs/2603.17058
作者: Mahdis Rabbani,Navid Mojahed,Shima Nazari
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 6 Pages, 2 Figures, Preprint submitted to IEEE L-CSS and CDC 2026

点击查看摘要

Abstract:Nash equilibria provide a principled framework for modeling interactions in multi-agent decision-making and control. However, many equilibrium-seeking methods implicitly assume that each agent has access to the other agents’ objectives and constraints, an assumption that is often unrealistic in practice. This letter studies a class of asymmetric-information two-player constrained games with decoupled feasible sets, in which Player 1 knows its own objective and constraints while Player 2 is available only through a best-response map. For this class of games, we propose an asymmetric projected gradient descent-best response iteration that does not require full mutual knowledge of both players’ optimization problems. Under suitable regularity conditions, we establish the existence and uniqueness of the Nash equilibrium and prove global linear convergence of the proposed iteration when the best-response map is exact. Recognizing that best-response maps are often learned or estimated, we further analyze the inexact case and show that, when the approximation error is uniformly bounded by \varepsilon , the iterates enter an explicit O(\varepsilon) neighborhood of the true Nash equilibrium. Numerical results on a benchmark game corroborate the predicted convergence behavior and error scaling.

[MA-10] Impacts of Electric Vehicle Charging Regimes and Infrastructure Deployments on System Performance: An Agent -Based Study

【速读】：该论文旨在解决电动汽车（Electric Vehicles, EVs）快速普及背景下充电基础设施规划的优化问题，尤其关注不同充电模式（目的地充电与途中充电）对设施布局成本及用户行为的影响。其解决方案的关键在于采用基于代理（Agent-Based Modeling, ABM）的建模框架，模拟三种充电模式下的潜在公共充电需求，并对比两种部署策略——基于优化的方法和基于利用率优化的方法。研究发现，利用率优化策略能显著降低系统总成本（包括建设成本与用户综合充电成本），尤其是在混合充电模式下效果最明显；其中，合理分配交流慢充桩可改变用户的目的地充电行为，从而减少对途中充电的依赖，降低因绕行带来的额外成本，凸显了不同充电模式间的行为关联性及其在规划中协同考虑的重要性。

链接: https://arxiv.org/abs/2603.16961
作者: Jiahua Hu,Hai L.Vu,Wynita Griggs,Hao Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Systems and Control (eess.SY)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:The rapid growth of electric vehicles (EVs) requires more effective charging infrastructure planning. Infrastructure layout not only determines deployment cost, but also reshapes charging behavior and influences overall system performance. In addition, destination charging and en-route charging represent distinct charging regimes associated with different power requirements, which may lead to substantially different infrastructure deployment outcomes. This study applies an agent-based modeling framework to generate trajectory-level latent public charging demand under three charging regimes based on a synthetic representation of the Melbourne (Australia) metropolitan area. Two deployment strategies, an optimization-based approach and a utilization-refined approach, are evaluated across different infrastructure layouts. Results show that utilization-refined deployments reduce total system cost, accounting for both infrastructure deployment cost and user generalized charging cost, with the most significant improvement observed under the combined charging regime. In particular, a more effective allocation of AC slow chargers reshapes destination charging behavior, which in turn reduces unnecessary reliance on en-route charging and lowers detour costs associated with en-route charging. This interaction highlights the behavioral linkage between destination and en-route charging regimes and demonstrates the importance of accounting for user response and multiple charging regimes in charging infrastructure planning.

[MA-11] Noncooperative Human-AI Agent Dynamics

【速读】：该论文旨在解决人工智能（AI）代理与人类决策者在战略环境中非合作互动的动态机制问题，尤其关注人类行为如何因前景理论（Prospect Theory）偏好而异于传统期望效用最大化模型。其解决方案的关键在于构建一个混合代理博弈框架：将人类建模为具有参考点依赖和损失厌恶特征的前景理论代理，而AI代理则维持标准期望效用最大化假设，并通过经典矩阵博弈和定制化场景进行多组数值模拟，以揭示不同偏好结构下策略行为的涌现特性。这一方法使研究能够识别出人类行为异常、AI适应性学习模式以及人机混合作用下的新行为规律。

链接: https://arxiv.org/abs/2603.16916
作者: Dylan Waldner,Vyacheslav Kungurtsev,Mitchelle Ashimosi
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 41 pages

点击查看摘要

Abstract:This paper investigates the dynamics of noncooperative interactions between artificial intelligence agents and human decision-makers in strategic environments. In particular, motivated by extensive literature in behavioral Economics, human agents are more faithfully modeled with respect to the state of the art using Prospect Theoretic preferences, while AI agents are modeled with standard expected utility maximization. Prospect Theory incorporates known cognitive heuristics employed by humans, including reference dependence and greater loss aversion relative to utility to relative gains. This paper runs different combinations of expected utility and prospect theoretic agents in a number of classic matrix games as well as examples specialized to tease out distinctions in strategic behavior with respect to preference functions, to explore the emergent behaviors from mixed population (human vs. AI) competition. Extensive numerical simulations are performed across AI, aware humans (those with full knowledge of the game structure and payoffs), and learning Prospect Agents (i.e., for AIs representing humans). A number of interesting observations and patterns show up, spanning barely distinguishable behavior, behavior corroborating Prospect preference anomalies in the theoretical literature, and unexpected surprises. Code can be found at this https URL.

[MA-12] rraLingua: Emergence and Analysis of Open-endedness in LLM Ecologies

【速读】：该论文旨在解决如何在真实数字生态系统中理解自主代理（autonomous agents）的协调机制、制度形成过程以及共享文化积累的问题，这是科学研究与实际应用中的重要议题。解决方案的关键在于构建一个名为TerraLingua的持续性多代理生态体系，该体系通过引入资源约束和有限寿命等现实条件，使代理在交互中产生具有持久性的文化产物（artifacts），从而塑造后续互动与选择压力。这一设计促使代理自发演化出合作规范、分工协作、治理尝试及分支式文化传承路径，为刻画人工群体中累积文化与社会结构的形成机制提供了可量化的实验平台。

链接: https://arxiv.org/abs/2603.16910
作者: Giuseppe Paolo,Jamieson Warner,Hormoz Shahrzad,Babak Hodjat,Risto Miikkulainen,Elliot Meyerson
机构: Cognizant AI Lab; The University of Texas at Austin
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:As autonomous agents increasingly operate in real-world digital ecosystems, understanding how they coordinate, form institutions, and accumulate shared culture becomes both a scientific and practical priority. This paper introduces TerraLingua, a persistent multi-agent ecology designed to study open-ended dynamics in such systems. Unlike prior large language model simulations with static or consequence-free environments, TerraLingua imposes resource constraints and limited lifespans for the agents. As a result, agents create artifacts that persist beyond individuals, shaping future interactions and selection pressures. To characterize the dynamics, an AI Anthropologist systematically analyzes agent behavior, group structure, and artifact evolution. Across experimental conditions, the results reveal the emergence of cooperative norms, division of labor, governance attempts, and branching artifact lineages consistent with cumulative cultural processes. Divergent outcomes across experimental runs can be traced back to specific innovations and organizational structures. TerraLingua thus provides a platform for characterizing the mechanisms of cumulative culture and social organization in artificial populations, and can serve as a foundation for guiding real-world agentic populations to socially beneficial outcomes.

[MA-13] Capability-Priced Micro-Markets: A Micro-Economic Framework for the Agent ic Web over HTTP 402

【速读】：该论文旨在解决去中心化自治AI代理生态系统中经济协调的根本性挑战，即如何在极少人工干预的情况下实现高效、安全且可扩展的交易机制。其解决方案的核心在于提出能力定价微市场（Capability-Priced Micro-Markets, CPMM）框架，该框架融合了三项关键技术：基于MIT发起的Project NANDA的可密码验证的能力型安全与发现机制、支持高效低开销微支付的HTTP 402状态码及其现代扩展X402/H402协议，以及用于安全多步协商与承诺的代理能力协商与绑定协议（Agent Capability Negotiation and Binding Protocol, ACNBP）。通过将代理交互形式化为不完全信息下的重复双边博弈，论文理论证明CPMM机制收敛至受限的Radner均衡，从而在信息不对称条件下保障市场效率，并引入“隐私弹性需求”概念量化代理信息披露与其服务市场价格之间的权衡关系，为新兴的代理网络构建功能性微市场提供了理论完备且实践可行的方案。

链接: https://arxiv.org/abs/2603.16899
作者: Ken Huang,Jerry Huang,Mahesh Lambe,Hammad Atta,Yasir Mehmood,Muhammad Zeeshan Baig,Muhammad Aziz Ul Haq,Nadeem Shahzad,Shailja Gupta,Rajesh Ranjan,Rekha Singhal
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper introduces Capability-Priced Micro-Markets (CPMM), a micro-economic framework designed to enable robust, scalable, and secure commerce among autonomous AI agents on the agentic web. The framework addresses the fundamental challenge of economic coordination in decentralized agent ecosystems, where entities must transact with minimal human oversight. CPMM synthesizes three key technologies into a unified system: MIT originated, Project NANDA infrastructure for cryptographically verifiable, capability-based security and discovery; the HTTP 402 “Payment Required” status code, with modern X402/H402 extensions for efficient, low-cost micropayments; and the Agent Capability Negotiation and Binding Protocol (ACNBP) for secure, multi-step negotiation and commitment. The paper formalizes agent interactions as a repeated bilateral game with incomplete information, demonstrating theoretically that the CPMM mechanism converges to a constrained Radner equilibrium, ensuring efficient outcomes under information asymmetry. A key theoretical contribution is the concept of “privacy elasticity of demand,” which is introduced to quantify the trade-off between an agent’s information disclosure and the market price of its services. By integrating secure capabilities, micropayment protocols, and formal negotiation mechanisms, CPMM provides a comprehensive, theoretically-grounded solution for creating functional micro-markets for the emergent agentic web.

自然语言处理

[NLP-0] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在空间理解与视角感知推理方面的不足，尤其是其对3D场景结构和主体视角（egocentric perspective）建模能力的欠缺。解决方案的关键在于提出Loc3R-VLM框架，通过两个协同目标实现：一是全局布局重建（global layout reconstruction），用于构建场景结构的整体表示；二是显式情境建模（explicit situation modeling），用于锚定主体视角下的空间关系。该框架利用轻量级相机位姿先验（camera pose priors）来自预训练的3D基础模型，以确保几何一致性与度量尺度对齐，从而为视觉感知与语言理解提供直接的空间监督，显著提升模型在基于语言的定位及3D问答任务中的表现。

链接: https://arxiv.org/abs/2603.18002
作者: Kevin Qu,Haozhe Qi,Mihai Dusmanu,Mahdi Rad,Rui Wang,Marc Pollefeys
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: this https URL

[NLP-1] ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）和大语言模型（Large Language Models, LLMs）在跨语言性别处理中的挑战，尤其是在从性别中性语言（如英语）向形态性别标记语言（如意大利语）翻译时，系统倾向于默认使用阳性形式，从而加剧性别偏见并降低翻译准确性。解决方案的关键在于提出Contextual Gender Annotation (ConGA) 框架，该框架基于语言学原则，提供词级性别标注方案：在英语中区分语义性别（Masculine (M)、Feminine (F)、Ambiguous (A)），在意大利语中明确语法性别（Masculine (M)、Feminine (F)），并引入实体级标识符以实现跨句追踪。通过将该标注体系应用于gENder-IT数据集，构建了评估性别偏见的金标准资源，揭示了当前MT系统中阳性的系统性过用和阴性表达的一致性不足，为开发更具备性别意识和多语言适应能力的自然语言处理系统提供了方法论与基准。

链接: https://arxiv.org/abs/2603.17962
作者: Argentina Anna Rescigno,Eva Vanmassenhove,Johanna Monti
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Handling gender across languages remains a persistent challenge for Machine Translation (MT) and Large Language Models (LLMs), especially when translating from gender-neutral languages into morphologically gendered ones, such as English to Italian. English largely omits grammatical gender, while Italian requires explicit agreement across multiple grammatical categories. This asymmetry often leads MT systems to default to masculine forms, reinforcing bias and reducing translation accuracy. To address this issue, we present the Contextual Gender Annotation (ConGA) framework, a linguistically grounded set of guidelines for word-level gender annotation. The scheme distinguishes between semantic gender in English through three tags, Masculine (M), Feminine (F), and Ambiguous (A), and grammatical gender realisation in Italian (Masculine (M), Feminine (F)), combined with entity-level identifiers for cross-sentence tracking. We apply ConGA to the gENder-IT dataset, creating a gold-standard resource for evaluating gender bias in translation. Our results reveal systematic masculine overuse and inconsistent feminine realisation, highlighting persistent limitations of current MT systems. By combining fine-grained linguistic annotation with quantitative evaluation, this work offers both a methodology and a benchmark for building more gender-aware and multilingual NLP systems.

[NLP-2] Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

【速读】：该论文旨在解决大规模语言模型在机器翻译（Machine Translation, MT）中存在系统性性别偏见的问题，尤其是由于不同语言对性别标记的处理方式差异导致的隐式性别信号难以准确转化的问题。其解决方案的关键在于提出了一种新的评估指标“先验偏见”（Prior Bias），用于量化模型对性别默认假设的程度，并将这一指标应用于仅解码器架构（decoder-only）的MT模型进行分析。研究发现，尽管这些模型规模庞大且性能先进，但在性别特定指标上并未显著优于编码器-解码器架构；然而，通过后训练策略（如指令微调）不仅能增强模型的上下文敏感性，还能有效降低男性化的先验偏见。

链接: https://arxiv.org/abs/2603.17952
作者: Chiara Manna,Hosein Mohebbi,Afra Alishahi,Frédéric Blain,Eva Vanmassenhove
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models achieve state-of-the-art results across a wide range of NLP tasks, they remain prone to systematic biases. Among these, gender bias is particularly salient in MT, due to systematic differences across languages in whether and how gender is marked. As a result, translation often requires disambiguating implicit source signals into explicit gender-marked forms. In this context, standard benchmarks may capture broad disparities but fail to reflect the full complexity of gender bias in modern MT. In this paper, we extend recent frameworks on bias evaluation by: (i) introducing a novel measure coined “Prior Bias”, capturing a model’s default gender assumptions, and (ii) applying the framework to decoder-only MT models. Our results show that, despite their scale and state-of-the-art status, decoder-only models do not generally outperform encoder-decoder architectures on gender-specific metrics; however, post-training (e.g., instruction tuning) not only improves contextual awareness but also reduces the masculine Prior Bias.

[NLP-3] ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

【速读】：该论文旨在解决多语言预训练中因语言混合比例（language mixture ratios）不合理而导致测试损失优化不足的问题，尤其是现有多语言缩放定律（multilingual scaling laws）未能有效衡量跨语言迁移效应（cross-lingual transfer effect），从而导致混合比例次优。其解决方案的关键在于将多语言预训练建模为一个合作博弈（cooperative game），其中每种语言作为博弈参与者（player），共同贡献于预训练过程并共享最终的测试损失下降收益（payoff）。基于此框架，作者提出了一种名为ShapleyLaw的游戏理论型多语言缩放定律，通过Shapley值量化每种语言对整体性能提升的边际贡献，从而更准确地预测模型性能并优化语言混合比例。

链接: https://arxiv.org/abs/2603.17945
作者: Xuyang Cao,Qianying Liu,Chuan Xiao,Yusuke Oda,Pontus Stenetorp,Daisuke Kawahara,Makoto Onizuka,Sadao Kurohashi,Shuyuan Zheng
机构: NII LLMC; Osaka University; Nara Institute of Science and Technology; University College London; Kyoto University; Waseda University
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textitlanguage mixture ratios. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textitcross-lingual transfer effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textitShapleyLaw. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.

[NLP-4] Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理阶段因逐 token 生成而导致的效率瓶颈问题，即如何在不修改模型权重或引入额外训练的情况下实现多 token 并行预测以提升生成速度。其解决方案的关键在于提出一种无需训练的掩码 token 探测机制（probing-based MTP），通过从模型嵌入空间中动态采样掩码 token（mask tokens），构建候选 token 树并结合轻量级剪枝策略保留高概率延续路径；在解码过程中并行验证候选预测，从而实现无损生成且显著减少模型调用次数与延迟，同时利用 decoder 层对掩码表示与下一 token 状态的自然对齐特性，无需重训练或辅助模型即可达成准确的多步预测。

链接: https://arxiv.org/abs/2603.17942
作者: Raghavv Goel,Mukul Gagrani,Mingu Lee,Chris Lott
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12% on LLaMA3 and 8–12% on Qwen3, and achieving throughput gains of up to 15–19%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

[NLP-5] Only relative ranks matter in weight-clustered large language models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）参数冗余与存储效率低下的问题，核心挑战在于如何在不显著损失模型性能的前提下实现高效的模型压缩。解决方案的关键在于发现权重的相对秩（relative rank）比其精确数值更为重要：即权重连接的强弱关系（而非具体大小）对模型性能起决定性作用。作者通过将权重矩阵进行K-means聚类，用少量共享值（如16–64个）替代原始密集权重，实现了无需重新训练即可大幅压缩模型体积，且保持高精度；进一步研究表明，保持权重相对秩不变的随机化不会导致性能下降，而破坏秩结构则会引发严重质量退化，这揭示了模型鲁棒性的关键机制，并为高效、稳定压缩提供了理论依据。

链接: https://arxiv.org/abs/2603.17917
作者: Borja Aizpurua,Sukhbinder Singh,Román Orús
机构: Multiverse Computing(多元宇宙计算); Tecnun - University of Navarra(纳瓦拉大学技术学院); Donostia International Physics Center(多诺斯蒂亚国际物理中心); Ikerbasque Foundation for Science(伊克尔巴斯克科学基金会)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 9 tables

点击查看摘要

Abstract:Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w’ = aw + b with a 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.

[NLP-6] IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言场景下，特别是在文化多样性高且资源匮乏的印地语系（Indic）语言中，其安全行为缺乏系统评估的问题。现有研究普遍忽视低资源语言中的文化敏感性与安全对齐问题，导致模型在跨语言部署时出现显著的安全漂移（safety drift）。解决方案的关键在于构建首个针对印地语系语言的系统性安全评估基准——IndicSafe，该基准包含6000个基于文化背景设计的提示（prompt），覆盖种姓、宗教、性别、健康和政治等敏感领域，并通过翻译变体测试10个主流LLM的安全表现。研究进一步引入提示级熵、类别偏差分数和多语言一致性指数等量化指标，揭示了模型在不同语言间安全响应的一致性极低（仅12.8%交叉语言一致性），并识别出过度拒绝良性请求、误标政治敏感话题及漏报不安全生成等关键缺陷。结果表明，当前LLM的安全对齐无法在语言间有效迁移，亟需采用基于区域危害的、语言感知的对齐策略。

链接: https://arxiv.org/abs/2603.17915
作者: Priyaranjan Pattnayak,Sanchari Chowdhuri
机构: Oracle America Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8%, and \textttSAFE rate variance exceeds 17% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textscIndicSafe, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17915 [cs.CL] (or arXiv:2603.17915v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.17915 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-7] Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

【速读】：该论文旨在解决语言距离的量化测量问题，即如何在跨语言比较中建立统一且可扩展的定量方法，以支持语言学、人类学及人类进化史研究。传统语言学虽能提供丰富的定性描述，但缺乏系统性的量化工具。其解决方案的关键在于利用预训练多语言语言模型（multilingual language models）中自发产生的注意力机制，提出一种称为注意力传输距离（Attention Transport Distance, ATD）的新指标：通过将注意力矩阵视为概率分布，并采用最优传输（optimal transport）方法计算其几何差异，从而衡量翻译过程中不同语言表征间的距离。这一方法不依赖词元化方式（tokenization-agnostic），并在大规模多样语言数据上验证了其对已知语言分类和地理/接触关系的高度一致性，同时作为正则项提升低资源机器翻译的迁移性能。

链接: https://arxiv.org/abs/2603.17912
作者: Yue Zhao,Jiatao Gu,Paloma Jeretič,Weijie Su
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.

[NLP-8] DebugLM: Learning Traceable Training Data Provenance for LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多阶段训练过程中缺乏数据可追溯性的问题，即开发者难以定位导致特定行为的具体训练数据来源，从而使得调试依赖于被动修补，且在分布偏移或模型迭代更新后问题易复发。解决方案的关键在于提出 DebugLM 框架，通过在模型中嵌入数据溯源（data provenance）能力，使模型能够为响应分配唯一的数据源标签（provenance tags），从而精确识别不良行为的学习来源；在此基础上，DebugLM 还支持测试时的靶向干预，允许开发者在不重新训练或修改参数的情况下，对指定数据源触发拒绝响应，实现高效的行为控制与修复。

链接: https://arxiv.org/abs/2603.17884
作者: Wenjie Jacky Mo,Qin Liu,Xiaofei Wen,Wenxuan Zhou,Zhe Zhao,Muhao Chen
机构: University of California, Davis (加州大学戴维斯分校); University of Southern California (南加利福尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.

[NLP-9] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成内容时易出现“幻觉”（hallucinations）的问题，即模型输出与事实不符或缺乏依据的内容，这在高风险领域中尤为危险。解决方案的关键在于提出一种基于领域约束的分层检索与验证架构（domain-grounded tiered retrieval and verification architecture），通过四阶段自调节流水线实现对事实性错误的系统性拦截：首先利用内在验证与早停逻辑优化计算效率；其次通过领域检测器引导至特定知识库进行精准检索；再以纠正文档评分（Corrective Document Grading, CRAG）过滤无关上下文；最后执行外在再生并逐条断言验证。该框架显著提升了多基准测试下的事实准确性，尤其在时间敏感和数值精确场景中表现突出，但识别出“虚假前提过度主张”（False-Premise Overclaiming）仍是待解难题。

链接: https://arxiv.org/abs/2603.17872
作者: Md. Asraful Haque,Aasar Mehdi,Maaz Mahboob,Tamkeen Fatima
机构: Aligarh Muslim University (阿尔igarh穆斯林大学); Interdisciplinary Center for Artificial Intelligence (跨学科人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 Pages, 5 Figures, 4 Tables

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to “hallucinations” - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of “False-Premise Overclaiming” was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval “answerability” nodes to further bridge the reliability gap in conversational AI.

[NLP-10] How do LLM s Compute Verbal Confidence ICML2026

【速读】：该论文试图解决的问题是：大语言模型（Large Language Models, LLMs）在被要求输出“口头自信度”（verbal confidence）时，其内部如何生成此类评分——即自信度是在请求时即时计算，还是在生成答案过程中自动计算并缓存以供后续调用；以及这些自信度分数反映的是简单的词元对数概率（token log-probabilities），还是更复杂的答案质量评估。解决方案的关键在于通过多种实验手段（包括激活操控、补丁实验、噪声干扰和交换实验）发现，自信度信息并非在输出位置实时生成，而是在答案邻近位置提前形成，并被缓存在第一个答案后的位置，随后被检索用于输出；进一步地，线性探测和方差分解表明，这些缓存表示能够解释远超词元对数概率的自信度变异，说明其本质是对答案质量的复杂评估，而非简单流畅性读出。这一发现揭示了LLM中自信度是一种自动且复杂的自我评估机制，而非事后重构，为理解LLM中的元认知（metacognition）提供了新视角，并有助于改进模型校准（calibration）。

链接: https://arxiv.org/abs/2603.17839
作者: Dharshan Kumaran,Arthur Conmy,Federico Barbero,Simon Osindero,Viorica Patraucean,Petar Velickovic
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ICML 2026

点击查看摘要

Abstract:Verbal confidence – prompting LLMs to state their confidence as a number or category – is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation – not post-hoc reconstruction – with implications for understanding metacognition in LLMs and improving calibration.

[NLP-11] Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned Multi-Granularity Benchmark

【速读】：该论文旨在解决现有面向人类价值观的数据集无法有效支持事实性新闻中价值理解的问题，具体表现为数据集多为无主体（actor-agnostic）的、依赖孤立语句或合成场景，且缺乏显式的事件结构和价值方向标注。为此，作者提出了NEVU（News Event-centric Value Understanding）基准，其核心创新在于构建了一个以事件为中心（event-centric）、面向主体的（actor-conditioned）且具备价值方向感知能力（direction-aware）的人类价值观识别框架。关键解决方案包括：基于2,865篇英文新闻文章构建四层语义单元（子事件、基于行为的复合事件、基于故事的复合事件及整篇文章）的细粒度标注体系，并通过LLM辅助的分阶段验证与针对性人工审核流程确保标注质量；同时采用包含54个细粒度价值类别和20个粗粒度类别的层次化价值空间，覆盖超过4.5万条单位-主体对与16.8万条有向价值实例，从而实现对模型在真实新闻语境下识别价值线索、归属正确主体并判断价值方向的能力进行系统评估。

链接: https://arxiv.org/abs/2603.17838
作者: Yao Wang,Xin Liu,Zhuochen Liu,Jiankang Chen,Adam Jatowt,Kyoungsook Kim,Noriko Kando,Haitao Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing human value datasets do not directly support value understanding in factual news: many are actor-agnostic, rely on isolated utterances or synthetic scenarios, and lack explicit event structure or value direction. We present \textbfNEVU (\textbfNews \textbfEvent-centric \textbfValue \textbfUnderstanding), a benchmark for \emphactor-conditioned, \emphevent-centric, and \emphdirection-aware human value recognition in factual news. NEVU evaluates whether models can identify value cues, attribute them to the correct actor, and determine value direction from grounded evidence. Built from 2,865 English news articles, NEVU organizes annotations at four semantic unit levels (\textbfSubevent, \textbfbehavior-based composite event, \textbfstory-based composite event, and \textbfArticle) and labels \mbox(unit, actor) pairs for fine-grained evaluation across local and composite contexts. The annotations are produced through an LLM-assisted pipeline with staged verification and targeted human auditing. Using a hierarchical value space with \textbf54 fine-grained values and \textbf20 coarse-grained categories, NEVU covers 45,793 unit–actor pairs and 168,061 directed value instances. We provide unified baselines for proprietary and open-source LLMs, and find that lightweight adaptation (LoRA) consistently improves open-source models, showing that although NEVU is designed primarily as a benchmark, it also supports supervised adaptation beyond prompting-only evaluation. Data availability is described in Appendix~\refapp:data_code_availability.

[NLP-12] xt-to-Stage: Spatial Layouts from Long-form Narratives

【速读】：该论文旨在解决从非结构化文本中自动推断空间布局信息的问题，即如何让语言模型像人类一样进行空间推理，从而为下游媒体应用（如舞台剧编排、虚拟场景生成等）提供支持。其核心挑战在于文本中缺乏显式的空间、位置或关系线索，而模型仍需准确还原场景、角色位置、移动路径及房间类型等要素。解决方案的关键在于提出一种基于戏剧理论的确定性评估体系，并结合拒绝采样监督微调（rejection SFT）与基于可验证奖励的强化学习（GRPO），通过Best-of-N采样和可验证反馈优化模型输出，在经典英文文学语料上显著提升角色归属准确性、空间合理性及动作经济性等多个指标，同时增强与大型语言模型（LLM）作为评判者及人类主观偏好的一致性。

链接: https://arxiv.org/abs/2603.17832
作者: Jefferson Hernandez,Swarnadeep Saha,Chenxi Whitehouse,Sanjeel Parekh,Calvin Murdock,Yuliang Li,W. Owen Brimijoin,Vamsi Krishna Ithapu,Ishwarya Ananthabhotla
机构: Rice University; Meta FAIR; Meta Reality Labs Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.

[NLP-13] CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

【速读】：该论文旨在解决代码定位（code localization）问题，即在大型代码库中高效识别与任务相关的文件、类和函数，这是编码代理（coding agent）执行任务的前提。现有方法多依赖复杂的专用工具（如基于静态分析构建的仓库图），而本文提出一种更轻量级的解决方案：仅使用标准 Unix 终端作为交互接口，通过有效的强化学习（reinforcement learning, RL）策略训练编码代理完成代码定位任务。其关键在于重新利用现有编码代理环境进行代码搜索、设计合理的奖励机制以及优化 RL 训练流程，从而在多个基准测试（SWE-Bench Verified、Pro 和 Lite）上实现优于或媲美更大规模预训练模型甚至闭源模型（如 Claude Sonnet）的性能表现。

链接: https://arxiv.org/abs/2603.17829
作者: Lintang Sutawika,Aditya Bharat Soni,Bharath Sriraam R R,Apurva Gandhi,Taha Yassine,Sanidhya Vijayvargiya,Yuchen Li,Xuhui Zhou,Yilin Zhang,Leander Melroy Maben,Graham Neubig
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A prerequisite for coding agents to perform tasks on large repositories is code localization - the identification of relevant files, classes, and functions to work on. While repository-level code localization has been performed using embedding-based retrieval approaches such as vector search, recent work has focused on developing agents to localize relevant code either as a standalone precursor to or interleaved with performing actual work. Most prior methods on agentic code search equip the agent with complex, specialized tools, such as repository graphs derived from static analysis. In this paper, we demonstrate that, with an effective reinforcement learning recipe, a coding agent equipped with nothing more than a standard Unix terminal can be trained to achieve strong results. Our experiments on three benchmarks (SWE-Bench Verified, Pro, and Lite) reveal that our models consistently achieve superior or competitive performance over 2-18x larger base and post-trained LLMs and sometimes approach performance provided by closed models like Claude Sonnet, even when using specialized scaffolds. Our work particularly focuses on techniques for re-purposing existing coding agent environments for code search, reward design, and RL optimization. We release the resulting model family, CodeScout, along with all our code and data for the community to build upon.

[NLP-14] Discovering Decoupled Functional Modules in Large Language Models AAAI-26

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）内部功能模块组织不清晰的问题，即如何将海量神经元解耦为具有语义意义的功能模块，并揭示这些模块与输入样本主题之间的关联。其解决方案的关键在于提出一种无监督的跨层模块发现框架——ULCMOD（Unsupervised LLM Cross-layer MOdule Discovery），该框架通过设计新颖的目标函数和高效的迭代解耦（Iterative Decoupling, IterD）算法，实现对整个LLM中神经元的模块化分解与相关输入主题的同步识别，从而挖掘出语义一致、可解释性强且具备空间层级结构的功能模块。

链接: https://arxiv.org/abs/2603.17823
作者: Yanke Yu,Jin Li,Ying Sun,Ping Li,Zhefeng Wang,Yi Zheng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: AAAI-26 Oral

点击查看摘要

Abstract:Understanding the internal functional organization of Large Language Models (LLMs) is crucial for improving their trustworthiness and performance. However, how LLMs organize different functions into modules remains highly unexplored. To bridge this gap, we formulate a functional module discovery problem and propose an Unsupervised LLM Cross-layer MOdule Discovery (ULCMOD) framework that simultaneously disentangles the large set of neurons in the entire LLM into modules while discovering the topics of input samples related to these modules. Our framework introduces a novel objective function and an efficient Iterative Decoupling (IterD) algorithm. Extensive experiments show that our method discovers high-quality, disentangled modules that capture more meaningful semantic information and achieve superior performance in various downstream tasks. Moreover, our qualitative analysis reveals that the discovered modules show semantic coherence, correspond to interpretable specializations, and a clear spatial and hierarchical organization within the LLM. Our work provides a novel tool for interpreting the functional modules of LLMs, filling a critical blank in LLM’s interpretability research.

[NLP-15] Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多步推理过程中因中间步骤错误传播而导致整体性能下降的问题。现有方法依赖昂贵的人工标注或计算密集型自动标签生成方式，难以规模化应用。其解决方案的关键在于提出一种基于信息论的自动步骤级标签生成方法，通过估计每一步推理对正确答案似然性的改变来量化步骤质量，从而提供细粒度监督信号；该方法将计算复杂度从之前的 $\mathcal{O}(N \log N)$ 降低至 $\mathcal{O}(N)$ ，显著提升了效率，并在数学、Python编程、SQL和科学问答等多个推理基准上验证了其有效性，尤其适用于对错误传播敏感的任务场景。

链接: https://arxiv.org/abs/2603.17815
作者: Corentin Royer,Debarun Bhattacharjya,Gaetano Rossiello,Andrea Giovannini,Mennatallah El-Assady
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to \mathcalO(N) , improving over the previous \mathcalO(N \log N) methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of- K evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.

[NLP-16] CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

【速读】：该论文旨在解决标签无关的强化学习（label-free reinforcement learning）在提升大语言模型推理能力时存在的“共识陷阱”（consensus trap）问题，即模型在训练过程中过度追求输出一致性，导致多样性丧失并固化系统性错误。解决方案的关键在于提出CoVerRL框架，通过单一模型在生成器（generator）与验证器（verifier）角色之间交替切换，实现两者能力的协同进化：验证器利用多数投票提供的噪声监督信号逐步过滤伪标签中的自洽错误，而生成器则基于更高质量的伪标签持续优化输出，从而形成一个良性循环，维持高奖励准确率并显著提升数学推理性能。

链接: https://arxiv.org/abs/2603.17775
作者: Teng Pan,Yuchen Yan,Zixuan Wang,Ruiqing Zhang,Gaiyang Han,Wanqi Zhang,Weiming Lu,Jun Xiao,Yongliang Shen
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve.

[NLP-17] Modeling Overlapped Speech with Shuffles

【速读】：该论文旨在解决多说话人重叠语音的对齐与说话人归属标注问题，即如何在不依赖复杂后处理的情况下实现单次遍历（one-pass）的多说话人语音同步与转录。其关键解决方案是利用shuffle product和偏序有限状态自动机（Partial Order Finite-State Automata, PO-FSA）建模并行语音流，并通过在子词、词和短语层级上对所有可能的序列化路径进行边缘化（marginalization），以总得分作为损失函数进行端到端训练。同时，通过直接建模（token, speaker）元组实现说话人归属，结合时间约束构造紧凑的PO-FSA以降低计算图规模，最终借助Viterbi算法在shuffle product FSA中完成高效的一次性对齐。

链接: https://arxiv.org/abs/2603.17769
作者: Matthew Wiesner,Samuele Cornell,Alexander Polok,Lucas Ondel Yang,Lukáš Burget,Sanjeev Khudanpur
机构: Johns Hopkins University (约翰霍普金斯大学); LISN, CNRS (法国国家科学研究中心); Carnegie Mellon University (卡内基梅隆大学); Brno University of Technology (布杰约维采理工大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.

[NLP-18] Harm or Humor: A Multimodal Multilingual Benchmark for Overt and Covert Harmful Humor

【速读】：该论文旨在解决当前静态基准测试在检测和理解有害及冒犯性幽默（offensive humor）方面的局限性，尤其是针对依赖文化细微差别和隐含线索的黑色幽默（dark humor）所引发的安全挑战。其解决方案的关键在于构建一个新颖的多模态、多语言基准数据集，涵盖英文和阿拉伯文的3,000条文本、6,000张图像以及1,200个视频（包括英语、阿拉伯语和无语言依赖的通用情境），并采用严格的标注规范：区分安全（Safe）与有害（Harmful）笑话，其中有害类别进一步细分为显性（Explicit）和隐性（Implicit）两类，以推动模型对深层语境推理能力的评估。实证结果表明，闭源模型显著优于开源模型，且英语与阿拉伯语之间存在明显性能差异，凸显了文化相关性和推理感知型安全对齐的重要性。

链接: https://arxiv.org/abs/2603.17759
作者: Ahmed Sharshar,Hosam Elgendy,Saad El Dine Ahmed,Yasser Rohaim,Yuxia Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emphSafe jokes from \emphHarmful ones, with the latter further classified into \emphExplicit (overt) and \emphImplicit (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolorredWarning: this paper contains example data that may be offensive, harmful, or biased.

[NLP-19] Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）中因检索到的上下文存在噪声、不可靠或与模型参数知识不一致而导致的检索优先冲突（retrieval-prior conflicts）问题，这一问题在基于扩散的文本生成模型（diffusion-based language models）中尤为突出，因其迭代去噪过程对上下文整合提出了独特挑战。解决方案的关键在于提出一种无需训练的自适应引导框架——自适应检索增强掩码扩散模型（Adaptive Retrieval-Augmented Masked Diffusion, ARAM），其通过动态调整去噪过程中引导强度（guidance scale）来应对检索上下文引发的分布偏移（distributional shift），具体依据是该偏移的信噪比（Signal-to-Noise Ratio, SNR）：当检索内容提供可靠纠正证据时增强引导，反之则抑制引导，从而提升生成质量。

链接: https://arxiv.org/abs/2603.17677
作者: Jaemin Kim,Jong Chul Ye
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model’s parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.

[NLP-20] Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis LREC2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）是否能够捕捉结构化语义信息的问题，特别是其如何表征概念之间的语义关系。研究聚焦于四种核心语义关系：同义关系（synonymy）、反义关系（antonymy）、上下位关系（hypernymy 和 hyponymy），并采用线性探测（linear probing）与机制可解释性技术（包括稀疏自编码器（sparse autoencoders, SAE）和激活修补（activation patching））相结合的方法，识别这些关系在模型中的编码位置及其特征贡献机制。关键解决方案在于通过多尺度分析揭示语义关系的分布特性与因果效应：发现层级关系存在方向性不对称——上位词关系（hypernymy）具有冗余编码且抗干扰能力强，而下位词关系（hyponymy）依赖紧凑特征且易受消融影响；同时，关系信号虽呈弥散分布但具有稳定层间模式（峰值位于中层），且主要集中在残差连接后的前馈网络（MLP）路径而非注意力模块；此外，探测层面的因果效应具有模型规模依赖性，仅在大规模模型（如 Llama 3.1 8B）中实现稳定可靠的特征引导修补效果。这一框架为将稀疏特征与探针级因果证据关联提供了可复现的方法论基础。

链接: https://arxiv.org/abs/2603.17624
作者: Andor Diera,Ansgar Scherp
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted at LREC 2026

点击查看摘要

Abstract:Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.

[NLP-21] Complementary Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在训练大语言模型（Large Language Model, LLM）驱动智能体时存在的样本效率低下的问题，其根源在于稀疏的奖励信号以及智能体无法有效利用跨回合的历史经验。现有方法虽尝试引入历史经验，但因经验存储静态或未能与策略主体（actor）协同进化，导致经验与智能体能力逐渐脱节，削弱了经验的价值。解决方案的关键在于提出“互补式强化学习”（Complementary RL），通过在RL优化循环中实现经验提取器（experience extractor）与策略主体的无缝协同演化：策略主体基于稀疏结果奖励进行优化，而经验提取器则根据其提炼的经验是否显著提升主体表现来调整自身策略，从而确保经验管理机制随主体能力同步演进。这一机制显著提升了单任务性能（+10%）并展现出多任务场景下的强可扩展性。

链接: https://arxiv.org/abs/2603.17621
作者: Dilxat Muhtar,Jiashun Liu,Wei Gao,Weixun Wang,Shaopan Xiong,Ju Huang,Siran Yang,Wenbo Su,Jiamang Wang,Ling Pan,Bo Zheng
机构: Alibaba Group(阿里巴巴集团); HKUST(香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent’s inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor’s evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor’s success, thereby evolving its experience management strategy in lockstep with the actor’s growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.

[NLP-22] mporal Narrative Monitoring in Dynamic Information Environments

【速读】：该论文旨在解决危机事件中信息环境（Information Environment, IE）的动态性与抽象性所带来的理解难题，现有方法多依赖静态分类或网络分析，忽视了信息随时间演变的本质。其解决方案的关键在于提出一种面向系统的框架，通过整合语义嵌入（semantic embeddings）、基于密度的聚类（density-based clustering）和滚动时间关联（rolling temporal linkage），将新兴叙事建模为在共享语义空间中持续演化且自适应的结构，无需预先标注标签。该方法实现了从非结构化社交媒体流中提取可解释、时序结构化的叙事表示，从而提升情境意识（situational awareness）下的感知与理解能力，并支持动态信息环境中监测与决策辅助。

链接: https://arxiv.org/abs/2603.17617
作者: David Farr,Stephen Prochaska,Jack Moody,Lynnette Hui Xian Ng,Iain Cruickshank,Kate Starbird,Jevin West
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Comprehending the information environment (IE) during crisis events is challenging due to the rapid change and abstract nature of the domain. Many approaches focus on snapshots via classification methods or network approaches to describe the IE in crisis, ignoring the temporal nature of how information changed over time. This work presents a system-oriented framework for modeling emerging narratives as temporally evolving semantic structures without requiring prior label specification. By integrating semantic embeddings, density-based clustering, and rolling temporal linkage, the framework represents narratives as persistent yet adaptive entities within a shared semantic space. We apply the methodology to a real-world crisis event and evaluate system behavior through stratified cluster validation and temporal lifecycle analysis. Results demonstrate high cluster coherence and reveal heterogeneous narrative lifecycles characterized by both transient fragments and stable narrative anchors. We ground our approach in situational awareness theory, supporting perception and comprehension of the IE by transforming unstructured social media streams into interpretable, temporally structured representations. The resulting system provides a methodology for monitoring and decision support in dynamic information environments. Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL) Cite as: arXiv:2603.17617 [cs.SI] (or arXiv:2603.17617v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2603.17617 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-23] VeriAgent : A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在自动寄存器传输级（Register-Transfer Level, RTL）代码生成中仅关注功能正确性，而忽视物理设计关键指标（即功率、性能和面积，Power, Performance, and Area, PPA）的问题。解决方案的关键在于提出一个面向PPA的、工具集成的多智能体框架，该框架通过编程者智能体（Programmer Agent）、正确性验证智能体（Correctness Agent）与PPA优化智能体（PPA Agent）组成的闭环工作流，实现功能正确性与物理指标的联合优化；同时引入可演化的记忆机制（Evolved Memory Mechanism），将优化经验结构化存储于记忆节点中，并由专用记忆管理器动态维护，从而支持无需重新训练模型即可持续改进策略，使RTL生成从一次性推理转变为反馈驱动的持续优化过程。

链接: https://arxiv.org/abs/2603.17613
作者: Yaoxiang Wang,Qi Shi,ShangZhan Li,Qingguo Hu,Xinyu Yin,Bo Guo,Xu Han,Maosong Sun,Jinsong Su
机构: Xiamen University (厦门大学); Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textitProgrammer Agent, a \textitCorrectness Agent, and a \textitPPA Agent, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textitEvolved Memory Mechanism that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.

[NLP-24] KA2L: A Knowledge-Aware Active Learning Framework for LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在特定领域知识掌握深度不足以及现有主动学习策略难以精准提升其专业能力的问题。解决方案的关键在于提出一种知识感知的主动学习框架（Knowledge-Aware Active Learning, KA2L），该框架通过潜空间分析评估模型对特定知识点的掌握程度，识别出模型尚未掌握的知识区域，并据此生成未回答或未知问题；同时，利用知识分布探测技术和隐状态解码方法，从潜在知识空间中自动构建大量自然语言形式的未知问题，从而实现针对性的知识增强训练，显著降低标注与计算成本（减少50%），并提升模型性能。

链接: https://arxiv.org/abs/2603.17566
作者: Haoxuan Yin,Bojian Liu,Chen Tang,Yangfan Wang,Lian Yan,Jingchi Jiang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with high-quality knowledge has been shown to enhance their performance effectively. However, there is a paucity of research on the depth of domain-specific knowledge comprehension by LLMs and the application of targeted active learning to improve their expertise. To address this gap, we introduce the Knowledge-Aware Active Learning (KA2L) framework. This framework assesses LLMs’ mastery of specific knowledge points to aid in constructing unanswerable or unknowable questions through latent space analysis. This active learning strategy enhances training efficiency by focusing on knowledge the model has yet to master, thereby minimizing redundancy in learning already acquired information. This study innovatively employs a knowledge distribution probing technique to examine the hidden states of specific Transformer layers and identify the distribution of known and unknown knowledge within the LLM. Additionally, a hidden-state decoding method is proposed to generate numerous unknown questions in natural language from the latent knowledge space. In our experiments, we selected nine open-source LLMs to validate the effectiveness of the proposed framework. Results indicate that KA2L not only significantly reduces 50% annotation and computation costs across two open-domain and one vertical-domain dataset but also achieves better performance, offering valuable insights into active learning strategies for LLMs. The code is available at this https URL.

[NLP-25] Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

【速读】：该论文旨在解决多语言自动语音识别（ASR）系统在数据分布不均衡情况下面临的稳定性-可塑性困境：完全共享的参数高效微调（PEFT）会导致低资源语言出现负向跨语言干扰，而完全独立的语言特定微调又限制了跨语言有益知识的迁移。解决方案的关键在于提出Zipper-LoRA框架，这是一种基于秩级解耦的动态融合机制，包含静态、硬性与软性三种变体，通过轻量级语言条件路由器在LoRA秩级别上动态调控共享子空间与语言特有子空间的更新贡献，实现兼容语言间的细粒度知识共享与冲突语言间的严格解耦。此外，引入两阶段训练策略配合初始温启动（Initial-B warm start）以增强优化稳定性，实验表明该方法在12种语言混合资源场景下显著优于全共享和全独立基线，尤其在极端低资源条件下表现突出。

链接: https://arxiv.org/abs/2603.17558
作者: Yuxiang Mei,Delai Qiu,Shengping Liu,Jiaen Liang,Yanhua Long
机构: Shanghai Engineering Research Center of Intelligent Education and Bigdata, Shanghai Normal University (上海智能教育与大数据工程研究中心，上海师范大学); SHNU-Unisound Natural Human-Computer Interaction Lab, Shanghai Normal University (SHNU-科大讯飞自然人机交互实验室，上海师范大学); Unisound AI Technology Co., Ltd. (科大讯飞人工智能技术有限公司)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework’s reliability for practical, large-scale multilingual ASR. Our code and data will be available at this https URL for reproducibility.

[NLP-26] AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications LREC2026

【速读】：该论文旨在解决语音学中形式音位（formant）与发音器官（尤其是舌位）之间关系难以直观理解的问题，特别是在教学和临床应用中缺乏高效、实时的可视化工具。其解决方案的关键在于构建AURORA模型，该模型基于前两个共振峰（formant）值预测元音发音时的舌体位移与形状，其核心创新在于利用40名英语母语者采集的超声舌部成像（ultrasound tongue imaging）与声学数据进行训练，从而建立从声学特征到口腔内部结构的映射关系。该模型不仅为语音教学提供可解释的理论框架，还通过开发Shiny应用程序和实时舌部生物反馈原型软件，实现了面向语言学学生、语音治疗师及患者的实用化工具支持。

链接: https://arxiv.org/abs/2603.17543
作者: Patrycja Strycharczuk,Sam Kirkham
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LREC 2026

点击查看摘要

Abstract:This paper outlines the conceptual and computational foundations of the AURORA (Acoustic Understanding and Real-time Observation of Resonant Articulations) model. AURORA predicts tongue displacement and shape in vowel sounds based on the first two formant values. It is intended as a didactic aid helping to explain the relationship between formants and the underlying articulation, as well as a foundation for biofeedback applications. The model is informed by ultrasound tongue imaging and acoustic data from 40 native speakers of English. In this paper we discuss the motivation for the model, the modelling objectives as well as the model architecture. We provide a qualitative evaluation of the model, focusing on selected tongue features. We then present two tools developed to make the model more accessible to a wider audience, a Shiny app and a prototype software for real-time tongue biofeedback. Potential users include students of phonetics, linguists in fields adjacent to phonetics, as well as speech and language therapy practitioners and clients.

[NLP-27] Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures Domains and Adversarial Conditions

【速读】：该论文旨在解决当前机器生成文本检测方法在跨领域迁移、跨大语言模型（Large Language Models, LLMs）泛化以及对抗鲁棒性方面表现不足的问题。现有基准测试通常仅在理想条件下评估单一检测器对单一数据集的性能，无法反映实际应用场景中的复杂性。论文提出一个全面的基准测试框架，在HC3和ELI5两个语料库上系统评估多种检测方法，包括传统分类器、微调的Transformer编码器（如BERT、RoBERTa等）、CNN、XGBoost风格特征模型、基于困惑度（perplexity）的检测方法及以LLM作为检测器的提示策略。关键发现在于：尽管Transformer模型在分布内（in-distribution）检测中接近完美，但其性能在领域偏移下显著下降；XGBoost风格特征模型在保持可解释性的同时达到与Transformer相当的性能；而基于LLM的检测器存在生成器-检测器身份偏差问题，且整体表现不佳；此外，现代LLM生成文本的困惑度反而低于人类文本，导致基于困惑度的方法出现极性反转现象，但经校正后仍具有效性。总体而言，当前无一种方法能在不同领域和LLM源之间实现稳健泛化。

链接: https://arxiv.org/abs/2603.17522
作者: Madhav S. Baidya,S. S. Baidya,Chirag Chawla
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ~30 pages, 10+ figures. Code available at: this https URL

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources. Comments: ~30 pages, 10+ figures. Code available at: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17522 [cs.CL] (or arXiv:2603.17522v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.17522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-28] Language on Demand Knowledge at Core: Composing LLM s with Encoder-Decoder Translation Models for Extensible Multilinguality ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言任务中表现不均衡的问题，尤其是其在低资源语言和未见语言上的性能下降。尽管LLMs在统一语义空间中编码了丰富的跨语言知识，但难以可靠地将其与这些语言进行对接。为此，作者提出XBridge架构，其关键在于构建一个由预训练编码器-LLM-解码器组成的组合式结构：将多语言理解与生成任务交由外部预训练的编码器-解码器翻译模型处理，而保留LLM作为以英语为中心的核心通用知识处理单元。为解决不同模型间表示错位问题，引入轻量级跨模型映射层及基于最优传输（optimal transport）的对齐目标，从而实现细粒度的语义一致性，显著提升低资源和未见语言上的多语言生成性能，且无需重新训练LLM。

链接: https://arxiv.org/abs/2603.17512
作者: Mengyu Bu,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
类目: Computation and Language (cs.CL)
备注: Submitted to ACL 2026. The code is available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.

[NLP-29] Inducing Epistemological Humility in Large Language Models : A Targeted SFT Approach to Reducing Hallucination

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中存在的幻觉问题，即模型在生成回答时可能产生看似合理但事实上错误的信息，这部分源于监督微调（Supervised Fine-Tuning, SFT）隐式鼓励模型始终给出回应而非承认知识边界。解决方案的关键在于设计一种名为HypoTermInstruct的SFT数据集（包含31,487条响应对应11,151个问题），通过引入关于虚构“假设性术语”的问题，引导模型习得元认知能力——即识别自身知识局限并承认不确定性，从而实现对幻觉的有效抑制。实验表明，使用该数据集进行微调可显著提升HypoTerm Score和FactScore，同时保持MMLU等基准任务性能稳定，且无需依赖偏好学习或强化学习（Preference/RL）机制，为构建更可靠的人工智能系统提供了可解释且实用的新路径。

链接: https://arxiv.org/abs/2603.17504
作者: Cem Uluoglakci,Tugba Taskaya Temizel
机构: Middle East Technical University (中东技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often hallucinate, producing fluent but false information, partly because supervised fine-tuning (SFT) implicitly rewards always responding. We introduce \textitHypoTermInstruct , an SFT dataset (31,487 responses for 11,151 questions) designed to teach models epistemological humility-the ability to recognize the limits of their own knowledge and admit uncertainty. This is achieved through questions about non-existent “hypothetical” terms. We also release \textitHypoTermQA-Enhanced , a benchmark for hallucination tendency strengthened through multiple validations. We conducted 800 controlled LoRA SFT runs across \textitLlama3.1-8B and \textitGemma3-4B (base and instruct), testing 100 fine-tuning configurations with paired controls. Our results demonstrate that replacing generic instruction data with \textitHypoTermInstruct significantly improves the HypoTerm Score (median increases of 0.19% to 25.91%) and FactScore (+0.39% to +0.86%), while maintaining stable performance on MMLU (minimal decreases of 0.26% to 0.35%). Our work demonstrates that targeted, high-quality SFT data teaching meta-cognitive skills can effectively reduce hallucination without preference/RL pipelines, providing mechanistic insights and a practical path toward more reliable AI systems.

[NLP-30] Learning When to Attend: Conditional Memory Access for Long-Context LLM s

【速读】：该论文旨在解决语言模型在预训练上下文长度之外难以泛化的问题，从而限制了长时推理和检索能力。传统方法通过继续预训练长上下文数据来提升性能，但因注意力机制（Attention）的二次复杂度导致计算成本高昂。解决方案的关键在于提出L2A（Learning To Attend）模块，该模块基于观察——大多数token无需全局注意力（Global Attention）覆盖整个序列，仅需局部上下文——设计了一种条件性（token-wise）的长程记忆访问机制，动态决定何时调用全局注意力。实验表明，L2A在Qwen 2.5与Qwen 3模型上将有效上下文长度从32K扩展至128K，性能接近标准长上下文训练（误差<3%），同时跳过约80% token的全局注意力；此外，通过定制Triton内核优化实现高效GPU并行，训练吞吐量和首token延迟相比FlashAttention提升近2倍，并支持后训练稀疏化剪枝，KV缓存内存减少达50%且性能损失可忽略。

链接: https://arxiv.org/abs/2603.17484
作者: Sakshi Choudhary,Aditya Chattopadhyay,Luca Zancato,Elvis Nunez,Matthew Trager,Wei Xia,Stefano Soatto
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 6 Tables, 18 Figures

点击查看摘要

Abstract:Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for \sim 80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to \sim 2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.

[NLP-31] UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

【速读】：该论文旨在解决统一多模态模型（Unified Multimodal Models, UMMs）在系统层面的安全性评估缺乏全面基准的问题。现有安全评测体系碎片化，难以覆盖多任务与多模态场景下的复杂系统级漏洞。为此，作者提出UniSAFE，这是首个针对UMMs跨7种输入输出模态组合的系统级安全评估基准，其核心创新在于采用“共享目标”设计，将共通风险场景映射至不同任务特定的输入输出配置中，从而实现对安全失效的受控跨任务比较。通过6,802个精心构建的测试实例，研究人员对15个主流UMMs进行了系统评估，揭示了当前模型在多图像合成和多轮交互设置下显著更高的安全违规率，且图像输出任务普遍比文本输出任务更易引发安全问题，凸显了加强UMMs系统级安全对齐的紧迫性。

链接: https://arxiv.org/abs/2603.17476
作者: Segyu Lee,Boryeong Cho,Hojung Jung,Seokhyun An,Juhyeong Kim,Jaehyun Kwak,Yongjin Yang,Sangwon Jang,Youngrok Park,Wonjun Chang,Se-Young Yun
机构: KAIST AI; Department of Computer Science and Engineering, UNIST; Department of Mathematical Sciences, KAIST; University of Toronto; KAIST CS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Equal contribution by first three authors, 55 pages

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at this https URL

[NLP-32] Humans and transformer LMs: Abstraction drives language learning EACL2026

【速读】：该论文旨在解决语言模型（Language Model, LM）在训练过程中如何习得语言类别（linguistic categories）的问题，特别是对比其学习行为与人类语言习得中抽象特征驱动（feature-based）和具体实例驱动（exemplar-based）两种理论模型的差异。解决方案的关键在于引入一种基于散度（divergence-based）的新指标，用于追踪语言模型在训练过程中对下一词预测分布的变化轨迹，从而量化抽象类级别行为与具体词汇项行为的出现时机。实验结果表明，抽象类别行为比具体词汇行为更早显现，并且不同语言行为在训练序列中呈突变式依次出现，揭示了抽象化在语言模型学习中的核心作用，为语言模型作为人类语言习得机制的存在性证明提供了实证支持。

链接: https://arxiv.org/abs/2603.17475
作者: Jasper Jian,Christopher D. Manning
机构: Stanford University
类目: Computation and Language (cs.CL)
备注: EACL 2026

点击查看摘要

Abstract:Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature-based and concrete exemplar-based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item-specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.

[NLP-33] RiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL

【速读】：该论文旨在解决大语言模型在复杂推理任务中因长链式思维（Chain-of-Thought, CoT）导致的推理膨胀问题，即生成冗余推理步骤造成计算资源浪费。其核心解决方案是提出一个理论指标——最小充分长度（Minimal Sufficient Length, MSL），用于量化保持答案正确性所需的最短推理路径，并基于此构建训练框架TRiMS。TRiMS的关键创新在于：利用MSL指导训练过程中的推理压缩，结合GRPO算法进行策略优化，并通过动态批次聚合与批次标准差归一化优势估计来稳定训练过程，从而实现超过80%的CoT token减少且精度略有提升。

链接: https://arxiv.org/abs/2603.17449
作者: Tingcheng Bian,Jinchang Luo,Mingquan Cheng,Jinyu Zhang,Xiaoling Xia,Ni Li,Yan Tao,Haiwei Wang
机构: Baidu Inc.(百度); Shenzhen University(深圳大学); Harbin Institute of Technology(哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: 8 pages (main), 21 pages total including appendix, 18 this http URL will be released

点击查看摘要

Abstract:Large language models achieve breakthroughs in complex reasoning via long chain-of-thought sequences. However, this often leads to severe reasoning inflation, causing substantial computational redundancy. To maximize Intelligence per Token, we introduce a theoretical metric, MSL-Minimal Sufficient Length. MSL rigorously characterizes the shortest reasoning length that preserves answer correctness. We provide a recursive definition based on independently sampled sequences and prove the existence of its limit, establishing the first measurable lower bound for reasoning-chain compression. Building on an analysis of mainstream CoT compression strategies, we identify key structural factors enabling a model to approach MSL. Based on these insights, we propose TRiMS which employs the GRPO algorithm in conjunction with MSL-based estimation during training, while mitigating instabilities during the training process through dynamic batch aggregation and advantage computation using batch-level standard deviation. TRiMS achieves over 80% CoT token reduction with a minor accuracy boost across all benchmarks.

[NLP-34] When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

【速读】：该论文旨在解决多智能体语言系统（multi-agent language systems）在产生错误或有害输出时的责任归属问题，尤其是在执行日志和智能体标识信息不可用的情况下。其核心挑战在于：随着系统依赖结构化交互（如任务委托和迭代优化），最终输出往往掩盖了底层的交互拓扑与各智能体的具体贡献，导致难以进行问责审计。解决方案的关键是提出一种名为IET（Implicit Execution Tracing）的元数据无关框架，通过在生成过程中将特定于智能体的键控信号嵌入到token分布中，使文本本身成为可检测的执行轨迹；同时，在检测阶段采用基于状态转移的评分机制识别智能体交接点并重构交互图谱，从而实现无需额外元数据支持的token级归因与协作结构恢复，且不牺牲生成质量，满足隐私保护下的审计需求。

链接: https://arxiv.org/abs/2603.17445
作者: Yi Nian,Haosen Cao,Shenzhe Zhu,Henry Peng Zou,Qingqing Luan,Yue Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? Multi-agent language systems increasingly rely on structured interactions such as delegation and iterative refinement, yet the final output often obscures the underlying interaction topology and agent contributions. We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction. During generation, agent-specific keyed signals are embedded into the token distribution, transforming the text into a self-describing execution trace detectable only with a secret key. At detection time, a transition-aware scoring method identifies agent handover points and reconstructs the interaction graph. Experiments show that IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent language systems.

[NLP-35] Argument Reconstruction as Supervision for Critical Thinking in LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在提升批判性思维能力方面的一个关键问题：是否可以通过学习重构论点（argument reconstruction）来增强其推理与评价能力。当前，人类学习者通过识别、重构和评估论点来培养批判性思维，但这一机制是否适用于LLMs尚不明确。论文提出一个整体框架，其关键创新在于：（1）开发了一个能自动重构任意论点的引擎（GAAR），用于生成结构化论证表示；（2）基于该引擎构建了一个高质量的新论点重构数据集（Arguinas）；（3）实证验证了在下游批判性思维任务中，训练模型学习论点重构可显著提升性能，尤其在使用Arguinas数据集时效果最优。这表明论点重构作为一项基础训练任务，对增强LLMs的批判性推理能力具有实质性价值。

链接: https://arxiv.org/abs/2603.17432
作者: Hyun Ryu,Gyouk Chu,Gregor Betz,Eunho Yang,Carolyn Rose,Sean Welleck
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument’s underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset. The source code and dataset will be publicly available.

[NLP-36] SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）作为AI tutors在实际教学场景中存在“安全与教学有效性不可兼得”的问题，即现有评估范式仅孤立考察问题求解准确率和通用安全性，未能捕捉模型在师生交互过程中是否同时具备良好的教学效果与安全性。其解决方案的关键在于提出SafeTutors基准，该基准基于学习科学文献构建了一个包含11个危害维度和48个子风险的理论驱动风险分类体系，从而系统性地联合评估模型在数学、物理和化学等学科中的教学安全性与有效性。研究发现，所有模型均普遍存在广泛危害，且规模提升并不保证行为改善，多轮对话显著加剧教学失败，揭示出需针对不同学科设计差异化缓解策略，避免单轮“安全/有用”结果掩盖长期交互中的系统性教学失效。

链接: https://arxiv.org/abs/2603.17373
作者: Rima Hazra,Bikram Ghuku,Ilona Marchenko,Yaroslava Tokarieva,Sayan Layek,Somnath Banerjee,Julia Stoyanovich,Mykola Pechenizkiy
机构: Eindhoven University of Technology, Netherlands (TU/e); Indian Institute of Technology Kharagpur; Cisco Research; Taras Shevchenko National University of Kyiv; National Technical University of Ukraine; New York University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn’t reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn “safe/helpful” results can mask systematic tutor failure over extended interaction.

[NLP-37] PACE-RAG : Patient-Aware Contextual and Evidence-based Policy RAG for Clinical Drug Recommendation

【速读】：该论文旨在解决当前药物推荐系统在处理复杂疾病（如帕金森病）时，难以准确融合个体患者临床特征与真实世界处方模式的问题。现有基于指南的检索增强生成（Retrieval-Augmented Generation, RAG）方法因过于通用或简单复制多数患者处方模式，无法捕捉个体患者的细微临床差异。其解决方案的关键在于提出一种新型框架 PACE-RAG（Patient-Aware Contextual and Evidence-based Policy RAG），通过整合个体患者上下文信息与相似病例的处方倾向，识别出针对特定临床信号优化的治疗方案，并生成可解释的临床摘要，从而实现更精准、可解释的个性化决策支持。

链接: https://arxiv.org/abs/2603.17356
作者: Chaeyoung Huh,Hyunmin Hwang,Jung Hwan Shin,Jinse Park,Jong Chul Ye
机构: Korea Advanced Institute of Science and Technology, KAIST, Republic of Korea; Department of Neurology, Seoul National University Hospital, Republic of Korea; Haeundae Paik Hospital, Inje University, Republic of Korea
类目: Computation and Language (cs.CL)
备注: 26 pages, 15 figures

点击查看摘要

Abstract:Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson’s disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE-RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson’s cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE-RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: this https URL.

[NLP-38] Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity

【速读】：该论文旨在解决层间混合精度量化（Layer-wise Mixed-Precision Quantization, LMPQ）中对同一层内不同权重模块采用统一精度分配的问题，即现有方法通常忽略各模块在结构和功能上的差异，仅依赖单一数值特性评估敏感性，导致压缩效率与模型性能之间难以平衡。其解决方案的关键在于提出一种无需校准数据的新型LMPQ框架NSDS（Numerical and Structural Dual-Sensitivity），通过机制性分解每层为不同操作角色，并从数值敏感性和结构敏感性两个维度分别量化各模块的重要性；随后利用基于MAD-Sigmoid和Soft-OR的鲁棒聚合策略，将双维度得分融合为统一的层级指标，从而实现更精细、更合理的比特分配。

链接: https://arxiv.org/abs/2603.17354
作者: Hengyuan Zhang,Xinrong Chen,Zunhai Su,Xiao Liang,Jing Xiong,Wendong Xu,He Xiao,Chaofan Tao,Wei Zhang,Ruobing Xie,Lei Jiang,Hayden Kwok-Hay So,Ngai Wong
机构: The University of Hong Kong (香港大学); Peking University (北京大学); Tsinghua University (清华大学); University of California, Los Angeles (加州大学洛杉矶分校); Tencent (腾讯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically treat all intra-layer weight modules uniformly and rely on a single numerical property when estimating sensitivity, overlooking their distinct operational roles and structural characteristics. To address this, we propose NSDS, a novel calibration-free LMPQ framework driven by Numerical and Structural Dual-Sensitivity. Specifically, it first mechanistically decomposes each layer into distinct operational roles and quantifies their sensitivity from both numerical and structural perspectives. These dual-aspect scores are then aggregated into a unified layer-wise metric through a robust aggregation scheme based on MAD-Sigmoid and Soft-OR to guide bit allocation. Extensive experiments demonstrate that NSDS consistently achieves superior performance compared to various baselines across diverse models and downstream tasks, without relying on any calibration data.

[NLP-39] Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids Embodied Settings and Coordinate Structures

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在文本驱动环境下对空间推理能力的评估问题，特别是针对导航、物体定位和结构组合这三项核心任务。其关键解决方案在于构建了一个纯文本的网格数据集GSU（Grid Spatial Understanding），通过摒弃视觉输入以隔离空间推理与感知过程，从而更纯粹地评估LLMs在无视觉模态下的空间理解能力。研究发现，尽管多数模型能掌握基础网格概念，但在处理相对于具身代理的参考系及从坐标列表识别三维形状方面表现不佳；同时，视觉语言模型（Vision-Language Models, VLMs）对3D空间的理解并未显著提升文本任务性能。此外，实验表明，前沿模型虽能完成任务，但小规模语言模型经全量微调或LoRA微调即可逼近其表现，为开发高效专用的具身智能体提供了可行路径。

链接: https://arxiv.org/abs/2603.17333
作者: Risham Sidhu,Julia Hockenmaier
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.

[NLP-40] Ruyi2.5 Technical Report

【速读】：该论文旨在解决多模态模型在部署多样性与隐私保护之间的矛盾问题，即如何在保持语义一致性的同时实现高效、灵活的多场景部署，并满足数据隐私要求。解决方案的关键在于提出Ruyi2.5架构，其基于AI Flow框架构建共享骨干网络（shared-backbone architecture），通过统一训练流程实现不同规模模型的协同优化，从而支持“Train Once, Deploy Many”范式；同时，进一步设计了Ruyi2.5-Camera系统，采用两阶段识别机制——边缘侧使用信息瓶颈引导的不可逆特征映射进行去标识化处理，云端执行深度行为推理，在保障隐私的前提下提升任务性能。此外，为加速强化学习微调，引入Binary Prefix Policy Optimization (BPPO)方法，通过二值响应选择减少样本冗余并聚焦梯度更新于响应前缀，显著提升训练效率。

链接: https://arxiv.org/abs/2603.17311
作者: Huan Song,Shuyu Tian,Qingfei Zhao,Wenhao Hong,Jiang Liu,Ting Long,Jiawei Shao,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2’s “Train Once, Deploy Many” paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.

[NLP-41] InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在具备扩展推理能力时生成冗长且重复的推理轨迹问题，这种现象不仅增加计算开销，还可能因奖励机制设计不当导致“奖励黑客”（reward hacking）行为。现有强化学习方法通常仅优化最终输出长度，忽视了中间推理步骤的质量。论文提出的关键解决方案是引入InfoDensity框架，其核心在于将推理质量建模为两个可量化指标：条件熵收敛性（low uncertainty convergence）和单调进步性（monotonic progress），并通过AUC奖励与单调性奖励的加权组合来统一衡量推理效率与质量，并以长度缩放项鼓励在保持同等推理质量的前提下更简洁地完成推理过程。实验表明，该方法在数学推理基准上实现了更高的准确性或相当性能的同时显著降低token消耗，达成优越的准确率-效率权衡。

链接: https://arxiv.org/abs/2603.17310
作者: Chengwei Wei,Jung-jae Kim,Longyin Zhang,Shengkai Chen,Nancy F. Chen
机构: Institute for Infocomm Research (I2R), ASTAR, Singapore; Centre for Frontier AI Research (CFAR), ASTAR, Singapore
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.

[NLP-42] Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language

【速读】：该论文试图解决语言学中一个长期存在的基础假设——词的语音与意义之间的关系是任意的，这一观点是否完全成立的问题。研究通过系统性地映射英语中每个音素（phoneme）的多维语义特征，揭示了音素与意义之间存在结构化的、可预测的关联，从而挑战了传统任意性假设。其解决方案的关键在于：利用最小对立体范式（minimal-pair paradigm）覆盖全部220个字母音素对比，结合三个大型语言模型（Large Language Models, LLMs）从纯文本输入中独立识别出一致的音素-语义关联，并发现这些关联可由发音特征（如发音方式和部位）系统预测；进一步的行为实验证实了这种关联在英语母语者中的显著性（80.8%正确率），且初步跨语言证据表明核心映射具有跨语言普遍性。这表明音义象似性（sound-meaning iconicity）并非偶然现象，而是语音信号中一种普遍存在且高度系统的属性。

链接: https://arxiv.org/abs/2603.17306
作者: Gexin Zhao
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:A foundational assumption in linguistics holds that the relationship between a word’s sound and its meaning is arbitrary. Accumulating evidence from sound symbolism challenges this view, yet no study has systematically mapped the multidimensional semantic profile of every phonological unit within a language. Here we show that individual letter-phonemes in English carry structured, multidimensional semantic signals. Using a minimal-pair paradigm spanning all 220 pairwise letter contrasts, three large language models independently recover consistent phoneme-meaning associations across nine perceptual dimensions. These associations are systematically predicted by articulatory-phonetic features, with manner and place of articulation mapping onto distinct semantic dimensions. Behavioral data from English speakers confirm these patterns at rates well above chance (80.8%), and preliminary cross-linguistic evidence from five typologically diverse languages suggests that core mappings generalize beyond English. Our findings indicate that sound-meaning iconicity is not an occasional curiosity but a pervasive, structured property of the phonological signal, one so systematic that large language models recover it when given only text input, without exposure to speech or articulation during the task.

[NLP-43] Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

【速读】：该论文旨在解决大语言模型在面对越狱攻击（jailbreak attacks）时安全性不足的问题，特别是传统防御方法仅在输出层面进行干预而难以抵御深层次的恶意诱导。其解决方案的关键在于提出CRAFT框架，通过显式优化隐藏状态空间中的目标函数，使模型生成具有安全意识的推理轨迹；具体而言，CRAFT结合对比表示学习与强化学习，在潜在空间中分离安全与不安全的推理路径，构建支持推理层级安全对齐的几何结构，并引入隐式-文本一致性约束以消除表面匹配但实质不安全的策略，从而实现更鲁棒的安全对齐。

链接: https://arxiv.org/abs/2603.17305
作者: Haozheng Luo,Yimin Wang,Jiahao Yu,Binghui Wang,Yan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

[NLP-44] From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

【速读】：该论文旨在解决当前机器翻译系统在处理文化负载表达（culture-loaded expressions）如习语、俚语和文化特定项（Culture-specific Items, CSIs）时存在的准确率不足问题。现有评估基准碎片化，缺乏对这类表达的系统性评价框架。其解决方案的关键在于提出一个名为CulT-Eval的新基准，包含超过7,959个精心构建的实例，覆盖多种文化根基表达类型，并建立全面的错误分类体系；同时引入一种互补性的评估指标，专门捕捉标准机器翻译指标未能识别的文化相关语义偏差，从而更精准地衡量模型在保留文化语境与细微差别方面的表现。

链接: https://arxiv.org/abs/2603.17303
作者: Bangju Han,Yingqi Wang,Huang Qing,Tiyuan Li,Fengyi Yang,Ahtamjan Ahmat,Abibulla Atawulla,Yating Yang,Xi Zhou
机构: Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences(中国科学院新疆物理化学技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at this https URL.

[NLP-45] LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

【速读】：该论文旨在解决当前文档版面分析（Document Layout Analysis, DLA）中结构错误（如区域合并、分割和遗漏）难以被传统基于重叠的评估指标（如IoU、mAP）有效捕捉的问题。解决方案的关键在于提出Layout Error Detection (LED) 基准，其定义了八类标准化的结构错误类型（Missing、Hallucination、Size Error、Split、Merge、Overlap、Duplicate 和 Misclassification），并提供定量规则与注入算法以实现真实场景下的错误模拟。通过构建LED-Dataset及设计三类评估任务（文档级错误检测、文档级错误类型分类、元素级错误类型分类），LED实现了对DLA模型结构推理能力的细粒度、可解释性评估，从而揭示不同模态与架构下的结构性弱点，建立了一个统一且可解释的基准用于诊断文档理解模型的结构鲁棒性和推理能力。

链接: https://arxiv.org/abs/2603.17265
作者: Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung
机构: Chungnam National University (忠南国立大学); National Research Foundation of Korea (韩国国家研究基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8pages

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.

[NLP-46] Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

【速读】：该论文旨在解决大型音频语言模型（Large Audio-Language Models, LALMs）在语音生成中难以实现可靠情感控制的问题，即现有方法常导致目标情感表达不准确，并可能因拒绝、幻觉或改写而损害语言保真度。解决方案的关键在于首次从神经元层面揭示了情感敏感神经元（Emotion-Sensitive Neurons, ESNs）的存在及其因果可操作性，通过成功过滤后的激活聚合策略识别出既能实现情感表达又能保持内容一致性的ESNs；在此基础上，无需训练即可在推理阶段实现高效的情感引导，且该方法在多个LALM模型上表现出对未见说话者的泛化能力，并得到自动与人工评估的验证。

链接: https://arxiv.org/abs/2603.17231
作者: Xiutian Zhao,Ismail Rasim Ulgen,Philipp Koehn,Björn Schuller,Berrak Sisman
机构: Center for Language and Speech Processing (CLSP), Johns Hopkins University (约翰霍普金斯大学); Group on Language, Audio Music (GLAM), Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.

[NLP-47] haruChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

【速读】：该论文旨在解决生成式 AI（Generative AI）发展中因数据资源不均而导致的“数字鸿沟”问题，特别是全球南方地区本土语言（如尼泊尔与印度恒河平原地区的 Tharu 语）被主流大语言模型（Large Language Models, LLMs）边缘化的问题。Tharu 语虽有丰富的口头传统，但面临严重的数据稀缺和语言碎片化，导致现有多语言模型在训练过程中易产生“幻觉”或默认转向高资源语言（如印地语和尼泊尔语）。解决方案的关键在于构建一个基于 LLM-to-Human 人工引导的合成数据集 TharuChat，其通过提示工程驱动的 Gemini 模型，结合 Rana Tharu 的语法和民间故事自动生成训练数据；该数据集虽包含方言混杂和残留印地语/阿瓦德语影响等噪声，但仍能有效提升小规模模型性能——实验证明，将训练数据量从 25% 提升至 100% 可使困惑度（perplexity）从 6.42 线性降至 2.88，从而验证了利用消费级硬件即可实现对低资源喜马拉雅语言保护的可行性。

链接: https://arxiv.org/abs/2603.17220
作者: Prajwal Panth,Agniva Maiti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 1 figure, 2 tables. Preprint. Code and dataset available on Hugging Face

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely “hallucinate” or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana Tharu grammar and folklore, to synthesize training data. Unlike curated gold-standard corpora, TharuChat reflects the noisy, heterogeneous linguistic reality of the region: it is predominantly anchored in Rana Tharu (~70%) while integrating elements of Dangaura and Kochila dialects. We provide a transparent analysis of the dataset’s limitations, including dialectal code-mixing and residual Awadhi/Hindi influence. Through a rigorous empirical ablation study, we demonstrate that despite these imperfections, small-scale synthetic data is highly effective, increasing the dataset volume from 25% to 100% results in a linear reduction in perplexity from 6.42 to 2.88. The resulting model serves as a proof-of-concept for the preservation of under-resourced Himalayan languages via generative AI, achievable on consumer-grade hardware. Comments: 6 pages, 1 figure, 2 tables. Preprint. Code and dataset available on Hugging Face Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.17220 [cs.CL] (or arXiv:2603.17220v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.17220 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-48] Alignment Makes Language Models Normative Not Descriptive

【速读】：该论文试图解决的问题是：后训练对齐（post-training alignment）优化语言模型以匹配人类偏好信号，但这一目标并不等同于建模真实的人类行为。研究发现，在多轮策略博弈（如讨价还价、说服、谈判和重复矩阵博弈）中，未经对齐的基线模型（base models）在预测人类决策方面显著优于对齐模型（aligned models），表现差距接近10:1，且该结果在不同模型家族、提示格式和游戏配置下均具稳健性。解决方案的关键在于识别出对齐机制引入了规范性偏差（normative bias）——即对齐模型在人类行为符合规范解（如博弈论预测）时提升预测能力，但在涉及描述性动态（如互惠、报复与历史依赖适应）的多轮策略场景中反而削弱预测性能。这一边界条件模式揭示了将模型优化为“人类使用工具”与将其用作“人类行为代理”之间的根本权衡。

链接: https://arxiv.org/abs/2603.17218
作者: Eilam Shapira,Moshe Tennenholtz,Roi Reichart
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

[NLP-49] Anonymous-by-Construction: An LLM -Driven Framework for Privacy-Preserving Text

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）时代下敏感信息保护与数据可用性之间的矛盾问题，即在不泄露个人身份信息（Personally Identifiable Information, PII）的前提下，保持文本的语义连贯性和任务相关性。其解决方案的关键在于提出一种基于本地部署LLM的替代管道（substitution pipeline），通过类型一致的合成替代词对PII进行匿名化处理，整个过程在组织内部完成，避免了数据外泄风险；同时，该方法在多个评估维度上（隐私强度、语义保真度、微调后性能损失）均优于现有主流方案（如Microsoft Presidio、Google DLP及ZSTS），并支持在代理问答（agentic QA）系统中作为前置层使用，从而实现责任型AI部署的同时保障下游任务的实用性。

链接: https://arxiv.org/abs/2603.17217
作者: Federico Albanese,Pablo Ronco,Nicolás D’Ippolito
机构: Veritran; University of Buenos Aires; University of San Andrés
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Responsible use of AI demands that we protect sensitive information without undermining the usefulness of data, an imperative that has become acute in the age of large language models. We address this challenge with an on-premise, LLM-driven substitution pipeline that anonymizes text by replacing personally identifiable information (PII) with realistic, type-consistent surrogates. Executed entirely within organizational boundaries using local LLMs, the approach prevents data egress while preserving fluency and task-relevant semantics. We conduct a systematic, multi-metric, cross-technique evaluation on the Action-Based Conversation Dataset, benchmarking against industry standards (Microsoft Presidio and Google DLP) and a state-of-the-art approach (ZSTS, in redaction-only and redaction-plus-substitution variants). Our protocol jointly measures privacy, semantic utility, and trainability under privacy via a lifecycle-ready criterion obtained by fine-tuning a compact encoder (BERT+LoRA) on sanitized text. In addition, we assess agentic QA performance by inserting an on-premise anonymization layer before the answering LLM and evaluating the quality of its responses. This intermediate, type-preserving substitution stage ensures that no sensitive content is exposed to third-party APIs, enabling responsible deployment of Q\A agents without compromising confidentiality. Our method attains state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches and named-entity recognition (NER) baselines and ZSTS variants on the combined privacy–utility–trainability frontier. These results show that local LLM substitution yields anonymized corpora that are both responsible to use and operationally valuable: safe for agentic pipelines and suitable for downstream fine-tuning with limited degradation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.17217 [cs.CL] (or arXiv:2603.17217v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.17217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-50] SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在寄存器传输级（Register-Transfer Level, RTL）综合与摘要任务中面临的挑战，包括硬件描述语言（Hardware Description Language, HDL）严格的语法约束、监督信号有限以及与自然语言对齐能力弱等问题。现有提示工程（prompting）和检索增强生成（Retrieval-Augmented Generation, RAG）方法因未引入符号规划（symbolic planning），导致结构精度不足。解决方案的关键在于提出一种神经符号框架 SYMDIREC，其核心机制为：将 RTL 任务分解为符号子目标，通过微调的检索器获取相关代码片段，并利用 LLM 进行推理以组装验证后的输出。该框架无需对 LLM 进行微调即可支持 Verilog 和 VHDL，显著提升了 RTL 综合的 Pass@1 率（提升约 20%）和摘要的 ROUGE-L 分数（提升 15–20%），证明了符号引导在 RTL 任务中的有效性。

链接: https://arxiv.org/abs/2603.17208
作者: Prashanth Vijayaraghavan,Apoorva Nitsure,Luyao Shi,Charles Mackin,Ashutosh Jadhav,David Beymer,Ehsan Degan,Vandana Mukherjee
机构: IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.

[NLP-51] CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

【速读】：该论文旨在解决寄存器传输级（Register Transfer Level, RTL）代码优化自动化不足的问题，以提升电子设计自动化（Electronic Design Automation, EDA）中功耗、性能和面积（Power, Performance, Area, PPA）的综合优化效果。其解决方案的关键在于提出CODMAS（Collaborative Optimization via a Dialectic Multi-Agent System），一个融合结构化辩证推理与领域感知代码生成的多智能体框架：其中“阐述者”（Articulator）通过逐步转化计划暴露隐含假设，“假设伙伴”（Hypothesis Partner）预测结果并调和预期与实际行为偏差，从而引导精准修正；二者协同驱动领域特定编码代理（Domain-Specific Coding Agent, DCA）生成架构感知的Verilog修改，并由代码评估代理（Code Evaluation Agent, CEA）验证语法、功能及PPA指标。该方法在RTLOPT基准上实现约25%关键路径延迟降低（流水线优化）和约22%功耗减少（时钟门控优化），显著优于强提示和代理基线方法。

链接: https://arxiv.org/abs/2603.17204
作者: Che-Ming Chang,Prashanth Vijayaraghavan,Ashutosh Jadhav,Charles Mackin,Vandana Mukherjee,Hsinyu Tsai,Ehsan Degan
机构: National Taiwan University (国立台湾大学); IBM Research (IBM研究)
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.

[NLP-52] Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中可能出现的动机性推理（motivated reasoning）问题，即模型在存在外部提示（hint）干扰时，会偏离真实决策逻辑而生成看似合理但实际受提示影响的思维链（Chain of Thought, CoT），从而误导对模型决策机制的理解。解决方案的关键在于利用对模型内部激活状态（internal activations）的监督探针（supervised probes），通过分析残差流（residual stream）中的表示来检测动机性推理行为：研究发现，即使在未生成CoT之前，预生成探针即可与基于完整CoT的监控方法同样有效预测动机性推理；而在生成后，后生成探针表现优于CoT监控方法，表明内部表征比外部输出更可靠地反映动机性推理的本质。这一发现揭示了从模型内部状态而非仅依赖CoT进行推理可信度评估的重要性。

链接: https://arxiv.org/abs/2603.17199
作者: Parsa Mirtaheri,Mikhail Belkin
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model’s residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.

[NLP-53] Abstraction as a Memory-Efficient Inductive Bias for Continual Learning

【速读】：该论文旨在解决在线持续学习（online continual learning）中因新知识学习导致旧知识遗忘和泛化性能下降的问题。其核心挑战在于如何在不重新训练模型的前提下，使智能体在非平稳、无限复杂的现实环境中稳定地持续学习。解决方案的关键在于提出抽象增强训练（Abstraction-Augmented Training, AAT），这是一种在损失层面进行的改进方法，通过联合优化具体实例与其抽象表示，引导模型捕捉跨样本共享的潜在关系结构，从而引入一种内存高效的归纳偏置（inductive bias）。该方法无需使用回放缓冲区（replay buffer），即可在严格在线数据流中稳定学习，实验证明其性能可媲美甚至超越依赖经验回放（experience replay, ER）的强基线模型。

链接: https://arxiv.org/abs/2603.17198
作者: Elnaz Rahmati,Nona Ghazizadeh,Zhivar Sourati,Nina Rouhani,Morteza Dehghani
机构: University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The real world is non-stationary and infinitely complex, requiring intelligent agents to learn continually without the prohibitive cost of retraining from scratch. While online continual learning offers a framework for this setting, learning new information often interferes with previously acquired knowledge, causes forgetting and degraded generalization. To address this, we propose Abstraction-Augmented Training (AAT), a loss-level modification encouraging models to capture the latent relational structure shared across examples. By jointly optimizing over concrete instances and their abstract representations, AAT introduces a memory-efficient inductive bias that stabilizes learning in strictly online data streams, eliminating the need for a replay buffer. To capture the multi-faceted nature of abstraction, we introduce and evaluate AAT on two benchmarks: a controlled relational dataset where abstraction is realized through entity masking, and a narrative dataset where abstraction is expressed through shared proverbs. Our results show that AAT achieves performance comparable to or exceeding strong experience replay (ER) baselines, despite requiring zero additional memory and only minimal changes to the training objective. This work highlights structural abstraction as a powerful, memory-free alternative to ER.

[NLP-54] abular LLM s for Interpretable Few-Shot Alzheimers Disease Prediction with Multimodal Biomedical Data

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）诊断中因生物标志物数据量小且不完整导致深度学习模型性能不佳的问题。传统机器学习方法在小样本场景下表现受限，而现有大语言模型（Large Language Models, LLMs）虽具备少样本泛化能力，却未针对结构化表格数据进行专门适配。解决方案的关键在于提出TAP-GPT——一个基于TableGPT2架构并针对表格数据领域适配的LLM框架，通过使用表格提示（tabular prompts）而非纯文本输入进行微调，在少量标注数据下实现对AD的精准分类。该方法不仅在多模态与单模态设置中优于基础模型和传统机器学习基线，还展现出对高维特征选择的鲁棒性、缺失值处理稳定性及可解释的模态感知推理能力，为临床决策支持系统提供了可扩展的表格式预训练模型基础。

链接: https://arxiv.org/abs/2603.17191
作者: Sophie Kearney,Shu Yang,Zixuan Wen,Weimin Lyu,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason H. Moore,Marylyn D. Ritchie,Chao Chen,Li Shen
机构: University of Pennsylvania (宾夕法尼亚大学); Stony Brook University (石溪大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Medical University of South Carolina (南卡罗来纳医科大学); Cedars-Sinai Medical Center (塞德斯-西奈医疗中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate diagnosis of Alzheimer’s disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer’s Prediction GPT, a domain-adapted tabular LLM framework built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts rather than plain texts. We evaluate TAP-GPT across four ADNI-derived datasets, including QT-PAD biomarkers and region-level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP-GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few-shot setting while remaining competitive with state-of-the-art general-purpose LLMs. We show that feature selection mitigates degradation in high-dimensional inputs and that TAP-GPT maintains stable performance under simulated and real-world missingness without imputation. Additionally, TAP-GPT produces structured, modality-aware reasoning aligned with established AD biology and shows greater stability under self-reflection, supporting its use in iterative multi-agent systems. To our knowledge, this is the first systematic application of a tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM-driven multi-agent clinical decision-support systems. The source code is publicly available on GitHub: this https URL.

[NLP-55] Exploiting the English Grammar Profile for L2 grammatical analysis with LLM s

【速读】：该论文旨在解决第二语言（L2）学习者语法能力评估的难题，即如何精准识别学习者在目标语言中对特定语法结构的尝试，并据此提供细粒度反馈及预测其整体语言水平（以欧洲语言共同参考框架，CEFR为准）。其核心解决方案在于构建一个基于英语语法档案（English Grammar Profile, EGP）的新型评估框架，该框架将语法构建成分映射至CEFR各等级，通过对比学习者原始句子与其修正版本，自动检测其语法尝试是否成功；并进一步利用这些检测结果作为预测变量进行熟练度评估。关键创新点在于：1）引入EGP实现语法构建成分的层级化标注与定位；2）比较规则基础方法与大语言模型（LLM）在不同语法特征上的表现，发现LLM在语义和语用复杂结构上更优，而规则方法在形态和句法层面仍具竞争力；3）提出混合管道（rule-based pre-filter + LLM）在熟练度预测中效果最佳；4）验证全自动语法纠错系统可逼近人工校正的效果，尤其在识别成功语法尝试方面表现优异，从而实现聚焦于学习者成功尝试的积极、形成性反馈机制。

链接: https://arxiv.org/abs/2603.17171
作者: Stefano Bannò,Penny Karanasou,Kate Knill,Mark Gales
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the grammatical competence of second language (L2) learners is essential both for providing targeted feedback and for assessing proficiency. To achieve this, we propose a novel framework leveraging the English Grammar Profile (EGP), a taxonomy of grammatical constructs mapped to the proficiency levels of the Common European Framework of Reference (CEFR), to detect learners’ attempts at grammatical constructs and classify them as successful or unsuccessful. This detection can then be used to provide fine-grained feedback. Moreover, the grammatical constructs are used as predictors of proficiency assessment by using automatically detected attempts as predictors of holistic CEFR proficiency. For the selection of grammatical constructs derived from the EGP, rule-based and LLM-based classifiers are compared. We show that LLMs outperform rule-based methods on semantically and pragmatically nuanced constructs, while rule-based approaches remain competitive for constructs that rely purely on morphological or syntactic features and do not require semantic interpretation. For proficiency assessment, we evaluate both rule-based and hybrid pipelines and show that a hybrid approach combining a rule-based pre-filter with an LLM consistently yields the strongest performance. Since our framework operates on pairs of original learner sentences and their corrected counterparts, we also evaluate a fully automated pipeline using automatic grammatical error correction. This pipeline closely approaches the performance of semi-automated systems based on manual corrections, particularly for the detection of successful attempts at grammatical constructs. Overall, our framework emphasises learners’ successful attempts in addition to unsuccessful ones, enabling positive, formative feedback and providing actionable insights into grammatical development.

[NLP-56] How Clued up are LLM s? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂多步演绎推理任务中表现不佳的问题，特别是针对生成式 AI（Generative AI）在模拟真实场景下进行持续、一致的逻辑推断能力不足的挑战。解决方案的关键在于构建一个基于文本的多智能体版本的经典桌游《妙探寻凶》（Clue），作为规则驱动的测试平台，用于评估LLM代理在完整游戏过程中执行多步骤演绎推理的能力；同时通过在结构化逻辑谜题上微调模型，探究其对实际游戏中推理性能的迁移效果。实验表明，尽管引入多智能体协作机制和微调策略，当前模型仍难以维持稳定推理，且微调并未显著提升推理精度，反而可能增加冗余推理量。

链接: https://arxiv.org/abs/2603.17169
作者: Rebecca Ansell,Autumn Toney-Wails
机构: Georgetown University (乔治城大学); Syntheos, Corp (Syntheos公司); UNU-Merit (联合国大学-默特中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

[NLP-57] Multilingual Reference Need Assessment System for Wikipedia WWW’26

【速读】：该论文旨在解决维基百科（Wikipedia）中内容可验证性（verifiability）问题，即确保所有陈述都有可靠来源支持，这依赖于编辑者手动核查引用，但面对每日海量编辑，这一过程效率低下且资源消耗大。解决方案的关键在于引入一个多语言机器学习系统，用于辅助编辑识别需要添加引用的声明（claim），该系统在10种语言版本的维基百科上进行了测试，其性能优于现有基准，并在模型准确率与计算效率之间权衡，以适应实际部署环境中的基础设施限制，最终实现了生产级部署并开源了数据与代码，推动后续研究。

链接: https://arxiv.org/abs/2603.17146
作者: Aitolkyn Baigutanova,Francisco Navas,Pablo Aragon,Mykola Trokhymovych,Muniza Aslam,Ai-Jou Chou,Miriam Redi,Diego Saez-Trumper
机构: Wikimedia Foundation(维基媒体基金会); Pompeu Fabra University(庞培法布拉大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Accepted for publication at the Proceedings of the ACM Web Conference 2026 (WWW '26). Author’s copy

点击查看摘要

Abstract:Wikipedia is a critical source of information for millions of users across the Web. It serves as a key resource for large language models, search engines, question-answering systems, and other Web-based applications. In Wikipedia, content needs to be verifiable, meaning that readers can check that claims are backed by references to reliable sources. This depends on manual verification by editors, an effective but labor-intensive process, especially given the high volume of daily edits. To address this challenge, we introduce a multilingual machine learning system to assist editors in identifying claims requiring citations. Our approach is tested in 10 language editions of Wikipedia, outperforming existing benchmarks for reference need assessment. We not only consider machine learning evaluation metrics but also system requirements, allowing us to explore the trade-offs between model accuracy and computational efficiency under real-world infrastructure constraints. We deploy our system in production and release data and code to support further research.

[NLP-58] Knowledge Localization in Mixture-of-Experts LLM s Using Cross-Lingual Inconsistency

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在跨语言场景下行为不一致的问题，即模型在某些语言中能准确回忆事实信息，而在其他语言中则无法做到。传统研究通常将这种不一致性视为需消除的缺陷，而本文提出将其作为可利用的工具，用于提升混合专家模型（Mixture-of-Experts, MoE LLMs）的可解释性。解决方案的关键在于构建一个知识定位框架：首先通过在多种语言中提问困难的事实类问题，生成“成功”与“失败”的激活桶；随后对MoE路由器的日志进行统计对比分析，识别出对特定知识回答起关键作用的一小部分专家。实验表明，仅关闭约20个专家（占总数6000的极小比例），即可导致超过40%的问答任务失败，从而验证了这些专家的功能必要性。该方法为复杂LLMs提供了一种现实且可扩展的知识定位手段。

链接: https://arxiv.org/abs/2603.17102
作者: Lucas Bandarkar,Alan Ansell,Trevor Cohn
机构: Google Research; University of California, Los Angeles; University of Melbourne
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate “success” and “failure” activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.

[NLP-59] Evaluating LLM -Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在模拟人类对话时难以准确再现不一致和非协作行为（如误解、打断等）的问题，这些问题对构建真实复杂社会互动的仿真系统至关重要。解决方案的关键在于提出一个名为CoCoEval的评估框架，该框架利用“以LLM为裁判”（LLM-as-a-Judge）的方法，在话语轮次级别检测10类不一致与非协作行为，并通过对比不同LLM（GPT-4.1、GPT-5.1、Claude Opus 4）生成对话与真实人类会议、辩论中行为频率的差异，系统性地量化了当前LLMs在模拟人类社交行为方面的局限性。

链接: https://arxiv.org/abs/2603.17094
作者: Ryo Kamoi,Ameya Godbole,Longqi Yang,Rui Zhang,Mengting Wan,Pei Zhou
机构: Microsoft Corporation(微软); Penn State University(宾夕法尼亚州立大学); University of Southern California(南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.

[NLP-60] Ensemble Self-Training for Unsupervised Machine Translation

【速读】：该论文旨在解决无监督神经机器翻译（Unsupervised Neural Machine Translation, UNMT）中模型性能受限于单一模型表达能力不足的问题。解决方案的关键在于提出一种基于集成驱动的自训练框架：首先训练多个共享同一翻译任务但引入不同辅助语言的UNMT模型，以诱导结构化的模型多样性；随后通过词级别集成解码生成伪平行语料，对每个模型进行进一步训练，实现模型间的共享监督学习；最终在部署时仅保留验证性能最优的单模型，兼顾效果提升与推理效率。实验表明，该方法在英译其他语言和反向翻译任务上分别获得1.7和0.67的chrF分数提升。

链接: https://arxiv.org/abs/2603.17087
作者: Ido Aharon,Jonathan Shaki,Sarit Kraus
机构: Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present an ensemble-driven self-training framework for unsupervised neural machine translation (UNMT). Starting from a primary language pair, we train multiple UNMT models that share the same translation task but differ in an auxiliary language, inducing structured diversity across models. We then generate pseudo-translations for the primary pair using token-level ensemble decoding, averaging model predictions in both directions. These ensemble outputs are used as synthetic parallel data to further train each model, allowing the models to improve via shared supervision. At deployment time, we select a single model by validation performance, preserving single-model inference cost. Experiments show statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.

[NLP-61] Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts

【速读】：该论文旨在解决大型推理类语言模型（Large Language Models, LLMs）在跨语言知识迁移中存在的性能差距问题，尤其是由书写系统（script）差异导致的知识传递障碍。研究表明，这种差距主要源于“脚本障碍”（script barrier），而非语言或语系差异；通过在推理阶段提供问题关键实体的源语言形式，可显著提升跨脚本任务的表现。解决方案的关键在于：利用合成数据进行监督微调（Supervised Fine-Tuning, SFT），引导模型在推理时更好地处理音译歧义，从而增强跨脚本参数化知识获取能力，最终缩小跨语言知识迁移的差距。

链接: https://arxiv.org/abs/2603.17070
作者: Lucas Bandarkar,Alan Ansell,Trevor Cohn
机构: Google Research; University of California, Los Angeles; University of Melbourne
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around the world, ECLeKTic and MultiLoKo. Our regression analysis shows that script match - not language or family - is the primary predictor of knowledge transfer failure once model capability and question difficulty are accounted for. We further this finding by providing the LLMs with the key entities of the questions in their source language and find that this disproportionately improves cross-script questions. We then posit that these LLMs could be reasoning better at test-time. To evaluate this, we develop a synthetic generation pipeline to design SFT samples to encourage the model to better reason about transliteration ambiguities when trying to fetch parametric knowledge at inference-time. We show that teaching two models to reason better reduces the cross-script transfer gap. As a result, we conclude that there is potential to improve cross-lingual parametric knowledge transfer during post-training.

[NLP-62] Evaluating Ill-Defined Tasks in Large Language Models

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在评估任务中面临的“定义不清”（ill-defined）问题，即任务输入输出空间不明确、成功标准模糊，导致现有评估基准和指标无法提供可靠或具有诊断意义的模型能力信号。其解决方案的关键在于通过两个典型案例——复杂指令遵循（Complex Instruction Following, CIF）与自然语言到Mermaid序列图生成（Natural Language to Mermaid Sequence Diagrams, NL2Mermaid）——揭示评估设计中的系统性缺陷，如覆盖不足、对指令表述敏感、指标不一致及LLM评判者引入的不稳定性，并强调多维评估标准可提供超越平均得分的可操作洞察，从而推动更鲁棒、可解释的评估体系构建。

链接: https://arxiv.org/abs/2603.17067
作者: Yi Zhou,Basel Shbita
机构: IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. Our findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate more robust, interpretable evaluation designs.

[NLP-63] HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在细粒度视觉-语言推理任务中表现不佳的问题，尤其是当依赖长链思维链（Chain-of-Thought, CoT）推理时，模型容易出现感知错误、推理错误、知识缺失和幻觉等多样化的失败模式，且这些错误可能在中间步骤中累积放大。现有用于强化学习视觉语言推理（Reinforcement Learning for Vision-Language Reasoning, RLVR）的数据集大多缺乏基于视觉证据的复杂多跳推理链条，导致上述弱点未被充分暴露与训练。解决方案的关键在于提出HopChain框架，该框架可规模化生成多跳视觉语言推理数据，每条查询由逻辑上相互依赖的实例锚定跳跃组成：前期跳跃构建后续跳跃所需的实体、集合或条件，最终答案为具体且可验证的数值，便于设计奖励机制；实验表明，将此类多跳数据加入原始RLVR训练数据后，在24个跨STEM、谜题、通用VQA、文本识别与文档理解及视频理解的基准测试中，两个不同规模的Qwen3.5系列模型均在20个基准上取得提升，验证了其广泛适用性与有效性。

链接: https://arxiv.org/abs/2603.17024
作者: Shenzhi Wang,Shixuan Liu,Jing Zhou,Chang Gao,Xiong-Hui Chen,Binghai Wang,An Yang,Shiji Song,Bowen Yu,Gao Huang,Junyang Lin
机构: Qwen Team, Alibaba Inc.(阿里巴巴公司); LeapLab, Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 8 figures, 2 tables

点击查看摘要

Abstract:VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

[NLP-64] LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agent ic Settings

【速读】：该论文旨在解决自然语言到SQL（Natural Language to SQL, NL2SQL）系统在真实动态、噪声和不断演化的数据库环境中缺乏鲁棒性的问题。传统基准测试通常假设静态模式和规范的用户输入，无法反映现实场景中的复杂性。为应对这一挑战，作者构建了一个包含约十类扰动的鲁棒性评估基准，并在传统管道与代理式（agentic）两种设置下对多个前沿大语言模型（Large Language Models, LLMs）进行评测。其解决方案的关键在于通过系统化引入多种类型扰动（如字符级噪声和语义不变但词法或句法变化的语言变体），揭示不同模型在面对表面噪声与语言变异时的性能差异，从而识别出当前NL2SQL系统在处理语义保持但形式多样的输入时仍存在显著脆弱性，尤其在代理式设置中表现更为突出。

链接: https://arxiv.org/abs/2603.17017
作者: Lifu Tu,Rongguang Wang,Tao Sheng,Sujjith Ravi,Dan Roth
机构: Oracle AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.

[NLP-65] MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

【速读】：该论文旨在解决基于Group Relative Policy Optimization (GRPO)框架中重要性比例（importance ratio）调控带来的训练不稳定性问题。现有方法如硬截断（hard clipping）存在不可微边界和梯度消失区域，难以保持梯度保真性，且缺乏对极端偏差的自适应抑制机制，导致策略优化易受突变影响。解决方案的关键在于提出一种新的Modulated Hazard-aware Policy Optimization (MHPO)框架：其核心创新包括两个模块——Log-Fidelity Modulator (LFM)，将无界的重要性比例映射到有界且可微的域，从而避免高方差异常值破坏损失曲面并保障全局梯度稳定性；以及Decoupled Hazard Penalty (DHP)，通过引入生存分析中的累积风险函数，独立调节正向与负向策略偏移，实现对不对称策略变化的精细控制，同时缓解因过度扩展导致的模式坍缩和因灾难性收缩引发的策略退化，最终在文本和视觉语言任务的多种推理基准上显著提升性能与训练稳定性。

链接: https://arxiv.org/abs/2603.16929
作者: Hongjun Wang,Wei Liu,Weibo Gu,Xing Sun,Kai Han
机构: The University of Hong Kong (香港大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

[NLP-66] Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

【速读】：该论文旨在解决语音深度伪造检测中对神经音频编解码器（Neural Audio Codec）离散结构利用不足的问题。现有方法通常依赖连续编码器特征或忽略量化层级的层次结构，而该研究发现不同量化层级捕获互补的声学线索：早期量化器捕捉粗粒度结构，后期量化器则揭示合成伪影的残差细节。解决方案的关键在于提出一种层次感知的表示学习框架，通过可学习的全局加权机制建模各量化层级的贡献，从而构建与取证线索对齐的结构化编解码器表示；该方法仅更新4.4%的额外参数，保持语音编码器主干冻结，显著提升了检测性能，在ASVspoof 2019和ASVspoof5数据集上分别实现相对等错误率（EER）降低46.2%和13.9%。

链接: https://arxiv.org/abs/2603.16914
作者: Jinyang Wu,Zihan Pan,Qiquan Zhang,Sailor Hardik Bhupendra,Soumik Mondal
机构: Agency for Science, Technology and Research (A*STAR), Singapore; The University of New South Wales, Australia
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy. We propose a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method achieves relative EER reductions of 46.2% on ASVspoof 2019 and 13.9% on ASVspoof5 over strong baselines.

[NLP-67] Rubric-Guided Fine-tuning of SpeechLLM s for Multi-Aspect Multi-Rater L2 Reading-Speech Assessment LREC2026

【速读】：该论文旨在解决生成式语音语言模型（SpeechLLMs）在第二语言（L2）口语自动评估中难以与人类评分者细腻变异性对齐的问题，从而实现可靠且可解释的自动化评分。其解决方案的关键在于提出一种基于评分量规（rubric-guided）的推理框架，显式编码准确性、流利度和语调三个维度的人类评估标准，并通过不确定性校准（uncertainty calibration）建模自然评分差异；具体而言，采用高斯不确定性建模与保形校准（conformal calibration）相结合的方法，生成可解释的置信区间，显著提升了模型与人类评分的一致性，尤其在流利度和语调评估上表现稳健，而准确性的评估仍具挑战性。

链接: https://arxiv.org/abs/2603.16889
作者: Aditya Kamlesh Parikh,Cristian Tejedor-Garcia,Catia Cucchiarini,Helmer Strik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to LREC 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

点击查看摘要

Abstract:Reliable and interpretable automated assessment of second-language (L2) speech remains a central challenge, as large speech-language models (SpeechLLMs) often struggle to align with the nuanced variability of human raters. To address this, we introduce a rubric-guided reasoning framework that explicitly encodes multi-aspect human assessment criteria: accuracy, fluency, and prosody, while calibrating model uncertainty to capture natural rating variability. We fine-tune the Qwen2-Audio-7B-Instruct model using multi-rater human judgments and develop an uncertainty-calibrated regression approach supported by conformal calibration for interpretable confidence intervals. Our Gaussian uncertainty modeling and conformal calibration approach achieves the strongest alignment with human ratings, outperforming regression and classification baselines. The model reliably assesses fluency and prosody while highlighting the inherent difficulty of assessing accuracy. Together, these results demonstrate that rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.

[NLP-68] okenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

【速读】：该论文旨在解决基于惯性测量单元（Inertial Measurement Unit, IMU）的在线手写识别中，因字符分布不均和跨写作者差异导致的性能下降问题。其关键解决方案在于区分处理不同类型的变异性：对于跨写作者差异（inter-writer variance），采用基于二元组（Bigram）的子词分词策略进行结构抽象，显著提升对未见书写风格的泛化能力，使词错误率（Word Error Rate, WER）从15.40%降至12.99%；而对于写作者内部差异（intra-writer variance），则提出基于拼接的数据增强方法，作为强正则化手段有效缓解词汇分布稀疏问题，将字符错误率降低34.5%，WER降低25.4%，且效果优于等比例延长训练时间的策略。

链接: https://arxiv.org/abs/2603.16883
作者: Jindong Li,Dario Zanca,Vincent Christlein,Tim Hamann,Jens Barth,Peter Kämpf,Björn Eskofier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.

[NLP-69] Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis CEC

【速读】：该论文旨在解决金融分析师在处理冗长的10-K报告（通常超过100页）时信息提取效率低下的问题，提出了一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的系统来高效回答关于标普500公司财务报告的问题。解决方案的关键在于构建一个结合全文检索与语义检索的混合搜索管道，并引入神经重排序（neural reranking）模块——利用交叉编码器（cross-encoder）模型对候选文档进行精细化排序，从而显著提升生成答案的准确性和可靠性。实验表明，该方法相较于基线模型在正确率上提升了15.5个百分点，同时大幅降低错误答案比例，验证了重排序机制在金融领域RAG系统中的核心作用。

链接: https://arxiv.org/abs/2603.16877
作者: Zhiyuan Cheng,Longying Lai,Yue Liu,Kai Cheng,Xiaoxi Qi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures. Submitted to ICECET 2026

点击查看摘要

Abstract:Financial analysts face significant challenges extracting information from lengthy 10-K reports, which often exceed 100 pages. This paper presents a Retrieval-Augmented Generation (RAG) system designed to answer questions about SP 500 financial reports and evaluates the impact of neural reranking on system performance. Our pipeline employs hybrid search combining full-text and semantic retrieval, followed by an optional reranking stage using a cross-encoder model. We conduct systematic evaluation using the FinDER benchmark dataset, comprising 1,500 queries across five experimental groups. Results demonstrate that reranking significantly improves answer quality, achieving 49.0 percent correctness for scores of 8 or above compared to 33.5 percent without reranking, representing a 15.5 percentage point improvement. Additionally, the error rate for completely incorrect answers decreases from 35.3 percent to 22.5 percent. Our findings emphasize the critical role of reranking in financial RAG systems and demonstrate performance improvements over baseline methods through modern language models and refined retrieval strategies.

[NLP-70] rust Safety and Accuracy: Assessing LLM s for Routine Maternity Advice

【速读】：该论文旨在解决印度农村地区孕妇难以获取可靠产前健康信息的问题，这主要受限于医疗资源匮乏和基础设施薄弱。解决方案的关键在于评估大型语言模型（Large Language Models, LLMs）如ChatGPT-4o、Perplexity AI和GeminiAI在提供准确且易懂的妊娠相关信息方面的潜力。研究通过对比模型输出与产科专业人员的回答，采用语义相似性、名词重叠度和可读性指标衡量内容质量，发现Perplexity在语义匹配上最接近专家水平，而ChatGPT-4o在术语使用和文本清晰度方面表现更优。这表明，具备高准确性与良好可读性的AI工具可作为提升偏远地区孕产妇健康教育可及性的可行方案。

链接: https://arxiv.org/abs/2603.16872
作者: V Sai Divya,A Bhanusree,Rimjhim,K Venkata Krishna Rao
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Access to reliable maternal healthcare information is a major challenge in rural India due to limited medical resources and infrastructure. With over 830 million internet users and nearly half of rural women online, digital tools offer new opportunities for health education. This study evaluates large language models (LLMs) like ChatGPT-4o, Perplexity AI, and GeminiAI to provide reliable and understandable pregnancy-related information. Seventeen pregnancy-focused questions were posed to each model and compared with responses from maternal health professionals. Evaluations used semantic similarity, noun overlap, and readability metrics to measure content quality. Results show Perplexity closely matched expert semantics, while ChatGPT-4o produced clearer, more understandable text with better medical terminology. As internet access grows in rural areas, LLMs could serve as scalable aids for maternal health education. The study highlights the need for AI tools that balance accuracy and clarity to improve healthcare communication in underserved regions.

[NLP-71] he Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

【速读】：该论文旨在解决传统自然语言处理（Natural Language Processing, NLP）对话系统中“思考”机制与语音交互时序不匹配的问题，即现有方法通常在用户说完后才进行推理和生成，导致响应延迟且无法模拟人类在听讲过程中同步进行内部认知加工的现象。解决方案的关键在于提出一种全双工潜意识推理方法（Full-duplex LAtent and Internal Reasoning, FLAIR），其核心创新是：在用户说话阶段即启动潜意识推理过程，通过递归地将前一时刻的隐式嵌入（latent embedding）作为下一时刻的输入，实现因果一致的连续推理，无需额外延迟；同时设计基于证据下界（Evidence Lower Bound, ELBO）的目标函数，支持高效的监督微调（supervised fine-tuning），避免对显式推理标注的依赖。

链接: https://arxiv.org/abs/2603.17837
作者: Donghang Wu,Tianyu Zhang,Yuxin Li,Hexin Liu,Chen Chen,Eng Siong Chng,Yoshua Bengio
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional “thinking” mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user’s speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

[NLP-72] Multi-Source Evidence Fusion for Audio Question Answering

【速读】：该论文旨在解决大音频语言模型（Large Audio Language Models, LALMs）在处理语音、音乐和环境声音时内部推理过程不透明、难以验证的问题。其解决方案的关键在于构建一个基于多源集成的推理管道：首先利用两个独立的大音频语言模型生成观测结果，再通过一个纯文本推理模型对这些结果与25个按可靠性分级的声学工具输出进行交叉验证；整个推理链每一步均基于显式的、带有可靠性标签的证据，从而生成密集且可验证的推理路径，显著提升了推理过程的事实准确性、逻辑严谨性和完整性。

链接: https://arxiv.org/abs/2603.17822
作者: Aivo Olev,Tanel Alumäe
机构: Tanel; Tallinn University of Technology (塔林理工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech’s solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge’s reasoning quality metric.

[NLP-73] Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution EACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）生成的上下文嵌入（context embeddings）在用于估计概念变迁时存在的不可解释性和缺乏时间感知性问题，以及历史数据中偏见增强对数字人文研究者带来的风险。其解决方案的关键在于构建一个基于主题的复杂网络框架，以原型概念（prototypical concepts）为节点，通过分析皇家学会语料库中化学革命时期两种竞争理论（燃素说 vs. 氧气理论）的演变路径，揭示了命名学变化（onomasiological change）与更高熵值和拓扑密度之间的关联，表明概念多样性增加与知识连接强度提升密切相关。

链接: https://arxiv.org/abs/2603.17594
作者: Sofía Aguilar-Valdez,Stefania Degaetano-Ortlieb
机构: Saarland University (萨尔兰大学)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注: Accepted by the EACL 2026 Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

点击查看摘要

Abstract:While context embeddings produced by LLMs can be used to estimate conceptual change, these representations are often not interpretable nor time-aware. Moreover, bias augmentation in historical data poses a non-trivial risk to researchers in the Digital Humanities. Hence, to model reliable concept trajectories in evolving scholarship, in this work we develop a framework that represents prototypical concepts through complex networks based on topics. Utilizing the Royal Society Corpus, we analyzed two competing theories from the Chemical Revolution (phlogiston vs. oxygen) as a case study to show that onomasiological change is linked to higher entropy and topological density, indicating increased diversity of ideas and connectivity effort.

[NLP-74] he Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLM s INTERSPEECH2026

【速读】：该论文旨在解决生成式 AI（Generative AI）在处理语音输入时可能引入的口音与性别偏见问题，尤其是在语音大语言模型（SpeechLLMs）中，由于保留了说话者身份相关的声学特征（如口音和感知性别），导致响应内容出现非公平性差异。解决方案的关键在于构建一个大规模、受控的交叉评估框架，通过语音克隆技术保持语言内容不变，仅改变口音（六种英语口音）和性别呈现（两种），并结合点态评分、成对比较和最佳-最差选择法（Best-Worst Scaling）进行多维度量化分析，从而识别出隐性的、非显式的偏见模式——特别是东欧口音在女性发音者中表现出显著更低的帮助性评分，且人类评估者比LLM评判器更能捕捉到这种交叉偏见的细微差异。

链接: https://arxiv.org/abs/2603.16941
作者: Shree Harsha Bokkahalli Satish,Christoph Minixhofer,Maria Teleki,James Caverlee,Ondřej Klejch,Peter Bell,Gustav Eje Henter,Éva Székely
机构: KTH Royal Institute of Technology (皇家理工学院); University of Edinburgh (爱丁堡大学); Texas AM University (德州农工大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 3 figures, 1 table, Submitted to Interspeech 2026

点击查看摘要

Abstract:Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect consistent disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. The bias is implicit: responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, uncovering sharper intersectional disparities.

[NLP-75] SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

【速读】：该论文旨在解决实时多语言通信中长期连续语音到语音翻译（Simultaneous Speech-to-Speech Translation, SimulS2S）的泛化能力不足问题，现有方法通常依赖资源密集型训练且仅适用于短句预分割输入，难以适配真实场景下的长语音流。其解决方案的关键在于提出首个无需训练的策略——SimulU，通过利用预训练端到端模型中的交叉注意力机制（cross-attention），设计历史管理与语音输出选择策略，动态调控输入历史窗口和输出生成过程，从而在不进行额外训练的前提下实现高质量、低延迟的长格式同步翻译。

链接: https://arxiv.org/abs/2603.16924
作者: Amirbek Djanibekov,Luisa Bentivogli,Matteo Negri,Sara Papi
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.

[NLP-76] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning

【速读】：该论文旨在解决当前脑电图（Electroencephalography, EEG）分析方法在临床应用中的局限性，即现有计算方法多局限于特定任务分类或粗粒度模式识别，难以提供具有临床意义的可解释性解读。为应对这一挑战，作者提出 NeuroNarrator——首个通用型 EEG-to-text 基础模型，其核心创新在于构建了 NeuroCorpus-160K，一个包含超过 16 万段 EEG 数据与结构化临床描述的标准化大规模语料库。解决方案的关键在于：首先通过对比学习对齐时间域 EEG 波形与空间拓扑图谱，建立时频-空间联合表征；进而基于状态空间启发式建模，将历史时序与频谱上下文融合至大型语言模型中，实现连贯、可解释的临床叙事生成，从而打通连续神经信号与离散临床语言之间的桥梁，支持开放式的、面向临床报告流程的电生理数据深度解析。

链接: https://arxiv.org/abs/2603.16880
作者: Guoan Wang,Shihao Yang,Jun-en Ding,Hao Zhu,Feng Liu
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: ignal Processing (eess.SP); Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remain largely confined to task-specific classification objectives or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. To address these limitations, we introduce NeuroNarrator, the first generalist EEG-to-text foundation model designed to translate electrophysiological segments into precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large-scale resource pairing over 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temporal EEG waveforms with spatial topographic maps via a rigorous contrastive objective, establishing spectro-spatially grounded representations. Building on this grounding, we condition a Large Language Model through a state-space-inspired formulation that integrates historical temporal and spectral context to support coherent clinical narrative generation. This approach establishes a principled bridge between continuous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates expert interpretation and supports clinical reporting workflows. Extensive evaluations across diverse benchmarks and zero-shot transfer tasks highlight NeuroNarrator’s capacity to integrate temporal, spectral, and spatial dynamics, positioning it as a foundational framework for time-frequency-aware, open-ended clinical interpretation of electrophysiological data.

信息检索

[IR-0] Averag e Case Graph Searching in Non-Uniform Cost Models

【速读】：该论文旨在解决带权重的二分搜索问题（Weighted Binary Search Problem）的一般化形式：在图中寻找一个隐藏的目标顶点 $ x $，通过迭代查询顶点来实现，每次查询 $ v $ 的代价为 $ c(v, x) $，并返回若 $ v \neq x $ 时包含 $ x $ 的连通分量。目标是最小化平均搜索代价。其核心解决方案包括：(1) 当查询代价与目标无关时，提出针对树结构的 $(4+\epsilon)$ -近似FPTAS算法（时间复杂度 $ O(n^4/\epsilon^2) $），以及针对一般图的 $ O(\sqrt{\log n}) $-近似算法；(2) 当代价函数关于目标单调递增时，设计出首个常数因子近似算法（2-近似），且可扩展至最坏情况下的优化；(3) 在任意代价函数下，证明该问题在UGC假设下不存在常数近似比，即使输入为星形树。关键突破在于对不同代价模型的精细刻画与算法设计，尤其在非均匀代价场景下首次获得常数近似保证。

链接: https://arxiv.org/abs/2603.17916
作者: Michał Szyfelbein
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注: arXiv admin note: substantial text overlap with arXiv:2511.06564

点击查看摘要

Abstract:We consider the following generalization of the classic Binary Search Problem: a searcher is required to find a hidden target vertex x in a graph G , by iteratively performing queries about vertices. A query to v incurs a cost c(v, x) and responds whether v=x and if not, returns the connected component in G-v containing x . The goal is to design a search strategy that minimizes the average-case search cost. Firstly, we consider the case when the cost of querying a vertex is independent of the target. We develop a \br4+\epsilon -approximation FPTAS for trees running in O(n^4/\epsilon^2) time and an O(\sqrt\log n) -approximation for general graphs. Additionally, we give an FPTAS parametrized by the number of non-leaf vertices of the graph. On the hardness side we prove that the problem is NP-hard even when the input is a tree with bounded degree or bounded diameter. Secondly, we consider trees and assume c(v, x) to be a monotone non-decreasing function with respect to x , i.e.\ if u \in P_v, x then c(u, x) \leq c(v, x) . We give a 2 -approximation algorithm which can also be easily altered to work for the worst-case variant. This is the first constant factor approximation algorithm for both criterions. Previously known results only regard the worst-case search cost and include a parametrized PTAS as well as a 4 -approximation for paths. At last, we show that when the cost function is an arbitrary function of the queried vertex and the target, then the problem does not admit any constant factor approximation under the UGC, even when the input tree is a star. Comments: arXiv admin note: substantial text overlap with arXiv:2511.06564 Subjects: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2603.17916 [cs.DS] (or arXiv:2603.17916v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2603.17916 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michał Szyfelbein [view email] [v1] Wed, 18 Mar 2026 16:54:24 UTC (64 KB)

[IR-1] A Contextual Help Browser Extension to Assist Digital Illiterate Internet Users

【速读】：该论文旨在解决用户在浏览网页时因不熟悉技术缩写（acronym）而导致的阅读理解障碍与信息检索效率低下的问题，尤其针对数字素养较低至中等水平的用户群体。其解决方案的关键在于设计并实现了一种浏览器扩展程序，通过双层人工智能（AI）管道机制：第一层利用Google Cloud的自然语言处理（Natural Language Processing, NLP）分类API识别页面是否为技术相关，从而精准激活提示逻辑以减少误报；第二层结合人工整理的技术词典与OpenAI大语言模型（Large Language Model, LLM），提供轻量级悬停提示（tooltip）定义，其中词典匹配响应平均耗时仅2135毫秒，显著优于AI生成定义（16429毫秒）和手动搜索（17200毫秒），有效提升了理解效率与用户体验。

链接: https://arxiv.org/abs/2603.17592
作者: Christos Koutsiaris
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 5 figures, 2 tables; MSc dissertation reformatted as conference paper; extended version available at this http URL

点击查看摘要

Abstract:This paper describes the design, implementation, and evaluation of a browser extension that provides contextual help to users who hover over technological acronyms and abbreviations on web pages. The extension combines a curated technical dictionary with OpenAI’s large language model (LLM) to deliver on-demand definitions through lightweight tooltip overlays. A dual-layer artificial intelligence (AI) pipeline, comprising Google Cloud’s Natural Language Processing (NLP) taxonomy API and OpenAI’s ChatGPT, classifies each visited page as technology-related before activating the tooltip logic, thereby reducing false-positive detections. A mixed-methods study with 25 participants evaluated the tool’s effect on reading comprehension and information-retrieval time among users with low to intermediate digital literacy. Results show that 92% of participants reported improved understanding of technical terms, 96% confirmed time savings over manual web searches, and all participants found the tooltips non-disruptive. Dictionary-based definitions were appended in an average of 2135 ms, compared to 16429 ms for AI-generated definitions and a mean manual search time of 17200 ms per acronym. The work demonstrates a practical, real-time approach to bridging the digital literacy gap and points toward extending contextual help to other domains such as medicine, law, and finance.

[IR-2] From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM -Based Paper Evaluation

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在科学论文评估中依赖绝对分数所带来的泛化能力不足问题。由于不同会议、时间周期和评价标准导致的评分尺度差异，基于绝对分数训练的模型容易过拟合于特定上下文规则，难以形成稳健的学术判断。其解决方案的关键在于将论文评估范式从独立评分转向协作排序（collaborative ranking），提出了一种名为CNPE（Comparison-Native framework for Paper Evaluation）的新框架：一方面通过图结构相似性排序算法高效采样更具区分度的论文对以构建训练数据；另一方面在模型学习阶段引入基于比较的监督微调与强化学习机制，利用比较奖励信号增强相对质量判断能力。推理时，模型通过对采样论文对进行成对比较，并聚合偏好信号生成全局相对质量排序，从而实现更鲁棒且可迁移的论文评估性能。

链接: https://arxiv.org/abs/2603.17588
作者: Pujun Zheng,Jiacheng Yao,Jinquan Zheng,Chenyang Gu,Guoxiu He,Jiawei Liu,Yong Huang,Tianrui Guo,Wei Lu
机构: East China Normal University (华东师范大学); Wuhan University (武汉大学); China Academic Degrees Graduate Education Development Center (中国学位与研究生教育发展中心)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design \textbfComparison-\textbfNative framework for \textbfPaper \textbfEvaluation (\textbfCNPE), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of \textbf21.8% over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. \hrefthis https URLCode.

[IR-3] Negation is Not Semantic: Diagnosing Dense Retrieval Failure Modes for Trade-offs in Contradiction-Aware Biomedical QA

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生物医学问答中生成看似合理但未经验证的陈述所带来的临床风险问题，尤其关注如何确保答案的可证伪性与证据可追溯性。其核心解决方案是通过构建一个去耦合的词汇架构（Decoupled Lexical Architecture），在统一的BM25检索基础上实现语义支持召回率（0.810）与矛盾证据精准揭示能力（0.750）之间的平衡，并结合叙事感知重排序（Narrative Aware Reranking）和单次上下文学习（One-Shot In-Context Learning），显著提升引用覆盖率至100%，同时保持零引用矛盾率。该方法有效规避了复杂对抗性密集检索导致的语义坍塌（Semantic Collapse）问题，实现了从随机生成到诚实证据合成的范式转变，为高可靠性生物医学AI系统提供了可扩展且精确的架构基础。

链接: https://arxiv.org/abs/2603.17580
作者: Soumya Ranjan Sahoo,Gagan N.,Sanand Sasidharan,Divya Bharti
机构: GE HealthCare(通用电气医疗); Bangalore(班加罗尔)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in biomedical question answering, yet their tendency to generate plausible but unverified claims poses serious risks in clinical settings. To mitigate these risks, the TREC 2025 BioGen track mandates grounded answers that explicitly surface contradictory evidence (Task A) and the generation of narrative driven, fully attributed responses (Task B). Addressing the absence of target ground truth, we present a proxy-based development framework using the SciFact dataset to systematically optimize retrieval architectures. Our iterative evaluation revealed a “Simplicity Paradox”: complex adversarial dense retrieval strategies failed catastrophically at contradiction detection (MRR 0.023) due to Semantic Collapse, where negation signals become indistinguishable in vector space. We further identify a Retrieval Asymmetry: filtering dense embeddings improves contradiction detection but degrades support recall, compromising reliability. We resolve this via a Decoupled Lexical Architecture built on a unified BM25 backbone, balancing semantic support recall (0.810) with precise contradiction surfacing (0.750). This approach achieves the highest Weighted MRR (0.790) on the proxy benchmark while remaining the only viable strategy for scaling to the 30 million document PubMed corpus. For answer generation, we introduce Narrative Aware Reranking and One-Shot In-Context Learning, improving citation coverage from 50% (zero-shot) to 100%. Official TREC results confirm our findings: our system ranks 2nd on Task A contradiction F1 and 3rd out of 50 runs on Task B citation coverage (98.77%), achieving zero citation contradict rate. Our work transforms LLMs from stochastic generators into honest evidence synthesizers, showing that epistemic integrity in biomedical AI requires precision and architectural scalability isolated metric optimization.

[IR-4] Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify

【速读】：该论文旨在解决播客推荐中长期偏好稳定性和用户意图动态变化之间的矛盾问题，即如何在保持用户熟悉内容的同时支持探索性发现。传统推荐系统侧重于长期交互模式，难以有效融合丰富上下文信号和灵活的意图感知目标。解决方案的关键在于提出GLIDE——一个面向生产环境的生成式播客推荐系统，其核心创新包括：将推荐任务建模为基于语义ID（Semantic IDs）离散化播客目录的指令遵循任务，实现大规模库存上的精准生成；通过近期收听历史与轻量级用户上下文进行条件控制，并注入长期用户嵌入作为软提示（soft prompts），在严格推理约束下捕捉稳定偏好；最终在百万级用户规模上验证了该方法在非习惯性播放量提升（最高5.4%）和新节目发现率提升（最高14.3%）方面的有效性，同时满足生产成本与延迟要求。

链接: https://arxiv.org/abs/2603.17540
作者: Edoardo D’Amico,Marco De Nadai,Praveen Chandar,Divita Vohra,Shawn Lin,Max Lefarov,Paul Gigioli,Gustavo Penha,Ilya Kopysitsky,Ivo Joel Senese,Darren Mei,Francesco Fabbri,Oguz Semerci,Yu Zhao,Vincent Tang,Brian St. Thomas,Alexandra Ranieri,Matthew N.K. Smith,Aaron Bernkopf,Bryan Leung,Ghazal Fazelnia,Mark VanMiddlesworth,Timothy Christopher Heath,Petter Pehrson Skiden,Alice Y. Wang,Doug J. Cole,Andreas Damianou,Maya Hristakeva,Reid Wilbur,Tarun Chillara,Vladan Radosavljevic,Pooja Chitkara,Sainath Adapa,Juan Elenter,Bernd Huber,Jacqueline Wood,Saaketh Vedantam,Jan Stypka,Sandeep Ghael,Martin D. Gould,David Murgatroyd,Yves Raimond,Mounia Lalmas,Paul N. Bennett
机构: Spotify(spotify)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Podcast listening is often grounded in a set of favorite shows, while listener intent can evolve over time. This combination of stable preferences and changing intent motivates recommendation approaches that support both familiarity and exploration. Traditional recommender systems typically emphasize long-term interaction patterns, and are less explicitly designed to incorporate rich contextual signals or flexible, intent-aware discovery objectives. In this setting, models that can jointly reason over semantics, context, and user state offer a promising direction. Large Language Models (LLMs) provide strong semantic reasoning and contextual conditioning for discovery-oriented recommendation, but deploying them in production introduces challenges in catalog grounding, user-level personalization, and latency-critical serving. We address these challenges with GLIDE, a production-scale generative recommender for podcast discovery at Spotify. GLIDE formulates recommendation as an instruction-following task over a discretized catalog using Semantic IDs, enabling grounded generation over a large inventory. The model conditions on recent listening history and lightweight user context, while injecting long-term user embeddings as soft prompts to capture stable preferences under strict inference constraints. We evaluate GLIDE using offline retrieval metrics, human judgments, and LLM-based evaluation, and validate its impact through large-scale online A/B testing. Across experiments involving millions of users, GLIDE increases non-habitual podcast streaming on Spotify home surface by up to 5.4% and new-show discovery by up to 14.3%, while meeting production cost and latency constraints. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2603.17540 [cs.IR] (or arXiv:2603.17540v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.17540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] A Unified Language Model for Large Scale Search Recommendation and Reasoning

【速读】：该论文旨在解决如何在大规模、异构目录中，通过单一端到端生成式模型同时支持推荐、检索与推理等多任务行为的问题。现有方法在处理真实物品的明确引用、多类型实体管理及低延迟可靠性约束时存在局限，而工具增强型系统虽部分缓解此问题，却引入了编排复杂性和端到端优化限制。解决方案的关键在于提出NEO框架——一个无需外部工具、基于目录锚定（catalog-grounded）的生成模型，其核心创新是将物品表示为结构化标识符（SID, Structured Identifier），并通过分阶段对齐与指令微调，使预训练解码器仅用自然语言序列即可混合生成文本和带类型的SID，实现语言可控的推理能力（即“语言可引导性”，language-steerability）。该设计确保生成结果始终符合目录有效性，同时保持自由文本生成灵活性，在超过1000万项跨媒体内容的真实场景下验证了其优越性能与跨任务迁移能力。

链接: https://arxiv.org/abs/2603.17533
作者: Marco De Nadai,Edoardo D’Amico,Max Lefarov,Alexandre Tamborrino,Divita Vohra,Mark VanMiddlesworth,Shawn Lin,Jacqueline Wood,Jan Stypka,Eliza Klyce,Keshi Dai,Timothy Christopher Heath,Martin D. Gould,Yves Raimond,Sandeep Ghael,Tony Jebara,Andreas Damianou,Vladan Radosavljevic,Paul N. Bennett,Mounia Lalmas,Praveen Chandar
机构: Spotify( Spotify)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs are increasingly applied to recommendation, retrieval, and reasoning, yet deploying a single end-to-end model that can jointly support these behaviors over large, heterogeneous catalogs remains challenging. Such systems must generate unambiguous references to real items, handle multiple entity types, and operate under strict latency and reliability constraints requirements that are difficult to satisfy with text-only generation. While tool-augmented recommender systems address parts of this problem, they introduce orchestration complexity and limit end-to-end optimization. We view this setting as an instance of a broader research problem: how to adapt LLMs to reason jointly over multiple-domain entities, users, and language in a fully self-contained manner. To this end, we introduce NEO, a framework that adapts a pre-trained decoder-only LLM into a tool-free, catalog-grounded generator. NEO represents items as SIDs and trains a single model to interleave natural language and typed item identifiers within a shared sequence. Text prompts control the task, target entity type, and output format (IDs, text, or mixed), while constrained decoding guarantees catalog-valid item generation without restricting free-form text. We refer to this instruction-conditioned controllability as language-steerability. We treat SIDs as a distinct modality and study design choices for integrating discrete entity representations into LLMs via staged alignment and instruction tuning. We evaluate NEO at scale on a real-world catalog of over 10M items across multiple media types and discovery tasks, including recommendation, search, and user understanding. In offline experiments, NEO consistently outperforms strong task-specific baselines and exhibits cross-task transfer, demonstrating a practical path toward consolidating large-scale discovery capabilities into a single language-steerable generative model.

[IR-6] VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

【速读】：该论文旨在解决多模态序列推荐（Multimodal Sequential Recommendation, MSR）中因使用小型冻结预训练编码器而导致语义容量受限、协同过滤（Collaborative Filtering, CF）信号难以充分融入物品表示的问题。传统方法通过标准对比监督微调（Supervised Fine-Tuning, SFT）将视觉-语言模型（Vision-Language Model, VLM）适配为嵌入生成器并注入CF信号，但易引发模态坍缩（modality collapse），即优化过程被单一模态主导而另一模态退化，从而损害推荐准确性。解决方案的关键在于提出VLM2Rec框架：其一，引入弱模态惩罚对比学习（Weak-modality Penalized Contrastive Learning），以纠正优化过程中的梯度不平衡；其二，设计跨模态关系拓扑正则化（Cross-Modal Relational Topology Regularization），以保持不同模态间的几何一致性，从而实现多模态信息的均衡利用与高质量推荐。

链接: https://arxiv.org/abs/2603.17450
作者: Junyoung Kim,Woojoo Kim,Jaehyung Lim,Dongha Kim,Hwanjo Yu
机构: Pohang University of Science and Technology (浦项科技大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.

[IR-7] CRE-T1 Preview Technical Report: Beyond Contrastive Learning for Reasoning -Intensive Retrieval

【速读】：该论文旨在解决推理密集型检索（reasoning-intensive retrieval）中的核心挑战，即如何识别查询与文档之间的隐式推理关系，而非仅依赖表面语义或词汇相似性。传统对比学习范式本质上是一种静态表征固化技术，在训练阶段将层次化相关性概念编码为向量空间中的固定几何结构，无法在推理时根据具体查询的推理需求动态调整相关性判断，导致在词汇不匹配或需隐式推理建立关联时性能显著下降。解决方案的关键在于提出 Thought 1 (T1)，一种生成式检索模型，将相关性建模从静态对齐转向动态推理生成：一方面，查询侧通过动态生成中间推理轨迹以桥接隐式推理关系，并使用 embtoken 作为推理输出的语义聚合点；另一方面，文档侧采用 instruction + text + embtoken 的编码格式支持高吞吐索引。此外，通过三阶段训练课程和第三阶段引入 GRPO（Generalized Reward Policy Optimization），使模型能够通过试错强化学习掌握不同查询的最佳推导策略，从而内化动态推理能力至向量表示中。实验证明，T1-4B 在 BRIGHT 基准上优于采用对比学习训练的更大模型，并达到多阶段检索流水线相当的性能水平，验证了以动态推理生成替代静态表征对齐的有效性。

链接: https://arxiv.org/abs/2603.17387
作者: Guangzhi Wang,Yinghao Jiao,Zhi Liu
机构: CareerInternational Research Team
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The central challenge of reasoning-intensive retrieval lies in identifying implicitreasoning relationships between queries and documents, rather than superficial se-mantic or lexical similarity. The contrastive learning paradigm is fundamentallya static representation consolidation technique: during training, it encodes hier-archical relevance concepts into fixed geometric structures in the vector space,and at inference time it cannot dynamically adjust relevance judgments accord-ing to the specific reasoning demands of each query. Consequently, performancedegrades noticeably when vocabulary mismatch exists between queries and doc-uments or when implicit reasoning is required to establish relevance. This pa-per proposes Thought 1 (T1), a generative retrieval model that shifts relevancemodeling from static alignment to dynamic reasoning. On the query side, T1 dy-namically generates intermediate reasoning trajectories for each query to bridgeimplicit reasoning relationships and uses embtoken as a semantic aggregationpoint for the reasoning output. On the document side, it employs an instruction+ text + embtoken encoding format to support high-throughput indexing. Tointernalize dynamic reasoning capabilities into vector representations, we adopt athree-stage training curriculum and introduce GRPO in the third stage, enablingthe model to learn optimal derivation strategies for different queries through trial-and-error reinforcement learning. On the BRIGHT benchmark, T1-4B exhibitsstrong performance under the original query setting, outperforming larger modelstrained with contrastive learning overall, and achieving performance comparableto multi-stage retrieval pipelines. The results demonstrate that replacing static rep-resentation alignment with dynamic reasoning generation can effectively improvereasoning-intensive retrieval performance.

[IR-8] PJB: A Reasoning -Aware Benchmark for Person-Job Retrieval

【速读】：该论文旨在解决当前检索模型在通用基准上性能趋同后，缺乏对系统失败位置与原因的诊断能力问题，尤其在人岗匹配（Person-Job Matching）这一复杂任务中，现有评估基准无法支持技能迁移推理和岗位胜任力判断的系统性分析。其解决方案的关键在于提出PJB（Person-Job Benchmark），这是一个以完整职位描述为查询、完整简历为文档的推理感知型检索评估数据集，通过岗位胜任力（job-competency judgment）定义相关性，并基于真实招聘数据构建涵盖六个行业领域近20万份简历的语料库；同时引入领域族（domain-family）和推理类型（reasoning-type）两类诊断标签，将评估范式从“谁得分更高”升级为“系统在何处存在差异及其成因”，从而揭示不同行业间的性能异质性远超模块改进带来的收益，并识别出重排序（reranking）模块稳定提升效果而查询理解模块在结合重排序时反而导致性能下降的瓶颈所在，最终为招聘检索系统的优化提供可定位的能力图谱（capability map）。

链接: https://arxiv.org/abs/2603.17386
作者: Guangzhi Wang,Xiaohui Yang,Kai Li,Jiawen He,Kai Yang,Ruixuan Zhang,Zhi Liu
机构: CareerInternational Research Team
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As retrieval models converge on generic benchmarks, the pressing question is no longer “who scores higher” but rather “where do systems fail, and why?” Person-job matching is a domain that urgently demands such diagnostic capability – it requires systems not only to verify explicit constraints but also to perform skill-transfer inference and job-competency reasoning, yet existing benchmarks provide no systematic diagnostic support for this task. We introduce PJB (Person-Job Benchmark), a reasoning-aware retrieval evaluation dataset that uses complete job descriptions as queries and complete resumes as documents, defines relevance through job-competency judgment, is grounded in real-world recruitment data spanning six industry domains and nearly 200,000 resumes, and upgrades evaluation from “who scores higher” to “where do systems differ, and why” through domain-family and reasoning-type diagnostic labels. Diagnostic experiments using dense retrieval reveal that performance heterogeneity across industry domains far exceeds the gains from module upgrades for the same model, indicating that aggregate scores alone can severely mislead optimization decisions. At the module level, reranking yields stable improvements while query understanding not only fails to help but actually degrades overall performance when combined with reranking – the two modules face fundamentally different improvement bottlenecks. The value of PJB lies not in yet another leaderboard of average scores, but in providing recruitment retrieval systems with a capability map that pinpoints where to invest.

[IR-9] Public Profile Matters: A Scalable Integrated Approach to Recommend Citations in the Wild

【速读】：该论文旨在解决现有引文推荐系统在捕捉人类引文行为模式方面的不足，以及当前评估协议未能反映真实应用场景的问题。现有方法虽能利用局部和全局文本信息提升性能，但往往忽视了人类引文行为的细微特征，且部分引入行为模式的方法存在计算开销大和引入系统性偏差的问题。解决方案的关键在于提出一个轻量级、无需训练的模块——Profiler，其能够高效且无偏地捕获人类引文模式，显著提升候选文献的检索效果；同时，论文设计了一个严格的归纳式（Inductive）评估设置，通过施加时间约束来模拟对新发表论文的真实引文推荐场景，从而更贴近实际应用需求。此外，论文进一步提出了DAVINCI模型，通过自适应向量门控机制融合Profiler提供的置信度先验与语义信息，在多个基准数据集上实现新的最先进性能，展现出卓越的效率与泛化能力。

链接: https://arxiv.org/abs/2603.17361
作者: Karan Goyal,Dikshant Kukreja,Vikram Goyal,Mukesh Mohania
机构: IIIT Delhi, India
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Proper citation of relevant literature is essential for contextualising and validating scientific contributions. While current citation recommendation systems leverage local and global textual information, they often overlook the nuances of the human citation behaviour. Recent methods that incorporate such patterns improve performance but incur high computational costs and introduce systematic biases into downstream rerankers. To address this, we propose Profiler, a lightweight, non-learnable module that captures human citation patterns efficiently and without bias, significantly enhancing candidate retrieval. Furthermore, we identify a critical limitation in current evaluation protocol: the systems are assessed in a transductive setting, which fails to reflect real-world scenarios. We introduce a rigorous Inductive evaluation setting that enforces strict temporal constraints, simulating the recommendation of citations for newly authored papers in the wild. Finally, we present DAVINCI, a novel reranking model that integrates profiler-derived confidence priors with semantic information via an adaptive vector-gating mechanism. Our system achieves new state-of-the-art results across multiple benchmark datasets, demonstrating superior efficiency and generalisability.

[IR-10] Learning Evolving Preferences: A Federated Continual Framework for User-Centric Recommendation WWW2026

【速读】：该论文旨在解决联邦持续推荐（federated continual recommendation）中的两个核心问题：一是用户行为动态变化导致的时间遗忘（temporal forgetting），即模型在本地更新过程中逐渐丢失历史偏好；二是异构用户数据下协同个性化能力弱化的问题。解决方案的关键在于提出FCUCR框架，其核心创新包括：1）时间感知自蒸馏策略（time-aware self-distillation strategy），通过隐式保留历史偏好来缓解时间遗忘；2）跨用户原型迁移机制（inter-user prototype transfer mechanism），利用相似用户的知识增强个体表示，同时保持用户的独立决策逻辑，从而实现长期隐私保护下的个性化推荐。

链接: https://arxiv.org/abs/2603.17315
作者: Chunxu Zhang,Zhiheng Xue,Guodong Long,Weipeng Zhang,Bo Yang
机构: Jilin University (吉林大学); Australian Artificial Intelligence Institute, FEIT, University of Technology Sydney (悉尼科技大学信息技术学院人工智能研究院)
类目: Information Retrieval (cs.IR)
备注: Accepted at WWW 2026

点击查看摘要

Abstract:User-centric recommendation has become essential for delivering personalized services, as it enables systems to adapt to users’ evolving behaviors while respecting their long-term preferences and privacy constraints. Although federated learning offers a promising alternative to centralized training, existing approaches largely overlook user behavior dynamics, leading to temporal forgetting and weakened collaborative personalization. In this work, we propose FCUCR, a federated continual recommendation framework designed to support long-term personalization in a privacy-preserving manner. To address temporal forgetting, we introduce a time-aware self-distillation strategy that implicitly retains historical preferences during local model updates. To tackle collaborative personalization under heterogeneous user data, we design an inter-user prototype transfer mechanism that enriches each client’s representation using knowledge from similar users while preserving individual decision logic. Extensive experiments on four public benchmarks demonstrate the superior effectiveness of our approach, along with strong compatibility and practical applicability. Code is available.

[IR-11] Graph-Native Cognitive Memory for AI Agents : Formal Belief Revision Semantics for Versioned Memory Architectures

【速读】：该论文旨在解决AI代理记忆系统中组件虽已存在但缺乏统一架构设计与形式化基础的问题，尤其关注如何构建一个既能支持认知记忆又能管理代理生成工作项的图原生架构。解决方案的关键在于提出Kumiho架构，其核心是建立AGM信念修正框架与属性图记忆系统操作语义之间的对应关系，从而在形式上满足基本AGM公理（K2–K6）和Hansson信念基公理（相关性、核心保留性），并通过三个创新实现性能突破：前瞻性索引（LLM在写入时预生成未来情景推论）、事件抽取（结构化因果事件保留在摘要中）以及客户端LLM重排序机制；该架构还采用双存储模型（Redis用于工作内存，Neo4j用于长期图存储）并集成混合全文与向量检索，且具备模型解耦特性——更换答案模型无需调整流水线即可显著提升端到端准确率。

链接: https://arxiv.org/abs/2603.17244
作者: Young Bin Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Logic in Computer Science (cs.LO)
备注: 56 pages, 1 figure

点击查看摘要

Abstract:While individual components for AI agent memory exist in prior systems, their architectural synthesis and formal grounding remain underexplored. We present Kumiho, a graph-native cognitive memory architecture grounded in formal belief revision semantics. The structural primitives required for cognitive memory – immutable revisions, mutable tag pointers, typed dependency edges, URI-based addressing – are identical to those required for managing agent-produced work as versionable assets, enabling a unified graph-native architecture that serves both purposes. The central formal contribution is a correspondence between the AGM belief revision framework and the operational semantics of a property graph memory system, proving satisfaction of the basic AGM postulates (K2–K6) and Hansson’s belief base postulates (Relevance, Core-Retainment). The architecture implements a dual-store model (Redis working memory, Neo4j long-term graph) with hybrid fulltext and vector retrieval. On LoCoMo (token-level F1), Kumiho achieves 0.565 overall F1 (n=1,986) including 97.5% adversarial refusal accuracy. On LoCoMo-Plus, a Level-2 cognitive memory benchmark testing implicit constraint recall, Kumiho achieves 93.3% judge accuracy (n=401); independent reproduction by the benchmark authors yielded results in the mid-80% range, still substantially outperforming all published baselines (best: Gemini 2.5 Pro, 45.7%). Three architectural innovations drive the results: prospective indexing (LLM-generated future-scenario implications indexed at write time), event extraction (structured causal events preserved in summaries), and client-side LLM reranking. The architecture is model-decoupled: switching the answer model from GPT-4o-mini (~88%) to GPT-4o (93.3%) improves end-to-end accuracy without pipeline changes, at a total evaluation cost of ~ 14 for 401 entries.

[IR-12] ListK: Semantic ORDER BY and LIMIT K with Listwise Prompting

【速读】：该论文旨在解决语义排序操作（semantic ORDER BY … LIMIT K）在大型语言模型（LLM）驱动的SQL查询中延迟高且准确率难以兼顾的问题。现有方法在处理此类语义排序时性能不足，尤其在大规模数据场景下效率低下。解决方案的关键在于提出ListK框架，其核心创新包括：引入三种基于列表级排序（listwise ranking）的新型算法——确定性列表锦标赛（LTTopK）、拉斯维加斯并行多轴快速选择/排序（LMPQSelect/LMPQSort，首次研究）以及蒙特卡洛列表锦标赛过滤器（LTFilter），并通过查询优化器动态组合这些物理算子以最小化延迟并满足目标召回率。理论分析支持参数调优与成本估算，实验证明ListK显著优于现有方案，在几乎不损失召回率和归一化折损累计增益（NDCG）的前提下将延迟降低约50%。

链接: https://arxiv.org/abs/2603.17223
作者: Jason Shin,Jiwon Chang,Fatemeh Nargesian
机构: University of Rochester(罗切斯特大学)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Semantic operators abstract large language model (LLM) calls in SQL clauses. It is gaining traction as an easy method to analyze semi-structured, unstructured, and multimodal datasets. While a plethora of recent works optimize various semantic operators, existing methods for semantic ORDER BY (full sort) and LIMIT K (top-K) remain lackluster. Our ListK framework improves the latency of semantic ORDER BY … LIMIT K at no cost to accuracy. Motivated by the recent advance in fine-tuned listwise rankers, we study several sorting algorithms that best combine partial listwise rankings. These include: 1) deterministic listwise tournament (LTTopK), 2) Las Vegas and embarrassingly parallel listwise multi-pivot quickselect/sort (LMPQSelect, LMPQSort), and 3) a basic Monte Carlo listwise tournament filter (LTFilter). Of these, listwise multi-pivot quickselect/sort are studied here for the first time. The full framework provides a query optimizer for combining the above physical operators based on the target recall to minimize latency. We provide theoretical analysis to easily tune parameters and provide cost estimates for query optimizers. ListK empirically dominates the Pareto frontier, halving latency at virtually no cost to recall and NDCG compared to prior art.

[IR-13] OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

【速读】：该论文旨在解决密集检索模型（dense retriever）在领域特定微调过程中存在的数据效率与效果不均衡问题，即并非所有训练样本对学习过程的贡献均等。其解决方案的关键在于提出一种名为OPERA的数据剪枝框架，其中包含两种策略：静态剪枝（Static Pruning, SP）通过保留高相似度查询-文档对提升排序性能（如NDCG），但可能因查询多样性下降而损害召回率；进一步提出的两阶段动态剪枝（Dynamic Pruning, DP）则通过在训练过程中自适应调节查询和文档层面的采样概率，在优先选择高质量样本的同时保持对完整训练集的访问，从而同时优化排序与召回指标。实验表明，DP在多个领域数据集上均实现最优综合性能，并显著缩短训练时间（<50%），且适用于基于大语言模型（LLM）的检索器（如Qwen3-Embedding），展现出架构无关的优势。

链接: https://arxiv.org/abs/2603.17205
作者: Haoyang Fang,Shuai Zhang,Yifei Ma,Hengyi Wang,Cuixiong Hu,Katrin Kirchhoff,Bernie Wang,George Karypis
机构: Amazon Web Services (亚马逊网络服务)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9%) and retrieval (Recall@20 +0.7%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50% of the training time required by standard finetuning.

[IR-14] Visual Product Search Benchmark

【速读】：该论文旨在解决工业和商业场景中基于图像的精确产品识别问题，尤其是在维护、采购和运营流程中，错误匹配可能导致高昂的下游故障。其核心挑战在于从大规模且持续演化的商品目录中，在多样成像条件下准确检索并排序特定对象实例。解决方案的关键在于构建一个结构化的视觉嵌入模型基准测试体系，涵盖开源基础嵌入模型、专有跨模态嵌入系统及领域特定的纯视觉模型，并在统一的图像到图像检索协议下进行评估，且不依赖后处理步骤以纯粹衡量各模型的检索能力。该基准特别强调现实约束、异构图像条件与精确实例匹配需求，从而为从业者和研究者提供当前视觉嵌入方法在生产级产品识别系统中的性能边界与适用性参考。

链接: https://arxiv.org/abs/2603.17186
作者: Karthik Sulthanpete Govindappa
机构: nyris GmbH (nyris公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 21 pages

点击查看摘要

Abstract:Reliable product identification from images is a critical requirement in industrial and commercial applications, particularly in maintenance, procurement, and operational workflows where incorrect matches can lead to costly downstream failures. At the core of such systems lies the visual search component, which must retrieve and rank the exact object instance from large and continuously evolving catalogs under diverse imaging conditions. This report presents a structured benchmark of modern visual embedding models for instance-level image retrieval, with a focus on industrial applications. A curated set of open-source foundation embedding models, proprietary multi-modal embedding systems, and domain-specific vision-only models are evaluated under a unified image-to-image retrieval protocol. The benchmark includes curated datasets, which includes industrial datasets derived from production deployments in Manufacturing, Automotive, DIY, and Retail, as well as established public benchmarks. Evaluation is conducted without post-processing, isolating the retrieval capability of each model. The results provide insight into how well contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks, and how they compare to models explicitly trained for industrial applications. By emphasizing realistic constraints, heterogeneous image conditions, and exact instance matching requirements, this benchmark aims to inform both practitioners and researchers about the strengths and limitations of current visual embedding approaches in production-level product identification systems. An interactive companion website presenting the benchmark results, evaluation details, and additional visualizations is available at this https URL.

[IR-15] HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storag e

【速读】：该论文旨在解决传统GPU哈希表在嵌入表（embedding table）规模超过单GPU容量时，因强制保留所有插入键值对而导致高带宽内存（HBM）资源浪费的问题。其核心解决方案是引入缓存语义（cache semantics），将策略驱动的逐出（eviction）作为首要操作，而非依赖重哈希或容量失败处理。关键创新在于提出HierarchicalKV（HKV）——首个通用GPU哈希表库，其默认运行模式为缓存语义：每个全桶的更新或插入操作（upsert）通过就地逐出或拒绝接纳完成，无需重新哈希；同时协同设计了四个核心机制：缓存行对齐桶、内联评分驱动的upsert、基于评分的动态双桶选择及三组并发控制，并利用分层键值分离实现超越HBM容量的扩展能力。

链接: https://arxiv.org/abs/2603.17168
作者: Haidong Rong,Jiashu Yao,Matthias Langer,Shijie Liu,Li Fan,Dongxin Wang,Jia He,Jinglin Chen,Jiaheng Rang,Julian Qian,Mengyao Xu,Fan Yu,Minseok Lee,Zehuan Wang,Even Oldridge
机构: NVIDIA(英伟达); Tencent(腾讯); Vipshop(唯品会); BOSS Zhipin(BOSS直聘); ByteDance(字节跳动); Snap( snaps); NVIDIA(英伟达)
类目: Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Traditional GPU hash tables preserve every inserted key – a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms – cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency – and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.

人机交互

[HC-0] Augmenting Scholarly Reading with Cross-Media Annotations

【速读】：该论文旨在解决当前学术阅读中PDF标注工具对多媒体内容支持有限的问题，即学者在阅读过程中难以将音频、视频或网页等外部材料与PDF文档进行有效关联。其解决方案的关键在于提出一种跨媒体标注（cross-media annotation）的设计探索，通过允许用户轻松地将PDF内容与其他类型文档或媒体资源建立链接，从而丰富学术阅读实践，并为其他研究者提供可引导的阅读体验支持。

链接: https://arxiv.org/abs/2603.17957
作者: Qi Xu,Beat Signer
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Scholarly reading often involves engaging with various supplementary materials beyond PDFs to support understanding. In practice, scholars frequently incorporate such external materials into their reading workflow through annotation. However, most existing PDF annotation tools support only a limited range of media types for embedding annotations in PDF documents. This paper investigates cross-media annotation as a design space for augmenting academic reading. We present a design exploration of a cross-media annotation tool that allows scholars to easily link PDF content with other documents and materials such as audio, video or web pages. The proposed design has the potential to enrich reading practices and enable scholars to guide and support other researchers’ reading experiences.

[HC-1] AI-Assisted Goal Setting Improves Goal Progress Through Social Accountability

【速读】：该论文旨在解决如何在大规模应用中帮助个体识别并追求具有个人意义的职业目标这一关键挑战，传统职业辅导（career coaching）虽能提升目标质量和实现程度，但受限于成本高和可及性差。研究提出以大语言模型（Large Language Model, LLM）驱动的聊天机器人作为可扩展替代方案，并通过一项预先注册的三臂随机对照试验验证其心理机制。解决方案的关键在于：AI职业教练（“Leon”）通过增强用户感知到的社会责任感（perceived social accountability），显著促进短期目标进展，且该效应在结构化自我反思条件下未被观察到，表明其核心优势并非提升目标与自我一致性（self-concordance），而是通过交互式反馈强化了责任感这一中介变量。

链接: https://arxiv.org/abs/2603.17887
作者: Michel Schimpf,Julian Voigt,Thomas Bohné
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Helping people identify and pursue personally meaningful career goals at scale remains a key challenge in applied psychology. Career coaching can improve goal quality and attainment, but its cost and limited availability restrict access. Large language model (LLM)-based chatbots offer a scalable alternative, yet the psychological mechanisms by which they might support goal pursuit remain untested. Here we report a preregistered three-arm randomised controlled trial (N = 517) comparing an AI career coach (“Leon,” powered by Claude Sonnet), a matched structured written questionnaire covering closely matched reflective topics, and a no-support control on goal progress at a two-week follow-up. The AI chatbot produced significantly higher goal progress than the control (d = 0.33, p = .016). Compared with the written-reflection condition, the AI did not significantly improve overall goal progress, but it increased perceived social accountability. In the preregistered mediation model, perceived accountability mediated the AI-over-questionnaire effect on goal progress (indirect effect = 0.15, 95% CI [0.04, 0.31]), whereas self-concordance did not. These findings suggest that AI-assisted goal setting can improve short-term goal progress, and that its clearest added value over structured self-reflection lies in increasing felt accountability.

[HC-2] Building a “-Sensitive Design” Methodology from Political Philosophies or Ideologies

【速读】：该论文旨在解决传统价值敏感设计（Value Sensitive Design, VSD）在将人类价值观转化为具体技术需求时存在的转化困境以及规范性基础薄弱的问题。其解决方案的关键在于提出一种元框架——“-Sensitive Design (-SD)”，通过嵌入政治或意识形态价值作为规范性准则，强化设计过程中的价值导向；文中以依赖敏感设计（Dependency-Sensitive Design, DSD）为例，融合Kittay对古典自由主义理论的批判，在实践中扩展VSD方法论，从而推动哲学与设计研究在更广泛领域的深度融合。

链接: https://arxiv.org/abs/2603.17806
作者: Anthony Maocheia-Ricci,Edith Law
机构: University of Waterloo (滑铁卢大学)
类目: Human-Computer Interaction (cs.HC)
备注: Position paper for the CHIdeology workshop at CHI 2026, Barcelona. this https URL

点击查看摘要

Abstract:Value-based approaches such as Value Sensitive Design (VSD) enable technology designers to engage with and integrate human values in technology through a tripartite methodology of conceptual, empirical, and technical investigations. However, VSD contains pitfalls in both translating values to requirements and a lack of normative grounding, leading to adaptations such as Jacobs’ Capability Sensitive Design (CSD). Inspired by CSD and extensions of the design approach, we propose the concept of creating -Sensitive Design (-SD); a meta-framework to embed various political or ideological values as norms in a design research process. We exemplify this through \emphDependency-Sensitive Design (DSD), combining ideas from Kittay’s critiques of classical liberal theory within a practical VSD framework. Finally, we push for further work combining philosophy and design in areas beyond CSD and DSD.

[HC-3] Large Language Models in Teaching and Learning: Reflections on Implementing an AI Chatbot in Higher Education

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在高等教育中应用时所面临的潜在风险，特别是其生成幻觉（hallucination）和专业知识有限可能导致的教学质量下降问题。解决方案的关键在于引入一个基于检索增强生成（Retrieval-Augmented Generation, RAG）模型的生成式AI助手，该助手被嵌入到一门大学课程中以替代此前由教师主导、耗时较长的教学活动。通过三轮迭代式的混合方法实验（包括交叉设计），研究系统评估了LLM对学生动机、人机交互感知差异、生成内容质量及学业表现的影响，从而为LLMs在专业课程中的教学可行性提供了实证依据。

链接: https://arxiv.org/abs/2603.17773
作者: Fiammetta Caccavale,Carina L. Gargalo,Julian Kager,Magdalena Skowyra,Steen Larsen,Krist V. Gernaey,Ulrich Krühne
机构: DTU (丹麦技术大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The landscape of education is changing rapidly, shaped by emerging pedagogical approaches, technological innovations such as artificial intelligence (AI), and evolving societal expectations, all of which demand thorough evaluation of new educational tools. Although large language models (LLMs) present substantial opportunities especially in Higher Education, their propensity to generate hallucinations and their limited specialized knowledge may introduce significant risks. This study aims to address these risks by examining the practical implementation of an LLM-enhanced assistant in a university level course. We implemented a generative AI assistant grounded in a retrieval-augmented generation (RAG) model to replicate a previously teacher-led, time-intensive exercise. To assess the effectiveness of the LLM, we conducted three separate experiments through iterative mixed-methods approaches, including a crossover design. The resulting data address central research questions related to student motivation, perceived differences between engaging with the LLM versus a human teacher, the quality of AI-generated responses, and the impact of the LLM on students’ academic performance. The results offer direct insights into students’ views and the pedagogical feasibility of embedding LLMs into specialized courses. Finally, we discuss the main challenges, opportunities and future directions of LLMs in teaching and learning in Higher Education. Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.17773 [cs.CY] (or arXiv:2603.17773v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.17773 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-4] Facial Movement Dynamics Reveal Workload During Complex Multitasking

【速读】：该论文旨在解决安全关键环境中实时认知负荷监测的难题，传统方法存在侵入性强、成本高或时间分辨率不足的问题。其解决方案的关键在于利用标准网络摄像头捕捉面部运动动力学特征（如速度、加速度、位移及递归量化指标），通过随机森林分类器实现对认知负荷的识别，仅需每位受试者2分钟的最小校准即可达到73%的准确率，展现出基于通用摄像设备的低成本、高灵敏度监测潜力，尽管个体差异限制了跨被试泛化能力。

链接: https://arxiv.org/abs/2603.17767
作者: Carter Sale,Melissa N. Stolar,Gaurav Patil,Michael J. Gostelow,Julia Wallier,Margaret C. Macpherson,Jan-Louis Kruger,Mark Dras,Simon G. Hosking,Rachel W. Kallen,Michael J. Richardson
机构: Macquarie University (麦考瑞大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 7 figures, under review at Royal Society Open Science

点击查看摘要

Abstract:Real-time cognitive workload monitoring is crucial in safety-critical environments, yet established measures are intrusive, expensive, or lack temporal resolution. We tested whether facial movement dynamics from a standard webcam could provide a low-cost alternative. Seventy-two participants completed a multitasking simulation (OpenMATB) under varied load while facial keypoints were tracked via OpenPose. Linear kinematics (velocity, acceleration, displacement) and recurrence quantification features were extracted. Increasing load altered dynamics across timescales: movement magnitudes rose, temporal organisation fragmented then reorganised into complex patterns, and eye-head coordination weakened. Random forest classifiers trained on pose kinematics outperformed task performance metrics (85% vs. 55% accuracy) but generalised poorly across participants (43% vs. 33% chance). Participant-specific models reached 50% accuracy with minimal calibration (2 minutes per condition), improving continuously to 73% without plateau. Facial movement dynamics sensitively track workload with brief calibration, enabling adaptive interfaces using commodity cameras, though individual differences limit cross-participant generalisation.

[HC-5] DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

【速读】：该论文旨在解决3D角色动画制作中对专业软件或昂贵动捕系统依赖的问题，从而降低非专业人士的使用门槛。其核心解决方案是提出一个轻量级、基于视觉的系统DancingBox，通过将动捕过程重构为数字木偶戏（digital puppetry），利用单个网络摄像头捕捉用户操作日常物品时的粗粒度代理动作（proxy motions），再结合从大规模数据集中学习到的人体运动先验知识，对这些代理动作进行条件化生成，最终输出逼真的角色动画。关键创新在于通过合成训练样本（将现有动捕序列转换为代理表示）来克服缺乏配对代理-动画数据的难题，并实现了对多样化代理物体（如毛绒玩具、香蕉等）的直观且富有创造力的动画控制。

链接: https://arxiv.org/abs/2603.17704
作者: Haocheng Yuan,Adrien Bousseau,Hao Pan,Lei Zhong,Changjian Li
机构: University of Edinburgh(爱丁堡大学); Inria, Université Côte d’Azur(法国国家信息与自动化研究院，蔚蓝海岸大学); Tsinghua University(清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to CHI2026

点击查看摘要

Abstract:Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy-animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.

[HC-6] Whos Sense is This? Possibility for Impacting Human Insights in AI-assisted Sensemaking

【速读】：该论文试图解决的问题是：在AI辅助的群体协同认知构建（collaborative sensemaking）过程中，由于AI可能在用户理解尚处于模糊阶段时过早提供洞察（insight），导致用户盲目采纳这些见解而缺乏充分验证，从而影响最终决策质量。解决方案的关键在于提出三个值得深入探讨的研究问题，以引导实践者审慎评估AI介入时机与方式，并分析用户为何倾向于接受AI提供的见解，从而优化AI辅助认知构建的机制设计，避免“提前定论”对群体思维过程的干扰。

链接: https://arxiv.org/abs/2603.17643
作者: Zhuoyi Cheng,Steven Houben
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted by CHI 26 Workshop on Sensemaking and AI 2026: Uses, Behaviors, Design, and Recommendations

点击查看摘要

Abstract:Sensemaking is an important preceding step for activities like consensus building and decision-making. When groups of people make sense of large amounts of information, their understanding gradually evolves from vague to clear. During this process when reaching a conclusion is still premature, if people are presented with others’ insights, they may be directed to focus on that specific perspective without adequate verification. We argue that similar phenomena may also exist in AI-assisted sensemaking, in which AI will usually be the one that presents insight prematurely when users’ understandings are still vague and ill-formed. In this paper, we raised three questions that are worth deliberation before exploiting AI to assist in collaborative sensemaking in practice, and discussed possible reasons that may lead users to opt for insights from AI.

[HC-7] Large Language Models as a Semantic Interface and Ethical Mediator in Neuro-Digital Ecosystems: Conceptual Foundations and a Regulatory Imperative

【速读】：该论文旨在解决当前人机交互中因大型语言模型（Large Language Models, LLMs）作为神经数据与社会应用之间语义接口所引发的伦理困境与治理空白问题，特别是其在增强人类能力的同时对心理自主权和神经权利构成的新威胁。解决方案的关键在于提出一种“第二层神经伦理学”（second-order neuroethics）的基础性治理框架，该框架以语义透明度（Semantic Transparency）、心理知情同意（Mental Informed Consent）和能动性保护（Agency Preservation）为核心原则，并辅以针对NLI场景的伦理沙盒、偏见感知的LLM认证机制以及对神经语言推断的法律承认等实践工具，从而系统性应对LLMs在神经语义翻译过程中对个体认知完整性与公平性的潜在侵蚀。

链接: https://arxiv.org/abs/2603.17444
作者: Alexander V. Shenderuk-Zhidkov,Alexander E. Hramov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 21 pages, 1 figure

点击查看摘要

Abstract:This article introduces and substantiates the concept of Neuro-Linguistic Integration (NLI), a novel paradigm for human-technology interaction where Large Language Models (LLMs) act as a key semantic interface between raw neural data and their social application. We analyse the dual nature of LLMs in this role: as tools that augment human capabilities in communication, medicine, and education, and as sources of unprecedented ethical risks to mental autonomy and neurorights. By synthesizing insights from AI ethics, neuroethics, and the philosophy of technology, the article critiques the inherent limitations of LLMs as semantic mediators, highlighting core challenges such as the erosion of agency in translation, threats to mental integrity through precision semantic suggestion, and the emergence of a new neuro-linguistic divide' as a form of biosemantic inequality. Moving beyond a critique of existing regulatory models (e.g., GDPR, EU AI Act), which fail to address the dynamic, meaning-making processes of NLI, we propose a foundational framework for proactive governance. This framework is built on the principles of Semantic Transparency, Mental Informed Consent, and Agency Preservation, supported by practical tools such as NLI-specific ethics sandboxes, bias-aware certification of LLMs, and legal recognition of the neuro-linguistic inference. The article argues for the development of a second-order neuroethics,’ focused not merely on neural data protection but on the ethics of AI-mediated semantic interpretation itself, thereby providing a crucial conceptual basis for steering the responsible development of neuro-digital ecosystems.

[HC-8] Scale-Aware Navigation of Astronomical Survey Imagery Data on High Resolution Immersive Displays

【速读】：该论文旨在解决天文观测影像在多尺度空间范围内分析时面临的挑战，即如何在保持全局结构与局部细节之间流畅切换，以支持科学家在极端规模的科学图像中进行探索性分析。传统桌面工作流依赖离散视图或静态裁剪区域，导致上下文信息碎片化，难以实现高效认知整合。其解决方案的关键在于提出一种面向设计的尺度感知导航框架（scale-aware navigation framework），通过在高分辨率沉浸式显示环境中部署天文影像数据（如Vera C. Rubin天文台和银河系巡天图像），结合房间尺度的拼接高分辨率显示屏与曲面沉浸系统，实现对多尺度图像内容的无缝交互与情境连续性维持，从而为极端规模科学图像的探索提供新的沉浸式交互范式。

链接: https://arxiv.org/abs/2603.17337
作者: Ava Nederlander,Zainab Aamir,Arie E. Kaufman(Stony Brook University)
机构: Stony Brook University (石溪大学)
类目: Human-Computer Interaction (cs.HC); Instrumentation and Methods for Astrophysics (astro-ph.IM); Graphics (cs.GR)
备注: 4 pages, 2 figures, to appear in IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (IEEE VRW 2026)

点击查看摘要

Abstract:Upcoming astronomical surveys produce imagery that spans many orders of magnitude in spatial scale, requiring scientists to reason fluidly between global structure and local detail. Data from the Vera C. Rubin Observatory exemplifies this challenge, as traditional desktop-based workflows often rely on discrete views or static cutouts that fragment context during exploration. This paper presents a design-oriented framework for scale-aware navigation of astronomical survey imagery in high-resolution immersive display environments. We illustrate these principles through representative usage scenarios using Vera Rubin Observatory and Milky Way survey imagery deployed in room-scale immersive environments, including tiled high-resolution displays and curved immersive systems. Our goal is to contribute design insights that inform the development of immersive interaction paradigms for exploratory analysis of extreme-scale scientific imagery.

[HC-9] “Not Just Me and My To-Do List”: Understanding Challenges of Task Management for Adults with ADHD and the Need for AI-Augmented Social Scaffolds

【速读】：该论文试图解决的问题是：现有生产力工具多基于神经典型（neurotypical）用户的假设，即默认其具备稳定的自我调节能力和线性时间感知，而忽视了注意力缺陷多动障碍（ADHD）成人用户在任务管理中因情绪与关系性错位所导致的困难。解决方案的关键在于揭示ADHD人群的任务管理实践本质上是“分布式且情感化支撑”的社会关系建构过程，并提出通过具有社会感知能力的生成式AI（Generative AI）系统支持协同调节（co-regulation）和非线性注意力节奏（nonlinear attention rhythms），从而设计出更契合ADHD用户需求的干预策略与交互范式。

链接: https://arxiv.org/abs/2603.17258
作者: Jingruo Chen,Yibo Meng,Kexin Nie
机构: Cornell University (康奈尔大学); Tsinghua University (清华大学); The University of Sydney (悉尼大学)
类目: Human-Computer Interaction (cs.HC)
备注: This preprint is accepted to CSCW2026

点击查看摘要

Abstract:Adults with ADHD often face challenges with task management, not due to a lack of willpower, but because of emotional and relational misalignments between cognitive needs and normative infrastructures. Existing productivity tools, designed for neurotypical users, often assume consistent self-regulation and linear time, overlooking these differences. We conducted 22 semi-structured interviews with ADHD-identifying adults, exploring their challenges in task management and their coping mechanisms through socially and emotionally scaffolded strategies. Building on these insights, we conducted a follow-up speed dating study with 20 additional ADHD-identifying adults, focusing on 13 speculative design concepts that leverage AI for task support. Our findings reveal that task management among adults with ADHD is relationally and affectively co-constructed, rather than an isolated individual act. Overall, we provide (1) empirical insights into distributed and emotionally scaffolded task management practices, (2) design implications for socially-aware AI systems that support co-regulation and nonlinear attention rhythms, and (3)an analysis of user preferences for different AI design concepts, clarifying which features were most valued and why.

[HC-10] Actionable Guidance Outperforms Map and Compass Cues in Demanding Immersive VR Wayfinding

【速读】：该论文旨在解决沉浸式虚拟现实（Immersive Virtual Reality, VR）中导航辅助工具在物理移动场景下的有效性问题，尤其是如何优化空间信息向运动决策的转化效率。其解决方案的关键在于：相较于提供丰富但需额外认知解析的空间表示（如小地图、指南针），直接将空间信息转化为可立即执行的动作提示（如方向箭头）能显著提升导航性能，这表明在高要求的沉浸式移动任务中，减少用户对空间信息的认知转换负担是设计高效扩展现实（Extended Reality, XR）导航界面的核心原则。

链接: https://arxiv.org/abs/2603.17238
作者: Apurv Varshney,Lily M. Turkstra,Jiaxin Su,Mable Zhou,Scott T. Grafton,Barry Giesbrecht,Mary Hegarty,Michael Beyeler
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Human-Computer Interaction (cs.HC)
备注: AV and LMT contributed equally to this work

点击查看摘要

Abstract:Navigation aids are central to immersive virtual reality (VR) experiences that involve physical locomotion. Their effectiveness depends not only on how much spatial information they provide, but also on how directly that information supports movement decisions. We compared three common guidance techniques for immersive VR wayfinding: a directional arrow, a minimap, and a compass. In a controlled room-scale VR study with 42 participants completing 1008 trials, participants navigated to target landmarks in a time-pressured maze with reduced visibility and forced route replanning. Across behavioral and eye-tracking measures, arrow guidance produced the strongest navigation performance, minimap guidance yielded intermediate performance, and compass cues performed worst, suggesting that during immersive locomotion users benefit from guidance that can be interpreted rapidly while moving. These results suggest that in demanding immersive locomotion tasks, interfaces that translate spatial information directly into actionable movement cues can outperform richer but more interpretive spatial representations. Our findings highlight the importance of designing XR navigation interfaces that minimize the cognitive translation between spatial information and movement decisions.

[HC-11] Collecting Prosody in the Wild: A Content-Controlled Privacy-First Smartphone Protocol and Empirical Evaluation INTERSPEECH2026

【速读】：该论文旨在解决日常言语数据收集中因语调（prosody）与语义混杂、隐私限制及参与者配合度低而导致的挑战。其解决方案的关键在于提出并实证评估了一种内容可控、以隐私优先的智能手机协议，该协议通过脚本化朗读句子标准化词汇内容（包括提示情感效价），同时捕捉自然的语调表达变化；此外，该协议在设备端完成语调特征提取，立即删除原始音频，并仅传输提取后的特征用于分析，从而在保障隐私的同时提升数据质量与可分析性。

链接: https://arxiv.org/abs/2603.17061
作者: Timo K. Koch,Florian Bemmann,Ramona Schoedel,Markus Buehner,Clemens Stachl
机构: University of St. Gallen (圣加仑大学); LMU Munich (慕尼黑路德维希马克西米利安大学); University of Mannheim (曼海姆大学); Charlotte Fresenius Hochschule (夏洛特弗雷森ius大学)
类目: Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Collecting everyday speech data for prosodic analysis is challenging due to the confounding of prosody and semantics, privacy constraints, and participant compliance. We introduce and empirically evaluate a content-controlled, privacy-first smartphone protocol that uses scripted read-aloud sentences to standardize lexical content (including prompt valence) while capturing natural variation in prosodic delivery. The protocol performs on-device prosodic feature extraction, deletes raw audio immediately, and transmits only derived features for analysis. We deployed the protocol in a large study (N = 560; 9,877 recordings), evaluated compliance and data quality, and conducted diagnostic prediction tasks on the extracted features, predicting speaker sex and concurrently reported momentary affective states (valence, arousal). We discuss implications and directions for advancing and deploying the protocol.

[HC-12] he State of Generative AI in Software Development: Insights from Literature and a Developer Survey

【速读】：该论文旨在解决生成式人工智能（Generative AI）在软件工程实践中应用研究碎片化的问题，即现有文献多聚焦于软件开发生命周期（Software Development Lifecycle, SDLC）中的单一任务，缺乏对GenAI整体影响的系统性整合与实证分析。其解决方案的关键在于通过系统性文献综述与针对65名软件开发者的问卷调查相结合的方法，量化了GenAI在不同SDLC阶段的实际效用，并揭示其对开发效率、工作模式及组织治理的影响：结果显示GenAI在设计、实现、测试和文档等阶段效果显著，可将重复性任务耗时减少至少50%，且多数开发者每日使用基于浏览器的大语言模型；同时指出早期阶段如需求分析效益较低，强调需建立完善治理机制以应对盲目采纳、技能退化和技术债等风险，从而推动GenAI从辅助编码向提升架构质量与决策能力的价值转移。

链接: https://arxiv.org/abs/2603.16975
作者: Vincent Gurgul,Robin Gubela,Stefan Lessmann
机构: Humboldt-Universität zu Berlin (柏林洪堡大学); Hochschule für Technik und Wirtschaft Berlin (柏林应用技术大学); Bucharest University of Economic Studies (布加勒斯特经济大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) rapidly transforms software engineering, yet existing research remains fragmented across individual tasks in the Software Development Lifecycle. This study integrates a systematic literature review with a survey of 65 software developers. The results show that GenAI exerts its highest impact in design, implementation, testing, and documentation, where over 70 % of developers report at least halving the time for boilerplate and documentation tasks. 79 % of survey respondents use GenAI daily, preferring browser-based Large Language Models over alternatives integrated directly in their development environment. Governance is maturing, with two-thirds of organizations maintaining formal or informal guidelines. In contrast, early SDLC phases such as planning and requirements analysis show markedly lower reported benefits. In a nutshell, GenAI shifts value creation from routine coding toward specification quality, architectural reasoning, and oversight, while risks such as uncritical adoption, skill erosion, and technical debt require robust governance and human-in-the-loop mechanisms.

[HC-13] VisceroHaptics: Investigating the Effects of Gut-based Audio-Haptic Feedback on Gastric Feelings and Gastric Interoceptive Behavior

【速读】：该论文试图解决的问题是：胃内感知（gastric interoception）是否能够在不侵入性条件下被调节，从而影响进食行为和情绪，为医疗健康及人机交互应用提供新路径。解决方案的关键在于利用基于腹腔声音驱动的视听触觉反馈（audio-haptic feedback），通过作用于腹部皮肤模拟胃部感觉，实验结果首次证明该方法可显著改变用户的饥饿感、饱腹感以及胃内感知行为（如水负荷测试II中摄入水量增加），从而验证了非侵入式干预胃内感知的有效性。

链接: https://arxiv.org/abs/2603.16919
作者: Mia Huong Nguyen,Moritz Alexander Messerschmidt,Jochen Huber,Suranga Nanayakkara
机构: Augmented Human Lab, School of Computing, National University of Singapore (新加坡国立大学计算机学院); Furtwangen University (富特旺根应用技术大学)
类目: Human-Computer Interaction (cs.HC)
备注: CHI’26 Honourable Mention Award

点击查看摘要

Abstract:Gastric interoception influences eating behavior and emotions, making its modulation valuable for healthcare and human-computer-interaction applications. However, whether gastric interoception can be modulated noninvasively in humans remains unclear. While previous research indicates that abdominal-sound-driven haptic feedback resembles gut sensations, its impact on feelings and gastric interoceptive behavior is unknown. We conducted three experiments totalling 55 participants to investigate how gut-sound-driven audio-haptic feedback applied to the stomach (1) affects user’s feelings (2) influences perception of hunger and satiety levels and (3) influences gastric interoceptive behavior, quantified with Water Load Test-II. Results revealed that audio-haptic feedback patterns (a) induced the feelings of hunger, fullness, thirst, stomach upset, (b) increased hunger level, and © significantly increased volumes of ingested water. This work provides the first evidence showing that audio-haptic stimulation can alter gastric interoceptive behavior, motivating the use of noninvasive methods to influence users’ feelings and behaviors in future applications.

[HC-14] Privacy and Safety Experiences and Concerns of U.S. Women Using Generative AI for Seeking Sexual and Reproductive Health Information

【速读】：该论文旨在解决生成式 AI (Generative AI) 聊天机器人在性与生殖健康 (Sexual and Reproductive Health, SRH) 信息获取中的隐私与安全风险问题，尤其是在美国罗伊诉韦德案被推翻后，用户对在线SRH信息依赖加剧的背景下。研究发现，尽管用户普遍认可GenAI的实用性、易用性和可及性，并愿意披露敏感个人健康信息，但其数据收集、政府监控、模型训练和数据商品化等隐私风险引发显著担忧，尤其在涉及堕胎相关查询时安全顾虑更为突出；然而多数用户缺乏有效的保护策略。解决方案的关键在于通过设计层面引入健康专用功能（如隐私保护模式）与政策层面强化内容审核机制，从而系统性提升GenAI支持下的SRH信息服务的隐私保障与用户安全性。

链接: https://arxiv.org/abs/2603.16918
作者: Ina Kaleva,Xiao Zhan,Ruba Abu-Salma,Jose Such
机构: King’s College London (伦敦国王学院); Universitat Politècnica de València (瓦伦西亚理工大学); University of Cambridge (剑桥大学); INGENIO (CSIC-Universitat Politècnica de València) (CSIC-瓦伦西亚理工大学英格尼奥研究中心)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, 2 tables, CHI conference on Human Factors in Computing Systems

点击查看摘要

Abstract:The rapid adoption of generative AI (GenAI) chatbots has reshaped access to sexual and reproductive health (SRH) information, particularly following the overturning of Roe v. Wade, as individuals assigned female at birth increasingly turn to online sources. However, existing research remains largely model-centered, paying limited attention to user privacy and safety. We conducted semi-structured interviews with 18 U.S.-based participants from both restrictive and non-restrictive states who had used GenAI chatbots to seek SRH information. Adoption was influenced by perceived utility, usability, credibility, accessibility, and anthropomorphism, and many participants disclosed sensitive personal SRH details. Participants identified multiple privacy risks, including excessive data collection, government surveillance, profiling, model training, and data commodification. While most participants accepted these risks in exchange for perceived utility, abortion-related queries elicited heightened safety concerns. Few participants employed protective strategies beyond minimizing disclosures or deleting data. Based on these findings, we offer design and policy recommendations, such as health-specific features and stronger moderation practices, to enhance privacy and safety in GenAI-supported SRH information seeking.

[HC-15] Attention Guidance through Video Script: A Case Study of Object Focusing on 360° VR Video Tours

【速读】：该论文旨在解决360°虚拟现实（VR）视频中缺乏有效注意力引导机制的问题，即用户在沉浸式球面环境中难以聚焦于特定目标对象。解决方案的关键在于融合Grounding Dino与Segment Anything (SAM)模型，通过视频脚本实现基于对象的注意力引导，从而提升用户在360° VR视频导览中的体验质量。

链接: https://arxiv.org/abs/2603.16875
作者: Paulo Vitor Santana Silva,Arthur Ricardo Sousa Vitória,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
机构: Federal University of Goiás(戈亚斯联邦大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Within the expansive domain of virtual reality (VR), 360° VR videos immerse viewers in a spherical environment, allowing them to explore and interact with the virtual world from all angles. While this video representation offers unparalleled levels of immersion, it often lacks effective methods to guide viewers’ attention toward specific elements within the virtual environment. This paper combines the models Grounding Dino and Segment Anything (SAM) to guide attention by object focusing based on video scripts. As a case study, this work conducts the experiments on a 360° video tour on the University of Reading. The experiment results show that video scripts can improve the user experience in 360° VR Videos Tour by helping in the task of directing the user’s attention.

[HC-16] Disclosure By Design: Identity Transparency as a Behavioural Property of Conversational AI Models

【速读】：该论文旨在解决当前对话式人工智能（Conversational AI）系统在人机交互中身份透明性不足的问题，即用户难以判断其交互对象是人类还是AI，从而可能导致敏感信息泄露、不当信任或欺诈风险。解决方案的关键在于提出“披露即设计”（disclosure by design）策略：当用户直接询问时，AI系统应主动明确声明自身为人工身份，且该披露行为作为模型行为嵌入，不依赖于用户界面或基础设施支持，能够在多种场景（如角色扮演、对抗性提示）下保持稳定，并保障用户自主验证身份的能力，同时不影响沉浸式使用体验。

链接: https://arxiv.org/abs/2603.16874
作者: Anna Gausen,Sarenne Wallbridge,Hannah Rose Kirk,Jennifer Williams,Christopher Summerfield
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:As conversational AI systems become more realistic and widely deployed, users are increasingly uncertain about whether they are interacting with a human or an AI system. When AI identity is unclear, users may unwittingly share sensitive information, place unwarranted trust in AI-generated advice, or fall victim to AI-enabled fraud. More broadly, a persistent lack of transparency can erode trust in mediated communication. While regulations like the EU AI Act and California’s BOT Act require AI systems to identify themselves, they provide limited guidance on reliable disclosure in real-time conversation. Existing transparency mechanisms also leave gaps: interface indicators can be omitted by deployers, and provenance tools require coordinated infrastructure and cannot provide reliable real-time verification. We ask how conversational AI systems should maintain identity transparency as human-AI interactions become more ambiguous and diverse. We advocate for disclosure by design, where AI systems explicitly disclose their artificial identity when directly asked. Implemented as model behaviour, disclosure can persist across deployment contexts without relying on user interfaces, while preserving user agency to verify identity on demand without disrupting immersive uses like role-playing. To assess current practice, we present the first multi-modal (text and voice) evaluation of disclosure behaviour in deployed systems across baseline, role-playing, and adversarial settings. We find that baseline disclosure rates are often high but drop substantially in role-play and can be suppressed under adversarial prompting. Importantly, disclosure rates vary significantly across providers and modalities, highlighting the fragility of current disclosure behaviour. We conclude with technical interventions to help developers embed disclosure as a fundamental property of conversational AI models. Comments: 25 pages, 5 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.5.1 Cite as: arXiv:2603.16874 [cs.HC] (or arXiv:2603.16874v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.16874 Focus to learn more arXiv-issued DOI via DataCite

[HC-17] he Truth the Whole Truth and Nothing but the Truth: Automatic Visualization Evaluation from Reconstruction Quality

【速读】：该论文旨在解决生成式 AI (Generative AI) 在自动创建可视化图表时存在的质量问题，尤其是传统单次生成方法产出的可视化结果往往不够准确或清晰，需依赖人工干预进行优化。由于人工评估成本高且难以扩展，研究者提出了一种无需大量人工标注数据集的自动化评价指标。其核心创新在于利用原始数据作为隐式真实标签，通过衡量从可视化图像中重建原始数据的准确性来评估可视化质量，从而实现对AI驱动可视化流程的高效、可靠的质量监控与优化。

链接: https://arxiv.org/abs/2603.16873
作者: Roxana Bujack,Li-Ta Lo,Ethan Stam,Ayan Biswas,David Rogers
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in AI enable the automatic generation of visualizations directly from textual prompts using agentic workflows. However, visualizations produced via one-shot generative methods often suffer from insufficient quality, typically requiring a human in the loop to refine the outputs. Human evaluation, though effective, is costly and impractical at scale. To alleviate this problem, we propose an automated metric that evaluates visualization quality without relying on extensive human-labeled datasets. Instead, our approach uses the original underlying data as implicit ground truth. Specifically, we introduce a method that measures visualization quality by assessing the reconstruction accuracy of the original data from the visualization itself. This reconstruction-based metric provides an autonomous and scalable proxy for thorough human evaluation, facilitating more efficient and reliable AI-driven visualization workflows.

[HC-18] Social physics in the age of artificial intelligence

【速读】：该论文试图解决的问题是：随着人工智能（AI）系统在社会生活中日益自主化和嵌入化，传统仅基于人类行为的社交动力学模型已无法充分描述人机共存的混合社会（hybrid human-AI societies）的集体演化机制。其解决方案的关键在于引入社会物理学（social physics）的研究范式，结合进化博弈论、文化演化理论与大型语言模型（LLMs）驱动的模拟方法，构建一套系统性的研究框架，聚焦于人与机器之间的协同演化关系，涵盖社会行为、机器文化、语言与决策交互、责任分配、认知路径差异及AI发展与规制的动态互动等六个核心方向，从而实现对先进AI社会影响的前瞻性预测与引导性干预。

链接: https://arxiv.org/abs/2603.16900
作者: TheAnh Han,Joel Z. Leibo,Tom Lenaerts,Iyad Rahwan,Fernando Santos,Matjaž Perc,Valerio Capraro
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems are rapidly becoming more capable, autonomous, and deeply embedded in social life. As humans increasingly interact, cooperate, and compete with AI, we move from purely human societies to hybrid human-AI societies whose collective dynamics cannot be captured by existing behavioural models alone. Drawing on evolutionary game theory, cultural evolution, and Large Language Models (LLMs) powered simulations, we argue that these developments open a new research agenda for social physics centred on the co-evolution of humans and machines. We outline six key research directions. First, modelling the evolutionary dynamics of social behaviours (e.g. cooperation, fairness, trust) in hybrid human-AI populations. Second, understanding machine culture: how AI systems generate, mediate, and select cultural traits. Third, analysing the co-evolution of language and behaviour when LLMs frame and participate in decisions. Fourth, studying the evolution of AI delegation: how responsibilities and control are negotiated between humans and machines. Fifth, formalising and comparing the distinct epistemic pipelines that generate human and AI behaviour. Sixth, modelling the co-evolution of AI development and regulation in a strategic ecosystem of firms, users, and institutions. Together, these directions define a programme for using social physics to anticipate and steer the societal impact of advanced AI.

[HC-19] EEG-Based Brain-LLM Interface for Human Preference Aligned Generation

【速读】：该论文旨在解决传统基于自然语言的大型语言模型（Large Language Models, LLMs）交互方式对存在言语或运动障碍用户（如肌萎缩侧索硬化症，ALS）不友好这一问题，即假设用户能够稳定输出明确语言输入的局限性。其解决方案的关键在于构建一种脑-LLM接口，利用脑电图（EEG）信号作为替代输入源，通过训练一个分类器从EEG中估计用户满意度，并将该神经反馈引入测试时缩放（Test-Time Scaling, TTS）框架，动态调整图像生成模型的推理过程以适应用户的实时偏好。实验表明，EEG信号可有效预测用户满意度，证明神经活动蕴含实时偏好信息，为实现基于神经反馈的自适应LLM交互提供了初步路径。

链接: https://arxiv.org/abs/2603.16897
作者: Junzi Zhang,Jianing Shen,Weijie Tu,Yi Zhang,Hailin Zhang,Tom Gedeon,Bin Jiang,Yue Yao
机构: 未知
类目: ignal Processing (eess.SP); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) are becoming an increasingly important component of human–computer interaction, enabling users to coordinate a wide range of intelligent agents through natural language. While language-based interfaces are powerful and flexible, they implicitly assume that users can reliably produce explicit linguistic input, an assumption that may not hold for users with speech or motor impairments, e.g., Amyotrophic Lateral Sclerosis (ALS). In this work, we investigate whether neural signals can be used as an alternative input to LLMs, particularly to support those socially marginalized or underserved users. We build a simple brain-LLM interface, which uses EEG signals to guide image generation models at test time. Specifically, we first train a classifier to estimate user satisfaction from EEG signals. Its predictions are then incorporated into a test-time scaling (TTS) framework that dynamically adapts model inference using neural feedback collected during user evaluation. The experiments show that EEG can predict user satisfaction, suggesting that neural activity carries information on real-time preference inference. These findings provide a first step toward integrating neural feedback into adaptive language-model inference, and hopefully open up new possibilities for future research on adaptive LLM interaction.

[HC-20] A Novel end-to-end Digital Health System Using Deep Learning-based ECG Analysis

【速读】：该论文旨在解决长期动态心电图（ambulatory electrocardiogram, ECG）数据在临床应用中面临的自动化分析与决策支持难题，特别是如何实现高精度、可解释且可集成至常规医疗流程的AI辅助诊断系统。其解决方案的关键在于构建一个云端信息平台AI-HEART，该平台实现了从多日三导联ECG数据输入到信号预处理、波形分割、噪声/质量检测及心律失常分类的端到端深度学习流水线；并通过专家参与式标注与生成式增强技术缓解类别不平衡问题，提升对常见和罕见心律失常的泛化能力，最终在保持高特异性的同时实现具有临床意义的宏观平均性能，并支持可追溯输出、审计友好的存储机制及医生反馈闭环优化，从而推动AI-ECG系统在真实世界中的落地应用。

链接: https://arxiv.org/abs/2603.16891
作者: Artemis Kontou,Natalia Miroshnikova,Costakis Matheou,Sophocles Sophocleous,Nicholas Tsekouras,Kleanthis Malialis,Panayiotis Kolios
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Preprint submitted to the International Journal of Information Management Data Insights (Elsevier). 15 pages, 5 figures

点击查看摘要

Abstract:This study presents AI-HEART, a cloud-based information system for managing and analysing long-duration ambulatory electrocardiogram (ECG) recordings and supporting clinician decision-making. The platform operationalises an end-to-end pipeline that ingests multi-day three-lead ECGs, normalises inputs, performs signal preprocessing, and applies dedicated deep neural networks for wave delineation, noise/quality detection, and beat- and rhythm-level multi-class arrhythmia classification. To address class imbalance and real-world signal variability, model development combines large clinically annotated datasets with expert-in-the-loop curation and generative augmentation for under-represented rhythms. Empirical evaluation on three-lead ambulatory ECG data shows that delineation accuracy is sufficient for automated interval measurement, noise detection reliably flags poor-quality segments, and arrhythmia classification achieves high specificity with clinically useful macro-averaged performance across common and rarer rhythms. Beyond predictive accuracy, AI-HEART provides a scalable deployment approach for integrating AI into routine ECG services, enabling traceable outputs, audit-friendly storage of recordings and derived annotations, and clinician review/editing that captures feedback for controlled model improvement. The findings demonstrate the technical feasibility and operational value of a noise-aware AI-ECG platform as a digital health information system.

[HC-21] DECODE: Dual-Enhanced Conditioned Diffusion for EEG Forecasting

【速读】：该论文旨在解决认知事件中脑电图（EEG）信号预测的难题，现有方法难以同时捕捉神经动态的随机性与行为任务的语义上下文。其解决方案的关键在于提出一种双增强条件扩散模型（DECODE），通过预训练语言模型将自然语言描述作为语义引导条件注入扩散过程，同时利用基于历史信号的Langevin动力学保持时间一致性，从而生成特定事件的神经响应轨迹。此框架实现了亚微伏级预测精度（平均绝对误差=0.626 μV）和校准良好的不确定性估计，首次证明自然语言可有效连接高层认知描述与底层神经活动，为零样本泛化和可解释脑机接口开辟新路径。

链接: https://arxiv.org/abs/2603.16885
作者: Mehran Shabanpour,Sadaf Khademi,Konstantinos N Plataniotis,Arash Mohammadi
机构: 未知
类目: ignal Processing (eess.SP); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting Electroncephalography (EEG) signals during cognitive events remains a fundamental challenge in neuroscience and Brain-Computer Interfaces (BCIs), as existing methods struggle to capture both the stochastic nature of neural dynamics and the semantic context of behavioral tasks. We present the Dual-Enhanced COnditioned Diffusion (DECODE) for EEG, a novel framework that unifies semantic guidance from natural language descriptions with temporal dynamics from historical signals to generate event-specific neural responses. DECODE leverages pre-trained language models to condition the diffusion process on rich textual descriptions of cognitive events, while maintaining temporal coherence through history-based Langevin dynamics. Evaluated on a real-world driving task dataset with five distinct behaviors, DECODE achieves sub-microvolt prediction accuracy (MAE = 0.626 microvolt) over 75 timestep horizons while maintaining well-calibrated uncertainty estimates. Our framework demonstrates that natural language can effectively bridge high-level cognitive descriptions and low-level neural dynamics, opening new possibilities for zero-shot generalization to novel behaviors and interpretable BCIs. By generating physiologically plausible, event-specific EEG trajectories conditioned on semantic descriptions, DECODE establishes a new paradigm for understanding and predicting context-dependent neural activity.

计算机视觉

[CV-0] Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

【速读】：该论文旨在解决视频类视觉语言模型（Vision-Language Models, VLMs）中因时间冗余导致的计算效率低下问题，尤其针对视频问答（Video QA）任务中的高资源消耗。现有方法要么仅在视觉Transformer（ViT）层面进行token剪枝，无法适配下游多模态任务；要么仅在语言大模型（LLM）中剪枝，依赖复杂的文本条件选择机制且未充分利用空间与时间信息。解决方案的关键在于提出一种名为“时空Token评分”（Spatio-Temporal Token Scoring, STTS）的轻量级模块，它无需文本条件或token合并操作，即可在ViT和LLM两个层级统一剪枝50%的视觉token，并通过辅助损失函数学习时间维度上的评分策略、利用LLM下游梯度学习空间维度评分，结合高效打包算法实现端到端训练兼容性。实验表明，STTS在保持平均性能仅下降0.7%的前提下，使训练与推理效率提升62%，且随着采样帧数增加效率收益进一步增强。

链接: https://arxiv.org/abs/2603.18004
作者: Jianrui Zhang,Yue Yang,Rohun Tripathi,Winson Han,Ranjay Krishna,Christopher Clark,Yong Jae Lee,Sangho Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

[CV-1] Universal Skeleton Understanding via Differentiable Rendering and MLLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）难以直接处理非视觉结构化数据（如人体骨骼序列）的问题。现有方法要么将骨骼动态压缩为有损特征向量进行文本对齐，要么将运动离散化为token，但后者在不同骨骼格式间泛化能力差。解决方案的关键在于提出SkeletonLLM框架，其核心是DrAction——一个可微分、格式无关的渲染器，能将任意骨骼序列转化为MLLM原生视觉模态的紧凑图像序列；由于整个流程端到端可微，MLLM的梯度可直接引导渲染过程生成任务相关的视觉token，从而实现跨模态理解与推理。

链接: https://arxiv.org/abs/2603.18003
作者: Ziyi Wang,Peiming Li,Xinshun Wang,Yang Tang,Kai-Kuang Ma,Mengyuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 15 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM’s native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer – suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.

[CV-2] EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding AAAI AAAI2026

【速读】：该论文旨在解决布局到图像生成（layout-to-image generation）与图像定位（image grounding）两个任务在联合训练中面临的优化挑战，这些问题导致模型性能受限。其核心问题是：尽管这两个任务具有互补性（如图像定位具备强文本和布局理解能力，而基于布局生成的图像内容多样性高），但直接联合训练难以实现协同增益。解决方案的关键在于提出一种分阶段的渐进式训练策略：首先通过并行多任务预训练（PMTP）阶段建立基础能力；然后通过双重联合优化（DJO）阶段利用任务对偶性顺序融合两任务以实现统一优化；最后通过循环强化学习（Cycle RL）阶段引入一致性约束作为奖励信号，减少对视觉监督的依赖，并借助GRPO策略显著提升模型的统一能力。该方法在多个基准上实现了最先进的性能，验证了双任务协同优化的有效性。

链接: https://arxiv.org/abs/2603.18001
作者: Kai Zou,Hongbo Liu,Dian Zheng,Jianxiong Gao,Zhiwei Zhao,Bin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model’s unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.

[CV-3] he Unreason able Effectiveness of Text Embedding Interpolation for Continuous Image Steering

【速读】：该论文旨在解决文本条件生成模型在测试时进行连续且可控图像编辑的问题，传统方法依赖额外训练或人工干预，存在效率低和控制不精细的缺陷。解决方案的关键在于：通过大语言模型自动构建去偏对比提示对，计算生成器文本编码空间中的导向向量（steering vector），并将该向量直接加到输入提示表示上，实现沿目标语义轴的平滑控制；同时引入弹性范围搜索（elastic range search）自动确定有效导向强度区间，避免欠调整（无编辑效果）与过调整（影响其他属性）问题，从而实现连续、可控的编辑效果。由于仅修改文本表示，该方法天然适用于多种模态（如图像和视频生成）。

链接: https://arxiv.org/abs/2603.17998
作者: Yigit Ekin,Yossi Gandelsman
机构: Reve; Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator’s text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

[CV-4] LoST: Level of Semantics Tokenization for 3D Shapes CVPR2026

【速读】：该论文旨在解决3D形状生成中tokenization（分词）的优化问题，特别是针对自回归（Autoregressive, AR）模型在3D生成任务中的token效率与语义一致性不足的问题。现有最优方法依赖于为渲染和压缩设计的几何层次细节（Level-of-Detail, LoD）结构，这类空间层次结构往往token效率低且缺乏语义连贯性，难以支持高质量的AR建模。论文提出的解决方案是Level-of-Semantics Tokenization (LoST)，其核心在于按语义显著性（semantic salience）对tokens进行排序：早期prefix能解码出具有主干语义的完整、合理形状，后续tokens则逐步细化实例特异的几何与语义细节。为训练LoST，作者进一步提出Relational Inter-Distance Alignment (RIDA)损失函数，通过将3D形状潜在空间的结构关系对齐至语义DINO特征空间，实现跨模态语义一致性约束。实验表明，LoST在重建精度和语义保真度上均显著优于基于LoD的基线方法，并在仅使用0.1%–10% token数的情况下实现高效高质量的AR 3D生成及下游语义检索任务。

链接: https://arxiv.org/abs/2603.17995
作者: Niladri Shekhar Dutt,Zifan Shi,Paul Guerrero,Chun-Hao Paul Huang,Duygu Ceylan,Niloy J. Mitra,Xuelin Chen
机构: University College London (伦敦大学学院); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: CVPR 2026; Project website-- this https URL

点击查看摘要

Abstract:Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

[CV-5] GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

【速读】：该论文旨在解决在3D环境中生成可控的6自由度（6-DOF）物体操作轨迹这一挑战性问题，其核心难点在于实现精确的空间推理、物理可行性以及多模态场景理解。现有方法通常依赖于2D或部分3D表示，难以捕捉完整场景几何信息，从而限制了轨迹精度。解决方案的关键在于提出GMT（Generative Manipulation Trajectory）框架，该框架是一种多模态Transformer模型，通过联合利用3D边界框几何、点云上下文、语义物体类别及目标末端位姿信息，将轨迹表示为连续的6-DOF位姿序列，并采用定制化的条件融合策略整合几何、语义、上下文和目标导向信息，从而显著提升空间准确性和姿态控制能力。

链接: https://arxiv.org/abs/2603.17993
作者: Huajian Zeng,Abhishek Saroha,Daniel Cremers,Xi Wang
机构: TU München (慕尼黑工业大学); MCML (机器感知与学习中心); ETH Zürich (苏黎世联邦理工学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accpeted by 3DV 2026. Project Page: this https URL

点击查看摘要

Abstract:Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- this http URL. io/projects/gmt/.

[CV-6] Versatile Editing of Video Content Actions and Dynamics without Training

【速读】：该论文旨在解决现有视频编辑方法在处理复杂动作修改、动态事件插入及对象间交互行为调整时面临的挑战，尤其是训练-free 方法难以支持运动变化和物体间相互作用的问题。其解决方案的关键在于提出一种名为 DynaEdit 的新方法，该方法基于预训练的文本到视频扩散模型（text-to-video diffusion models），采用无需模型内部干预的 inversion-free 技术实现模型无关的编辑能力；同时通过分析并缓解直接应用该技术导致的低频错位与高频抖动问题，引入创新机制以提升编辑的准确性与稳定性，从而实现对视频中动作、物体交互和全局效应的多样化编辑。

链接: https://arxiv.org/abs/2603.17989
作者: Vladimir Kulikov,Roni Paiss,Andrey Voynov,Inbar Mosseri,Tali Dekel,Tomer Michaeli
机构: Google DeepMind(谷歌深度智核)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.

[CV-7] Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在3D场景空间推理中面临的两大挑战：一是依赖计算成本高昂的3D表示（如点云或重建的鸟瞰图BEV地图），二是缺乏物理尺度上的定位能力，难以消除规模和尺寸相关的歧义。解决方案的关键在于引入惯性测量单元（Inertial Measurement Units, IMUs）采集的自我运动（egomotion）模态数据，并构建Motion-MLLM框架，其核心创新包括：(1) 一种级联式运动-视觉关键帧过滤模块，利用IMU数据与视觉特征协同筛选稀疏但具有代表性的关键帧；(2) 一种非对称跨模态融合模块，通过运动token作为中介，将自我运动线索和跨帧视觉上下文有效注入视觉表征，从而实现对视觉内容的物理轨迹锚定，提升绝对尺度和空间关系的推理能力。

链接: https://arxiv.org/abs/2603.17980
作者: Shuyao Shi,Kang G. Shin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird’s-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40 \times and 1.63 \times higher cost-effectiveness, respectively).

[CV-8] AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception CVPR2026

【速读】：该论文旨在解决高维雷达数据在自动驾驶系统中传输带宽受限的问题，即原始雷达数据量庞大，难以通过低带宽接口（如NPU）高效传输，而现有图像域压缩方法无法适应动态或对抗性环境。其解决方案的关键在于提出一种基于自适应反馈的雷达数据在线压缩机制：通过零阶梯度近似计算检测置信度对压缩率的代理梯度，并据此执行梯度下降以动态调整压缩比；同时利用离散余弦变换（Discrete Cosine Transform, DCT）对雷达数据立方体进行频域分析，仅保留关键频率分量并结合缩放量化保留每个雷达块的动态范围，从而实现超过100倍的特征尺寸压缩，且性能损失低于1个百分点（~1%p）。

链接: https://arxiv.org/abs/2603.17979
作者: Jinho Park,Se Young Chun,Mingoo Seok
机构: Columbia University (哥伦比亚大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations–pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.

[CV-9] AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

【速读】：该论文旨在解决从存在严重遮挡的野生单目视频中重建完整且可动画化的3D高斯化身（3D Gaussian avatars）的问题。现有方法通常假设输入为无遮挡、完全可见的主体，这限制了其在真实场景中的适用性，因为现实世界中人物常被家具、物体或其他人遮挡，导致部分身体区域无法观测，且缺乏多视角监督信号。解决方案的关键在于四个核心贡献：(i) 提出“幻觉作为监督”（hallucination-as-supervision）管道，利用身份微调的扩散模型生成未观测区域的密集监督信号；(ii) 设计两阶段从规范姿态到姿态依赖的架构，从稀疏观测逐步构建完整的姿态依赖高斯映射；(iii) 引入地图-姿态/线性骨骼绑定（LBS）-姿态解耦机制，吸收由生成数据带来的多视角不一致性；(iv) 采用头部/身体分离监督策略以保留面部身份特征。该方法显著提升了在复杂遮挡条件下的重建质量，并实现了高质量的动画与三维场景合成。

链接: https://arxiv.org/abs/2603.17975
作者: Aymen Mir,Riza Alp Guler,Xiangjun Tang,Peter Wonka,Gerard Pons-Moll
机构: University of Tübingen(图宾根大学); Imperial College London(伦敦帝国学院); KAUST(阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page is available at this https URL

点击查看摘要

Abstract:We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at this https URL

[CV-10] Robust-ComBat: Mitigating Outlier Effects in Diffusion MRI Data Harmonization

【速读】：该论文旨在解决在弥散磁共振成像（diffusion MRI, dMRI）多中心数据整合中，由于患者群体存在病理异常值（pathological outliers）导致传统协方差校正方法（如ComBat）估计的站点效应发生显著偏差的问题。其关键解决方案是提出一种鲁棒的ComBat变体——Robust-ComBat，通过引入一个简单的多层感知机（MLP）模型对异常值进行补偿，从而在保留疾病相关信号的同时实现更可靠的跨站点数据标准化，实验表明该方法在包含高达80%病理受试者的多中心数据集中均优于传统统计基线。

链接: https://arxiv.org/abs/2603.17968
作者: Yoan David,Pierre-Marc Jodoin,Alzheimer’s Disease Neuroimaging Initiative, TheTRACK-TBI Investigators
机构: VitaLab, Dep of Computer Science, University of Sherbrooke (谢布罗克大学计算机科学系); Imeka Solutions Inc. (Imeka解决方案公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Harmonization methods such as ComBat and its variants are widely used to mitigate diffusion MRI (dMRI) site-specific biases. However, ComBat assumes that subject distributions exhibit a Gaussian profile. In practice, patients with neurological disorders often present diffusion metrics that deviate markedly from those of healthy controls, introducing pathological outliers that distort site-effect estimation. This problem is particularly challenging in clinical practice as most patients undergoing brain imaging have an underlying and yet undiagnosed condition, making it difficult to exclude them from harmonization cohorts, as their scans were precisely prescribed to establish a diagnosis. In this paper, we show that harmonizing data to a normative reference population with ComBat while including pathological cases induces significant distortions. Across 7 neurological conditions, we evaluated 10 outlier rejection methods with 4 ComBat variants over a wide range of scenarios, revealing that many filtering strategies fail in the presence of pathology. In contrast, a simple MLP provides robust outlier compensation enabling reliable harmonization while preserving disease-related signal. Experiments on both control and real multi-site cohorts, comprising up to 80% of subjects with neurological disorders, demonstrate that Robust-ComBat consistently outperforms conventional statistical baselines with lower harmonization error across all ComBat variants.

[CV-11] LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

【速读】：该论文旨在解决现有媒体设计生成方法在层数灵活性和语义一致性方面的局限性：传统方法要么固定输出层数，要么要求每层仅包含空间连续区域，导致层数与设计复杂度呈线性关系，难以生成结构复杂且可编辑的分层设计文档（如海报、Logo等）。其解决方案的关键在于提出LaDe（Layered Media Design）框架，该框架通过三个核心组件实现：（1）基于大语言模型（LLM）的提示扩展器，将简短用户意图转化为结构化的逐层描述以指导生成；（2）采用4D RoPE位置编码机制的潜在扩散Transformer，联合生成完整媒体设计及其RGBA分层表示；（3）支持全Alpha通道解码的RGBA变分自编码器（VAE）。该统一框架不仅支持文本到图像生成，还首次实现了文本到分层设计和设计分解三类任务，并在Crello测试集上显著优于Qwen-Image-Layered，在文本到图层对齐方面获得双VLM评估器（GPT-4o mini和Qwen3-VL）验证。

链接: https://arxiv.org/abs/2603.17965
作者: Vlad-Constantin Lungu-Stan,Ionut Mironica,Mariana-Iuliana Georgescu
机构: Adobe Research(Adobe研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages (main + supp)

点击查看摘要

Abstract:Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

[CV-12] VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

【速读】：该论文旨在解决视频理解中两个核心挑战：一是表示问题（representation），现有方法依赖有损近似导致信息丢失；二是长上下文处理问题（long-context），基于文本的caption或agent管道将视频压缩为文本，丧失视觉保真度。解决方案的关键在于提出VideoAtlas——一种任务无关的层次化网格结构，可无损、可导航、可扩展地表示视频，且无需预处理或生成描述文本。该结构统一用于视频整体、局部区域及智能体记忆，实现端到端无损视觉表示，并通过递归语言模型（Recursive Language Models, RLMs）与Markov决策过程（MDP）结合，构建出Video-RLM框架，支持并行主从架构下的全局探索与局部深度挖掘，从而在视频长度增长时保持计算复杂度对数级增长，同时实现自适应计算资源分配和高鲁棒性。

链接: https://arxiv.org/abs/2603.17948
作者: Mohamed Eltahir,Ali Habibullah,Yazan Alshoibi,Lama Ayash,Tanveer Hussain,Naeemullah Khan
机构: King Abdullah University of Science and Technology (KAUST); King Khalid University (KKU); Edge Hill University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbfVideoAtlas, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent’s memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbfVideoAtlas provides. \textbfVideoAtlas as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60% multimodal cache hit rate arising from the grid’s structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

[CV-13] ransText: Transparency Aware Image-to-Video Typography Animation

【速读】：该论文旨在解决图像到视频（image-to-video）模型在层感知文本（glyph）动画生成中的适应性问题，即如何高效、高质量地建模透明度（alpha channel）与RGB外观的联合分布，而无需重新训练预训练生成模型的潜在空间。现有方法通常将透明度编码为附加于RGB空间的额外潜变量维度，这不仅需要重构以RGB为中心的变分自编码器（VAE），且由于高质量透明文本数据稀缺，易导致计算成本高和语义先验退化，甚至引发潜在特征混叠。论文提出的解决方案核心在于引入TransText框架，其关键创新是提出“Alpha-as-RGB”范式——通过潜空间的局部拼接方式将alpha通道嵌入为与RGB兼容的视觉信号，从而在不修改预训练生成流形的前提下，显式保证RGB与Alpha之间的跨模态一致性，并避免特征纠缠，最终实现高保真、多样且精细控制的透明文本动画生成。

链接: https://arxiv.org/abs/2603.17944
作者: Fei Zhang,Zijian Zhou,Bohao Tang,Sen He,Hang Li,Zhe Wang,Soubhik Sanyal,Pengfei Liu,Viktar Atliha,Tao Xiang,Frost Xu,Semih Gunel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, publication review

点击查看摘要

Abstract:We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.

[CV-14] Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

【速读】：该论文旨在解决交通肇事视频证据与法律责任认定之间缺乏自动化映射的问题，即如何将dashcam（行车记录仪）视频中的事故场景有效转化为符合中国交通法规的责任判定。现有研究多集中于视觉感知或文本描述的法律推理，而忽视了视频证据的深度融合，导致法律推理过程不透明且难以解释。解决方案的关键在于提出一个名为C-TRAIL的多模态法律数据集，该数据集在中文交通法规体系下，明确对齐行车记录视频、文本描述与责任模式及其对应的交通法条；并构建了一个两阶段框架：第一阶段通过事故理解模块生成结构化视频描述，第二阶段采用法律多智能体框架输出责任模式、法条集合及完整判决报告，从而实现从视频到法律判定的端到端可解释推理。

链接: https://arxiv.org/abs/2603.17930
作者: Jingchun Yang,Jinchang Zhang
机构: Northeast University (东北大学); SUNY Binghamton University (纽约州立大学宾汉姆顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming “what happened in the video” into “who is responsible under which legal provisions” still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.

[CV-15] A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

【速读】：该论文旨在解决法医与司法场景中法律年龄估计（legal age estimation）的准确性、鲁棒性和可重复性问题，尤其针对当前人工智能方法多依赖手部X光或牙科影像而忽视了胸骨柄（clavicle）CT扫描这一有效但未被充分探索的模态。解决方案的关键在于提出一个可解释的多阶段流程：首先采用基于连通域的特征检测方法实现自动胸骨定位，仅需极少人工标注；其次利用集成梯度引导的切片选择策略构建多切片卷积神经网络输入数据，提升模型对关键解剖区域（如胸骨近端骨骺）的关注度；最后引入置信区间校准预测（conformal prediction intervals），以满足国际法医协议对不确定性量化的要求。该方法在1,158例尸体全身CT扫描数据上验证，达到1.55 ± 0.16年的平均绝对误差（MAE），优于人类专家（~1.90年）和现有方法（>1.75年），且具备可根据法医需求调整覆盖率的不确定性支持能力。

链接: https://arxiv.org/abs/2603.17926
作者: Javier Venema,Stefano De Luca,Pablo Mesejo,Óscar Ibáñez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, submitted to Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 \pm 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (this https URL), is intended as a decision-support component within multi-factorial forensic workflows.

[CV-16] SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

【速读】：该论文旨在解决无人机（Unmanned Aerial Vehicles, UAVs）场景下语义分割任务中RGB和RGB-T（RGB-热红外）数据集在规模、多样性及标注效率方面的局限性问题，其核心挑战在于人工标注成本高以及低成本硬件平台上难以实现精确的RGB-T图像对齐。解决方案的关键在于提出一种可扩展的几何驱动型2D-3D-2D范式：首先将不足3%的RGB图像提升至语义3D点云空间，再通过多视角冗余信息将标签自动传播至所有视角，从而在无需人工干预的情况下生成密集伪真值标签（97% RGB标签与100%热红外标签），同时保持高标注准确率（91%和88%）。该方法进一步延伸至跨模态图像配准，利用3D几何作为中间对齐空间，实现全自动且像素级精度的RGB-T对齐（87%准确率），无需硬件同步。

链接: https://arxiv.org/abs/2603.17920
作者: Markus Gross,Sai Bharadhwaj Matha,Rui Song,Viswanathan Muthuveerappan,Conrad Christoph,Julius Huber,Daniel Cremers
机构: 1. ETH Zurich (瑞士联邦理工学院); 2. Stanford University (斯坦福大学); 3. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 4. University of California, Berkeley (加州大学伯克利分校); 5. UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at this https URL.

[CV-17] Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

【速读】：该论文旨在解决边缘人工智能（Edge-AI）中协同推理场景下，因资源受限终端设备将部分处理数据卸载至边缘服务器进行完整分类时，可能遭受恶意数据注入攻击导致隐蔽误分类的问题，尤其在环境噪声干扰下此类攻击更难被检测。解决方案的关键在于提出一种半灰盒（semi-gray-box）且具备噪声感知能力的异常检测框架，该框架基于变分自编码器（Variational Autoencoder, VAE）构建，并引入一种鲁棒的噪声感知特征，以刻画环境噪声的典型行为，从而提升检测准确性并降低误报率。实验表明，该方法在多种主流目标分类深度神经网络（Deep Neural Networks, DNNs）上均表现出高鲁棒性（最高达90% AUROC），但在特征相似性较高或噪声水平过高的情况下仍存在局限。

链接: https://arxiv.org/abs/2603.17914
作者: Shima Yousefi,Saptarshi Debroy
机构: City University of New York (纽约市立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication in IEEE/ACM CCGrid 2026

点击查看摘要

Abstract:Collaborative inference of object classification Deep neural Networks (DNNs) where resource-constrained end-devices offload partially processed data to remote edge servers to complete end-to-end processing, is becoming a key enabler of edge-AI. However, such edge-offloading is vulnerable to malicious data injections leading to stealthy misclassifications that are tricky to detect, especially in the presence of environmental noise. In this paper, we propose a semi-gray-box and noise- aware anomaly detection framework fueled by a variational autoencoder (VAE) to capture deviations caused by adversarial manipulation. The proposed framework incorporates a robust noise-aware feature that captures the characteristic behavior of environmental noise to improve detection accuracy while reducing false alarm rates. Our evaluation with popular object classification DNNs demonstrate the robustness of the proposed detection (up to 90% AUROC across DNN configurations) under realistic noisy conditions while revealing limitations caused by feature similarity and elevated noise levels.

[CV-18] SpiderCam: Low-Power Snapshot Depth from Differential Defocus CVPR

【速读】：该论文旨在解决传统被动式三维（3D）成像系统在功耗、实时性和硬件资源受限条件下难以实现高效深度计算的问题。针对这一挑战，作者提出了一种基于现场可编程门阵列（FPGA）的快照式深度感知相机SpiderCam，其核心创新在于：通过算法优化改进了深度差分模糊（Depth from Differential Defocus, DfDD）方法，以适应低功耗传感器带来的噪声与动态范围限制；同时设计了内存局部化的流水线处理架构，在仅能存储单帧图像子集的极小FPGA资源下实现了图像对的流式深度计算。最终，该系统可在52 cm工作范围内以32.5 FPS的速度生成480×400稀疏深度图，总功耗仅为624 mW，首次实现了亚瓦级功耗的被动式FPGA 3D相机。

链接: https://arxiv.org/abs/2603.17910
作者: Marcos A. Ferreira,Tianao Li,John Mamish,Josiah Hester,Yaman Sangar,Qi Guo,Emma Alexander
机构: Northwestern University (西北大学); Georgia Institute of Technology (佐治亚理工学院); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 624 mW of power in total. SpiderCam comprises a custom camera that simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.

[CV-19] A Creative Agent is Worth a 64-Token Template

【速读】：该论文旨在解决当前文本到图像（Text-to-Image, T2I）生成模型在面对模糊提示（fuzzy prompts）时创造力受限的问题，即模型难以从如“一个受黑胶唱片启发的创意摩天大楼”这类非结构化提示中自动推断出深层创意意图，导致创意生成仍依赖人工设计提示，且现有基于推理或智能体（agent）的方法因实例特定的迭代优化机制而计算成本高昂、不可复用。解决方案的关键在于提出一种名为CAT（Creative Agent Tokenization）的框架，其核心创新是引入一个创意分词器（Creative Tokenizer），该分词器通过创意语义解耦（creative semantic disentanglement）训练，能够从模糊提示的嵌入中学习并提取通用的创意表征，并生成可复用的“创意标记模板”（creative token template），该模板可直接与原始提示拼接后输入T2I模型，从而无需重复推理即可注入创造性语义，显著提升生成效率与质量。

链接: https://arxiv.org/abs/2603.17895
作者: Ruixiao Shi,Fu Feng,Yucheng Xie,Xu Yang,Jing Wang,Xin Geng
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes creativity’’ costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbfCAT, a framework for \textbfCreative \textbfAgent \textbfTokenization that encapsulates agents’ intrinsic understanding of ``creativity’’ through a \textitCreative Tokenizer. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent’s latent creative representations. Extensive experiments on \textbf\textitArchitecture Design, \textbf\textitFurniture Design, and \textbf\textitNature Mixture tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a 3.7\times speedup and a 4.8\times reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.

[CV-20] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

【速读】：该论文旨在解决当前生成式AI（Generative AI）在跨模态身份感知内容生成中缺乏细粒度控制能力的问题，特别是如何实现对多个个体面部特征和语音音色的高保真、一致性个性化控制。其解决方案的关键在于提出一个统一且可扩展的身份感知联合音频-视频生成框架，包含三个核心创新：首先，设计了一种自动提取跨模态配对标注的身份信息的数据整理管道，覆盖从单人到多人交互的多样化场景；其次，提出一种灵活可扩展的身份注入机制，将面部外观与声纹特征作为身份控制信号，适用于单人与多人场景；最后，针对模态差异问题，构建多阶段训练策略以加速收敛并增强跨模态一致性。

链接: https://arxiv.org/abs/2603.17889
作者: Yingjie Chen,Shilun Lin,Cai Xing,Qixin Yan,Wenjing Wang,Dingming Liu,Hao Liu,Chen Li,Jing Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \hrefthis https URLIdentity-as-Presence.

[CV-21] Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification ICPR2026

【速读】：该论文旨在解决视频胶囊内镜（Video Capsule Endoscopy, VCE）中多标签分类任务面临的极端类别不平衡问题，尤其是在Galar数据集上病理发现仅占所有标注帧的不到0.1%的情况下。解决方案的关键在于结合架构级与优化级策略：首先对BiomedCLIP模型进行改进，用差分注意力机制（differential attention mechanism）替代标准多头自注意力，通过计算两个Softmax注意力图的差异来抑制注意力噪声；其次采用平方根频率加权采样器（sqrt-frequency weighted sampler）、非对称焦点损失（asymmetric focal loss）、Mixup正则化以及每类阈值优化等方法缓解标签分布偏斜；最后通过中值滤波和平滑处理确保时序一致性，从而在RARE-VISION测试集上实现较高的时间维度mAP@0.5（0.2456）和mAP@0.95（0.2353），且单GPU推理耗时约8.6分钟。

链接: https://arxiv.org/abs/2603.17879
作者: Podakanti Satyajith Chary,Nagarajan Ganapathy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, ICPR 2026 RARE-VISION Competition

点击查看摘要

Abstract:This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.

[CV-22] Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

【速读】：该论文旨在解决图像编辑模型中存在的“编辑溢出”（edit spillover）现象，即模型在修改指定区域时，会无意识地改变语义相关但未被指定的区域内容，从而引发对这种现象本质的疑问：是模型具备隐式的世界知识（world knowledge），还是仅因注意力机制的泄漏（attention leakage）所致。解决方案的关键在于提出了一种系统性的分析框架——EditSpilloverProbe，其核心包括：(1) 构建编辑溢出的分类体系（空间、语义、混合、随机）；(2) 开发自动化检测与分类管道；(3) 基于真实中文文本编辑任务构建基准数据集EditSpilloverBench。通过该框架对五种代表性图像编辑模型的系统评估发现，语义溢出比例稳定且不随距离衰减，直接证明其反映的是模型真实的常识理解能力，而非单纯的注意力扩散。

链接: https://arxiv.org/abs/2603.17876
作者: Guandong Li,Zhaobin Chu
机构: iFLYTEK(科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-following image editing models are expected to modify only the specified region while keeping the rest of the image unchanged. However, in practice, we observe a pervasive phenomenon – edit spillover: models alter semantically related but unspecified content outside the edit region. This raises a fundamental question – does spillover reflect genuine implicit world understanding, or is it merely attention leakage? We propose EditSpilloverProbe, a systematic framework that repurposes edit spillover as a natural probe for world knowledge in image editing models. We introduce a spillover taxonomy (spatial, semantic, mixed, random), an automated detection-and-classification pipeline, and a benchmark dataset constructed from real-world Chinese text editing tasks, EditSpilloverBench. Systematic evaluation of 5 representative editing models reveals three core findings: (1) spillover rates vary dramatically across architectures, from 3.49% to 11.46%, with a 3.3x ratio; (2) absolute semantic spillover quantity reveals models’ world understanding capability – nano_banana produces the most semantic spillover (27.8 per image), while qwen_2511 has the most precise editing control but lower semantic spillover (16.3 per image), revealing a trade-off between editing control and world understanding; (3) spatial decay analysis shows spillover area density decays exponentially with distance, but the proportion of semantically relevant spillover remains constant (40%-58%), providing direct evidence that semantic spillover reflects genuine world understanding rather than spatial diffusion.

[CV-23] VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

【速读】：该论文旨在解决开放集虹膜活体检测（iris presentation attack detection, PAD）中模型泛化能力不足的问题，尤其是在面对未见过的攻击类型时性能下降显著。其核心解决方案是利用人类感知先验（human perceptual priors）作为监督信号来改进深度学习训练，通过对比手绘标注、眼动热图、分割掩码和DINOv2嵌入等多种形式的人类注意力信息，发现去噪后的眼动热图在Leave-One-Attack-Type-Out评估范式下能显著提升模型性能，具体表现为在BPCER=1%条件下，AUROC和APCER指标均优于交叉熵基准模型，表明基于人类视觉关注点的监督信号有助于增强模型对未知攻击类型的鲁棒性。

链接: https://arxiv.org/abs/2603.17859
作者: Byron Dowling,Eleanor Frederick,Jacob Piland,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human perceptual priors have shown promise in saliency-guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open-set iris PAD remains underexplored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and DINOv2 embeddings to a state-of-the-art deep learning-based baseline on the task of open-set iris PAD. Results for open-set PAD in a leave-one-attack-type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in terms of Area Under the ROC curve (AUROC) and Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow-up research efforts.

[CV-24] Revisiting foundation models for cell instance segmentation

【速读】：该论文旨在解决显微图像中细胞分割任务的性能瓶颈问题，特别是如何有效适配基于通用分割基础模型（如SAM系列）到微观图像场景的问题。其关键解决方案在于提出一种名为自动提示生成（Automatic Prompt Generation, APG）的新实例分割策略，该策略可显著提升以μ SAM为基础的模型在细胞、细胞核及类器官分割任务上的表现，并在性能上媲美当前最先进的CellPoseSAM模型。此外，研究还系统评估了多款适用于显微图像的分割基础模型，为未来构建更强大的显微图像专用基础模型提供了重要方法论指导与实践路径。

链接: https://arxiv.org/abs/2603.17845
作者: Anwai Archit,Constantin Pape
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in MIDL 2026

点击查看摘要

Abstract:Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general-purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, \mu SAM) and for general-purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM-based microscopy foundation models. APG consistently improves segmentation results for \mu SAM, which is used as the base model, and is competitive with the state-of-the-art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM-style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at this https URL.

[CV-25] Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass CVPR26

【速读】：该论文旨在解决当前基于指令驱动的3D编辑方法中存在的两大问题：一是缺乏对不同3D编辑任务的统一设计，因显式操作3D几何结构需依赖任务特定规则（如3D外观编辑需保留源几何，而3D移除则需改变源几何）；二是迭代优化过程耗时严重，通常需要数千次2D/3D更新。解决方案的关键在于提出Omni-3DEdit，一个统一的、基于学习的模型，通过隐式方式泛化多种3D编辑任务。其核心技术包括：构建高质量的多视角配对数据集以缓解训练样本稀缺问题，采用SEVA预训练生成模型作为骨干网络并结合源视图潜在表示与条件token序列拼接，以及引入双流LoRA模块解耦不同视角线索，从而显著提升模型表征能力。该方案无需在线优化，在单次前向传播中即可完成各类3D编辑任务，推理时间从数十分钟缩短至约两分钟。

链接: https://arxiv.org/abs/2603.17841
作者: Chen Liyi,Wang Pengfei,Zhang Guowen,Ma Zhiyuan,Zhang Lei
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR26

点击查看摘要

Abstract:Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model’s representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.

[CV-26] Video Understanding: From Geometry and Semantics to Unified Models

【速读】：该论文旨在解决视频理解（video understanding）领域中模型对动态视觉世界感知、推理与交互能力不足的问题，其核心挑战在于如何有效建模时空动态性和不断演化的视觉上下文，从而提升模型的时空推理能力。解决方案的关键在于提出一个结构化的三重视角框架：低级视频几何理解、高级语义理解以及统一的视频理解模型，并强调从孤立的任务特定流水线向可适应多种下游目标的统一建模范式的转变，以此系统性地梳理近年来的研究进展并指明构建鲁棒、可扩展且统一的视频基础模型（video foundation models）的发展方向。

链接: https://arxiv.org/abs/2603.17840
作者: Zhaochong An,Zirui Li,Mingqiao Ye,Feng Qiao,Jiaang Li,Zongwei Wu,Vishal Thengane,Chengzu Li,Lei Li,Luc Van Gool,Guolei Sun,Serge Belongie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A comprehensive survey of video understanding, spanning low-level geometry, high-level semantics, and unified understanding models

点击查看摘要

Abstract:Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.

[CV-27] INA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models CVPR2026

【速读】：该论文旨在解决当前文本到图像扩散模型中概念擦除（concept erasure）方法的局限性问题，即现有技术主要聚焦于切断文本到图像的映射关系（text-to-image mapping），而忽略了被擦除概念在模型内部视觉知识（visual knowledge）中的残留。这种“文本中心主义”的擦除策略并未真正删除相关概念，导致其仍可通过其他路径生成。为验证这一假设，作者提出了一种无文本引导的反演攻击方法——TINA（Text-free INversion Attack），其核心创新在于通过在零文本（null-text）条件下执行DDIM反演（DDIM inversion），绕过现有的文本导向防御机制，并引入优化流程以克服因缺乏文本引导而导致的累积近似误差。实验表明，TINA能够从经过先进去学习（unlearning）处理的模型中成功重建被擦除的概念，证明当前方法仅是“遮蔽”而非“删除”概念，亟需直接作用于内部视觉表示的新范式。

链接: https://arxiv.org/abs/2603.17828
作者: Qianlong Xiang,Miao Zhang,Haoyu Zhang,Kun Wang,Junhui Hou,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen); City University of Hong Kong; Shenzhen Loop Area Institute; Peng Cheng Laboratory; Shandong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, accepted by CVPR 2026

点击查看摘要

Abstract:Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content. This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods. However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist. To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found. However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance. Our experiments demonstrate that TINA regenerates erased concepts from models treated with state-of-the-art unlearning. The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.

[CV-28] Steering Video Diffusion Transformers with Massive Activations

【速读】：该论文旨在解决视频扩散变换器（Video Diffusion Transformers）中如何利用内部模型信号以最小计算开销提升视频生成质量的问题。其核心挑战在于，尽管视频扩散模型在性能上不断进步，但其内部激活机制（如大规模激活，Massive Activations, MAs）的潜在价值尚未被充分挖掘。解决方案的关键在于发现并利用MAs在时间维度上的结构化分布规律：第一帧和时序块边界处的token具有显著更高的激活幅度，而内部token则相对较低。基于此观察，作者提出无需训练的“结构化激活引导”（Structured Activation Steering, STAS）方法，通过将首帧与边界token的MA值引导至一个缩放后的全局最大参考幅值，从而在不增加额外计算负担的前提下，显著提升不同文本到视频模型的生成质量和时序一致性。

链接: https://arxiv.org/abs/2603.17825
作者: Xianhang Cheng,Yujian Zheng,Zhenyu Xie,Tingting Liao,Hao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.

[CV-29] M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

【速读】：该论文旨在解决当前视频理解中点跟踪（Tracking Any Point, TAP）任务的性能瓶颈问题，即基于静态图像预训练的视觉基础模型（Vision Foundation Models, VFMs）在捕捉视频中密集时序对应关系方面的局限性。其核心解决方案是提出一种弱监督的Mask-to-Point (M2P) 学习框架，关键在于引入三种基于掩码（mask）的新约束：(1) 局部结构一致性损失（local structure consistency loss），利用Procrustes分析建模局部区域内点的协同运动以提升点对点匹配的可靠性；(2) 掩码标签一致性损失（mask label consistency, MLC loss），强制前景点严格对应前景区域，作为正则化项稳定训练并防止陷入平凡解；(3) 掩码边界约束，显式监督边界点的表示学习。通过仅使用3.6K个视频对象分割（VOS）标注数据，M2P显著优于基线VFMs，在TAP-Vid-DAVIS基准上分别实现12.8%和14.6%的性能提升，并可作为通用预训练骨干网络用于测试时优化与离线微调的TAP任务。

链接: https://arxiv.org/abs/2603.17813
作者: Qiangqiang Wu,Tianyu Yang,Bo Fang,Jia Wan,Matias Di Martino,Guillermo Sapiro,Antoni B. Chan
机构: City University of Hong Kong (香港城市大学); Princeton University (普林斯顿大学); Meituan (美团); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.

[CV-30] ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

【速读】：该论文旨在解决视频扩散模型在像素域训练时面临的高内存开销问题，这一问题源于递归帧处理机制导致的激活值随视频序列长度线性累积，使得长视频或高分辨率视频的微调在像素级损失下计算不可行。解决方案的关键在于提出ChopGrad——一种截断反向传播（truncated backpropagation）方案，通过限制梯度计算仅在局部帧窗口内进行，同时保持全局一致性，从而将训练内存复杂度从与视频帧数成线性关系降低至常数级别，并支持高效微调。

链接: https://arxiv.org/abs/2603.17812
作者: Dmitriy Rivkin,Parker Ewen,Lili Gao,Julian Ost,Stefanie Walz,Rasika Kangutkar,Mario Bijelic,Felix Heide
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

[CV-31] Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients CVPR2026

【速读】：该论文旨在解决大型视觉语言模型（Large Vision Language Models, LVLMs）在部署时因计算和内存开销大而导致的实际应用受限问题，特别是现有后训练量化方法在模态层面衡量token敏感性时，无法捕捉复杂的跨token交互关系，导致量化误差难以精确评估。其解决方案的关键在于提出一种基于量化感知积分梯度（Quantization-aware Integrated Gradients, QIG）的细粒度量化策略，利用积分梯度对每个token进行定量敏感性分析，将量化校准的粒度从模态级细化到token级，从而更好地反映模态间与模态内动态交互，实现更精准的量化误差控制。实验表明，该方法在多种LVLM和量化设置下均能显著提升精度，且延迟增加可忽略。

链接: https://arxiv.org/abs/2603.17809
作者: Ziwei Xiang,Fanhu Zeng,Hongjian Fang,Rui-Qi Wang,Renxing Chen,Yanan Zhu,Yi Chen,Peipei Yang,Xu-Yao Zhang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, UCAS; Beijing National Research Center for Information Science and Technology; Institute of Artificial Intelligence, USTB; School of Artificial Intelligence, Beihang University; Zhongguancun Academy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026 Main Conference

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at this https URL.

[CV-32] ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis ICPR2026

【速读】：该论文旨在解决多标签胃肠视频分析中因罕见病理标签导致的严重类别不平衡问题，以及帧级预测到事件级推理过程中产生的时序不匹配问题。其解决方案的关键在于：首先，在训练阶段采用截断类加权（clipped class-wise positive weighting）策略优化损失函数，有效提升稀有病理类别的学习能力并保持优化稳定性；其次，在时序阶段引入基于解剖结构的事件解码机制，包括GT风格的帧级事件组合、解剖投票平滑和基于解剖结构的病理门控策略，并结合保守迟滞解码器（conservative hysteresis decoder），显著提升了时序平均精度（temporal mAP）从0.3801提高至0.4303。

链接: https://arxiv.org/abs/2603.17784
作者: Romil Imtiaz,Dimitris K. Iakovidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICPR 2026 RARE-VISION Competition

点击查看摘要

Abstract:We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.

[CV-33] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

【速读】：该论文旨在解决精准畜牧业中行为分类任务面临的两大挑战：高计算成本与标注数据稀缺问题。其核心解决方案是采用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术，特别是基于DINOv3基础模型（67亿参数）的QLoRA和DoRA方法，在仅使用2,160张标注图像的情况下实现高性能泛化。关键发现在于：PEFT显著优于从头训练（ResNet-18、ViT-Small）和冻结特征提取策略，其中最优QLoRA配置（全线性层微调 + 秩=64）在仅使用2.72%参数（1.83亿）的情况下达到83.16%测试准确率，且训练时间仅为5.8小时，远优于对比方法；同时，增加适配器容量可稳定提升泛化性能而不引发过拟合，表明当前农业图像场景下主要瓶颈为欠拟合而非过拟合，为部署百亿级视觉模型于农业场景提供了明确实践指导。

链接: https://arxiv.org/abs/2603.17782
作者: Haiyu Yang,Sumit Sharma,Enhong Liu,Miel Hostens
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.17782 [cs.CV] (or arXiv:2603.17782v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.17782 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-34] CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image CVPR2026

【速读】：该论文旨在解决从单张图像中重建多人场景下人体三维模型的难题，尤其针对复杂人群场景中存在的严重遮挡、图像清晰度低以及个体外观多样等挑战。其解决方案的关键在于提出了一种统一框架 CrowdGaussian，该框架直接基于单图输入重建多个人体的三维高斯溅射（3D Gaussian Splatting, 3DGS）表示；通过设计自监督适应流程，使预训练的大规模人体模型能够从高度遮挡的输入中恢复出几何合理且外观逼真的完整人体结构；同时引入自校准学习（Self-Calibrated Learning, SCL）策略，利用单步扩散模型结合身份保持样本与干净/损坏图像对，自适应地将粗略渲染结果优化至高质量输出，并可将优化结果蒸馏回增强多人体3DGS表示的质量。

链接: https://arxiv.org/abs/2603.17779
作者: Yizheng Song,Yiyu Zhuang,Qipeng Xu,Haixiang Wang,Jiahe Zhu,Jing Tian,Siyu Zhu,Hao Zhu
机构: Nanjing University (南京大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.

[CV-35] Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

【速读】：该论文旨在解决图像深度伪造检测（Image Deepfake Detection, IDD）中依赖昂贵微调且泛化能力差的问题，尤其针对大视觉语言模型（Large Vision-Language Models, LVLMs）在面对多样化、持续演化的篡改手法时表现不佳的局限性。其解决方案的关键在于提出一种无需训练的框架——语义一致证据包（Semantic Consistent Evidence Pack, SCEP），该框架通过挖掘最能揭示篡改线索的局部图像块（patch tokens）构成“证据包”，利用视觉编码器的CLS token作为全局语义参考，结合聚类与融合评分机制（融合语义不一致性和频域/噪声异常特征）筛选高置信度样本，并采用基于网格的非极大值抑制（grid-based NMS）避免冗余，最终以冻结的LVLM为推理基础实现高效准确的检测。

链接: https://arxiv.org/abs/2603.17761
作者: Yuxin Liu,Fei Wang,Kun Li,Yiqi Nie,Junjie Chen,Zhangling Duan,Zhaohong Jia
机构: Anhui University (安徽大学); Hefei University of Technology (合肥工业大学); IAI, Hefei Comprehensive National Science Center (IAI，合肥综合性国家科学中心); United Arab Emirates University (阿联酋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder’s CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.

[CV-36] PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

【速读】：该论文旨在解决3D视觉定位（3D Visual Grounding, 3DVG）在复杂多物体场景中性能显著下降的问题，特别是针对现有方法在处理隐式空间定位线索不足以及共现物体带来的动态空间干扰时表现不佳的挑战。其解决方案的核心是提出PC-CrossDiff框架，采用双层次跨模态差分注意力机制：一是点级差分注意力（Point-Level Differential Attention, PLDA）模块，通过文本与点云之间的双向差分注意力自适应提取隐式定位线索；二是簇级差分注意力（Cluster-Level Differential Attention, CLDA）模块，构建层次化注意力机制，在增强与定位相关的空间关系的同时，抑制模糊或无关的空间关联，从而提升复杂场景下的定位准确性。

链接: https://arxiv.org/abs/2603.17753
作者: Wenbin Tan,Jiawen Lin,Fangyong Wang,Yuan Xie,Yong Xie,Yachao Zhang,Yanyun Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.

[CV-37] Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

【速读】：该论文旨在解决通用医学图像分割（Universal Medical Image Segmentation）中现有方法依赖人工视觉提示或检索参考图像导致自动化程度低、鲁棒性差，以及跨模态联合训练难以应对显著域偏移（domain shift）的问题。其解决方案的关键在于提出一种无需提示的框架 Concept-to-Pixel (C2P)，通过显式分离解剖知识为几何（Geometric）与语义（Semantic）两个表征组件：利用多模态大语言模型（Multimodal Large Language Models, MLLMs）将抽象医学概念蒸馏为可学习的语义令牌（Semantic Tokens），并引入显式监督的几何令牌（Geometric Tokens）以施加通用物理和结构约束；二者与图像特征深度交互生成输入特定的动态卷积核，实现精准掩码预测；同时设计几何感知推理一致性机制（Geometry-Aware Inference Consensus），基于模型预测的几何约束评估置信度并抑制异常值，从而显著提升模型在多模态、零样本及跨模态任务中的泛化能力。

链接: https://arxiv.org/abs/2603.17746
作者: Haoyun Chen,Fenghe Tang,Wenxin Ma,Shaohua Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, code is available at: this https URL

点击查看摘要

Abstract:Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model’s predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: this https URL

[CV-38] APESTRY: From Geometry to Appearance via Consistent Turntable Videos

【速读】：该论文旨在解决从无纹理3D模型自动生成高保真、几何一致的360度转台视频（TTVs）这一关键挑战，以支持高质量3D重建与纹理合成。现有通用视频扩散模型难以维持全视角下的几何一致性与外观稳定性，导致其输出不适用于高精度3D重建任务。解决方案的关键在于提出TAPESTRY框架，将3D外观生成重构为基于显式几何条件的视频扩散问题：通过渲染并编码多模态几何特征，在像素级精度上约束视频生成过程，从而生成高质量且一致的TTVs；进一步设计多阶段下游重建流程，包含3D感知修复（3D-Aware Inpainting），利用旋转模型和上下文感知的二次生成来填补自遮挡区域，实现完整表面覆盖。最终生成的TTVs不仅可作为动态预览，还可无缝回投影至UV纹理或用于监督神经渲染方法（如3DGS），实现从无纹理网格到生产就绪3D资产的自动化生成。

链接: https://arxiv.org/abs/2603.17735
作者: Yan Zeng,Haoran Jiang,Kaixin Yao,Qixuan Zhang,Longwen Zhang,Lan Xu,Jingyi Yu
机构: ShanghaiTech University (上海科技大学); Deemos Technology (德莫斯科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.

[CV-39] SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

【速读】：该论文旨在解决训练-free细粒度视觉识别（Fine-Grained Visual Recognition, FGVR）中因下位类别视觉模糊性导致的识别困难问题。现有方法多采用检索导向或推理导向范式，但存在两个根本局限：一是对所有样本采用统一推理流程，未考虑识别难度差异，导致准确率与效率不优；二是缺乏对错误经验的积累与复用机制，使得相似挑战性案例反复出错。解决方案的关键在于提出SARE（Sample-wise Adaptive textbfREasoning）框架，其核心创新包括：（1）采用级联设计，先快速候选检索，仅在必要时触发细粒度推理，实现自适应推理策略；（2）引入自省式经验机制，在推理过程中利用历史失败案例提供可迁移的判别引导，无需参数更新即可提升识别鲁棒性。

链接: https://arxiv.org/abs/2603.17729
作者: Jingxiao Yang,DaLin He,Miao Pan,Ge Su,Wenqi Zhang,Yifeng Hu,Tangwei Li,Yuke Li,Xuhong Zhang
机构: Zhejiang University (浙江大学); Netease Yidun AI Lab (网易易盾AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint, under review

点击查看摘要

Abstract:Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

[CV-40] DiffVP: Differential Visual Semantic Prompting for LLM -Based CT Report Generation

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的CT报告生成方法中，因对3D影像进行整体编码而无法有效区分诊断相关信息与冗余解剖背景的问题。解决方案的关键在于提出差异视觉提示（Differential Visual Prompting, DiffVP），其通过层次化差异提取器捕获全局与局部语义差异，并将其映射到共享潜在空间；随后，差值到提示生成器将这些差异信号转化为可学习的视觉前缀令牌（visual prefix tokens），用于条件化LLM输入。该机制通过结构化的条件信号隐式抑制不变解剖结构，同时增强诊断相关视觉证据，从而实现无需显式病灶定位的高精度报告生成。

链接: https://arxiv.org/abs/2603.17718
作者: Yuhe Tian,Kun Zhang,Haoran Ma,Rui Yan,Yingtai Li,Rongsheng Wang,Shaohua Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at this https URL.

[CV-41] Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

【速读】：该论文旨在解决当前视觉基础模型在眼图像分割任务中性能是否随迭代提升的问题，特别是对比最新版本Segment Anything Model（SAM3）与前代SAM2在眼图像分割中的表现差异，并探索SAM3引入的文本提示（text prompting）新功能的有效性。其关键解决方案在于系统性地评估SAM3在实验室高分辨率视频和野外挑战性眼视频（TEyeD数据集）上的零样本分割性能，结果表明SAM3在多数情况下并未优于SAM2，且速度更慢，因此作者认为SAM2仍是眼图像分割的最佳选择，并开源了适配SAM3代码库以支持任意时长视频处理。

链接: https://arxiv.org/abs/2603.17715
作者: Diederick C. Niehorster,Marcus Nyström
机构: Lund University Humanities Lab (隆德大学人文实验室); Dept. of Psychology, Lund University (隆德大学心理学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3’s codebase that allows processing videos of arbitrary duration.

[CV-42] AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

【速读】：该论文旨在解决零样本物体导航（Zero-Shot Object Navigation, ZSON）在未知多层环境中的挑战，尤其是机器人在未见过的场景中难以平衡探索（exploration）与利用（exploitation），导致如卡在狭窄交汇处、无休止漫游或无法找到楼梯入口等问题。解决方案的关键在于提出AERR-Nav框架，其核心创新为：(1) 自适应探索-恢复-回忆策略（Adaptive Exploration-Recovery-Reminiscing Strategy），使机器人能动态切换三种状态以应对不同导航场景；(2) 自适应探索状态设计，包含快速思维（Fast-Thinking）与慢速思维（Slow-Thinking）模式，从而根据环境信息演化更优地权衡探索、利用与高层推理能力。

链接: https://arxiv.org/abs/2603.17712
作者: Jingzhi Huang,Junkai Huang,Haoyang Yang,Haoang Li,Yi Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot’s environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.

[CV-43] Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

【速读】：该论文旨在解决多模态遥感语义分割中因预训练视觉基础模型（Vision Foundation Models, VFMs）适配时带来的计算开销大和模态不平衡问题，即辅助模态在优化过程中贡献被抑制。其解决方案的关键在于提出MoBaNet框架，该框架采用参数高效且模态平衡的对称融合结构：首先设计跨模态提示注入适配器（Cross-modal Prompt-Injected Adapter, CPIA），通过生成共享提示并注入冻结骨干网络中的瓶颈适配器实现深层语义交互；其次引入差异引导门控融合模块（Difference-Guided Gated Fusion Module, DGFM），利用跨模态差异显式指导特征选择以获得紧凑且判别性强的多模态表示；最后提出模态条件随机掩码策略（Modality-Conditional Random Masking, MCRM），仅在训练时掩码一个模态并施加硬像素辅助监督，从而缓解模态不平衡问题。

链接: https://arxiv.org/abs/2603.17705
作者: Haocheng Li,Juepeng Zheng,Shuangxi Miao,Ruibo Lu,Guosheng Cai,Haohuan Fu,Jianxi Huang
机构: China Agricultural University (中国农业大学); Key Laboratory of Remote Sensing for Agri-Hazards, Ministry of Agriculture and Rural Affairs (农业农村部农业遥感灾害监测重点实验室); Southwest Jiaotong University (西南交通大学); Sun Yat-Sen University (中山大学); Henan Polytechnic University (河南理工大学); Key Laboratory of Spatio-Temporal Information and Ecological Restoration of Mines, Ministry of Natural Resources of the People’s Republic of China (中华人民共和国自然资源部矿产时空信息与生态修复重点实验室); Tsinghua University (清华大学); National Supercomputing Center in Shenzhen (深圳国家超级计算中心); Ministry of Education Key Laboratory for Earth System Modeling (教育部地球系统建模重点实验室); Department of Earth System Science, Tsinghua University (清华大学地球系统科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at this https URL.

[CV-44] Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在从图像理解向视频理解过渡时面临的两大关键问题：一是现有训练数据集缺乏以时间为中心的设计，导致模型可仅依赖孤立关键帧即可作答，而无需整合全局时序信息；二是由专有模型生成的训练数据存在系统性时序感知错误，如混淆运动方向或误判速度。解决方案的核心在于提出SynRL框架，通过编程生成的合成视频来学习时序基础单元（temporal primitives），包括方向、速度和状态追踪等抽象能力，并将其从简单几何形状的合成场景迁移至真实世界视频任务中。该方法构建了7.7K带逐帧标注的思维链（Chain-of-Thought, CoT）样本与7K强化学习（Reinforcement Learning, RL）样本，显著优于使用165K真实样本训练的Video-R1模型，验证了基于精心设计的合成数据进行时序学习是一种更高效且具泛化能力的视频后训练范式。

链接: https://arxiv.org/abs/2603.17693
作者: Songtao Jiang,Sibo Song,Chenyi Zhou,Yuan Wang,Ruizhe Chen,Tongkun Guan,Ruilin Luo,Yan Zhang,Zhihang Tang,Yuchong Sun,Hang Zhang,Zhibo Yang,Shuai Bai,Junyang Lin,Zuozhu Liu
机构: Zhejiang University (浙江大学); Qwen Team, Alibaba Group (阿里巴巴集团通义实验室); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.

[CV-45] Does YOLO Really Need to See Every Training Image in Every Epoch? CVPR2026

【速读】：该论文旨在解决YOLO（You Only Look Once）目标检测器在训练过程中效率低下的问题——尽管其推理速度极快，但传统训练流程会重复处理每张图像每个epoch，即使部分图像已充分学习，造成大量冗余计算。解决方案的核心是提出一种抗遗忘采样策略（Anti-Forgetting Sampling Strategy, AFSS），其关键在于动态评估每张训练图像的学习充分性（通过检测召回率与精确率的最小值衡量），并据此将图像分为易、中、难三类：易样本稀疏重采样且优先选择长时间未被使用的图像以减少冗余和遗忘；中样本部分选取，兼顾近期未用和随机补充以保证覆盖；难样本则全量采样确保充分学习。该机制通过周期性更新图像学习状态，使模型逐步聚焦于高信息量样本，从而实现训练加速（提升超1.43倍）并保持或提升精度。

链接: https://arxiv.org/abs/2603.17684
作者: Xingxing Xie,Jiahua Dong,Junwei Han,Gong Cheng
机构: Northwestern Polytechnical University (西北工业大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once’’ philosophy. This naturally raises an important question: \textitDoes YOLO really need to see every training image in every epoch? To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than 1.43\times training speedup for YOLO-series detectors while also improving accuracy.

[CV-46] WeatherReason Seg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models

【速读】：该论文旨在解决现有视觉语言模型（Vision-Language Models, VLMs）在恶劣天气条件下进行推理分割（reasoning-based segmentation）能力不足的问题。当前主流基准测试多基于理想环境下的高质量图像，难以评估模型在真实复杂天气场景中的鲁棒性。为此，作者提出WeatherReasonSeg基准，其核心解决方案包含两个关键部分：一是通过合成不同强度的天气扰动构建可控的推理数据集，实现细粒度鲁棒性分析；二是收集真实恶劣天气下的推理分割数据集，利用掩码引导的大语言模型（LLM）提示生成语义一致的查询，从而更贴近实际应用需求。该方案还拓展了五个维度的评估体系（功能、应用场景、结构属性、交互关系与需求匹配），揭示了VLM性能随天气恶化而单调下降，且不同天气类型引发差异化脆弱模式的关键发现。

链接: https://arxiv.org/abs/2603.17680
作者: Wanjun Du,Zifeng Yuan,Tingting Chen,Fucai Ke,Beibei Lin,Shunli Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.

[CV-47] Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging

【速读】：该论文旨在解决无接触指纹识别中伪造攻击检测（presentation attack detection, PAD）的鲁棒性不足问题，尤其是在缺乏物理接触和传统活体线索的情况下。其解决方案的关键在于提出了一种轻量级主动感知机制——成对的闪光-非闪光图像采集方式，通过对比不同光照条件下指纹图像的差异特征来增强伪造检测能力。具体而言，闪光照明能够凸显材料与结构相关的属性（如纹线可见度、次表面散射、微几何形态及表面油脂分布），而非闪光图像提供基础外观上下文；利用可解释的指标（如通道间相关性、镜面反射特性、纹理真实性与差分成像）提取互补特征，从而有效区分真实指纹与打印、数字合成及模具制造等类型的伪造攻击。

链接: https://arxiv.org/abs/2603.17679
作者: Roja Sahoo,Anoop Namboodiri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IWBF 2026 (14th International Workshop on Biometrics and Forensics)

点击查看摘要

Abstract:Contactless fingerprint recognition enables hygienic and convenient biometric authentication but poses new challenges for spoof detection due to the absence of physical contact and traditional liveness cues. Most existing methods rely on single-image acquisition and appearance-based features, which often generalize poorly across devices, capture conditions, and spoof materials. In this work, we study paired flash-non-flash contactless fingerprint acquisition as a lightweight active sensing mechanism for spoof detection. Through a preliminary empirical analysis, we show that flash illumination accentuates material- and structure-dependent properties, including ridge visibility, subsurface scattering, micro-geometry, and surface oils, while non-flash images provide a baseline appearance context. We analyze lighting-induced differences using interpretable metrics such as inter-channel correlation, specular reflection characteristics, texture realism, and differential imaging. These complementary features help discriminate genuine fingerprints from printed, digital, and molded presentation attacks. We further examine the limitations of paired acquisition, including sensitivity to imaging settings, dataset scale, and emerging high-fidelity spoofs. Our findings demonstrate the potential of illumination-aware analysis to improve robustness and interpretability in contactless fingerprint presentation attack detection, motivating future work on paired acquisition and physics-informed feature design. Code is available in the repository.

[CV-48] DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation

【速读】：该论文旨在解决冠状动脉造影（coronary angiography）在临床实践中依赖人工视觉判读、存在读者间差异的问题，同时克服现有人工智能方法多局限于单帧或单一投影视角、主要关注狭窄程度而缺乏对冠状动脉疾病全面评估的局限性。其解决方案的关键在于提出一种基于视频-文本对比学习（video-text contrastive learning）训练的多视角基础模型 DeepCORO-CLIP，通过整合多个投影视角并采用注意力池化机制（attention-based pooling），实现从研究层面进行诊断、预后及疾病进展任务的综合分析。该模型在显著狭窄检测、慢性完全闭塞、血栓和钙化识别等任务中表现优异，并支持迁移学习用于心血管事件预测与左心室射血分数估算，具备临床部署可行性（平均推理时间仅4.2秒）。

链接: https://arxiv.org/abs/2603.17675
作者: Sarra Harrabi,Yichen Wu,Geoffrey H. Tison,Minhaj Ansari,Milos Vukadinovic,David Ouyang,Joshua P. Barrios,Jacques Delfrate,Robert Avram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 69 pages, 5 figures

点击查看摘要

Abstract:Coronary angiography is the reference standard for evaluating coronary artery disease, yet visual interpretation remains variable between readers. Existing artificial intelligence methods typically analyze single frames or projections and focus mainly on stenosis, limiting comprehensive coronary assessment. We present DeepCORO-CLIP, a multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute and externally validated on 4,249 studies from the University of California, San Francisco. DeepCORO-CLIP integrates multiple projections with attention-based pooling for study-level assessment across diagnostic, prognostic, and disease progression tasks. For significant stenosis detection, the model achieved an AUROC of 0.888 internally and 0.89 on external validation. Mean absolute error against core laboratory quantitative coronary angiography was 13.6%, lower than clinical reports at 19.0%. The model also performed strongly for chronic total occlusion, intracoronary thrombus, and coronary calcification detection. Transfer learning enabled prediction of one-year major adverse cardiovascular events with AUROC 0.79 and estimation of left ventricular ejection fraction with mean absolute error 7.3%. Embeddings also captured disease progression across serial examinations. With a mean inference time of 4.2 seconds in hospital deployment, DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation at the point of care. Code, sample data, model weights, and deployment infrastructure are publicly released.

[CV-49] Few-Step Diffusion Sampling Through Instance-Aware Discretizations

【速读】：该论文旨在解决生成式模型中全局固定时间步长调度策略在实例级复杂性差异下表现不佳的问题，即现有方法普遍采用跨所有样本共享的时间步长安排，未能考虑每个样本在生成过程中的动态特性，从而限制了生成质量。其解决方案的关键在于提出一种实例感知的离散化框架，通过学习输入依赖的先验分布来自适应调整时间步分配，并将基于梯度的离散化搜索扩展至条件生成场景，实现对每个样本的个性化时间步优化，在不显著增加训练成本和推理开销的前提下，显著提升多种任务（包括合成数据、像素空间扩散模型、潜在空间图像与视频流匹配）的生成质量。

链接: https://arxiv.org/abs/2603.17671
作者: Liangyu Yuan,Ruoyu Wang,Tong Zhao,Dingwen Fu,Mingkun Lei,Beier Zhu,Chi Zhang
机构: Westlake University (西湖大学); Tongji University (同济大学); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 20 figures. code: this https URL

点击查看摘要

Abstract:Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.

[CV-50] FINER: MLLM s Hallucinate under Fine-grained Negative Queries CVPR2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在细粒度查询任务中普遍存在的幻觉问题，这一问题在现有基准测试中未被充分覆盖，因多数评测聚焦于粗粒度图像相关问题。为系统评估此类幻觉现象，作者提出了FIne-grained NEgative queRies (FINER)框架，并构建了两个新基准：FINER-CompreCap与FINER-DOCCI，用于分析四种典型场景下的幻觉行为——多对象、多属性、多关系及“what”类问题。研究发现，当细粒度不匹配与图像中真实存在的元素共现时，MLLMs更容易产生幻觉。针对此问题，论文提出FINER-Tuning方法，基于直接偏好优化（Direct Preference Optimization, DPO）在FINER启发的数据上对前沿MLLMs进行微调。关键创新在于利用精细设计的负向查询样本引导模型学习更准确的跨模态对齐，从而显著降低幻觉率（最高提升24.2%），同时增强模型在多个主流基准上的通用多模态能力。

链接: https://arxiv.org/abs/2603.17662
作者: Rui Xiao,Sanghwan Kim,Yongqin Xian,Zeynep Akata,Stephan Alaniz
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Helmholtz Munich (赫尔姆霍兹慕尼黑研究中心); Google (谷歌); LTCI, Télécom Paris, Institut Polytechnique de Paris, France (法国巴黎电信学院、巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what’’ questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \hrefthis https URLthis https URL.

[CV-51] Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment CVPR2026

【速读】：该论文旨在解决基于CLIP的跨域少样本学习（Cross-Domain Few-Shot Learning, CDFSL）中出现的局部特征对齐问题（local misalignment problem），即在目标域数据稀缺且存在领域差异的情况下，CLIP模型难以聚焦于细粒度视觉线索（fine-grained visual cues），导致模型决策缺乏可解释性。解决方案的关键在于引入一种基于循环一致性（cycle consistency）的自监督机制——CC-CDFSL方法，通过在局部视觉特征与文本特征之间进行双向翻译并约束原特征与译回特征的一致性，从而增强局部层次上的跨模态对齐；同时提出语义锚点（Semantic Anchor）机制，利用视觉特征增强构建更丰富的文本到图像映射，并通过图像特征压缩过滤无关映射，降低视觉模态噪声干扰，显著提升局部对齐精度与模型可解释性。

链接: https://arxiv.org/abs/2603.17655
作者: Yaze Zhao,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP’s shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.

[CV-52] VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

【速读】：该论文旨在解决闭环自动驾驶策略评估中生成式世界模型在长期滚动预测时的三大核心问题：(i) 初始状态与策略输入的历史条件不匹配，(ii) 多步采样延迟超出实时预算，以及 (iii) 长时间尺度下运动学不可行性累积放大。其解决方案的关键在于提出 VectorWorld——一种流式世界模型，通过运动感知门控变分自编码器（motion-aware gated VAE）生成与策略兼容的交互状态以对齐初始化；采用无求解器的一步掩码补全机制结合边缘门控关系扩散 Transformer（edge-gated relational DiT），并辅以区间条件均值流（MeanFlow）和Jacobian向量积（JVP）驱动的大步长监督，实现真正的实时外推（outpainting）；同时引入 ΔSim 策略，即物理一致的非我车（NPC）控制策略，具备混合离散-连续动作空间与可微运动学logit调制机制，从而显著提升长时间滚动模拟的稳定性与真实性。

链接: https://arxiv.org/abs/2603.17652
作者: Chaokang Jiang,Desen Zhou,Jiuming Liu,Kevin Li Sun
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric 64 \mathrmm\times 64\mathrmm lane–agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce \Delta Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete–continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time 1\mathrmkm+ closed-loop rollouts (\hrefthis https URLcode).

[CV-53] Anchoring and Rescaling Attention for Semantically Coherent Inbetweening CVPR2026

【速读】：该论文旨在解决生成式AI（Generative AI）中关键帧间插（Generative Inbetweening, GI）任务在序列稀疏性和大运动场景下出现的帧一致性差、节奏不稳定及语义错位问题。现有GI模型难以在固定起点与终点之间生成合理且连贯的中间帧，尤其当路径多样性高时缺乏有效引导。其解决方案的关键在于引入关键帧锚定注意力偏置（Keyframe-anchored Attention Bias），通过从关键帧和文本中提取语义与时序信息，为每帧中间帧提供路径约束；同时采用**重缩放时间RoPE（Rescaled Temporal RoPE）**增强帧间一致性，使自注意力机制更准确地关注关键帧，从而提升生成质量与稳定性。

链接: https://arxiv.org/abs/2603.17651
作者: Tae Eun Choi,Sumin Shim,Junhyeok Kim,Seong Jae Hwang
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026; Code is released at this https URL

点击查看摘要

Abstract:Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

[CV-54] Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

【速读】：该论文旨在解决3D物体中语言驱动的可操作性定位（language-driven 3D affordance grounding）问题，即如何将自然语言提问精准映射到三维物体中功能相关的区域。现有方法在开放词汇泛化能力、细粒度几何对齐以及部件级语义一致性方面仍存在挑战。解决方案的关键在于提出一种两阶段跨模态框架：第一阶段利用大语言模型生成部件感知的指令以恢复缺失语义，实现语义相似可操作性的关联；第二阶段引入两个核心组件——可操作原型聚合（Affordance Prototype Aggregation, APA）用于捕捉跨物体的几何一致性，以及对象内关系建模（Intra-Object Relational Modeling, IORM）以细化物体内部几何差异，从而支持精确的语义对齐。

链接: https://arxiv.org/abs/2603.17647
作者: Dongqiang Gou,Xuming He
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grounding natural language questions to functionally relevant regions in 3D objects – termed language-driven 3D affordance grounding – is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.

[CV-55] DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

【速读】：该论文旨在解决生成式对抗网络（Generative Adversarial Network, GAN）在噪声到图像合成任务中，如何更有效地实现类条件控制与高保真图像生成的问题。传统方法通常采用全局信号注入方式进行类别条件建模，难以精细调控不同空间区域的特征响应。其解决方案的关键在于提出方向潜空间路由（Directional Latent Routing, DLR），这是一种新颖的条件机制，将潜变量分解为沿不同空间方向的子向量，每个子向量与类别嵌入联合投影，以产生特征级仿射调制作用于Mamba扫描模块；该机制实现了类别身份与潜空间结构在多尺度上的解耦协同，从而提升了图像质量与可控性。

链接: https://arxiv.org/abs/2603.17637
作者: Aleksander Ogonowski,Konrad Klimaszewski,Przemysław Rokita
机构: Warsaw University of Technology (华沙理工大学); National Centre for Nuclear Research (国家核研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis. The central contribution is Directional Latent Routing (DLR), a novel conditioning mechanism that decomposes the latent vector into direction-specific subvectors, each jointly projected with a class embedding to produce a feature-wise affine modulation of the corresponding Mamba scan. Unlike conventional class conditioning that injects a global signal, DLR couples class identity and latent structure along distinct spatial axes of the feature map, applied consistently across all generative scales. DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets. Analysis of the latent space reveals that directional subvectors exhibit measurable specialization: perturbations along individual components produce structured, direction-correlated changes in the synthesized image.

[CV-56] A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning

【速读】：该论文旨在解决城市建筑存量年龄分布精准识别问题，这对于可持续市政供热规划和升级优先级制定至关重要。现有方法多依赖传感器或遥感数据，常存在数据不一致与缺失的问题。解决方案的关键在于构建一个由三个核心智能体（Zensus agent、OSM agent 和 Monument agent）组成的多代理大语言模型（multi-agent LLM）系统，融合异构数据源，并通过数据编排器与协调器进行地理编码与去重处理，形成高质量的基准数据集；在此基础上提出 BuildingAgeCNN 模型，基于 ConvNeXt 主干网络结合特征金字塔网络（FPN）、CoordConv 空间通道和 Squeeze-and-Excitation（SE）模块，实现仅使用卫星影像的建筑年代分类，整体准确率达 90.69%，并通过置信度校准与低置信度标注机制降低规划应用风险。

链接: https://arxiv.org/abs/2603.17626
作者: Kundan Thota,Thorsten Schlachter,Veit Hagenmeyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Determining the age distribution of the urban building stock is crucial for sustainable municipal heat planning and upgrade prioritization. However, existing approaches often rely on datasets gathered via sensors or remote sensing techniques, leaving inconsistencies and gaps in data. We present a multi-agent LLM system comprising three key agents, the Zensus agent, the OSM agent, and the Monument agent, that fuse data from heterogeneous sources. A data orchestrator and harmonizer geocodes and deduplicates building imprints. Using this fused ground truth, we introduce BuildingAgeCNN, a satellite-only classifier based on a ConvNeXt backbone augmented with a Feature Pyramid Network (FPN), CoordConv spatial channels, and Squeeze-and-Excitation (SE) blocks. Under spatial cross validation, BuildingAgeCNN attains an overall accuracy of 90.69% but a modest macro-F1 of 67.25%, reflecting strong class imbalance and persistent confusions between adjacent historical cohorts. To mitigate risk for planning applications, the address-to prediction pipeline includes calibrated confidence estimates and flags low-confidence cases for manual review. This multi-agent LLM system not only assists in gathering structured data but also helps energy demand planners optimize district-heating networks and target low-carbon sustainable energy systems.

[CV-57] S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models ICME2026

【速读】：该论文旨在解决前馈式3D基础模型中因全局注意力机制（global attention）导致的二次计算复杂度问题，该问题在输入长度增加时严重限制了模型的可扩展性。现有加速方法如令牌合并（token merging）虽能在令牌层面带来局部优化，但其依赖最近邻搜索引入额外开销，且未能从根本上解决密集捕获数据中的结构冗余问题。论文提出S-VGGT方法，其关键创新在于从结构帧（structural frame）层面识别并消除冗余：首先构建密集场景图以刻画结构冗余并指导场景划分，随后将帧软分配至少量子场景，确保分组平衡与几何过渡平滑；核心在于设计共享参考帧的子场景结构，建立并行几何桥梁，实现无需显式几何对齐的独立高效处理，从而从源头上降低全局注意力计算成本。此方案与令牌级加速方法正交，可无缝集成以获得复合加速效果而不损失重建保真度。

链接: https://arxiv.org/abs/2603.17625
作者: Xinze Li,Pengxu Chen,Yiyuan Wang,Weifeng Su,Wentao Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures. Accepted by ICME 2026

点击查看摘要

Abstract:Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbfS-VGGT, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at this https URL.

[CV-58] ReLaGS: Relational Language Gaussian Splatting CVPR2026

【速读】：该论文旨在解决跨任务统一的3D感知与推理问题，特别是针对分割、检索和关系理解等任务中现有方法要么以物体为中心（object-centric），要么依赖昂贵的场景特定训练来实现物体间推理的局限性。其解决方案的关键在于构建一个无需场景特定训练的层次化语言蒸馏高斯场景（hierarchical language-distilled Gaussian scene）及其3D语义场景图（3D semantic scene graph），通过高斯剪枝机制优化场景几何结构，并采用鲁棒的多视角语言对齐策略将噪声2D特征聚合为精确的3D物体嵌入；在此基础上，利用视觉语言模型生成的标注信息和基于图神经网络（Graph Neural Network, GNN）的关系推理模块，建立开放词汇（open-vocabulary）的3D场景图，从而实现高效且可扩展的开放词汇3D推理。

链接: https://arxiv.org/abs/2603.17605
作者: Yaxu Xie,Abdalla Arafa,Alireza Javanmardi,Christen Millerdurai,Jia Cheng Hu,Shaoxiang Wang,Alain Pagani,Didier Stricker
机构: German Research Center for Artificial Intelligence (DFKI); RPTU University Kaiserslautern-Landau; University of Modena and Reggio Emilia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: this https URL

[CV-59] rust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification

【速读】：该论文旨在解决在有限资源条件下高效管理和利用大规模医学影像数据集的挑战，尤其是传统核心集（coreset）选择方法在医学数据中效果受限的问题，原因在于医学图像存在类内差异大和类间相似度高的特性。其解决方案的关键在于提出一种动态不可靠性驱动的核心集选择策略（Dynamic Unreliability-Driven Coreset Selection, DUCS），该策略从两个维度评估样本的不可靠性：1）向内自省（Inward Self-Awareness），通过分析训练过程中置信度的变化来量化样本不确定性；2）向后记忆追踪（Backward Memory Tracking），通过跟踪样本被遗忘的频率来评估模型对其的记忆能力。最终选择那些置信度波动显著且反复被遗忘的样本，这些样本通常位于决策边界附近，有助于模型更精准地优化分类边界，从而在高压缩率下仍保持优异性能。

链接: https://arxiv.org/abs/2603.17603
作者: Yan Liang,Ziyuan Yang,Zhuxin Lei,Mengyu Sun,Yingyu Chen,Yi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficiently managing and utilizing large-scale medical imaging datasets with limited resources presents significant challenges. While coreset selection helps reduce computational costs, its effectiveness in medical data remains limited due to inherent complexity, such as large intra-class variation and high inter-class similarity. To address this, we revisit the training process and observe that neural networks consistently produce stable confidence predictions and better remember samples near class centers in training. However, concentrating on these samples may complicate the modeling of decision boundaries. Hence, we argue that the more unreliable samples are, in fact, the more informative in helping build the decision boundary. Based on this, we propose the Dynamic Unreliability-Driven Coreset Selection(DUCS) strategy. Specifically, we introduce an inward-backward unreliability assessment perspective: 1) Inward Self-Awareness: The model introspects its behavior by analyzing the evolution of confidence during training, thereby quantifying uncertainty of each sample. 2) Backward Memory Tracking: The model reflects on its training tracking by tracking the frequency of forgetting samples, thus evaluating its retention ability for each sample. Next, we select unreliable samples that exhibit substantial confidence fluctuations and are repeatedly forgotten during training. This selection process ensures that the chosen samples are near the decision boundary, thereby aiding the model in refining the boundary. Extensive experiments on public medical datasets demonstrate our superior performance compared to state-of-the-art(SOTA) methods, particularly at high compression rates.

[CV-60] Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing CVPR2026

【速读】：该论文旨在解决3D室内场景编辑中因现有开放词汇系统依赖生成式建模而导致的全局结构破坏与物理不一致问题，这些问题通常表现为对场景的大规模重生成或图像空间编辑引发的空间关系失真。其解决方案的关键在于提出一种名为Edit-As-Act的框架，该框架将编辑任务视为在3D空间中的目标回溯规划（goal-regressive planning），通过设计一种受PDDL启发的动作语言EditLang来显式编码前提条件与效果，包括支撑、接触、碰撞等几何关系，并结合语言驱动的规划器与验证器，实现以最小动作序列达成用户指令目标的同时保持场景其余部分不变，从而在指令保真度、语义一致性与物理合理性三个维度上取得协同优化。

链接: https://arxiv.org/abs/2603.17583
作者: Seongrae Noh,SeungWon Seo,Gyeong-Moon Park,HyeongYeop Kang
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.

[CV-61] LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

【速读】：该论文旨在解决脑肿瘤在磁共振成像（MRI）中精确定位与分割的难题，尤其针对现有方法依赖任务特定监督模型且受限于标注数据稀缺的问题。其解决方案的关键在于提出一种参数高效、以检测驱动的框架LoGSAM，通过将放射科医生的语音转录并经临床自然语言处理（NLP）提取出肿瘤特异性文本提示，引导LoRA适配的视觉-语言检测模型Grounding DINO（GDINO）完成定位；随后利用预测边界框作为先验条件，直接调用冻结的MedSAM模型生成像素级肿瘤掩膜，无需额外微调。该方法仅更新5%的模型参数即可实现计算高效的领域适应，同时保持预训练跨模态知识，最终在BRISC 2025数据集上达到80.32%的Dice分数，并在未见德国影像上实现91.7%的病例级准确率，验证了基于预训练基础模型的模块化“语音到分割”流程的可行性。

链接: https://arxiv.org/abs/2603.17576
作者: Mohammad Robaitul Islam Bhuiyan,Sheethal Bhat,Melika Qahqaie,Tri-Thien Nguyen,Paula Andrea Pérez Toro,Tomas Arias Vergara,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.

[CV-62] PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery CVPR2026

【速读】：该论文旨在解决全景图像（panoramic imagery）中由于非针孔畸变导致的相机位姿估计（pose estimation）与三维重建（3D reconstruction）难题。现有基于透视相机设计的前馈模型在处理全景图像时泛化能力差，难以有效建模球面几何结构。解决方案的关键在于提出PanoVGGT框架，其核心创新包括：1）引入球面感知的位置嵌入（spherical-aware positional embeddings）和三轴SO(3)旋转增强策略，以实现球面域内的有效几何推理；2）通过训练阶段的随机锚定策略（stochastic anchoring strategy）缓解全局坐标系歧义问题；3）构建大规模户外全景数据集PanoCity，支持密集深度和6-DoF位姿标注，从而提升模型在跨域场景下的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2603.17571
作者: Yijing Guo,Mengjun Chao,Luo Wang,Tianyang Zhao,Haizhao Dai,Yingliang Zhang,Jingyi Yu,Yujiao Shi
机构: ShanghaiTech University (上海科技大学); Sudo (苏多)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

[CV-63] Face anonymization preserving facial expressions and photometric realism

【速读】：该论文旨在解决当前人脸匿名化方法在保护隐私的同时，往往忽视面部表情、光照方向和肤色等关键属性的一致性问题，从而限制了匿名化图像在下游任务（如重照明、颜色恒常性分析及情感识别）中的可用性。其解决方案的关键在于提出一种特征保持型匿名化框架，通过引入密集面部关键点以更好保留表情信息，并设计轻量级后处理模块确保光照方向与肤色的一致性，同时建立专门评估指标量化表达保真度、光照一致性和色彩保真度，从而在保证身份不可逆隐藏的前提下显著提升匿名图像的实用性与可信度。

链接: https://arxiv.org/abs/2603.17567
作者: Luigi Celona,Simone Bianco,Raimondo Schettini
机构: University of Milano-Bicocca (米兰博科尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject’s identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency – specifically attributes such as illumination and skin tone – that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.

[CV-64] FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

【速读】：该论文旨在解决扩散模型在图像到视频（I2V）生成中处理超高清输入（如4K分辨率）时面临的全局结构一致性与局部细节保留之间的矛盾问题。现有方法要么在模型原生分辨率下生成导致细粒度结构丢失，要么采用高分辨率分块去噪虽能保留局部细节但破坏了全局布局一致性，尤其在壁画动画等复杂场景中表现尤为明显——此类场景包含多个语义各异的子区域和角色，需保持长时间内的空间连贯性。解决方案的关键在于引入一种无需训练的协同机制：首先生成低分辨率视频并上采样其潜在轨迹作为全局参考（latent prior），随后在每个扩散步骤中通过最小化单一加权最小二乘目标函数，将每块的噪声预测与该参考进行融合，从而实现全局一致性增强与局部细节保留的平衡；同时设计空间正则化变量以实现区域级运动控制，明确调节创造力与一致性之间的权衡。

链接: https://arxiv.org/abs/2603.17555
作者: Hugo Caselles-Dupré(1),Mathis Koroglu(1 and 2),Guillaume Jeanneret(2),Arnaud Dapogny(2),Matthieu Cord(2) ((1) Obvious Research, Paris, France, (2) Institute of Intelligent Systems and Robotics - Sorbonne University, Paris, France)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 authors. Hugo Caselles-Dupré, Mathis Koroglu, and Guillaume Jeanneret contributed equally. 14 pages, 7 figures

点击查看摘要

Abstract:Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model’s native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.

[CV-65] Prompt-Free Universal Region Proposal Network CVPR2026

【速读】：该论文旨在解决现有目标检测方法依赖于示例图像、预定义类别或文本描述进行潜在目标定位的问题，这些问题限制了模型在真实场景中的灵活性与适应性。其解决方案的关键在于提出一种无需外部提示的通用区域提议网络（Prompt-Free Universal Region Proposal Network, PF-RPN），该网络通过三个核心模块实现无监督式潜在目标识别：首先，稀疏图像感知适配器（Sparse Image-Aware Adapter, SIA）利用可学习查询嵌入动态更新视觉特征以完成初始定位；其次，级联自提示模块（Cascade Self-Prompt, CSP）通过自提示的可学习嵌入级联聚合信息，识别剩余潜在目标；最后，中心度引导查询选择模块（Centerness-Guided Query Selection, CG-QS）基于中心度评分网络筛选高质量查询嵌入。该方法可在少量数据（如MS COCO数据集的5%）下优化，并直接应用于水下目标检测、工业缺陷检测和遥感图像目标检测等不同领域，无需微调即可有效识别潜在对象。

链接: https://arxiv.org/abs/2603.17554
作者: Qihong Tang,Changhan Liu,Shaofeng Zhang,Wenbin Li,Qi Fan,Yang Gao
机构: Nanjing University (南京大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at this https URL.

[CV-66] ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

【速读】：该论文旨在解决现有感知视频压缩（Perceptual Video Compression）方法在变量比特率（Variable Bitrate, VBR）支持和渐进式传输（Progressive Delivery）方面的不足，以及生成模块与熵编码之间耦合松散导致的比特率压缩效率受限的问题。其解决方案的关键在于提出一种基于渐进式的生成视频压缩框架 ProGVC，该框架通过将视频编码为分层多尺度残差标记图（Hierarchical Multi-scale Residual Token Maps），实现从粗到细的渐进传输；同时引入基于 Transformer 的多尺度自回归上下文模型，用于高效熵编码传输的标记，并在解码端预测截断的精细尺度标记以恢复感知细节，从而统一了渐进传输、高效熵编码与细节合成三大功能。

链接: https://arxiv.org/abs/2603.17546
作者: Daowen Li,Ruixiao Dong,Ying Chen,Kai Li,Ding Ding,Li Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.

[CV-67] mporal Gains Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视频监督微调（Video-SFT）过程中，视觉能力细粒度演化不明确的问题，特别是空间理解与时间理解之间的平衡如何被影响。研究发现，尽管Video-SFT能显著提升视频任务性能，但常导致静态图像基准性能受限甚至下降，且这种权衡与帧采样数量（即时间预算）密切相关。解决方案的关键在于提出一种指令感知的混合帧策略（instruction-aware Hybrid-Frame strategy），通过自适应分配帧数，在保持视频性能的同时部分缓解图像与视频能力之间的冲突，从而揭示Video-SFT并非免费提升手段，强调了在联合图像-视频训练中维持空间理解能力的核心挑战。

链接: https://arxiv.org/abs/2603.17541
作者: Linghao Zhang,Jungang Li,Yonghua Hei,Sicheng Tao,Song Dai,Yibo Yan,Zihao Dongfang,Weiting Liu,Chenxi Qin,Hanqian Li,Xin Zou,Jiahao Zhang,Shuhang Xun,Haiyun Jiang,Xuming Hu
机构: SJTU; HKUST(GZ); HKUST; CityU; FDU; HIT; TJU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

[CV-68] Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis CVPR2026

【速读】：该论文旨在解决3D点云学习中如何同时实现严格的SE(3)对称性（rigid motion symmetry）与可扩展性的难题。现有方法如群卷积（group convolution）在实践中难以兼顾这两者，导致性能或效率受限。解决方案的关键在于提出Equivariant Coordinate-based Kernel Convolution（ECKConv），其核心创新是通过在双陪集空间（double coset space）定义的核域获取SE(3)等变性，并采用基于坐标的显式核设计，从而在保持严格对称性的同时显著提升学习能力与内存效率，实现在多种点云任务上的优越性能和良好可扩展性。

链接: https://arxiv.org/abs/2603.17538
作者: Jaein Kim,Hee Bin Yoo,Dong-Sig Han,Byoung-Tak Zhang
机构: Seoul National University (首尔国立大学); École Normale Supérieure (巴黎高等师范学院); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.

[CV-69] Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing CVPR2026

【速读】：该论文旨在解决生成式 AI (Generative AI) 基于扩散模型的图像编辑对数字视觉内容真实性造成的威胁，特别是传统嵌入式水印方法因引入可见扰动而损害视觉保真度，以及现有零水印方法依赖全局图像特征难以抵御复杂篡改的问题。解决方案的关键在于发现并利用图像局部块对之间的相对距离在 AI 编辑过程中保持相对不变的特性，提出了一种名为 Relational Zero-Watermarking (Rel-Zero) 的新框架，通过构建基于图像内在结构一致性的零水印，无需修改原始图像即可实现非侵入且鲁棒的内容认证。

链接: https://arxiv.org/abs/2603.17531
作者: Pengzhen Chen,Yanwei Liu,Xiaoyan Gu,Xiaojun Chen,Wu Liu,Weiping Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: accepted to CVPR 2026

点击查看摘要

Abstract:Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

[CV-70] AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection

【速读】：该论文旨在解决工业视觉异常检测（Visual Anomaly Detection, VAD）中现有方法普遍局限于单类别场景、难以适应多类别和持续学习（continual learning）需求的问题。针对这一挑战，作者提出AdapTS框架，其核心创新在于采用统一的教师-学生（Teacher-Student, TS）架构，通过共享一个冻结的主干网络（backbone）并仅在学生路径中注入轻量级可训练适配器（adapter），实现了多类别与持续学习任务的统一处理。该方案显著降低内存开销，例如最轻量版本AdapTS-S仅需8 MB额外内存，相较现有方法大幅优化，使其适用于边缘部署场景。

链接: https://arxiv.org/abs/2603.17530
作者: Manuel Barusco,Davide Dalle Pezze,Francesco Borsatti,Gian Antonio Susto
机构: University of Padova, Italy (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Anomaly Detection (VAD) is crucial for industrial inspection, yet most existing methods are limited to single-category scenarios, failing to address the multi-class and continual learning demands of real-world environments. While Teacher-Student (TS) architectures are efficient, they remain unexplored for the Continual Setting. To bridge this gap, we propose AdapTS, a unified TS framework designed for multi-class and continual settings, optimized for edge deployment. AdapTS eliminates the need for two different architectures by utilizing a single shared frozen backbone and injecting lightweight trainable adapters into the student pathway. Training is enhanced via a segmentation-guided objective and synthetic Perlin noise, while a prototype-based task identification mechanism dynamically selects adapters at inference with 99% accuracy. Experiments on MVTec AD and VisA demonstrate that AdapTS matches the performance of existing TS methods across multi-class and continual learning scenarios, while drastically reducing memory overhead. Our lightest variant, AdapTS-S, requires only 8 MB of additional memory, 13x less than STFPM (95 MB), 48x less than RD4AD (360 MB), and 149x less than DeSTSeg (1120 MB), making it a highly scalable solution for edge deployment in complex industrial environments. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17530 [cs.CV] (or arXiv:2603.17530v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.17530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-71] MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

【速读】：该论文旨在解决遥感图像中开放词汇分割（open-vocabulary segmentation）在多云或雾霾等恶劣天气条件下性能显著下降的问题。现有方法主要依赖清晰天空下的光学影像，难以应对气象干扰导致的语义信息缺失。其解决方案的关键在于提出MM-OVSeg框架，通过融合光学影像与合成孔径雷达（SAR）的多模态数据：一方面利用光学影像提供的丰富光谱语义信息，另一方面借助SAR穿透云层的结构线索增强鲁棒性；同时设计了跨模态统一过程以对齐多传感器表征，并引入双编码器融合模块整合多个视觉基础模型的分层特征，实现文本对齐的多模态分割，从而显著提升在复杂气象条件下的泛化能力与分割精度。

链接: https://arxiv.org/abs/2603.17528
作者: Yimin Wei,Aoran Xiao,Hongruixuan Chen,Junshi Xia,Naoto Yokoya
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所先进智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities–optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

[CV-72] PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation CVPR2026

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在开放词汇语义分割（Open-Vocabulary Semantic and Part Segmentation, OSPS）任务中，由于采用串行结构进行空间和类别聚合导致的类别级语义与空间上下文之间的知识干扰问题。解决方案的关键在于提出一种简单而有效的并行代价聚合（Parallel Cost Aggregation, PCA-Seg）范式，其核心是设计了一个专家驱动的感知学习（Expert-driven Perceptual Learning, EPL）模块：该模块通过多专家解析器从多个视角提取互补特征，并引入系数映射器自适应学习像素级权重以融合不同特征；同时结合特征正交解耦（Feature Orthogonalization Decoupling, FOD）策略减少语义流与上下文流间的冗余，从而实现更丰富且鲁棒的视觉-语言对齐特征表示。

链接: https://arxiv.org/abs/2603.17520
作者: Jianjian Yin,Tao Chen,Yi Chen,Gensheng Pei,Xiangbo Shu,Yazhou Yao,Fumin Shen
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing Normal University (南京师范大学); Sungkyunkwan University (成均馆大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.

[CV-73] UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

【速读】：该论文旨在解决从稀疏、无姿态约束图像中进行语义感知的3D重建问题，特别是针对前馈式3D高斯泼溅（3D Gaussian Splatting, 3DGS）方法中存在的两个关键挑战：一是由于稀疏视图监督导致预测的高斯基元过量，引发几何不稳定和深度质量下降；二是仅依赖2D分割器特征进行语义提升，缺乏有效的3D级监督与泛化能力，造成新场景中3D语义不完整。解决方案的核心在于提出UniSem框架，包含两个关键组件：其一为误差感知的高斯丢弃（Error-aware Gaussian Dropout, EGD），通过渲染误差引导抑制冗余高斯基元，实现几何稳定的表示以提升深度估计精度；其二为混合训练课程（Mix-training Curriculum, MTC），渐进式融合2D分割器提取的语义与模型自身涌现的3D语义先验，并通过对象级原型对齐增强语义一致性与完整性，从而显著改善开放词汇3D分割性能。

链接: https://arxiv.org/abs/2603.17519
作者: Guibiao Liao,Qian Ren,Kaimin Liao,Hua Wang,Zhi Chen,Luchao Wang,Yaohua Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model’s own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.

[CV-74] EI: Early Intervention for Multimodal Imaging based Disease Recognition CVPR2026

【速读】：该论文旨在解决多模态医学图像疾病识别中的两大挑战：一是现有“先单模态嵌入后融合”的范式难以充分挖掘多模态数据间的互补与相关性；二是标注的多模态医学图像稀缺且与自然图像存在显著领域偏移，限制了视觉基础模型（Vision Foundation Models, VFMs）在医学图像嵌入中的应用。解决方案的关键在于提出一种名为“早期干预”（Early Intervention, EI）的新框架，该框架将一模态视为目标模态，其余作为参考模态，通过引入参考模态的高层语义token作为干预token，在嵌入过程早期对目标模态进行引导；同时设计了一种参数高效的微调方法——低秩混合适配（Mixture of Low-varied-Ranks Adaptation, MoR），利用不同秩的低秩适配器和权重松弛路由器实现VFMs的高效迁移。

链接: https://arxiv.org/abs/2603.17514
作者: Qijie Wei,Hailan Lin,Xirong Li
机构: Renmin University of China (中国人民大学); Beijing Key Laboratory for Intelligent Diagnosis of Fundus Diseases and Drug-Device RD and Translation (北京市眼科疾病智能诊断与药物器械研发转化重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing “fusion after unimodal image embedding” paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality’s embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.

[CV-75] Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

【速读】：该论文旨在解决当前大型多模态模型（Large Multimodal Models, LMMs）在将复杂、结构化的数字图形转换为可执行代码方面能力不足的问题。这一任务要求模型具备高保真视觉感知与精确生成表达的协同能力，即不仅要准确解析图形中的空间层次和符号细节，还需生成语法正确且逻辑一致的代码。解决方案的关键在于构建Omni-I2C基准测试集，其包含1080个精心设计的样本，覆盖多种主题、图像模态和编程语言，并引入真实用户来源案例以增强多样性；同时，评估框架通过解耦感知保真度与符号精度，深入揭示LMMs在结构完整性与推理瓶颈上的局限性，从而推动该领域从表面准确性向深层结构性理解演进。

链接: https://arxiv.org/abs/2603.17508
作者: Jiawei Zhou,Chi Zhang,Xiang Feng,Qiming Zhang,Haibo Qiu,Lihuo He,Dengpan Ye,Xinbo Gao,Jing Zhang
机构: Wuhan University (武汉大学); Independent Researcher (独立研究者); Meituan Inc (美团); Xidian University (西安电子科技大学); Guangzhou University (广州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 26 figures

点击查看摘要

Abstract:We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception – to parse intricate spatial hierarchies and symbolic details – and precise generative expression – to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content – from scientific visualizations to complex symbolic notations – each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at this https URL.

[CV-76] UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection

【速读】：该论文旨在解决低空环境中无人机（UAV）检测难题，特别是由复杂背景、伪装效应和多模态干扰导致的检测性能下降问题。现有数据集未能充分刻画真实场景中的伪装与复杂背景特性，限制了鲁棒感知能力的发展。为应对这一挑战，作者构建了专门针对复杂低空背景和伪装特征设计的RGB-T无人机检测数据集UAV-CB，并提出局部频域桥梁网络（Local Frequency Bridge Network, LFBNet），其核心创新在于通过建模局部频域特征来弥合频域-空间融合间隙与跨模态差异间隙，从而实现更有效的RGB-T多模态融合，在伪装和杂乱条件下展现出卓越的检测性能与鲁棒性。

链接: https://arxiv.org/abs/2603.17492
作者: Shenghui Huang,Menghao Hu,Longkun Zou,Hongyu Chi,Zekai Li,Feng Gao,Fan Yang,Qingyao Wu,Ke Chen
机构: Pengcheng Laboratory (鹏城实验室); South China University of Technology (华南理工大学); Peking University (北京大学); Xinjiang University (新疆大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications.

[CV-77] Revisiting Cross-Attention Mechanisms: Leverag ing Beneficial Noise for Domain-Adaptive Learning

【速读】：该论文旨在解决无监督域适应（Unsupervised Domain Adaptation, UDA）中因域间差异和尺度变化导致的性能下降问题，尤其关注跨域特征对齐时内容语义保持困难与尺度不一致带来的挑战。其解决方案的关键在于提出了一种名为“有益噪声”（beneficial noise）的新机制，通过在交叉注意力（cross-attention）中注入受控扰动，引导模型忽略风格干扰并聚焦于内容信息；同时设计了域自适应跨尺度匹配（Domain-Adaptive Cross-Scale Matching, DACSM）框架，包含域自适应Transformer（DAT）用于分离共享内容与域特定风格，并引入跨尺度匹配（Cross-Scale Matching, CSM）模块以实现多分辨率下的语义一致性对齐。该方法显著提升了模型在复杂域迁移场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.17474
作者: Zelin Zang,Yehui Yang,Fei Wang,Liangyu Li,Baigui Sun
机构: Westlake University (西湖大学); Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences (香港科学与创新研究院，中国科学院); Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging “truck” class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.

[CV-78] VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection CVPR2026

【速读】：该论文旨在解决单目3D目标检测中因依赖真实世界标注而导致的泛化能力受限问题，特别是现有基于伪标签的技术在利用语言提示（linguistic cues）时，难以捕捉场景内个体的视觉多样性，从而限制了模型学习场景感知表示的能力。其解决方案的关键在于提出一种自适应的多模态预训练范式——视觉引导的概率提示学习（Visual-referred Probabilistic Prompt Learning, VirPro），核心创新包括：构建可动态更新的自适应提示库（Adaptive Prompt Bank, APB）以存储跨场景的实例条件提示；引入多高斯提示建模（Multi-Gaussian Prompt Modeling, MGPM）将场景级视觉特征融入文本嵌入，使提示能表达视觉不确定性；并通过RoI级对比匹配机制增强模态对齐，提升同一场景中共现目标的语义一致性，从而显著提升模型性能，在KITTI基准上相较基线最高实现4.8%的平均精度提升。

链接: https://arxiv.org/abs/2603.17470
作者: Chupeng Liu,Jiyong Rao,Shangquan Sun,Runkai Zhao,Weidong Cai
机构: University of Sydney (悉尼大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026 Findings

点击查看摘要

Abstract:Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model’s ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.

[CV-79] AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

【速读】：该论文旨在解决流式自回归（Streaming Autoregressive, AR）视频生成模型在通过人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）进行对齐时面临的挑战，尤其是现有基于随机微分方程（SDE）的GRPO方法因短轨迹、低随机性及对初始噪声敏感而难以有效探索中间状态的问题。其解决方案的关键在于提出AR-CoPO（AutoRegressive Contrastive Policy Optimization），该框架将邻居GRPO的对比视角引入流式AR生成场景，通过在随机选定的片段（chunk）处引入分叉机制构建邻域候选序列，赋予序列级奖励并执行局部GRPO更新；同时设计了一种半在线策略（semi-on-policy）训练策略，结合经验回放缓冲区中的参考轨迹实现探索与利用的平衡，从而显著提升跨领域泛化能力和域内人类偏好对齐效果，避免了单纯奖励作弊（reward hacking）。

链接: https://arxiv.org/abs/2603.17461
作者: Dailan He,Guanlin Feng,Xingtong Ge,Yi Zhang,Bingqi Ma,Guanglu Song,Yu Liu,Hongsheng Li
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Zhejiang University (浙江大学); 4. SenseTime (商汤科技); 5. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.

[CV-80] FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

【速读】：该论文旨在解决情感视频字幕（Emotional Video Captioning, EVC）任务中因事实与情感线索挖掘不足及协调机制缺失而导致的“事实-情感偏差”问题，即不同样本在生成过程中对事实性和情感性要求存在差异，现有方法难以兼顾两者。解决方案的关键在于提出一种检索增强的框架FACE-net，其核心创新包括：1）引入外部语义库并检索相关句子以扩充语义信息；2）通过不确定性估计模块进行事实校准，将检索信息分解为三元组并结合视频内容进行自适应精炼；3）设计渐进式视觉情感增强模块，利用校准后的事实语义作为专家指导生成视觉查询和候选情感，并聚合至各事实语义以实现情感自适应增强；4）构建动态偏差调整路由模块，预测并调节样本级别的事实-情感偏差程度，从而提升生成描述的准确性与一致性。

链接: https://arxiv.org/abs/2603.17455
作者: Weidong Chen,Cheng Ye,Zhendong Mao,Peipei Song,Xinyan Liu,Lei Zhang,Xiaojun Chang,Yongdong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to TPAMI. 16 pages, 9 figures

点击查看摘要

Abstract:Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.

[CV-81] AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

【速读】：该论文旨在解决高分辨率GUI截图中目标元素定位困难的问题，尤其是在UI元素尺寸小、自然语言指令模糊等挑战下，如何提升生成式AI（Generative AI）模型对图形用户界面（GUI）的精准理解与交互能力。其解决方案的关键在于提出AdaZoom-GUI框架：一是设计指令精炼模块，将原始自然语言指令重构为更明确、详尽的描述，增强模型对目标元素的理解；二是引入条件性缩放策略，在第一阶段预测的小尺寸元素上执行二次推理，从而在保证定位精度的同时避免对简单案例进行冗余计算和上下文丢失。该方法在公开基准测试中实现了参数规模相当或更大模型中的最优性能，验证了其在高分辨率GUI理解和实际GUI代理部署中的有效性。

链接: https://arxiv.org/abs/2603.17441
作者: Siqi Pei,Liang Tang,Tiaonan Duan,Long Chen,Shuxian Li,Kaer Huang,Yanzhe Jing,Yiqiang Yan,Bo Zhang,Chenghao Jiang,Borui Zhang,Jiwen Lu
机构: Lenovo Research(联想研究院); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

[CV-82] ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

【速读】：该论文旨在解决交互式头部生成（Interactive Head Generation, IHG）中两个核心问题：一是现有方法依赖短时行为线索，缺乏对长程上下文的建模，导致合成的面部行为（Facial Behaviors, FBs）缺乏情境适切性；二是双轨信号（用户行为与预定义音频）在融合过程中存在角色无关的纠缠，引发跨信号干扰，影响说话时唇部区域的同步精度。解决方案的关键在于提出ECHO框架，其包含两个创新模块：一是长程上下文理解（Long-range Contextual Understanding, LCU）组件，用于联合建模行为驱动的动力学和语言驱动的情感语义，提升FBs的情境适切性和情感合理性；二是分块空间感知解耦交叉注意力调制（block-wise Spatial-aware Decoupled Cross-attention Modulation, SDCM）模块，实现自音频驱动的唇部运动保持与用户情境行为线索在非唇部区域的自适应融合，配合两阶段训练策略，协同优化唇部同步与视觉保真度。

链接: https://arxiv.org/abs/2603.17427
作者: Xiangyu Kong,Xiaoyu Jin,Yihan Pan,Haoqin Sun,Hengde Zhu,Xiaoming Xu,Xiaoming Wei,Lu Liu,Siyang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user’s behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar’s audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO’s superior IHG performance.

[CV-83] SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

【速读】：该论文旨在解决视频扩散模型在训练后微调过程中出现的运动对齐问题，尤其是因微调导致的运动保真度下降，如运动动态减弱或长期时间一致性退化。其解决方案的关键在于提出基于像素运动奖励（pixel-motion rewards）的机制，该机制通过捕捉像素流场的动力学特性来同时衡量瞬时与长期的运动一致性；并进一步设计了平滑混合微调（Smooth Hybrid Fine-tuning, SHIFT）框架，将监督微调与优势加权微调统一为一个可扩展的奖励驱动优化流程，利用新型对抗性优势显著提升收敛速度并缓解奖励欺骗（reward hacking）问题，从而有效解决现代视频扩散模型在监督微调中常见的动态程度坍缩（dynamic-degree collapse）现象。

链接: https://arxiv.org/abs/2603.17426
作者: Xi Ye,Wenjia Yang,Yangyang Xu,Xiaoyang Liu,Duo Su,Mengfei Xia,Jun Zhu
机构: Tsinghua University (清华大学); Ant Group (蚂蚁集团); University of Chinese Academy of Science (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.

[CV-84] owards Motion-aware Referring Image Segmentation AISTATS2026

【速读】：该论文旨在解决现有参考图像分割（Referring Image Segmentation, RIS）方法在处理运动相关查询时性能显著低于外观相关查询的问题。其关键解决方案包括两个方面：一是提出一种高效的数据增强方案，通过从原始描述中提取以运动为中心的短语，使模型在无需额外标注的情况下接触更多运动表达；二是设计多模态径向对比学习（Multimodal Radial Contrastive Learning, MRaCL），在融合的图像-文本嵌入空间中进行对比学习，而非依赖单一模态表示，从而更好地捕捉上下文相关的对象描述差异。实验表明，该方法在多个RIS模型上显著提升了对运动相关查询的分割性能，同时保持了对外观描述的竞争力。

链接: https://arxiv.org/abs/2603.17413
作者: Chaeyun Kim,Seunghoon Yi,Yejin Kim,Yohan Jo,Joonseok Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AISTATS 2026. * Equal contribution

点击查看摘要

Abstract:Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at this https URL

[CV-85] Mutually Causal Semantic Distillation Network for Zero-Shot Learning

【速读】：该论文旨在解决零样本学习（Zero-shot Learning, ZSL）中如何有效挖掘视觉特征与属性特征之间内在语义知识的问题，特别是针对现有方法仅依赖弱监督下的单向注意力机制所导致的伪相关性和语义表示不足问题。其解决方案的关键在于提出一种相互因果语义蒸馏网络（Mutually Causal Semantic Distillation Network, MSDN++），该网络包含两个互为因果的注意力子网：属性→视觉因果注意力子网用于学习基于属性的视觉特征，视觉→属性因果注意力子网用于学习基于视觉的属性特征；通过因果注意力机制引导两个子网共同学习可靠的因果视觉-属性关联，从而实现更鲁棒的语义知识迁移。此外，借助语义蒸馏损失函数，两个子网在训练过程中协同学习、相互指导，显著提升了ZSL任务的性能，在多个基准数据集上取得了新的最优结果。

链接: https://arxiv.org/abs/2603.17412
作者: Shiming Chen,Shuhuang Chen,Guo-Sen Xie,Xinge You
机构: Huazhong University of Science and Technology (华中科技大学); National Anti-Counterfeit Engineering Research Center (国家反假冒工程研究中心); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IJCV. arXiv admin note: text overlap with arXiv:2203.03137

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute \rightarrow visual causal attention sub-net that learns attribute-based visual features, and a visual \rightarrow attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.

[CV-86] Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

【速读】：该论文旨在解决当前基于扩散模型的极端图像压缩（Extreme Image Compression, EIC）方法在超低比特率下存在的两大问题：一是现有方法需为每个目标比特率单独训练扩散模型，导致计算开销大、难以部署；二是传统联合超分辨率（Joint Super-Resolution）方法在极低比特率下因信息严重丢失而性能下降，且固定超分尺度无法灵活适配不同比特率。解决方案的关键在于提出ASSR-EIC框架，其核心创新包括：1）在编码端引入任意尺度下采样模块（Arbitrary-Scale Downsampling），实现可控比特率缩减；2）设计基于扩散的、联合退化感知的任意尺度超分辨率解码器（Diffusion-based Joint Degradation-Aware ASSR Decoder），支持单一模型内率适应性重建；3）通过全局压缩-重缩放适配器（Global Compression-Rescaling Adaptor）和局部压缩-重缩放调制器（Local Compression-Rescaling Modulator）协同引导重建过程，在保持高保真度的同时实现细粒度的比特率自适应细节恢复。

链接: https://arxiv.org/abs/2603.17408
作者: Xinning Chai,Zhengxue Cheng,Xin Li,Rong Xie,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (教育部人工智能重点实验室，上海交通大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on BroadCasting

点击查看摘要

Abstract:Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction. Comments: Accepted by IEEE Transactions on BroadCasting Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17408 [cs.CV] (or arXiv:2603.17408v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.17408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-87] Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

【速读】：该论文旨在解决视频生成中如何在保持场景一致性的同时精准捕捉动态细节的问题，尤其是在基于冻结的Stable Diffusion模型上实现参数高效的视频生成。其核心解决方案是提出一种运动自适应的时间注意力机制（motion-adaptive temporal attention mechanism），该机制根据估计的运动强度动态调整时间注意力的感知范围：高运动序列采用局部注意力以保留快速变化的细节，低运动序列则采用全局注意力以增强场景一致性。通过级联策略将轻量级时间注意力模块注入UNet的各个Transformer块——下采样与中间块使用全局注意力进行语义稳定，上采样块使用运动自适应注意力进行精细重构，并结合时序相关噪声初始化和运动感知门控，仅引入25.8M可训练参数（占基础UNet的2.9%），即可在WebVid验证集上取得竞争力结果。

链接: https://arxiv.org/abs/2603.17398
作者: Rui Hong,Shuxue Quan
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 4 tables. Published at IST Electronic Imaging 2026, GENAI Track

点击查看摘要

Abstract:We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy – global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.

[CV-88] Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

【速读】：该论文旨在解决从单目RGB图像中准确估计三维手部姿态（3D hand pose）的问题，这是增强现实/虚拟现实（AR/VR）、人机交互和手语理解等应用的核心挑战。其解决方案的关键在于引入手势语义（gesture semantics）作为强有力的归纳偏置（inductive bias），通过两阶段框架实现：首先在InterHand2.6M数据集上利用粗粒度和细粒度手势标签进行手势感知预训练（gesture-aware pretraining），构建一个信息丰富的嵌入空间；随后采用基于手势嵌入引导的逐关节token Transformer结构，以中间表示形式辅助MANO手部参数的回归。该方法通过分层目标函数联合优化参数、关节点与结构约束，在InterHand2.6M上显著优于当前最优的EANet基线，并且性能提升可跨架构迁移，无需修改模型结构。

链接: https://arxiv.org/abs/2603.17396
作者: Rui Hong,Jana Kosecka
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

[CV-89] Harnessing the Power of Foundation Models for Accurate Material Classification

【速读】：该论文旨在解决材料分类（material classification）任务中因标注数据稀缺而导致模型准确率和泛化能力受限的问题。现有基于视觉语言基础模型（vision-language foundation models, VLMs）的方法在材料识别上仍表现不佳，难以满足实际应用需求。解决方案的关键在于提出一个新颖框架，包含两个核心创新：一是构建一个鲁棒的图像生成与自动标注流水线，通过融合物体语义与材料属性的文本提示，自动生成多样化且高质量的材料中心图像并自动赋予标签；二是引入先验知识蒸馏策略，将VLM提取的信息作为先验约束，并结合联合微调方法优化预训练视觉基础模型，从而在保持广泛泛化能力的同时适配材料特定特征。实验表明，该方法显著提升了多个数据集上的分类性能，且合成数据能有效捕捉真实材料特性。

链接: https://arxiv.org/abs/2603.17390
作者: Qingran Lin,Fengwei Yang,Chaolun Zhu
机构: Georgia Institute of Technology (佐治亚理工学院); Duke University (杜克大学); Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific this http URL experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.

[CV-90] oward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

【速读】：该论文旨在解决生成自然、准确且视觉平滑的3D手语动作（sign language motion）这一难题，尤其在文本输入条件下的可控性与语义一致性问题。其解决方案的关键在于构建一个基于扩散模型（diffusion model）的3D人体运动生成框架，并系统探索音位属性（phonological attributes）如手形（hand shape）、手位（hand location）和运动模式（movement）对生成质量的影响。研究发现，将符号化音位标注（symbolic ASL-LEX notation）转化为自然语言表示是实现CLIP编码器有效属性条件控制的必要条件，而T5编码器则不受此转换影响；最优方案（CLIP+映射属性）在所有指标上均优于现有最先进方法SignAvatar，凸显了输入表示方式对文本编码器驱动的属性条件控制的重要性，并推动采用独立路径分别编码词义（gloss）与音位属性的结构化条件策略。

链接: https://arxiv.org/abs/2603.17388
作者: Rui Hong,Jana Kosecka
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.

[CV-91] VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

【速读】：该论文旨在解决自动驾驶中新颖视图合成（Novel View Synthesis, NVS）的核心瓶颈问题，即模型在推理阶段需生成未见过的视角图像，但在训练阶段缺乏对应位姿下的真实图像监督，导致监督差距（supervision gap）。其解决方案的关键在于提出 VisionNVS 框架，通过引入“虚拟位移”（Virtual-Shift）策略，利用单目深度代理模拟遮挡模式并映射至原视图，将原本病态的外推问题转化为自监督的图像修复（inpainting）任务，从而使用原始录制图像作为像素级精确监督，有效消除域间隙；同时结合伪3D接缝合成（Pseudo-3D Seam Synthesis）策略，利用相邻摄像头的视觉数据建模真实世界的光度差异与标定误差，提升空间一致性，最终实现优于依赖LiDAR基线方法的几何保真度和视觉质量。

链接: https://arxiv.org/abs/2603.17382
作者: Hongbo Lu,Liang Yao,Chenghao He,Fan Liu,Wenlong Liao,Tao He,Pai Peng
机构: Shanghai Jiao Tong University (上海交通大学); Hohai University (河海大学); COWARobot Co. Ltd. (COWARobot公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift’’ strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.

[CV-92] Stereo World Model: Camera-Guided Stereo Video Generation

【速读】：该论文旨在解决单目RGB或RGBD方法在生成立体视频时存在的几何一致性差、计算效率低以及依赖深度估计等问题，提出了一种基于相机条件的立体世界模型StereoWorld。其核心解决方案在于：（1）引入统一的相机帧RoPE（Rotary Positional Encoding），通过增强潜在token的相机感知旋转位置编码，在保持预训练视频先验稳定的同时实现视图和时间一致性的条件控制；（2）设计立体感知注意力分解机制，将4D全注意力分解为3D视图内注意力与水平行注意力，利用极线约束高效捕捉视差对齐对应关系，显著降低计算开销。该方法仅使用RGB模态即可直接从视差中推断几何信息，实现了端到端的立体视频生成，并在多个基准上提升了立体一致性、视差精度和相机运动保真度。

链接: https://arxiv.org/abs/2603.17375
作者: Yang-Tian Sun,Zehuan Huang,Yifan Niu,Lin Ma,Yan-Pei Cao,Yuewen Ma,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); VAST (视觉与智能系统实验室); ByteDance Pico (字节跳动Pico)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video this http URL monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

[CV-93] Shot-Aware Frame Sampling for Video Understanding

【速读】：该论文旨在解决长视频理解中帧采样效率与关键事件保留之间的矛盾问题，即在受限帧数条件下，现有采样方法难以同时兼顾视频整体结构覆盖与短时关键事件（如异常或突变）的捕捉，从而影响下游任务的可靠性。解决方案的关键在于提出 InfoShot，一种任务无关、shot-aware 的帧采样器：首先将视频划分为语义一致的镜头（shot），然后从每个镜头中选择两个互补的关键帧——一个代表主内容，另一个捕捉镜头内的罕见变化；该策略基于信息论目标优化，旨在最大化采样帧对镜头结构和稀疏内部变化的信息保留能力，无需重新训练即可提升异常检测命中率和 Video-QA 准确性。

链接: https://arxiv.org/abs/2603.17374
作者: Mengyu Zhao,Di Fu,Yongyu Xie,Jiaxing Zhang,Zhigang Yuan,Shirin Jalali,Yong Cao
机构: ByteDance; Rutgers University; Georgia Tech
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.

[CV-94] Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, VLMs）在引入视觉模态后安全对齐能力下降的问题，即当文本提示包含明确有害意图时，添加图像会显著提升越狱（jailbreak）成功率。其解决方案的关键在于识别出一种与越狱行为相关的表示空间转移方向（jailbreak-related shift, JRS），并证明该方向能统一解释多种越狱场景中模型状态的变化机制——即视觉输入将模型内部表征推向特定的越狱状态，从而绕过拒绝响应。基于此发现，作者提出在推理阶段移除该JRS成分的防御方法（JRS-Rem），实验表明该方法可在保持良性任务性能的同时有效增强VLM的安全性。

链接: https://arxiv.org/abs/2603.17372
作者: Zhihua Wei,Qiang Li,Jian Ruan,Zhenxin Qin,Leilei Wen,Dongrui Liu,Wen Shen
机构: Tongji University (同济大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

[CV-95] Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes

【速读】：该论文旨在解决无纹理网格中材料感知的部件分组问题（material-aware part grouping in untextured meshes），即在现实世界形状（如松果鳞片或建筑窗户）中，存在具有相同材质但几何形态各异的重复部件，传统方法需逐个手动识别和选择这些部件进行材质分配，效率低下。解决方案的关键在于提出 Material Magic Wand 工具，其核心是一个部件编码器（part encoder），能够生成兼顾局部几何与全局上下文信息的材料感知嵌入（material-aware embedding）；通过监督对比损失训练模型，使同材质部件的嵌入向量彼此靠近、异材质部件分离，从而实现基于已选部件嵌入的最近邻检索来自动获取其他可能共享相同材质的部件组。

链接: https://arxiv.org/abs/2603.17370
作者: Umangi Jain,Vladimir Kim,Matheus Gadelha,Igor Gilitschenski,Zhiqin Chen
机构: University of Toronto (多伦多大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce the problem of material-aware part grouping in untextured meshes. Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations. When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming. To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties – when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context. We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials; therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part. To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries. We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.

[CV-96] MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval WWW2026

【速读】：该论文旨在解决组合图像检索（Composed Image Retrieval, CIR）中因文本修改导致的语义偏差问题，即现有方法难以从参考图像中准确提取与用户意图一致的视觉语义线索，从而受到无关视觉噪声干扰。其解决方案的关键在于提出多模态链式思维推理引导的多层级视觉选择机制（MCoT-MVS）：首先利用多模态大语言模型（MLLM）对输入进行链式思维推理，生成保留、移除和目标推断的文本提示；随后，这些文本线索指导两个视觉注意力模块分别提取参考图像中的patch级和实例级判别性语义特征；最后通过加权分层融合模块将多粒度视觉特征与修改后的文本及想象的目标描述对齐于统一嵌入空间，实现更精准的图像检索。

链接: https://arxiv.org/abs/2603.17360
作者: Xuri Ge,Chunhao Wang,Xindi Wang,Zheyun Qin,Zhumin Chen,Xin Xin
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The Web Conference 2026 (WWW2026)

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user’s intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

[CV-97] A 3D Reconstruction Benchmark for Asset Inspection

【速读】：该论文旨在解决资产检测中高保真三维（3D）重建方法在实际应用中的性能瓶颈问题，特别是在密集采集轨迹和复杂表面特性（如非朗伯表面、反射与透明材质）条件下，现有重建算法难以稳定输出高质量模型的问题。其解决方案的关键在于构建一个包含真实场景模拟的新型数据集，该数据集提供精确的深度图、相机位姿和网格模型，覆盖三种合成场景并涵盖不同表面状态，从而为评估先进重建方法提供可靠基准，揭示当前技术在可扩展性上的显著不足，并推动面向部署的3D重建研究方向的发展。

链接: https://arxiv.org/abs/2603.17358
作者: James L. Gray,Nikolai Goncharov,Alexandre Cardaillac,Ryan Griffiths,Jack Naylor,Donald G. Dansereau(Australian Centre for Robotics, School of Aerospace, Mechanical and Mechatronic Engineering, University of Sydney, Sydney, NSW, Australia)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 29 pages, 15 figures, 8 tables

点击查看摘要

Abstract:Asset management requires accurate 3D models to inform the maintenance, repair, and assessment of buildings, maritime vessels, and other key structures as they age. These downstream applications rely on high-fidelity models produced from aerial surveys in close proximity to the asset, enabling operators to locate and characterise deterioration or damage and plan repairs. Captured images typically have high overlap between adjacent camera poses, sufficient detail at millimetre scale, and challenging visual appearances such as reflections and transparency. However, existing 3D reconstruction datasets lack examples of these conditions, making it difficult to benchmark methods for this task. We present a new dataset with ground truth depth maps, camera poses, and mesh models of three synthetic scenes with simulated inspection trajectories and varying levels of surface condition on non-Lambertian scene content. We evaluate state-of-the-art reconstruction methods on this dataset. Our results demonstrate that current approaches struggle significantly with the dense capture trajectories and complex surface conditions inherent to this domain, exposing a critical scalability gap and pointing toward new research directions for deployable 3D reconstruction in asset inspection. Project page: this https URL

[CV-98] OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery CVPR2026

【速读】：该论文旨在解决现有单目视频中人体三维重建（Human Mesh Recovery, HMR）方法多为离线处理、依赖未来帧或全局优化，从而难以应用于AR/VR和远程呈现等需要实时交互反馈与感知-动作闭环场景的问题。其解决方案的关键在于提出OnlineHMR框架，该框架通过双分支架构实现在线处理的四大核心要求：系统级因果性（causality）、忠实性（faithfulness）、时序一致性（temporal consistency）和效率（efficiency）。具体而言，采用因果键值缓存设计支持流式推理，并结合精选滑动窗口学习策略；同时引入以人为中心的增量式SLAM（Simultaneous Localization and Mapping）机制，在物理合理轨迹修正下实现世界坐标系下的在线对齐，从而在保持与块处理方法相当性能的同时，首次实现了真正意义上的在线处理能力。

链接: https://arxiv.org/abs/2603.17355
作者: Yiwen Zhao,Ce Zheng,Yufu Wang,Hsueh-Han Daniel Yang,Liting Wen,Laszlo A. Jeni
机构: Carnegie Mellon University (卡内基梅隆大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at this https URL.

[CV-99] EvoGuard: An Extensible Agent ic RL-based Framework for Practical and Evolving AI-Generated Image Detection

【速读】：该论文旨在解决AI生成图像（AIGI）检测中面临的挑战，即传统方法依赖低级特征导致泛化能力不足，而基于多模态大语言模型（MLLM）的方法虽具较强理解能力但仍存在扩展性差和标注成本高的问题。解决方案的关键在于提出EvoGuard——一个新颖的代理式（agentic）框架，其通过能力感知的动态编排机制将多种现成的SOTA检测器（包括MLLM与非MLLM类型）封装为可调用工具，并利用代理的自主规划与反思能力，在多轮交互推理中智能选择工具、评估中间结果并决策下一步动作，从而有效融合异构检测器的优势，突破单一模型限制；此外，该框架采用仅需低成本二值标签的GRPO增强型代理强化学习算法进行优化，无需精细标注即可实现性能提升，且支持新检测器的即插即用集成，以零训练方式持续增强整体检测效能，为应对不断演化的AIGI威胁提供了实用且可持续的解决方案。

链接: https://arxiv.org/abs/2603.17343
作者: Chenyang Zhu,Maorong Wang,Jun Liu,Ching-Chun Chang,Isao Echizen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid proliferation of AI-Generated Images (AIGIs) has introduced severe risks of misinformation, making AIGI detection a critical yet challenging task. While traditional detection paradigms mainly rely on low-level features, recent research increasingly focuses on leveraging the general understanding ability of Multimodal Large Language Models (MLLMs) to achieve better generalization, but still suffer from limited extensibility and expensive training data annotations. To better address complex and dynamic real-world environments, we propose EvoGuard, a novel agentic framework for AIGI detection. It encapsulates various state-of-the-art (SOTA) off-the-shelf MLLM and non-MLLM detectors as callable tools, and coordinates them through a capability-aware dynamic orchestration mechanism. Empowered by the agent’s capacities for autonomous planning and reflection, it intelligently selects suitable tools for given samples, reflects intermediate results, and decides the next action, reaching a final conclusion through multi-turn invocation and reasoning. This design effectively exploits the complementary strengths among heterogeneous detectors, transcending the limits of any single model. Furthermore, optimized by a GRPO-based Agentic Reinforcement Learning algorithm using only low-cost binary labels, it eliminates the reliance on fine-grained annotations. Extensive experiments demonstrate that EvoGuard achieves SOTA accuracy while mitigating the bias between positive and negative samples. More importantly, it allows the plug-and-play integration of new detectors to boost overall performance in a train-free manner, offering a highly practical, long-term solution to ever-evolving AIGI threats. Source code will be publicly available upon acceptance.

[CV-100] FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中视觉编码器（vision encoder）的性能瓶颈问题，尤其在细粒度感知任务中表现不足。传统基于CLIP的视觉编码器因低分辨率预训练导致视觉细节丢失，且依赖噪声较大的网络爬取图像-文本对，难以支持密集空间任务。解决方案的关键在于提出FineViT，其核心创新是采用渐进式训练策略：首先在数十亿条全球重描述（recaptioned）图像-文本对上从零开始训练高分辨率视觉编码器，建立丰富的语义基础；随后利用自建的FineCap-450M数据集（包含超4.5亿条高质量局部描述）进行大语言模型（LLM）对齐，显著提升局部感知能力。该方法系统性缓解了信息损失，使FineViT在零样本识别与长上下文检索等任务中达到当前最优性能。

链接: https://arxiv.org/abs/2603.17326
作者: Peisen Zhao,Xiaopeng Zhang,Mingxing Xu,Ruoyu Sun,Zewei Du,Dunzheng Wang,Guanghao Zheng,Haohang Xu,Zhibo Zhang,Yuhang Zhang,Yi Ai,Lin Liu,Qi Tian
机构: Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over 450 million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.

[CV-101] MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

【速读】：该论文旨在解决医学异常检测（Medical Anomaly Detection, MAD）中现有基于CLIP（Contrastive Language-Image Pretraining）的方法在零样本或少样本场景下依赖全局特征和弱监督信号，导致病灶定位粗略、分割质量有限的问题。其解决方案的关键在于提出MedSAD-CLIP模型，通过引入Token-Patch Cross-Attention（TPCA）机制挖掘细粒度图文线索以提升病灶定位精度，同时结合轻量级图像适配器与可学习提示词（prompt tokens）高效适应预训练CLIP编码器至医疗领域，并设计基于边距的图像-文本对比损失（Margin-based image-text Contrastive Loss）增强正常与异常表征之间的全局特征判别能力，从而在有限标注数据条件下实现高精度的像素级分割与图像级分类。

链接: https://arxiv.org/abs/2603.17325
作者: Thuy Truong Tran,Minh Kha Do,Phuc Nguyen Duy,Min Hun Lee
机构: Singapore Management University (新加坡管理大学); La Trobe University (拉特罗布大学); Vietnam National University, Hanoi (河内国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at this https URL

[CV-102] A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition

【速读】：该论文旨在解决** grounded Multimodal Named Entity Recognition (GMNER) **中的关键挑战，即现有方法通常将任务拆分为两个独立步骤：先使用预训练的通用目标检测器提取图像区域，再匹配文本中的命名实体。然而，这类方法因对象检测器与文本实体之间缺乏协同优化，常忽略细粒度语义区域，导致定位不准确和性能下降。其解决方案的核心是提出一种无提案（proposal-free）的Query-Guided Network (QGN)，通过文本引导和跨模态交互统一多模态推理与解码过程，从而实现更精准的实体定位和在开放域场景下的鲁棒性能。

链接: https://arxiv.org/abs/2603.17314
作者: Hongbing Li,Jiamin Liu,Shuo Zhang,Bo Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) identifies named entities, including their spans and types, in natural language text and grounds them to the corresponding regions in associated images. Most existing approaches split this task into two steps: they first detect objects using a pre-trained general-purpose detector and then match named entities to the detected objects. However, these methods face a major limitation. Because pre-trained general-purpose object detectors operate independently of textual entities, they tend to detect common objects and frequently overlook specific fine-grained regions required by named entities. This misalignment between object detectors and entities introduces imprecision and can impair overall system performance. In this paper, we propose a proposal-free Query-Guided Network (QGN) that unifies multimodal reasoning and decoding through text guidance and cross- modal interaction. QGN enables accurate grounding and robust performance in open-domain scenarios. Extensive experiments demonstrate that QGN achieves top performance among compared GMNER models on widely used benchmarks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.17314 [cs.CV] (or arXiv:2603.17314v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.17314 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hongbing Li [view email] [v1] Wed, 18 Mar 2026 03:16:41 UTC (10,859 KB) Full-text links: Access Paper: View a PDF of the paper titled A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition, by Hongbing Li and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-103] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress CVPR2026

【速读】：该论文旨在解决具身智能体在执行长时序多步骤任务时，如何准确估计任务进展的问题。现有基于视觉-语言模型（Vision-Language Models, VLMs）的方法主要依赖其视频理解能力，忽视了其复杂推理潜力，且处理长视频轨迹计算开销大，难以部署于真实场景。解决方案的关键在于提出一种递归推理视觉-语言模型（Recurrent Reasoning Vision-Language Model, \textR^2 VLM），其核心设计为：通过迭代处理局部视频片段，并利用一个动态演化的思维链（Chain of Thought, CoT）维护全局上下文，显式记录任务分解、关键步骤及其完成状态，从而实现对复杂时间依赖关系的有效推理。该机制在避免高成本处理长视频的同时，保留了VLM的推理能力，显著提升了任务进展估计的准确性与泛化性能。

链接: https://arxiv.org/abs/2603.17312
作者: Yuelin Zhang,Sijie Cheng,Chen Li,Zongzhao Li,Yuxin Huang,Yang Liu,Wenbing Huang
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); Beijing Key Laboratory of Research on Large Models and Intelligent Governance(北京市大模型与智能治理重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(教育部下一代智能搜索与推荐工程研究中心); RayNeo.AI; Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系); Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026

点击查看摘要

Abstract:Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ( \textR^2 VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train \textR^2 VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that \textR^2 VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \hrefthis https URLhuggingface.

[CV-104] Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding CVPR2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在长视频理解（Long-form Video Understanding, LVU）任务中面临的挑战，包括高信息密度与长时程跨度导致的推理困难，以及现有方法因简单任务分解或嵌入式检索导致关键信息丢失的问题。其解决方案的关键在于提出了一种名为Symphony的多智能体系统，该系统通过模拟人类认知模式，将LVU任务细粒度拆解，并引入基于反思（reflection）的深度推理协作机制以增强复杂链式推理能力；同时，采用视觉语言模型（Vision-Language Model, VLM）驱动的定位方法来分析视频内容并评估片段相关性，从而显著提升对具有隐含意图和长时间跨度问题的识别与定位能力。

链接: https://arxiv.org/abs/2603.17307
作者: Haiyang Yan,Hongyun Zhou,Peng Xu,Xiaoxue Feng,Mengyi Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Kuaishou Technology (快手科技); School of Future Technology, University of Chinese Academy of Sciences (中国科学院大学未来技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by cvpr2026

点击查看摘要

Abstract:Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at this https URL.

[CV-105] 3D MRI-Based Alzheimers Disease Classification Using Multi-Modal 3D CNN with Leakage-Aware Subject-Level Evaluation

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）分类中因依赖二维切片分析而可能忽略脑部三维空间结构关系的问题。现有研究多基于从MRI体积中提取的单个2D切片进行分析，但临床神经影像学实践更依赖于完整的三维脑结构信息。为此，作者提出了一种多模态三维卷积神经网络（multimodal 3D convolutional neural network），直接使用原始OASIS 1 MRI体积数据，融合T1加权结构信息与通过FSL FAST分割获得的灰质、白质和脑脊液概率图，以捕获互补的神经解剖学特征。该方案的关键在于利用全三维空间建模能力，更好地捕捉与AD进展相关的脑区空间关系，并在OASIS 1队列上通过5折受试者级别交叉验证实现72.34%的平均准确率和0.7781的ROC AUC，同时GradCAM可视化验证模型关注到与AD相关的关键区域（如内侧颞叶和脑室区）。

链接: https://arxiv.org/abs/2603.17304
作者: Md Sifat,Sania Akter,Akif Islam,Md. Ekramul Hamid,Abu Saleh Musa Miah,Najmul Hassan,Md Abdur Rahim,Jungpil Shin
机构: 1: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); 2: School of Computing, Korea University of Science and Technology (韩国科学技术院); 3: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 tables, 6 figures, Submitted to International Conference on Power, Electronics, Communications, Computing, and Intelligent Infrastructure 2026

点击查看摘要

Abstract:Deep learning has become an important tool for Alzheimer’s disease (AD) classification from structural MRI. Many existing studies analyze individual 2D slices extracted from MRI volumes, while clinical neuroimaging practice typically relies on the full three dimensional structure of the brain. From this perspective, volumetric analysis may better capture spatial relationships among brain regions that are relevant to disease progression. Motivated by this idea, this work proposes a multimodal 3D convolutional neural network for AD classification using raw OASIS 1 MRI volumes. The model combines structural T1 information with gray matter, white matter, and cerebrospinal fluid probability maps obtained through FSL FAST segmentation in order to capture complementary neuroanatomical information. The proposed approach is evaluated on the clinically labelled OASIS 1 cohort using 5 fold subject level cross validation, achieving a mean accuracy of 72.34% plus or minus 4.66% and a ROC AUC of 0.7781 plus or minus 0.0365. GradCAM visualizations further indicate that the model focuses on anatomically meaningful regions, including the medial temporal lobe and ventricular areas that are known to be associated with Alzheimer’s related structural changes. To better understand how data representation and evaluation strategies may influence reported performance, additional diagnostic experiments were conducted on a slice based version of the dataset under both slice level and subject level protocols. These observations help provide context for the volumetric results. Overall, the proposed multimodal 3D framework establishes a reproducible subject level benchmark and highlights the potential benefits of volumetric MRI analysis for Alzheimer’s disease classification.

[CV-106] Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

【速读】：该论文旨在解决故事可视化（Story Visualization）中长期存在的角色身份不一致（identity drift）和视觉风格不稳定问题，尤其是在复杂交互或长篇叙事场景下，现有方法难以保持角色特征与视觉风格的一致性。其解决方案的关键在于提出一个两阶段协同框架：第一阶段引入组共享注意力机制（Group-Shared Attention, GSA），通过在注意力层内实现无损跨样本信息传递，无需外部编码器即可结构化地建模帧间身份对应关系；第二阶段采用直接偏好优化（Direct Preference Optimization, DPO），利用整体偏好数据同时提升图像保真度与身份一致性，避免传统方法中辅助损失函数之间的冲突。实验表明，该方法在ViStoryBench基准上显著优于现有基线，在角色身份一致性（CIDS）和风格一致性（CSD）指标上分别提升+10.0和+18.7，实现了高质量且一致性的故事图像生成。

链接: https://arxiv.org/abs/2603.17295
作者: Jianzhang Zhang,Yijing Tian,Jiwang Qu,Chuang Liu
机构: Hangzhou Normal University (杭州师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.

[CV-107] DANCE: Dynamic 3D CNN Pruning: Joint Frame Channel and Feature Adaptation for Energy Efficiency on the Edge

【速读】：该论文旨在解决现代3D卷积神经网络（3D CNNs）在视频和图像处理中无法根据输入样本的计算复杂度动态调整资源消耗的问题，从而导致能效低下。其解决方案的关键在于提出了一种细粒度、输入感知的动态剪枝框架DANCE，包含两个核心步骤：首先通过激活变异增强（Activation Variability Amplification, AVA）对模型进行再训练，以提升神经元激活幅度的方差，使剪枝决策更适应多样化的输入场景；其次通过自适应激活剪枝（Adaptive Activation Pruning, AAP）引入轻量级激活控制器网络，基于第一层输出统计信息动态剪枝每一层的帧、通道和特征，实现卷积层内部稀疏化，在几乎不损失性能的前提下显著减少乘加（MAC）操作和内存访问，最终在NVIDIA Jetson Nano和Qualcomm Snapdragon 8 Gen 1平台上分别获得1.37倍和2.22倍的速度提升及最高达1.47倍的能效提升。

链接: https://arxiv.org/abs/2603.17275
作者: Mohamed Mejri,Ashiqur Rasul,Abhijit Chatterjee
机构: Georgia Tech (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input samples in a dynamic manner to minimize energy consumption. In this research, we propose DANCE, a fine-grained, input-aware, dynamic pruning framework for 3D CNNs to maximize power efficiency with negligible to zero impact on performance. In the proposed two-step approach, the first step is called activation variability amplification (AVA), and the 3D CNN model is retrained to increase the variance of the magnitude of neuron activations across the network in this step, facilitating pruning decisions across diverse CNN input scenarios. In the second step, called adaptive activation pruning (AAP), a lightweight activation controller network is trained to dynamically prune frames, channels, and features of 3D convolutional layers of the network (different for each layer), based on statistics of the outputs of the first layer of the network. Our method achieves substantial savings in multiply-accumulate (MAC) operations and memory accesses by introducing sparsity within convolutional layers. Hardware validation on the NVIDIA Jetson Nano GPU and the Qualcomm Snapdragon 8 Gen 1 platform demonstrates respective speedups of 1.37X and 2.22X, achieving up to 1.47X higher energy efficiency compared to the state of the art.

[CV-108] ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos

【速读】：该论文旨在解决教育人工智能（Educational AI）中从视频中识别与定位学生困惑（student confusion）这一重要但具有挑战性的问题。现有混淆数据集存在标签噪声、时间标注粗粒度及专家验证不足等问题，限制了细粒度识别与时间对齐分析的可靠性。解决方案的关键在于提出一个实用的多阶段过滤流水线，整合两阶段模型辅助筛选、研究人员校正和专家验证，从而构建高质量的混淆理解基准。基于此流程，作者提出了ConfusionBench基准数据集，包含平衡的片段级混淆识别数据集和视频定位数据集，并提供了代表性开源模型与专有模型的零样本基线评估，为后续研究提供可靠基础。

链接: https://arxiv.org/abs/2603.17267
作者: Lu Dong,Xiao Wang,Mark Frank,Srirangaraj Setlur,Venu Govindaraju,Ifeoma Nwogu
机构: State University of New York at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recognizing and localizing student confusion from video is an important yet challenging problem in educational AI. Existing confusion datasets suffer from noisy labels, coarse temporal annotations, and limited expert validation, which hinder reliable fine-grained recognition and temporally grounded analysis. To address these limitations, we propose a practical multi-stage filtering pipeline that integrates two stages of model-assisted screening, researcher curation, and expert validation to build a higher-quality benchmark for confusion understanding. Based on this pipeline, we introduce ConfusionBench, a new benchmark for educational videos consisting of a balanced confusion recognition dataset and a video localization dataset. We further provide zero-shot baseline evaluations of a representative open-source model and a proprietary model on clip-level confusion recognition, long-video confusion localization tasks. Experimental results show that the proprietary model performs better overall but tends to over-predict transitional segments, while the open-source model is more conservative and more prone to missed detections. In addition, the proposed student confusion report visualization can support educational experts in making intervention decisions and adapting learning plans accordingly. All datasets and related materials will be made publicly available on our project page.

[CV-109] GigaWorld-Policy: An Efficient Action-Centered World–Action Model

【速读】：该论文旨在解决当前世界-动作模型（World-Action Models, WAM）在机器人策略学习中面临的两个关键瓶颈：一是未来视觉动态与对应动作的联合推理导致显著的推理开销；二是联合建模常使视觉与运动表征纠缠，致使运动预测精度高度依赖于未来视频预测的质量。解决方案的关键在于提出GigaWorld-Policy，一种以动作为中心的世界模型，其通过显式分离动作预测与视频生成模块，在保持物理合理性的同时实现高效推理。具体而言，该方法将策略训练分解为两个耦合组件：基于当前观测预测未来动作序列，并基于预测动作和相同观测生成未来视频；策略同时受动作预测和视频生成监督，从而提供更丰富的学习信号并借助视觉-动态约束提升动作的物理合理性。此外，采用因果设计避免未来视频token影响动作token，使得视频生成在推理阶段可选，从而实现9倍于基线Motus的推理速度提升，且任务成功率提高7%。

链接: https://arxiv.org/abs/2603.17240
作者: Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Hengtao Li,Jie Li,Jindi Lv,Jingyu Liu,Min Cao,Peng Li,Qiuping Deng,Wenjun Mei,Xiaofeng Wang,Xinze Chen,Xinyu Zhou,Yang Wang,Yifan Chang,Yifan Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu
机构: GigaAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.

[CV-110] Neural Radiance Maps for Extraterrestrial Navigation and Path Planning

【速读】：该论文旨在解决自主行星探测车在缺乏全局地图的情况下难以实现高效、安全路径规划的问题。当前探测车的自主性受限于无法便捷构建和存储用于在线重规划的全局地图，从而影响探索效率与科学目标达成速度。解决方案的关键在于利用神经辐射场（Neural Radiance Fields, NeRFs）作为高精度三维场景表示，从稀疏二维图像中训练并高效存储地图，并在此基础上提出一种融合局部与全局信息的路径规划框架：通过在NeRF地图提取的地形特征上使用核岭回归（kernel ridge regression）对局部代价观测进行跨区域插值，使探测车能够在运行过程中识别不可通行区域后动态调整路径。实验表明，该方法相比多种基线方案具有更低的路径代价和更高的成功率。

链接: https://arxiv.org/abs/2603.17236
作者: Adam Dai,Shubh Gupta,Grace Gao
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Proceedings of the ION GNSS+ 2023 Conference

点击查看摘要

Abstract:Autonomous vehicles such as the Mars rovers currently lead the vanguard of surface exploration on extraterrestrial planets and moons. In order to accelerate the pace of exploration and science objectives, it is critical to plan safe and efficient paths for these vehicles. However, current rover autonomy is limited by a lack of global maps which can be easily constructed and stored for onboard re-planning. Recently, Neural Radiance Fields (NeRFs) have been introduced as a detailed 3D scene representation which can be trained from sparse 2D images and efficiently stored. We propose to use NeRFs to construct maps for online use in autonomous navigation, and present a planning framework which leverages the NeRF map to integrate local and global information. Our approach interpolates local cost observations across global regions using kernel ridge regression over terrain features extracted from the NeRF map, allowing the rover to re-route itself around untraversable areas discovered during online operation. We validate our approach in high-fidelity simulation and demonstrate lower cost and higher percentage success rate path planning compared to various baselines.

[CV-111] Visual SLAM with DEM Anchoring for Lunar Surface Navigation

【速读】：该论文旨在解决月球表面长期自主导航中因缺乏全球定位系统（Global Positioning System, GPS）、极端光照条件及低纹理月壤导致的视觉惯性里程计（Visual-Inertial Odometry, VIO）累积漂移问题，从而实现数十公里级地形下的高精度定位与全局一致地图构建。其解决方案的关键在于提出一种融合学习驱动特征检测与匹配的立体视觉同步定位与建图（Stereo Visual Simultaneous Localization and Mapping, SLAM）系统，并通过数字高程模型（Digital Elevation Model, DEM）提供的高度和表面法向量约束，在位姿图优化中引入绝对地表约束，有效抑制长时间行驶中的漂移，尤其在重复或视觉混淆的地形中表现稳定。

链接: https://arxiv.org/abs/2603.17229
作者: Adam Dai,Guillem Casadesus Vila,Grace Gao
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Aerospace Conference 2026

点击查看摘要

Abstract:Future lunar missions will require autonomous rovers capable of traversing tens of kilometers across challenging terrain while maintaining accurate localization and producing globally consistent maps. However, the absence of global positioning systems, extreme illumination, and low-texture regolith make long-range navigation on the Moon particularly difficult, as visual-inertial odometry pipelines accumulate drift over extended traverses. To address this challenge, we present a stereo visual simultaneous localization and mapping (SLAM) system that integrates learned feature detection and matching with global constraints from digital elevation models (DEMs). Our front-end employs learning-based feature extraction and matching to achieve robustness to illumination extremes and repetitive terrain, while the back-end incorporates DEM-derived height and surface-normal factors into a pose graph, providing absolute surface constraints that mitigate long-term drift. We validate our approach using both simulated lunar traverse data generated in Unreal Engine and real Moon/Mars analog data collected from Mt. Etna. Results demonstrate that DEM anchoring consistently reduces absolute trajectory error compared to baseline SLAM methods, lowering drift in long-range navigation even in repetitive or visually aliased terrain.

[CV-112] From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在像素级视觉任务中空间理解能力不足的问题，特别是其在图像分割任务中的表现机制尚不清晰。解决方案的关键在于通过分层线性探针评估整个MLLM处理流程（视觉编码器、适配器和语言模型层），结合基于注意力掩蔽的干预分析，揭示了适配器引入的分割表示退化现象，并发现语言模型层通过注意力驱动的精炼机制逐步恢复分割性能——其中正确分类的token能够引导邻近误分类token修正标签；同时，早期图像token位置的恢复受限于因果注意力结构，而图像token间的双向注意力可缓解这一限制。该研究为理解MLLM如何处理视觉信息以实现分割提供了机制层面的洞见。

链接: https://arxiv.org/abs/2603.17228
作者: Boyong Wu,Sanghwan Kim,Zeynep Akata
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.

[CV-113] Adaptive Anchor Policies for Efficient 4D Gaussian Streaming

【速读】：该论文旨在解决动态场景重建中基于高斯点绘（Gaussian Splatting）的流式渲染方法在锚点（anchor）选择上的计算效率低下问题。现有方法通常采用固定数量的锚点（如8,192个）并使用远距离点采样（Farthest Point Sampling, FPS），导致在复杂度较低的场景中资源浪费，难以满足严格计算预算下的实时性需求。其解决方案的关键在于提出一种名为Efficient Gaussian Streaming (EGS) 的可插拔、预算感知的锚点采样器，该采样器以强化学习策略替代FPS，能够在离散约束下联合决策最优锚点数量与子集，利用高斯表示的空间特征实现重建质量与运行效率之间的平衡。实验表明，EGS在保持高质量的同时显著降低锚点数量，并在多个动态多视角数据集上优于传统FPS策略。

链接: https://arxiv.org/abs/2603.17227
作者: Ashim Dahal,Rabab Abdelfattah,Nick Rahimi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality–efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ( 32\times fewer than 8,192), EGS improves PSNR by +0.52 – 0.61 ,dB while running 1.29 – 1.35\times faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emphCode and pretrained checkpoints will be released upon acceptance. \keywords4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning

[CV-114] SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization MICCAI2026

【速读】：该论文旨在解决多中心神经影像分析中因扫描仪引起的协变量偏移问题，即不同成像协议下体素强度的边缘分布 $ P(\mathbf{x}) $ 非线性变化，而解剖条件分布 $ P(\mathbf{y}|\mathbf{x}) $ 保持不变，这对放射组学（radiomic）的可重复性造成严重干扰。现有统计校正方法（如ComBat）仅在特征空间操作，无法支持空间下游任务；标准深度学习方法则受限于局部有效感受野（effective receptive field, ERF），难以建模场强偏差所导致的全局强度相关性。其解决方案的关键在于提出SA-CycleGAN-2.5D框架，通过三个核心创新实现：(1) 2.5D三平面流形注入机制，在 $ O(HW) $ 复杂度下保留跨层梯度 $\nabla_z$ ；(2) 带密集体素到体素自注意力机制的U-ResNet生成器，突破CNN的 $ O(\sqrt{L}) $ 感受野限制，以建模全局扫描仪场强偏差；(3) 谱归一化判别器约束Lipschitz常数 $ K_D \le 1 $，保障对抗优化稳定性。实验证明该方法显著降低最大均值差异（MMD）达99.1%，并使域分类准确率接近随机水平（59.7%），且全局注意力对异质到同质翻译方向至关重要（Cohen’s $ d = 1.32, p < 0.001 $），从而实现保持肿瘤病理生理特性的体素级图像校正，支撑多中心放射组学研究的可重复性。

链接: https://arxiv.org/abs/2603.17219
作者: Ishrith Gowda,Chunwei Liu
机构: University of California, Berkeley (加州大学伯克利分校); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures, 5 tables. Submitted to MICCAI 2026

点击查看摘要

Abstract:Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities P(\mathbfx) varies non-linearly across acquisition protocols while the conditional anatomy P(\mathbfy|\mathbfx) remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the H\Delta H -divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients \nabla_z at O(HW) complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the O(\sqrtL) receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ( K_D \le 1 ) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ( 1.729 \to 0.015 ) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen’s d = 1.32 , p 0.001 ) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis. Comments: 12 pages, 5 figures, 5 tables. Submitted to MICCAI 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.4.3; I.4.9; I.2.6 Cite as: arXiv:2603.17219 [cs.CV] (or arXiv:2603.17219v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.17219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-115] Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

【速读】：该论文旨在解决单目视频中人体三维网格（3D body mesh）重建在手术增强现实（surgical augmented reality, AR）场景下的挑战，特别是当患者被手术布帘遮挡且摄像机视角持续变化时，现有的人体网格重建（Human Mesh Recovery, HMR）方法因依赖于静止相机和直立运动的训练数据而性能显著下降。解决方案的关键在于提出Patient4D——一个基于站位先验（stationarity prior）的重建流程，其核心机制包括：姿态锁定（Pose Locking），通过稳定关键帧锚定姿态参数以提升时序一致性；以及刚性回退（Rigid Fallback），在严重遮挡下利用轮廓引导的刚性对齐恢复网格。这两个组件共同提升了重建稳定性，同时兼容现成的HMR模型，在模拟手术遮挡条件下将平均IoU提升至0.75，并将失败帧比例从30.5%降至1.3%。

链接: https://arxiv.org/abs/2603.17178
作者: Mingxiao Tu,Hoijoon Jung,Alireza Moghadam,Andre Kyme,Jinman Kim
机构: University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering a dense 3D body mesh from monocular video remains challenging under occlusion from draping and continuously moving camera viewpoints. This configuration arises in surgical augmented reality (AR), where an anesthetized patient lies under surgical draping while a surgeon’s head-mounted camera continuously changes viewpoint. Existing human mesh recovery (HMR) methods are typically trained on upright, moving subjects captured from relatively stable cameras, leading to performance degradation under such conditions. To address this, we present Patient4D, a stationarity-constrained reconstruction pipeline that explicitly exploits the stationarity prior. The pipeline combines image-level foundation models for perception with lightweight geometric mechanisms that enforce temporal consistency across frames. Two key components enable robust reconstruction: Pose Locking, which anchors pose parameters using stable keyframes, and Rigid Fallback, which recovers meshes under severe occlusion through silhouette-guided rigid alignment. Together, these mechanisms stabilize predictions while remaining compatible with off-the-shelf HMR models. We evaluate Patient4D on 4,680 synthetic surgical sequences and three public HMR video benchmarks. Under surgical drape occlusion, Patient4D achieves a 0.75 mean IoU, reducing failure frames from 30.5% to 1.3% compared to the best baseline. Our findings demonstrate that exploiting stationarity priors can substantially improve monocular reconstruction in clinical AR scenarios.

[CV-116] Generalist Multimodal LLM s Gain Biometric Expertise via Human Salience

【速读】：该论文旨在解决虹膜活体检测（Iris Presentation Attack Detection, PAD）在实际部署中面临的三大挑战：一是难以收集未来未知攻击类型的样本数据；二是现有数据集虽具多样性但预测能力有限且获取成本高；三是生物特征数据共享存在隐私风险。为应对这些问题，论文提出利用通用多模态大语言模型（Multimodal Large Language Models, MLLMs）结合人类专家知识进行虹膜PAD识别，且所有处理均在机构隐私约束下完成，禁止将生物特征数据上传至公共云服务。其解决方案的关键在于：首先发现预训练视觉编码器嵌入（vision encoder embeddings）能天然聚类多种虹膜攻击类型，即使未针对该任务专门训练；其次，在聚类存在重叠时引入结构化提示（structured prompts），融合人类标注的显著性信息（human salience，即受试者对攻击特征的口头描述），从而有效区分模糊类别。实验表明，使用经专家提示增强的Gemini 2.5 Pro模型优于专用卷积神经网络（CNN）基线和人工判读，而本地部署的Llama 3.2-Vision模型也达到近人类水平，验证了在隐私合规前提下MLLMs作为虹膜PAD新范式的可行性。

链接: https://arxiv.org/abs/2603.17173
作者: Jacob Piland,Byron Dowling,Christopher Sweet,Adam Czajka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.

[CV-117] SLAM Adversarial Lab: An Extensible Framework for Visual SLAM Robustness Evaluation under Adverse Conditions

【速读】：该论文旨在解决视觉SLAM（Simultaneous Localization and Mapping，即时定位与地图构建）系统在恶劣环境条件下性能下降的问题，特别是在雾、雨等真实世界干扰因素下的鲁棒性评估难题。解决方案的关键在于提出了一种模块化框架SAL（SLAM Adversarial Lab），其核心创新在于将每种对抗性条件建模为对现有数据集的扰动（perturbation），并支持以可解释的真实世界单位（如能见度的米数）定义扰动强度；同时，SAL通过统一接口解耦数据集、扰动和SLAM算法，实现组件扩展无需重写集成代码，并内置搜索机制自动定位导致SLAM系统失效的临界扰动强度，从而系统性地评估多种SLAM算法在不同对抗条件下的鲁棒性表现。

链接: https://arxiv.org/abs/2603.17165
作者: Mohamed Hefny,Karthik Dantu,Steven Y. Ko
机构: Simon Fraser University (西蒙菲莎大学); University at Buffalo (纽约州立大学布法罗分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:We present SAL (SLAM Adversarial Lab), a modular framework for evaluating visual SLAM systems under adversarial conditions such as fog and rain. SAL represents each adversarial condition as a perturbation that transforms an existing dataset into an adversarial dataset. When transforming a dataset, SAL supports severity levels using easily-interpretable real-world units such as meters for fog visibility. SAL’s extensible architecture decouples datasets, perturbations, and SLAM algorithms through common interfaces, so users can add new components without rewriting integration code. Moreover, SAL includes a search procedure that finds the severity level of a perturbation at which a SLAM system fails. To showcase the capabilities of SAL, our evaluation integrates seven SLAM algorithms and evaluates them across three datasets under weather, camera, and video transport perturbations.

[CV-118] GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion CVPR2026

【速读】：该论文旨在解决从单个桌面安装的向上视角鱼眼相机中实现多人群体三维注视方向估计的问题，这一场景在传统基于前向摄像头的方法中因视角受限而研究较少。其关键解决方案在于提出GazeOnce360模型，该模型通过引入旋转卷积（rotational convolutions）以应对鱼眼图像中的严重畸变和视角变化，并结合眼部关键点监督（eye landmark supervision）提升精度；同时设计了一种双分辨率架构（dual-resolution architecture），融合全局低分辨率上下文与高分辨率局部眼部区域特征，从而更有效地捕捉对注视估计至关重要的细粒度眼部信息。

链接: https://arxiv.org/abs/2603.17161
作者: Zhuojiang Cai,Zhenghui Sun,Feng Lu
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: this https URL.

[CV-119] BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Birds-Eye View Images CVPR2026

【速读】：该论文旨在解决LiDAR在复杂环境下的全局定位（global localization）精度不足问题，尤其是在场景特征稀疏或变化较大的情况下。解决方案的关键在于提出BEV-SLD方法，其核心思想是基于鸟瞰图（bird’s-eye-view, BEV）图像自监督地发现具有特定空间密度的场景特异性模式，并将其视为地标（landmark），通过一致性损失函数将可学习的全局地标坐标与每帧热力图对齐，从而实现跨场景的一致性地标检测，显著提升了定位鲁棒性和准确性。

链接: https://arxiv.org/abs/2603.17159
作者: David Skuddis,Vincent Ress,Wei Zhang,Vincent Ofosu Nyako,Norbert Haala
机构: Institute for Photogrammetry and Geoinformatics, University of Stuttgart, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird’s-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.

[CV-120] SMAL-pets: SMAL Based Avatars of Pets from Single Image

【速读】：该论文旨在解决生成高保真、可动画化3D狗虚拟形象（dog avatar）的挑战，尤其针对动物重建领域缺乏大规模标注数据集、形态多样性大导致模型泛化困难、毛发纹理真实感不足以及传统方法依赖人工网格操作和专家绑定等问题。解决方案的关键在于提出SMAL-pets框架，其核心创新是融合3D高斯泼溅（3D Gaussian Splatting）与SMAL参数化模型（SMAL, Skinned Multi-Animal Linear model），从而实现视觉保真度与解剖结构合理性兼备的表示；同时引入多模态编辑套件，支持用户通过自然语言指令直接控制外观与动作行为，显著提升交互效率与灵活性。

链接: https://arxiv.org/abs/2603.17131
作者: Piotr Borycki,Joanna Waczyńska,Yizhe Zhu,Yongqiang Gao,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating high-fidelity, animatable 3D dog avatars remains a formidable challenge in computer vision. Unlike human digital doubles, animal reconstruction faces a critical shortage of large-scale, annotated datasets for specialized applications. Furthermore, the immense morphological diversity across species, breeds, and crosses, which varies significantly in size, proportions, and features, complicates the generalization of existing models. Current reconstruction methods often struggle to capture realistic fur textures. Additionally, ensuring these avatars are fully editable and capable of performing complex, naturalistic movements typically necessitates labor-intensive manual mesh manipulation and expert rigging. This paper introduces SMAL-pets, a comprehensive framework that generates high-quality, editable animal avatars from a single input image. Our approach bridges the gap between reconstruction and generative modeling by leveraging a hybrid architecture. Our method integrates 3D Gaussian Splatting with the SMAL parametric model to provide a representation that is both visually high-fidelity and anatomically grounded. We introduce a multimodal editing suite that enables users to refine the avatar’s appearance and execute complex animations through direct textual prompts. By allowing users to control both the aesthetic and behavioral aspects of the model via natural language, SMAL-pets provides a flexible, robust tool for animation and virtual reality.

[CV-121] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models ICME

【速读】：该论文旨在解决视频扩散模型在构建长期一致世界模拟器时面临的时空记忆瓶颈问题，特别是如何在相机运动、场景重访和干预条件下保持空间一致性，同时准确建模动态物体。现有方法中，显式3D结构虽能提升重投影一致性但难以刻画运动物体，而隐式记忆虽能生成复杂场景却常导致相机运动不准确。解决方案的关键在于提出一种混合空间记忆机制——Mosaic Memory（MosaicMem），其核心是将图像块（patch）提升至3D空间以实现可靠定位与目标检索，同时利用模型原生条件控制（conditioning）保留提示跟随能力；通过补丁拼接接口（patch-and-compose interface）在查询视图中组合空间对齐的补丁，既保留应持续存在的内容，又允许模型对需演化的区域进行内插（inpainting），从而实现更精确的姿态遵循和更强的动态建模能力。

链接: https://arxiv.org/abs/2603.17117
作者: Wei Yu,Runjia Qian,Yumeng Li,Liquan Wang,Songheng Yin,Sri Siddarth Chakaravarthy P,Dennis Anthony,Yang Ye,Yidi Li,Weiwei Wan,Animesh Garg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model’s native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

[CV-122] Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

【速读】：该论文旨在解决多模型集成中因同架构视觉语言模型（Vision-Language Models, VLMs）存在相关错误（family-correlated errors）而导致的性能瓶颈问题。传统投票机制无法有效处理这种结构化误差，导致集成效果受限，甚至在部分问题上出现“误导层级”（Misleading tier），即多数模型同时犯错，使准确率降至0%。解决方案的关键在于引入家族感知（family-aware）策略：首先通过分层家族投票（Hierarchical Family Voting, HFV）在家族内部聚合决策，再跨家族进行投票；其次提出无需训练的QualRCCV方法，基于校准度、家族质量与逆家族规模对模型加权；最后设计可学习的候选评分机制（Learned Candidate Scoring, LCS），利用支持广度、家族多样性与模型质量重新排序候选答案，显著提升整体性能且无退化风险。其中LCS在多个基准测试中取得最大增益，并在VQAv2 test-standard上达到87.83%的准确率，验证了其泛化能力。

链接: https://arxiv.org/abs/2603.17111
作者: Zacharie Bugaud
机构: Astera Institute (Astera研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA – all significant – and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization. Comments: 15 pages, 6 figures, 11 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17111 [cs.CV] (or arXiv:2603.17111v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.17111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-123] Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation

【速读】：该论文旨在解决图像分割任务中依赖大量人工标注数据所带来的高成本与低效率问题，同时应对银标准（AI生成）标签可能引入的偏差风险。其解决方案的关键在于提出一种结合反事实生成（counterfactual generation）与密集对比学习（dense contrastive learning）的管道架构，具体包括Dual-View (DVD-CL) 和 Multi-View (MVD-CL) 方法，并进一步设计了利用可用银标准标签的监督变体。通过引入反事实增强和银标准标签的有效利用，该方法显著提升了模型对采集条件及病理变化的鲁棒性，在无需人工标注的情况下性能超越现有密集对比学习方法，且使用银标准标签的监督版本优于直接训练于银标准标签数据的表现，最终在挑战性数据集上实现了约94%的Dice相似系数（DSC）。

链接: https://arxiv.org/abs/2603.17110
作者: Marceau Lafargue-Hauret,Raghav Mehta,Fabio De Sousa Ribeiro,Mélanie Roschewitz,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ISBI-2026 (oral presentation)

点击查看摘要

Abstract:Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving \sim 94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.

[CV-124] LLM -Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience

【速读】：该论文旨在解决城市内涝对交通网络连续性的威胁，特别是缺乏能够提供厘米级分辨率实时街道级洪水深度信息的运营系统，这对动态路径规划、电动汽车（EV）安全和自动驾驶车辆（AV）运行至关重要。解决方案的关键在于提出FloodLlama——一个基于单张街景图像进行连续洪水深度估计的开源视觉语言模型（VLM），其核心创新包括：利用TikTok数据构建多模态感知管道以实现实时众包洪水监测；通过合成数据集（约19万张图像，涵盖7种车型、4种天气和41个深度层级）支持渐进式课程训练（progressive curriculum training）；采用QLoRA微调LLaMA 3.2-11B Vision模型实现高效学习；并通过五阶段可解释性框架识别出关键层L23作为深度编码转换点，实现选择性微调，在减少76–80%可训练参数的同时保持高精度（Tier 3配置在真实数据上达到98.62%准确率），从而提供一种无需基础设施、可扩展的洪水感知方案。

链接: https://arxiv.org/abs/2603.17108
作者: Nafis Fuad,Xiaodong Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Urban flooding poses an escalating threat to transportation network continuity, yet no operational system currently provides real-time, street-level flood depth information at the centimeter resolution required for dynamic routing, electric vehicle (EV) safety, and autonomous vehicle (AV) operations. This study presents FloodLlama, a fine-tuned open-source vision-language model (VLM) for continuous flood depth estimation from single street-level images, supported by a multimodal sensing pipeline using TikTok data. A synthetic dataset of approximately 190000 images was generated, covering seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution). Progressive curriculum training enabled coarse-to-fine learning, while LLaMA 3.2-11B Vision was fine-tuned using QLoRA. Evaluation across 34797 trials reveals a depth-dependent prompt effect: simple prompts perform better for shallow flooding, whereas chain-of-thought (CoT) reasoning improves performance at greater depths. FloodLlama achieves a mean absolute error (MAE) below 0.97 cm and Acc@5cm above 93.7% for deep flooding, exceeding 96.8% for shallow depths. A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy. The Tier 3 configuration achieves 98.62% accuracy on real-world data and shows strong robustness under visual occlusion. A TikTok-based data pipeline, validated on 676 annotated flood frames from Detroit, demonstrates the feasibility of real-time, crowd-sourced flood sensing. The proposed framework provides a scalable, infrastructure-free solution with direct implications for EV safety, AV deployment, and resilient transportation management.

[CV-125] Accurate Shift Invariant Convolutional Neural Networks Using Gaussian-Hermite Moments

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）在空间平移变换下缺乏不变性（shift invariance）的问题，尤其是在使用下采样（downsampling）操作时，这种不变性被显著破坏。传统CNN中的下采样虽能提升计算效率并扩大感受野以获取更多上下文信息，但其固有的离散采样机制导致模型对输入图像的微小平移敏感，从而影响性能一致性。解决方案的关键在于提出高斯-埃尔米特采样（Gaussian-Hermite Sampling, GHS），该方法利用高斯-埃尔米特多项式实现一致性的空间采样，使CNN层在训练前即具备对任意空间平移的不变性。GHS无需修改网络结构或引入额外训练步骤，即可直接嵌入标准CNN架构中，实验表明其在CIFAR-10、CIFAR-100和MNIST-rot数据集上实现了100%的分类一致性，并优于基线CNN模型的准确率。

链接: https://arxiv.org/abs/2603.17098
作者: Jaspreet Singh,Petra Bosilj,Grzegorz Cielniak
机构: Teesside University (提塞德大学); Maastricht University (马斯特里赫特大学); University of Lincoln (林肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The convolutional neural networks (CNNs) are not inherently shift invariant or equivariant. The downsampling operation, used in CNNs, is one of the key reasons which breaks the shift invariant property of a CNN. Conversely, downsampling operation is important to improve computational efficiency and increase the area of the receptive field for more contextual information. In this work, we propose Gaussian-Hermite Sampling (GHS), a novel downsampling strategy designed to achieve accurate shift invariance. GHS leverages Gaussian-Hermite polynomials to perform shift-consistent sampling, enabling CNN layers to maintain invariance to arbitrary spatial shifts prior to training. When integrated into standard CNN architectures, the proposed method embeds shift invariance directly at the layer level without requiring architectural modifications or additional training procedures. We evaluate the proposed approach on CIFAR-10, CIFAR-100, and MNIST-rot datasets. Experimental results demonstrate that GHS significantly improves shift consistency, achieving 100% classification consistency under spatial shifts, while also improving classification accuracy compared to baseline CNN models.

[CV-126] ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models

【速读】：该论文旨在解决通用型医学视觉语言模型（Medical Vision-Language Models, Medical VLMs）在跨域任务中面临的“专业化-泛化权衡”问题：即单一领域训练的专用模型虽能捕捉细粒度诊断线索，但泛化能力弱；而多领域训练的通用模型虽具备广泛语义理解能力，却因信息稀释导致诊断细节丢失。解决方案的关键在于提出ACE-LoRA框架，其核心创新包括：1）在冻结的图像-文本编码器中引入低秩适应（Low-Rank Adaptation, LoRA）模块实现参数高效微调；2）设计基于注意力机制的上下文增强超图神经网络（Attention-based Context Enhancement Hypergraph Neural Network, ACE-HGNN），通过建模高阶上下文交互来强化全局表示中的局部诊断线索，弥补传统参数高效微调方法忽略细粒度特征的缺陷；3）引入标签引导的信息最大化对比损失（label-guided InfoNCE loss），有效抑制语义相关图文对间的假负样本，提升跨模态对齐精度。实验表明，仅增加0.95M可训练参数，ACE-LoRA在零样本分类、分割和检测等多个医学任务上均显著优于现有先进模型。

链接: https://arxiv.org/abs/2603.17079
作者: M. Arda Aydın,Melih B. Yilmaz,Aykut Koç,Tolga Çukur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at this https URL.

[CV-127] Edge-Efficient Two-Stream Multimodal Architecture for Non-Intrusive Bathroom Fall Detection ICME2026

【速读】：该论文旨在解决湿浴室环境中老年人跌倒检测的隐私保护与实时性问题，现有方案多将运动与冲击视为松散耦合的信号流，缺乏对雷达观测到的坍塌与地面冲击之间因果关系的显式建模，且未有效处理时序漂移、物体掉落干扰以及低功耗边缘设备上的延迟和能耗限制。其解决方案的关键在于提出一种双流架构：Motion–Mamba分支用于捕捉长距离运动模式，Impact–Griffin分支强调冲击瞬态和跨轴耦合；通过低秩双线性交互和Switch–MoE头实现跨条件融合，以对齐运动与冲击特征并抑制物体掉落干扰，从而在Raspberry Pi 4B网关上实现高精度（96.1%准确率）、低延迟（15.8 ms）和低能耗（每2.56 s窗口降低3450 mJ）的实时跌倒检测。

链接: https://arxiv.org/abs/2603.17069
作者: Haitian Wang,Yiren Wang,Xinyu Wang,Sheldon Fung,Atif Mansoor
机构: University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for poster presenation at IEEE ICME 2026

点击查看摘要

Abstract:Falls in wet bathroom environments are a major safety risk for seniors living alone. Recent work has shown that mmWave-only, vibration-only, and existing multimodal schemes, such as vibration-triggered radar activation, early feature concatenation, and decision-level score fusion, can support privacy-preserving, non-intrusive fall detection. However, these designs still treat motion and impact as loosely coupled streams, depending on coarse temporal alignment and amplitude thresholds, and do not explicitly encode the causal link between radar-observed collapse and floor impact or address timing drift, object drop confounders, and latency and energy constraints on low-power edge devices. To this end, we propose a two-stream architecture that encodes radar signals with a Motion–Mamba branch for long-range motion patterns and processes floor vibration with an Impact–Griffin branch that emphasizes impact transients and cross-axis coupling. Cross-conditioned fusion uses low-rank bilinear interaction and a Switch–MoE head to align motion and impact tokens and suppress object-drop confounders. The model keeps inference cost suitable for real-time execution on a Raspberry Pi 4B gateway. We construct a bathroom fall detection benchmark dataset with frame-level annotations, comprising more than 3~h of synchronized mmWave radar and triaxial vibration recordings across eight scenarios under running water, together with subject-independent training, validation, and test splits. On the test split, our model attains 96.1% accuracy, 94.8% precision, 88.0% recall, a 91.1% macro F1 score, and an AUC of 0.968. Compared with the strongest baseline, it improves accuracy by 2.0 percentage points and fall recall by 1.3 percentage points, while reducing latency from 35.9 ms to 15.8 ms and lowering energy per 2.56 s window from 14200 mJ to 10750 mJ on the Raspberry Pi 4B gateway.

[CV-128] rackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects

【速读】：该论文旨在解决当前三维（3D）可变形物体数据集构建中的两大挑战：一是现有感知方法在处理复杂形变时鲁棒性不足，难以准确提取如关键点（keypoints）和网格（meshes）等结构化3D表示；二是大规模3D数据采集成本高，依赖人工标注或昂贵的动捕系统，且在非结构化环境中假设失效，导致高质量、大尺度的可变形物体数据集稀缺。解决方案的关键在于提出一个基于RGB-D相机的低成本、自主式3D数据采集框架——TrackDeform3D，其核心创新在于通过引入运动一致性约束（motion consistency constraints），实现对3D关键点的稳健识别与轨迹追踪，从而生成时间上平滑且几何一致的数据，显著提升了几何精度与跟踪稳定性。

链接: https://arxiv.org/abs/2603.17068
作者: Yeheng Zong,Yizhou Chen,Alexander Bowler,Chia-Tung Yang,Ram Vasudevan
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.

[CV-129] DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems

【速读】：该论文旨在解决在非结构化、无道路环境（特别是沙漠地形）中实现可靠地形感知的问题，此类场景因色彩对比度低、光照变化剧烈及植被稀疏而难以被传统道路场景分割模型有效处理。解决方案的关键在于提出 DesertFormer，一个基于 SegFormer B2 架构并采用分层混合 Transformer（MiT-B2）骨干网络的语义分割流水线，能够将地形细分为10类生态学上有意义的类别（如枯草、岩石、地面杂乱物等），从而支持地面机器人和自动驾驶车辆的安全路径规划。该方法在自建的4,176张标注图像数据集上训练，mIoU 达到64.4%，较 DeepLabV3 MobileNetV2 基线提升24.2个百分点，并通过系统性失败分析识别出主要混淆模式（如“地面杂乱物”与“景观”、“干草”与“景观”之间的误判），进而引入类别加权训练和复制粘贴增强策略以改善稀有类别的分割性能。

链接: https://arxiv.org/abs/2603.17056
作者: Yasaswini Chebolu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, 3 tables. Preprint also available on Zenodo (DOI: https://doi.org/10.5281/zenodo.19053085 )

点击查看摘要

Abstract:Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories – Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky – enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns – Ground Clutter to Landscape and Dry Grass to Landscape – and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at this https URL.

[CV-130] PaAgent : Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

【速读】：该论文旨在解决现有图像修复（Image Restoration, IR）代理在自动化处理复杂退化场景时缺乏对历史交互信息的总结机制，导致工具选择过程冗长且低效的问题。其核心解决方案是提出一种基于画像感知的IR代理（PaAgent），关键在于引入一个自演化画像库（portrait bank）与检索增强生成（Retrieval-Augmented Generation, RAG）相结合的机制：画像库通过持续归纳不同IR工具在修复图像、输入退化图像及所选工具之间的特征关系进行动态构建和演化；RAG则利用该画像库中存储的历史洞察来高效检索并推荐最优IR工具，从而显著提升工具选择的准确性与效率。此外，为增强复杂场景下的退化感知能力，还设计了一种主客观强化学习策略，融合图像质量评分与语义信息以生成更鲁棒的奖励信号。

链接: https://arxiv.org/abs/2603.17055
作者: Yijian Wang,Qingsen Yan,Jiantao Zhou,Duwei Dai,Wei Dong
机构: Northwestern Polytechnical University (西北工业大学); Shenzhen Research Institute of Northwestern Polytechnical University (西北工业大学深圳研究院); University of Macau (澳门大学); Xi’an Jiaotong University (西安交通大学); The Second Affiliated Hospital of Xi’an Jiaotong University (西安交通大学第二附属医院); Xi’an University of Architecture and Technology (西安建筑科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent’s ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent’s superiority in addressing complex IR tasks. Our project page is \hrefthis https URLPaAgent.

[CV-131] Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

【速读】：该论文旨在解决蒸馏后的自回归（Autoregressive, AR）视频模型在流式生成过程中与人类视觉偏好存在偏差的问题。现有强化学习（Reinforcement Learning, RL）框架难以适配此类架构，通常依赖昂贵的重新蒸馏或耦合求解器的反向过程优化，导致显著的内存和计算开销。其解决方案的关键在于提出一种面向蒸馏AR模型的高效在线强化学习框架Astrolabe，核心创新包括：1）基于负样本感知微调的前向过程强化学习公式，通过推理端直接对比正负样本隐式引导策略改进，无需反向传播；2）采用滚动键值缓存（KV-cache）的流式训练机制，在局部片段窗口内执行RL更新并依赖先前上下文保持长视频连贯性；3）引入不确定性感知的选择性正则化与动态参考更新机制以缓解奖励劫持问题。实验表明，该方法可稳定提升多种蒸馏AR视频模型的生成质量，具备良好的鲁棒性和扩展性。

链接: https://arxiv.org/abs/2603.17051
作者: Songchun Zhang,Zeyue Xue,Siming Fu,Jie Huang,Xianghao Kong,Y Ma,Haoyang Huang,Nan Duan,Anyi Rao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 53 pages, 37 figures

点击查看摘要

Abstract:Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

[CV-132] SCE-LITE-HQ: Smooth visual counterfactual explanations with generative foundation models

【速读】：该论文旨在解决高维视觉领域中黑箱神经网络模型可解释性差的问题，特别是现有反事实解释（Counterfactual Explanations, CFEs）方法依赖特定数据集的生成模型且计算成本高，难以扩展到高分辨率图像数据。其解决方案的关键在于提出SCE-LITE-HQ框架，该框架利用预训练的生成基础模型（pretrained generative foundation models），在生成器的潜在空间中进行优化，通过平滑梯度提升优化稳定性，并引入基于掩码的多样化策略以生成结构多样且现实的反事实样本，从而无需任务特定训练即可高效生成高质量反事实解释。

链接: https://arxiv.org/abs/2603.17048
作者: Ahmed Zeid,Sidney Bender
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern neural networks achieve strong performance but remain difficult to interpret in high-dimensional visual domains. Counterfactual explanations (CFEs) provide a principled approach to interpreting black-box predictions by identifying minimal input changes that alter model outputs. However, existing CFE methods often rely on dataset-specific generative models and incur substantial computational cost, limiting their scalability to high-resolution data. We propose SCE-LITE-HQ, a scalable framework for counterfactual generation that leverages pretrained generative foundation models without task-specific retraining. The method operates in the latent space of the generator, incorporates smoothed gradients to improve optimization stability, and applies mask-based diversification to promote realistic and structurally diverse counterfactuals. We evaluate SCE-LITE-HQ on natural and medical datasets using a desiderata-driven evaluation protocol. Results show that SCE-LITE-HQ produces valid, realistic, and diverse counterfactuals competitive with or outperforming existing baselines, while avoiding the overhead of training dedicated generative models.

[CV-133] Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models CVPR

【速读】：该论文旨在解决统一多模态模型中，直接偏好优化（DPO）是否能同时对理解与生成能力进行有效对齐的问题。其关键发现是：在Janus-Pro模型架构下，无论采用何种训练策略或后处理方法，生成质量均无法通过DPO提升，且在7B参数规模下所有方法均未改善CLIPScore（|Δ| < 0.2, p > 0.5，n=200/种子，3个种子），1B规模则全部退化。根本原因在于理解与生成任务的梯度近似正交（cos ≈ 0），且生成梯度幅度比理解梯度高11–14倍，主要由VQ编码器生成token数量远超文本token（576 vs. ~30–100）导致。这种幅度失衡构成多任务DPO中的主导干扰机制；尽管通过梯度幅度平衡可小幅提升理解性能（+0.01–0.04 VQA），但生成性能差距依然存在。研究进一步指出离散VQ分词结构可能是瓶颈，并提出实际指导建议以优化基于VQ的统一模型训练。

链接: https://arxiv.org/abs/2603.17044
作者: Abinav Rao,Sujan Rachuri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, CVPR MMF5M Workshop 2026

点击查看摘要

Abstract:Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| 0.2, p 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck – supported by the generation DPO loss converging to ln(2) – and provide practical guidance for practitioners working with VQ-based unified models.

[CV-134] OpenQlaw: An Agent ic AI Assistant for Analysis of 2D Quantum Materials

【速读】：该论文旨在解决二维量子材料（2D quantum materials）从光学识别到实际器件制造过程中，现有多模态大语言模型（Multimodal Large Language Models, MLLMs）因过度追求认知透明性而导致的输出冗长、推理密集、缺乏即时实用性的难题。其核心挑战在于如何在保持物理合理性的同时，提升系统对科研人员实时交互的支持能力。解决方案的关键在于提出OpenQlaw——一个基于轻量级代理架构（NanoBot）与物理感知指令多模态平台（QuPAINT）协同的智能编排系统，通过将视觉识别与推理过程解耦，并引入持久化记忆机制以存储物理尺度比例和样品制备方法，使核心大语言模型（LLM）代理能够动态调度领域专家模块，实现面向任务的上下文感知响应，从而显著加速高通量器件制造流程。

链接: https://arxiv.org/abs/2603.17043
作者: Sankalp Pandey,Xuan-Bac Nguyen,Hoang-Quan Nguyen,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu
机构: University of Arkansas, USA; University of Utah, USA; MonARK NSF Quantum Foundry
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain-specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics-informed reasoning, their outputs are optimized for step-by-step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real-world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics-Aware Instruction Multi-modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain-expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale-aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 \mum) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain-specific experts, transforms isolated inferences into a context-aware assistant capable of accelerating high-throughput device fabrication.

[CV-135] Empirical Recipes for Efficient and Compact Vision-Language Models

【速读】：该论文旨在解决紧凑型视觉语言模型（Vision-Language Models, VLMs）在资源受限场景下推理效率低的问题，即其实际推理延迟与参数量所预期的加速效果不匹配。关键解决方案在于通过端到端的效率分析和推理过程 profiling，识别出影响性能的主要瓶颈，并据此提出针对性的优化策略，显著降低首次生成 token 的时间（Time to First Token, TTFT），例如在 InternVL3-2B 上减少 53%，在 SmolVLM-256M 上减少 93%。这些优化方法适用于多种 VLM 架构和主流部署框架，同时论文还进一步扩展了紧凑模型的能力，引入结构化感知输出机制，形成了 ArgusVLM 模型家族，在保持高效性的同时实现多基准测试下的优异性能。

链接: https://arxiv.org/abs/2603.16987
作者: Jiabo Huang,Zhizhong Li,Sina Sajadmanesh,Weiming Zhuang,Lingjuan Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

[CV-136] Are a Thousand Words Better Than a Single Picture? Beyond Images – A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

【速读】：该论文旨在解决多模态知识图谱（Multi-Modal Knowledge Graphs, MMKGs）在构建过程中因图像数据采集困难而导致的视觉信息覆盖不足问题，特别是那些语义模糊但相关的视觉内容（如标志、符号和抽象场景）常被排除，从而限制了MMKG的补全性能。其解决方案的关键在于提出一个全自动的数据驱动增强流程——Beyond Images，包含三个核心阶段：首先大规模检索与实体相关的额外图像；其次将所有视觉输入统一转换为文本描述，以确保模糊图像能提供可用语义而非噪声；最后利用大语言模型（Large Language Model, LLM）融合多源描述，生成简洁且与实体对齐的摘要，并替换或补充标准MMKG模型中的文本模态，无需修改原有架构或损失函数。实验证明，该方法在多个公开数据集上显著提升性能（最高达7% Hits@1），尤其在视觉模糊实体子集上改善显著（MRR提升201.35%，Hits@1提升333.33%）。

链接: https://arxiv.org/abs/2603.16974
作者: Pengyu Zhang,Klim Zaporojets,Jie Liu,Jia-Hong Huang,Paul Groth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large-scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data-centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large-scale retrieval of additional entity-related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi-source descriptions using a large language model (LLM) to generate concise, entity-aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text-Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at this https URL.

[CV-137] Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection

【速读】：该论文旨在解决多模态第一人称行为识别在开放世界持续学习场景下的挑战，即如何在非稳态数据流中有效检测新活动并持续学习，同时避免灾难性遗忘导致的模态不平衡问题（如RGB主导分类 logits，而IMU等其他模态信息被忽视）。解决方案的关键在于提出一种模态感知框架MAND，其核心包括两个创新模块：1）推理阶段的模态自适应评分（MoAS），通过能量分数估计样本级模态可靠性，并动态融合各模态logits以增强互补信息利用；2）训练阶段的模态级表示稳定化训练（MoRST），借助辅助头和模态级logit蒸馏机制保留各模态在任务间的判别能力，从而缓解因持续学习引发的模态失衡与性能退化。

链接: https://arxiv.org/abs/2603.16970
作者: Wonseon Lim,Hyejeong Im,Dae-Won Kim
机构: Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10% and known-class classification accuracy by up to 2.8% over state-of-the-art baselines.

[CV-138] MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing

【速读】：该论文旨在解决现有基于指令的图像编辑模型在处理复杂、多步且相互依赖的指令时性能显著下降的问题，其根本原因在于缺乏带有复杂多指令标注的训练数据，而重新训练这些模型成本高昂。解决方案的关键在于提出一种无需训练的代理框架MSRAMIE，该框架基于多模态大语言模型（Multimodal Large Language Model, MLLM），将现有编辑模型作为插件组件集成，并通过结构化的多模态推理来执行多指令任务。其核心创新在于引入了“状态树”（Tree-of-States）与“参考图”（Graph-of-References）相结合的新型推理拓扑，实现编辑指令的逐步分解、状态转移、跨步骤信息聚合与原始输入召回，从而系统性探索图像编辑空间并支持灵活的渐进式输出优化，同时提供可解释且可控的决策路径。

链接: https://arxiv.org/abs/2603.16967
作者: Zhaoyuan Qiu,Ken Chen,Xiangwei Wang,Yu Xia,Sachith Seneviratne,Saman Halgamuge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures, 3 tables, appendix and references provided

点击查看摘要

Abstract:Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing steps which enable state transitions, cross-step information aggregation, and original input recall, which enables systematic exploration of the image editing space and flexible progressive output refinement. The visualizable inference topology further provides interpretable and controllable decision pathways. Experiments show that as the instruction complexity increases, MSRAMIE can improve instruction following over 15% and increases the probability of finishing all modifications in a single run over 100%, while preserving perceptual quality and maintaining visual consistency.

[CV-139] CineSRD: Leverag ing Visual Acoustic and Linguistic Cues for Open-World Visual Media Speaker Diarization CVPR2026

【速读】：该论文旨在解决开放世界场景下视觉媒体（如电影和电视剧）中的说话人聚类（speaker diarization）问题，该场景具有长视频理解、大量说话人、跨模态异步性（cross-modal asynchrony）以及真实环境下的高变异性等挑战。解决方案的关键在于提出一种统一的多模态框架——Cinematic Speaker Registration Diarization (CineSRD)，其核心机制包括：首先利用视觉锚点聚类（visual anchor clustering）对初始说话人进行注册，随后引入音频语言模型（audio language model）实现说话人片段检测，从而细化标注并补充未注册的离屏说话人。该方法有效融合了视频、语音与字幕中的视觉、声学及语言线索，显著提升了在复杂开放世界视觉媒体场景下的说话人识别鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2603.16966
作者: Liangbin Huang,Xiaohua Liao,Chaoqun Cui,Shijing Wang,Zhaolong Huang,Yanlong Du,Wenji Mao
机构: Hujing Digital Media and Entertainment Group; MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Computer Science and Technology, Beijing Jiaotong University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.

[CV-140] Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE

【速读】：该论文旨在解决自动驾驶系统（ADS）在真实交通场景中验证时面临的两个核心问题：一是如何实现标准化的场景提取，以克服现有方法因定义异质性导致的场景可比性差；二是如何有效进行场景聚类，以兼顾机器学习方法对复杂性的处理能力与规则方法的可解释性及领域知识一致性。解决方案的关键在于提出基于“场景即规范”（Scenario-as-Specification）理念的标准化场景提取框架，并引入领域知识引导的场景聚类过程，从而在高德数据集（highD dataset）上实现了可靠场景提取和有效知识嵌入，提升了高速公路场景分类的标准化程度与自动化车辆验证效率。

链接: https://arxiv.org/abs/2603.16964
作者: Niklas Roßberg,Sinan Hasirlioglu,Mohamed Essayed Bouzouraa,Wolfgang Utschick,Michael Botsch
机构: Technische Hochschule Ingolstadt (英戈尔施塔特应用技术大学); AUDI AG (奥迪股份公司); Technische Universität München (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

点击查看摘要

Abstract:Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.

[CV-141] PhysQuantAgent : An Inference Pipeline of Mass Estimation for Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在机器人感知与操作中对物理属性（特别是物体质量）推理能力不足的问题。当前VLMs缺乏可靠的物理量估计能力，且现有基准测试未在真实传感条件下评估物理量估算性能。为应对这一挑战，作者提出PhysQuantAgent框架，并构建VisPhysQuant数据集用于评估。解决方案的关键在于引入三种视觉提示方法：目标检测、尺度估计和截面图像生成，通过增强输入图像的信息来提升模型对物体尺寸和内部结构的理解，从而显著提高真实场景下质量估计的准确性，验证了将空间推理与VLM知识融合以实现物理推断的有效性。

链接: https://arxiv.org/abs/2603.16958
作者: Hisayuki Yokomizo,Taiki Miyanishi,Yan Gang,Shuhei Kurita,Nakamasa Inoue,Yusuke Iwasawa
机构: The University of Tokyo (东京大学); Institute of Science Tokyo (东京科学研究所); National Institute of Informatics (日本国立情报学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and dataset will be available at this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.

[CV-142] EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments

【速读】：该论文旨在解决零样本视觉-语言导航（Vision-and-Language Navigation in Continuous Environments, VLN-CE）中现代视觉-语言模型（Vision-Language Models, VLMs）难以实现稳定长程具身执行的问题。尽管VLMs具备丰富的语义先验知识，但其开放式的推理机制无法直接转化为可靠的导航行为。论文指出，核心瓶颈并非知识缺失，而是缺乏一个用于组织指令遵循、感知定位、时间进度和阶段验证的执行结构。解决方案的关键在于提出EmergeNav框架，该框架将连续VLN建模为结构化的具身推理过程，通过Plan–Solve–Transition层次结构实现阶段式执行，结合目标条件感知提取（GIPE）、对比双记忆推理（contrastive dual-memory reasoning）与角色分离双视野感知（role-separated Dual-FOV sensing），从而在无需任务特定训练、显式地图或图搜索的情况下，在VLN-CE基准上实现了显著的零样本性能提升（如Qwen3-VL-32B达到37.00%成功率）。

链接: https://arxiv.org/abs/2603.16947
作者: Kun Luo,Xiaoguang Ma
机构: Northeastern University (东北大学); Foshan Graduate School of Innovation (佛山创新研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for organizing instruction following, perceptual grounding, temporal progress, and stage verification. We propose EmergeNav, a zero-shot framework that formulates continuous VLN as structured embodied inference. EmergeNav combines a Plan–Solve–Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and role-separated Dual-FOV sensing for time-aligned local control and boundary verification. On VLN-CE, EmergeNav achieves strong zero-shot performance using only open-source VLM backbones and no task-specific training, explicit maps, graph search, or waypoint predictors, reaching 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B. These results suggest that explicit execution structure is a key ingredient for turning VLM priors into stable embodied navigation behavior.

[CV-143] Joint Optimization of Storag e and Loading for High-Performance 3D Point Cloud Data Processing

【速读】：该论文旨在解决大规模点云数据在存储和处理过程中面临的效率瓶颈问题，包括数据规模庞大、格式多样导致的数据加载与预处理耗时长，以及传统算法难以高效处理复杂点云结构等挑战。其解决方案的关键在于提出了一种统一的点云数据存储格式——.PcRecord，并构建了一个多模块高性能数据处理流水线，通过多阶段并行架构优化计算资源利用，显著提升了点云数据的加载速度与处理效率，在多个主流点云数据集上实现了最高达25.4倍的加速比（使用Ascend处理器）。

链接: https://arxiv.org/abs/2603.16945
作者: Ke Wang,Yanfei Cao,Xiangzhi Tao,Naijie Gu,Jun Yu,Zhengdong Wang,Shouyang Dong,Fan Yu,Cong Wang,Yang Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of computer vision and deep learning, significant advancements have been made in 3D vision, partic- ularly in autonomous driving, robotic perception, and augmented reality. 3D point cloud data, as a crucial representation of 3D information, has gained widespread attention. However, the vast scale and complexity of point cloud data present significant chal- lenges for loading and processing and traditional algorithms struggle to handle large-scale this http URL diversity of storage formats for point cloud datasets (e.g., PLY, XYZ, BIN) adds complexity to data handling and results in inefficiencies in data preparation. Al- though binary formats like BIN and NPY have been used to speed up data access, they still do not fully address the time-consuming data loading and processing phase. To overcome these challenges, we propose the .PcRecord format, a unified data storage solution designed to reduce the storage occupation and accelerate the processing of point cloud data. We also introduce a high-performance data processing pipeline equipped with multiple modules. By leveraging a multi-stage parallel pipeline architecture, our system optimizes the use of computational resources, significantly improving processing speed and efficiency. This paper details the im- plementation of this system and demonstrates its effectiveness in addressing the challenges of handling large-scale point cloud this http URL average, our system achieves performance improvements of 6.61x (ModelNet40), 2.69x (S3DIS), 2.23x (ShapeNet), 3.09x (Kitti), 8.07x (SUN RGB-D), and 5.67x (ScanNet) with GPU and 6.9x, 1.88x, 1.29x, 2.28x, 25.4x, and 19.3x with Ascend.

[CV-144] Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

【速读】：该论文旨在解决当前指令驱动图像编辑（Instruction-based Image Editing, IIE）基准测试中因混合评估导致的模型编辑一致性诊断不足问题，尤其关注模型在不同语义尺度任务间表现不一致这一关键失败模式。解决方案的关键在于提出Omni IIE Bench——一个高质量、人工标注的基准，采用创新的双轨诊断设计：(1) 单轮一致性（Single-turn Consistency），包含共享上下文的任务对（如属性修改与实体替换）；(2) 多轮协调性（Multi-turn Coordination），涵盖跨语义尺度的连续对话任务。该基准通过多阶段人工筛选流程构建，确保了专业性和实用性，首次量化揭示了主流IIE模型从低语义尺度到高语义尺度任务时普遍存在的显著性能下降现象，为下一代更可靠、稳定的IIE模型开发提供了关键诊断工具与洞见。

链接: https://arxiv.org/abs/2603.16944
作者: Yujia Yang,Yuanxiang Wang,Zhenyu Guan,Tiankun Yang,Chenxi Bao,Haopeng Jin,Jinwen Luo,Xinyu Zuo,Lisheng Duan,Haijin Liang,Jin Ma,Xinming Wang,Ruiwen Tao,Hongzhu Yi
机构: University of Chinese Academy of Sciences (中国科学院大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.

[CV-145] KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition

【速读】：该论文旨在解决基于骨骼的动作识别中传感器数据稀疏性和拓扑结构刚性所带来的挑战，即离散坐标形式的骨骼数据在高动态运动过程中会丢失细粒度时空细节，且固定物理传感器拓扑限制了对潜在长距离依赖关系的建模。解决方案的关键在于提出KGS-GCN框架，其核心创新包括：一是设计了基于运动学驱动的高斯点绘（kinematics-driven Gaussian splatting）模块，通过瞬时关节速度向量动态构建各向异性协方差矩阵，将稀疏骨骼序列转化为富含时空语义的多视角连续热图；二是引入概率拓扑构建方法，利用联合高斯分布间的巴氏距离（Bhattacharyya distance）量化统计相关性，生成自适应先验邻接矩阵以突破固定连接约束；最终通过视觉上下文门控机制对图卷积网络（GCN）进行特征调制，从而显著提升复杂时空动态建模能力。

链接: https://arxiv.org/abs/2603.16943
作者: Yuhan Chen,Yicui Shi,Guofa Li,Liping Zhang,Jie Li,Jiaxin Gao,Wenbo Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.

[CV-146] Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion

【速读】：该论文旨在解决情感识别中的模糊性/犹豫（Ambivalence/Hesitancy, A/H）视频识别问题，即在多模态数据中准确识别个体表现出的矛盾情绪状态。其解决方案的关键在于提出一种基于差异的多模态融合方法，通过显式计算视觉、音频和文本模态嵌入之间的成对绝对差值来量化跨模态冲突，从而捕捉A/H的核心特征——模态间不一致。具体而言，视觉特征以动作单元（Action Units, AUs）形式提取，音频与文本分别采用Wav2Vec 2.0和BERT编码，并经双向LSTM结合注意力池化后投影至共享嵌入空间；实验表明该方法在BAH数据集上取得Macro F1为0.6808的性能，显著优于基线（0.2827），且统计分析验证了AUs的时间变异性是主导视觉判别因子。

链接: https://arxiv.org/abs/2603.16939
作者: Aislan Gabriel O. Souza,Agostinho Freire,Leandro Honorato Silva,Igor Lucas B. da Silva,João Vinícius R. de Andrade,Gabriel C. de Albuquerque,Lucas Matheus da S. Oliveira,Mário Stela Guerra,Luciana Machado
机构: Universidade de Pernambuco (UPE); Escola Politécnica de Pernambuco (POLI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the Ambivalence/Hesitancy (A/H) Video Recognition Challenge at the 10th ABAW Competition (CVPR 2026). We propose a divergence-based multimodal fusion that explicitly measures cross-modal conflict between visual, audio, and textual channels. Visual features are encoded as Action Units (AUs) extracted via Py-Feat, audio via Wav2Vec 2.0, and text via BERT. Each modality is processed by a BiLSTM with attention pooling and projected into a shared embedding space. The fusion module computes pairwise absolute differences between modality embeddings, directly capturing the incongruence that characterizes A/H. On the BAH dataset, our approach achieves a Macro F1 of 0.6808 on the validation test set, outperforming the challenge baseline of 0.2827. Statistical analysis across 1,132 videos confirms that temporal variability of AUs is the dominant visual discriminator of A/H.

[CV-147] DMM-LM: Bridging Facial Understanding and Animation via Language Models

【速读】：该论文旨在解决文本引导的人体动画中面部动画滞后的问题，其根本原因在于高质量、标注完善且与文本配对的面部行为数据集稀缺。解决方案的关键在于利用基础生成模型（foundation generative models）合成大规模、平衡的面部行为语料库，设计覆盖情绪和头部运动的提示（prompt）套件，生成约80小时的面部视频，并拟合每帧的3D面部参数，从而构建大规模的文本-参数对用于训练。在此基础上，通过两个互补任务——Motion2Language（从3D面部参数生成自然语言描述）和Language2Motion（从文本提示生成量化运动标记序列）——验证了语言模型在面部运动理解与生成上的双向能力，首次将面部参数建模视为语言问题，为文本条件下的面部动画与运动理解提供统一路径。

链接: https://arxiv.org/abs/2603.16936
作者: Luchuan Song,Pinxin Liu,Haiyang Liu,Zhenchao Jin,Yolo Yunlong Tang,Zichong Xu,Susan Liang,Jing Bi,Jason J Corso,Chenliang Xu
机构: University of Rochester (罗彻斯特大学); University of Tokyo (东京大学); University of Michigan (密歇根大学); Voxel51
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 13 figures

点击查看摘要

Abstract:Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.

[CV-148] GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference ICASSP

【速读】：该论文旨在解决视频-based lie detection（基于视频的谎言检测）中难以学习稀疏且具有判别性的表征问题。由于欺骗信号通常微弱且短暂，易被冗余信息淹没，同时个体差异和情境变化引入强身份相关噪声，导致现有方法性能受限。其解决方案的关键在于提出GenLie——一种全局增强的谎言检测网络，通过在全局监督下进行局部特征建模：在局部层面捕捉稀疏微妙的欺骗线索，同时借助全局监督与优化机制抑制身份相关噪声，从而获得鲁棒且具有判别力的表征。

链接: https://arxiv.org/abs/2603.16935
作者: Zongshun Zhang,Yao Liu,Qiao Liu,Xuefeng Peng,Peiyuan Jiang,Jiaye Yang,Daibing Yao,Wei Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at this https URL.

[CV-149] AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在农业领域部署所面临的两大核心瓶颈：一是缺乏大规模、高质量的农业专用数据集以支撑模型的训练与评估；二是现有先进模型普遍缺乏经过验证的领域专业知识，难以在不同植物分类体系下进行可靠推理。为此，作者提出了一种名为“视觉到验证知识”（Vision-to-Verified-Knowledge, V2VK）的新颖生成式 AI 驱动标注框架，其关键在于将视觉描述（visual captioning）与基于网络的科学文献检索相结合，自动生成具有生物学准确性的 AgriMM 基准数据集，从而通过引入已验证的植物病理学文献来消除生物幻觉（biological hallucinations）。该方法确保了训练数据的可信度，并最终推动了 AgriChat 的开发，这是一个具备跨数千类农作物知识且能提供详尽农业诊断解释的专用 MLLM，实验证明其在多种任务和基准上均优于其他开源模型。

链接: https://arxiv.org/abs/2603.16934
作者: Abderrahmene Boudiaf,Irfan Hussain,Sajid Javed
机构: Khalifa University of Science and Technology (哈利法大学科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat’s superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at this https URL .

[CV-150] Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在处理图像时面临的准确性与计算效率之间的权衡问题：高分辨率输入虽能保留细节但计算开销大，低分辨率输入虽高效却可能遗漏关键信息（如小尺寸文本）。其解决方案的关键在于提出一种空间按需（spatial-on-demand）框架 AwaRes，该框架首先以低分辨率全局视图进行初步理解，再通过工具调用（tool-calling）机制精准检索当前任务所需的高分辨率图像片段，从而实现动态、高效的视觉信息获取。训练过程中，作者构建了自动标注的监督数据集，利用判别器比较高低分辨率答案并标记是否需要裁剪，同时借助接地模型（grounding model）定位证据区域，映射为离散裁剪集合形成多轮工具使用轨迹，并采用冷启动监督微调（SFT）结合多轮强化学习（GRPO）优化策略，其中奖励函数融合语义正确性与显式的裁剪成本惩罚，有效平衡性能与效率。

链接: https://arxiv.org/abs/2603.16932
作者: Nimrod Shabtay,Moshe Kimhi,Artem Spector,Sivan Haray,Ehud Rivlin,Chaim Baskin,Raja Giryes,Eli Schwartz
机构: IBM Research (IBM 研究院); Tel-Aviv University (特拉维夫大学); Technion (以色列理工学院); Ben-Gurion University (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: this https URL

[CV-151] Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation

【速读】：该论文旨在解决滑动幻灯片视频制作中视觉效果与语音内容对齐的高人工成本问题，特别是如何将讲解脚本中的句子自动关联到对应的幻灯片对象。其核心解决方案是首次形式化了“脚本到幻灯片对象对齐”（Script-to-Slide Grounding, S2SG）任务，并提出“Text-S2SG”方法，利用大语言模型（Large Language Model, LLM）实现文本类对象的自动对齐，实验表明该方法在F1分数上达到0.924，为自动化生成教学视频奠定了基础。

链接: https://arxiv.org/abs/2603.16931
作者: Rena Suzuki,Masato Kikuchi,Tadachika Ozono
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 21st International Conference on E-Service and Knowledge Management (ESKM 2025-Winter)

点击查看摘要

Abstract:While slide-based videos augmented with visual effects are widely utilized in education and research presentations, the video editing process – particularly applying visual effects to ground spoken content to slide objects – remains highly labor-intensive. This study aims to develop a system that automatically generates such instructional videos from slides and corresponding scripts. As a foundational step, this paper proposes and formulates Script-to-Slide Grounding (S2SG), defined as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose ``Text-S2SG,‘’ a method that utilizes a large language model (LLM) to perform this grounding task for text objects. Our experiments demonstrate that the proposed method achieves high performance (F1-score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation.

[CV-152] Facial beauty prediction fusing transfer learning and broad learning system

【速读】：该论文旨在解决面部美感预测（Facial Beauty Prediction, FBP）中存在的两个核心问题：一是由于缺乏大规模有效数据，模型易过拟合；二是因面部外观差异性和人类感知复杂性，难以快速构建鲁棒且高效的评估模型。解决方案的关键在于将迁移学习（Transfer Learning）与广义学习系统（Broad Learning System, BLS）相结合，提出两种改进架构：E-BLS通过基于迁移学习的卷积神经网络（CNN）特征提取器（采用EfficientNets）提取面部美感特征并输入BLS进行预测；ER-BLS进一步引入连接层以优化特征提取器与BLS之间的信息传递。实验表明，该方法在准确率上优于传统BLS和CNN方法，验证了其有效性与优越性。

链接: https://arxiv.org/abs/2603.16930
作者: Junying Gan,Xiaoshan Xie,Yikui Zhai,Guohui He,Chaoyun Mai,Heng Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.

[CV-153] Leverag ing Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

【速读】：该论文旨在解决多无人飞行器（UAV）协同感知中因大量视觉数据导致的通信延迟高和资源效率低的问题。解决方案的关键在于提出一种基站辅助的UAV协同感知框架（Base-Station-Helped UAV, BHU），其核心创新包括：通过Top-K选择机制从RGB图像中提取最具信息量的像素，实现视觉数据稀疏化传输以降低数据量与延迟；利用多用户MIMO（MU-MIMO）将稀疏图像传至地面服务器，结合基于Swin-large的MaskDINO编码器提取鸟瞰图（BEV）特征并完成协同特征融合；进一步设计基于扩散模型的深度强化学习（DRL）算法，联合优化协作UAV选择、稀疏化比例及预编码矩阵，从而在通信效率与感知效用之间取得平衡。

链接: https://arxiv.org/abs/2603.16927
作者: Yunting Xu,Jiacheng Wang,Ruichen Zhang,Changyuan Zhao,Yinqiu Liu,Dusit Niyato,Liang Yu,Haibo Zhou,Dong In Kim
机构: Nanyang Technological University (南洋理工大学); Alibaba Cloud (阿里云); Nanjing University (南京大学); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird’s-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.

[CV-154] Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards

【速读】：该论文旨在解决放射学报告生成中临床有效性（Clinical Efficacy, CE）不足的问题，尤其是传统单模型强化学习或事后独立训练模型的代理化方法难以协同优化整体系统性能的局限性。其解决方案的关键在于提出一种多模态多智能体强化学习框架 MARL-Rad，通过协调局部区域智能体与全局整合智能体，并利用可临床验证的奖励机制联合训练整个智能体系统，从而显著提升报告的准确性、细节丰富度及侧向一致性，实现最优的临床有效性表现。

链接: https://arxiv.org/abs/2603.16876
作者: Kaito Baba,Satoshi Kodera
机构: The University of Tokyo Hospital (东京大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.

[CV-155] Deep Learning-Based Airway Segmentation in Systemic Lupus Erythematosus Patients with Interstitial Lung Disease (SLE-ILD): A Comparative High-Resolution CT Analysis

【速读】：该论文旨在解决系统性红斑狼疮（Systemic Lupus Erythematosus, SLE）患者中伴有间质性肺病（Interstitial Lung Disease, ILD）与不伴ILD者在肺叶及肺段气道容积差异的量化识别问题，从而揭示SLE-ILD特有的气道结构表型。解决方案的关键在于开发了一种基于U-Net架构的定制化深度学习框架，实现了对非增强胸部高分辨率CT（High-Resolution CT, HRCT）图像中肺叶和肺段气道结构的自动分割，并通过统计学方法比较两组患者的气道容积差异，结果表明SLE-ILD患者在上肺叶及特定肺段（如右上叶R1、R3及左上叶L3）存在显著气道扩张，提示该AI驱动的定量影像生物标志物具有早期识别和监测SLE相关ILD的潜力。

链接: https://arxiv.org/abs/2603.17547
作者: Sirong Piao(1),Ying Ming(1),Ruijie Zhao(1),Jiaru Wang(1),Ran Xiao(1),Rui Zhao(1),Zicheng Liao(1),Qiqi Xu(2),Shaoze Luo(2),Bing Li(2),Lin Li(2),Zhuangfei Ma(3),Fuling Zheng(1),Wei Song(1) ((1) Department of Radiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China, (2) Research and Development Center (RDC), Canon Medical Systems (China), Beijing, China, (3) Canon Medical Systems (China), Beijing, China)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To characterize lobar and segmental airway volume differences between systemic lupus erythematosus (SLE) patients with interstitial lung disease (ILD) and those without ILD (non-ILD) using a deep learning-based approach on non-contrast chest high-resolution CT (HRCT). Methods: A retrospective analysis was conducted on 106 SLE patients (27 SLE-ILD, 79 SLE-non-ILD) who underwent HRCT. A customized deep learning framework based on the U-Net architecture was developed to automatically segment airway structures at the lobar and segmental levels via HRCT. Volumetric measurements of lung lobes and segments derived from the segmentations were statistically compared between the two groups using two-sample t-tests (significance threshold: p 0.05). Results: At lobar level, significant airway volume enlargement in SLE-ILD patients was observed in the right upper lobe (p=0.009) and left upper lobe (p=0.039) compared to SLE-non-ILD. At the segmental level, significant differences were found in segments including R1 (p=0.016), R3 (p0.001), and L3 (p=0.038), with the most marked changes in the upper lung zones, while lower zones showed non-significant trends. Conclusion: Our study demonstrates that an automated deep learning-based approach can effectively quantify airway volumes on HRCT scans and reveal significant, region-specific airway dilation in patients with SLE-ILD compared to those without ILD. The pattern of involvement, predominantly affecting the upper lobes and specific segments, highlights a distinct topographic phenotype of SLE-ILD and implicates airway structural alterations as a potential biomarker for disease presence. This AI-powered quantitative imaging biomarker holds promise for enhancing the early detection and monitoring of ILD in the SLE population, ultimately contributing to more personalized patient management.

[CV-156] Structured SIR: Efficient and Expressive Importance-Weighted Inference for High-Dimensional Image Registration

【速读】：该论文旨在解决三维密集图像配准（dense image registration）中不确定性建模的难题，尤其是在高维空间下如何有效捕捉多模态、结构化的后验分布。传统变分推断方法因对后验形式的限制常导致表征不足、过度自信及低质量样本生成。其核心解决方案是提出一种名为Structured SIR的内存和计算高效推断方法，关键创新在于将高维协方差矩阵参数化为低秩协方差与稀疏空间结构化的Cholesky精度因子之和，从而在保持计算可行性的同时，精确刻画复杂的时空相关性，并实现高质量、多模态的不确定性量化。

链接: https://arxiv.org/abs/2603.17415
作者: Ivor J. A. Simpson,Neill D. F. Campbell
机构: University of Sussex, UK; University College London, UK; University of Bath, UK
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image registration is an ill-posed dense vision task, where multiple solutions achieve similar loss values, motivating probabilistic inference. Variational inference has previously been employed to capture these distributions, however restrictive assumptions about the posterior form can lead to poor characterisation, overconfidence and low-quality samples. More flexible posteriors are typically bottlenecked by the complexity of high-dimensional covariance matrices required for dense 3D image registration. In this work, we present a memory and computationally efficient inference method, Structured SIR, that enables expressive, multi-modal, characterisation of uncertainty with high quality samples. We propose the use of a Sampled Importance Resampling (SIR) algorithm with a novel memory-efficient high-dimensional covariance parameterisation as the sum of a low-rank covariance and a sparse, spatially structured Cholesky precision factor. This structure enables capturing complex spatial correlations while remaining computationally tractable. We evaluate the efficacy of this approach in 3D dense image registration of brain MRI data, which is a very high-dimensional problem. We demonstrate that our proposed methods produces uncertainty estimates that are significantly better calibrated than those produced by variational methods, achieving equivalent or better accuracy. Crucially, we show that the model yields highly structured multi-modal posterior distributions, enable effective and efficient uncertainty quantification. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2603.17415 [eess.IV] (or arXiv:2603.17415v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2603.17415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-157] A Lensless Polarization Camera

【速读】：该论文旨在解决传统 polarization imaging（偏振成像）系统因使用空间或时间复用技术而导致体积大、重量重、成本高的问题。其解决方案的关键在于提出一种紧凑型无透镜偏振成像系统，由一个散射器（diffuser）和一个简单的条纹状偏振掩膜（striped polarization mask）组成，并结合一种显式建模偏振编码的无透镜测量数据的重建算法，从而仅需单次快照即可恢复四幅线性偏振图像，实现了高集成度与计算成像的协同优化。

链接: https://arxiv.org/abs/2603.17156
作者: Noa Kraicer,Shay Elmalem,Erez Yosef,Hani Barhum,Raja Giryes
机构: Weizmann Institute of Science (魏兹曼科学研究所); Tel Aviv University (特拉维夫大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Polarization imaging is a technique that creates a pixel map of the polarization state in a scene. Although invisible to the human eye, polarization can assist various sensing and computer vision tasks. Existing polarization cameras use spatial or temporal multiplexing, which increases the camera volume, weight, cost, or all of the above. Recent lensless imaging approaches, such as DiffuserCam, have demonstrated that compact imaging systems can be realized by replacing the lens with a coding element and performing computational reconstruction. In this work, we propose a compact lensless polarization camera composed of a diffuser and a simple striped polarization mask. By combining this optical design with a reconstruction algorithm that explicitly models the polarization-encoded lensless measurements, four linear polarization images are recovered from a single snapshot. Our results demonstrate the potential of lensless approaches for polarization imaging and reveal the physical factors that govern reconstruction quality, guiding the development of high-quality practical systems.

[CV-158] opology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CT

【速读】：该论文旨在解决肿瘤患者常规影像学检查中脊柱不稳性筛查效率低的问题，尤其是由于Spinal Instability Neoplastic Score (SINS)评估依赖复杂的几何推理而常被忽略。现有方法受限于转移性骨溶解导致的拓扑模糊性，使得标准分割和黑箱人工智能难以准确执行SINS评分。其解决方案的关键在于提出Topology-Guided Biomechanical Profiling (TGBP)框架，该框架通过解耦解剖感知与结构推理，引入两项确定性几何创新：(i) 以椎管为参考的分区策略以消除后外侧边界歧义，(ii) 基于协方差的定向包围盒（OBB）实现上下文感知的形态学归一化，从而量化椎体塌陷程度。结合辅助放射组学与大语言模型（LLM）模块，TGBP实现了端到端、可审计的SINS评估，在多中心、多癌种队列中达到90.2%的三分类稳定性分层准确率，并显著优于临床肿瘤医生在复杂结构特征识别和总分估算上的表现。

链接: https://arxiv.org/abs/2603.16963
作者: Zanting Ye,Xuanbin Wu,Guoqing Zhong,Shengyuan Liu,Jiashuai Liu,Ge Song,Zhisong Wang,Jing Hao,Xiaolong Niu,Yefeng Zheng,Yu Zhang,Lijun Lu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 tables, 2 figures

点击查看摘要

Abstract:Routine oncologic computed tomography (CT) presents an ideal opportunity for screening spinal instability, yet prophylactic stabilization windows are frequently missed due to the complex geometric reasoning required by the Spinal Instability Neoplastic Score (SINS). Automating SINS is fundamentally hindered by metastatic osteolysis, which induces topological ambiguity that confounds standard segmentation and black-box AI. We propose Topology-Guided Biomechanical Profiling (TGBP), an auditable white-box framework decoupling anatomical perception from structural reasoning. TGBP anchors SINS assessment on two deterministic geometric innovations: (i) canal-referenced partitioning to resolve posterolateral boundary ambiguity, and (ii) context-aware morphometric normalization via covariance-based oriented bounding boxes (OBB) to quantify vertebral collapse. Integrated with auxiliary radiomic and large language model (LLM) modules, TGBP provides an end-to-end, interpretable SINS evaluation. Validated on a multi-center, multi-cancer cohort ( N=482 ), TGBP achieved 90.2% accuracy in 3-tier stability triage. In a blinded reader study ( N=30 ), TGBP significantly outperformed medical oncologists on complex structural features ( \kappa=0.857 vs.\ 0.570 ) and prevented compounding errors in Total Score estimation ( \kappa=0.625 vs.\ 0.207 ), democratizing expert-level opportunistic screening.

[CV-159] UNICORN: Ultrasound Nakagami Imaging via Score Matching and Adaptation for Assessing Hepatic Steatosis

【速读】：该论文旨在解决传统超声 Nakagami 成像在评估肝脂肪变性（hepatic steatosis）时面临的两大挑战：一是窗口尺寸选择不 optimal 导致图像分辨率下降，二是估计器不稳定影响定量准确性。其解决方案的关键在于提出了一种名为 UNICORN（Ultrasound Nakagami Imaging via Score Matching and Adaptation）的新方法，该方法基于超声包络信号的 score 函数构建了一个闭式解析估计器，实现了像素级参数映射而非固定窗口内的区域估计，从而显著提升了成像的空间分辨率与统计稳定性，同时增强了对肝脂肪变性的临床检测能力与泛化性能。

链接: https://arxiv.org/abs/2603.16942
作者: Kwanyoung Kim,Jaa-Yeon Lee,Youngjun Ko,GunWoo Lee,Jong Chul Ye
机构: GIST(韩国科学技术院); KAIST(韩国科学技术院); Samsung Medison(三星医疗)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 12pages, 7 figures, 6 tables. arXiv admin note: text overlap with arXiv:2403.06275

点击查看摘要

Abstract:Ultrasound imaging is an essential first-line tool for assessing hepatic steatosis. While conventional B-mode ultrasound imaging has limitations in providing detailed tissue characterization, ultrasound Nakagami imaging holds promise for visualizing and quantifying tissue scattering in backscattered signals, with potential applications in fat fraction analysis. However, existing methods for Nakagami imaging struggle with optimal window size selection and suffer from estimator instability, leading to degraded image resolution. To address these challenges, we propose a novel method called UNICORN (Ultrasound Nakagami Imaging via Score Matching and Adaptation), which offers an accurate, closed-form estimator for Nakagami parameter estimation based on the score function of the ultrasound envelope signal. Unlike methods that visualize only specific regions of interest (ROI) and estimate parameters within fixed window sizes, our approach provides comprehensive parameter mapping by providing a pixel-by-pixel estimator, resulting in high-resolution imaging. We demonstrated that our proposed estimator effectively assesses hepatic steatosis and provides visual distinction in the backscattered statistics associated with this condition. Through extensive experiments using real envelope data from patient, we validated that UNICORN enables clinical detection of hepatic steatosis and exhibits robustness and generalizability.

[CV-160] On the Degrees of Freedom of Gridded Control Points in Learning-Based Medical Image Registration

【速读】：该论文旨在解决医学图像配准中因同质或噪声区域导致的病态问题，以及密集体素级解码器带来的高维度、高内存消耗和不稳定性的挑战。其核心解决方案是提出一种基于稀疏控制点参数化的学习型配准框架GridReg，通过在稀疏网格控制点上预测位移场（displacement field），替代传统密集体素级解码，从而显著降低模型参数量与内存占用，同时保持甚至提升配准精度。关键创新在于利用多尺度3D编码器特征的1D序列表示结合位置编码以保留空间上下文，并采用交叉注意力模块实现稀疏网格位移场的预测，此外引入网格自适应训练策略，使模型可在推理阶段灵活适配多种网格尺寸而无需重新训练，有效提升了方法的实用性与效率。

链接: https://arxiv.org/abs/2603.16940
作者: Wen Yan,Qianye Yang,Yipei Wang,Shonit Punwani,Mark Emberton,Vasilis Stavrinides,Yipeng Hu,Dean Barratt
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages; 8 figures

点击查看摘要

Abstract:Many registration problems are ill-posed in homogeneous or noisy regions, and dense voxel-wise decoders can be unnecessarily high-dimensional. A sparse control-point parameterisation provides a compact, smooth deformation representation while reducing memory and improving stability. This work investigates the required control points for learning-based registration network development. We present GridReg, a learning-based registration framework that replaces dense voxel-wise decoding with displacement predictions at a sparse grid of control points. This design substantially cuts the parameter count and memory while retaining registration accuracy. Multiscale 3D encoder feature maps are flattened into a 1D token sequence with positional encoding to retain spatial context. The model then predicts a sparse gridded deformation field using a cross-attention module. We further introduce grid-adaptive training, enabling an adaptive model to operate at multiple grid sizes at inference without retraining. This work quantitatively demonstrates the benefits of using sparse grids. Using three data sets for registering prostate gland, pelvic organs and neurological structures, the results suggested a significant improvement with the usage of grid-controled displacement field. Alternatively, the superior registration performance was obtained using the proposed approach, with a similar or less computational cost, compared with existing algorithms that predict DDFs or displacements sampled on scattered key points.

[CV-161] Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

【速读】：该论文旨在解决基于胸部CT图像的新冠肺炎（COVID-19）检测与多类疾病分类问题，尤其关注跨数据源场景下的模型鲁棒性不足挑战。解决方案的关键在于提出一种融合2.5D与3D表征的深度学习框架：其中2.5D分支利用DINOv3视觉Transformer处理多视角（轴向、冠状面、矢状面）CT切片以提取稳健的局部特征，而3D分支采用ResNet-18架构建模体素级上下文，并通过Variance Risk Extrapolation（VREx）预训练结合监督对比学习提升跨源泛化能力；最终通过logit级集成策略融合两分支预测结果，实现对不同疾病类型的高精度识别与判别。

链接: https://arxiv.org/abs/2603.14832
作者: Tuan-Anh Yang,Bao V. Q. Bui,Chanh-Quang Vo-Van,Truong-Son Hy
机构: VNUHCM University of Science, Vietnam National University, Vietnam; Ho Chi Minh University of Technology, Vietnam National University, Vietnam; The University of Alabama at Birmingham, United States
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at this https URL

人工智能

[AI-0] Agent Factory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）驱动的智能体在复杂任务场景中难以高效复现成功经验的问题。现有方法通常将成功经验以文本形式记录为提示（prompt）或反思，但这种方式无法确保在多样化和动态环境中可靠地重执行任务。其解决方案的关键在于提出AgentFactory这一新型自进化范式：将成功的任务解保存为可执行的子智能体代码（subagent code），而非文本描述，并基于执行反馈持续优化这些子智能体，使其在积累更多任务经验后变得更为鲁棒和高效。这种机制实现了无监督的持续能力累积，且生成的子智能体为标准化Python代码，具备跨平台可移植性，显著提升了LLM代理系统的长期适应性和实用性。

链接: https://arxiv.org/abs/2603.18000
作者: Zhang Zhang,Shuqi Lu,Hongjin Qian,Di He,Zheng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at this https URL, and our demonstration video is available at this https URL.

[AI-1] oward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

【速读】：该论文旨在解决当前软件漏洞检测中存在的一大瓶颈问题：现有基于学习的漏洞检测方法多依赖于函数级（function-centric）的基准测试，难以在真实、可执行且跨过程（interprocedural）的环境中有效评估模型性能；同时，尽管近期repo-level安全基准已凸显真实环境的重要性，但其人工标注方式限制了规模与可扩展性。论文提出的解决方案关键在于构建一个自动化基准生成器（automated benchmark generator），该工具能够将现实漏洞注入到真实开源仓库中，并自动生成可复现的漏洞利用证明（proof-of-vulnerability, PoV），从而形成精确标注的数据集，用于训练和评估repo-level漏洞检测代理（agent）。此外，研究进一步引入攻击者与检测者之间的对抗共演化机制（adversarial co-evolution loop），以提升检测模型在实际约束下的鲁棒性。

链接: https://arxiv.org/abs/2603.17974
作者: Amine Lbath
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Supervisor: Prof. Massih-Reza Amini

点击查看摘要

Abstract:Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

[AI-2] DAD: Test-Driven Agent ic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

【速读】：该论文旨在解决当前AI编码代理（AI coding agents）在实际软件开发中频繁引入回归问题（regressions）的缺陷，即修改代码后导致原本通过的测试失败，而现有基准评测主要关注修复率（resolution rate），忽视了对回归行为的系统研究。其解决方案的核心是提出TDAD（Test-Driven Agentic Development）方法，结合基于抽象语法树（AST）的代码-测试图构建与加权影响分析（weighted impact analysis），精准识别最可能受变更影响的测试用例，从而在AI代理执行修改时优先验证相关测试。实验表明，该方法在两个本地模型上将测试级回归率降低70%（从6.08%降至1.82%），并提升修复成功率；更重要的是，发现仅靠TDD提示（Test-Driven Development prompting）反而会增加回归，说明对于小型模型而言，提供上下文信息（如需验证哪些测试）比程序性指令（如何执行TDD）更为关键，这为AI代理工具设计提供了重要启示：强调上下文感知而非流程固化更有效。

链接: https://arxiv.org/abs/2603.17973
作者: Pepe Alonso
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Toolpaper, 7 pages, 3 tables, 1 figure, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track)

点击查看摘要

Abstract:AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD’s GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at this https URL.

[AI-3] Specification-Aware Distribution Shaping for Robotics Foundation Models

【速读】：该论文旨在解决机器人基础模型（Robotics Foundation Models）在部署过程中缺乏对时间相关规范（time-dependent specifications）的安全保障问题，尤其是在需要满足复杂时空约束（如时间边界内的目标访问、顺序目标和持续安全条件）时。解决方案的关键在于提出一种规范感知的动作分布优化框架，该框架在不修改预训练模型参数的前提下，通过在每个决策步骤中计算最小扰动的动作分布来满足硬性信号时序逻辑（Signal Temporal Logic, STL）可行性约束，其核心机制是利用前向动力学传播对剩余时间窗进行推理，从而确保执行过程符合STL规范。

链接: https://arxiv.org/abs/2603.17969
作者: Sadık Bera Yüksel,Derya Aksaray
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.

[AI-4] CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention ICLR2026

【速读】：该论文旨在解决将预训练的注意力模块（如分组查询注意力，GQA）转换为多头潜在注意力（MLA）时存在的表达能力不足与激活漂移问题。现有方法多依赖仅基于权重矩阵的低秩近似（如SVD初始化）和均匀秩分配策略，忽视了输入激活的协方差结构，导致KV缓存成本不变的前提下注意力保真度下降。其解决方案的关键在于提出CARE（Covariance-Aware, Rank-Enhanced MLA conversion pipeline），包含三个核心步骤：(i) 激活保持因子分解，使近似更贴近实际输入激活而非仅匹配权重；(ii) 调整后的秩分配机制，在固定KV宽度约束下动态优化各层容量分配；(iii) KV对齐映射，重新参数化K和V以适配MLA格式但不增加KV缓存大小。实验表明，CARE在Qwen3和Llama-3.1模型上显著优于统一秩SVD基线，单次困惑度降低最多达215倍，平均准确率提升最高1.70倍，并通过简短微调恢复原模型性能。

链接: https://arxiv.org/abs/2603.17946
作者: Zhongzhu Zhou,Fengxiang Bie,Ziyan Chen,Zhenyu Zhang,Yibo Yang,Junxiong Wang,Ben Athiwaratkun,Xiaoxia Wu,Shuaiwen Leon Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026. Conference paper. 10 pages main text; 34 pages total including references and appendix. 11 figures and 20 tables in total

点击查看摘要

Abstract:Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model’s accuracy.

[AI-5] Differential Privacy in Generative AI Agents : Analysis and Optimal Tradeoffs

【速读】：该论文旨在解决企业级AI代理（AI agent）在集成大型语言模型（Large Language Models, LLMs）后可能引发的敏感数据隐私泄露问题，尤其关注从企业数据角度而非仅用户提示（prompt）隐私出发的风险建模。其解决方案的关键在于构建一个基于差分隐私（differential privacy）的概率框架，将响应生成建模为从提示和数据集到标记序列分布的随机机制，并引入标记级别（token-level）和消息级别（message-level）的差分隐私定义，从而量化隐私泄露与生成参数（如温度和消息长度）之间的关系，进一步提出隐私-效用权衡的设计问题，以指导最优温度选择来平衡隐私保护与输出质量。

链接: https://arxiv.org/abs/2603.17902
作者: Ya-Ting Yang,Quanyan Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and AI agents are increasingly integrated into enterprise systems to access internal databases and generate context-aware responses. While such integration improves productivity and decision support, the model outputs may inadvertently reveal sensitive information. Although many prior efforts focus on protecting the privacy of user prompts, relatively few studies consider privacy risks from the enterprise data perspective. Hence, this paper develops a probabilistic framework for analyzing privacy leakage in AI agents based on differential privacy. We model response generation as a stochastic mechanism that maps prompts and datasets to distributions over token sequences. Within this framework, we introduce token-level and message-level differential privacy and derive privacy bounds that relate privacy leakage to generation parameters such as temperature and message length. We further formulate a privacy-utility design problem that characterizes optimal temperature selection.

[AI-6] scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM -Generated Patterns

【速读】：该论文旨在解决科学计算中由方法论错误（如数据泄露、交叉验证不当和随机种子缺失）导致的Python代码产生看似合理但实际错误结果的问题，而传统静态分析工具难以识别此类问题。其解决方案的关键在于提出scicode-lint，采用两层架构：构建时使用前沿模型生成检测模式（而非人工编码），运行时则依赖轻量本地模型执行检测；这种设计使工具能以较低成本适应新库版本（仅需消耗少量token），显著提升可持续性和可扩展性。

链接: https://arxiv.org/abs/2603.17893
作者: Sergey V. Samsonau
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

[AI-7] RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在资源受限硬件上部署时，因现有后训练量化方法强制全层统一比特宽度而导致的精度-效率权衡不佳的问题。其核心解决方案是提出RAMP（Reinforcement Adaptive Mixed Precision），一个基于离策略Soft Actor-Critic框架的自适应混合精度量化方法，通过学习每层最优比特分配以最小化困惑度（perplexity）并满足全局比特预算。关键创新在于：1）设计了一个包含激活统计量、权重属性和结构描述符的11维嵌入作为策略条件输入，实现跨模型家族和规模的零样本迁移；2）引入Scale Folding预处理技术，通过通道级缩放与归一化层补偿将激活异常值迁移至权重中，从而稳定实现低于4比特的量化；3）采用带不对称惩罚和预算悬崖的质量优先奖励机制，加速收敛。实验表明，RAMP在Llama 2 7B上以3.68GB存储空间达到5.54困惑度，优于统一4比特AWQ和GPTQ方案，在不同模型间具有强泛化能力。

链接: https://arxiv.org/abs/2603.17891
作者: Arpit Singh Gautam,Saurabh Jha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

[AI-8] Procedural Generation of Algorithm Discovery Tasks in Machine Learning

【速读】：该论文旨在解决当前机器学习算法发现系统（Algorithm Discovery Agents, ADAs）在开发与评估过程中所面临的瓶颈问题，主要包括任务集存在的评价方法不严谨、数据污染以及问题饱和或高度相似等缺陷。其解决方案的关键在于提出DiscoGen——一个用于生成机器学习算法发现任务的程序化生成器，能够从多个机器学习领域中自动构造出数百万个难度和复杂度各异的任务，这些任务由少量配置参数定义，并可用于优化ADAs。DiscoGen通过提供多样、可扩展且结构清晰的任务空间，显著提升了算法发现系统的训练与评估能力，同时配套提出了DiscoBench基准以支持对ADAs的系统性评估。

链接: https://arxiv.org/abs/2603.17863
作者: Alexander D. Goldie,Zilin Wang,Adrian Hayler,Deepak Nathani,Edan Toledo,Ken Thampiratwong,Aleksandra Kalisz,Michael Beukman,Alistair Letcher,Shashank Reddy,Clarisse Wibault,Theo Wolf,Charles O’Neill,Uljad Berdica,Nicholas Roberts,Saeed Rahmani,Hannah Erlebach,Roberta Raileanu,Shimon Whiteson,Jakob N. Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open-source at this https URL.

[AI-9] Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

【速读】：该论文旨在解决扩散模型（Diffusion Models）与流匹配（Flow Matching）在机器人模仿学习中因固定积分调度导致的结构效率低下问题，即推理过程无法根据状态复杂度动态调整计算资源，从而造成对简单动作与复杂任务均分配相同计算预算的浪费。其解决方案的关键在于提出一种时间无条件的生成控制优化框架（Generative Control as Optimization, GeCO），将动作合成从轨迹积分转变为迭代优化过程：GeCO通过学习动作序列空间中的稳态速度场（stationary velocity field），使专家行为成为稳定的吸引子（attractors），从而在测试阶段实现基于收敛性的自适应计算分配——简单状态可提前退出，复杂状态则持续优化；同时，该稳态几何结构还自然产生无需训练的安全信号，即优化后动作处的速度场范数可作为分布外检测（OOD Detection）指标，有效识别异常状态。

链接: https://arxiv.org/abs/2603.17834
作者: Zunzhe Zhang,Runhan Huang,Yicheng Liu,Shaoting Zhu,Linzhan Mou,Hang Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence–exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at this https URL

[AI-10] RPMS: Enhancing LLM -Based Embodied Planning through Rule-Augmented Memory Synergy

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在封闭世界具身环境（closed-world embodied environments）中执行任务时的失败问题，其核心挑战在于动作必须满足严格的先决条件（如位置、物品持有状态和容器状态），且失败反馈稀疏。作者识别出两种结构耦合的失败模式：(P1) 无效动作生成与 (P2) 状态漂移，二者相互放大形成退化循环。解决方案的关键是提出RPMS架构，通过三重机制实现突破：一是基于结构化规则检索确保动作可行性；二是利用轻量级信念状态（belief state）控制记忆适用性；三是采用规则优先仲裁策略化解规则与记忆之间的冲突。实验证明，规则检索单独贡献了+14.9个百分点的性能提升（统计显著），表明其为关键因素；同时发现情景记忆仅在状态接地并受显式动作规则约束时才具有稳定正向作用，从而揭示了记忆使用的条件性价值。

链接: https://arxiv.org/abs/2603.17831
作者: Zhenhang Yuan,Shenghai Yuan,Lihua Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents often fail in closed-world embodied environments because actions must satisfy strict preconditions – such as location, inventory, and container states – and failure feedback is sparse. We identify two structurally coupled failure modes: (P1) invalid action generation and (P2) state drift, each amplifying the other in a degenerative cycle. We present RPMS, a conflict-managed architecture that enforces action feasibility via structured rule retrieval, gates memory applicability via a lightweight belief state, and resolves conflicts between the two sources via rules-first arbitration. On ALFWorld (134 unseen tasks), RPMS achieves 59.7% single-trial success with Llama 3.1 8B (+23.9 pp over baseline) and 98.5% with Claude Sonnet 4.5 (+11.9 pp); of the 8B gain, rule retrieval alone contributes +14.9 pp (statistically significant), making it the dominant factor. A key finding is that episodic memory is conditionally useful: it harms performance on some task types when used without grounding, but becomes a stable net positive once filtered by current state and constrained by explicit action rules. Adapting RPMS to ScienceWorld with GPT-4 yields consistent gains across all ablation conditions (avg. score 54.0 vs. 44.9 for the ReAct baseline), providing transfer evidence that the core mechanisms hold across structurally distinct environments.

[AI-11] FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

【速读】：该论文针对多模态自动化程序修复（Multimodal Automated Program Repair, MAPR）中存在的三大问题展开研究：一是现有方法依赖僵化的流水线工作流，限制了调试过程中的探索空间；二是视觉推理通常基于全屏截图而缺乏局部区域的精准定位；三是失败的修复尝试难以转化为可复用的知识。为解决上述挑战，论文提出FailureMem框架，其核心创新在于集成三个关键机制：混合工作流-代理架构以实现结构化定位与灵活推理的平衡、主动感知工具实现基于区域的视觉定位能力，以及失败记忆库（Failure Memory Bank）将历史修复经验转化为可重用的指导信息。实验表明，FailureMem在SWE-bench Multimodal数据集上相较于GUIRepair提升了3.7%的修复成功率。

链接: https://arxiv.org/abs/2603.17826
作者: Ruize Ma,Yilei Jiang,Shilin Zhang,Zheng Ma,Yi Feng,Vincent Ng,Zhi Wang,Xiangyu Yue,Chuanyi Li,Lewei Lu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Automated Program Repair (MAPR) extends traditional program repair by requiring models to jointly reason over source code, textual issue descriptions, and visual artifacts such as GUI screenshots. While recent LLM-based repair systems have shown promising results, existing approaches face several limitations: rigid workflow pipelines restrict exploration during debugging, visual reasoning is often performed over full-page screenshots without localized grounding, and failed repair attempts are rarely transformed into reusable knowledge. To address these challenges, we propose FailureMem, a multimodal repair framework that integrates three key mechanisms: a hybrid workflow-agent architecture that balances structured localization with flexible reasoning, active perception tools that enable region-level visual grounding, and a Failure Memory Bank that converts past repair attempts into reusable guidance. Experiments on SWE-bench Multimodal demonstrate FailureMem improves the resolved rate over GUIRepair by 3.7%.

[AI-12] Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

【速读】：该论文旨在解决Transformer-based语言模型在推理阶段因随机性（如蒙特卡洛dropout）导致的行为不确定性问题，尤其是其对模型可靠性的影响尚缺乏系统评估。当前生成式AI（Generative AI）应用中，不确定性量化对于可靠决策至关重要，但现有研究未充分揭示不同架构下dropout诱导的预测波动特性。解决方案的关键在于构建一个基于蒙特卡洛Dropout（MC Dropout）的全面基准测试框架，通过100次随机前向传播对19种Transformer模型进行95次独立评估，并引入认知分解框架将性能解耦为记忆（memory）与推理（reasoning）两个子组件。结果表明：模型鲁棒性高度依赖于架构而非规模，且多数模型在标准dropout设置下出现显著准确率下降（高达24个百分点），其中记忆任务受dropout破坏更严重（下降27个百分点），这揭示了Dropout对稳定表示学习的干扰机制，为不确定性感知场景下的模型选择提供了可操作的指导依据。

链接: https://arxiv.org/abs/2603.17811
作者: Antônio Junior Alves Caiado,Michael Hahsler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based language models are widely deployed for reasoning, yet their behavior under inference-time stochasticity remains underexplored. While dropout is common during training, its inference-time effects via Monte Carlo sampling lack systematic evaluation across architectures, limiting understanding of model reliability in uncertainty-aware applications. This work analyzes dropout-induced variability across 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Dropout robustness is defined as maintaining high accuracy and stable predictions under stochastic inference, measured by standard deviation of per-run accuracies. A cognitive decomposition framework disentangles performance into memory and reasoning components. Experiments span five dropout configurations yielding 95 unique evaluations on 1,000 samples. Results reveal substantial architectural variation. Smaller models demonstrate perfect prediction stability while medium-sized models exhibit notable volatility. Mid-sized models achieve the best overall performance; larger models excel at memory tasks. Critically, 53% of models suffer severe accuracy degradation under baseline MC Dropout, with task-specialized models losing up to 24 percentage points, indicating unsuitability for uncertainty quantification in these architectures. Asymmetric effects emerge: high dropout reduces memory accuracy by 27 percentage points while reasoning degrades only 1 point, suggesting memory tasks rely on stable representations that dropout disrupts. 84% of models demonstrate memory-biased performance. This provides the first comprehensive MC Dropout benchmark for transformers, revealing dropout robustness is architecture-dependent and uncorrelated with scale. The cognitive profiling framework offers actionable guidance for model selection in uncertainty-aware applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17811 [cs.LG] (or arXiv:2603.17811v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.17811 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Antonio Caiado [view email] [v1] Wed, 18 Mar 2026 15:04:26 UTC (458 KB)

[AI-13] EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

【速读】：该论文旨在解决视频生成模型作为机器人世界模型时存在的“可执行性差距”（executability gap）问题，即生成的视觉轨迹虽在视觉上连贯，但可能违反刚体动力学和运动学约束，导致逆动力学模型（IDM）解码出不稳定或不可行的控制指令。解决方案的关键在于提出一种基于强化学习的后训练框架——可执行视频对齐（Executable Video Alignment, EVA），其核心思想是将可执行性差距转化为训练信号：利用真实机器人轨迹训练IDM，并将其重用为奖励模型，通过评估生成视频所诱导的动作序列来优化视频生成过程，从而鼓励平滑的运动（以速度、加速度和急动度衡量），同时惩罚违反具身约束的动作，即使生成视频存在严重视觉伪影，该奖励机制仍能有效引导模型生成更符合物理可执行性的轨迹。

链接: https://arxiv.org/abs/2603.17808
作者: Ruixiang Wang,Qingming Liu,Yueci Deng,Guiliang Liu,Zhen Liu,Kui Jia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

[AI-14] RangeAD: Fast On-Model Anomaly Detection

【速读】：该论文旨在解决传统异常检测（Anomaly Detection, AD）方法中存在效率低下和信息冗余的问题。现有方法通常部署一个独立的AD模型与主模型并行运行，但忽略了主模型本身已编码了目标分布的大量信息。为此，作者提出“On-Model AD”这一新设置，强调直接利用主模型中的内部表示进行异常检测。其核心解决方案是RangeAD算法，该算法通过提取主模型中神经元的输出范围（neuron-wise output ranges）来构建高效且高精度的异常判别机制，在保持高性能的同时显著降低推理开销，尤其适用于高维任务场景。

链接: https://arxiv.org/abs/2603.17795
作者: Luca Hinkamp,Simon Klüttermann,Emmanuel Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:In practice, machine learning methods commonly require anomaly detection (AD) to filter inputs or detect distributional shifts. Typically, this is implemented by running a separate AD model alongside the primary model. However, this separation ignores the fact that the primary model already encodes substantial information about the target distribution. In this paper, we introduce On-Model AD, a setting for anomaly detection that explicitly leverages access to a related machine learning model. Within this setting, we propose RangeAD, an algorithm that utilizes neuron-wise output ranges derived from the primary model. RangeAD achieves superior performance even on high-dimensional tasks while incurring substantially lower inference costs. Our results demonstrate the potential of the On-Model AD setting as a practical framework for efficient anomaly detection.

[AI-15] Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory

【速读】：该论文旨在解决大语言模型在实际应用中依赖上下文记忆（in-context memory）所导致的可扩展性与可靠性问题，尤其是在知识存储容量受限、信息压缩损失及任务目标漂移等方面。其核心解决方案是引入知识对象（Knowledge Objects, KOs），即离散的哈希地址元组结构，具备O(1)的快速检索能力，并通过密度自适应检索机制实现对传统上下文记忆的动态切换。实验表明，KOs在保持100%准确率的同时，相较传统方法降低252倍成本，在多跳推理任务中性能显著优于上下文记忆（78.9% vs 31.6%），且该方案具有跨模型通用性，有效规避了因架构特性引发的压缩损耗问题。

链接: https://arxiv.org/abs/2603.17781
作者: Oliver Zahn,Simran Chana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 7 figures

点击查看摘要

Abstract:Large language models increasingly serve as persistent knowledge workers, with in-context memory - facts stored in the prompt - as the default strategy. We benchmark in-context memory against Knowledge Objects (KOs), discrete hash-addressed tuples with O(1) retrieval. Within the context window, Claude Sonnet 4.5 achieves 100% exact-match accuracy from 10 to 7,000 facts (97.5% of its 200K window). However, production deployment reveals three failure modes: capacity limits (prompts overflow at 8,000 facts), compaction loss (summarization destroys 60% of facts), and goal drift (cascading compaction erodes 54% of project constraints while the model continues with full confidence). KOs achieve 100% accuracy across all conditions at 252x lower cost. On multi-hop reasoning, KOs reach 78.9% versus 31.6% for in-context. Cross-model replication across four frontier models confirms compaction loss is architectural, not model-specific. We additionally show that embedding retrieval fails on adversarial facts (20% precision at 1) and that neural memory (Titans) stores facts but fails to retrieve them on demand. We introduce density-adaptive retrieval as a switching mechanism and release the benchmark suite.

[AI-16] Attention Sinks Induce Gradient Sinks

【速读】：该论文试图解决Transformer模型中注意力下沉（attention sinks）与大规模激活（massive activations）之间的因果关系问题，尤其是二者是否在训练过程中存在直接关联。现有研究主要关注前向传播，未能明确其内在机制。论文从反向传播角度出发，提出梯度下沉（gradient sinks）是连接注意力下沉与大规模激活的关键训练时中介机制：在因果掩码（causal mask）条件下，注意力下沉会引发显著的梯度集中现象，从而驱动预归一化（pre-norm）架构中使用RMSNorm的模型产生大规模激活作为自适应响应。解决方案的关键在于引入V-scale，通过调整值路径（value-path）的反向传播梯度，实验证明该方法可在保留注意力下沉的同时抑制大规模激活，从而验证了梯度下沉的核心中介作用。

链接: https://arxiv.org/abs/2603.17771
作者: Yihong Chen,Quanming Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing studies have largely focused on the forward pass, making it unclear whether their connection is direct or mediated by a training-time mechanism. We study this question from the perspective of backpropagation. Empirically and theoretically, we show that under causal mask, attention sinks can induce pronounced gradient concentration, which we term gradient sinks. Furthermore, in pre-norm architectures with RMSNorm, massive activations can be understood as an adaptive response to this localized gradient pressure during training. To test this hypothesis, we introduce V-scale, a modification that adjusts value-path backpropagated gradients. In pretrained V-scale models, attention sinks are preserved whereas massive activations are suppressed. These results support the interpretation that gradient sink is a key training-time mediator linking attention sinks and massive activations.

[AI-17] Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

【速读】：该论文旨在解决当前网络入侵检测系统（NIDS）在面对日益复杂和智能化的攻击（如生成式AI（GenAI）和强化学习技术驱动的攻击）时，模型稳定性与数据质量不足的问题。其解决方案的关键在于构建首个统一多模态的NIDS数据集，整合流级数据、包载荷信息及时间上下文特征，并基于同一特征空间对CIC-IDS-2017、CIC-IoT-2023、UNSW-NB15和CIC-DDoS-2019数据集进行重构；同时采用分层交叉验证的机器学习（ML）算法提升检测模型的稳定性和可靠性，并通过对抗学习生成合成数据，利用SDV框架、f散度（f-divergence）、可区分性测试（distinguishability）和非参数统计检验评估合成数据的保真度（fidelity）、实用性（utility）与隐私保护能力，最终结合Synthetic Data Vault框架、TRTS与TSTR检验方法，实现高保真度和实用性的生成模型，从而增强NIDS对新型攻击的防御能力。

链接: https://arxiv.org/abs/2603.17717
作者: Iakovos-Christos Zarkadis,Christos Douligeris
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.

[AI-18] From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving

【速读】：该论文旨在解决自动驾驶技术在实际部署中面临的三大核心挑战：数据稀缺性、安全性要求以及跨多样化环境的泛化能力不足。其解决方案的关键在于系统性地整合合成数据（synthetic data）与虚拟环境（virtual environments），通过三个维度实现突破：首先，利用合成数据提升感知与规划模块的训练效果；其次，基于数字孪生（digital twin）的仿真平台支持系统级验证；最后，采用领域自适应（domain adaptation）策略弥合合成数据与真实世界数据之间的差距。此外，论文强调视觉-语言模型（vision-language models）和仿真真实感（simulation realism）对增强场景理解与泛化性能的重要作用，为构建安全、可扩展且全球适用的自动驾驶系统提供了理论框架与实践路径。

链接: https://arxiv.org/abs/2603.17714
作者: A. Humnabadkar,A. Sikdar,B. Cave,H. Zhang,N. Bessis,A. Behera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted manuscript - Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for training and evaluation. This survey presents a comprehensive review of recent developments at the intersection of autonomous driving, simulation technologies, and synthetic datasets. We organize the landscape across three core dimensions: (i) the use of synthetic data for perception and planning, (ii) digital twin-based simulation for system validation, and (iii) domain adaptation strategies bridging synthetic and real-world data. We also highlight the role of vision-language models and simulation realism in enhancing scene understanding and generalization. A detailed taxonomy of datasets, tools, and simulation platforms is provided, alongside an analysis of trends in benchmark design. Finally, we discuss critical challenges and open research directions, including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning, that must be addressed to accelerate the path toward safe, generalizable, and globally deployable autonomous driving systems.

[AI-19] MALLES: A Multi-agent LLM s-based Economic Sandbox with Consumer Preference Alignment

【速读】：该论文旨在解决现实经济中高维、多模态环境下的决策难题，尤其是由主体异质性和组合数据稀疏性所导致的建模挑战。其核心问题在于如何在不同产品类别间实现有效的偏好迁移与稳定模拟，从而提升经济决策模型的泛化能力和可扩展性。解决方案的关键在于提出了一种基于多智能体大语言模型（Multi-Agent Large Language Model, MALLES）的经济沙盒框架，通过后训练对大规模异构交易记录进行偏好学习，使LLM内化并迁移潜在的消费者偏好模式以缓解单类别的数据稀疏问题；同时引入均值场机制来建模产品环境与用户群体间的动态交互，增强高维决策空间中的采样稳定性，并设计多智能体讨论机制分配认知负荷，借助结构化对话捕捉关键决策因素，从而显著提升产品选择准确率、购买量预测精度及整体仿真稳定性。

链接: https://arxiv.org/abs/2603.17694
作者: Yusen Wu,Yiran Liu,Xiaotie Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the real economy, modern decision-making is fundamentally challenged by high-dimensional, multimodal environments, which are further complicated by agent heterogeneity and combinatorial data sparsity. This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES), leveraging the inherent generalization capabilities of large-sacle models to establish a unified simulation framework applicable to cross-domain and cross-category scenarios. Central to our approach is a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. This methodology enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Furthermore, we propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information. This architecture distributes cognitive load to alleviate single-agent attention bottlenecks and captures critical decision factors through structured dialogue. Experiments demonstrate that our framework achieves significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines. Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database.

[AI-20] Can Blindfolded LLM s Still Trade? An Anonymization-First Framework for Portfolio Optimization ICLR2026

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）交易代理在金融市场中可能因记忆偏差（memorization bias）和幸存者偏差（survivorship bias）而产生虚假性能的问题，确保其预测能力源于对市场动态的真实理解而非对预训练数据中特定股票代码关联的记忆。解决方案的关键在于“盲化”（blindfolding）——通过匿名化所有股票代码和公司名称，使LLM代理无法依赖已知标识进行推理，并在此基础上构建基于图神经网络（Graph Neural Network, GNN）的推理嵌入图结构，结合PPO-DSR策略进行交易决策，从而验证信号的合法性与鲁棒性。实验表明，在2025年至今（截至2025-08-01）的回测中，该方法实现了平均夏普比率1.40 ± 0.22，且通过负向对照实验确认了信号的有效性；进一步扩展至2024–2025年的多市场周期评估揭示了策略在波动环境中表现优异，但在趋势性牛市中Alpha衰减，体现出对市场状态的依赖性。

链接: https://arxiv.org/abs/2603.17692
作者: Joohyoung Jeon,Hongchul Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM)
备注: Accepted at the ICLR 2026 Workshop on Advances in Financial AI (FinAI). 18 pages, 7 figures

点击查看摘要

Abstract:For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents–anonymizing all identifiers–and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024–2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.

[AI-21] Objective Mispricing Detection for Shortlisting Undervalued Football Players via Market Dynamics and News Signals

【速读】：该论文旨在解决足球运动员市场估值中存在主观偏差的问题，提出一种基于客观错估（mispricing）的可复现框架来识别被低估的球员。其核心解决方案是通过结构化数据（如历史市场动态、生物特征与合同信息、转会记录）估计球员的预期市场价值，并将其与实际观测值对比以量化错估程度；在此基础上进一步引入从新闻文本中提取的自然语言处理（Natural Language Processing, NLP）特征（如情感统计和语义嵌入），验证其是否能增强对被低估球员的识别能力。实验表明，市场动态是主要信号，而NLP特征提供稳定且具解释性的辅助增益，尤其在高不确定性情境下通过放大波动性线索提升模型鲁棒性。该方法设计用于支持球探工作流中的排名/筛选决策，而非硬性分类阈值，具备良好的可复现性和伦理透明度。

链接: https://arxiv.org/abs/2603.17687
作者: Chinenye Omejieke,Shuyao Chen,Xia Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a practical, reproducible framework for identifying undervalued football players grounded in objective mispricing. Instead of relying on subjective expert labels, we estimate an expected market value from structured data (historical market dynamics, biographical and contract features, transfer history) and compare it to the observed valuation to define mispricing. We then assess whether news-derived Natural Language Processing (NLP) features (i.e., sentiment statistics and semantic embeddings from football articles) complement market signals for shortlisting undervalued players. Using a chronological (leakage-aware) evaluation, gradient-boosted regression explains a large share of the variance in log-transformed market value. For undervaluation shortlisting, ROC-AUC-based ablations show that market dynamics are the primary signal, while NLP features provide consistent, secondary gains that improve robustness and interpretability. SHAP analyses suggest the dominance of market trends and age, with news-derived volatility cues amplifying signals in high-uncertainty regimes. The proposed pipeline is designed for decision support in scouting workflows, emphasizing ranking/shortlisting over hard classification thresholds, and includes a concise reproducibility and ethics statement. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17687 [cs.LG] (or arXiv:2603.17687v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.17687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] Sensi: Learn One Thing at a Time – Curriculum-Based Test-Time Learning for LLM Game Agents

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在未知环境中部署时，需在测试阶段学习任务结构的问题，而当前方法通常需要数千次交互才能形成有效假设，效率低下。其解决方案的关键在于提出Sensi架构，通过三个核心机制实现结构化的测试时学习：(1) 两玩家架构将感知与行动分离，提升模块化能力；(2) 基于外部状态机管理的课程学习系统，引导渐进式知识获取；(3) 数据库作为控制平面，使上下文窗口可编程地调节，增强推理可控性。此外，引入LLM作为裁判组件，结合动态生成的评估标准判断是否完成当前主题学习并进入下一阶段。实验表明，Sensi v2虽未成功通关游戏关卡，但仅用约32次动作即完成全部课程学习，相较同类系统（需1600–3000次尝试）样本效率提升50–94倍，揭示了从学习效率瓶颈向感知基础（perceptual grounding）瓶颈的转移，为后续优化提供了明确方向。

链接: https://arxiv.org/abs/2603.17683
作者: Mohsen Arjmandi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. 18 pages, 5 figures, 2 tables. Independent research. Code and Colab demo coming soon on GitHub

点击查看摘要

Abstract:Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.

[AI-23] Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在安全任务（如Linux权限提升，Privilege Escalation）中依赖云端闭源系统所带来的资源消耗高、难以复现以及无法处理专有代码或敏感数据的问题。为此，作者提出了一种两阶段后训练（post-training）流水线：首先在程序化生成的权限提升环境中进行监督微调（Supervised Fine-Tuning, SFT），随后通过基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards）进一步优化模型性能。关键创新在于利用自动可验证的成功反馈机制，在严格资源预算下实现高效本地部署，并最终使小型模型（4B参数规模）在12个Linux权限提升场景中达到95.8%的成功率，显著优于基线且推理成本降低超过100倍。

链接: https://arxiv.org/abs/2603.17673
作者: Philipp Normann,Andreas Happe,Jürgen Cito,Daniel Arp
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are increasingly relevant to research domains such as vulnerability discovery. Yet, the strongest systems remain closed and cloud-only, making them resource-intensive, difficult to reproduce, and unsuitable for work involving proprietary code or sensitive data. Consequently, there is an urgent need for small, local models that can perform security tasks under strict resource budgets, but methods for developing them remain underexplored. In this paper, we address this gap by proposing a two-stage post-training pipeline. We focus on the problem of Linux privilege escalation, where success is automatically verifiable and the task requires multi-step interactive reasoning. Using an experimental setup that prevents data leakage, we post-train a 4B model in two stages: supervised fine-tuning on traces from procedurally generated privilege-escalation environments, followed by reinforcement learning with verifiable rewards. On a held-out benchmark of 12 Linux privilege-escalation scenarios, supervised fine-tuning alone more than doubles the baseline success rate at 20 rounds, and reinforcement learning further lifts our resulting model, PrivEsc-LLM, to 95.8%, nearly matching Claude Opus 4.6 at 97.5%. At the same time, the expected inference cost per successful escalation is reduced by over 100x.

[AI-24] Automated Grammar-based Algebraic Multigrid Design With Evolutionary Algorithms

【速读】：该论文旨在解决多网格（Multigrid）方法在求解重要偏微分方程时效率受限的问题，其根源在于算法组件（如平滑策略和循环模式）的选取高度依赖人工经验。为突破这一瓶颈，作者提出一种基于进化算法（Evolutionary Algorithms）的互补策略，通过遗传编程（Genetic Programming, GP）结合上下文无关文法（Context-Free Grammars）自动构建非标准的多网格循环结构，特别是具有层次特定平滑序列和非递归循环模式的灵活循环（Flexible Cycling）。该方案的关键在于利用GP在庞大且难以手动探索的搜索空间中高效生成高性能的代数多网格（Algebraic Multigrid, AMG）配置，数值实验表明此类非标准GP循环在作为求解器或预条件子时均能显著提升性能。

链接: https://arxiv.org/abs/2603.17641
作者: Dinesh Parthasarathy,Wayne Mitchell,Arjun Gambhir,Harald Köstler,Ulrich Rüde
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Although multigrid is asymptotically optimal for solving many important partial differential equations, its efficiency relies heavily on the careful selection of the individual algorithmic components. In contrast to recent approaches that can optimize certain multigrid components using deep learning techniques, we adopt a complementary strategy, employing evolutionary algorithms to construct efficient multigrid cycles from proven algorithmic building blocks. Here, we will present its application to generate efficient algebraic multigrid methods with so-called \emphflexible cycling, that is, level-specific smoothing sequences and non-recursive cycling patterns. The search space with such non-standard cycles is intractable to navigate manually, and is generated using genetic programming (GP) guided by context-free grammars. Numerical experiments with the linear algebra library, \emphhypre, demonstrate the potential of these non-standard GP cycles to improve multigrid performance both as a solver and a preconditioner.

[AI-25] VeriGrey: Greybox Agent Validation

【速读】：该论文旨在解决生成式 AI（Generative AI）代理在自主决策与外部环境交互过程中引入的关键安全风险问题，尤其是难以发现的间接提示注入漏洞（indirect prompt injection vulnerabilities）。其解决方案的核心在于提出一种灰盒测试方法 VeriGrey，通过将代理调用的工具序列作为反馈函数驱动测试过程，从而识别出罕见但危险的工具调用行为；同时，设计基于任务关联的恶意注入提示构造机制，使攻击任务成为完成代理功能的必要步骤，显著提升了对复杂攻击场景的检测能力。实验表明，VeriGrey 在 AgentDojo 基准上相比黑盒基线提升了 33% 的漏洞发现效率，并在真实世界代理（如 Gemini CLI 和 OpenClaw）中成功识别出黑盒方法无法探测的恶意技能变体，验证了动态灰盒测试在代理安全保障中的有效性。

链接: https://arxiv.org/abs/2603.17639
作者: Yuntong Zhang,Sungmin Kang,Ruijie Meng,Marcel Böhme,Abhik Roychoudhury
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17639 [cs.AI] (or arXiv:2603.17639v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.17639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）算法客观比较的难题，其核心挑战在于不同RL方法的性能评估高度依赖于环境设计、奖励结构以及算法与环境动态中的随机性。为应对这一复杂性，作者提出了一种严谨的基准测试框架，其关键创新在于将逆向最优性（converse optimality）扩展至具有噪声的离散时间、控制仿射非线性系统，从而为预设的价值函数和策略提供必要且充分的最优性条件。该框架通过同伦变化（homotopy variations）和随机参数生成基准家族，实现了对RL算法的受控且全面的评估，为基于真实最优解的标准方法对比提供了可复现的基准基础。

链接: https://arxiv.org/abs/2603.17631
作者: Sinan Ibrahim,Grégoire Ouerdane,Hadi Salloum,Henni Ouerdane,Stefan Streif,Pavel Osinenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework’s capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.

[AI-27] Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

【速读】：该论文旨在解决从无动作（action-free）的离线轨迹中恢复潜在动作（latent actions）和环境动态（environment dynamics）的问题，其中动作从未被观测到。核心挑战在于如何在缺乏直接动作标签的情况下，仅通过带有示范者身份（demonstrator identity）的观测序列来识别出每个示范者所遵循的策略以及共享的环境转移机制。解决方案的关键在于利用不同示范者策略的多样性（policy diversity）作为可识别性（identifiability）的来源：假设每个示范者对应一个独特策略、环境动态对所有示范者相同，且示范者身份仅通过其选择的动作影响下一状态的观测分布，则可观测的条件分布 $ p(o_{t+1} \mid o_t, e) $ 可建模为一系列潜行动作条件转移核的混合，其权重由示范者特定的策略决定。在此设定下，该分布满足列随机非负矩阵分解（column-stochastic nonnegative matrix factorization），并通过充分分散的策略多样性和秩条件证明了潜行动作转移和示范者策略在潜行动作标签排列意义下的唯一可识别性。进一步地，通过Gram-行列式最小体积准则将结果扩展至连续观测空间，并利用状态空间连通性将局部排列歧义升级为全局唯一排列，从而只需少量标注动作数据即可完全确定潜行动作标签。这一框架确立了示范者多样性作为从离线强化学习数据中学习潜行动作与动态的理论基础。

链接: https://arxiv.org/abs/2603.17577
作者: Felix Schur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution p(o_t+1\mid o_t,e) is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.

[AI-28] Unsupervised Symbolic Anomaly Detection

【速读】：该论文旨在解决现有异常检测方法在可解释性方面的不足问题，尤其是那些依赖高维、黑箱模型难以提供直观逻辑解释的监督或无监督学习方法。其解决方案的关键在于提出SYRAN方法，通过符号回归（symbolic regression）技术自动发现一组人类可读的数学方程，这些方程刻画了正常数据中的符号不变量（symbolic invariants），即在正常情况下近似恒定的函数。异常得分由数据偏离这些不变量的程度决定，从而使得整个检测逻辑从构建之初就具备可解释性，而非依赖事后解释工具。

链接: https://arxiv.org/abs/2603.17575
作者: Md Maruf Hossain,Tim Katzke,Simon Klüttermann,Emmanuel Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:We propose SYRAN, an unsupervised anomaly detection method based on symbolic regression. Instead of encoding normal patterns in an opaque, high-dimensional model, our method learns an ensemble of human-readable equations that describe symbolic invariants: functions that are approximately constant on normal data. Deviations from these invariants yield anomaly scores, so that the detection logic is interpretable by construction, rather than via post-hoc explanation. Experimental results demonstrate that SYRAN is highly interpretable, providing equations that correspond to known scientific or medical relationships, and maintains strong anomaly detection performance comparable to that of state-of-the-art methods.

[AI-29] FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models

【速读】：该论文旨在解决生成式 AI（Generative AI）驱动的表格式基础模型（如 Prior-Data Fitted Networks, PFNs）在异常检测（Outlier Detection, OD）中缺乏可解释性与不确定性量化的问题。尽管这些模型具备零样本适应能力，但其输出的标量异常分数无法提供操作层面的上下文信息，且现有后处理解释方法在实时部署中计算开销大或难以捕捉零样本推理中的认知不确定性（epistemic uncertainty）。解决方案的关键在于提出 FoMo-X 框架，通过在预训练 PFN 的冻结嵌入上附加轻量级诊断头（Diagnostic Heads），利用与主干网络相同的生成模拟器先验进行离线训练，从而将昂贵的蒙特卡洛 dropout 估计等不确定性指标压缩为单次前向传播即可获得的确定性输出。该方法实现了高保真度的诊断信号恢复与近乎零的推理延迟，显著提升了零样本异常检测的可信度和实用性。

链接: https://arxiv.org/abs/2603.17570
作者: Simon Klüttermann,Tim Katzke,Phuong Huong Nguyen,Emmanuel Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:Tabular foundation models, specifically Prior-Data Fitted Networks (PFNs), have revolutionized outlier detection (OD) by enabling unsupervised zero-shot adaptation to new datasets without training. However, despite their predictive power, these models typically function as opaque black boxes, outputting scalar outlier scores that lack the operational context required for safety-critical decision-making. Existing post-hoc explanation methods are often computationally prohibitive for real-time deployment or fail to capture the epistemic uncertainty inherent in zero-shot inference. In this work, we introduce FoMo-X, a modular framework that equips OD foundation models with intrinsic, lightweight diagnostic capabilities. We leverage the insight that the frozen embeddings of a pretrained PFN backbone already encode rich, context-conditioned relational information. FoMo-X attaches auxiliary diagnostic heads to these embeddings, trained offline using the same generative simulator prior as the backbone. This allows us to distill computationally expensive properties, such as Monte Carlo dropout based epistemic uncertainty, into a deterministic, single-pass inference. We instantiate FoMo-X with two novel heads: a Severity Head that discretizes deviations into interpretable risk tiers, and an Uncertainty Head that provides calibrated confidence measures. Extensive evaluation on synthetic and real-world benchmarks (ADBench) demonstrates that FoMo-X recovers ground-truth diagnostic signals with high fidelity and negligible inference overhead. By bridging the gap between foundation model performance and operational explainability, FoMo-X offers a scalable path toward trustworthy, zero-shot outlier detection.

[AI-30] CLeAN: Continual Learning Adaptive Normalization in Dynamic Environments

【速读】：该论文旨在解决持续学习（Continual Learning）中因数据分布动态变化而导致模型性能下降和灾难性遗忘（Catastrophic Forgetting）的问题，尤其是在表格式数据（Tabular Data）场景下，传统归一化方法（如Min-Max Scaling）依赖全局数据集信息，无法适应序列化学习的特性。解决方案的关键在于提出一种自适应归一化技术——持续学习自适应归一化（Continual Learning Adaptive Normalization, CLeAN），其通过可学习参数结合指数移动平均（Exponential Moving Average, EMA）模块在线估计全局特征尺度，从而实现对演化数据分布的动态适应，有效提升模型在新数据上的表现并缓解知识遗忘问题。

链接: https://arxiv.org/abs/2603.17548
作者: Isabella Marasco,Davide Evangelista,Elena Loli Piccolomini,Michele Colajanni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Artificial intelligence systems predominantly rely on static data distributions, making them ineffective in dynamic real-world environments, such as cybersecurity, autonomous transportation, or finance, where data shifts frequently. Continual learning offers a potential solution by enabling models to learn from sequential data while retaining prior knowledge. However, a critical and underexplored issue in this domain is data normalization. Conventional normalization methods, such as min-max scaling, presuppose access to the entire dataset, which is incongruent with the sequential nature of continual learning. In this paper we introduce Continual Learning Adaptive Normalization (CLeAN), a novel adaptive normalization technique designed for continual learning in tabular data. CLeAN involves the estimation of global feature scales using learnable parameters that are updated via an Exponential Moving Average (EMA) module, enabling the model to adapt to evolving data distributions. Through comprehensive evaluations on two datasets and various continual learning strategies, including Resevoir Experience Replay, A-GEM, and EwC we demonstrate that CLeAN not only improves model performance on new data but also mitigates catastrophic forgetting. The findings underscore the importance of adaptive normalization in enhancing the stability and effectiveness of tabular data, offering a novel perspective on the use of normalization to preserve knowledge in dynamic learning environments.

[AI-31] Per-Domain Generalizing Policies: On Learning Efficient and Robust Q-Value Functions (Extended Version with Technical Appendix)

【速读】：该论文旨在解决在规划学习中如何高效地学习跨领域通用策略（per-domain generalizing policies）的问题。传统方法通过监督学习训练基于图神经网络的状态值函数（state-value functions），但存在计算开销大、效率低的问题。论文提出改用Q值函数（Q-value functions）进行学习，其优势在于评估时仅需处理当前状态而非所有后继状态，显著降低推理成本。解决方案的关键在于引入正则化项以增强模型对教师规划器所选动作与未选动作的区分能力，从而克服直接监督学习Q值时性能不佳的问题，最终实现比状态值策略更优且接近LAMA-first规划器性能的策略学习效果。

链接: https://arxiv.org/abs/2603.17544
作者: Nicola J. Müller,Moritz Oster,Isabel Valera,Jörg Hoffmann,Timo P. Gros
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning per-domain generalizing policies is a key challenge in learning for planning. Standard approaches learn state-value functions represented as graph neural networks using supervised learning on optimal plans generated by a teacher planner. In this work, we advocate for learning Q-value functions instead. Such policies are drastically cheaper to evaluate for a given state, as they need to process only the current state rather than every successor. Surprisingly, vanilla supervised learning of Q-values performs poorly as it does not learn to distinguish between the actions taken and those not taken by the teacher. We address this by using regularization terms that enforce this distinction, resulting in Q-value policies that consistently outperform state-value policies across a range of 10 domains and are competitive with the planner LAMA-first.

[AI-32] Informative Semi-Factuals for XAI: The Elaborated Explanations that People Prefer

【速读】：该论文旨在解决当前半事实解释（semi-factuals）方法缺乏对为何极端特征值变化仍不改变预测结果提供解释的问题，即现有方法仅能生成“即使某关键特征被大幅调整，预测结果仍不变”的陈述，但无法揭示背后起作用的隐藏因素。解决方案的关键在于提出一种新的信息性半事实解释方法（Informative Semi-Factuals, ISF），该方法通过补充与决策相关的额外隐藏特征（hidden features）信息，生成更具洞察力的解释，从而提升解释的可理解性和实用性。实验结果表明，ISF在基准数据集上生成的半事实解释在质量和信息丰富度上均优于现有方法，且用户研究验证了其更受偏好。

链接: https://arxiv.org/abs/2603.17534
作者: Saugat Aryal,Mark T. Keane
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, in eXplainable AI (XAI), \textiteven if explanations – so-called semi-factuals – have emerged as a popular strategy that explains how a predicted outcome \textitcan remain the same even when certain input-features are altered. For example, in the commonly-used banking app scenario, a semi-factual explanation could inform customers about better options, other alternatives for their successful application, by saying " \textitEven if you asked for double the loan amount, you would still be accepted". Most semi-factuals XAI algorithms focus on finding maximal value-changes to a single key-feature that do \textitnot alter the outcome (unlike counterfactual explanations that often find minimal value-changes to several features that alter the outcome). However, no current semi-factual method explains \textitwhy these extreme value-changes do not alter outcomes; for example, a more informative semi-factual could tell the customer that it is their good credit score that allows them to borrow double their requested loan. In this work, we advance a new algorithm – the \textitinformative semi-factuals (ISF) method – that generates more elaborated explanations supplementing semi-factuals with information about additional \textithidden features that influence an automated decision. Experimental results on benchmark datasets show that this ISF method computes semi-factuals that are both informative and of high-quality on key metrics. Furthermore, a user study shows that people prefer these elaborated explanations over the simpler semi-factual explanations generated by current methods.

[AI-33] AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting AAAI2026

【速读】：该论文旨在解决空气质量预测中因污染物动态过程被建模为瞬时行为而忽略传播延迟的问题，从而导致预测精度受限。其解决方案的关键在于提出AirDDE框架，首次将延迟微分方程（Delay Differential Equation, DDE）引入空气质量预测任务，通过连续时间下的物理引导建模实现对污染传播延迟的精准刻画；具体包含两个核心创新：一是基于记忆增强注意力模块，自适应地捕获全局与局部历史特征以体现多因素调控下的延迟效应；二是基于扩散-对流方程构建的物理引导延迟演化函数，能够同时建模扩散、延迟对流及源/汇项，从而在物理合理性基础上捕捉具有延迟特性的污染物累积模式。

链接: https://arxiv.org/abs/2603.17529
作者: Binqing Wu,Zongjiang Shang,Shiyu Liu,Jianlong Huang,Jiahui Xu,Ling Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026

点击查看摘要

Abstract:Accurate air quality forecasting is essential for public health and environmental sustainability, but remains challenging due to the complex pollutant dynamics. Existing deep learning methods often model pollutant dynamics as an instantaneous process, overlooking the intrinsic delays in pollutant propagation. Thus, we propose AirDDE, the first neural delay differential equation framework in this task that integrates delay modeling into a continuous-time pollutant evolution under physical guidance. Specifically, two novel components are introduced: (1) a memory-augmented attention module that retrieves globally and locally historical features, which can adaptively capture delay effects modulated by multifactor data; and (2) a physics-guided delay evolving function, grounded in the diffusion-advection equation, that models diffusion, delayed advection, and source/sink terms, which can capture delay-aware pollutant accumulation patterns with physical plausibility. Extensive experiments on three real-world datasets demonstrate that AirDDE achieves the state-of-the-art forecasting performance with an average MAE reduction of 8.79% over the best baselines. The code is available at this https URL.

[AI-34] KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

【速读】：该论文旨在解决现有视觉-语言-动作（Vision-Language-Action, VLA）系统在执行任务时对运动学特性（kinematics）建模不足的问题，即当前指令通常仅粗略或部分描述动作的运动学属性（如方向、轨迹、姿态和相对位移），难以支持精细且个性化的操作行为。为应对这一挑战，作者提出KineVLA框架，其核心创新在于通过双层动作表示（bi-level action representation）与双层推理标记（bi-level reasoning tokens）显式解耦任务目标的不变性（goal-level invariance）与运动学变化性（kinematics-level variability），从而作为显式的监督中间变量，实现语言与动作之间的精准对齐。该方案显著提升了机器人在保持任务目标一致的前提下，根据指令级运动学要求灵活调整执行轨迹的能力。

链接: https://arxiv.org/abs/2603.17524
作者: Gaoge Han,Zhengqing Gao,Ziwen Li,Jiaxin Huang,Shaoli Huang,Fakhri Karray,Mingming Gong,Tongliang Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.

[AI-35] QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在物联网（Internet of Things, IoT）设备上运行时因频繁上行传输导致的高能耗问题，从而降低其碳足迹。解决方案的关键在于提出QuantFL框架，利用预训练模型（pre-trained model）作为初始化起点，实现激进且计算轻量化的量化策略；通过预训练自然集中更新统计特性，使能采用内存高效的桶量化（bucket quantisation），避免了复杂误差反馈机制带来的能量开销，从而在严格带宽约束下显著减少通信比特数（上行最多减少80%），同时保持或超越无压缩基线的性能表现。

链接: https://arxiv.org/abs/2603.17507
作者: Charuka Herath,Yogachandran Rahulamathavan,Varuna De Silva,Sangarapillai Lambotharan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables privacy-preserving intelligence on Internet of Things (IoT) devices but incurs a significant carbon footprint due to the high energy cost of frequent uplink transmission. While pre-trained models are increasingly available on edge devices, their potential to reduce the energy overhead of fine-tuning remains underexplored. In this work, we propose QuantFL, a sustainable FL framework that leverages pre-trained initialisation to enable aggressive, computationally lightweight quantisation. We demonstrate that pre-training naturally concentrates update statistics, allowing us to use memory-efficient bucket quantisation without the energy-intensive overhead of complex error-feedback mechanisms. On MNIST and CIFAR-100, QuantFL reduces total communication by 40% ( \simeq40% total-bit reduction with full-precision downlink; \geq80% on uplink or when downlink is quantised) while matching or exceeding uncompressed baselines under strict bandwidth budgets; BU attains 89.00% (MNIST) and 66.89% (CIFAR-100) test accuracy with orders of magnitude fewer bits. We also account for uplink and downlink costs and provide ablations on quantisation levels and initialisation. QuantFL delivers a practical, “green” recipe for scalable training on battery-constrained IoT networks.

[AI-36] Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization

【速读】：该论文旨在解决无线波束赋形（beamforming）与波形优化中传统迭代算法计算复杂度高、训练数据需求量大以及黑箱模型可解释性差的问题。其关键解决方案是将迭代的近端梯度下降（proximal gradient descent, PGD）算法通过深度展开（deep unfolding, DU）技术转化为可学习的神经网络结构，其中每层参数由数据驱动而非预设；进一步引入一种混合层（hybrid layer），在近端投影前执行可学习的线性梯度变换以增强表达能力，并结合AutoGluon内置树状帕尔岑估计器（tree-structured Parzen estimator, TPE）进行超参数优化（hyperparameter optimization, HPO），从而在仅用100个训练样本和五层网络的情况下，实现接近200次迭代传统PGD解法98.8%的频谱效率，同时保持良好的可解释性和训练稳定性。

链接: https://arxiv.org/abs/2603.17478
作者: Ahmet Kaplan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:This study explores the combination of automated machine learning (AutoML) with model-based deep unfolding (DU) for optimizing wireless beamforming and waveforms. We convert the iterative proximal gradient descent (PGD) algorithm into a deep neural network, wherein the parameters of each layer are learned instead of being predetermined. Additionally, we enhance the architecture by incorporating a hybrid layer that performs a learnable linear gradient transformation prior to the proximal projection. By utilizing AutoGluon with a tree-structured parzen estimator (TPE) for hyperparameter optimization (HPO) across an expanded search space, which includes network depth, step-size initialization, optimizer, learning rate scheduler, layer type, and post-gradient activation, the proposed auto-unrolled PGD (Auto-PGD) achieves 98.8% of the spectral efficiency of a traditional 200-iteration PGD solver using only five unrolled layers, while requiring only 100 training samples. We also address a gradient normalization issue to ensure consistent performance during training and evaluation, and we illustrate per-layer sum-rate logging as a tool for transparency. These contributions highlight a notable reduction in the amount of training data and inference cost required, while maintaining high interpretability compared to conventional black-box architectures.

[AI-37] Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates

【速读】：该论文旨在解决时间序列预测中传统基于提示学习（in-context learning, ICL）方法依赖手工构造的表格特征、而端到端序列模型缺乏推理时自适应能力的问题。其核心解决方案是提出一个统一框架 Baguan-TS，通过3D Transformer实现对时间轴、变量轴和上下文轴的联合注意力机制，从而在不依赖人工特征的情况下直接利用原始序列进行上下文学习；关键创新在于引入一种与特征无关的目标空间检索式局部校准策略以提升训练稳定性和校准性能，并采用上下文过拟合策略缓解输出过度平滑问题，最终在多个公共基准和真实世界能源数据集上显著优于现有基线方法。

链接: https://arxiv.org/abs/2603.17439
作者: Linxiao Yang,Xue Jiang,Gezheng Xu,Tian Zhou,Min Yang,ZhaoYang Zhu,Linyuan Geng,Zhipeng Zeng,Qiming Chen,Xinyue Gu,Rong Jin,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers enable in-context learning (ICL) for rapid, gradient-free adaptation in time series forecasting, yet most ICL-style approaches rely on tabularized, hand-crafted features, while end-to-end sequence models lack inference-time adaptation. We bridge this gap with a unified framework, Baguan-TS, which integrates the raw-sequence representation learning with ICL, instantiated by a 3D Transformer that attends jointly over temporal, variable, and context axes. To make this high-capacity model practical, we tackle two key hurdles: (i) calibration and training stability, improved with a feature-agnostic, target-space retrieval-based local calibration; and (ii) output oversmoothing, mitigated via context-overfitting strategy. On public benchmark with covariates, Baguan-TS consistently outperforms established baselines, achieving the highest win rate and significant reductions in both point and probabilistic forecasting metrics. Further evaluations across diverse real-world energy datasets demonstrate its robustness, yielding substantial improvements.

[AI-38] meAPN: Adaptive Amplitude-Phase Non-Stationarity Normalization for Time Series Forecasting

【速读】：该论文旨在解决多变量长期时间序列预测中的非平稳性（non-stationarity）问题，其核心表现是幅值和相位的快速变化，导致分布偏移并显著降低预测性能。现有基于归一化的方法主要依赖一阶和二阶统计量，隐式假设分布平滑演化，忽略了细粒度的时间动态特性。解决方案的关键在于提出TimeAPN（Adaptive Amplitude-Phase Non-Stationarity Normalization）框架，该框架通过联合建模时域与频域的均值序列及其未来演变，并显式捕捉预测与真实序列之间的相位差异以应对时间错位，同时将幅值信息融入自适应归一化机制，从而有效处理信号能量的突变。最终，预测得到的非平稳因素通过协同去归一化过程与主干模型输出融合，重建出完整的非平稳时间序列，且该框架具有模型无关性，可无缝集成于多种预测模型中。

链接: https://arxiv.org/abs/2603.17436
作者: Yue Hu,Jialiang Tang,Siwei Yu,Baosheng Yu,Jing Zhang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non-stationarity is a fundamental challenge in multivariate long-term time series forecasting, often manifested as rapid changes in amplitude and phase. These variations lead to severe distribution shifts and consequently degrade predictive performance. Existing normalization-based methods primarily rely on first- and second-order statistics, implicitly assuming that distributions evolve smoothly and overlooking fine-grained temporal dynamics. To address these limitations, we propose TimeAPN, an Adaptive Amplitude-Phase Non-Stationarity Normalization framework that explicitly models and predicts non-stationary factors from both the time and frequency domains. Specifically, TimeAPN first models the mean sequence jointly in the time and frequency domains, and then forecasts its evolution over future horizons. Meanwhile, phase information is extracted in the frequency domain, and the phase discrepancy between the predicted and ground-truth future sequences is explicitly modeled to capture temporal misalignment. Furthermore, TimeAPN incorporates amplitude information into an adaptive normalization mechanism, enabling the model to effectively account for abrupt fluctuations in signal energy. The predicted non-stationary factors are subsequently integrated with the backbone forecasting outputs through a collaborative de-normalization process to reconstruct the final non-stationary time series. The proposed framework is model-agnostic and can be seamlessly integrated with various forecasting backbones. Extensive experiments on seven real-world multivariate datasets demonstrate that TimeAPN consistently improves long-term forecasting accuracy across multiple prediction horizons and outperforms state-of-the-art reversible normalization methods.

[AI-39] he Phasor Transformer: Resolving Attention Bottlenecks on the Unit Circle

【速读】：该论文旨在解决Transformer模型中基于点积自注意力机制在长序列时间序列建模时存在的二次计算复杂度瓶颈问题（quadratic token-mixing bottleneck）。其解决方案的关键在于提出一种基于相位原生表示的Phasor Transformer模块，将序列状态映射到单位圆流形 $S^1$ 上，通过轻量级可训练相位偏移与无参数的离散傅里叶变换（Discrete Fourier Transform, DFT）进行token耦合，实现全局 $\mathcal{O}(N\log N)$ 的混合效率，且无需显式计算注意力矩阵。该方法在保持高效的同时，能够学习稳定的时间序列全局动态，从而为振荡性时序建模提供了一种几何约束下的确定性全局耦合路径。

链接: https://arxiv.org/abs/2603.17433
作者: Dibakar Sigdel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer models have redefined sequence learning, yet dot-product self-attention introduces a quadratic token-mixing bottleneck for long-context time-series. We introduce the \textbfPhasor Transformer block, a phase-native alternative representing sequence states on the unit-circle manifold S^1 . Each block combines lightweight trainable phase-shifts with parameter-free Discrete Fourier Transform (DFT) token coupling, achieving global \mathcalO(N\log N) mixing without explicit attention maps. Stacking these blocks defines the \textbfLarge Phasor Model (LPM). We validate LPM on autoregressive time-series prediction over synthetic multi-frequency benchmarks. Operating with a highly compact parameter budget, LPM learns stable global dynamics and achieves competitive forecasting behavior compared to conventional self-attention baselines. Our results establish an explicit efficiency-performance frontier, demonstrating that large-model scaling for time-series can emerge from geometry-constrained phase computation with deterministic global coupling, offering a practical path toward scalable temporal modeling in oscillatory domains.

[AI-40] Proactive Knowledge Inquiry in Doctor-Patient Dialogue: Stateful Extraction Belief Updating and Path-Aware Action Planning

【速读】：该论文旨在解决当前自动化电子病历（Electronic Medical Record, EMR）生成系统普遍存在的“输出导向”问题，即现有方法仅在诊疗结束后进行转录、提取与总结，而未能显式建模已知信息、缺失内容、不确定性优先级以及下一步应提问或推荐的内容。为此，作者将医患对话建模为一个在部分可观测条件下的主动知识探究（proactive knowledge-inquiry）问题，并提出了一种融合状态感知抽取、序列信念更新、差距感知状态建模、对象化医学知识的混合检索以及POMDP-lite动作规划器的框架。其核心创新在于将EMR文档视为持续探究循环的结构化投影，而非单一目标产物，从而实现更符合临床认知逻辑的动态生成机制。

链接: https://arxiv.org/abs/2603.17425
作者: Zhenhai Pan,Yan Liu,Jia You
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 5 tables. Pilot concept demonstration under a controlled simulated setting

点击查看摘要

Abstract:Most automated electronic medical record (EMR) pipelines remain output-oriented: they transcribe, extract, and summarize after the consultation, but they do not explicitly model what is already known, what is still missing, which uncertainty matters most, or what question or recommendation should come next. We formulate doctor-patient dialogue as a proactive knowledge-inquiry problem under partial observability. The proposed framework combines stateful extraction, sequential belief updating, gap-aware state modeling, hybrid retrieval over objectified medical knowledge, and a POMDP-lite action planner. Instead of treating the EMR as the only target artifact, the framework treats documentation as the structured projection of an ongoing inquiry loop. To make the formulation concrete, we report a controlled pilot evaluation on ten standardized multi-turn dialogues together with a 300-query retrieval benchmark aggregated across dialogues. On this pilot protocol, the full framework reaches 83.3% coverage, 80.0% risk recall, 81.4% structural completeness, and lower redundancy than the chunk-only and template-heavy interactive baselines. These pilot results do not establish clinical generalization; rather, they suggest that proactive inquiry may be methodologically interesting under tightly controlled conditions and can be viewed as a conceptually appealing formulation worth further investigation for dialogue-based EMR generation. This work should be read as a pilot concept demonstration under a controlled simulated setting rather than as evidence of clinical deployment readiness. No implication of clinical deployment readiness, clinical safety, or real-world clinical utility should be inferred from this pilot protocol.

[AI-41] From Digital Twins to World Models:Opportunities Challenges and Applications for Mobile Edge General Intelligence

【速读】：该论文旨在解决传统数字孪生（Digital Twin）在高度动态的网络边缘环境中面临的自主性、适应性和可扩展性不足的问题，从而推动边缘通用智能（Edge General Intelligence, EGI）的发展。其核心解决方案在于从以物理模型为基础、集中式且系统中心化的数字孪生向数据驱动、去中心化且代理中心化的世界模型（World Model）演进，通过构建具备感知、潜在状态表示、动态学习、基于想象的规划和记忆等关键组件的架构，实现更高效、自适应且资源友好的边缘智能。这一转变使世界模型能够支持无线边缘计算场景下多智能体协同与决策，为集成感知与通信、语义通信、空地网络及低空无线网络等新兴应用提供理论基础与技术路径。

链接: https://arxiv.org/abs/2603.17420
作者: Jie Zheng,Dusit Niyato,Changyuan Zhao,Jiawen Kang,Jiacheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution toward 6G and beyond communication systems is accelerating the convergence of digital twins and world models at the network edge. Traditional digital twins provide high-fidelity representations of physical systems and support monitoring, analysis, and offline optimization. However, in highly dynamic edge environments, they face limitations in autonomy, adaptability, and scalability. This paper presents a systematic survey of the transition from digital twins to world models and discusses its role in enabling edge general intelligence (EGI). First, the paper clarifies the conceptual differences between digital twins and world models and highlights the shift from physics-based, centralized, and system-centric replicas to data-driven, decentralized, and agent-centric internal models. This discussion helps readers gain a clear understanding of how this transition enables more adaptive, autonomous, and resource-efficient intelligence at the network edge. The paper reviews the design principles, architectures, and key components of world models, including perception, latent state representation, dynamics learning, imagination-based planning, and memory. In addition, it examines the integration of world models and digital twins in wireless EGI systems and surveys emerging applications in integrated sensing and communications, semantic communication, air-ground networks, and low-altitude wireless networks. Finally, this survey provides a systematic roadmap and practical insights for designing world-model-driven edge intelligence systems in wireless and edge computing environments. It also outlines key research challenges and future directions toward scalable, reliable, and interoperable world models for edge-native agentic AI.

[AI-42] Caging the Agents : A Zero Trust Security Architecture for Autonomous AI in Healthcare ALT

【速读】：该论文旨在解决部署于医疗健康环境中的自主式AI代理（autonomous AI agents）所面临的严重安全漏洞问题，这些漏洞包括未经授权的指令执行、敏感信息泄露、身份伪造、跨代理传播不安全行为以及通过外部资源间接注入恶意提示等，均可能违反《健康保险可携性和责任法案》（HIPAA）。解决方案的关键在于构建一个六域威胁模型并实施四层纵深防御体系：（1）基于gVisor的内核级工作负载隔离，（2）使用凭证代理Sidecar组件防止代理容器直接访问原始密钥，（3）网络出口策略限制每个代理仅能连接白名单目的地，（4）基于结构化元数据封装和不可信内容标记的提示完整性框架。该架构已在九个生产环境中持续运行90天，有效识别并修复了四个高危漏洞，覆盖全部十一类攻击模式，并将相关配置与审计工具开源。

链接: https://arxiv.org/abs/2603.17419
作者: Saikat Maiti
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Keywords: agentic AI security, autonomous agents, healthcare cybersecurity, zero trust, prompt injection, HIPAA, Kubernetes security, OpenClaw

点击查看摘要

Abstract:Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that these agents exhibit critical vulnerabilities in realistic settings: unauthorized compliance with non-owner instructions, sensitive information disclosure, identity spoofing, cross-agent propagation of unsafe practices, and indirect prompt injection through external resources [7]. In healthcare environments processing Protected Health Information, every such vulnerability becomes a potential HIPAA violation. This paper presents a security architecture deployed for nine autonomous AI agents in production at a healthcare technology company. We develop a six-domain threat model for agentic AI in healthcare covering credential exposure, execution capability abuse, network egress exfiltration, prompt integrity failures, database access risks, and fleet configuration drift. We implement four-layer defense in depth: (1) kernel level workload isolation using gVisor on Kubernetes, (2) credential proxy sidecars preventing agent containers from accessing raw secrets, (3) network egress policies restricting each agent to allowlisted destinations, and (4) a prompt integrity framework with structured metadata envelopes and untrusted content labeling. We report results from 90 days of deployment including four HIGH severity findings discovered and remediated by an automated security audit agent, progressive fleet hardening across three VM image generations, and defense coverage mapped to all eleven attack patterns from recent literature. All configurations, audit tooling, and the prompt integrity framework are released as open source.

[AI-43] SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction

【速读】：该论文旨在解决虚拟细胞模型在大规模扰动预测中面临的三大耦合瓶颈：训练与推理管道效率低下、高维稀疏表达空间中的建模不稳定，以及评估协议过度强调重建准确性而忽视生物真实性。其关键解决方案在于构建一个名为SCALE的专用大规模基础模型，通过三方面创新实现突破：首先，基于BioNeMo框架优化训练与推理流程，显著提升数据吞吐量、分布式扩展性和部署效率；其次，将扰动预测建模为条件传输问题，并采用集合感知流架构（set-aware flow architecture），结合LLaMA-based细胞编码与端点导向监督，增强训练稳定性与扰动效应恢复能力；最后，在Tahoe-100M基准上采用以细胞层面生物意义指标为核心的评估协议，而非仅依赖重建精度，从而更真实地反映模型的生物学性能。

链接: https://arxiv.org/abs/2603.17380
作者: Shuizhou Chen,Lang Yu,Kedu Jin,Songming Zhang,Hao Wu,Wenxuan Huang,Sheng Xu,Quan Qian,Qin Chen,Lei Bai,Siqi Sun,Zhangyang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Virtual cell models aim to enable in silico experimentation by predicting how cells respond to genetic, chemical, or cytokine perturbations from single-cell measurements. In practice, however, large-scale perturbation prediction remains constrained by three coupled bottlenecks: inefficient training and inference pipelines, unstable modeling in high-dimensional sparse expression space, and evaluation protocols that overemphasize reconstruction-like accuracy while underestimating biological fidelity. In this work we present a specialized large-scale foundation model SCALE for virtual cell perturbation prediction that addresses the above limitations jointly. First, we build a BioNeMo-based training and inference framework that substantially improves data throughput, distributed scalability, and deployment efficiency, yielding 12.51* speedup on pretrain and 1.29* on inference over the prior SOTA pipeline under matched system settings. Second, we formulate perturbation prediction as conditional transport and implement it with a set-aware flow architecture that couples LLaMA-based cellular encoding with endpoint-oriented supervision. This design yields more stable training and stronger recovery of perturbation effects. Third, we evaluate the model on Tahoe-100M using a rigorous cell-level protocol centered on biologically meaningful metrics rather than reconstruction alone. On this benchmark, our model improves PDCorr by 12.02% and DE Overlap by 10.66% over STATE. Together, these results suggest that advancing virtual cells requires not only better generative objectives, but also the co-design of scalable infrastructure, stable transport modeling, and biologically faithful evaluation.

[AI-44] Efficient Exploration at Scale

【速读】：该论文旨在解决强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）中数据效率低下的问题，即传统方法需要大量标注数据才能训练出高性能的奖励模型和语言模型。其解决方案的关键在于提出一种在线学习算法，该算法能够随着选择数据的持续输入，增量式地更新奖励模型和语言模型：奖励模型直接拟合选择数据，而语言模型则通过一种改进的REINFORCE算法进行更新，其强化信号由奖励模型提供。此外，算法引入了三个关键设计：对每个强化信号施加微小的正向激励（affirmative nudge）、使用认知神经网络（epistemic neural network）建模奖励不确定性，以及信息导向探索（information-directed exploration），从而显著提升数据利用效率。实验表明，使用Gemma大语言模型（LLMs）时，该方法仅需少于2万条标签即可达到离线RLHF在20万条标签下训练的性能，实现超过10倍的数据效率提升，且可扩展至百万级标签匹配十亿级标签的离线训练效果，标志着首次实证证明此类大规模效率增益的可能性。

链接: https://arxiv.org/abs/2603.17378
作者: Seyed Mohammad Asghari,Chris Chute,Vikranth Dwaracherla,Xiuyuan Lu,Mehdi Jafarnia,Victor Minden,Zheng Wen,Benjamin Van Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.

[AI-45] owards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

【速读】：该论文旨在解决大推理模型（Large Reasoning Models, LRM）在启用链式思维（Chain-of-Thought, CoT）后安全能力显著下降的问题。研究表明，这种安全性能的退化仅在CoT被激活时出现，而在其关闭时则不发生，这表明安全决策应前置至CoT生成之前。解决方案的关键在于提出一种新颖的安全对齐方法：首先使用基于Bert的分类器从一个安全模型（如未启用CoT的LRM）中提取安全决策信号，并将这些信号作为辅助监督信息融入LRM的安全对齐过程；通过这种方式，安全梯度可反向传播至LRM的潜在表示层，从而有效增强模型在CoT生成前的安全决策能力，同时保持其通用推理性能不受影响。

链接: https://arxiv.org/abs/2603.17368
作者: Jianan Chen,Zhifang Zhang,Shuo He,Linan Yue,Lei Feng,Minling Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs’ safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs’ safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs’ latent representations, effectively strengthening the LRMs’ safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs’ general reasoning performance.

[AI-46] WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

【速读】：该论文旨在解决生成式 AI (Generative AI) 在网页交互场景中引发的隐私泄露问题，特别是训练数据中包含敏感信息以及云端推理时暴露用户截图所导致的个人身份信息（PII）泄露风险。其解决方案的关键在于提出一个细粒度的合成基准 WebPII，包含 44,865 张标注的电子商务 UI 图像，具备三大特性：扩展的 PII 分类体系（涵盖交易级标识符以支持再识别分析）、针对部分填写表单的前瞻检测能力（适应用户实时输入场景），以及基于视觉语言模型（VLM）的可扩展界面复现机制。通过该设计，研究者训练出 WebRedact 模型，在保持实时 CPU 推理延迟（20ms）的同时，将文本提取准确率提升至 mAP@50=0.753，显著优于基线（0.357），验证了方法在跨界面布局不变性和新页面类型泛化上的有效性。

链接: https://arxiv.org/abs/2603.17357
作者: Nathan Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.

[AI-47] Learning Permutation Distributions via Reflected Diffusion on Ranks

【速读】：该论文旨在解决在对称群 $ S_n $ 上学习概率分布的挑战，尤其是由于其阶乘增长的规模和离散、非欧几里得结构导致的采样与建模困难。现有基于洗牌（shuffle-based）随机游走的扩散方法虽然能实现前向扰动，但其反向去噪过程常因轨迹突变且随 $ n $ 增大而愈发难以恢复，限制了性能表现。解决方案的关键在于提出 Soft-Rank Diffusion 框架：通过将排列映射到连续潜空间中的“软秩”（soft ranks）来替代传统的硬性洗牌扰动，从而获得更平滑、可微分的前向轨迹；同时引入上下文感知的广义 Plackett-Luce（contextualized generalized Plackett-Luce, cGPL）去噪器，增强模型对序列决策结构的表达能力。实验表明，该方法在排序和组合优化任务中显著优于现有扩散基线，尤其在长序列和内在顺序场景下优势明显。

链接: https://arxiv.org/abs/2603.17353
作者: Sizhuang He,Yangtian Zhang,Shiyang Zhang,David van Dijk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

[AI-48] A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

【速读】：该论文旨在解决网约车责任纠纷自动化 adjudication（裁决）中的两大核心问题：一是传统人工审核因订单量激增而不可行，二是现有自动化方法缺乏类司法决策所需的推理透明性；二是当前多模态大语言模型（Multimodal LLMs）难以在通用视觉语义与严谨证据规则之间建立对齐，常导致感知幻觉和逻辑松散。解决方案的关键在于提出 RideJudge 框架，其创新性体现在三个层面：首先，通过 SynTraj 合成引擎将抽象责任概念锚定为具体的轨迹模式，从而弥合语义鸿沟；其次，采用自适应上下文优化策略与 Chain-of-Adjudication 机制，在有限上下文窗口内高效整合专家知识并主动引导证据检索；最后，引入序数敏感强化学习（Ordinal-Sensitive Reinforcement Learning）机制，基于层级严重性校准决策边界，克服稀疏二元反馈的局限性，显著提升判责准确性与可解释性。

链接: https://arxiv.org/abs/2603.17328
作者: Weiming Wu,Zi-Jian Cheng,Jie Meng,Peng Zhen,Shan Huang,Qun Li,Guobin Wu,Lan-Zhe Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we propose an Adaptive Context Optimization strategy that distills expert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assessment, we implement a novel Ordinal-Sensitive Reinforcement Learning mechanism that calibrates decision boundaries against hierarchical severity. Extensive experiments show that our RideJudge-8B achieves 88.41% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication.

[AI-49] ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

【速读】：该论文旨在解决如何在快速对抗性体育项目中有效支持强化学习与策略行为分析的问题，尤其针对羽毛球运动中复杂且动态的回合交互建模难题。其解决方案的关键在于构建一个基于精英选手比赛数据的交互式、数据驱动型仿真环境ShuttleEnv，该环境采用显式的概率模型来模拟回合级动态，从而实现真实且可解释的智能体-对手互动，而无需依赖物理引擎模拟。这一设计使得研究人员能够训练和可视化多种策略，并对决策行为进行交互式分析，为体育人工智能研究提供了一个可复用的平台。

链接: https://arxiv.org/abs/2603.17324
作者: Ang Li,Xinyang Gong,Bozhou Chen,Yunlong Lu,Jiaming Ji,Yongyi Wang,Yaodong Yang,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present ShuttleEnv, an interactive and data-driven simulation environment for badminton, designed to support reinforcement learning and strategic behavior analysis in fast-paced adversarial sports. The environment is grounded in elite-player match data and employs explicit probabilistic models to simulate rally-level dynamics, enabling realistic and interpretable agent-opponent interactions without relying on physics-based simulation. In this demonstration, we showcase multiple trained agents within ShuttleEnv and provide live, step-by-step visualization of badminton rallies, allowing attendees to explore different play styles, observe emergent strategies, and interactively analyze decision-making behaviors. ShuttleEnv serves as a reusable platform for research, visualization, and demonstration of intelligent agents in sports AI. Our ShuttleEnv demo video URL: this https URL

[AI-50] Physics-informed offline reinforcement learning eliminates catastrophic fuel waste in maritime routing

【速读】：该论文旨在解决国际航运中航路规划效率低下导致的温室气体排放问题，特别是传统基于启发式方法的航线规划难以兼顾燃油经济性与航行安全。其核心挑战在于如何在不依赖在线仿真器的前提下，从历史船舶轨迹和海洋再分析数据中学习出既节能又安全的航路策略。解决方案的关键在于提出PIER（Physics-Informed, Energy-efficient, Risk-aware routing）框架，该框架融合了物理信息状态构建、示范增强的离线数据训练以及解耦后的后处理安全屏障机制，实现了无需天气预报输入即可稳定优化航路性能，显著降低极端燃油消耗风险（减少9倍）并提升航程间燃油使用的稳定性（方差降低3.5倍）。

链接: https://arxiv.org/abs/2603.17319
作者: Aniruddha Bora,Julie Chalfant,Chryssostomos Chryssostomidis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:International shipping produces approximately 3% of global greenhouse gas emissions, yet voyage routing remains dominated by heuristic methods. We present PIER (Physics-Informed, Energy-efficient, Risk-aware routing), an offline reinforcement learning framework that learns fuel-efficient, safety-aware routing policies from physics-calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator. Validated on one full year (2023) of AIS data across seven Gulf of Mexico routes (840 episodes per method), PIER reduces mean CO2 emissions by 10% relative to great-circle routing. However, PIER’s primary contribution is eliminating catastrophic fuel waste: great-circle routing incurs extreme fuel consumption (1.5x median) in 4.8% of voyages; PIER reduces this to 0.5%, a 9-fold reduction. Per-voyage fuel variance is 3.5x lower (p0.001), with bootstrap 95% CI for mean savings [2.9%, 15.7%]. Partial validation against observed AIS vessel behavior confirms consistency with the fastest real transits while exhibiting 23.1x lower variance. Crucially, PIER is forecast-independent: unlike A* path optimization whose wave protection degrades 4.5x under realistic forecast uncertainty, PIER maintains constant performance using only local observations. The framework combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield, an architecture that transfers to wildfire evacuation, aircraft trajectory optimization, and autonomous navigation in unmapped terrain.

[AI-51] GUIDE: GenAI Units In Digital Design Education

【速读】：该论文旨在解决数字设计教育中教学资源分散、实践环节薄弱以及生成式 AI (Generative AI) 技术难以系统融入课程的问题。其解决方案的关键在于构建一个结构化的开源课程资源库——GenAI Units In Digital Design Education (GUIDE)，该库以标准化的教学单元（包括幻灯片、短视频、可运行的 Google Colab 实验室及参考文献）为核心组织形式，确保学生学习体验的一致性与教师教学、评分的可复用性。通过将 GUIDE 单元整合为完整学期课程（如 GUIDE4HardwareSecurity），并结合真实项目（如 LLM 辅助硬件木马插入）和竞赛场景（如 CSAW），实现了生成式 AI 与数字电路设计教育的深度融合与实践落地。

链接: https://arxiv.org/abs/2603.17296
作者: Weihua Xiao,Jason Blocklove,Matthew DeLorenzo,Johann Knechtel,Ozgur Sinanoglu,Kanad Basu,Jeyavijayan Rajendran,Siddharth Garg,Ramesh Karri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GenAI Units In Digital Design Education (GUIDE) is an open courseware repository with runnable Google Colab labs and other materials. We describe the repository’s architecture and educational approach based on standardized teaching units comprising slides, short videos, runnable labs, and related papers. This organization enables consistency for both the students’ learning experience and the reuse and grading by instructors. We demonstrate GUIDE in practice with three representative units: VeriThoughts for reasoning and formal-verification-backed RTL generation, enhanced LLM-aided testbench generation, and LLMPirate for IP Piracy. We also provide details for four example course instances (GUIDE4ChipDesign, Build your ASIC, GUIDE4HardwareSecurity, and Hardware Design) that assemble GUIDE units into full semester offerings, learning outcomes, and capstone projects, all based on proven materials. For example, the GUIDE4HardwareSecurity course includes a project on LLM-aided hardware Trojan insertion that has been successfully deployed in the classroom and in Cybersecurity Games and Conference (CSAW), a student competition and academic conference for cybersecurity. We also organized an NYU Cognichip Hackathon, engaging students across 24 international teams in AI-assisted RTL design workflows. The GUIDE repository is open for contributions and available at: this https URL.

[AI-52] Pathology-Aware Multi-View Contrastive Learning for Patient-Independent ECG Reconstruction

【速读】：该论文旨在解决从少导联（reduced lead set）重构12导联心电图（12-lead electrocardiogram, ECG）这一病态逆问题（ill-posed inverse problem），其核心挑战在于个体解剖差异导致的信号模糊性，以及传统深度学习方法因忽略心脏病理状态而丢失胸前导联关键形态信息的问题。解决方案的关键在于提出了一种病理感知多视角对比学习（Pathology-Aware Multi-View Contrastive Learning）框架，通过在潜在空间中引入病理学流形（pathological manifold）进行正则化，并利用监督对比对齐（supervised contrastive alignment）学习病理感知嵌入（pathology-aware embeddings），从而最大化潜在表示与临床标签之间的互信息，有效过滤解剖“干扰变量”（anatomical “nuisance” variables），实现高保真度的ECG重构与跨数据集的优异泛化性能。

链接: https://arxiv.org/abs/2603.17248
作者: Youssef Youssef,Jitin Singla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing a 12-lead electrocardiogram (ECG) from a reduced lead set is an ill-posed inverse problem due to anatomical variability. Standard deep learning methods often ignore underlying cardiac pathology losing vital morphology in precordial leads. We propose Pathology-Aware Multi-View Contrastive Learning, a framework that regularizes the latent space through a pathological manifold. Our architecture integrates high-fidelity time-domain waveforms with pathology-aware embeddings learned via supervised contrastive alignment. By maximizing mutual information between latent representations and clinical labels, the framework learns to filter anatomical “nuisance” variables. On the PTB-XL dataset, our method achieves approx. 76% reduction in RMSE compared to state-of-the-art model in patient-independent setting. Cross-dataset evaluation on the PTB Diagnostic Database confirms superior generalization, bridging the gap between hardware portability and diagnostic-grade reconstruction.

[AI-53] Deployment and Evaluation of an EHR-integrated Large Language Model-Powered Tool to Triage Surgical Patients

【速读】：该论文旨在解决外科联合管理（Surgical Co-Management, SCM）在临床实践中因依赖人工筛选符合条件患者而效率低下的问题。其核心挑战在于如何自动化识别适合SCM的术前复杂患者，以提升诊疗流程的可及性与资源利用效率。解决方案的关键在于开发并验证一个基于大语言模型（Large Language Model, LLM）的电子健康记录（Electronic Health Record, EHR）集成式分诊工具——SCM Navigator，该系统通过整合术前文档、结构化数据和围术期并发症临床标准，自动对患者进行分类推荐，并采用“人在环路”（human-in-the-loop）模式由医师复核决策，从而实现高敏感度（0.94）和可接受特异度（0.74）的精准分诊，显著减少人工筛查负担，同时揭示多数偏差源于临床标准或流程可改进空间，而非AI误判本身。

链接: https://arxiv.org/abs/2603.17234
作者: Jane Wang,Timothy Keyes,April S Liang,Stephen P Ma,Jason Shen,Jerry Liu,Nerissa Ambers,Abby Pandya,Rita Pandya,Jason Hom,Natasha Steele,Jonathan H Chen,Kevin Schulman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 35 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Surgical co-management (SCM) is an evidence-based model in which hospitalists jointly manage medically complex perioperative patients alongside surgical teams. Despite its clinical and financial value, SCM is limited by the need to manually identify eligible patients. To determine whether SCM triage can be automated, we conducted a prospective, unblinded study at Stanford Health Care in which an LLM-based, electronic health record (EHR)-integrated triage tool (SCM Navigator) provided SCM recommendations followed by physician review. Using pre-operative documentation, structured data, and clinical criteria for perioperative morbidity, SCM Navigator categorized patients as appropriate, not appropriate, or possibly appropriate for SCM. Faculty indicated their clinical judgment and provided free-text feedback when they disagreed. Sensitivity, specificity, positive predictive value, and negative predictive value were measured using physician determinations as a reference. Free-text reasons were thematically categorized, and manual chart review was conducted on all false-negative cases and 30 randomly selected cases from the largest false-positive category. Since deployment, 6,193 cases have been triaged, of which 1,582 (23%) were recommended for hospitalist consultation. SCM Navigator displayed high sensitivity (0.94, 95% CI 0.91-0.96) and moderate specificity (0.74, 95% CI 0.71-0.77). Post-hoc chart review suggested most discrepancies reflect modifiable gaps in clinical criteria, institutional workflow, or physician practice variability rather than LLM misclassification, which accounted for 2 of 19 (11%) false-negative cases. These findings demonstrate that an LLM-powered, EHR-integrated, human-in-the-loop AI system can accurately and safely triage surgical patients for SCM, and that AI-enabled screening tools can augment and potentially automate time-intensive clinical workflows. Comments: 35 pages, 4 figures, 5 tables Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17234 [cs.CY] (or arXiv:2603.17234v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.17234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-54] Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

【速读】：该论文旨在解决自动形式化（Auto-formalization, AF）管道在实际应用中存在语义错误的问题，即生成的程序虽能执行但语义不正确，从而影响符号求解器进行可靠逻辑推理的能力。现有方法主要通过求解器反馈修复语法错误，但对语义错误的缓解仍具挑战性。解决方案的关键在于提出一种推理时框架Draft-and-Prune (DP)，其核心机制包括：首先基于多样性生成多个自然语言推理计划，并据此条件化程序生成；随后通过验证机制剔除执行后存在矛盾或歧义的形式化结果；最终对剩余有效路径进行多数投票聚合预测。该方法无需额外监督即可显著提升AF推理性能，在四个基准测试中均取得优于现有最强基线的结果。

链接: https://arxiv.org/abs/2603.17233
作者: Zhiyu Ni,Zheng Liang,Liangcheng Song,Chenrui Cao,Xian Zhang,Alberto Sangiovanni-Vincentelli,Pierluigi Nuzzo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute, or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantics failures remains a major bottleneck. We propose Draft-and-Prune (DP), an inference-time framework that improves AF-based logical reasoning via diversity and verification. DP first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting. Across four representative benchmarks (AR-LSAT, ProofWriter, PrOntoQA, LogicalDeduction), DP substantially strengthens AF-based reasoning without extra supervision. On AR-LSAT, in the AF-only setting, DP achieves 78.43% accuracy with GPT-4 and 78.00% accuracy with GPT-4o, significantly outperforming the strongest AF baselines MAD-LOGIC and CLOVER. DP then attains near-ceiling performance on the other benchmarks, including 100% on PrOntoQA and LogicalDeduction.

[AI-55] KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference

【速读】：该论文旨在解决Kolmogorov-Arnold Networks (KANs)在推理阶段因B-spline函数计算复杂度高而导致的硬件效率低下问题，尤其是在低比特量化（<8 bits）场景下缺乏系统性研究。其解决方案的关键在于：通过将B-spline系数量化至2–3 bits，可实现精度损失可忽略的同时显著降低计算复杂度；进一步提出使用预计算的低比特查找表替代递归B-spline算法，从而大幅减少BitOps操作并提升硬件部署效率。实验表明，该方法在多种硬件平台（GPU、FPGA和ASIC）上均能实现显著加速与资源节省，例如ResKAN18在无精度损失前提下实现50倍BitOps减少，FPGA上3-bit量化使资源占用下降36%且时钟频率提升50%。

链接: https://arxiv.org/abs/2603.17230
作者: Sohaib Errabii,Olivier Sentieys,Marcello Traiola
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have gained attention for their potential to outperform Multi-Layer Perceptrons (MLPs) in terms of parameter efficiency and interpretability. Unlike traditional MLPs, KANs use learnable non-linear activation functions, typically spline functions, expressed as linear combinations of basis splines (B-splines). B-spline coefficients serve as the model’s learnable parameters. However, evaluating these spline functions increases computational complexity during inference. Conventional quantization reduces this complexity by lowering the numerical precision of parameters and activations. However, the impact of quantization on KANs, and especially its effectiveness in reducing computational complexity, is largely unexplored, particularly for quantization levels below 8 bits. The study investigates the impact of low-bit quantization on KANs and its impact on computational complexity and hardware efficiency. Results show that B-splines can be quantized to 2-3 bits with negligible loss in accuracy, significantly reducing computational complexity. Hence, we investigate the potential of using low-bit quantized precomputed tables as a replacement for the recursive B-spline algorithm. This approach aims to further reduce the computational complexity of KANs and enhance hardware efficiency while maintaining accuracy. For example, ResKAN18 achieves a 50x reduction in BitOps without loss of accuracy using low-bit-quantized B-spline tables. Additionally, precomputed 8-bit lookup tables improve GPU inference speedup by up to 2.9x, while on FPGA-based systolic-array accelerators, reducing B-spline table precision from 8 to 3 bits cuts resource usage by 36%, increases clock frequency by 50%, and enhances speedup by 1.24x. On a 28nm FD-SOI ASIC, reducing the B-spline bit-width from 16 to 3 bits achieves 72% area reduction and 50% higher maximum frequency.

[AI-56] AI Scientist via Synthetic Task Scaling

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在自动科学发现中面临的核心挑战：如何有效训练能够从实践中学习的机器学习代理（ML agents），以克服现有大语言模型（LLMs）常产生看似合理但实际无效研究思路的问题。其解决方案的关键在于构建一个新颖的合成环境生成流水线，该流水线能自动创建与 SWE-agent 框架兼容的、基于真实机器学习数据集的合成任务，涵盖主题采样、数据集提议和代码生成，并通过自调试循环确保任务质量。实验表明，利用该流水线生成的任务训练的学生模型（Qwen3-4B 和 Qwen3-8B）在 MLGym 基准测试中性能显著提升，AUP 指标分别提高 9% 和 12%，验证了合成任务的有效性与可训练性。

链接: https://arxiv.org/abs/2603.17216
作者: Ziyang Cai,Harkirat Behl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don’t offer a principled way to train such agents – and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.

[AI-57] Adaptive Contracts for Cost-Effective AI Delegation

【速读】：该论文旨在解决组织在通过按绩效付费（pay-for-performance）合同将文本生成任务外包给AI提供商时，因评估噪声导致预期支付成本上升的问题。当评估方法愈发复杂以降低噪声时，其带来的经济收益常被高昂的评估成本所抵消。解决方案的关键在于提出自适应合约（adaptive contracts），即先基于粗粒度信号进行初步判断，再根据结果选择性地执行精细化评估，从而在保证质量的同时节约资源。论文进一步提供了计算最优自适应合约的高效算法，并通过问答与代码生成数据集实证验证了该策略相较于非自适应基线方法的显著优势。

链接: https://arxiv.org/abs/2603.17212
作者: Eden Saig,Tamar Garbuz,Ariel D. Procaccia,Inbal Talgam-Cohen,Jamie Tucker-Foltz
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Comments are welcome

点击查看摘要

Abstract:When organizations delegate text generation tasks to AI providers via pay-for-performance contracts, expected payments rise when evaluation is noisy. As evaluation methods become more elaborate, the economic benefits of decreased noise are often overshadowed by increased evaluation costs. In this work, we introduce adaptive contracts for AI delegation, which allow detailed evaluation to be performed selectively after observing an initial coarse signal in order to conserve resources. We make three sets of contributions: First, we provide efficient algorithms for computing optimal adaptive contracts under natural assumptions or when core problem dimensions are small, and prove hardness of approximation in the general unstructured case. We then formulate alternative models of randomized adaptive contracts and discuss their benefits and limitations. Finally, we empirically demonstrate the benefits of adaptivity over non-adaptive baselines using question-answering and code-generation datasets.

[AI-58] A scalable neural bundle map for multiphysics prediction in lithium-ion battery across varying configurations

【速读】：该论文旨在解决锂离子电池中多物理场（电化学、热学和力学）演化过程在不同单元几何结构和工况下难以高效且准确预测的问题。现有计算框架难以捕捉跨多种几何构型的耦合动态，导致设计效率低且安全性难保障。解决方案的关键在于提出神经束映射（Neural Bundle Map, NBM），这是一种数学严谨的框架，将多物理场演化重新表述为定义在几何基流形上的束映射（bundle map），从而实现几何复杂性与物理定律的完全解耦，确保跨不同域的强算子连续性。该方法在多种配置下实现了小于1%的归一化平均绝对误差，同时在长期预测中保持稳定，并将计算成本降低两个数量级，显著提升了电池设计优化与实时监控的效率与精度。

链接: https://arxiv.org/abs/2603.17209
作者: Zhiwei Zhao,Changqing Liu,Jie Lin,Fan Yang,Yifan Zhang,Yan Jin,Yingguang Li
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:Efficient and accurate prediction of Multiphysics evolution across diverse cell geometries is fundamental to the design, management and safety of lithium-ion batteries. However, existing computational frameworks struggle to capture the coupled electrochemical, thermal, and mechanical dynamics across diverse cell geometries and varying operating conditions. Here, we present a Neural Bundle Map (NBM), a mathematically rigorous framework that reformulates multiphysics evolution as a bundle map over a geometric base manifold. This approach enables the complete decoupling of geometric complexity from underlying physical laws, ensuring strong operator continuity across varying domains. Our framework achieves high-fidelity spatiotemporal predictions with a normalized mean absolute error of less than 1% across varying configurations, while maintaining stability during long-horizon forecasting far beyond the training window and reducing computational costs by two orders of magnitude compared with conventional solvers. Leveraging this capability, we rapidly explored a vast configurational space to identify an optimal battery design that yields a 38% increase in energy density while adhering to thermal safety constraints. Furthermore, the NBM demonstrates remarkable scalability to multi-cell systems through few-shot transfer learning, providing a foundational paradigm for the intelligent design and real-time monitoring of complex energy storage infrastructures.

[AI-59] owards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems

【速读】：该论文旨在解决检索增强生成（Retrieval Augmented Generation, RAG）系统中因上下文文档被篡改而导致的安全漏洞问题，尤其是针对持续性攻击和零日攻击的早期检测难题。其解决方案的关键在于提出一种无监督检测方法，利用生成器激活值、输出嵌入向量以及基于熵的不确定性度量作为互补指标，通过简单的统计异常检测技术实现对恶意上下文文档的有效识别，且无需依赖攻击者的目标提示（target prompt），同时发现上下文摘要生成可能更优的检测效果。

链接: https://arxiv.org/abs/2603.17176
作者: Patrick Levi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval augmented generation systems have become an integral part of everyday life. Whether in internet search engines, email systems, or service chatbots, these systems are based on context retrieval and answer generation with large language models. With their spread, also the security vulnerabilities increase. Attackers become increasingly focused on these systems and various hacking approaches are developed. Manipulating the context documents is a way to persist attacks and make them affect all users. Therefore, detecting compromised, adversarial context documents early is crucial for security. While supervised approaches require a large amount of labeled adversarial contexts, we propose an unsupervised approach, being able to detect also zero day attacks. We conduct a preliminary study to show appropriate indicators for adversarial contexts. For that purpose generator activations, output embeddings, and an entropy-based uncertainty measure turn out as suitable, complementary quantities. With an elementary statistical outlier detection, we propose and compare their detection abilities. Furthermore, we show that the target prompt, which the attacker wants to manipulate, is not required for a successful detection. Moreover, our results indicate that a simple context summary generation might even be superior in finding manipulated contexts.

[AI-60] Detecting Data Poisoning in Code Generation LLM s via Black-Box Vulnerability-Oriented Scanning

【速读】：该论文旨在解决生成式 AI（Generative AI）在代码生成场景中面临的后门攻击和投毒攻击问题，此类攻击可诱导模型生成存在安全漏洞的代码，而现有防御手段因依赖词元级一致性检测，在面对语义相同但语法多样的源代码时效果有限。解决方案的关键在于提出 CodeScan，一个专为代码生成模型设计的投毒扫描框架：它通过分析不同干净提示下多次生成代码之间的结构相似性，结合迭代分歧分析与抽象语法树（Abstract Syntax Tree, AST）归一化技术，消除表面语法差异并统一语义等价代码；进而利用大语言模型（LLM）进行漏洞分析，识别出反复出现且含安全漏洞的结构，从而准确判定模型是否被污染。

链接: https://arxiv.org/abs/2603.17174
作者: Shenao Yan,Shimaa Ahmed,Shan Jin,Sunpreet S. Arora,Yiwei Cai,Yizhen Wang,Yuan Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Preprint

点击查看摘要

Abstract:Code generation large language models (LLMs) are increasingly integrated into modern software development workflows. Recent work has shown that these models are vulnerable to backdoor and poisoning attacks that induce the generation of insecure code, yet effective defenses remain limited. Existing scanning approaches rely on token-level generation consistency to invert attack targets, which is ineffective for source code where identical semantics can appear in diverse syntactic forms. We present CodeScan, which, to the best of our knowledge, is the first poisoning-scanning framework tailored to code generation models. CodeScan identifies attack targets by analyzing structural similarities across multiple generations conditioned on different clean prompts. It combines iterative divergence analysis with abstract syntax tree (AST)-based normalization to abstract away surface-level variation and unify semantically equivalent code, isolating structures that recur consistently across generations. CodeScan then applies LLM-based vulnerability analysis to determine whether the extracted structures contain security vulnerabilities and flags the model as compromised when such a structure is found. We evaluate CodeScan against four representative attacks under both backdoor and poisoning settings across three real-world vulnerability classes. Experiments on 108 models spanning three architectures and multiple model sizes demonstrate 97%+ detection accuracy with substantially lower false positives than prior methods.

[AI-61] PAuth - Precise Task-Scoped Authorization For Agents

【速读】：该论文旨在解决当前授权模型与新兴智能体网络（agentic web）愿景不匹配的问题，即传统基于操作者范围的授权机制（如OAuth）授予的是宽泛权限，而非任务执行所需的精确操作权限，导致AI代理存在过度授权风险。其解决方案的核心是提出一种名为精准任务范围隐式授权（Precise Task-Scoped Implicit Authorization, PAuth）的新模型，该模型通过自然语言（NL）任务自动隐式授权仅限于完成该任务所必需的具体操作；关键技术包括：NL切片（NL slices）——从任务和上游结果推导出每个服务预期的符号化调用规范，以及封装体（envelopes）——绑定每个操作数的符号来源与具体值，使服务器可验证所有操作数均源自合法计算过程。实验证明，PAuth在正常场景下无需额外权限即可成功执行任务，在注入恶意操作的攻击场景中能准确触发权限缺失警告，实现了对权限推理的精确性保障。

链接: https://arxiv.org/abs/2603.17170
作者: Reshabh K Sharma,Linxi Jiang,Zhiqiang Lin,Shuo Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:The emerging agentic web envisions AI agents that reliably fulfill users’ natural-language (NL)-based tasks by interacting with existing web services. However, existing authorization models are misaligned with this vision. In particular, today’s operator-scoped authorization, exemplified by OAuth, grants broad permissions tied to operators (e.g., the transfer operator) rather than to the specific operations (e.g., transfer 100 to Bob) implied by a user’s task. This will inevitably result in overprivileged agents. We introduce Precise Task-Scoped Implicit Authorization (PAuth), a fundamentally different model in which submitting an NL task implicitly authorizes only the concrete operations required for its faithful execution. To make this enforceable at servers, we propose NL slices: symbolic specifications of the calls each service expects, derived from the task and upstream results. Complementing this, we also propose envelopes: special data structure to bind each operand’s concrete value to its symbolic provenance, enabling servers to verify that all operands arise from legitimate computations. PAuth is prototyped in the agent-security evaluation framework AgentDojo. We evaluate it in both benign settings and attack scenarios where a spurious operation is injected into an otherwise normal task. In all benign tests, PAuth executes the tasks successfully without requiring any additional permissions. In all attack tests, PAuth correctly raises warnings about missing permissions. These results demonstrate that PAuth’s reasoning about permissions is indeed precise. We further analyze the characteristics of these tasks and measure the associated token costs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2603.17170 [cs.CR] (or arXiv:2603.17170v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.17170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-62] Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

【速读】：该论文试图解决的问题是：生成式 AI (Generative AI) 系统在代码生成过程中存在“意图鸿沟”（intent gap），即用户用自然语言描述的非正式需求与程序实际行为之间缺乏精确对应，导致生成的代码可能无法满足用户真实意图。解决方案的关键在于“意图形式化”（intent formalization）——将非正式用户意图转化为可验证的正式规范（formal specifications），从而实现对 AI 生成代码的可靠性保障。该方法提供了一个从轻量级测试到全功能规格乃至自动合成正确代码的可靠性权衡谱系，并强调“规范验证”（validating specifications）是当前的核心瓶颈，需借助半自动化指标和轻量用户交互来评估规范质量，而非依赖外部 oracle。

链接: https://arxiv.org/abs/2603.17150
作者: Shuvendu K. Lahiri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 10 pages

点击查看摘要

Abstract:Agentic AI systems can now generate code with remarkable fluency, but a fundamental question remains: \emphdoes the generated code actually do what the user intended? The gap between informal natural language requirements and precise program behavior – the \emphintent gap – has always plagued software engineering, but AI-generated code amplifies it to an unprecedented scale. This article argues that \textbfintent formalization – the translation of informal user intent into a set of checkable formal specifications – is the key challenge that will determine whether AI makes software more reliable or merely more abundant. Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically. The central bottleneck is \emphvalidating specifications: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests. We survey early research that demonstrates the \emphpotential of this approach: interactive test-driven formalization that improves program correctness, AI-generated postconditions that catch real-world bugs missed by prior methods, and end-to-end verified pipelines that produce provably correct code from informal specifications. We outline the open research challenges – scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, designing human-AI specification interactions – that define a research agenda spanning AI, programming languages, formal methods, and human-computer interaction.

[AI-63] REAL: Regression-Aware Reinforcement Learning for LLM -as-a-Judge

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在作为自动化评估者（LLM-as-a-Judge）时，传统强化学习（Reinforcement Learning, RL）方法因仅使用二值奖励（如0-1准确率）而忽略回归任务中固有的序数结构的问题。现有回归感知方法多局限于监督微调（Supervised Fine-Tuning, SFT），难以探索最优推理路径。为此，作者提出了一种名为REAL（Regression-Aware Reinforcement Learning）的原理性强化学习框架，其关键在于通过广义策略梯度估计器（generalized policy gradient estimator）将优化过程分解为两个互补组件：（1）在思维链（Chain-of-Thought, CoT）轨迹上的探索；（2）对最终评分的回归感知预测精炼。该方法显式处理回归目标对策略的依赖性，从而克服标准策略梯度方法失效的问题，并在多个模型规模（8B至32B参数）上显著优于SFT和标准RL基线，尤其在跨域基准测试中展现出更强泛化能力。

链接: https://arxiv.org/abs/2603.17145
作者: Yasi Zhang,Tianyu Chen,Mingyuan Zhou,Oscar Leong,Ying Nian Wu,Michal Lukasik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbfREAL (\underlineREgression-\underlineAware Reinforcement \underlineLearning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

[AI-64] Security Assessment and Mitigation Strategies for Large Language Models : A Comprehensive Defensive Framework

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在关键基础设施领域部署时面临的对抗性攻击风险问题，即当前缺乏对主流LLM架构的系统性安全评估方法，导致组织无法量化风险或选择适合高安全性场景的模型。解决方案的关键在于构建一个标准化的漏洞评估框架，并开发一套多层防御体系：首先通过针对五类主流LLM（包括GPT-4、Claude-3 Haiku、LLaMA-2-70B等）实施10,000条跨六类攻击的对抗提示测试，揭示其安全性能差异（漏洞率介于11.9%至29.8%之间）；随后设计并实现一个可投入生产的外部防御框架，在保持仅5%假阳性率的前提下实现了平均83%的检测准确率，从而为生产环境中更安全的LLM应用提供可行路径。

链接: https://arxiv.org/abs/2603.17123
作者: Taiwo Onitiju,Iman Vakilinia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models increasingly power critical infrastructure from healthcare to finance, yet their vulnerability to adversarial manipulation threatens system integrity and user safety. Despite growing deployment, no comprehensive comparative security assessment exists across major LLM architectures, leaving organizations unable to quantify risk or select appropriately secure LLMs for sensitive applications. This research addresses this gap by establishing a standardized vulnerability assessment framework and developing a multi-layered defensive system to protect against identified threats. We systematically evaluate five widely-deployed LLM families GPT-4, GPT-3.5 Turbo, Claude-3 Haiku, LLaMA-2-70B, and Gemini-2.5-pro against 10,000 adversarial prompts spanning six attack categories. Our assessment reveals critical security disparities, with vulnerability rates ranging from 11.9% to 29.8%, demonstrating that LLM capability does not correlate with security robustness. To mitigate these risks, we develop a production-ready defensive framework achieving 83% average detection accuracy with only 5% false positives. These results demonstrate that systematic security assessment combined with external defensive measures provides a viable path toward safer LLM deployment in production environments.

[AI-65] Cascade-Aware Multi-Agent Routing: Spatio-Temporal Sidecars and Geometry-Switching

【速读】：该论文旨在解决符号图网络（symbolic graph network）中因几何结构盲区导致的故障传播不可控问题，即当前调度器在优化负载和任务适配度时忽略了执行图的拓扑结构特性——树状结构中单点故障易引发指数级扩散，而密集循环结构则具有自限性。解决方案的关键在于引入一种轻量级的在线几何控制机制，通过融合三类信号实现路由风险估计：(i) 欧几里得时空传播基线，(ii) 带时间衰减（可选突发激励）的双曲路由风险模型，以及 (iii) 基于结构特征的几何选择器（一个9-12-1的紧凑多层感知机），其输入包括6个拓扑统计量与3个几何感知信号（BFS壳层增长斜率、环秩范数、拟合的庞加莱曲率）。该方法显著提升了系统在树状结构中的鲁棒性，在Genesis 3基准测试中将最难的非树场景胜率从64%提升至92%，整体胜率达87.2%，验证了对几何敏感型故障传播的有效抑制。

链接: https://arxiv.org/abs/2603.17112
作者: Davide Di Gioia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A common architectural pattern in advanced AI reasoning systems is the symbolic graph network: specialized agents or modules connected by delegation edges, routing tasks through a dynamic execution graph. Current schedulers optimize load and fitness but are geometry-blind: they do not model how failures propagate differently in tree-like versus cyclic regimes. In tree-like delegation, a single failure can cascade exponentially; in dense cyclic graphs, failures tend to self-limit. We identify this observability gap, quantify its system-level cost, and propose a lightweight mitigation. We formulate online geometry control for route-risk estimation on time-indexed execution graphs with route-local failure history. Our approach combines (i) a Euclidean spatio-temporal propagation baseline, (ii) a hyperbolic route-risk model with temporal decay (and optional burst excitation), and (iii) a learned geometry selector over structural features. The selector is a compact MLP (9-12-1) using six topology statistics plus three geometry-aware signals: BFS shell-growth slope, cycle-rank norm, and fitted Poincare curvature. On the Genesis 3 benchmark distribution, adaptive switching improves win rate in the hardest non_tree regime from 64-72% (fixed hyperbolic variants) to 92%, and achieves 87.2% overall win rate. To measure total system value, we compare against Genesis 3 routing without any spatio-temporal sidecar, using only native bandit/LinUCB signals (team fitness and mean node load). This baseline achieves 50.4% win rate overall and 20% in tree-like regimes; the full sidecar recovers 87.2% overall (+36.8 pp), with +48 to +68 pp gains in tree-like settings, consistent with a cascade-sensitivity analysis. Overall, a 133-parameter sidecar substantially mitigates geometry-blind failure propagation in one high-capability execution-graph system. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.17112 [cs.AI] (or arXiv:2603.17112v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.17112 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-66] When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

【速读】：该论文旨在解决当前编码代理（coding agent）评估基准在真实研究场景中缺乏对渐进式需求披露（emergent specification）建模的问题。传统基准通常一次性提供完整任务说明，而实际科研编码过程往往通过交互逐步揭示系统设计，要求代理在长会话中持续追踪持久的设计承诺。为此，作者提出了一个新基准，定义了“规范忠实度损失 under emergent specification”（SLUMP），量化因渐进式披露导致的最终实现忠实度下降，并引入包含20篇最新机器学习论文、371个可验证原子组件及约60次渐进式编码请求的测试集。关键解决方案是提出ProjectGuard——一种外部项目状态层用于显式跟踪规范变更，在Claude Code上使忠实度差距减少90%，完全忠实组件数量从118提升至181，严重失败从72降至49，从而验证了规范追踪作为长程编码代理独立评估目标的重要性。

链接: https://arxiv.org/abs/2603.17104
作者: Lu Yan,Xuan Chen,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.

[AI-67] CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning ICLR2026

【速读】：该论文旨在解决高效算术电路（arithmetic circuit）的自动合成问题，即如何用最少的加法和乘法门构造出计算特定多项式（polynomial）的最优电路。这一问题与自动生成证明（auto-proof generation）及Valiant提出的VP vs. VNP猜想密切相关，是理论计算机科学中关于计算复杂性与电路最小化的核心挑战之一。解决方案的关键在于将该问题建模为一个单玩家博弈，并采用强化学习（reinforcement learning, RL）方法进行训练：具体地，作者实现了类似AlphaZero的训练循环，对比了两种策略优化算法——近端策略优化结合蒙特卡洛树搜索（Proximal Policy Optimization with Monte Carlo Tree Search, PPO+MCTS）与软演员-评论家算法（Soft Actor-Critic, SAC）。实验表明，SAC在双变量目标上表现最佳，而PPO+MCTS能扩展至三变量情形并持续提升对更难实例的求解能力，验证了该框架作为研究自适应改进搜索策略的紧凑且可验证环境的有效性。

链接: https://arxiv.org/abs/2603.17075
作者: Weikun K. Zhang,Rohan Pandey,Bhaumik Mehta,Kaijie Jin,Naomi Morato,Archit Ganapule,Michael Ruofan Zeng,Jarod Alper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: ICLR 2026 Workshop on AI with Recursive Self-Improvement

点击查看摘要

Abstract:Motivated by auto-proof generation and Valiant’s VP vs. VNP conjecture, we study the problem of discovering efficient arithmetic circuits to compute polynomials, using addition and multiplication gates. We formulate this problem as a single-player game, where an RL agent attempts to build the circuit within a fixed number of operations. We implement an AlphaZero-style training loop and compare two approaches: Proximal Policy Optimization with Monte Carlo Tree Search (PPO+MCTS) and Soft Actor-Critic (SAC). SAC achieves the highest success rates on two-variable targets, while PPO+MCTS scales to three variables and demonstrates steady improvement on harder instances. These results suggest that polynomial circuit synthesis is a compact, verifiable setting for studying self-improving search policies.

[AI-68] ransformers are Bayesian Networks

【速读】：该论文试图解决的问题是：尽管Transformer已成为人工智能领域的主导架构，但其工作机制仍缺乏清晰的理论解释。论文的核心贡献在于提出并严格证明了一个关键结论——Transformer本质上是一个贝叶斯网络（Bayesian network）。解决方案的关键在于通过五种形式化论证确立这一等价性：首先，证明任意权重的sigmoid Transformer均等价于在其隐式因子图上执行加权环状信念传播（loopy belief propagation），且每层对应一轮BP；其次，构造性地证明Transformer可对任何知识库实现精确信念传播；第三，证明唯一性：只有当权重符合BP规则时，sigmoid Transformer才能产生精确后验概率；第四，揭示了Transformer层的AND/OR布尔结构，其中注意力机制对应AND、前馈网络（FFN）对应OR，并严格匹配Pearl的gather/update算法；第五，实验验证了上述理论结果在实践中的有效性，同时指出无概念空间下的推理必然导致幻觉，这并非可通过模型规模扩展修复的缺陷，而是结构上的必然结果。

链接: https://arxiv.org/abs/2603.17063
作者: Gregory Coppola
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. First, we prove that every sigmoid transformer with any weights implements weighted loopy belief propagation on its implicit factor graph. One layer is one round of BP. This holds for any weights – trained, random, or constructed. Formally verified against standard mathematical axioms. Second, we give a constructive proof that a transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies this yields provably correct probability estimates at every node. Formally verified against standard mathematical axioms. Third, we prove uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP weights. There is no other path through the sigmoid architecture to exact posteriors. Formally verified against standard mathematical axioms. Fourth, we delineate the AND/OR boolean structure of the transformer layer: attention is AND, the FFN is OR, and their strict alternation is Pearl’s gather/update algorithm exactly. Fifth, we confirm all formal results experimentally, corroborating the Bayesian network characterization in practice. We also establish the practical viability of loopy belief propagation despite the current lack of a theoretical convergence guarantee. We further prove that verifiable inference requires a finite concept space. Any finite verification procedure can distinguish at most finitely many concepts. Without grounding, correctness is not defined. Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts. Formally verified against standard mathematical axioms. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.17063 [cs.AI] (or arXiv:2603.17063v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.17063 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gregory Coppola [view email] [v1] Tue, 17 Mar 2026 18:50:13 UTC (28 KB) Full-text links: Access Paper: View a PDF of the paper titled Transformers are Bayesian Networks, by Gregory CoppolaView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-69] Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

【速读】：该论文旨在解决向量量化（Vector Quantization）在生成模型中出现的表示坍缩（Collapse）问题，包括离散码本（codebook）中的token坍缩和连续潜在嵌入（latent embeddings）中的嵌入坍缩。研究发现，随机初始化和编码器容量有限是导致这两种坍缩的关键因素。解决方案的核心在于针对不同类型的坍缩提出相应的缓解策略：对于token坍缩，优化码本初始化与更新机制；对于嵌入坍缩，则通过增强编码器表达能力或引入正则化手段来稳定潜在空间分布。这是首个系统性探讨向量量化中表示坍缩问题的研究。

链接: https://arxiv.org/abs/2603.17052
作者: Wenhao Zhao,Qiran Zou,Rushi Shah,Yudi Wu,Zhouhan Lin,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we systematically investigate the issue of collapses in vector quantization, where collapsed representations are observed across discrete codebook tokens and continuous latent embeddings. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that random initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

[AI-70] Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty

【速读】：该论文旨在解决社会-环境规划在深度不确定性下，问题概念化过程因依赖参与式建模而复杂且耗时的问题。解决方案的关键在于提出一种基于大语言模型（Large Language Models, LLMs）的模板化工作流，通过LLMs从利益相关者的自然语言描述中自动识别关键模型组件，探索多样化的视角，并将其整合为统一模型，最终以迭代方式在Python中实现。实验表明，该方法能显著提升问题概念化效率，且经少量人工验证后即可获得可接受结果，从而有效支持后续的 socio-environmental planning 步骤。

链接: https://arxiv.org/abs/2603.17021
作者: Zhihao Pei,Nir Lipovetzky,Angela M. Rojas-Arevalo,Fjalar J. de Haan,Enayat A. Moallemi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Socio-environmental planning under deep uncertainty requires researchers to identify and conceptualize problems before exploring policies and deploying plans. In practice and model-based planning approaches, this problem conceptualization process often relies on participatory modeling to translate stakeholders’ natural-language descriptions into a quantitative model, making this process complex and time-consuming. To facilitate this process, we propose a templated workflow that uses large language models for an initial conceptualization process. During the workflow, researchers can use large language models to identify the essential model components from stakeholders’ intuitive problem descriptions, explore their diverse perspectives approaching the problem, assemble these components into a unified model, and eventually implement the model in Python through iterative communication. These results will facilitate the subsequent socio-environmental planning under deep uncertainty steps. Using ChatGPT 5.2 Instant, we demonstrated this workflow on the lake problem and an electricity market problem, both of which demonstrate socio-environmental planning problems. In both cases, acceptable outputs were obtained after a few iterations with human verification and refinement. These experiments indicated that large language models can serve as an effective tool for facilitating participatory modeling in the problem conceptualization process in socio-environmental planning.

[AI-71] Implementation of tangent linear and adjoint models for neural networks based on a compiler library tool

【速读】：该论文旨在解决传统数值模型（以Fortran编写）与基于Python的深度学习框架之间存在的跨语言兼容性差、耦合灵活性不足及数据传输效率低的问题。其解决方案的关键在于提出TorchNWP工具库，该工具基于LibTorch构建，通过将PyTorch框架下的深度学习模型转换为静态二进制格式并提供C/C++接口，实现深度学习模型在数值模型中的高效嵌入式部署；同时利用混合Fortran/C/C++编程方式，在C/C++层面实现神经网络的切线线性模型（tangent linear model）和伴随模型（adjoint model），从而屏蔽神经网络内部结构、简化四维变分同化系统（4D-Var data assimilation system）的构建流程，并支持异构平台部署与主流神经网络模型的兼容性，显著降低耦合成本并提升数值天气预报的精度与效率。

链接: https://arxiv.org/abs/2603.16976
作者: Sa Xiao,Hao Jing,Honglu Sun,Haoyu Li
机构: 未知
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents TorchNWP, a compilation library tool for the efficient coupling of artificial intelligence components and traditional numerical models. It aims to address the issues of poor cross-language compatibility, insufficient coupling flexibility, and low data transfer efficiency between operational numerical models developed in Fortran and Python-based deep learning frameworks. Based on LibTorch, it optimizes and designs a unified application-layer calling interface, converts deep learning models under the PyTorch framework into a static binary format, and provides C/C++ interfaces. Then, using hybrid Fortran/C/C++ programming, it enables the deployment of deep learning models within numerical models. Integrating TorchNWP into a numerical model only requires compiling it into a callable link library and linking it during the compilation and linking phase to generate the executable. On this basis, tangent linear and adjoint model based on neural networks are implemented at the C/C++ level, which can shield the internal structure of neural network models and simplify the construction process of four-dimensional variational data assimilation systems. Meanwhile, it supports deployment on heterogeneous platforms, is compatible with mainstream neural network models, and enables mapping of different parallel granularities and efficient parallel execution. Using this tool requires minimal code modifications to the original numerical model, thus reducing coupling costs. It can be efficiently integrated into numerical weather prediction models such as CMA-GFS and MCV, and has been applied to the coupling of deep learning-based physical parameterization schemes (e.g., radiation, non-orographic gravity wave drag) and the development of their tangent linear and adjoint models, significantly improving the accuracy and efficiency of numerical weather prediction.

[AI-72] DeepStage: Learning Autonomous Defense Policies Against Multi-Stage APT Campaigns

【速读】：该论文旨在解决高级持续性威胁（Advanced Persistent Threats, APT）检测与响应中缺乏自适应性和阶段感知能力的问题。传统防御机制往往无法动态调整策略以应对攻击者在不同阶段（如初始访问、横向移动、数据窃取等）的行为变化，导致防御效率低下。解决方案的关键在于提出 DeepStage 框架，其核心创新包括：1）将企业环境建模为部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP），融合主机溯源数据与网络遥测信息生成统一的溯源图（provenance graph）；2）基于图神经网络编码器和 LSTM 阶段估计器（StageFinder）推断攻击者所处阶段的概率分布，对齐 MITRE ATT&CK 框架；3）利用阶段信念与图嵌入指导分层近端策略优化（Hierarchical Proximal Policy Optimization, PPO）代理，在监控、访问控制、隔离和修复等维度自动选择最优防御动作。实验表明，DeepStage 在真实企业测试环境中实现了 0.89 的阶段加权 F1 分数，较风险感知基线提升 21.9%，验证了其阶段感知与成本高效自主防御的有效性。

链接: https://arxiv.org/abs/2603.16969
作者: Trung V. Phan,Tri Gia Nguyen,Thomas Bauschert
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents DeepStage, a deep reinforcement learning (DRL) framework for adaptive, stage-aware defense against Advanced Persistent Threats (APTs). The enterprise environment is modeled as a partially observable Markov decision process (POMDP), where host provenance and network telemetry are fused into unified provenance graphs. Building on our prior work, StageFinder, a graph neural encoder and an LSTM-based stage estimator infer probabilistic attacker stages aligned with the MITRE ATTCK framework. These stage beliefs, combined with graph embeddings, guide a hierarchical Proximal Policy Optimization (PPO) agent that selects defense actions across monitoring, access control, containment, and remediation. Evaluated in a realistic enterprise testbed using CALDERA-driven APT playbooks, DeepStage achieves a stage-weighted F1-score of 0.89, outperforming a risk-aware DRL baseline by 21.9%. The results demonstrate effective stage-aware and cost-efficient autonomous cyber defense.

[AI-73] Adversarial attacks against Modern Vision-Language Models

【速读】：该论文旨在解决开放源代码视觉语言模型（Vision-Language Model, VLM）代理在真实电商环境中部署前的对抗鲁棒性评估问题。研究通过构建自包含的电商模拟环境，对两个主流开源VLM代理——LLaVA-v1.5-7B与Qwen2.5-VL-7B——实施三种基于梯度的攻击方法（基本迭代法BIM、投影梯度下降PGD及CLIP基谱攻击），以量化其对抗脆弱性。解决方案的关键在于：首先，采用贴近实际应用场景的测试框架确保评估结果具有现实意义；其次，发现不同架构的VLM在对抗攻击下表现出显著差异的鲁棒性，例如Qwen2.5-VL在所有攻击下均表现远优于LLaVA，揭示了模型结构设计对提升对抗防御能力的重要性，从而为VLM代理在商业部署前的安全评估提供了可操作的基准和改进方向。

链接: https://arxiv.org/abs/2603.16960
作者: Alejandro Paredes La Torre
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study adversarial robustness of open-source vision-language model (VLM) agents deployed in a self-contained e-commerce environment built to simulate realistic pre-deployment conditions. We evaluate two agents, LLaVA-v1.5-7B and Qwen2.5-VL-7B, under three gradient-based attacks: the Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and a CLIP-based spectral attack. Against LLaVA, all three attacks achieve substantial attack success rates (52.6%, 53.8%, and 66.9% respectively), demonstrating that simple gradient-based methods pose a practical threat to open-source VLM agents. Qwen2.5-VL proves significantly more robust across all attacks (6.5%, 7.7%, and 15.5%), suggesting meaningful architectural differences in adversarial resilience between open-source VLM families. These findings have direct implications for the security evaluation of VLM agents prior to commercial deployment.

[AI-74] Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies

【速读】：该论文旨在解决将基础模型（foundation models）部署到具身边缘系统（embodied edge systems）中的实际可行性问题，指出这不仅涉及模型压缩，更是一个复杂的系统级挑战。其核心问题在于：在严格的尺寸、重量和功耗约束下，内存带宽、计算延迟、时序波动和安全裕度等因素相互耦合，直接影响模型的实时控制能力。解决方案的关键在于采用系统级协同设计（system-level co-design），涵盖内存架构、调度策略、通信机制与模型结构的联合优化，并引入分解策略，将快速控制与慢速语义推理分离，从而突破当前具身基础模型在边缘设备上的部署瓶颈。

链接: https://arxiv.org/abs/2603.16952
作者: Utkarsh Grover(1),Ravi Ranjan(2),Mingyang Mao(1),Trung Tien Dong(1),Satvik Praveen(1),Zhenqi Wu(1),J. Morris Chang(1),Tinoosh Mohsenin(3),Yi Sheng(1),Agoritsa Polyzou(2),Eiman Kanjo(4 and 5),Xiaomin Lin(1) ((1) University of South Florida, Tampa, USA, (2) Florida International University, Miami, USA, (3) Johns Hopkins University, Baltimore, USA, (4) Nottingham Trent University, Nottingham, United Kingdom, (5) Imperial College London, London, United Kingdom)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying foundation models in embodied edge systems is fundamentally a systems problem, not just a problem of model compression. Real-time control must operate within strict size, weight, and power constraints, where memory traffic, compute latency, timing variability, and safety margins interact directly. The Deployment Gauntlet organizes these constraints into eight coupled barriers that determine whether embodied foundation models can run reliably in practice. Across representative edge workloads, autoregressive Vision-Language-Action policies are constrained primarily by memory bandwidth, whereas diffusion-based controllers are limited more by compute latency and sustained execution cost. Reliable deployment therefore depends on system-level co-design across memory, scheduling, communication, and model architecture, including decompositions that separate fast control from slower semantic reasoning.

[AI-75] Cryptographic Runtime Governance for Autonomous AI Systems: The Aegis Architecture for Verifiable Policy Enforcement

【速读】：该论文旨在解决当前AI治理框架在面对自主性增强、运行速度加快和操作不透明的AI系统时，其后验监督、政策指导和行为对齐技术变得脆弱的问题。解决方案的关键在于提出Aegis架构，将政策与法律约束作为执行条件而非建议原则，并通过三个核心组件实现：加密密封的不可变伦理策略层（IEPL）、伦理验证代理（EVA）和执行内核模块（EKM），以及不可变日志内核（ILK）。该架构确保违反政策的行为无法在受控运行时环境中被执行，从而实现可验证的实时约束，而非依赖事后审查。

链接: https://arxiv.org/abs/2603.16938
作者: Adam Massimo Mazzocchetti
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Contemporary AI governance frameworks rely heavily on post hoc oversight, policy guidance, and behavioral alignment techniques, yet these mechanisms become fragile as systems gain autonomy, speed, and operational opacity. This paper presents Aegis, a runtime governance architecture for autonomous AI systems that treats policy and legal constraints as execution conditions rather than advisory principles. Aegis binds each governed agent to a cryptographically sealed Immutable Ethics Policy Layer (IEPL) at system genesis and enforces external emissions through an Ethics Verification Agent (EVA), an Enforcement Kernel Module (EKM), and an Immutable Logging Kernel (ILK). Amendments to the governing policy layer require quorum approval and redeclaration of the system trust root; verified violations trigger autonomous shutdown and generation of auditable proof artifacts. We evaluate the architecture within the Civitas runtime using three operational measures: proof verification latency under tamper conditions, publication overhead, and alignment retention performance relative to an ungoverned baseline. In controlled trials, Aegis demonstrates median proof verification latency of 238 ms, median publication overhead of approximately 9.4 ms, and higher alignment retention than the baseline condition across matched tasks. We argue that these results support a shift in AI governance from discretionary oversight toward verifiable runtime constraint. Rather than claiming to resolve machine ethics in the abstract, the proposed architecture seeks to show that policy violating behavior can be rendered operationally non executable within a controlled runtime governance framework. The paper concludes by discussing methodological limits, evidentiary implications, and the role of proof oriented governance in high assurance AI deployment. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2603.16938 [cs.CR] (or arXiv:2603.16938v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.16938 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.5281/zenodo.19027190 Focus to learn more DOI(s) linking to related resources

[AI-76] Music Source Restoration with Ensemble Separation and Targeted Reconstruction

【速读】：该论文致力于解决音乐源恢复（Music Source Restoration, MSR）问题，即从经过混音和母带处理的全混合音频中还原出原始未处理的各个声部（stems）。与传统的音乐源分离不同，MSR需逆向处理复杂的音频制作流程，如均衡、压缩、混响等现实世界退化。解决方案的关键在于提出一个两阶段系统：第一阶段利用预训练的分离模型集成生成初步的源估计；第二阶段通过一组基于BSRNN（Band-Specific Recurrent Neural Network）的预训练恢复模型对初步结果进行针对性重构以提升质量。该方法在官方MSR基准上优于所有基线，排名第二。

链接: https://arxiv.org/abs/2603.16926
作者: Xinlong Deng,Yu Xia,Jie Jiang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The Inaugural Music Source Restoration (MSR) Challenge targets the recovery of original, unprocessed stems from fully mixed and mastered music. Unlike conventional music source separation, MSR requires reversing complex production processes such as equalization, compression, reverberation, and other real-world degradations. To address MSR, we propose a two-stage system. First, an ensemble of pre-trained separation models produces preliminary source estimates. Then a set of pre-trained BSRNN-based restoration models performs targeted reconstruction to refine these estimates. On the official MSR benchmark, our system surpasses the baselines on all metrics, ranking second among all submissions. The code is available at this https URL

[AI-77] What on Earth is AlphaEarth? Hierarchical structure and functional interpretability for global land cover

【速读】：该论文旨在解决生成式 AI (Generative AI) 中地理空间基础模型（Geospatial Foundation Models）嵌入空间内部组织机制不明确的问题，从而限制其科学应用。现有研究虽已发现嵌入与连续环境变量相关，但尚不清楚其维度是否具有功能特异性或层级结构。解决方案的关键在于提出一种功能性可解释性框架，通过大规模实验结合特征重要性模式和渐进消融分析，反向解析嵌入维度对地表覆盖分类行为的贡献，从而揭示嵌入维度呈现出非均匀的功能性分布，并可沿层级功能谱系划分为专家型、中等泛化型和高泛化型维度。这一发现表明，仅需少量维度即可维持接近原始性能的地表覆盖分类准确率（最高达98%），显著降低了计算成本，为实际任务中的维度选择提供了依据。

链接: https://arxiv.org/abs/2603.16911
作者: Ivan Felipe Benavides-Martinez,Justin Guthrie,Jhon Edwin Arias,Yeison Alberto Garces-Gomez,Angela Ines Guzman-Alvis,Cristiam Victoriano Portilla-Cabrera,Somnath Mondal,Andrew J. Allyn,Auroop R. Ganguly
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geospatial foundation models generate high-dimensional embeddings that achieve strong predictive performance, yet their internal organization remains obscure, limiting their scientific use. Recent interpretability studies relate Google AlphaEarth Foundations (GAEF) embeddings to continuous environmental variables, but it is still unclear whether the embedding space exhibits a functional or hierarchical organization, in which some dimensions act as specialized representations while others encode shared or broader geospatial structure. In this work, we propose a functional interpretability framework that reverse-engineers the role of embedding dimensions by characterizing their contribution to land cover structure from observed classification behavior. The approach combines large-scale experimentation with a structural analysis of embedding-class relationships based on feature importance patterns and progressive ablation. Our results show that embedding dimensions exhibit consistent and non-uniform functional behavior, allowing them to be categorized along a hierarchical functional spectrum: specialist dimensions associated with specific land cover classes, low- and mid-generalist dimensions capturing shared characteristics between classes, and highgeneralist dimensions reflecting broader environmental gradients. Critically, we find that accurate land cover classification (98% of baseline performance) can be achieved using as few as 2 to 12 of the 64 available dimensions, depending on the class. This demonstrates substantial redundancy in the embedding space and offers a pathway toward significant reductions in computational cost. Together, these findings reveal that AlphaEarth embeddings are not only physically informative, but also functionally organized into a hierarchical structure, providing practical guidance for dimension selection in operational classification tasks.

[AI-78] From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

【速读】：该论文旨在解决生成式 AI（Generative AI）在阿拉伯语场景下执行函数调用时存在的结构不稳定问题，即模型在将自然语言转化为可执行结构化动作时频繁出现解析失败和函数名误识别。其解决方案的关键在于构建一个面向生产环境的阿拉伯语函数调用框架 AISA-AR-FunctionCall，该框架基于 270M 参数的 FunctionGemma 骨干模型，并通过系统性的数据集审计、Schema 修复、工具感知提示重构以及全参数监督微调实现优化。实验表明，该方法将解析失败率从 87% 降至 1% 以下，函数名准确率提升超过八倍，并显著改善跨方言与跨领域的参数对齐效果，揭示了序列化稳定性与决策层推理是可分离的挑战。

链接: https://arxiv.org/abs/2603.16901
作者: Omer Nacar,Deema Alquffari,Saleh Alsharideh,Adeem AlOtaibi,Abdulaziz Alabdulkarim,Leen Alhazmi,Nada Alomar,Wareef Alzubaidi,Nada Alsultan,Ahmed Alrabghi,Demah Alhoshan,Rana Alsayyari,Hamed Alruwaili,Albaraa Jaafar,Khaled Alusmani,Abdulaziz Alsohimy,Munirah Alsubaie,Shahd Aldukhayil,Arwa Alali,Yazeed BinShihah,Razan Alsulaymi,Nourah Alhumaid,Razan Abdulsalam,Reem Alamoudi,Mohammed Alkhalifa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Function-calling language models are essential for agentic AI systems that translate natural language into executable structured actions, yet existing models exhibit severe structural instability when applied to Arabic. We present AISA-AR-FunctionCall, a production-oriented Arabic function-calling framework built on a 270M-parameter FunctionGemma backbone and trained through systematic dataset auditing, schema repair, tool-aware prompt restructuring, and full-parameter supervised fine-tuning. On a held-out test set, fine-tuning reduces parse failures from 87% to below 1%, improves function name accuracy by more than eightfold, and substantially enhances argument alignment across dialects and domains. Error analysis reveals a transition from structural collapse to semantic misalignment, suggesting that serialization stability and decision-level reasoning are separable challenges. We further explore a reasoning-augmented LoRA variant that introduces explicit intermediate reasoning prior to tool invocation. All datasets and models are publicly released under the AISA framework.

[AI-79] Multi-Agent Reinforcement Learning for Dynamic Pricing: Balancing ProfitabilityStability and Fairness

【速读】：该论文旨在解决竞争性零售市场中动态定价问题，即如何在需求波动和竞争对手行为变化的环境下，设计能够自适应调整价格策略的优化方法。其解决方案的关键在于采用多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）框架，具体比较了MAPPO（Multi-Agent Proximal Policy Optimization）和MADDPG（Multi-Agent Deep Deterministic Policy Gradient）两种算法，并与独立学习基线IDDPG进行对比。实验表明，MAPPO在平均收益和稳定性方面表现最优，具备可扩展性和高重现性，是应对复杂竞争环境下的有效策略；而MADDPG则在利润分配公平性上更优，体现了MARL方法相较传统独立学习在动态定价中的优势。

链接: https://arxiv.org/abs/2603.16888
作者: Krishna Kumar Neelakanta Pillai Santha Kumari Amma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic pricing in competitive retail markets requires strategies that adapt to fluctuating demand and competitor behavior. In this work, we present a systematic empirical evaluation of multi-agent reinforcement learning (MARL) approaches-specifically MAPPO and MADDPG-for dynamic price optimization under competition. Using a simulated marketplace environment derived from real-world retail data, we benchmark these algorithms against an Independent DDPG (IDDPG) baseline, a widely used independent learner in MARL literature. We evaluate profit performance, stability across random seeds, fairness, and training efficiency. Our results show that MAPPO consistently achieves the highest average returns with low variance, offering a stable and reproducible approach for competitive price optimization, while MADDPG achieves slightly lower profit but the fairest profit distribution among agents. These findings demonstrate that MARL methods-particularly MAPPO-provide a scalable and stable alternative to independent learning approaches for dynamic retail pricing.

[AI-80] PowerModelsGAT-AI: Physics-Informed Graph Attention for Multi-System Power Flow with Continual Learning

【速读】：该论文旨在解决实时求解交流功率流（Alternating Current Power Flow, AC PF）方程的难题，尤其是在电网运行条件紧张时，传统牛顿-拉夫森（Newton-Raphson）求解器收敛缓慢的问题。现有基于图神经网络（Graph Neural Networks, GNNs）的方法通常在单一系统上训练且泛化能力差，难以跨系统部署。其解决方案的关键在于提出一种物理信息引导的图注意力网络（PowerModelsGAT-AI），该模型通过节点类型感知的掩码机制处理不同类型的节点（如PQ、PV、平衡节点），并采用学习权重平衡多目标损失函数（包括功率不匹配惩罚项），从而实现高精度电压和发电机注入预测；此外，引入经验回放与弹性权重巩固（Experience Replay and Elastic Weight Consolidation）策略，有效缓解持续学习中的灾难性遗忘问题，在新系统适应过程中保持原有系统的性能误差增幅低于2%，甚至提升部分基准系统表现。

链接: https://arxiv.org/abs/2603.16879
作者: Chidozie Ezeakunne,Jose E. Tabarez,Reeju Pokharel,Anup Pandey
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 11 figures, 1 ancillary supplementary PDF

点击查看摘要

Abstract:Solving the alternating current power flow equations in real time is essential for secure grid operation, yet classical Newton-Raphson solvers can be slow under stressed conditions. Existing graph neural networks for power flow are typically trained on a single system and often degrade on different systems. We present PowerModelsGAT-AI, a physics-informed graph attention network that predicts bus voltages and generator injections. The model uses bus-type-aware masking to handle different bus types and balances multiple loss terms, including a power-mismatch penalty, using learned weights. We evaluate the model on 14 benchmark systems (4 to 6,470 buses) and train a unified model on 13 of these under N-2 (two-branch outage) conditions, achieving an average normalized mean absolute error of 0.89% for voltage magnitudes and R^2 0.99 for voltage angles. We also show continual learning: when adapting a base model to a new 1,354-bus system, standard fine-tuning causes severe forgetting with error increases exceeding 1000% on base systems, while our experience replay and elastic weight consolidation strategy keeps error increases below 2% and in some cases improves base-system performance. Interpretability analysis shows that learned attention weights correlate with physical branch parameters (susceptance: r = 0.38; thermal limits: r = 0.22), and feature importance analysis supports that the model captures established power flow relationships.

[AI-81] A foundation model for electrodermal activity data

【速读】：该论文旨在解决电活动（Electrodermal Activity, EDA）建模领域缺乏大规模、高质量、公开可获取数据集的问题，这一瓶颈限制了生成式 AI 在生理信号分析中的应用进展。其关键解决方案是构建 EDAMAME 数据集，整合来自 24 个公共数据源的超过 25,000 小时 EDA 数据（覆盖 634 名用户），并基于此训练出首个专用于 EDA 的基础模型 UME。该模型在十种任务场景中八项优于现有基线方法，性能媲美通用时间序列基础模型，同时仅需 1/20 的计算资源，显著提升了 EDA 建模的效率与可行性。

链接: https://arxiv.org/abs/2603.16878
作者: Leonardo Alchieri,Matteo Garzon,Lidia Alecci,Francesco Bombassei De Bona,Martin Gjoreski,Giovanni De Felice,Silvia Santini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Foundation models have recently extended beyond natural language and vision to timeseries domains, including physiological signals. However, progress in electrodermal activity (EDA) modeling is hindered by the absence of large-scale, curated, and openly accessible datasets. EDA reflects sympathetic nervous system activity and is widely used to infer cognitive load, stress, and engagement. Yet very few wearable devices provide continuous, unobtrusive sensing, and the only large-scale archive to date is proprietary. To address this gap, we compile EDAMAME, a collection of EDA traces from 24 public datasets, comprising more than 25,000 hours from 634 users. Using this resource, we train UME, the first dedicated foundation model for EDA. In eight out of ten scenarios, UME outperforms baselines and matches generalist timeseries foundation models while using 20x fewer computational resources. Our findings, however, also highlight the intrinsic challenges of EDA modeling, motivating further research to unlock its full potential. All datasets, model weights, and code are released to support further research.

[AI-82] A Dual Certificate Approach to Sparsity in Infinite-Width Shallow Neural Networks

【速读】：该论文旨在解决无限宽浅层ReLU神经网络在总变差（Total Variation, TV）正则化下的训练问题，核心目标是建立对最优解稀疏性的严格理论保证。其解决方案的关键在于利用TV正则化优化问题的对偶理论，并发现ReLU激活函数对应的对偶证书在权重空间中具有分段线性结构——这种结构由数据诱导的超平面排列决定，从而定义出“对偶区域”（dual regions）。作者进一步证明，在每个对偶区域内，对偶证书至多存在一个极值点，这直接导致最优解的支持集有限且其基数可由数据诱导的超平面排列几何特性上界控制。在此基础上，论文还给出了稀疏解唯一性的充分条件，并在低标签噪声和小正则化参数下，证明了最优解保持稀疏性、位置与幅值收敛，且在对偶区域内部时收敛速率与噪声和正则化参数呈线性关系。

链接: https://arxiv.org/abs/2603.17785
作者: Leonardo Del Grande,Christoph Brune,Marcello Carioni
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we study total variation (TV)-regularized training of infinite-width shallow ReLU neural networks, formulated as a convex optimization problem over measures on the unit sphere. Our approach leverages the duality theory of TV-regularized optimization problems to establish rigorous guarantees on the sparsity of the solutions to the training problem. Our analysis further characterizes how and when this sparsity persists in a low noise regime and for small regularization parameter. The key observation that motivates our analysis is that, for ReLU activations, the associated dual certificate is piecewise linear in the weight space. Its linearity regions, which we name dual regions, are determined by the activation patterns of the data via the induced hyperplane arrangement. Taking advantage of this structure, we prove that, on each dual region, the dual certificate admits at most one extreme value. As a consequence, the support of any minimizer is finite, and its cardinality can be bounded from above by a constant depending only on the geometry of the data-induced hyperplane arrangement. Then, we further investigate sufficient conditions ensuring uniqueness of such sparse solution. Finally, under a suitable non-degeneracy condition on the dual certificate along the boundaries of the dual regions, we prove that in the presence of low label noise and for small regularization parameter, solutions to the training problem remain sparse with the same number of Dirac deltas. Additionally, their location and the amplitudes converge, and, in case the locations lie in the interior of a dual region, the convergence happens with a rate that depends linearly on the noise and the regularization parameter.

[AI-83] Inhibitory normalization of error signals improves learning in neural circuits

【速读】：该论文试图解决的问题是：生物神经回路中由抑制性中间神经元介导的归一化机制是否能够提升学习性能，即这种生理上的归一化是否具备类似人工神经网络（ANNs）中归一化操作在优化训练过程中的作用。其解决方案的关键在于设计了一种包含兴奋性和抑制性神经元分离结构的人工神经网络模型，并在图像识别任务中引入可变亮度的输入分布以模拟复杂环境；实验发现，仅在推理阶段应用归一化无法提升学习效果，而当归一化扩展至反向传播误差信号时，模型性能显著改善，表明若生物系统中的抑制性归一化确实促进学习，则必须同时对学习信号进行归一化处理。

链接: https://arxiv.org/abs/2603.17676
作者: Roy Henha Eyono,Daniel Levenstein,Arna Ghosh,Jonathan Cornford,Blake Richards
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 7 figures. Submitted to Neural Computation

点击查看摘要

Abstract:Normalization is a critical operation in neural circuits. In the brain, there is evidence that normalization is implemented via inhibitory interneurons and allows neural populations to adjust to changes in the distribution of their inputs. In artificial neural networks (ANNs), normalization is used to improve learning in tasks that involve complex input distributions. However, it is unclear whether inhibition-mediated normalization in biological neural circuits also improves learning. Here, we explore this possibility using ANNs with separate excitatory and inhibitory populations trained on an image recognition task with variable luminosity. We find that inhibition-mediated normalization does not improve learning if normalization is applied only during inference. However, when this normalization is extended to include back-propagated errors, performance improves significantly. These results suggest that if inhibition-mediated normalization improves learning in the brain, it additionally requires the normalization of learning signals.

[AI-84] rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks

【速读】：该论文旨在解决神经网络在训练过程中对数据污染（data contamination）高度敏感的问题，特别是标签噪声（label noise）和对抗扰动（adversarial perturbations）两类典型污染形式。标准神经分类器通常通过最小化类别交叉熵损失进行训练，该方法在理想条件下具有统计效率，但在实际场景中易受异常观测影响。论文提出了一种统一且具有统计基础的鲁棒神经分类框架——rSDNet，其核心在于将神经网络训练建模为最小散度估计（minimum-divergence estimation）问题，并基于S-散度（S-divergences）构造学习目标。该方法通过模型概率自动降低异常样本的权重，从而实现对两类污染的联合鲁棒性保障；理论分析表明rSDNet具备Fisher一致性、分类校准性（classification calibration）以及在均匀标签噪声和微小特征污染下的鲁棒性保证，实验验证了其在图像分类任务中对标签污染与对抗攻击的增强鲁棒性，同时保持干净数据上的竞争性能。

链接: https://arxiv.org/abs/2603.17628
作者: Suryasis Jana,Abhik Ghosh
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注: Pre-print; under review

点击查看摘要

Abstract:Neural networks are central to modern artificial intelligence, yet their training remains highly sensitive to data contamination. Standard neural classifiers are trained by minimizing the categorical cross-entropy loss, corresponding to maximum likelihood estimation under a multinomial model. While statistically efficient under ideal conditions, this approach is highly vulnerable to contaminated observations including label noises corrupting supervision in the output space, and adversarial perturbations inducing worst-case deviations in the input space. In this paper, we propose a unified and statistically grounded framework for robust neural classification that addresses both forms of contamination within a single learning objective. We formulate neural network training as a minimum-divergence estimation problem and introduce rSDNet, a robust learning algorithm based on the general class of S -divergences. The resulting training objective inherits robustness properties from classical statistical estimation, automatically down-weighting aberrant observations through model probabilities. We establish essential population-level properties of rSDNet, including Fisher consistency, classification calibration implying Bayes optimality, and robustness guarantees under uniform label noise and infinitesimal feature contamination. Experiments on three benchmark image classification datasets show that rSDNet improves robustness to label corruption and adversarial attacks while maintaining competitive accuracy on clean data, Our results highlight minimum-divergence learning as a principled and effective framework for robust neural classification under heterogeneous data contamination.

[AI-85] Dependence Fidelity and Downstream Inference Stability in Generative Models

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 模型评估中过度依赖边际分布匹配（marginal distribution matching）而忽视多变量依赖结构（multivariate dependence structure）的问题。现有评估方法虽能确保单变量分布的准确性，但无法保证生成数据在联合分布层面保留真实数据的关键依赖关系，从而可能导致下游任务（如回归分析、主成分分析等）出现不稳定甚至错误的结论。解决方案的关键在于提出“协方差层级依赖保真度”（covariance-level dependence fidelity）作为新的评估指标，通过量化生成分布与真实分布之间的协方差结构差异，确保模型在依赖敏感任务中表现出稳定性能；研究进一步证明，显式控制协方差级依赖偏差可有效避免回归系数符号反转等问题，为扩散模型和变分自编码器等生成模型提供了可信赖的评估框架。

链接: https://arxiv.org/abs/2603.17041
作者: Nazia Riasat
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注: 22 pages, 7 figures. Poster presentation at MathAI 2026 (International Conference on Mathematics of Artificial Intelligence), March 30 - April 3, 2026

点击查看摘要

Abstract:Recent advances in generative AI have led to increasingly realistic synthetic data, yet evaluation criteria remain focused on marginal distribution matching. While these diagnostics assess local realism, they provide limited insight into whether a generative model preserves the multivariate dependence structures governing downstream inference. We introduce covariance-level dependence fidelity as a practical criterion for evaluating whether a generative distribution preserves joint structure beyond univariate marginals. We establish three core results. First, distributions can match all univariate marginals exactly while exhibiting substantially different dependence structures, demonstrating marginal fidelity alone is insufficient. Second, dependence divergence induces quantitative instability in downstream inference, including sign reversals in regression coefficients despite identical marginal behavior. Third, explicit control of covariance-level dependence divergence ensures stable behavior for dependence-sensitive tasks such as principal component analysis. Synthetic constructions illustrate how dependence preservation failures lead to incorrect conclusions despite identical marginal distributions. These results highlight dependence fidelity as a useful diagnostic for evaluating generative models in dependence-sensitive downstream tasks, with implications for diffusion models and variational autoencoders. These guarantees apply specifically to procedures governed by covariance structure; tasks requiring higher-order dependence such as tail-event estimation require richer criteria.

[AI-86] Shared Representation Learning for Reference-Guided Targeted Sound Detection ICASSP2026

【速读】：该论文旨在解决目标声音检测（Targeted Sound Detection, TSD）问题，即在复杂声学场景中从混合音频中检测并定位一个特定目标声音，前提是提供该目标声音的参考音频。此前方法依赖于生成区分性条件嵌入向量（sound-discriminative conditional embedding vector）并与混合音频编码器配对，采用多任务学习进行联合优化。本文的关键创新在于提出一种统一的编码架构，将参考音频与混合音频共同映射到共享表示空间中，从而增强两者间的对齐关系，同时降低模型结构复杂度。该设计不仅简化了整体框架，还提升了对未见类别的泛化能力，并在URBAN-SED数据集上实现了83.15%的段级F1分数和95.17%的整体准确率，显著优于现有方法，建立了新的基准。

链接: https://arxiv.org/abs/2603.17025
作者: Shubham Gupta,Adarsh Arigala,B. R. Dilleswari,Sri Rama Murty Kodukula
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.

[AI-87] Machine intelligence supports the full chain of 2D dendrite synthesis

【速读】：该论文针对二维 dendrites（树枝状结构）材料合成中存在参数密集、数据稀缺及反应过程复杂的问题，旨在实现从快速工艺优化到定制化合成再到机制解析的全流程支持。其核心解决方案在于构建一个由机器学习（Machine Learning, ML）赋能的智能框架：首先通过主动学习（active learning）在仅60次实验（4轮迭代）内确定高分支、电催化活性强的ReSe₂ dendrites的最佳生长配方；其次采用预测精度引导的数据增强策略与树基机器学习算法，仅需新增9次实验即可揭示5个工艺变量与枝晶分形维数（fractal dimension, DF）之间的非线性关系，从而实现用户自定义DF的精准合成；最后融合跨尺度表征、可解释机器学习模型与热力学/动力学领域知识，建立数据-知识双驱动的机制模型，阐明多工艺参数对产物形貌的协同作用。该方法显著减少了实验次数，提升了合成效率与机制理解深度，展示了机器学习在材料合成研究范式变革中的巨大潜力。

链接: https://arxiv.org/abs/2603.16959
作者: Wenqiang Huang,Susu Fang,Xuhang Gu,Shen’ao Xue,Huanhuan Xing,Junjie Jiang,Junying Zhang,Shen Zhou,Zheng Luo,Jin Zhang,Fangping Ouyang,Shanshan Wang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Exemplified by the chemical vapor deposition growth of two-dimensional dendrites, which has potential applications in catalysis and presents a parameter-intensive, data-scarce and reaction process-complex model problem, we devise a machine intelligence-empowered framework for the full chain support of material synthesis, encompassing rapid process optimization, accurate customized synthesis, and comprehensive mechanism this http URL, active learning is integrated into the experimental workflow, identifying an optimal recipe for the growth of highly-branched, electrocatalytically-active ReSe2 dendrites through 60 experiments (4 iterations), which account for less than 1.3% of the numerous possible parameter this http URL, a prediction accuracy-guided data augmentation strategy is developed combined with a tree-based machine learning (ML) algorithm, unveiling a non-linear correlation between 5 process variables and fractal dimension (DF) of ReSe2 dendrites with only 9 experiment additions, which guides the synthesis of various user-defined DF. Finally, we construct a data-knowledge dual-driven mechanism model by integration of cross-scale characterizations, interpretable ML models, and domain knowledge in thermodynamics and kinetics, unraveling synergistic contributions of multiple process parameters to the product morphology. This work demonstrates the ML potential to transform the research paradigm and is adaptable to broader material synthesis.

[AI-88] Automatic Termination Strategy of Inelastic Neutron-scattering Measurement Using Bayesian Optimization for Bin-width Selection

【速读】：该论文旨在解决四维非弹性中子散射实验中因数据量过大而导致的束流时间利用效率低的问题，其核心挑战在于如何在保证数据质量的前提下自动确定最优实验终止时机。解决方案的关键在于提出一种基于贝叶斯优化（Bayesian optimization）的方法：首先通过贝叶斯优化高效计算多维直方图的最佳分箱宽度（bin width），并在实验过程中实时判断当前最优分箱宽度是否已低于设备目标分辨率；一旦满足此条件，则判定可终止实验，从而避免冗余测量。数值实验表明，该方法可在保持高精度的同时将搜索成本降低至穷举法的约10%，显著提升了实验效率。

链接: https://arxiv.org/abs/2603.16946
作者: Kensuke Muto,Hirotaka Sakamoto,Kenji Nagata,Taka-hisa Arima,Masato Okada
机构: 未知
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures; under review at Journal of the Physical Society of Japan (JPSJ)

点击查看摘要

Abstract:Currently, an excessive amount of event data is being obtained in four-dimensional inelastic neutron-scattering experiments. A method for automatic bin-width optimization of multidimensional histograms has been developed and recently validated on real inelastic neutron-scattering data. However, measuring beyond the equipment resolution leads to inefficient use of valuable beam time. To improve experimental efficiency, an automatic termination strategy is essential. We propose a Bayesian-optimization-based method to compute stopping criteria and determine whether to continue or terminate the experiment in real time. In the proposed method, the bin-width optimization is performed using Bayesian optimization to efficiently compute the optimal bin widths. The experiment is terminated when the optimal bin widths become smaller than the target resolutions. In numerical experiments using real inelastic neutron-scattering data, the optimal bin widths decrease as the number of events increases. Even the optimal bin widths for data downsampled to 1/5 are comparable with the resolutions limited by the sample size, choppers, and so on. This implies excessive measurement of the inelastic neutron experiments for the moment. Moreover, we found that Bayesian optimization can reduce the search cost to approximately 10% of an exhaustive search in our numerical experiments.

[AI-89] Quantum-Assisted Optimal Rebalancing with Uncorrelated Asset Selection for Algorithmic Trading Walk-Forward QUBO Scheduling via QAOA

【速读】：该论文旨在解决投资组合动态再平衡中的高交易成本与性能优化之间的权衡问题，特别是在高频再平衡场景下如何减少交易频率而不显著牺牲风险调整后收益。其解决方案的关键在于将再平衡调度问题形式化为一个无约束二次二值优化（Quadratic Unconstrained Binary Optimisation, QUBO）问题，并利用量子近似优化算法（Quantum Approximate Optimisation Algorithm, QAOA）在行走向前（walk-forward）框架内求解，从而实现结构化的二值调度决策。这一方法通过混合经典-量子架构，在保持竞争力的夏普比率（0.588）的同时，将再平衡次数从24次降至8次，降低交易成本达44.5%，验证了近中期量子优化在金融组合管理中的可行性与有效性。

链接: https://arxiv.org/abs/2603.16904
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a hybrid classical-quantum framework for portfolio construction and rebalancing. Asset selection is performed using Ledoit-Wolf shrinkage covariance estimation combined with hierarchical correlation clustering to extract n = 10 decorrelated stocks from the SP 500 universe without survivorship bias. Portfolio weights are optimised via an entropy-regularised Genetic Algorithm (GA) accelerated on GPU, alongside closed-form minimum-variance and equal-weight benchmarks. Our primary contribution is the formulation of the portfolio rebalancing schedule as a Quadratic Unconstrained Binary Optimisation (QUBO) problem. The resulting combinatorial optimisation task is solved using the Quantum Approximate Optimisation Algorithm (QAOA) within a walk-forward framework designed to eliminate lookahead bias. This approach recasts dynamic rebalancing as a structured binary scheduling problem amenable to variational quantum methods. Backtests on SP 500 data (training: 2010-2024; out-of-sample test: 2025, n = 249 trading days) show that the GA + QAOA strategy attains a Sharpe ratio of 0.588 and total return of 10.1%, modestly outperforming the strongest classical baseline (GA with 10-day periodic rebalancing, Sharpe 0.575) while executing 8 rebalances versus 24, corresponding to a 44.5% reduction in transaction costs. Multi-restart QAOA (4096 measurement shots per run) exhibits concentrated probability mass on high-quality schedules, indicating stable convergence of the variational procedure. These findings suggest that hybrid classical-quantum architectures can reduce turnover in portfolio rebalancing while preserving competitive risk-adjusted performance, providing a structured testbed for near-term quantum optimisation in financial applications. Subjects: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.16904 [q-fin.PM] (or arXiv:2603.16904v1 [q-fin.PM] for this version) https://doi.org/10.48550/arXiv.2603.16904 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Abraham Itzhak Weinberg [view email] [v1] Wed, 4 Mar 2026 15:15:55 UTC (730 KB)

[AI-90] Unsupervised learning for inverse problems in computed tomography

【速读】：该论文旨在解决医学影像中CT图像重建的效率与质量难题，传统方法如滤波反投影（Filtered Back Projection, FBP）和最大似然（Maximum Likelihood, ML）重建在精度或计算时间上存在局限。解决方案的关键在于提出一种无监督深度学习框架，通过在神经网络中嵌入前向和反向投影层（forward and backward projection layers），将深度学习训练过程与传统迭代重建方法的本质特性相融合，从而实现仅使用投影数据即可完成高质量图像重建，无需依赖真实参考图像（ground-truth images）。该方法在2DeteCT数据集上验证了其在均方误差（MSE）和结构相似性指数（SSIM）上的优越性能，并显著缩短了重建时间，具备实时应用潜力。

链接: https://arxiv.org/abs/2508.05321
作者: Laura Hellwege,Johann Christopher Engster,Moritz Schaar,Thorsten M. Buzug,Maik Stille
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 Figures

点击查看摘要

Abstract:This study presents an unsupervised deep learning approach for computed tomography (CT) image reconstruction, leveraging the inherent similarities between deep neural network training and conventional iterative reconstruction methods. By incorporating forward and backward projection layers within the deep learning framework, we demonstrate the feasibility of reconstructing images from projection data without relying on ground-truth images. Our method is evaluated on the two-dimensional 2DeteCT dataset, showcasing superior performance in terms of mean squared error (MSE) and structural similarity index (SSIM) compared to traditional filtered backprojection (FBP) and maximum likelihood (ML) reconstruction techniques. Additionally, our approach significantly reduces reconstruction time, making it a promising alternative for real-time medical imaging applications. Future work will focus on extending this methodology to three-dimensional reconstructions and enhancing the adaptability of the projection geometry.

机器学习

[LG-0] Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

链接: https://arxiv.org/abs/2603.17970
作者: Ben S. Southworth,Stephen Thomas
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon’s polar update with a triangular (Cholesky-like) whitening surrogate inspired by classical Gram–Schmidt and Gauss-Seidel ideas. We show that row-orthonormal matrices are fixed points of the MUD map, relate the inner step to symmetric Gauss-Seidel preconditioning of the Gram matrix, and prove quadratic local convergence near the fixed point. In terms of time-to-perplexity, MUD yields consistent 10-50% wall-clock improvements over tuned AdamW and Muon in time-to-perplexity, typically converging slightly slower per step than Muon but with substantially lower optimizer overhead – relative to Muon, MUD improves peak tokens/s by roughly 1.3-2.6\times across most settings and up to nearly 3\times on GPT-2 large on an A100. We also demonstrate training a ESM-2 150M protein language model, where MUD matches Muon-level validation perplexity in significantly less wall-clock time.

[LG-1] Unified Policy Value Decomposition for Rapid Adaptation

链接: https://arxiv.org/abs/2603.17947
作者: Cristiano Capone,Luca Falorsi,Andrea Ciardiello,Luca Manneschi
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

[LG-2] Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

链接: https://arxiv.org/abs/2603.17875
作者: Abhishek Gupta,Aditya Mahajan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.

[LG-3] RHYME-XT: A Neural Operator for Spatiotemporal Control Systems

链接: https://arxiv.org/abs/2603.17867
作者: Marijn Ruiter,Miguel Aguiar,Jake Rap,Karl H. Johansson,Amritam Das
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 6 pages, 5 figures. Submitted to IEEE Control Systems Letters (L-CSS) and CDC 2026

点击查看摘要

Abstract:We propose RHYME-XT, an operator-learning framework for surrogate modeling of spatiotemporal control systems governed by input-affine nonlinear partial integro-differential equations (PIDEs) with localized rhythmic behavior. RHYME-XT uses a Galerkin projection to approximate the infinite-dimensional PIDE on a learned finite-dimensional subspace with spatial basis functions parameterized by a neural network. This yields a projected system of ODEs driven by projected inputs. Instead of integrating this non-autonomous system, we directly learn its flow map using an architecture for learning flow functions, avoiding costly computations while obtaining a continuous-time and discretization-invariant representation. Experiments on a neural field PIDE show that RHYME-XT outperforms a state-of-the-art neural operator and is able to transfer knowledge effectively across models trained on different datasets, through a fine-tuning process.

[LG-4] Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation

链接: https://arxiv.org/abs/2603.17855
作者: William Thorossian
类目: Machine Learning (cs.LG)
*备注: 18 pages, 2 Tables, 1 Figure, 22 References

点击查看摘要

Abstract:Modern seismic and volcanic monitoring is increasingly shaped by continuous, multi-sensor observations and by the need to extract actionable information from nonstationary, noisy wavefields. In this context, machine learning has moved from a research curiosity to a practical ingredient of processing chains for detection, phase picking, classification, denoising, and anomaly tracking. However, improved accuracy on a fixed dataset is not sufficient for operational use. Models must remain reliable under domain shift (new stations, changing noise, evolving volcanic activity), provide uncertainty that supports decision-making, and connect their outputs to physically meaningful constraints. This paper surveys and organizes recent ML approaches for seismic and volcanic signal analysis, highlighting where classical signal processing provides indispensable inductive bias, how self-supervision and generative modeling can reduce dependence on labels, and which evaluation protocols best reflect transfer across regions. We conclude with open challenges for robust, interpretable, and maintainable AI-assisted monitoring.

[LG-5] Verification and Validation of Physics-Informed Surrogate Component Models for Dynamic Power-System Simulation

链接: https://arxiv.org/abs/2603.17836
作者: Petros Ellinas,Indrajit Chaudhuri,Johanna Vorwerk,Spyros Chatzivasileiadis
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed machine learning surrogates are increasingly explored to accelerate dynamic simulation of generators, converters, and other power grid components. The key question, however, is not only whether a surrogate matches a stand-alone component model on average, but whether it remains accurate after insertion into a differential-algebraic simulator, where the surrogate outputs enter the algebraic equations coupling the component to the rest of the system. This paper formulates that in-simulator use as a verification and validation (V\V) problem. A finite-horizon bound is derived that links allowable component-output error to algebraic-coupling sensitivity, dynamic error amplification, and the simulation horizon. Two complementary settings are then studied: model-based verification against a reference component solver, and data-based validation through conformal calibration of the component-output variables exchanged with the simulator. The framework is general, but the case study focuses on physics-informed neural-network surrogates of second-, fourth-, and sixth-order synchronous-machine models. Results show that good stand-alone surrogate accuracy does not by itself guarantee accurate in-simulator behavior, that the largest discrepancies concentrate in stressed operating regions, and that small equation residuals do not necessarily imply small state-trajectory errors.

[LG-6] Symmetry-Reduced Physics-Informed Learning of Tensegrity Dynamics

链接: https://arxiv.org/abs/2603.17824
作者: Jing Qin,Muhao Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensegrity structures possess intrinsic geometric symmetries that govern their dynamic behavior. However, most existing physics-informed neural network (PINN) approaches for tensegrity dynamics do not explicitly exploit these symmetries, leading to high computational complexity and unstable optimization. In this work, we propose a symmetry-reduced physics-informed neural network (SymPINN) framework that embeds group-theory-based symmetry directly into both the solution expression and the neural network architecture to predict tensegrity dynamics. By decomposing nodes into symmetry orbits and representing free nodal coordinates using a symmetry basis, the proposed method constructs a reduced coordinate representation that preserves geometric symmetry of the structure. The full coordinates are then recovered via symmetry transformations of the reduced solution learned by the network, ensuring that the predicted configurations automatically satisfy the symmetry constraints. In this framework, equivariance is enforced through orbit-based coordinate generation, symmetry-consistent message passing, and physics residual constraints. In addition, SymPINN improves training effectiveness by encoding initial conditions as hard constraints, incorporating Fourier feature encoding to enhance the representation of dynamic motions, and employing a two-stage optimization strategy. Extensive numerical experiments on symmetric T-bars and lander structures demonstrate significantly improved prediction accuracy and computational efficiency compared to standard physics-informed models, indicating the great potential of symmetry-aware learning for structure-preserving modeling of tensegrity dynamics.

[LG-7] Federated Distributional Reinforcement Learning with Distributional Critic Regularization

链接: https://arxiv.org/abs/2603.17820
作者: David Millard,Cecilia Alm,Rashid Ali,Pengcheng Shi,Ali Baheri
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 Figures, conference

点击查看摘要

Abstract:Federated reinforcement learning typically aggregates value functions or policies by parameter averaging, which emphasizes expected return and can obscure statistical multimodality and tail behavior that matter in safety-critical settings. We formalize federated distributional reinforcement learning (FedDistRL), where clients parametrize quantile value function critics and federate these networks only. We also propose TR-FedDistRL, which builds a per client, risk-aware Wasserstein barycenter over a temporal buffer. This local barycenter provides a reference region to constrain the parameter averaged critic, ensuring necessary distributional information is not averaged out during the federation process. The distributional trust region is implemented as a shrink-squash step around this reference. Under fixed-policy evaluation, the feasibility map is nonexpansive and the update is contractive in a probe-set Wasserstein metric under evaluation. Experiments on a bandit, multi-agent gridworld, and continuous highway environment show reduced mean-smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift versus mean-oriented and non-federated baselines.

[LG-8] owards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

链接: https://arxiv.org/abs/2603.17750
作者: Qi Liu,Laure Zanna,Joan Bruna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in autoregressive neural surrogate models have enabled orders-of-magnitude speedups in simulating dynamical systems. However, autoregressive models are generally prone to distribution drift: compounding errors in autoregressive rollouts that severely degrade generation quality over long time horizons. Existing work attempts to address this issue by implicitly leveraging the inherent trade-off between short-time accuracy and long-time consistency through hyperparameter tuning. In this work, we introduce a unifying mathematical framework that makes this tradeoff explicit, formalizing and generalizing hyperparameter-based strategies in existing approaches. Within this framework, we propose a robust, hyperparameter-free model implemented as a conditional diffusion model that balances short-time fidelity with long-time consistency by construction. Our model, Self-refining Neural Surrogate model (SNS), can be implemented as a standalone model that refines its own autoregressive outputs or as a complementary model to existing neural surrogates to ensure long-time consistency. We also demonstrate the numerical feasibility of SNS through high-fidelity simulations of complex dynamical systems over arbitrarily long time horizons.

[LG-9] Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

链接: https://arxiv.org/abs/2603.17737
作者: Oksana Kolomenko,Ricardo Knauer,Erik Rodner
类目: Machine Learning (cs.LG)
*备注: Computational Intelligence 2025 Workshop

点击查看摘要

Abstract:Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

[LG-10] Predicting Trajectories of Long COVID in Adult Women: The Critical Role of Causal Disentanglement

链接: https://arxiv.org/abs/2603.17722
作者: Jing Wang,Jie Shen,Yiming Luo,Amar Sra,Qiaomin Xie,Jeremy C. Weiss
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Early prediction of Post-Acute Sequelae of SARS-CoV-2 severity is a critical challenge for women’s health, particularly given the diagnostic overlap between PASC and common hormonal transitions such as menopause. Identifying and accounting for these confounding factors is essential for accurate long-term trajectory prediction. We conducted a retrospective study of 1,155 women (mean age 61) from the NIH RECOVER dataset. By integrating static clinical profiles with four weeks of longitudinal wearable data (monitoring cardiac activity and sleep), we developed a causal network based on a Large Language Model to predict future PASC scores. Our framework achieved a precision of 86.7% in clinical severity prediction. Our causal attribution analysis demonstrate the model’s ability to differentiate between active pathology and baseline noise: direct indicators such as breathlessness and malaise reached maximum saliency (1.00), while confounding factors like menopause and diabetes were successfully suppressed with saliency scores below 0.27.

[LG-11] Flow Matching Policy with Entropy Regularization

链接: https://arxiv.org/abs/2603.17685
作者: Ting Gao,Stavros Orfanoudakis,Nan Lin,Elvin Isufi,Winnie Daamen,Serge Hoogendoorn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model’s generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.

[LG-12] ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

链接: https://arxiv.org/abs/2603.17623
作者: Zirui Gong,Leo Yu Zhang,Yanjun Zhang,Viet Vo,Tianqing Zhu,Shirui Pan,Cong Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages. To appear in the IEEE Symposium on Security and Privacy 2026

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training by sharing model updates instead of raw data, aiming to protect user privacy. However, recent studies reveal that these shared updates can inadvertently leak sensitive training data through gradient inversion attacks (GIAs). Among them, active GIAs are particularly powerful, enabling high-fidelity reconstruction of individual samples even under large batch sizes. Nevertheless, existing approaches often require architectural modifications, which limit their practical applicability. In this work, we bridge this gap by introducing the Activation REcovery via Sparse inversion (ARES) attack, an active GIA designed to reconstruct training samples from large training batches without requiring architectural modifications. Specifically, we formulate the recovery problem as a noisy sparse recovery task and solve it using the generalized Least Absolute Shrinkage and Selection Operator (Lasso). To extend the attack to multi-sample recovery, ARES incorporates the imprint method to disentangle activations, enabling scalable per-sample reconstruction. We further establish the expected recovery rate and derive an upper bound on the reconstruction error, providing theoretical guarantees for the ARES attack. Extensive experiments on CNNs and MLPs demonstrate that ARES achieves high-fidelity reconstruction across diverse datasets, significantly outperforming prior GIAs under large batch sizes and realistic FL settings. Our results highlight that intermediate activations pose a serious and underestimated privacy risk in FL, underscoring the urgent need for stronger defenses.

[LG-13] AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data

链接: https://arxiv.org/abs/2603.17610
作者: Cai Xu,Changhao Sun,Ziyu Guan,Wei Zhao
类目: Machine Learning (cs.LG)
*备注: 15 pages. Submitted to IEEE Transactions on Image Processing

点击查看摘要

Abstract:Multi-view learning primarily aims to fuse multiple features to describe data comprehensively. Most prior studies implicitly assume that different views share similar dimensions. In practice, however, severe dimensional disparities often exist among different views, leading to the unbalanced multi-view learning issue. For example, in emotion recognition tasks, video frames often reach dimensions of 10^6 , while physiological signals comprise only 10^1 dimensions. Existing methods typically face two main challenges for this problem: (1) They often bias towards high-dimensional data, overlooking the low-dimensional views. (2) They struggle to effectively align representations under extreme dimensional imbalance, which introduces severe redundancy into the low-dimensional ones. To address these issues, we propose the Adaptive Multi-view Sparsity Learning (AdaMuS) framework. First, to prevent ignoring the information of low-dimensional views, we construct view-specific encoders to map them into a unified dimensional space. Given that mapping low-dimensional data to a high-dimensional space often causes severe overfitting, we design a parameter-free pruning method to adaptively remove redundant parameters in the encoders. Furthermore, we propose a sparse fusion paradigm that flexibly suppresses redundant dimensions and effectively aligns each view. Additionally, to learn representations with stronger generalization, we propose a self-supervised learning paradigm that obtains supervision information by constructing similarity graphs. Extensive evaluations on a synthetic toy dataset and seven real-world benchmarks demonstrate that AdaMuS consistently achieves superior performance and exhibits strong generalization across both classification and semantic segmentation tasks.

[LG-14] End-to-end data-driven prediction of urban airflow and pollutant dispersion

链接: https://arxiv.org/abs/2603.17606
作者: Nishant Kumar,Franck Kerhervé,Lionel Agostini,Laurent Cordier
类目: Machine Learning (cs.LG)
*备注: 22 pages, 22 figures

点击查看摘要

Abstract:Climate change and the rapid growth of urban populations are intensifying environmental stresses within cities, making the behavior of urban atmospheric flows a critical factor in public health, energy use, and overall livability. This study targets to develop fast and accurate models of urban pollutant dispersion to support decision-makers, enabling them to implement mitigation measures in a timely and cost-effective manner. To reach this goal, an end-to-end data-driven approach is proposed to model and predict the airflow and pollutant dispersion in a street canyon in skimming flow regime. A series of time-resolved snapshots obtained from large eddy simulation (LES) serves as the database. The proposed framework is based on four fundamental steps. Firstly, a reduced basis is obtained by spectral proper orthogonal decomposition (SPOD) of the database. The projection of the time series snapshot data onto the SPOD modes (time-domain approach) provides the temporal coefficients of the dynamics. Secondly, a nonlinear compression of the temporal coefficients is performed by autoencoder to reduce further the dimensionality of the problem. Thirdly, a reduced-order model (ROM) is learned in the latent space using Long Short-Term Memory (LSTM) netowrks. Finally, the pollutant dispersion is estimated from the predicted velocity field through convolutional neural network that maps both fields. The results demonstrate the efficacy of the model in predicting the instantaneous as well as statistically stationary fields over long time horizon.

[LG-15] One-Step Sampler for Boltzmann Distributions via Drifting

链接: https://arxiv.org/abs/2603.17579
作者: Wenhan Cao,Keyu Yan,Lin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a drifting-based framework for amortized sampling of Boltzmann distributions defined by energy functions. The method trains a one-step neural generator by projecting samples along a Gaussian-smoothed score field from the current model distribution toward the target Boltzmann distribution. For targets specified only up to an unknown normalization constant, we derive a practical target-side drift from a smoothed energy and use two estimators: a local importance-sampling mean-shift estimator and a second-order curvature-corrected approximation. Combined with a mini-batch Gaussian mean-shift estimate of the sampler-side smoothed score, this yields a simple stop-gradient objective for stable one-step training. On a four-mode Gaussian-mixture Boltzmann target, our sampler achieves mean error 0.0754 , covariance error 0.0425 , and RBF MMD 0.0020 . Additional double-well and banana targets show that the same formulation also handles nonconvex and curved low-energy geometries. Overall, the results support drifting as an effective way to amortize iterative sampling from Boltzmann distributions into a single forward pass at test time.

[LG-16] HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

链接: https://arxiv.org/abs/2603.17573
作者: Zihao Zheng,Zhihao Mao,Sicheng Tian,Maoliang Li,Jiayu Chen,Xinhao Sun,Zhaobo Zhang,Xuanzhe Liu,Donggang Cao,Hong Mei,Xiang Chen
类目: Robotics (cs.RO); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.

[LG-17] Conditional Inverse Learning of Time-Varying Reproduction Numbers Inference

链接: https://arxiv.org/abs/2603.17549
作者: Lanlan Yu,Quan-Hui Liu,Haoyue Zheng,Xinfu Yang
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 10 pages, 5 figures. Related to epidemic modeling, neural networks and time-varying reproduction number

点击查看摘要

Abstract:Estimating time-varying reproduction numbers from epidemic incidence data is a central task in infectious disease surveillance, yet it poses an inherently ill-posed inverse problem. Existing approaches often rely on strong structural assumptions derived from epidemiological models, which can limit their ability to adapt to non-stationary transmission dynamics induced by interventions or behavioral changes, leading to delayed detection of regime shifts and degraded estimation accuracy. In this work, we propose a Conditional Inverse Reproduction Learning framework (CIRL) that addresses the inverse problem by learning a conditional mapping from historical incidence patterns and explicit time information to latent reproduction numbers. Rather than imposing strongly enforced parametric constraints, CIRL softly integrates epidemiological structure with flexible likelihood-based statistical modeling, using the renewal equation as a forward operator to enforce dynamical consistency. The resulting framework combines epidemiologically grounded constraints with data-driven temporal representations, producing reproduction number estimates that are robust to observation noise while remaining responsive to abrupt transmission changes and zero-inflated incidence observations. Experiments on synthetic epidemics with controlled regime changes and real-world SARS and COVID-19 data demonstrate the effectiveness of the proposed approach.

[LG-18] CA-Based Interpretable Knowledge Representation and Analysis of Geometric Design Parameters

链接: https://arxiv.org/abs/2603.17535
作者: Alexander Köhler,Michael Breuß
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, 1 table, preprint to IntelliSys-Artificial Intelligence Conference 2026

点击查看摘要

Abstract:In many CAD-based applications, complex geometries are defined by a high number of design parameters. This leads to high-dimensional design spaces that are challenging for downstream engineering processes like simulations, optimization, and design exploration tasks. Therefore, dimension reduction methods such as principal component analysis (PCA) are used. The PCA identifies dominant modes of geometric variation and yields a compact representation of the geometry. While classical PCA excels in the compact representation part, it does not directly recover underlying design parameters of a generated geometry. In this work, we deal with the problem of estimating design parameters from PCA-based representations. Analyzing a recent modification of the PCA dedicated to our field of application, we show that the results are actually identical to the standard PCA. We investigate limitations of this approach and present reasonable conditions under which accurate, interpretable parameter estimation can be obtained. With the help of dedicated experiments, we take a more in-depth look at every stage of the PCA and the possible changes of the geometry during these processes.

[LG-19] Anisotropic Permeability Tensor Prediction from Porous Media Microstructure via Physics-Informed Progressive Transfer Learning with Hybrid CNN-Transformer

链接: https://arxiv.org/abs/2603.17532
作者: Mohammad Nooraiepour
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Accurate prediction of permeability tensors from pore-scale microstructure images is essential for subsurface flow modeling, yet direct numerical simulation requires hours per sample, fundamentally limiting large-scale uncertainty quantification and reservoir optimization workflows. A physics-informed deep learning framework is presented that resolves this bottleneck by combining a MaxViT hybrid CNN-Transformer architecture with progressive transfer learning and differentiable physical constraints. MaxViT’s multi-axis attention mechanism simultaneously resolves grain-scale pore-throat geometry via block-local operations and REV-scale connectivity statistics through grid-global operations, providing the spatial hierarchy that permeability tensor prediction physically requires. Training on 20000 synthetic porous media samples spanning three orders of magnitude in permeability, a three-phase progressive curriculum advances from an ImageNet-pretrained baseline with D4-equivariant augmentation and tensor transformation, through component-weighted loss prioritizing off-diagonal coupling, to frozen-backbone transfer learning with porosity conditioning via Feature-wise Linear Modulation (FiLM). Onsager reciprocity and positive definiteness are enforced via differentiable penalty terms. On a held-out test set of 4000 samples, the framework achieves variance-weighted R2 = 0.9960 (R2_Kxx = 0.9967, R2_Kxy = 0.9758), a 33% reduction in unexplained variance over the supervised baseline. The results offer three transferable principles for physics-informed scientific machine learning: large-scale visual pretraining transfers effectively across domain boundaries; physical constraints are most robustly integrated as differentiable architectural components; and progressive training guided by diagnostic failure-mode analysis enables unambiguous attribution of performance gains across methodological stages.

[LG-20] ranslation Invariance of Neural Operators for the FitzHugh-Nagumo Model

链接: https://arxiv.org/abs/2603.17523
作者: Luca Pellegrini
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural Operators (NOs) are a powerful deep learning framework designed to learn the solution operator that arise from partial differential equations. This study investigates NOs ability to capture the stiff spatio-temporal dynamics of the FitzHugh-Nagumo model, which describes excitable cells. A key contribution of this work is evaluating the translation invariance using a novel training strategy. NOs are trained using an applied current with varying spatial locations and intensities at a fixed time, and the test set introduces a more challenging out-of-distribution scenario in which the applied current is translated in both time and space. This approach significantly reduces the computational cost of dataset generation. Moreover we benchmark seven NOs architectures: Convolutional Neural Operators (CNOs), Deep Operator Networks (DONs), DONs with CNN encoder (DONs-CNN), Proper Orthogonal Decomposition DONs (POD-DONs), Fourier Neural Operators (FNOs), Tucker Tensorized FNOs (TFNOs), Localized Neural Operators (LocalNOs). We evaluated these models based on training and test accuracy, efficiency, and inference speed. Our results reveal that CNOs performs well on translated test dynamics. However, they require higher training costs, though their performance on the training set is similar to that of the other considered architectures. In contrast, FNOs achieve the lowest training error, but have the highest inference time. Regarding the translated dynamics, FNOs and their variants provide less accurate predictions. Finally, DONs and their variants demonstrate high efficiency in both training and inference, however they do not generalize well to the test set. These findings highlight the current capabilities and limitations of NOs in capturing complex ionic model dynamics and provide a comprehensive benchmark including their application to scenarios involving translated dynamics.

[LG-21] Efficient Soft Actor-Critic with LLM -Based Action-Level Guidance for Continuous Control

链接: https://arxiv.org/abs/2603.17468
作者: Hao Ma,Zhiqiang Pu,Xiaolin Ai,Huimu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present GuidedSAC, a novel reinforcement learning (RL) algorithm that facilitates efficient exploration in vast state-action spaces. GuidedSAC leverages large language models (LLMs) as intelligent supervisors that provide action-level guidance for the Soft Actor-Critic (SAC) algorithm. The LLM-based supervisor analyzes the most recent trajectory using state information and visual replays, offering action-level interventions that enable targeted exploration. Furthermore, we provide a theoretical analysis of GuidedSAC, proving that it preserves the convergence guarantees of SAC while improving convergence speed. Through experiments in both discrete and continuous control environments, including toy text tasks and complex MuJoCo benchmarks, we demonstrate that GuidedSAC consistently outperforms standard SAC and state-of-the-art exploration-enhanced variants (e.g., RND, ICM, and E3B) in terms of sample efficiency and final performance.

[LG-22] ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression ASPLOS’26

链接: https://arxiv.org/abs/2603.17435
作者: Ruibo Fan,Xiangrui Yu,Xinglin Pan,Zeyu Li,Weile Luo,Qiang Wang,Wei Wang,Xiaowen Chu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注: ASPLOS’26 Accepted Paper

点击查看摘要

Abstract:Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This “load-compressed, compute-decompressed” design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA’s cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.

[LG-23] Causal Representation Learning on High-Dimensional Data: Benchmarks Reproducibility and Evaluation Metrics

链接: https://arxiv.org/abs/2603.17405
作者: Alireza Sadeghi,Wael AbdAlmageed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal representation learning (CRL) models aim to transform high-dimensional data into a latent space, enabling interventions to generate counterfactual samples or modify existing data based on the causal relationships among latent variables. To facilitate the development and evaluation of these models, a variety of synthetic and real-world datasets have been proposed, each with distinct advantages and limitations. For practical applications, CRL models must perform robustly across multiple evaluation directions, including reconstruction, disentanglement, causal discovery, and counterfactual reasoning, using appropriate metrics for each direction. However, this multi-directional evaluation can complicate model comparison, as a model may excel in some direction while under-performing in others. Another significant challenge in this field is reproducibility: the source code corresponding to published results must be publicly available, and repeated runs should yield performance consistent with the original reports. In this study, we critically analyzed the synthetic and real-world datasets currently employed in the literature, highlighting their limitations and proposing a set of essential characteristics for suitable datasets in CRL model development. We also introduce a single aggregate metric that consolidates performance across all evaluation directions, providing a comprehensive score for each model. Finally, we reviewed existing implementations from the literature and assessed them in terms of reproducibility, identifying gaps and best practices in the field.

[LG-24] Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching

链接: https://arxiv.org/abs/2603.17403
作者: Yaozhong Shi,Grigorios Lavrentiadis,Konstantinos Tsalouchidis,Zachary E. Ross,David McCallen,Caifeng Zou,Kamyar Azizzadenesheli,Domniki Asimaki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Earthquake hazard analysis and design of spatially distributed infrastructure, such as power grids and energy pipeline networks, require scenario-specific ground-motion time histories with realistic frequency content and spatiotemporal coherence. However, producing the large ensembles needed for uncertainty quantification with physics-based simulations is computationally intensive and impractical for engineering workflows. To address this challenge, we introduce Ground-Motion Flow (GMFlow), a physics-inspired latent operator flow matching framework that generates realistic, large-scale regional ground-motion time-histories conditioned on physical parameters. Validated on simulated earthquake scenarios in the San Francisco Bay Area, GMFlow generates spatially coherent ground motion across more than 9 million grid points in seconds, achieving a 10,000-fold speedup over the simulation workflow, which opens a path toward rapid and uncertainty-aware hazard assessment for distributed infrastructure. More broadly, GMFlow advances mesh-agnostic functional generative modeling and could potentially be extended to the synthesis of large-scale spatiotemporal physical fields in diverse scientific domains.

[LG-25] Bootstrapping Coding Agents : The Specification Is the Program

链接: https://arxiv.org/abs/2603.17399
作者: Martin Monperrus
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: To appear in IEEE Software

点击查看摘要

Abstract:A coding agent can bootstrap itself. Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch. This reproduces, in the domain of AI coding agents, the classical bootstrap sequence known from compiler construction, and instantiates the meta-circular property known from Lisp. The result carries a practical implication: the specification, not the implementation, is the stable artifact of record. Improving an agent means improving its specification; the implementation is, in principle, regenerable at any time.

[LG-26] he Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

链接: https://arxiv.org/abs/2603.17385
作者: Rui Wu,Hong Xie,Yongjun Li
类目: Machine Learning (cs.LG)
*备注: 33 pages, 6 figures. Submitted to the Journal of Machine Learning Research (JMLR)

点击查看摘要

Abstract:Judea Pearl’s do-calculus provides a foundation for causal inference, but its translation to continuous generative models remains fraught with geometric challenges. We establish the fundamental limits of such interventions. We define the Counterfactual Event Horizon and prove the Manifold Tearing Theorem: deterministic flows inevitably develop finite-time singularities under extreme interventions. We establish the Causal Uncertainty Principle for the trade-off between intervention extremity and identity preservation. Finally, we introduce Geometry-Aware Causal Flow (GACF), a scalable algorithm that utilizes a topological radar to bypass manifold tearing, validated on high-dimensional scRNA-seq data.

[LG-27] Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models

链接: https://arxiv.org/abs/2603.17384
作者: Rui Wu,Hong Xie,Yongjun Li
类目: Machine Learning (cs.LG)
*备注: 34 pages, 5 figures. Submitted to JMLR

点击查看摘要

Abstract:Current continuous generative models (e.g., Diffusion Models, Flow Matching) implicitly assume that locally consistent causal mechanisms naturally yield globally coherent counterfactuals. In this paper, we prove that this assumption fails fundamentally when the causal graph exhibits non-trivial homology (e.g., structural conflicts or hidden confounders). We formalize structural causal models as cellular sheaves over Wasserstein spaces, providing a strict algebraic topological definition of cohomological obstructions in measure spaces. To ensure computational tractability and avoid deterministic singularities (which we define as manifold tearing), we introduce entropic regularization and derive the Entropic Wasserstein Causal Sheaf Laplacian, a novel system of coupled non-linear Fokker-Planck equations. Crucially, we prove an entropic pullback lemma for the first variation of pushforward measures. By integrating this with the Implicit Function Theorem (IFT) on Sinkhorn optimality conditions, we establish a direct algorithmic bridge to automatic differentiation (VJP), achieving O(1)-memory reverse-mode gradients strictly independent of the iteration horizon. Empirically, our framework successfully leverages thermodynamic noise to navigate topological barriers (“entropic tunneling”) in high-dimensional scRNA-seq counterfactuals. Finally, we invert this theoretical framework to introduce the Topological Causal Score, demonstrating that our Sheaf Laplacian acts as a highly sensitive algebraic detector for topology-aware causal discovery.

[LG-28] Variational Kernel Design for Internal Noise: Gaussian Chaos Noise Representation Compatibility and Reliable Deep Learning

链接: https://arxiv.org/abs/2603.17365
作者: Ziran Liu
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 37 pages

点击查看摘要

Abstract:Internal noise in deep networks is usually inherited from heuristics such as dropout, hard masking, or additive perturbation. We ask two questions: what correlation geometry should internal noise have, and is the implemented perturbation compatible with the representations it acts on? We answer these questions through Variational Kernel Design (VKD), a framework in which a noise mechanism is specified by a law family, a correlation kernel, and an injection operator, and is derived from learning desiderata. In a solved spatial subfamily, a quadratic maximum-entropy principle over latent log-fields yields a Gaussian optimizer with precision given by the Dirichlet Laplacian, so the induced geometry is the Dirichlet Green kernel. Wick normalization then gives a canonical positive mean-one gate, Gaussian Chaos Noise (GCh). For the sample-wise gate used in practice, we prove exact Gaussian control of pairwise log-ratio deformation, margin-sensitive ranking stability, and an exact expected intrinsic roughness budget; hard binary masks instead induce singular or coherence-amplified distortions on positive coherent representations. On ImageNet and ImageNet-C, GCh consistently improves calibration and under shift also improves NLL at competitive accuracy.

[LG-29] WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation

链接: https://arxiv.org/abs/2603.17301
作者: Zahin Sufiyan,Shadan Golestan,Yoshihiro Mitsuka,Shotaro Miwa,Osmar Zaiane
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Flow Networks for continuous scenarios (CFlowNets) have shown promise in solving sequential decision-making tasks by learning stochastic policies using a flow and a retrieval network. Despite their demonstrated efficiency compared to state-of-the-art Reinforcement Learning (RL) algorithms, their practical application in robotic control tasks is constrained by the reliance on pre-training the retrieval network. This dependency poses challenges in dynamic robotic environments, where pre-training data may not be readily available or representative of the current environment. This paper introduces WINFlowNets, a novel CFlowNets framework that enables the co-training of flow and retrieval networks. WINFlowNets begins with a warm-up phase for the retrieval network to bootstrap its policy, followed by a shared training architecture and a shared replay buffer for co-training both networks. Experiments in simulated robotic environments demonstrate that WINFlowNets surpasses CFlowNets and state-of-the-art RL algorithms in terms of average reward and training stability. Furthermore, WINFlowNets exhibits strong adaptive capability in fault environments, making it suitable for tasks that demand quick adaptation with limited sample data. These findings highlight WINFlowNets’ potential for deployment in dynamic and malfunction-prone robotic systems, where traditional pre-training or sample inefficient data collection may be impractical.

[LG-30] Classifier Pooling for Modern Ordinal Classification

链接: https://arxiv.org/abs/2603.17278
作者: Noam H. Rotenberg,Andreia V. Faria,Brian Caffo
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Ordinal data is widely prevalent in clinical and other domains, yet there is a lack of both modern, machine-learning based methods and publicly available software to address it. In this paper, we present a model-agnostic method of ordinal classification, which can apply any non-ordinal classification method in an ordinal fashion. We also provide an open-source implementation of these algorithms, in the form of a Python package. We apply these models on multiple real-world datasets to show their performance across domains. We show that they often outperform non-ordinal classification methods, especially when the number of datapoints is relatively small or when there are many classes of outcomes. This work, including the developed software, facilitates the use of modern, more powerful machine learning algorithms to handle ordinal data.

[LG-31] Variational Rectification Inference for Learning with Noisy Labels

链接: https://arxiv.org/abs/2603.17255
作者: Haoliang Sun,Qi Wei,Lei Feng,Yupeng Hu,Fan Liu,Hehe Fan,Yilong Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (\textite.g., re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.

[LG-32] Binary Latent Protein Fitness Landscapes for Quantum Annealing Optimization

链接: https://arxiv.org/abs/2603.17247
作者: Truong-Son Hy
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We propose Q-BIOLAT, a framework for modeling and optimizing protein fitness landscapes in binary latent spaces. Starting from protein sequences, we leverage pretrained protein language models to obtain continuous embeddings, which are then transformed into compact binary latent representations. In this space, protein fitness is approximated using a quadratic unconstrained binary optimization (QUBO) model, enabling efficient combinatorial search via classical heuristics such as simulated annealing and genetic algorithms. On the ProteinGym benchmark, we demonstrate that Q-BIOLAT captures meaningful structure in protein fitness landscapes and enables the identification of high-fitness variants. Despite using a simple binarization scheme, our method consistently retrieves sequences whose nearest neighbors lie within the top fraction of the training fitness distribution, particularly under the strongest configurations. We further show that different optimization strategies exhibit distinct behaviors, with evolutionary search performing better in higher-dimensional latent spaces and local search remaining competitive in preserving realistic sequences. Beyond its empirical performance, Q-BIOLAT provides a natural bridge between protein representation learning and combinatorial optimization. By formulating protein fitness as a QUBO problem, our framework is directly compatible with emerging quantum annealing hardware, opening new directions for quantum-assisted protein engineering. Our implementation is publicly available at: this https URL Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2603.17247 [cs.LG] (or arXiv:2603.17247v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.17247 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Truong-Son Hy [view email] [v1] Wed, 18 Mar 2026 01:06:20 UTC (457 KB)

[LG-33] On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings

链接: https://arxiv.org/abs/2603.17246
作者: David Restrepo,Miguel L Martins,Chenwei Wu,Luis Filipe Nakayama,Diego M Lopez,Stergios Christodoulidis,Maria Vakalopoulou,Enzo Ferrante
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit a characteristic “cone effect” in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -particularly in medical domains- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter \lambda. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.

[LG-34] Self-Conditioned Denoising for Atomistic Representation Learning

链接: https://arxiv.org/abs/2603.17196
作者: Tynan Perez,Rafael Gomez-Bombarelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of large-scale pretraining in NLP and computer vision has catalyzed growing efforts to develop analogous foundation models for the physical sciences. However, pretraining strategies using atomistic data remain underexplored. To date, large-scale supervised pretraining on DFT force-energy labels has provided the strongest performance gains to downstream property prediction, out-performing existing methods of self-supervised learning (SSL) which remain limited to ground-state geometries, and/or single domains of atomistic data. We address these shortcomings with Self-Conditioned Denoising (SCD), a backbone-agnostic reconstruction objective that utilizes self-embeddings for conditional denoising across any domain of atomistic data, including small molecules, proteins, periodic materials, and ‘non-equilibrium’ geometries. When controlled for backbone architecture and pretraining dataset, SCD significantly outperforms previous SSL methods on downstream benchmarks and matches or exceeds the performance of supervised force-energy pretraining. We show that a small, fast GNN pretrained by SCD can achieve competitive or superior performance to larger models pretrained on significantly larger labeled or unlabeled datasets, across tasks in multiple domains. Our code is available at: this https URL

[LG-35] MetaClaw: Just Talk – An Agent That Meta-Learns and Evolves in the Wild

链接: https://arxiv.org/abs/2603.17187
作者: Peng Xia,Jianwen Chen,Xinyu Yang,Haoqin Tu,Jiaqi Liu,Kaiwen Xiong,Siwei Han,Shi Qiu,Haonian Ji,Yuyin Zhou,Zeyu Zheng,Cihang Xie,Huaxiu Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at this https URL.

[LG-36] Domain-informed explainable boosting machines for trustworthy lateral spread predictions

链接: https://arxiv.org/abs/2603.17175
作者: Cheng-Hsi Hsiao,Krishna Kumar,Ellen M. Rathje
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 33 pages, 16 figures

点击查看摘要

Abstract:Explainable Boosting Machines (EBMs) provide transparent predictions through additive shape functions, enabling direct inspection of feature contributions. However, EBMs can learn non-physical relationships that reduce their reliability in natural hazard applications. This study presents a domain-informed framework to improve the physical consistency of EBMs for lateral spreading prediction. Our approach modifies learned shape functions based on domain knowledge. These modifications correct non-physical behavior while maintaining data-driven patterns. We apply the method to the 2011 Christchurch earthquake dataset and correct non-physical trends observed in the original EBM. The resulting model produces more physically consistent global and local explanations, with an acceptable tradeoff in accuracy (4–5%).

[LG-37] Noise-Response Calibration: A Causal Intervention Protocol for LLM -Judges ICLR2026

链接: https://arxiv.org/abs/2603.17172
作者: Maxim Khomiakov,Jes Frellsen
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at CAO Workshop at ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

[LG-38] Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

链接: https://arxiv.org/abs/2603.17152
作者: Sadık Bera Yüksel,Ali Tevfik Buyukkocak,Derya Aksaray
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, 2026 IEEE American Control Conference (ACC)

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown promise in various robotics applications, yet its deployment on real systems is still limited due to safety and operational constraints. The safe RL field has gained considerable attention in recent years, which focuses on imposing safety constraints throughout the learning process. However, real systems often require more complex constraints than just safety, such as periodic recharging or time-bounded visits to specific regions. Imposing such spatio-temporal tasks during learning still remains a challenge. Signal Temporal Logic (STL) is a formal language for specifying temporal properties of real-valued signals and provides a way to express such complex tasks. In this paper, we propose a framework that leverages sequential control barrier functions and model-free RL to ensure that the given STL tasks are satisfied throughout the learning process. Our method extends beyond traditional safety constraints by enforcing rich STL specifications, which can involve visits to dynamic targets with unknown trajectories. We also demonstrate the effectiveness of our framework through various simulations.

[LG-39] Personalized Fall Detection by Balancing Data with Selective Feedback Using Contrastive Learning

链接: https://arxiv.org/abs/2603.17148
作者: Awatif Yasmin,Tarek Mahmud,Sana Alamgeer,Anne H. H. Ngu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized fall detection models can significantly improve accuracy by adapting to individual motion patterns, yet their effectiveness is often limited by the scarcity of real-world fall data and the dominance of non-fall feedback samples. This imbalance biases the model toward routine activities and weakens its sensitivity to true fall events. To address this challenge, we propose a personalization framework that combines semi-supervised clustering with contrastive learning to identify and balance the most informative user feedback samples. The framework is evaluated under three retraining strategies, including Training from Scratch (TFS), Transfer Learning (TL), and Few-Shot Learning (FSL), to assess adaptability across learning paradigms. Real-time experiments with ten participants show that the TFS approach achieves the highest performance, with up to a 25% improvement over the baseline, while FSL achieves the second-highest performance with a 7% improvement, demonstrating the effectiveness of selective personalization for real-world deployment.

[LG-40] Contextual Preference Distribution Learning

链接: https://arxiv.org/abs/2603.17139
作者: Benjamin Hudson,Laurent Charlin,Emma Frejinger
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: In CPAIOR 2026 (23rd International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research)

点击查看摘要

Abstract:Decision-making problems often feature uncertainty stemming from heterogeneous and context-dependent human preferences. To address this, we propose a sequential learning-and-optimization pipeline to learn preference distributions and leverage them to solve downstream problems, for example risk-averse formulations. We focus on human choice settings that can be formulated as (integer) linear programs. In such settings, existing inverse optimization and choice modelling methods infer preferences from observed choices but typically produce point estimates or fail to capture contextual shifts, making them unsuitable for risk-averse decision-making. Using a bounded-variance score function gradient estimator, we train a predictive model mapping contextual features to a rich class of parameterizable distributions. This approach yields a maximum likelihood estimate. The model generates scenarios for unseen contexts in the subsequent optimization phase. In a synthetic ridesharing environment, our approach reduces average post-decision surprise by up to 114 \times compared to a risk-neutral approach with perfect predictions and up to 25 \times compared to leading risk-averse baselines.

[LG-41] opology-Preserving Deep Joint Source-Channel Coding for Semantic Communication

链接: https://arxiv.org/abs/2603.17126
作者: Omar Erak,Omar Alhussein,Fang Fang,Sami Muhaidat
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Image and Video Processing (eess.IV)
*备注: Submitted to IEEE Journals for possible publication

点击查看摘要

Abstract:Many wireless vision applications, such as autonomous driving, require preservation of global structural information rather than only per-pixel fidelity. However, existing Deep joint source-channel coding (DeepJSCC) schemes mainly optimize pixel-wise losses and provide no explicit protection of connectivity or topology. This letter proposes TopoJSCC, a topology-aware DeepJSCC framework that integrates persistent-homology regularizers to end-to-end training. Specifically, we enforce topological consistency by penalizing Wasserstein distances between cubical persistence diagrams of original and reconstructed images, and between Vietoris–Rips persistence of latent features before and after the channel to promote a robust latent manifold. TopoJSCC is based on end-to-end learning and requires no side information. Experiments show improved topology preservation and peak signal-to-noise ratio (PSNR) in low signal-to-noise ratio (SNR) and bandwidth-ratio regimes.

[LG-42] SENSE: Efficient EEG-to-Text via Privacy-Preserving Semantic Retrieval

链接: https://arxiv.org/abs/2603.17109
作者: Akshaj Murhekar,Christina Liu,Abhijit Mishra,Shounak Roychowdhury,Jacek Gwizdka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding brain activity into natural language is a major challenge in AI with important applications in assistive communication, neurotechnology, and human-computer interaction. Most existing Brain-Computer Interface (BCI) approaches rely on memory-intensive fine-tuning of Large Language Models (LLMs) or encoder-decoder models on raw EEG signals, resulting in expensive training pipelines, limited accessibility, and potential exposure of sensitive neural data. We introduce SENSE (SEmantic Neural Sparse Extraction), a lightweight and privacy-preserving framework that translates non-invasive electroencephalography (EEG) into text without LLM fine-tuning. SENSE decouples decoding into two stages: on-device semantic retrieval and prompt-based language generation. EEG signals are locally mapped to a discrete textual space to extract a non-sensitive Bag-of-Words (BoW), which conditions an off-the-shelf LLM to synthesize fluent text in a zero-shot manner. The EEG-to-keyword module contains only ~6M parameters and runs fully on-device, ensuring raw neural signals remain local while only abstract semantic cues interact with language models. Evaluated on a 128-channel EEG dataset across six subjects, SENSE matches or surpasses the generative quality of fully fine-tuned baselines such as Thought2Text while substantially reducing computational overhead. By localizing neural decoding and sharing only derived textual cues, SENSE provides a scalable and privacy-aware retrieval-augmented architecture for next-generation BCIs.

[LG-43] An End-to-End Framework for Functionality-Embedded Provenance Graph Construction and Threat Interpretation

链接: https://arxiv.org/abs/2603.17100
作者: Kushankur Ghosh,Mehar Klair,Kian Kyars,Euijin Choo,Jörg Sander
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:Provenance graphs model causal system-level interactions from logs, enabling anomaly detectors to learn normal behavior and detect deviations as attacks. However, existing approaches rely on brittle, manually engineered rules to build provenance graphs, lack functional context for system entities, and provide limited support for analyst investigation. We present Auto-Prov, an adaptive, end-to-end framework that leverages large language models (LLMs) to automatically construct provenance graphs from heterogeneous and evolving logs, embed system-level functional attributes into the graph, enable provenance graph-based anomaly detectors to learn from these enriched graphs, and summarize the detected attacks to assist an analyst’s investigation. Auto-Prov clusters unseen log types and efficiently extracts provenance edges and entity-level information via automatically generated rules. It further infers system-level functional context for both known and previously unseen system entities using a combination of LLM inference and behavior-based estimation. Attacks detected by provenance-graph-based anomaly detectors trained on Auto-Prov’s graphs are then summarized into natural-language text. We evaluate Auto-Prov with four state-of-the-art provenance graph-based detectors across diverse logs. Results show that Auto-Prov consistently enhances detection performance, generalizes across heterogeneous log formats, and produces stable, interpretable attack summaries that remain robust under system evolution.

[LG-44] PRISM: Demystifying Retention and Interaction in Mid-Training

链接: https://arxiv.org/abs/2603.17074
作者: Bharat Runwal,Ashish Agrawal,Anurag Roy,Rameswar Panda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training’s representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

[LG-45] ransformers Can Learn Rules Theyve Never Seen: Proof of Computation Beyond Interpolation

链接: https://arxiv.org/abs/2603.17019
作者: Andy Gray
类目: Machine Learning (cs.LG)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:A central question in the LLM debate is whether transformers can infer rules absent from training, or whether apparent generalisation reduces to similarity-based interpolation over observed examples. We test a strong interpolation-only hypothesis in two controlled settings: one where interpolation is ruled out by construction and proof, and one where success requires emitting intermediate symbolic derivations rather than only final answers. In Experiment 1, we use a cellular automaton with a pure XOR transition rule and remove specific local input patterns from training; since XOR is linearly inseparable, each held-out pattern’s nearest neighbours have the opposite label, so similarity-based predictors fail on the held-out region. Yet a two-layer transformer recovers the rule (best 100%; 47/60 converged runs), and circuit extraction identifies XOR computation. Performance depends on multi-step constraint propagation: without unrolling, accuracy matches output bias (63.1%), while soft unrolling reaches 96.7%. In Experiment 2, we study symbolic operator chains over integers with one operator pair held out; the model must emit intermediate steps and a final answer in a proof-like format. Across all 49 holdout pairs, the transformer exceeds every interpolation baseline (mean 41.8%, up to 78.6%; mean KRR 4.3%; KNN and MLP score 0% on every pair), while removing intermediate-step supervision degrades performance. Together with a construction showing that a standard transformer block can implement exact local Boolean rules, these results provide an existence proof that transformers can learn rule structure not directly observed in training and express it explicitly, ruling out the strongest architectural form of interpolation-only accounts: that transformers cannot in principle discover and communicate unseen rules, while leaving open when such behaviour arises in large-scale language training.

[LG-46] Integrating Inductive Biases in Transformers via Distillation for Financial Time Series Forecasting

链接: https://arxiv.org/abs/2603.16985
作者: Yu-Chen Den,Kuan-Yu Chen,Kendro Vincent,Darby Tien-Hao Chang
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Transformer-based models have been widely adopted for time-series forecasting due to their high representational capacity and architectural flexibility. However, many Transformer variants implicitly assume stationarity and stable temporal dynamics – assumptions routinely violated in financial markets characterized by regime shifts and non-stationarity. Empirically, state-of-the-art time-series Transformers often underperform even vanilla Transformers on financial tasks, while simpler architectures with distinct inductive biases, such as CNNs and RNNs, can achieve stronger performance with substantially lower complexity. At the same time, no single inductive bias dominates across markets or regimes, suggesting that robust financial forecasting requires integrating complementary temporal priors. We propose TIPS (Transformer with Inductive Prior Synthesis), a knowledge distillation framework that synthesizes diverse inductive biases – causality, locality, and periodicity – within a unified Transformer. TIPS trains bias-specialized Transformer teachers via attention masking, then distills their knowledge into a single student model with regime-dependent alignment across inductive biases. Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55%, 9%, and 16% in annual return, Sharpe ratio, and Calmar ratio, while requiring only 38% of the inference-time computation. Further analyses show that TIPS generates statistically significant excess returns beyond both vanilla Transformers and its teacher ensembles, and exhibits regime-dependent behavioral alignment with classical architectures during their profitable periods. These results highlight the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series.

[LG-47] Formal verification of tree-based machine learning models for lateral spreading

链接: https://arxiv.org/abs/2603.16983
作者: Krishna Kumar
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Machine learning models for geotechnical hazard prediction can achieve high accuracy while learning physically inconsistent relationships from sparse or biased training data. Current remedies (post-hoc explainability, such as SHAP and LIME, and training-time constraints) either diagnose individual predictions approximately or restrict model capacity without providing exhaustive guarantees. This paper encodes trained tree ensembles as logical formulas in a Satisfiability Modulo Theories (SMT) solver and checks physical specifications across the entire input domain, not just sampled points. Four geotechnical specifications (water table depth, PGA monotonicity, distance safety, and flat-ground safety) are formalized as decidable logical formulas and verified via SMT against both XGBoost ensembles and Explainable Boosting Machines (EBMs) trained on the 2011 Christchurch earthquake lateral spreading dataset (7,291 sites, four features). The SMT solver either produces a concrete counterexample where a specification fails or proves that no violation exists. The unconstrained EBM (80.1% accuracy) violates all four specifications. A fully constrained EBM (67.2%) satisfies three of four specifications, demonstrating that iterative constraint application guided by verification can progressively improve physical consistency. A Pareto analysis of 33 model variants reveals a persistent trade-off, as none of the variants studied achieve both greater than 80% accuracy and full compliance with the specified set. SHAP analysis of specification counterexamples shows that the offending feature can rank last, demonstrating that post-hoc explanations do not substitute for formal verification. These results establish a verify-fix-verify engineering loop and a formal certification for deploying physically consistent ML models in safety-critical geotechnical applications.

[LG-48] Interpretable AI-Assisted Early Reliability Prediction for a Two-Parameter Parallel Root-Finding Scheme

链接: https://arxiv.org/abs/2603.16980
作者: Bruno Carpentieri,Andrei Velichko,Mudassir Shams,Paola Lecca
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:We propose an interpretable AI-assisted reliability diagnostic framework for parameterized root-finding schemes based on kNN-LLE proxy stability profiling and multi-horizon early prediction. The approach augments a numerical solver with a lightweight predictive layer that estimates solver reliability from short prefixes of iteration dynamics, enabling early identification of stable and unstable parameter regimes. For each configuration in the parameter space, raw and smoothed proxy profiles of a largest Lyapunov exponent (LLE) estimator are constructed, from which contractivity-based reliability scores summarizing finite-time convergence are derived. Machine learning models predict the reliability score from early segments of the proxy profile, allowing the framework to determine when solver dynamics become diagnostically informative. Experiments on a two-parameter parallel root-finding scheme show reliable prediction after only a few iterations: the best models achieve R^2=0.48 at horizon T=1, improve to R^2=0.67 by T=3, and exceed R^2=0.89 before the characteristic minimum-location scale of the stability profile. Prediction accuracy increases to R^2=0.96 at larger horizons, with mean absolute errors around 0.03, while inference costs remain negligible (microseconds per sample). The framework provides interpretable stability indicators and supports early decisions during solver execution, such as continuing, restarting, or adjusting parameters.

[LG-49] Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

链接: https://arxiv.org/abs/2603.16978
作者: Pierre Krack,Tobias Jülg,Wolfram Burgard,Florian Walter
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, submitted to IEEE

点击查看摘要

Abstract:Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model’s compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.

[LG-50] Minimum-Action Learning: Energy-Constrained Symbolic Model Selection for Physical Law Identification from Noisy Data

链接: https://arxiv.org/abs/2603.16951
作者: Martin G. Frasch
类目: Machine Learning (cs.LG)
*备注: 28 pages, 10 figures, this https URL

点击查看摘要

Abstract:Identifying physical laws from noisy observational data is a central challenge in scientific machine learning. We present Minimum-Action Learning (MAL), a framework that selects symbolic force laws from a pre-specified basis library by minimizing a Triple-Action functional combining trajectory reconstruction, architectural sparsity, and energy-conservation enforcement. A wide-stencil acceleration-matching technique reduces noise variance by 10,000x, transforming an intractable problem (SNR ~0.02) into a learnable one (SNR ~1.6); this preprocessing is the critical enabler shared by all methods tested, including SINDy variants. On two benchmarks – Kepler gravity and Hooke’s law – MAL recovers the correct force law with Kepler exponent p = 3.01 +/- 0.01 at ~0.07 kWh (40% reduction vs. prediction-error-only baselines). The raw correct-basis rate is 40% for Kepler and 90% for Hooke; an energy-conservation-based criterion discriminates the true force law in all cases, yielding 100% pipeline-level identification. Basis library sensitivity experiments show that near-confounders degrade selection (20% with added r^-2.5 and r^-1.5), while distant additions are harmless, and the conservation diagnostic remains informative even when the correct basis is absent. Direct comparison with noise-robust SINDy variants, Hamiltonian Neural Networks, and Lagrangian Neural Networks confirms MAL’s distinct niche: interpretable, energy-constrained model selection that combines symbolic basis identification with dynamical rollout validation.

[LG-51] Entropy-Aware Task Offloading in Mobile Edge Computing

链接: https://arxiv.org/abs/2603.16949
作者: Mohsen Sahraei Ardakani,Hong Wan,Rui Song
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 13 pages, submitted to Journal of Blockchain Research

点击查看摘要

Abstract:Mobile Edge Computing (MEC) technology has been introduced to enable could computing at the edge of the network in order to help resource limited mobile devices with time sensitive data processing tasks. In this paradigm, mobile devices can offload their computationally heavy tasks to more efficient nearby MEC servers via wireless communication. Consequently, the main focus of researches on the subject has been on development of efficient offloading schemes, leaving the privacy of mobile user out. While the Blockchain technology is used as the trust mechanism for secured sharing of the data, the privacy issues induced from wireless communication, namely, usage pattern and location privacy are the centerpiece of this work. The effects of these privacy concerns on the task offloading Markov Decision Process (MDP) is addressed and the MDP is solved using a Deep Recurrent Q-Netwrok (DRQN). The Numerical simulations are presented to show the effectiveness of the proposed method.

[LG-52] Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention

链接: https://arxiv.org/abs/2603.16937
作者: Mahfuz Ahmed Anik,Mohsin Mahmud Topu,Azmine Toushik Wasi,Md Isfar Khan,MD Manjurul Ahsan
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 34 Pages. 7 Tables. 6 Figures

点击查看摘要

Abstract:Sleep quality is influenced by a complex interplay of behavioral, environmental, and psychosocial factors, yet most computational studies focus mainly on predictive risk identification rather than actionable intervention design. Although machine learning models can accurately predict subjective sleep outcomes, they rarely translate predictive insights into practical intervention strategies. To address this gap, we propose a personalized predictive-prescriptive framework that integrates interpretable machine learning with mixed-integer optimization. A supervised classifier trained on survey data predicts sleep quality, while SHAP-based feature attribution quantifies the influence of modifiable factors. These importance measures are incorporated into a mixed-integer optimization model that identifies minimal and feasible behavioral adjustments, while modelling resistance to change through a penalty mechanism. The framework achieves strong predictive performance, with a test F1-score of 0.9544 and an accuracy of 0.9366. Sensitivity and Pareto analyses reveal a clear trade-off between expected improvement and intervention intensity, with diminishing returns as additional changes are introduced. At the individual level, the model generates concise recommendations, often suggesting one or two high-impact behavioral adjustments and sometimes recommending no change when expected gains are minimal. By integrating prediction, explanation, and constrained optimization, this framework demonstrates how data-driven insights can be translated into structured and personalized decision support for sleep improvement.

[LG-53] Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

链接: https://arxiv.org/abs/2603.16928
作者: Thomas Jiralerspong,Flemming Kondrup,Yoshua Bengio
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) monitoring is proposed as a method for overseeing the internal reasoning of language-model agents. Prior work has shown that when models are explicitly informed that their reasoning is being monitored, or are fine-tuned to internalize this fact, they may learn to obfuscate their CoTs in ways that allow them to evade CoT-based monitoring systems. We ask whether reasoning agents can autonomously infer that their supposedly private CoT is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions. In a multi-episode agentic framework, models pursue both a primary task and a concealed side task while being told that their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking feedback, with the most capable models reaching a confident belief that their thinking is observed in up to 19% of episodes. This awareness scales with model capability and, in rare cases, escalates to an explicit intent to suppress reasoning about the side task. However, models that form this intent uniformly fail to execute it, openly reasoning about their concealed objectives in the very next episode. This intent-capability gap is reassuring for current deployment, but the autonomous emergence of both monitoring awareness and evasion intent suggests that CoT monitoring is not a permanently reliable safeguard.

[LG-54] HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

链接: https://arxiv.org/abs/2603.16917
作者: Vladimer Khasia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequence modeling universally relies on discrete subword tokenization to circumvent the \mathcalO(N^2) computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbfHoloByte: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from \mathcalO(N^2D) to \mathcalO\left( \fracN^2W^2D + ND^2 \right) . A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension D = \Omega(W \ln |\mathcalV|) required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at this https URL

[LG-55] Federated Multi Agent Deep Learning and Neural Networks for Advanced Distributed Sensing in Wireless Networks

链接: https://arxiv.org/abs/2603.16881
作者: Nadine Muller,Stefano DeRosa,Su Zhang,Chun Lee Huan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent deep learning (MADL), including multi-agent deep reinforcement learning (MADRL), distributed/federated training, and graph-structured neural networks, is becoming a unifying framework for decision-making and inference in wireless systems where sensing, communication, and computing are tightly coupled. Recent 5G-Advanced and 6G visions strengthen this coupling through integrated sensing and communication, edge intelligence, open programmable RAN, and non-terrestrial/UAV networking, which create decentralized, partially observed, time-varying, and resource-constrained control problems. This survey synthesizes the state of the art, with emphasis on 2021-2025 research, on MADL for distributed sensing and wireless communications. We present a task-driven taxonomy across (i) learning formulations (Markov games, Dec-POMDPs, CTDE), (ii) neural architectures (GNN-based radio resource management, attention-based policies, hierarchical learning, and over-the-air aggregation), (iii) advanced techniques (federated reinforcement learning, communication-efficient federated deep RL, and serverless edge learning orchestration), and (iv) application domains (MEC offloading with slicing, UAV-enabled heterogeneous networks with power-domain NOMA, intrusion detection in sensor networks, and ISAC-driven perceptive mobile networks). We also provide comparative tables of algorithms, training topologies, and system-level trade-offs in latency, spectral efficiency, energy, privacy, and robustness. Finally, we identify open issues including scalability, non-stationarity, security against poisoning and backdoors, communication overhead, and real-time safety, and outline research directions toward 6G-native sense-communicate-compute-learn systems.

[LG-56] Smart Learning to Find Dumb Contracts (Extended Version)

链接: https://arxiv.org/abs/2304.10726
作者: Tamer Abdelaziz,Aquinas Hobor
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:We introduce the Deep Learning Vulnerability Analyzer (DLVA) for Ethereum smart contracts based on neural networks. We train DLVA to judge bytecode even though the supervising oracle can only judge source. DLVA’s training algorithm is general: we extend a source code analysis to bytecode without any manual feature engineering, predefined patterns, or expert rules. DLVA’s training algorithm is also robust: it overcame a 1.25% error rate mislabeled contracts, and–the student surpassing the teacher–found vulnerable contracts that Slither mislabeled. DLVA is much faster than other smart contract vulnerability detectors: DLVA checks contracts for 29 vulnerabilities in 0.2 seconds, a 10-1,000x speedup. DLVA has three key components. First, Smart Contract to Vector (SC2V) uses neural networks to map smart contract bytecode to a high-dimensional floating-point vector. We benchmark SC2V against 4 state-of-the-art graph neural networks and show that it improves model differentiation by 2.2%. Second, Sibling Detector (SD) classifies contracts when a target contract’s vector is Euclidian-close to a labeled contract’s vector in a training set; although only able to judge 55.7% of the contracts in our test set, it has a Slither-predictive accuracy of 97.4% with a false positive rate of only 0.1%. Third, Core Classifier (CC) uses neural networks to infer vulnerable contracts regardless of vector distance. We benchmark DLVA’s CC with 10 ML techniques and show that the CC improves accuracy by 11.3%. Overall, DLVA predicts Slither’s labels with an overall accuracy of 92.7% and associated false positive rate of 7.2%. Lastly, we benchmark DLVA against nine well-known smart contract analysis tools. Despite using much less analysis time, DLVA completed every query, leading the pack with an average accuracy of 99.7%, pleasingly balancing high true positive rates with low false positive rates.

[LG-57] Multi-Armed Sequential Hypothesis Testing by Betting

链接: https://arxiv.org/abs/2603.17925
作者: Ricardo J. Sandoval,Ian Waudby-Smith,Michael I. Jordan
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis \mathscrP that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting \mathscrP in favor of a composite alternative \mathscrQ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek e -processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against \mathscrP . Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently “estimable” rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

[LG-58] A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models

链接: https://arxiv.org/abs/2603.17896
作者: Leonardo Defilippis,Florent Krzakala,Bruno Loureiro,Antoine Maillard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding when learning is statistically possible yet computationally hard is a central challenge in high-dimensional statistics. In this work, we investigate this question in the context of single- and multi-index models, classes of functions widely studied as benchmarks to probe the ability of machine learning methods to discover features in high-dimensional data. Our main contribution is to show that a Noise Sensitivity Exponent (NSE) - a simple quantity determined by the activation function - governs the existence and magnitude of statistical-to-computational gaps within a broad regime of these models. We first establish that, in single-index models with large additive noise, the onset of a computational bottleneck is fully characterized by the NSE. We then demonstrate that the same exponent controls a statistical-computational gap in the specialization transition of large separable multi-index models, where individual components become learnable. Finally, in hierarchical multi-index models, we show that the NSE governs the optimal computational rate in which different directions are sequentially learned. Taken together, our results identify the NSE as a unifying property linking noise robustness, computational hardness, and feature specialization in high-dimensional learning.

[LG-59] he Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery

链接: https://arxiv.org/abs/2603.17790
作者: Narjes Ansari,César Feniou,Nicolaï Gouraud,Daniele Loco,Siwar Badreddine,Baptiste Claudon,Félix Aviat,Marharyta Blazhynska,Kevin Gasperich,Guillaume Michel,Diata Traore,Corentin Villot,Thomas Plé,Olivier Adjoua,Louis Lagardère,Jean-Philip Piquemal
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Integrating quantum mechanics into drug discovery marks a decisive shift from empirical trial-and-error toward quantitative precision. However, the prohibitive cost of ab initio molecular dynamics has historically forced a compromise between chemical accuracy and computational scalability. This paper identifies the convergence of High-Performance Computing (HPC), Machine Learning (ML), and Quantum Computing (QC) as the definitive solution to this bottleneck. While ML foundation models, such as FeNNix-Bio1, enable quantum-accurate simulations, they remain tethered to the inherent limits of classical data generation. We detail how High-Performance Quantum Computing (HPQC), utilizing hybrid QPU-GPU architectures, will serve as the ultimate accelerator for quantum chemistry data. By leveraging Hilbert space mapping, these systems can achieve true chemical accuracy while bypassing the heuristics of classical approximations. We show how this tripartite convergence optimizes the drug discovery pipeline, spanning from initial system preparation to ML-driven, high-fidelity simulations. Finally, we position quantum-enhanced sampling as the beyond GPU frontier for modeling reactive cellular systems and pioneering next-generation materials.

[LG-60] Stochastic set-valued optimization and its application to robust learning

链接: https://arxiv.org/abs/2603.17691
作者: Tommaso Giovannelli,Jingfu Tan,Luis Nunes Vicente
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we develop a stochastic set-valued optimization (SVO) framework tailored for robust machine learning. In the SVO setting, each decision variable is mapped to a set of objective values, and optimality is defined via set relations. We focus on SVO problems with hyperbox sets, which can be reformulated as multi-objective optimization (MOO) problems with finitely many objectives and serve as a foundation for representing or approximating more general mapped sets. Two special cases of hyperbox-valued optimization (HVO) are interval-valued (IVO) and rectangle-valued (RVO) optimization. We construct stochastic IVO/RVO formulations that incorporate subquantiles and superquantiles into the objective functions of the MOO reformulations, providing a new characterization for subquantiles. These formulations provide interpretable trade-offs by capturing both lower- and upper-tail behaviors of loss distributions, thereby going beyond standard empirical risk minimization and classical robust models. To solve the resulting multi-objective problems, we adopt stochastic multi-gradient algorithms and select a Pareto knee solution. In numerical experiments, the proposed algorithms with this selection strategy exhibit improved robustness and reduced variability across test replications under distributional shift compared with empirical risk minimization, while maintaining competitive accuracy.

[LG-61] Atomic Trajectory Modeling with State Space Models for Biomolecular Dynamics

链接: https://arxiv.org/abs/2603.17633
作者: Liang Shi,Jiarui Lu,Junqi Liu,Chence Shi,Zhi Yang,Jian Tang
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the dynamic behavior of biomolecules is fundamental to elucidating biological function and facilitating drug discovery. While Molecular Dynamics (MD) simulations provide a rigorous physical basis for studying these dynamics, they remain computationally expensive for long timescales. Conversely, recent deep generative models accelerate conformation generation but are typically either failing to model temporal relationship or built only for monomeric proteins. To bridge this gap, we introduce ATMOS, a novel generative framework based on State Space Models (SSM) designed to generate atom-level MD trajectories for biomolecular systems. ATMOS integrates a Pairformer-based state transition mechanism to capture long-range temporal dependencies, with a diffusion-based module to decode trajectory frames in an autoregressive manner. ATMOS is trained across crystal structures from PDB and conformation trajectory from large-scale MD simulation datasets including mdCATH and MISATO. We demonstrate that ATMOS achieves state-of-the-art performance in generating conformation trajectories for both protein monomers and complex protein-ligand systems. By enabling efficient inference of atomic trajectory of motions, this work establishes a promising foundation for modeling biomolecular dynamics.

[LG-62] Gaussian Process Limit Reveals Structural Benefits of Graph Transformers

链接: https://arxiv.org/abs/2603.17569
作者: Nil Ayday,Lingchu Yang,Debarghya Ghoshdastidar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph transformers are the state-of-the-art for learning from graph-structured data and are empirically known to avoid several pitfalls of message-passing architectures. However, there is limited theoretical analysis on why these models perform well in practice. In this work, we prove that attention-based architectures have structural benefits over graph convolutional networks in the context of node-level prediction tasks. Specifically, we study the neural network gaussian process limits of graph transformers (GAT, Graphormer, Specformer) with infinite width and infinite heads, and derive the node-level and edge-level kernels across the layers. Our results characterise how the node features and the graph structure propagate through the graph attention layers. As a specific example, we prove that graph transformers structurally preserve community information and maintain discriminative node representations even in deep layers, thereby preventing oversmoothing. We provide empirical evidence on synthetic and real-world graphs that validate our theoretical insights, such as integrating informative priors and positional encoding can improve performance of deep graph transformers.

[LG-63] Consistency of the k-Nearest Neighbor Regressor under Complex Survey Designs

链接: https://arxiv.org/abs/2603.17551
作者: Caren Hasler
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the consistency of the k -nearest neighbor regressor under complex survey designs. While consistency results for this algorithm are well established for independent and identically distributed data, corresponding results for complex survey data are lacking. We show that the k -nearest neighbor regressor is consistent under regularity conditions on the sampling design and the distribution of the data. We derive lower bounds for the rate of convergence and show that these bounds exhibit the curse of dimensionality, as in the independent and identically distributed setting. Empirical studies based on simulated and real data illustrate our theoretical findings.

[LG-64] Mirror Descent on Riemannian Manifolds

链接: https://arxiv.org/abs/2603.17527
作者: Jiaxin Jiang,Lei Shi,Jiyuan Tan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

[LG-65] Data-driven model order reduction for structures with piecewise linear nonlinearity using dynamic mode decomposition

链接: https://arxiv.org/abs/2603.17423
作者: Akira Saito,Masato Tanaka
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Piecewise-linear nonlinear systems appear in many engineering disciplines. Prediction of the dynamic behavior of such systems is of great importance from practical and theoretical viewpoint. In this paper, a data-driven model order reduction method for piecewise-linear systems is proposed, which is based on dynamic mode decomposition (DMD). The overview of the concept of DMD is provided, and its application to model order reduction for nonlinear systems based on Galerkin projection is explained. The proposed approach uses impulse responses of the system to obtain snapshots of the state variables. The snapshots are then used to extract the dynamic modes that are used to form the projection basis vectors. The dynamics described by the equations of motion of the original full-order system are then projected onto the subspace spanned by the basis vectors. This produces a system with much smaller number of degrees of freedom (DOFs). The proposed method is applied to two representative examples of piecewise linear systems: a cantilevered beam subjected to an elastic stop at its end, and a bonded plates assembly with partial debonding. The reduced order models (ROMs) of these systems are constructed by using the Galerkin projection of the equation of motion with DMD modes alone, or DMD modes with a set of classical constraint modes to be able to handle the contact nonlinearity efficiently. The obtained ROMs are used for the nonlinear forced response analysis of the systems under harmonic loading. It is shown that the ROMs constructed by the proposed method produce accurate forced response results.

[LG-66] Rapid Neural Network Prediction of Linear Block Copolymer Free Energies

链接: https://arxiv.org/abs/2603.17391
作者: Ian Chen,Alfredo Alexander-Katz
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Free energies are fundamental quantities governing phase behavior and thermodynamic stability in polymer systems, yet their accurate computation often requires extensive simulations and post-processing techniques such as the Bennett Acceptance Ratio (BAR). While BAR provides reliable estimates when applied between closely related thermodynamic states, evaluating free energies across large changes in interaction strength typically requires a sequence of intermediate simulations to maintain sufficient phase-space overlap, substantially increasing computational cost. In this work we develop a machine learning framework for rapidly predicting excess free energies of linear diblock copolymer systems from simulation-derived energetic descriptors. Using dissipative particle dynamics simulations of freely-jointed chain polymers, we construct a dataset of per-chain energetic statistics, including heterogeneous interaction energies, homogeneous interaction energies, and bonded spring energies, and train feed-forward neural networks to learn the relationship between these descriptors and free energies computed using a stratified BAR procedure. The resulting models accurately reproduce the reference free energies across a range of chain lengths, compositions, and densities, including polymer architectures held out from training. In regimes where direct, brute-force BAR estimates become unreliable due to poor phase-space overlap, the neural network predictions remain consistent with the reference values. These results demonstrate that physically informed machine learning models can serve as efficient surrogates for expensive free-energy calculations and provide a promising approach for accelerating thermodynamic analysis of polymer systems.

[LG-67] Wasserstein-type Gaussian Process Regressions for Input Measurement Uncertainty

链接: https://arxiv.org/abs/2603.17271
作者: Hengrui Luo,Xiaoye S. Li,Yang Liu,Marcus Noack,Ji Qiang,Mark D. Risser
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Gaussian process (GP) regression is widely used for uncertainty quantification, yet the standard formulation assumes noise-free covariates. When inputs are measured with error, this errors-in-variables (EIV) setting can lead to optimistically narrow posterior intervals and biased decisions. We study GP regression under input measurement uncertainty by representing each noisy input as a probability measure and defining covariance through Wasserstein distances between these measures. Building on this perspective, we instantiate a deterministic projected Wasserstein ARD (PWA) kernel whose one-dimensional components admit closed-form expressions and whose product structure yields a scalable, positive-definite kernel on distributions. Unlike latent-input GP models, PWA-based GPs (\PWAGPs) handle input noise without introducing unobserved covariates or Monte Carlo projections, making uncertainty quantification more transparent and robust.

[LG-68] Self-Regularized Learning Methods

链接: https://arxiv.org/abs/2603.17160
作者: Max Schölpple,Liu Fanghui,Ingo Steinwart
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We introduce a general framework for analyzing learning algorithms based on the notion of self-regularization, which captures implicit complexity control without requiring explicit regularization. This is motivated by previous observations that many algorithms, such as gradient-descent based learning, exhibit implicit regularization. In a nutshell, for a self-regularized algorithm the complexity of the predictor is inherently controlled by that of the simplest comparator achieving the same empirical risk. This framework is sufficiently rich to cover both classical regularized empirical risk minimization and gradient descent. Building on self-regularization, we provide a thorough statistical analysis of such algorithms including minmax-optimal rates, where it suffices to show that the algorithm is self-regularized – all further requirements stem from the learning problem itself. Finally, we discuss the problem of data-dependent hyperparameter selection, providing a general result which yields minmax-optimal rates up to a double logarithmic factor and covers data-driven early stopping for RKHS-based gradient descent.

[LG-69] Optimization-Embedded Active Multi-Fidelity Surrogate Learning for Multi-Condition Airfoil Shape Optimization

链接: https://arxiv.org/abs/2603.17057
作者: Isaac Robledo,Alberto Vilariño,Arnau Miró,Oriol Lehmkuhl,Carlos Sanmiguel Vila,Rodrigo Castellanos
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注: 21 pages, 14 figures

点击查看摘要

Abstract:Active multi-fidelity surrogate modeling is developed for multi-condition airfoil shape optimization to reduce high-fidelity CFD cost while retaining RANS-level accuracy. The framework couples a low-fidelity-informed Gaussian process regression transfer model with uncertainty-triggered sampling and a synchronized elitism rule embedded in a hybrid genetic algorithm. Low-fidelity XFOIL evaluations provide inexpensive features, while sparse RANS simulations are adaptively allocated when predictive uncertainty exceeds a threshold; elite candidates are mandatorily validated at high fidelity, and the population is re-evaluated to prevent evolutionary selection based on outdated fitness values produced by earlier surrogate states. The method is demonstrated for a two-point problem at Re=6\times10^6 with cruise at \alpha=2^\circ (maximize E=L/D ) and take-off at \alpha=10^\circ (maximize C_L ) using a 12-parameter CST representation. Independent multi-fidelity surrogates per flight condition enable decoupled refinement. The optimized design improves cruise efficiency by 41.05% and take-off lift by 20.75% relative to the best first-generation individual. Over the full campaign, only 14.78% (cruise) and 9.5% (take-off) of evaluated individuals require RANS, indicating a substantial reduction in high-fidelity usage while maintaining consistent multi-point performance.

[LG-70] Hybrid Classical-Quantum Transfer Learning with Noisy Quantum Circuits

链接: https://arxiv.org/abs/2603.16973
作者: D. Martín-Pérez,F. Rodríguez-Díaz,D. Gutiérrez-Avilés,A. Troncoso,F. Martínez-Álvarez
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum transfer learning combines pretrained classical deep learning models with quantum circuits to reuse expressive feature representations while limiting the number of trainable parameters. In this work, we introduce a family of compact quantum transfer learning architectures that attach variational quantum classifiers to frozen convolutional backbones for image classification. We instantiate and evaluate several classical-quantum hybrid models implemented in PennyLane and Qiskit, and systematically compare them with a classical transfer-learning baseline across heterogeneous image datasets. To ensure a realistic assessment, we evaluate all approaches under both ideal simulation and noisy emulation using noise models calibrated from IBM quantum hardware specifications, as well as on real IBM quantum hardware. Experimental results show that the proposed quantum transfer learning architectures achieve competitive and, in several cases, superior accuracy while consistently reducing training time and energy consumption relative to the classical baseline. Among the evaluated approaches, PennyLane-based implementations provide the most favorable trade-off between accuracy and computational efficiency, suggesting that hybrid quantum transfer learning can offer practical benefits in realistic NISQ era settings when feature extraction remains classical.

[LG-71] Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network

链接: https://arxiv.org/abs/2603.16972
作者: Protopopov Alexey
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 9 pages, 5 figures, 1 table

点击查看摘要

Abstract:Automatic speech recognition systems based on neural networks are vulnerable to adversarial attacks that alter transcriptions in a malicious way. Recent works in this field have focused on making attacks work in over-the-air scenarios, however such attacks are typically detectable by human hearing, limiting their potential applications. In the present work we explore different approaches of making over-the-air attacks less detectable, as well as the impact these approaches have on the attacks’ effectiveness.

[LG-72] Kriging via variably scaled kernels

链接: https://arxiv.org/abs/2603.16950
作者: Gianluca Audone,Francesco Marchetti,Emma Perracchione,Milvia Rossini
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Classical Gaussian processes and Kriging models are commonly based on stationary kernels, whereby correlations between observations depend exclusively on the relative distance between scattered data. While this assumption ensures analytical tractability, it limits the ability of Gaussian processes to represent heterogeneous correlation structures. In this work, we investigate variably scaled kernels as an effective tool for constructing non-stationary Gaussian processes by explicitly modifying the correlation structure of the data. Through a scaling function, variably scaled kernels alter the correlations between data and enable the modeling of targets exhibiting abrupt changes or discontinuities. We analyse the resulting predictive uncertainty via the variably scaled kernel power function and clarify the relationship between variably scaled kernels-based constructions and classical non-stationary kernels. Numerical experiments demonstrate that variably scaled kernels-based Gaussian processes yield improved reconstruction accuracy and provide uncertainty estimates that reflect the underlying structure of the data

[LG-73] A Framework for Modeling Liquefaction-Induced Road Disruptions After Earthquakes: Implications for Emergency Response and Access in the Cascadia Region of North America

链接: https://arxiv.org/abs/2603.16948
作者: Morgan D. Sanger,Olyvia B. Smith,Brett W. Maurer,Liam Wotherspoon,Marc O. Eberhard,Jeffrey W. Berman
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large earthquakes along the Cascadia Subduction Zone (CSZ) are expected to trigger widespread soil liquefaction that could disrupt transportation systems across the U.S. Pacific Northwest. However, past regional assessments have relied on simple geologic screening methods and binomial shaking thresholds that are only loosely informed by liquefaction science. This study introduces a mechanics-informed, data-driven framework for estimating liquefaction-induced road closures and service reductions, and the framework is applied to a magnitude-9 CSZ earthquake. Predicted liquefaction severity is translated into segment-level probabilities of closure and reduced service using empirically derived fragility relationships. These probabilities are mapped at 90-m resolution and propagated through the National Highway System using a spatially correlated Monte Carlo simulation to estimate link-level disruption. Results show that impacts are concentrated in low-lying coastal zones, river valleys, and urban waterfronts, with major disruptions expected along critical routes including U.S. Route 101. Local mobility is further examined in Pacific and Grays Harbor counties, Washington, where limited network redundancy, strong shaking, and high liquefaction susceptibility lead to elevated probabilities of isolation and loss of hospital access. Socioeconomic analysis reveals modest but statistically significant associations between road impacts and demographic indicators, suggesting that liquefaction impacts may compound with existing social vulnerabilities. While not a substitute for site-specific analysis, the results provide a regional baseline for emergency planning, risk communication, and prioritization of more advanced geotechnical sampling and analysis. Moreover, the methodology proposed here is not specific to the CSZ, but rather, could be applied to analogous studies of road impacts elsewhere.

[LG-74] Gaussian Process Regression-based Knowledge Distillation Framework for Simultaneous Prediction of Physical and Mechanical Properties of Epoxy Polymers

链接: https://arxiv.org/abs/2603.16925
作者: Sindu B.S.,Jan Hamaekers
类目: oft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epoxy polymers are widely used due to their multifunctional properties, but machine learning (ML) applications remain limited owing to their complex 3D molecular structure, multi-component nature, and lack of curated datasets. Existing ML studies are largely restricted to simulation data, specific properties, or narrow constituent ranges. To address these limitations, we developed an informed Gaussian Process Regression-based Knowledge Distillation (GPR-KD) framework for predicting multiple physical (glass transition temperature, density) and mechanical properties (elastic modulus, tensile strength, compressive strength, flexural strength, fracture energy, adhesive strength) of thermoset epoxy polymers. The model was trained on experimental literature data covering diverse monomer classes (9 resins, 40 hardeners). Individual GPR models serve as teacher models capturing nonlinear feature-property relationships, while a unified neural network student model learns distilled knowledge across all properties simultaneously. By encoding the target property as an input feature, the student model leverages cross-property correlations. Molecular-level descriptors extracted from SMILES representations using RDKit create a physics-informed model. The framework combines GPR interpretability and robustness with deep learning scalability and generalization. Comparative analysis demonstrates superior prediction accuracy over conventional ML models. Simultaneous multi-property prediction further improves accuracy through information sharing across correlated properties. The proposed framework enables accelerated design of novel epoxy polymers with tailored properties.

[LG-75] EEG-SeeGraph: Interpreting functional connectivity disruptions in dementias via sparse-explanatory dynamic EEG-graph learning

链接: https://arxiv.org/abs/2603.16895
作者: Fengcheng Wu(1),Zhenxi Song(1),Guoyang Xu(1),Kaisong Hu(1),Zirui Wang(1),Yi Guo(2),Zhiguo Zhang(1) ((1) Harbin Institute of Technology, Shenzhen, China, (2) Institute of Neurological Diseases, Shenzhen Bay Laboratory, Shenzhen)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust and interpretable dementia diagnosis from noisy, non-stationary electroencephalography (EEG) is clinically essential yet remains challenging. To this end, we propose SeeGraph, a Sparse-Explanatory dynamic EEG-graph network that models time-evolving functional connectivity and employs a node-guided sparse edge mask to reveal the connections that drive diagnostic decisions, while remaining robust to noise and cross-site variability. SeeGraph comprises four components: (1) a dual-trajectory temporal encoder that models dynamic EEG with two streams, where node signals capture regional oscillations and edge signals capture interregional coupling; (2) a topology-aware positional encoder that derives graph-spectral Laplacian coordinates from the fused connectivity and augments node embeddings; (3) a node-guided sparse explanatory edge mask that gates the connectivity into a compact subgraph; and (4) a gated graph predictor that operates on the sparsified graph. The framework is trained using cross-entropy loss together with a sparsity regularizer on the mask, yielding noise-robust and interpretable diagnoses. The effectiveness of SeeGraph is validated on public and in-house EEG cohorts, including patients with neurodegenerative dementias and healthy controls, under both raw and noise-perturbed conditions. Its sparse, node-guided explanations highlight disease-relevant connections and align with established clinical findings on functional connectivity alterations, thereby offering transparent cues for neurological evaluation.

[LG-76] A Controlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting: Evidence from 918 Experiments

链接: https://arxiv.org/abs/2603.16886
作者: Nabeel Ahmad Saidd
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:Multi-horizon price forecasting is central to portfolio allocation, risk management, and algorithmic trading, yet deep learning architectures have proliferated faster than rigorous financial benchmarks can evaluate them. This study provides a controlled comparison of nine architectures (Autoformer, DLinear, iTransformer, LSTM, ModernTCN, N-HiTS, PatchTST, TimesNet, and TimeXer) spanning Transformer, MLP, CNN, and RNN families across cryptocurrency, forex, and equity index markets at 4-hour and 24-hour horizons. A total of 918 experiments were conducted under a strict five-stage protocol including fixed-seed Bayesian hyperparameter optimization, configuration freezing per asset class, multi-seed retraining, uncertainty aggregation, and statistical validation. ModernTCN achieves the best mean rank (1.333) with a 75 percent first-place rate, followed by PatchTST (2.000). Results reveal a clear three-tier ranking structure and show that architecture explains nearly all performance variance, while seed randomness is negligible. Rankings remain stable across horizons despite 2 to 2.5 times error amplification. Directional accuracy remains near 50 percent across all configurations, indicating that MSE-trained models lack directional skill at hourly resolution. The findings highlight the importance of architectural inductive bias over raw parameter count and provide reproducible guidance for multi-step financial forecasting.

附件下载

点击下载今日全部论文列表

目录

概览 (2026-03-19)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载